File Format #
Currently, supports Parquet, Avro, ORC, CSV, JSON file formats.
- Recommended column format is Parquet, which has a high compression rate and fast column projection queries.
- Recommended row based format is Avro, which has good performance n reading and writing full row (all columns).
- Recommended testing format is CSV, which has better readability but the worst read-write performance.
PARQUET #
Parquet is the default file format for Paimon.
The following table lists the type mapping from Paimon type to Parquet type.
| Paimon Type | Parquet type | Parquet logical type |
|---|---|---|
| CHAR / VARCHAR / STRING | BINARY | UTF8 |
| BOOLEAN | BOOLEAN | |
| BINARY / VARBINARY | BINARY | |
| DECIMAL(P, S) | P <= 9: INT32, P <= 18: INT64, P > 18: FIXED_LEN_BYTE_ARRAY | DECIMAL(P, S) |
| TINYINT | INT32 | INT_8 |
| SMALLINT | INT32 | INT_16 |
| INT | INT32 | |
| BIGINT | INT64 | |
| FLOAT | FLOAT | |
| DOUBLE | DOUBLE | |
| DATE | INT32 | DATE |
| TIME | INT32 | TIME_MILLIS |
| TIMESTAMP(P) | P <= 3: INT64, P <= 6: INT64, P > 6: INT96 | P <= 3: MILLIS, P <= 6: MICROS, P > 6: NONE |
| TIMESTAMP_LOCAL_ZONE(P) | P <= 3: INT64, P <= 6: INT64, P > 6: INT96 | P <= 3: MILLIS, P <= 6: MICROS, P > 6: NONE |
| ARRAY | 3-LEVEL LIST | LIST |
| MAP | 3-LEVEL MAP | MAP |
| MULTISET | 3-LEVEL MAP | MAP |
| ROW | GROUP |
Limitations:
- Parquet does not support nullable map keys.
- Parquet TIMESTAMP type with precision 9 will use INT96, but this int96 is a time zone converted value and requires additional adjustments.
AVRO #
The following table lists the type mapping from Paimon type to Avro type.
| Paimon type | Avro type | Avro logical type |
|---|---|---|
| CHAR / VARCHAR / STRING | string | |
BOOLEAN |
boolean |
|
BINARY / VARBINARY |
bytes |
|
DECIMAL |
bytes |
decimal |
TINYINT |
int |
|
SMALLINT |
int |
|
INT |
int |
|
BIGINT |
long |
|
FLOAT |
float |
|
DOUBLE |
double |
|
DATE |
int |
date |
TIME |
int |
time-millis |
TIMESTAMP |
P <= 3: long, P <= 6: long, P > 6: unsupported | P <= 3: timestampMillis, P <= 6: timestampMicros, P > 6: unsupported |
TIMESTAMP_LOCAL_ZONE |
P <= 3: long, P <= 6: long, P > 6: unsupported | P <= 3: timestampMillis, P <= 6: timestampMicros, P > 6: unsupported |
ARRAY |
array |
|
MAP(key must be string/char/varchar type) |
map |
|
MULTISET(element must be string/char/varchar type) |
map |
|
ROW |
record |
In addition to the types listed above, for nullable types. Paimon maps nullable types to Avro union(something, null),
where something is the Avro type converted from Paimon type.
You can refer to Avro Specification for more information about Avro types.
ORC #
The following table lists the type mapping from Paimon type to Orc type.
| Paimon Type | Orc physical type | Orc logical type |
|---|---|---|
| CHAR | bytes | CHAR |
| VARCHAR | bytes | VARCHAR |
| STRING | bytes | STRING |
| BOOLEAN | long | BOOLEAN |
| BYTES | bytes | BINARY |
| DECIMAL | decimal | DECIMAL |
| TINYINT | long | BYTE |
| SMALLINT | long | SHORT |
| INT | long | INT |
| BIGINT | long | LONG |
| FLOAT | double | FLOAT |
| DOUBLE | double | DOUBLE |
| DATE | long | DATE |
| TIMESTAMP | timestamp | TIMESTAMP |
| TIMESTAMP_LOCAL_ZONE | timestamp | TIMESTAMP_INSTANT |
| ARRAY | - | LIST |
| MAP | - | MAP |
| ROW | - | STRUCT |
Limitations:
- ORC has a time zone bias when mapping
TIMESTAMP_LOCAL_ZONEtype, saving the millis value corresponding to the UTC literal time. Due to compatibility issues, this behavior cannot be modified.
CSV #
Experimental feature, not recommended for production.
Format Options:
| Option | Default | Type | Description |
|---|---|---|---|
csv.field-delimiter |
, |
String | Field delimiter character (',' by default), must be single character. You can use backslash to specify special characters, e.g. '\t' represents the tab character.
|
csv.line-delimiter |
\n |
String | The line delimiter for CSV format |
csv.quote-character |
" |
String | Quote character for enclosing field values (" by default). |
csv.escape-character |
\ | String | The escape character for CSV format. |
csv.include-header |
false | Boolean | Whether to include header in CSV files. |
csv.null-literal |
"" |
String | Null literal string that is interpreted as a null value (disabled by default). |
Paimon CSV format uses jackson databind API to parse and generate CSV string.
The following table lists the type mapping from Paimon type to CSV type.
| Paimon type | CSV type |
|---|---|
CHAR / VARCHAR / STRING |
string |
BOOLEAN |
boolean |
BINARY / VARBINARY |
string with encoding: base64 |
DECIMAL |
number |
TINYINT |
number |
SMALLINT |
number |
INT |
number |
BIGINT |
number |
FLOAT |
number |
DOUBLE |
number |
DATE |
string with format: date |
TIME |
string with format: time |
TIMESTAMP |
string with format: date-time |
TIMESTAMP_LOCAL_ZONE |
string with format: date-time |
JSON #
Experimental feature, not recommended for production.
Format Options:
| Option | Default | Type | Description |
|---|---|---|---|
json.ignore-parse-errors |
false | Boolean | Whether to ignore parse errors for JSON format. Skip fields and rows with parse errors instead of failing. Fields are set to null in case of errors. |
json.map-null-key-mode |
FAIL |
String | How to handle map keys that are null. Currently supported values are 'FAIL', 'DROP' and 'LITERAL':
|
json.map-null-key-literal |
null |
String | Literal to use for null map keys when json.map-null-key-mode is LITERAL. |
json.line-delimiter |
\n |
String | The line delimiter for JSON format. |
Paimon JSON format uses jackson databind API to parse and generate JSON string.
The following table lists the type mapping from Paimon type to JSON type.
| Paimon type | JSON type |
|---|---|
CHAR / VARCHAR / STRING |
string |
BOOLEAN |
boolean |
BINARY / VARBINARY |
string with encoding: base64 |
DECIMAL |
number |
TINYINT |
number |
SMALLINT |
number |
INT |
number |
BIGINT |
number |
FLOAT |
number |
DOUBLE |
number |
DATE |
string with format: date |
TIME |
string with format: time |
TIMESTAMP |
string with format: date-time |
TIMESTAMP_LOCAL_ZONE |
string with format: date-time (with UTC time zone) |
ARRAY |
array |
MAP |
object |
MULTISET |
object |
ROW |
object |