This documentation is for an unreleased version of Apache Paimon. We recommend you use the latest stable version.
File Format #
Currently, supports Parquet, Avro, ORC, CSV, JSON file formats.
- Recommended column format is Parquet, which has a high compression rate and fast column projection queries.
- Recommended row based format is Avro, which has good performance n reading and writing full row (all columns).
- Recommended testing format is CSV, which has better readability but the worst read-write performance.
PARQUET #
Parquet is the default file format for Paimon.
The following table lists the type mapping from Paimon type to Parquet type.
Paimon Type | Parquet type | Parquet logical type |
---|---|---|
CHAR / VARCHAR / STRING | BINARY | UTF8 |
BOOLEAN | BOOLEAN | |
BINARY / VARBINARY | BINARY | |
DECIMAL(P, S) | P <= 9: INT32, P <= 18: INT64, P > 18: FIXED_LEN_BYTE_ARRAY | DECIMAL(P, S) |
TINYINT | INT32 | INT_8 |
SMALLINT | INT32 | INT_16 |
INT | INT32 | |
BIGINT | INT64 | |
FLOAT | FLOAT | |
DOUBLE | DOUBLE | |
DATE | INT32 | DATE |
TIME | INT32 | TIME_MILLIS |
TIMESTAMP(P) | P <= 3: INT64, P <= 6: INT64, P > 6: INT96 | P <= 3: MILLIS, P <= 6: MICROS, P > 6: NONE |
TIMESTAMP_LOCAL_ZONE(P) | P <= 3: INT64, P <= 6: INT64, P > 6: INT96 | P <= 3: MILLIS, P <= 6: MICROS, P > 6: NONE |
ARRAY | 3-LEVEL LIST | LIST |
MAP | 3-LEVEL MAP | MAP |
MULTISET | 3-LEVEL MAP | MAP |
ROW | GROUP |
Limitations:
AVRO #
The following table lists the type mapping from Paimon type to Avro type.
Paimon type | Avro type | Avro logical type |
---|---|---|
CHAR / VARCHAR / STRING | string | |
BOOLEAN |
boolean |
|
BINARY / VARBINARY |
bytes |
|
DECIMAL |
bytes |
decimal |
TINYINT |
int |
|
SMALLINT |
int |
|
INT |
int |
|
BIGINT |
long |
|
FLOAT |
float |
|
DOUBLE |
double |
|
DATE |
int |
date |
TIME |
int |
time-millis |
TIMESTAMP |
P <= 3: long, P <= 6: long, P > 6: unsupported | P <= 3: timestampMillis, P <= 6: timestampMicros, P > 6: unsupported |
TIMESTAMP_LOCAL_ZONE |
P <= 3: long, P <= 6: long, P > 6: unsupported | P <= 3: timestampMillis, P <= 6: timestampMicros, P > 6: unsupported |
ARRAY |
array |
|
MAP (key must be string/char/varchar type) |
map |
|
MULTISET (element must be string/char/varchar type) |
map |
|
ROW |
record |
In addition to the types listed above, for nullable types. Paimon maps nullable types to Avro union(something, null)
,
where something
is the Avro type converted from Paimon type.
You can refer to Avro Specification for more information about Avro types.
ORC #
The following table lists the type mapping from Paimon type to Orc type.
Paimon Type | Orc physical type | Orc logical type |
---|---|---|
CHAR | bytes | CHAR |
VARCHAR | bytes | VARCHAR |
STRING | bytes | STRING |
BOOLEAN | long | BOOLEAN |
BYTES | bytes | BINARY |
DECIMAL | decimal | DECIMAL |
TINYINT | long | BYTE |
SMALLINT | long | SHORT |
INT | long | INT |
BIGINT | long | LONG |
FLOAT | double | FLOAT |
DOUBLE | double | DOUBLE |
DATE | long | DATE |
TIMESTAMP | timestamp | TIMESTAMP |
TIMESTAMP_LOCAL_ZONE | timestamp | TIMESTAMP_INSTANT |
ARRAY | - | LIST |
MAP | - | MAP |
ROW | - | STRUCT |
Limitations:
- ORC has a time zone bias when mapping
TIMESTAMP_LOCAL_ZONE
type, saving the millis value corresponding to the UTC literal time. Due to compatibility issues, this behavior cannot be modified.
CSV #
Experimental feature, not recommended for production.
Format Options:
Option | Default | Type | Description |
---|---|---|---|
csv.field-delimiter |
, |
String | Field delimiter character (',' by default), must be single character. You can use backslash to specify special characters, e.g. '\t' represents the tab character.
|
csv.line-delimiter |
\n |
String | The line delimiter for CSV format |
csv.quote-character |
" |
String | Quote character for enclosing field values (" by default). |
csv.escape-character |
\ | String | The escape character for CSV format. |
csv.include-header |
false | Boolean | Whether to include header in CSV files. |
csv.null-literal |
"" |
String | Null literal string that is interpreted as a null value (disabled by default). |
Paimon CSV format uses jackson databind API to parse and generate CSV string.
The following table lists the type mapping from Paimon type to CSV type.
Paimon type | CSV type |
---|---|
CHAR / VARCHAR / STRING |
string |
BOOLEAN |
boolean |
BINARY / VARBINARY |
string with encoding: base64 |
DECIMAL |
number |
TINYINT |
number |
SMALLINT |
number |
INT |
number |
BIGINT |
number |
FLOAT |
number |
DOUBLE |
number |
DATE |
string with format: date |
TIME |
string with format: time |
TIMESTAMP |
string with format: date-time |
TIMESTAMP_LOCAL_ZONE |
string with format: date-time |
JSON #
Experimental feature, not recommended for production.
TODO