File Format #

Currently, supports Parquet, Avro, ORC, CSV, JSON, and Lance file formats.

Recommended column format is Parquet, which has a high compression rate and fast column projection queries.
Recommended row based format is Avro, which has good performance n reading and writing full row (all columns).
Recommended testing format is CSV, which has better readability but the worst read-write performance.
Recommended format for ML workloads is Lance, which is optimized for vector search and machine learning use cases.

PARQUET #

Parquet is the default file format for Paimon.

The following table lists the type mapping from Paimon type to Parquet type.

Paimon Type	Parquet type	Parquet logical type
CHAR / VARCHAR / STRING	BINARY	UTF8
BOOLEAN	BOOLEAN
BINARY / VARBINARY	BINARY
DECIMAL(P, S)	P <= 9: INT32, P <= 18: INT64, P > 18: FIXED_LEN_BYTE_ARRAY	DECIMAL(P, S)
TINYINT	INT32	INT_8
SMALLINT	INT32	INT_16
INT	INT32
BIGINT	INT64
FLOAT	FLOAT
DOUBLE	DOUBLE
DATE	INT32	DATE
TIME	INT32	TIME_MILLIS
TIMESTAMP(P)	P <= 3: INT64, P <= 6: INT64, P > 6: INT96	P <= 3: MILLIS, P <= 6: MICROS, P > 6: NONE
TIMESTAMP_LOCAL_ZONE(P)	P <= 3: INT64, P <= 6: INT64, P > 6: INT96	P <= 3: MILLIS, P <= 6: MICROS, P > 6: NONE
ARRAY	3-LEVEL LIST	LIST
MAP	3-LEVEL MAP	MAP
MULTISET	3-LEVEL MAP	MAP
ROW	GROUP

Limitations:

Parquet does not support nullable map keys.
Parquet TIMESTAMP type with precision 9 will use INT96, but this int96 is a time zone converted value and requires additional adjustments.

AVRO #

The following table lists the type mapping from Paimon type to Avro type.

Paimon type	Avro type	Avro logical type
CHAR / VARCHAR / STRING	string
`BOOLEAN`	`boolean`
`BINARY / VARBINARY`	`bytes`
`DECIMAL`	`bytes`	`decimal`
`TINYINT`	`int`
`SMALLINT`	`int`
`INT`	`int`
`BIGINT`	`long`
`FLOAT`	`float`
`DOUBLE`	`double`
`DATE`	`int`	`date`
`TIME`	`int`	`time-millis`
`TIMESTAMP`	P <= 3: long, P <= 6: long, P > 6: unsupported	P <= 3: timestampMillis, P <= 6: timestampMicros, P > 6: unsupported
`TIMESTAMP_LOCAL_ZONE`	P <= 3: long, P <= 6: long, P > 6: unsupported	P <= 3: localTimestampMillis, P <= 6: localTimestampMicros, P > 6: unsupported
`ARRAY`	`array`
`MAP` (key must be string/char/varchar type)	`map`
`MULTISET` (element must be string/char/varchar type)	`map`
`ROW`	`record`

Note:

In addition to the types listed above, for nullable types. Paimon maps nullable types to Avro union(something, null), where something is the Avro type converted from Paimon type.

You can refer to Avro Specification for more information about Avro types.

ORC #

The following table lists the type mapping from Paimon type to Orc type.

Paimon Type	Orc physical type	Orc logical type
CHAR	bytes	CHAR
VARCHAR	bytes	VARCHAR
STRING	bytes	STRING
BOOLEAN	long	BOOLEAN
BYTES	bytes	BINARY
DECIMAL	decimal	DECIMAL
TINYINT	long	BYTE
SMALLINT	long	SHORT
INT	long	INT
BIGINT	long	LONG
FLOAT	double	FLOAT
DOUBLE	double	DOUBLE
DATE	long	DATE
TIMESTAMP	timestamp	TIMESTAMP
TIMESTAMP_LOCAL_ZONE	timestamp	TIMESTAMP_INSTANT
ARRAY	-	LIST
MAP	-	MAP
ROW	-	STRUCT

Limitations:

ORC has a time zone bias when mapping TIMESTAMP_LOCAL_ZONE type, saving the millis value corresponding to the UTC literal time. Due to compatibility issues, this behavior cannot be modified.

CSV #

Experimental feature, not recommended for production.

Format Options:

Option	Default	Type	Description
csv.field-delimiter	`,`	String	Field delimiter character (`','` by default), must be single character. You can use backslash to specify special characters, e.g. `'\t'` represents the tab character.
csv.line-delimiter	`\n`	String	The line delimiter for CSV format
csv.quote-character	`"`	String	Quote character for enclosing field values (`"` by default).
csv.escape-character	\	String	The escape character for CSV format.
csv.include-header	false	Boolean	Whether to include header in CSV files.
csv.null-literal	`""`	String	Null literal string that is interpreted as a null value (disabled by default).
csv.mode	`PERMISSIVE`	String	Allows a mode for dealing with corrupt records during reading. Currently supported values are `'PERMISSIVE'`, `'DROPMALFORMED'` and `'FAILFAST'`: Option `'PERMISSIVE'` sets malformed fields to null. Option `'DROPMALFORMED'` ignores the whole corrupted records. Option `'FAILFAST'` throws an exception when it meets corrupted records.

Paimon CSV format uses jackson databind API to parse and generate CSV string.

The following table lists the type mapping from Paimon type to CSV type.

Paimon type	CSV type
`CHAR / VARCHAR / STRING`	`string`
`BOOLEAN`	`boolean`
`BINARY / VARBINARY`	`string with encoding: base64`
`DECIMAL`	`number`
`TINYINT`	`number`
`SMALLINT`	`number`
`INT`	`number`
`BIGINT`	`number`
`FLOAT`	`number`
`DOUBLE`	`number`
`DATE`	`string with format: date`
`TIME`	`string with format: time`
`TIMESTAMP`	`string with format: date-time`
`TIMESTAMP_LOCAL_ZONE`	`string with format: date-time`

TEXT #

Experimental feature, not recommended for production.

Format Options:

Option	Default	Type	Description
text.line-delimiter	`\n`	String	The line delimiter for TEXT format

The Paimon text table contains only one field, and it is of string type.

JSON #

Experimental feature, not recommended for production.

Format Options:

Option	Default	Type	Description
json.ignore-parse-errors	false	Boolean	Whether to ignore parse errors for JSON format. Skip fields and rows with parse errors instead of failing. Fields are set to null in case of errors.
json.map-null-key-mode	`FAIL`	String	How to handle map keys that are null. Currently supported values are `'FAIL'`, `'DROP'` and `'LITERAL'`: Option `'FAIL'` will throw exception when encountering map with null key. Option `'DROP'` will drop null key entries for map. Option `'LITERAL'` will replace null key with string literal. The string literal is defined by `json.map-null-key-literal` option.
json.map-null-key-literal	`null`	String	Literal to use for null map keys when `json.map-null-key-mode` is LITERAL.
json.line-delimiter	`\n`	String	The line delimiter for JSON format.

Paimon JSON format uses jackson databind API to parse and generate JSON string.

The following table lists the type mapping from Paimon type to JSON type.

Paimon type	JSON type
`CHAR / VARCHAR / STRING`	`string`
`BOOLEAN`	`boolean`
`BINARY / VARBINARY`	`string with encoding: base64`
`DECIMAL`	`number`
`TINYINT`	`number`
`SMALLINT`	`number`
`INT`	`number`
`BIGINT`	`number`
`FLOAT`	`number`
`DOUBLE`	`number`
`DATE`	`string with format: date`
`TIME`	`string with format: time`
`TIMESTAMP`	`string with format: date-time`
`TIMESTAMP_LOCAL_ZONE`	`string with format: date-time (with UTC time zone)`
`ARRAY`	`array`
`MAP`	`object`
`MULTISET`	`object`
`ROW`	`object`

LANCE #

Lance is a modern columnar data format optimized for machine learning and vector search workloads. It provides high-performance read and write operations with native support for Apache Arrow.

The following table lists the type mapping from Paimon type to Lance (Arrow) type.

Paimon Type	Lance (Arrow) type
CHAR / VARCHAR / STRING	UTF8
BOOLEAN	BOOL
BINARY / VARBINARY	BINARY
DECIMAL(P, S)	DECIMAL128(P, S)
TINYINT	INT8
SMALLINT	INT16
INT	INT32
BIGINT	INT64
FLOAT	FLOAT
DOUBLE	DOUBLE
DATE	DATE32
TIME	TIME32 / TIME64
TIMESTAMP(P)	TIMESTAMP (unit based on precision)
ARRAY	LIST
MULTISET	LIST
ROW	STRUCT

Limitations:

Lance file format does not support MAP type.
Lance file format does not support TIMESTAMP_LOCAL_ZONE type.

BLOB #

The BLOB format is a specialized format for storing large binary objects such as images, videos, and other multimodal data. Unlike other formats that store data inline, BLOB format stores large binary data in separate files with an optimized layout for random access.

BLOB files use the .blob extension and have the following structure:

+------------------+
| Blob Entry 1     |
|   Magic Number   |  4 bytes (1481511375, Little Endian)
|   Blob Data      |  Variable length
|   Length         |  8 bytes (Little Endian)
|   CRC32          |  4 bytes (Little Endian)
+------------------+
| Blob Entry 2     |
|   ...            |
+------------------+
| Index            |  Variable (Delta-Varint compressed)
+------------------+
| Index Length     |  4 bytes (Little Endian)
| Version          |  1 byte
+------------------+

Key features:

CRC32 Checksums: Each blob entry has a CRC32 checksum for data integrity verification
Indexed Access: The index at the end enables efficient random access to any blob in the file
Delta-Varint Compression: The index uses delta-varint compression for space efficiency

Limitations:

BLOB format only supports a single BLOB type column per file.
BLOB format does not support predicate pushdown.
Statistics collection is not supported for BLOB columns.

For usage details, configuration options, and examples, see Blob Type.

File Format #

PARQUET #

AVRO #

ORC #

CSV #

csv.field-delimiter

csv.line-delimiter

csv.quote-character

csv.escape-character

csv.include-header

csv.null-literal

csv.mode

TEXT #

text.line-delimiter

JSON #

json.ignore-parse-errors

json.map-null-key-mode

json.map-null-key-literal

json.line-delimiter

LANCE #

BLOB #