Skip to main content

File Format

Currently, supports Parquet, Avro, ORC, CSV, JSON, Lance, Vortex, Mosaic, and Row file formats.

  • Recommended column format is Parquet, which has a high compression rate and fast column projection queries.
  • Recommended row based format is Avro, which has good performance on reading and writing full row (all columns).
  • Recommended format for wide tables is Mosaic, a columnar-bucket hybrid format with column bucketing for parallel I/O.
  • Recommended columnar format for point lookups is Vortex, which uses adaptive encoding for excellent point-query performance and efficient vector data compression.
  • Recommended format for row-number based O(1) lookups is Row, which stores data in row-oriented blocks with ZSTD compression and supports fast random access by row number.
  • Recommended testing format is CSV, which has better readability but the worst read-write performance.
  • Recommended format for ML workloads is Lance, which is optimized for vector search and machine learning use cases.

PARQUET

Parquet is the default file format for Paimon.

The following table lists the type mapping from Paimon type to Parquet type.

Paimon Type Parquet type Parquet logical type
CHAR / VARCHAR / STRING BINARY UTF8
BOOLEAN BOOLEAN
BINARY / VARBINARY BINARY
DECIMAL(P, S) P <= 9: INT32, P <= 18: INT64, P > 18: FIXED_LEN_BYTE_ARRAY DECIMAL(P, S)
TINYINT INT32 INT_8
SMALLINT INT32 INT_16
INT INT32
BIGINT INT64
FLOAT FLOAT
DOUBLE DOUBLE
DATE INT32 DATE
TIME INT32 TIME_MILLIS
TIMESTAMP(P) P <= 3: INT64, P <= 6: INT64, P > 6: INT96 P <= 3: MILLIS, P <= 6: MICROS, P > 6: NONE
TIMESTAMP_LOCAL_ZONE(P) P <= 3: INT64, P <= 6: INT64, P > 6: INT96 P <= 3: MILLIS, P <= 6: MICROS, P > 6: NONE
ARRAY 3-LEVEL LIST LIST
MAP 3-LEVEL MAP MAP
MULTISET 3-LEVEL MAP MAP
ROW GROUP

Limitations:

  1. Parquet does not support nullable map keys.
  2. Parquet TIMESTAMP type with precision 9 will use INT96, but this int96 is a time zone converted value and requires additional adjustments.

AVRO

The following table lists the type mapping from Paimon type to Avro type.

Paimon type Avro type Avro logical type
CHAR / VARCHAR / STRING string
BOOLEAN boolean
BINARY / VARBINARY bytes
DECIMAL bytes decimal
TINYINT int
SMALLINT int
INT int
BIGINT long
FLOAT float
DOUBLE double
DATE int date
TIME int time-millis
TIMESTAMP P <= 3: long, P <= 6: long, P > 6: unsupported P <= 3: timestampMillis, P <= 6: timestampMicros, P > 6: unsupported
TIMESTAMP_LOCAL_ZONE P <= 3: long, P <= 6: long, P > 6: unsupported P <= 3: localTimestampMillis, P <= 6: localTimestampMicros, P > 6: unsupported
ARRAY array
MAP
(key must be string/char/varchar type)
map
MULTISET
(element must be string/char/varchar type)
map
ROW record

Note:

In addition to the types listed above, for nullable types. Paimon maps nullable types to Avro union(something, null), where something is the Avro type converted from Paimon type.

You can refer to Avro Specification for more information about Avro types.

ORC

The following table lists the type mapping from Paimon type to Orc type.

Paimon Type Orc physical type Orc logical type
CHAR bytes CHAR
VARCHAR bytes VARCHAR
STRING bytes STRING
BOOLEAN long BOOLEAN
BYTES bytes BINARY
DECIMAL decimal DECIMAL
TINYINT long BYTE
SMALLINT long SHORT
INT long INT
BIGINT long LONG
FLOAT double FLOAT
DOUBLE double DOUBLE
DATE long DATE
TIMESTAMP timestamp TIMESTAMP
TIMESTAMP_LOCAL_ZONE timestamp TIMESTAMP_INSTANT
ARRAY - LIST
MAP - MAP
ROW - STRUCT

Limitations:

  1. ORC has a time zone bias when mapping TIMESTAMP_LOCAL_ZONE type, saving the millis value corresponding to the UTC literal time. Due to compatibility issues, this behavior cannot be modified.

CSV

Experimental feature, not recommended for production.

Format Options:

Option Default Type Description
csv.field-delimiter
, String Field delimiter character (',' by default), must be single character. You can use backslash to specify special characters, e.g. '\t' represents the tab character.
csv.line-delimiter
\n String The line delimiter for CSV format
csv.quote-character
" String Quote character for enclosing field values (" by default).
csv.escape-character
\ String The escape character for CSV format.
csv.include-header
false Boolean Whether to include header in CSV files.
csv.null-literal
"" String Null literal string that is interpreted as a null value (disabled by default).
csv.mode
PERMISSIVE String Allows a mode for dealing with corrupt records during reading. Currently supported values are 'PERMISSIVE', 'DROPMALFORMED' and 'FAILFAST':
  • Option 'PERMISSIVE' sets malformed fields to null.
  • Option 'DROPMALFORMED' ignores the whole corrupted records.
  • Option 'FAILFAST' throws an exception when it meets corrupted records.

Paimon CSV format uses jackson databind API to parse and generate CSV string.

The following table lists the type mapping from Paimon type to CSV type.

Paimon type CSV type
CHAR / VARCHAR / STRING string
BOOLEAN boolean
BINARY / VARBINARY string with encoding: base64
DECIMAL number
TINYINT number
SMALLINT number
INT number
BIGINT number
FLOAT number
DOUBLE number
DATE string with format: date
TIME string with format: time
TIMESTAMP string with format: date-time
TIMESTAMP_LOCAL_ZONE string with format: date-time

TEXT

Experimental feature, not recommended for production.

Format Options:

Option Default Type Description
text.line-delimiter
\n String The line delimiter for TEXT format

The Paimon text table contains only one field, and it is of string type.

JSON

Experimental feature, not recommended for production.

Format Options:

Option Default Type Description
json.ignore-parse-errors
false Boolean Whether to ignore parse errors for JSON format. Skip fields and rows with parse errors instead of failing. Fields are set to null in case of errors.
json.map-null-key-mode
FAIL String How to handle map keys that are null. Currently supported values are 'FAIL', 'DROP' and 'LITERAL':
  • Option 'FAIL' will throw exception when encountering map with null key.
  • Option 'DROP' will drop null key entries for map.
  • Option 'LITERAL' will replace null key with string literal. The string literal is defined by json.map-null-key-literal option.
json.map-null-key-literal
null String Literal to use for null map keys when json.map-null-key-mode is LITERAL.
json.line-delimiter
\n String The line delimiter for JSON format.

Paimon JSON format uses jackson databind API to parse and generate JSON string.

The following table lists the type mapping from Paimon type to JSON type.

Paimon type JSON type
CHAR / VARCHAR / STRING string
BOOLEAN boolean
BINARY / VARBINARY string with encoding: base64
DECIMAL number
TINYINT number
SMALLINT number
INT number
BIGINT number
FLOAT number
DOUBLE number
DATE string with format: date
TIME string with format: time
TIMESTAMP string with format: date-time
TIMESTAMP_LOCAL_ZONE string with format: date-time (with UTC time zone)
ARRAY array
MAP object
MULTISET object
ROW object

LANCE

Lance is a modern columnar data format optimized for machine learning and vector search workloads. It provides high-performance read and write operations with native support for Apache Arrow.

The following table lists the type mapping from Paimon type to Lance (Arrow) type.

Paimon Type Lance (Arrow) type
CHAR / VARCHAR / STRING UTF8
BOOLEAN BOOL
BINARY / VARBINARY BINARY
DECIMAL(P, S) DECIMAL128(P, S)
TINYINT INT8
SMALLINT INT16
INT INT32
BIGINT INT64
FLOAT FLOAT
DOUBLE DOUBLE
DATE DATE32
TIME TIME32 / TIME64
TIMESTAMP(P) TIMESTAMP (unit based on precision)
ARRAY LIST
MULTISET LIST
ROW STRUCT

Limitations:

  1. Lance file format does not support MAP type.
  2. Lance file format does not support TIMESTAMP_LOCAL_ZONE type.

VORTEX

Vortex is a columnar file format that uses adaptive, data-dependent encodings to achieve high compression ratios while maintaining fast scan performance. It supports native predicate pushdown and efficient column projection.

Key features:

  • Adaptive Encoding: Automatically selects the best encoding per column based on data distribution
  • Native Predicate Pushdown: Supports filter expressions pushed down to the scan layer
  • Column Projection: Only reads requested columns from disk

Limitations:

  1. Vortex does not support MAP or MULTISET types.

MOSAIC

Mosaic is a columnar-bucket hybrid format optimized for wide tables. It groups columns into buckets and compresses each bucket independently with ZSTD, enabling efficient column projection that only reads the buckets containing requested columns.

Key features:

  • Column Bucketing: Columns are grouped into configurable buckets for parallel I/O, significantly reducing read amplification on wide tables
  • Row Group Statistics: Per-row-group min/max/null_count statistics enable row group skipping during scan
  • ZSTD Compression: All data is compressed with ZSTD (configurable level)
  • Arrow-native: Uses Apache Arrow as the in-memory representation for zero-copy integration

Format Options:

Option Default Type Description
mosaic.num-buckets
auto Integer Number of column buckets for parallel I/O. When set to 0 or not specified, the format auto-determines the bucket count.
mosaic.stats-columns
(empty) String Comma-separated column names to collect min/max statistics for filter pushdown. Empty means no statistics are collected.

Limitations:

  1. Mosaic does not support complex types: ARRAY, MAP, MULTISET, ROW, VARIANT, BLOB, VECTOR.

For more details, see the Mosaic documentation.

ROW

The Row format is a row-oriented storage format designed for O(1) random access by row number. Data is organized in blocks with ZSTD Level 1 compression. Each block contains complete rows serialized in a compact binary format with an offset array for direct row positioning.

Key features:

  • O(1) Row Lookup: Block index + in-block offset array enables direct access to any row by its global row number
  • Block-level ZSTD Compression: Each block is independently compressed for good compression ratio with fast decompression
  • Compact Serialization: Rows are serialized with a null bitmap followed by field values in sequence, minimizing overhead
  • Selection Pushdown: Supports RoaringBitmap-based row selection, skipping entire blocks that contain no selected rows

The Row format supports all Paimon data types: BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, CHAR, VARCHAR, BINARY, VARBINARY, DECIMAL, DATE, TIME, TIMESTAMP, TIMESTAMP_LOCAL_ZONE, VARIANT, ARRAY, MAP, ROW.

For detailed file layout and binary format specification, see Row Format.

BLOB

The BLOB format is a specialized format for storing large binary objects such as images, videos, and other multimodal data. Unlike other formats that store data inline, BLOB format stores large binary data in separate files with an optimized layout for random access.

BLOB files use the .blob extension and have the following structure:

+------------------+
| Blob Entry 1 |
| Magic Number | 4 bytes (1481511375, Little Endian)
| Blob Data | Variable length
| Length | 8 bytes (Little Endian)
| CRC32 | 4 bytes (Little Endian)
+------------------+
| Blob Entry 2 |
| ... |
+------------------+
| Index | Variable (Delta-Varint compressed)
+------------------+
| Index Length | 4 bytes (Little Endian)
| Version | 1 byte
+------------------+

Key features:

  • CRC32 Checksums: Each blob entry has a CRC32 checksum for data integrity verification
  • Indexed Access: The index at the end enables efficient random access to any blob in the file
  • Delta-Varint Compression: The index uses delta-varint compression for space efficiency

Limitations:

  1. BLOB format only supports a single BLOB type column per file.
  2. BLOB format does not support predicate pushdown.
  3. Statistics collection is not supported for BLOB columns.

For usage details, configuration options, and examples, see Blob Type.