FileFormat
This documentation is for an unreleased version of Apache Paimon. We recommend you use the latest stable version.

File Format #

Currently, supports Parquet, Avro, ORC, CSV, JSON file formats.

  • Recommended column format is Parquet, which has a high compression rate and fast column projection queries.
  • Recommended row based format is Avro, which has good performance n reading and writing full row (all columns).
  • Recommended testing format is CSV, which has better readability but the worst read-write performance.

PARQUET #

Parquet is the default file format for Paimon.

The following table lists the type mapping from Paimon type to Parquet type.

Paimon Type Parquet type Parquet logical type
CHAR / VARCHAR / STRING BINARY UTF8
BOOLEAN BOOLEAN
BINARY / VARBINARY BINARY
DECIMAL(P, S) P <= 9: INT32, P <= 18: INT64, P > 18: FIXED_LEN_BYTE_ARRAY DECIMAL(P, S)
TINYINT INT32 INT_8
SMALLINT INT32 INT_16
INT INT32
BIGINT INT64
FLOAT FLOAT
DOUBLE DOUBLE
DATE INT32 DATE
TIME INT32 TIME_MILLIS
TIMESTAMP(P) P <= 3: INT64, P <= 6: INT64, P > 6: INT96 P <= 3: MILLIS, P <= 6: MICROS, P > 6: NONE
TIMESTAMP_LOCAL_ZONE(P) P <= 3: INT64, P <= 6: INT64, P > 6: INT96 P <= 3: MILLIS, P <= 6: MICROS, P > 6: NONE
ARRAY 3-LEVEL LIST LIST
MAP 3-LEVEL MAP MAP
MULTISET 3-LEVEL MAP MAP
ROW GROUP

Limitations:

  1. Parquet does not support nullable map keys.

AVRO #

The following table lists the type mapping from Paimon type to Avro type.

Paimon type Avro type Avro logical type
CHAR / VARCHAR / STRING string
BOOLEAN boolean
BINARY / VARBINARY bytes
DECIMAL bytes decimal
TINYINT int
SMALLINT int
INT int
BIGINT long
FLOAT float
DOUBLE double
DATE int date
TIME int time-millis
TIMESTAMP P <= 3: long, P <= 6: long, P > 6: unsupported P <= 3: timestampMillis, P <= 6: timestampMicros, P > 6: unsupported
TIMESTAMP_LOCAL_ZONE P <= 3: long, P <= 6: long, P > 6: unsupported P <= 3: timestampMillis, P <= 6: timestampMicros, P > 6: unsupported
ARRAY array
MAP
(key must be string/char/varchar type)
map
MULTISET
(element must be string/char/varchar type)
map
ROW record

In addition to the types listed above, for nullable types. Paimon maps nullable types to Avro union(something, null), where something is the Avro type converted from Paimon type.

You can refer to Avro Specification for more information about Avro types.

ORC #

The following table lists the type mapping from Paimon type to Orc type.

Paimon Type Orc physical type Orc logical type
CHAR bytes CHAR
VARCHAR bytes VARCHAR
STRING bytes STRING
BOOLEAN long BOOLEAN
BYTES bytes BINARY
DECIMAL decimal DECIMAL
TINYINT long BYTE
SMALLINT long SHORT
INT long INT
BIGINT long LONG
FLOAT double FLOAT
DOUBLE double DOUBLE
DATE long DATE
TIMESTAMP timestamp TIMESTAMP
TIMESTAMP_LOCAL_ZONE timestamp TIMESTAMP_INSTANT
ARRAY - LIST
MAP - MAP
ROW - STRUCT

Limitations:

  1. ORC has a time zone bias when mapping TIMESTAMP_LOCAL_ZONE type, saving the millis value corresponding to the UTC literal time. Due to compatibility issues, this behavior cannot be modified.

CSV #

Experimental feature, not recommended for production.

Format Options:

Option Default Type Description
csv.field-delimiter
, String Field delimiter character (',' by default), must be single character. You can use backslash to specify special characters, e.g. '\t' represents the tab character.
csv.line-delimiter
\n String The line delimiter for CSV format
csv.quote-character
" String Quote character for enclosing field values (" by default).
csv.escape-character
\ String The escape character for CSV format.
csv.include-header
false Boolean Whether to include header in CSV files.
csv.null-literal
"" String Null literal string that is interpreted as a null value (disabled by default).

Paimon CSV format uses jackson databind API to parse and generate CSV string.

The following table lists the type mapping from Paimon type to CSV type.

Paimon type CSV type
CHAR / VARCHAR / STRING string
BOOLEAN boolean
BINARY / VARBINARY string with encoding: base64
DECIMAL number
TINYINT number
SMALLINT number
INT number
BIGINT number
FLOAT number
DOUBLE number
DATE string with format: date
TIME string with format: time
TIMESTAMP string with format: date-time
TIMESTAMP_LOCAL_ZONE string with format: date-time

JSON #

Experimental feature, not recommended for production.

TODO

Edit This Page
Copyright © 2025 The Apache Software Foundation. Apache Paimon, Paimon, and its feather logo are trademarks of The Apache Software Foundation.