This documentation is for an unreleased version of Apache Paimon. We recommend you use the latest stable version.

Iceberg Compatibility #

Paimon supports generating Iceberg compatible metadata, so that Paimon tables can be consumed directly by Iceberg readers.

Set the following table options, so that Paimon tables can generate Iceberg compatible metadata.

Option	Default	Type	Description
metadata.iceberg.storage	disabled	Enum	When set, produce Iceberg metadata after a snapshot is committed, so that Iceberg readers can read Paimon's raw data files. `disabled`: Disable Iceberg compatibility support. `table-location`: Store Iceberg metadata in each table's directory. `hadoop-catalog`: Store Iceberg metadata in a separate directory. This directory can be specified as the warehouse directory of an Iceberg Hadoop catalog. `hive-catalog`: Not only store Iceberg metadata like hadoop-catalog, but also create Iceberg external table in Hive.
metadata.iceberg.storage-location	(none)	Enum	Specifies where to store Iceberg metadata files. If not set, the storage location will default based on the selected metadata.iceberg.storage type. `table-location`: Store Iceberg metadata in each table's directory. Useful for standalone Iceberg tables or Iceberg Java API access. Can also be used with Hive Catalog. `catalog-location`: Store Iceberg metadata in a separate directory. This is the default behavior when using Hive Catalog or Hadoop Catalog.

For most SQL users, we recommend setting 'metadata.iceberg.storage' = 'hadoop-catalog' or 'metadata.iceberg.storage' = 'hive-catalog', so that all tables can be visited as an Iceberg warehouse. For Iceberg Java API users, you might consider setting 'metadata.iceberg.storage' = 'table-location', so you can visit each table with its table path. When using metadata.iceberg.storage = hadoop-catalog or hive-catalog, you can optionally configure metadata.iceberg.storage-location to control where the metadata is stored. If not set, the default behavior depends on the storage type.

Append Tables #

Let’s walk through a simple example, where we query Paimon tables with Iceberg connectors in Flink and Spark. Before trying out this example, make sure that your compute engine already supports Iceberg. Please refer to Iceberg’s document if you haven’t set up Iceberg.

Flink: Preparation when using Flink SQL Client
Spark: Using Iceberg in Spark 3

Let’s now create a Paimon append only table with Iceberg compatibility enabled and insert some data.

Flink SQL

CREATE CATALOG paimon_catalog WITH (
    'type' = 'paimon',
    'warehouse' = '<path-to-warehouse>'
);

CREATE TABLE paimon_catalog.`default`.cities (
    country STRING,
    name STRING
) WITH (
    'metadata.iceberg.storage' = 'hadoop-catalog'
);

INSERT INTO paimon_catalog.`default`.cities VALUES ('usa', 'new york'), ('germany', 'berlin'), ('usa', 'chicago'), ('germany', 'hamburg');

Spark SQL

Start spark-sql with the following command line.

spark-sql --jars <path-to-paimon-jar> \
    --conf spark.sql.catalog.paimon_catalog=org.apache.paimon.spark.SparkCatalog \
    --conf spark.sql.catalog.paimon_catalog.warehouse=<path-to-warehouse> \
    --packages org.apache.iceberg:iceberg-spark-runtime-<iceberg-version> \
    --conf spark.sql.catalog.iceberg_catalog=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.iceberg_catalog.type=hadoop \
    --conf spark.sql.catalog.iceberg_catalog.warehouse=<path-to-warehouse>/iceberg \
    --conf spark.sql.catalog.iceberg_catalog.cache-enabled=false \ # disable iceberg catalog caching to quickly see the result
    --conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions,org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Run the following Spark SQL to create Paimon table and insert data.

CREATE TABLE paimon_catalog.`default`.cities (
    country STRING,
    name STRING
) TBLPROPERTIES (
    'metadata.iceberg.storage' = 'hadoop-catalog'
);

INSERT INTO paimon_catalog.`default`.cities VALUES ('usa', 'new york'), ('germany', 'berlin'), ('usa', 'chicago'), ('germany', 'hamburg');

Now let’s query this Paimon table with Iceberg connector.

Flink SQL

CREATE CATALOG iceberg_catalog WITH (
    'type' = 'iceberg',
    'catalog-type' = 'hadoop',
    'warehouse' = '<path-to-warehouse>/iceberg',
    'cache-enabled' = 'false' -- disable iceberg catalog caching to quickly see the result
);

SELECT * FROM iceberg_catalog.`default`.cities WHERE country = 'germany';
/*
+----+--------------------------------+--------------------------------+
| op |                        country |                           name |
+----+--------------------------------+--------------------------------+
| +I |                        germany |                         berlin |
| +I |                        germany |                        hamburg |
+----+--------------------------------+--------------------------------+
*/

Spark SQL

SELECT * FROM iceberg_catalog.`default`.cities WHERE country = 'germany';
/*
germany berlin
germany hamburg
*/

Let’s insert more data and query again.

Flink SQL

INSERT INTO paimon_catalog.`default`.cities VALUES ('usa', 'houston'), ('germany', 'munich');

SELECT * FROM iceberg_catalog.`default`.cities WHERE country = 'germany';
/*
+----+--------------------------------+--------------------------------+
| op |                        country |                           name |
+----+--------------------------------+--------------------------------+
| +I |                        germany |                         munich |
| +I |                        germany |                         berlin |
| +I |                        germany |                        hamburg |
+----+--------------------------------+--------------------------------+
*/

Spark SQL

INSERT INTO paimon_catalog.`default`.cities VALUES ('usa', 'houston'), ('germany', 'munich');

SELECT * FROM iceberg_catalog.`default`.cities WHERE country = 'germany';
/*
germany munich
germany berlin
germany hamburg
*/

Primary Key Tables #

Flink SQL

CREATE CATALOG paimon_catalog WITH (
    'type' = 'paimon',
    'warehouse' = '<path-to-warehouse>'
);

CREATE TABLE paimon_catalog.`default`.orders (
    order_id BIGINT,
    status STRING,
    payment DOUBLE,
    PRIMARY KEY (order_id) NOT ENFORCED
) WITH (
    'metadata.iceberg.storage' = 'hadoop-catalog',
    'compaction.optimization-interval' = '1ms' -- ATTENTION: this option is only for testing, see "timeliness" section below for more information
);

INSERT INTO paimon_catalog.`default`.orders VALUES (1, 'SUBMITTED', CAST(NULL AS DOUBLE)), (2, 'COMPLETED', 200.0), (3, 'SUBMITTED', CAST(NULL AS DOUBLE));

CREATE CATALOG iceberg_catalog WITH (
    'type' = 'iceberg',
    'catalog-type' = 'hadoop',
    'warehouse' = '<path-to-warehouse>/iceberg',
    'cache-enabled' = 'false' -- disable iceberg catalog caching to quickly see the result
);

SELECT * FROM iceberg_catalog.`default`.orders WHERE status = 'COMPLETED';
/*
+----+----------------------+--------------------------------+--------------------------------+
| op |             order_id |                         status |                        payment |
+----+----------------------+--------------------------------+--------------------------------+
| +I |                    2 |                      COMPLETED |                          200.0 |
+----+----------------------+--------------------------------+--------------------------------+
*/

INSERT INTO paimon_catalog.`default`.orders VALUES (1, 'COMPLETED', 100.0);

SELECT * FROM iceberg_catalog.`default`.orders WHERE status = 'COMPLETED';
/*
+----+----------------------+--------------------------------+--------------------------------+
| op |             order_id |                         status |                        payment |
+----+----------------------+--------------------------------+--------------------------------+
| +I |                    1 |                      COMPLETED |                          100.0 |
| +I |                    2 |                      COMPLETED |                          200.0 |
+----+----------------------+--------------------------------+--------------------------------+
*/

Spark SQL

Start spark-sql with the following command line.

spark-sql --jars <path-to-paimon-jar> \
    --conf spark.sql.catalog.paimon_catalog=org.apache.paimon.spark.SparkCatalog \
    --conf spark.sql.catalog.paimon_catalog.warehouse=<path-to-warehouse> \
    --packages org.apache.iceberg:iceberg-spark-runtime-<iceberg-version> \
    --conf spark.sql.catalog.iceberg_catalog=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.iceberg_catalog.type=hadoop \
    --conf spark.sql.catalog.iceberg_catalog.warehouse=<path-to-warehouse>/iceberg \
    --conf spark.sql.catalog.iceberg_catalog.cache-enabled=false \ # disable iceberg catalog caching to quickly see the result
    --conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions,org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Run the following Spark SQL to create Paimon table, insert/update data, and query with Iceberg catalog.

CREATE TABLE paimon_catalog.`default`.orders (
    order_id BIGINT,
    status STRING,
    payment DOUBLE
) TBLPROPERTIES (
    'primary-key' = 'order_id',
    'metadata.iceberg.storage' = 'hadoop-catalog',
    'compaction.optimization-interval' = '1ms' -- ATTENTION: this option is only for testing, see "timeliness" section below for more information
);

INSERT INTO paimon_catalog.`default`.orders VALUES (1, 'SUBMITTED', CAST(NULL AS DOUBLE)), (2, 'COMPLETED', 200.0), (3, 'SUBMITTED', CAST(NULL AS DOUBLE));

SELECT * FROM iceberg_catalog.`default`.orders WHERE status = 'COMPLETED';
/*
2       COMPLETED       200.0
*/

INSERT INTO paimon_catalog.`default`.orders VALUES (1, 'COMPLETED', 100.0);

SELECT * FROM iceberg_catalog.`default`.orders WHERE status = 'COMPLETED';
/*
2       COMPLETED       200.0
1       COMPLETED       100.0
*/

Paimon primary key tables organize data files as LSM trees, so data files must be merged in memory before querying. However, Iceberg readers are not able to merge data files, so they can only query data files on the highest level of LSM trees. Data files on the highest level are produced by the full compaction process. So to conclude, for primary key tables, Iceberg readers can only query data after full compaction.

By default, there is no guarantee on how frequently Paimon will perform full compaction. You can configure the following table option, so that Paimon is forced to perform full compaction after several commits.

Option	Default	Type	Description
compaction.optimization-interval	(none)	Duration	Full compaction will be constantly triggered per time interval. First compaction after the job starts will always be full compaction.
full-compaction.delta-commits	(none)	Integer	Full compaction will be constantly triggered after delta commits. Only implemented in Flink.

Note that full compaction is a resource-consuming process, so the value of this table option should not be too small. We recommend full compaction to be performed once or twice per hour.

Hive Catalog #

When creating Paimon table, set 'metadata.iceberg.storage' = 'hive-catalog'. This option value not only store Iceberg metadata like hadoop-catalog, but also create Iceberg external table in Hive. This Paimon table can be accessed from Iceberg Hive catalog later.

To provide information about Hive metastore, you also need to set some (or all) of the following table options when creating Paimon table.

Option	Default	Type	Description
metadata.iceberg.uri		String	Hive metastore uri for Iceberg Hive catalog.
metadata.iceberg.hive-conf-dir		String	hive-conf-dir for Iceberg Hive catalog.
metadata.iceberg.hadoop-conf-dir		String	hadoop-conf-dir for Iceberg Hive catalog.
metadata.iceberg.manifest-compression	snappy	String	Compression for Iceberg manifest files.
metadata.iceberg.manifest-legacy-version	false	Boolean	Should use the legacy manifest version to generate Iceberg's 1.4 manifest files.
metadata.iceberg.hive-client-class	org.apache.hadoop.hive.metastore.HiveMetaStoreClient	String	Hive client class name for Iceberg Hive Catalog.
metadata.iceberg.glue.skip-archive	false	Boolean	Skip archive for AWS Glue catalog.

AWS Glue Catalog #

You can use Hive Catalog to connect AWS Glue metastore, you can use set 'metadata.iceberg.hive-client-class' to 'com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient'.

Note: You can use this repo to build the required jar, include it in your path and configure the AWSCatalogMetastoreClient.

AWS Athena #

AWS Athena may use old manifest reader to read Iceberg manifest by names, we should let Paimon producing legacy Iceberg manifest list file, you can enable: 'metadata.iceberg.manifest-legacy-version'.

DuckDB #

Duckdb may rely on files placed in the root/data directory, while Paimon is usually placed directly in the root directory, so you can configure this parameter for the table to achieve compatibility: 'data-file.path-directory' = 'data'.

Trino Iceberg #

In this example, we use Trino Iceberg connector to access Paimon table through Iceberg Hive catalog. Before trying out this example, make sure that you have configured Trino Iceberg connector. See Trino’s document for more information.

Let’s first create a Paimon table with Iceberg compatibility enabled.

Flink SQL

CREATE CATALOG paimon_catalog WITH (
    'type' = 'paimon',
    'warehouse' = '<path-to-warehouse>'
);

CREATE TABLE paimon_catalog.`default`.animals (
    kind STRING,
    name STRING
) WITH (
    'metadata.iceberg.storage' = 'hive-catalog',
    'metadata.iceberg.uri' = 'thrift://<host>:<port>'
);

INSERT INTO paimon_catalog.`default`.animals VALUES ('mammal', 'cat'), ('mammal', 'dog'), ('reptile', 'snake'), ('reptile', 'lizard');

Spark SQL

Start spark-sql with the following command line.

spark-sql --jars <path-to-paimon-jar> \
    --conf spark.sql.catalog.paimon_catalog=org.apache.paimon.spark.SparkCatalog \
    --conf spark.sql.catalog.paimon_catalog.warehouse=<path-to-warehouse> \
    --packages org.apache.iceberg:iceberg-spark-runtime-<iceberg-version> \
    --conf spark.sql.catalog.iceberg_catalog=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.iceberg_catalog.type=hadoop \
    --conf spark.sql.catalog.iceberg_catalog.warehouse=<path-to-warehouse>/iceberg \
    --conf spark.sql.catalog.iceberg_catalog.cache-enabled=false \ # disable iceberg catalog caching to quickly see the result
    --conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions,org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Run the following Spark SQL to create Paimon table, insert/update data, and query with Iceberg catalog.

CREATE TABLE paimon_catalog.`default`.animals (
    kind STRING,
    name STRING
) TBLPROPERTIES (
    'metadata.iceberg.storage' = 'hive-catalog',
    'metadata.iceberg.uri' = 'thrift://<host>:<port>'
);

INSERT INTO paimon_catalog.`default`.animals VALUES ('mammal', 'cat'), ('mammal', 'dog'), ('reptile', 'snake'), ('reptile', 'lizard');

Start Trino using Iceberg catalog and query from Paimon table.

SELECT * FROM animals WHERE class = 'mammal';
/*
   kind | name 
--------+------
 mammal | cat  
 mammal | dog  
*/

Supported Types #

Paimon Iceberg compatibility currently supports the following data types.

Paimon Data Type	Iceberg Data Type
`BOOLEAN`	`boolean`
`INT`	`int`
`BIGINT`	`long`
`FLOAT`	`float`
`DOUBLE`	`double`
`DECIMAL`	`decimal`
`CHAR`	`string`
`VARCHAR`	`string`
`BINARY`	`binary`
`VARBINARY`	`binary`
`DATE`	`date`
`TIMESTAMP`*	`timestamp`
`TIMESTAMP_LTZ`*	`timestamptz`
`ARRAY`	`list`
`MAP`	`map`
`ROW`	`struct`

*: TIMESTAMP and TIMESTAMP_LTZ type only support precision from 4 to 6

Table Options #

Options for Iceberg Compatibility.

Key	Default	Type	Description
metadata.iceberg.compaction.max.file-num	50	Integer	If number of small Iceberg manifest metadata files exceeds this limit, always trigger manifest metadata compaction regardless of their total size.
metadata.iceberg.compaction.min.file-num	10	Integer	Minimum number of Iceberg manifest metadata files to trigger manifest metadata compaction.
metadata.iceberg.database	(none)	String	Metastore database name for Iceberg Catalog. Set this as an iceberg database alias if using a centralized Catalog.
metadata.iceberg.delete-after-commit.enabled	true	Boolean	Whether to delete old metadata files after each table commit
metadata.iceberg.glue.skip-archive	false	Boolean	Skip archive for AWS Glue catalog.
metadata.iceberg.hadoop-conf-dir	(none)	String	hadoop-conf-dir for Iceberg Hive catalog.
metadata.iceberg.hive-client-class	"org.apache.hadoop.hive.metastore.HiveMetaStoreClient"	String	Hive client class name for Iceberg Hive Catalog.
metadata.iceberg.hive-conf-dir	(none)	String	hive-conf-dir for Iceberg Hive catalog.
metadata.iceberg.manifest-compression	"snappy"	String	Compression for Iceberg manifest files.
metadata.iceberg.manifest-legacy-version	false	Boolean	Should use the legacy manifest version to generate Iceberg's 1.4 manifest files.
metadata.iceberg.previous-versions-max	0	Integer	The number of old metadata files to keep after each table commit
metadata.iceberg.storage	disabled	Enum	When set, produce Iceberg metadata after a snapshot is committed, so that Iceberg readers can read Paimon's raw data files. Possible values: "disabled": Disable Iceberg compatibility support. "table-location": Store Iceberg metadata in each table's directory. "hadoop-catalog": Store Iceberg metadata in a separate directory. This directory can be specified as the warehouse directory of an Iceberg Hadoop catalog. "hive-catalog": Not only store Iceberg metadata like hadoop-catalog, but also create Iceberg external table in Hive.
metadata.iceberg.storage-location	(none)	Enum	To store Iceberg metadata in a separate directory or under table location Possible values: "table-location": Store Iceberg metadata in each table's directory. Useful for standalone Iceberg tables or Java API access. Can also be used with Hive Catalog "catalog-location": Store Iceberg metadata in a separate directory. Allows integration with Hive Catalog or Hadoop Catalog.
metadata.iceberg.table	(none)	String	Metastore table name for Iceberg Catalog.Set this as an iceberg table alias if using a centralized Catalog.
metadata.iceberg.uri	(none)	String	Hive metastore uri for Iceberg Hive catalog.

Iceberg Compatibility #

metadata.iceberg.storage

metadata.iceberg.storage-location

Append Tables #

Primary Key Tables #

compaction.optimization-interval

full-compaction.delta-commits

Hive Catalog #

metadata.iceberg.uri

metadata.iceberg.hive-conf-dir

metadata.iceberg.hadoop-conf-dir

metadata.iceberg.manifest-compression

metadata.iceberg.manifest-legacy-version

metadata.iceberg.hive-client-class

metadata.iceberg.glue.skip-archive

AWS Glue Catalog #

AWS Athena #

DuckDB #

Trino Iceberg #

Supported Types #

Table Options #

metadata.iceberg.compaction.max.file-num

metadata.iceberg.compaction.min.file-num

metadata.iceberg.database

metadata.iceberg.delete-after-commit.enabled

metadata.iceberg.glue.skip-archive

metadata.iceberg.hadoop-conf-dir

metadata.iceberg.hive-client-class

metadata.iceberg.hive-conf-dir

metadata.iceberg.manifest-compression

metadata.iceberg.manifest-legacy-version

metadata.iceberg.previous-versions-max

metadata.iceberg.storage

metadata.iceberg.storage-location

metadata.iceberg.table

metadata.iceberg.uri