Apache Paimon

Apache Paimon //paimon.apache.org/docs/1.4/ Recent content on Apache Paimon Hugo -- gohugo.io en-us Filesystems //paimon.apache.org/docs/1.4/maintenance/filesystems/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/maintenance/filesystems/ Filesystems # Apache Paimon utilizes the same pluggable file systems as Apache Flink. Users can follow the standard plugin mechanism to configure the plugin structure if using Flink as compute engine. However, for other engines like Spark or Hive, the provided opt jars (by Flink) may get conflicts and cannot be used directly. It is not convenient for users to fix class conflicts, thus Paimon provides the self-contained and engine-unified FileSystem pluggable jars for user to query tables from Spark/Hive side. Migration From Hive //paimon.apache.org/docs/1.4/migration/migration-from-hive/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/migration/migration-from-hive/ Hive Table Migration # Apache Hive supports ORC, Parquet file formats that could be migrated to Paimon. When migrating data to a paimon table, the origin table will be permanently disappeared. So please back up your data if you still need the original table. The migrated table will be append table. Now, we can use paimon hive catalog with Migrate Table Procedure to totally migrate a table from hive to paimon. Overview //paimon.apache.org/docs/1.4/append-table/overview/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/append-table/overview/ Overview # If a table does not have a primary key defined, it is an append table. Compared to the primary key table, it does not have the ability to directly receive changelogs. It cannot be directly updated with data through upsert. It can only receive incoming data from append data. Flink CREATE TABLE my_table ( product_id BIGINT, price DOUBLE, sales BIGINT ) WITH ( -- 'target-file-size' = '256 MB', -- 'file. Overview //paimon.apache.org/docs/1.4/cdc-ingestion/overview/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/cdc-ingestion/overview/ Overview # Paimon supports a variety of ways to ingest data into Paimon tables with schema evolution. This means that the added columns are synchronized to the Paimon table in real time and the synchronization job will not be restarted for this purpose. We currently support the following sync ways: MySQL Synchronizing Table: synchronize one or multiple tables from MySQL into one Paimon table. MySQL Synchronizing Database: synchronize the whole MySQL database into one Paimon database. Overview //paimon.apache.org/docs/1.4/concepts/overview/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/overview/ Overview # Apache Paimon’s Architecture: As shown in the architecture above: Read/Write: Paimon supports a versatile way to read/write data and perform OLAP queries. For reads, it supports consuming data from historical snapshots (in batch mode), from the latest offset (in streaming mode), or reading incremental snapshots in a hybrid way. For writes, it supports streaming synchronization from the changelog of databases (CDC) batch insert/overwrite from offline data. Overview //paimon.apache.org/docs/1.4/concepts/rest/overview/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/rest/overview/ RESTCatalog # Overview # Paimon REST Catalog provides a lightweight implementation to access the catalog service. Paimon could access the catalog service through a catalog server which implements REST API. You can see all APIs in REST API. Key Features # User Defined Technology-Specific Logic Implementation All technology-specific logic within the catalog server. This ensures that the user can define logic that could be owned by the user. Overview //paimon.apache.org/docs/1.4/concepts/spec/overview/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/spec/overview/ Spec Overview # This is the specification for the Paimon table format, this document standardizes the underlying file structure and design of Paimon. Terms # Schema: fields, primary keys definition, partition keys definition and options. Snapshot: the entrance to all data committed at some specific time point. Manifest list: includes several manifest files. Manifest: includes several data files or changelog files. Data File: contains incremental records. Changelog File: contains records produced by changelog-producer. Overview //paimon.apache.org/docs/1.4/ecosystem/overview/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/ecosystem/overview/ Overview # Compatibility Matrix # Engine Version Batch Read Batch Write Create Table Alter Table Streaming Write Streaming Read Batch Overwrite DELETE & UPDATE MERGE INTO Time Travel Flink 1.16 - 1.20 ✅ ✅ ✅ ✅(1.17+) ✅ ✅ ✅ ✅(1.17+) ❌ ✅ Spark 3.2 - 4.0 ✅ ✅ ✅ ✅ ✅(3.3+) ✅(3.3+) ✅ ✅ ✅ ✅(3.3+) Hive 2. Overview //paimon.apache.org/docs/1.4/iceberg/overview/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/iceberg/overview/ Overview # Paimon supports generating Iceberg compatible metadata, so that Paimon tables can be consumed directly by Iceberg readers. Set the following table options, so that Paimon tables can generate Iceberg compatible metadata. Option Default Type Description metadata.iceberg.storage disabled Enum When set, produce Iceberg metadata after a snapshot is committed, so that Iceberg readers can read Paimon's raw data files. disabled: Disable Iceberg compatibility support. Overview //paimon.apache.org/docs/1.4/primary-key-table/merge-engine/overview/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/primary-key-table/merge-engine/overview/ Overview # When Paimon sink receives two or more records with the same primary keys, it will merge them into one record to keep primary keys unique. By specifying the merge-engine table property, users can choose how records are merged together. Always set table.exec.sink.upsert-materialize to NONE in Flink SQL TableConfig, sink upsert-materialize may result in strange behavior. When the input is out of order, we recommend that you use Sequence Field to correct disorder. Overview //paimon.apache.org/docs/1.4/primary-key-table/overview/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/primary-key-table/overview/ Overview # If you define a table with primary key, you can insert, update or delete records in the table. Primary keys consist of a set of columns that contain unique values for each record. Paimon enforces data ordering by sorting the primary key within each bucket, allowing users to achieve high performance by applying filtering conditions on the primary key. See CREATE TABLE. Bucket # Unpartitioned tables, or partitions in partitioned tables, are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Overview //paimon.apache.org/docs/1.4/pypaimon/overview/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/pypaimon/overview/ Overview # PyPaimon is a Python implementation for connecting Paimon catalog, reading & writing tables. The complete Python implementation of the brand new PyPaimon does not require JDK installation. Environment Settings # SDK is published at pypaimon. You can install by pip install pypaimon Build From Source # You can build the source package by executing the following command: python3 setup.py sdist The package is under dist/. Then you can install the package by executing the following command: Quick Start //paimon.apache.org/docs/1.4/flink/quick-start/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/flink/quick-start/ Quick Start # This documentation is a guide for using Paimon in Flink. Jars # Paimon currently supports Flink 2.2, 2.1, 2.0, 1.20, 1.19, 1.18, 1.17, 1.16. We recommend the latest Flink version for a better experience. Download the jar file with corresponding version. Currently, paimon provides two types jar: one of which(the bundled jar) is used for read/write data, and the other(action jar) for operations such as manually compaction, Version Type Jar Flink 2. Quick Start //paimon.apache.org/docs/1.4/spark/quick-start/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/spark/quick-start/ Quick Start # Preparation # Paimon supports the following Spark versions with their respective Java and Scala compatibility. We recommend using the latest Spark version for a better experience. Spark 4.x (including 4.0) : Pre-built with Java 17 and Scala 2.13 Spark 3.x (including 3.5, 3.4, 3.3, 3.2) : Pre-built with Java 8 and Scala 2.12/2.13 Download the jar file with corresponding version. REST API //paimon.apache.org/docs/1.4/program-api/rest-api/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/program-api/rest-api/ REST API # This is Java API for REST. Dependency # Maven dependency: <dependency> <groupId>org.apache.paimon</groupId> <artifactId>paimon-api</artifactId> <version>1.4.1</version> </dependency> Or download the jar file: Paimon API. RESTApi # import org.apache.paimon.options.Options; import org.apache.paimon.rest.RESTApi; import java.util.List; import static org.apache.paimon.options.CatalogOptions.WAREHOUSE; import static org.apache.paimon.rest.RESTCatalogOptions.DLF_ACCESS_KEY_ID; import static org.apache.paimon.rest.RESTCatalogOptions.DLF_ACCESS_KEY_SECRET; import static org.apache.paimon.rest.RESTCatalogOptions.TOKEN; import static org.apache.paimon.rest.RESTCatalogOptions.TOKEN_PROVIDER; import static org.apache.paimon.rest.RESTCatalogOptions.URI; public class RESTApiExample { public static void main(String[] args) { Options options = new Options(); options.set(URI, "<catalog server url>"); options. Understand Files //paimon.apache.org/docs/1.4/learn-paimon/understand-files/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/learn-paimon/understand-files/ Understand Files # This article is specifically designed to clarify the impact that various file operations have on files. This page provides concrete examples and practical tips for effectively managing them. Furthermore, through an in-depth exploration of operations such as commit and compact, we aim to offer insights into the creation and updates of files. Prerequisite # Before delving further into this page, please ensure that you have read through the following sections: Upsert To Partitioned //paimon.apache.org/docs/1.4/migration/upsert-to-partitioned/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/migration/upsert-to-partitioned/ Upsert To Partitioned # Note: Only Hive Engine can be used to query these upsert-to-partitioned tables. The Tag Management will maintain the manifests and data files of the snapshot. A typical usage is creating tags daily, then you can maintain the historical data of each day for batch reading. When using primary key tables, a non-partitioned approach is often used to maintain updates, in order to mirror and synchronize tables from upstream database tables. Append Table //paimon.apache.org/docs/1.4/iceberg/append-table/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/iceberg/append-table/ Append Tables # Let’s walk through a simple example, where we query Paimon tables with Iceberg connectors in Flink and Spark. Before trying out this example, make sure that your compute engine already supports Iceberg. Please refer to Iceberg’s document if you haven’t set up Iceberg. Flink: Preparation when using Flink SQL Client Spark: Using Iceberg in Spark 3 Let’s now create a Paimon append only table with Iceberg compatibility enabled and insert some data. Basic Concepts //paimon.apache.org/docs/1.4/concepts/basic-concepts/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/basic-concepts/ Basic Concepts # File Layouts # All files of a table are stored under one base directory. Paimon files are organized in a layered style. The following image illustrates the file layout. Starting from a snapshot file, Paimon readers can recursively access all records from the table. Snapshot # All snapshot files are stored in the snapshot directory. A snapshot file is a JSON file containing information about this snapshot, including Bear Token //paimon.apache.org/docs/1.4/concepts/rest/bear/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/rest/bear/ Bear Token # A bearer token is an encrypted string, typically generated by the server based on a secret key. When the client sends a request to the server, it must include Authorization: Bearer <token> in the request header. After receiving the request, the server extracts the <token> and validates its legitimacy. If the validation passes, the authentication is successful. CREATE CATALOG `paimon-rest-catalog` WITH ( 'type' = 'paimon', 'uri' = '<catalog server url>', 'metastore' = 'rest', 'warehouse' = 'my_instance_name', 'token. Data Distribution //paimon.apache.org/docs/1.4/primary-key-table/data-distribution/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/primary-key-table/data-distribution/ Data Distribution # A bucket is the smallest storage unit for reads and writes, each bucket directory contains an LSM tree. Fixed Bucket # Configure a bucket greater than 0, using Fixed Bucket mode, according to Math.abs(key_hashcode % numBuckets) to compute the bucket of record. Rescaling buckets can only be done through offline processes, see Rescale Bucket. A too large number of buckets leads to too many small files, and a too small number of buckets leads to poor write performance. Download //paimon.apache.org/docs/1.4/project/download/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/project/download/ Download # This documentation is a guide for downloading Paimon Jars. Engine Jars # Version Jar Flink 2.2 paimon-flink-2.2-1.4.1.jar Flink 2.1 paimon-flink-2.1-1.4.1.jar Flink 2.0 paimon-flink-2.0-1.4.1.jar Flink 1.20 paimon-flink-1.20-1.4.1.jar Flink 1.19 paimon-flink-1.19-1.4.1.jar Flink 1.18 paimon-flink-1.18-1.4.1.jar Flink 1.17 paimon-flink-1.17-1.4.1.jar Flink 1.16 paimon-flink-1.16-1.4.1.jar Flink Action paimon-flink-action-1.4.1.jar Spark 4. Flink API //paimon.apache.org/docs/1.4/program-api/flink-api/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/program-api/flink-api/ Flink API # If possible, recommend using Flink SQL or Spark SQL, or simply use SQL APIs in programs. Dependency # Maven dependency: <dependency> <groupId>org.apache.paimon</groupId> <artifactId>paimon-flink-1.20</artifactId> <version>1.4.1</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table-api-java-bridge</artifactId> <version>1.20.0</version> <scope>provided</scope> </dependency> Or download the jar file: Paimon Flink. Please choose your Flink version. Paimon relies on Hadoop environment, you should add hadoop classpath or bundled jar. Not only DataStream API, you can also read or write to Paimon tables by the conversion between DataStream and Table in Flink. Incremental Clustering //paimon.apache.org/docs/1.4/append-table/incremental-clustering/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/append-table/incremental-clustering/ Incremental Clustering # Paimon currently supports ordering append tables using SFC (Space-Filling Curve)(see sort compact for more info). The resulting data layout typically delivers better performance for queries that target clustering keys. However, with the current SortCompaction, even when neither the data nor the clustering keys have changed, each run still rewrites the entire dataset, which is extremely costly. To address this, Paimon introduced a more flexible, incremental clustering mechanism—Incremental Clustering. Mysql CDC //paimon.apache.org/docs/1.4/cdc-ingestion/mysql-cdc/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/cdc-ingestion/mysql-cdc/ MySQL CDC # Paimon supports synchronizing changes from different databases using change data capture (CDC). This feature requires Flink and its CDC connectors. Prepare CDC Bundled Jar # Download CDC Bundled Jar and put them under <FLINK_HOME>/lib/. Version Bundled Jar 3.5.0 flink-sql-connector-mysql-cdc-3.5.0.jar mysql-connector-java-8.0.27.jar Only CDC 3.5.0 or above is supported. Synchronizing Tables # By using MySqlSyncTableAction in a Flink DataStream job or directly through flink run, users can synchronize one or multiple tables from MySQL into one Paimon table. Partial Update //paimon.apache.org/docs/1.4/primary-key-table/merge-engine/partial-update/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/primary-key-table/merge-engine/partial-update/ Partial Update # By specifying 'merge-engine' = 'partial-update', users have the ability to update columns of a record through multiple updates until the record is complete. This is achieved by updating the value fields one by one, using the latest data under the same primary key. However, null values are not overwritten in the process. For example, suppose Paimon receives three records: <1, 23.0, 10, NULL>- <1, NULL, NULL, 'This is a book'> <1, 25. Postgres CDC //paimon.apache.org/docs/1.4/cdc-ingestion/postgres-cdc/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/cdc-ingestion/postgres-cdc/ Postgres CDC # Paimon supports synchronizing changes from different databases using change data capture (CDC). This feature requires Flink and its CDC connectors. Prepare CDC Bundled Jar # Download CDC Bundled Jar and put them under <FLINK_HOME>/lib/. Version Bundled Jar 3.5.0 flink-sql-connector-postgres-cdc-3.5.0.jar Only CDC 3.5.0 or above is supported. Synchronizing Tables # By using PostgresSyncTableAction in a Flink DataStream job or directly through flink run, users can synchronize one or multiple tables from PostgreSQL into one Paimon table. Python API //paimon.apache.org/docs/1.4/pypaimon/python-api/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/pypaimon/python-api/ Python API # Create Catalog # Before coming into contact with the Table, you need to create a Catalog. filesystem from pypaimon import CatalogFactory # Note that keys and values are all string catalog_options = { 'warehouse': 'file:///path/to/warehouse' } catalog = CatalogFactory.create(catalog_options) rest catalog The sample code is as follows. The detailed meaning of option can be found in REST. from pypaimon import CatalogFactory # Note that keys and values are all string catalog_options = { 'metastore': 'rest', 'warehouse': 'xxx', 'uri': 'xxx', 'token. Scenario Guide //paimon.apache.org/docs/1.4/learn-paimon/scenario-guide/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/learn-paimon/scenario-guide/ Scenario Guide # This guide helps you choose the right Paimon table type and configuration for your specific use case. Paimon provides Primary Key Table, Append Table, and Multimodal Data Lake capabilities — each with different modes and configurations that are suited for different scenarios. Quick Decision # Scenario Table Type Key Configuration CDC real-time sync from database Primary Key Table deletion-vectors. Schema //paimon.apache.org/docs/1.4/concepts/spec/schema/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/spec/schema/ Schema # The version of the schema file starts from 0 and currently retains all versions of the schema. There may be old files that rely on the old schema version, so its deletion should be done with caution. Schema File is JSON, it includes: fields: data field list, data field contains id, name, type, field id is used to support schema evolution. partitionKeys: field name list, partition definition of the table, it cannot be modified. SQL DDL //paimon.apache.org/docs/1.4/flink/sql-ddl/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/flink/sql-ddl/ SQL DDL # Create Catalog # Paimon catalogs currently support three types of metastores: filesystem metastore (default), which stores both metadata and table files in filesystems. hive metastore, which additionally stores metadata in Hive metastore. Users can directly access the tables from Hive. jdbc metastore, which additionally stores metadata in relational databases such as MySQL, Postgres, etc. See CatalogOptions for detailed options when creating a catalog. SQL DDL //paimon.apache.org/docs/1.4/spark/sql-ddl/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/spark/sql-ddl/ SQL DDL # Catalog # Create Catalog # Paimon catalogs currently support three types of metastores: filesystem metastore (default), which stores both metadata and table files in filesystems. hive metastore, which additionally stores metadata in Hive metastore. Users can directly access the tables from Hive. jdbc metastore, which additionally stores metadata in relational databases such as MySQL, Postgres, etc. See CatalogOptions for detailed options when creating a catalog. SQL Functions //paimon.apache.org/docs/1.4/spark/sql-functions/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/spark/sql-functions/ SQL Functions # This section introduce all available Paimon Spark functions. Built-in Function # max_pt # sys.max_pt($table_name) It accepts a string type literal to specify the table name and return a max-valid-toplevel partition value. valid: the partition which contains data files toplevel: only return the first partition value if the table has multi-partition columns It would throw exception when: the table is not a partitioned table the partitioned table does not have partition all of the partitions do not contains data files Example SQL Write //paimon.apache.org/docs/1.4/flink/sql-write/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/flink/sql-write/ SQL Write # Syntax # INSERT { INTO | OVERWRITE } table_identifier [ part_spec ] [ column_list ] { value_expr | query }; For more information, please check the syntax document: Flink INSERT Statement INSERT INTO # Use INSERT INTO to apply records and changes to tables. INSERT INTO my_table SELECT ... INSERT INTO supports both batch and streaming mode. In Streaming mode, by default, it will also perform compaction, snapshot expiration, and even partition expiration in Flink Sink (if it is configured). SQL Write //paimon.apache.org/docs/1.4/spark/sql-write/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/spark/sql-write/ SQL Write # Insert Table # The INSERT statement inserts new rows into a table or overwrites the existing data in the table. The inserted rows can be specified by value expressions or result from a query. Syntax INSERT { INTO | OVERWRITE } table_identifier [ part_spec ] [ column_list ] { value_expr | query }; Parameters table_identifier: Specifies a table name, which may be optionally qualified with a database name. StarRocks //paimon.apache.org/docs/1.4/ecosystem/starrocks/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/ecosystem/starrocks/ StarRocks # This documentation is a guide for using Paimon in StarRocks. Version # Paimon currently supports StarRocks 3.1 and above. Recommended version is StarRocks 3.2.6 or above. Create Paimon Catalog # Paimon catalogs are registered by executing a CREATE EXTERNAL CATALOG SQL in StarRocks. For example, you can use the following SQL to create a Paimon catalog named paimon_catalog. CREATE EXTERNAL CATALOG paimon_catalog PROPERTIES( "type" = "paimon", "paimon. Aggregation //paimon.apache.org/docs/1.4/primary-key-table/merge-engine/aggregation/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/primary-key-table/merge-engine/aggregation/ Aggregation # NOTE: Always set table.exec.sink.upsert-materialize to NONE in Flink SQL TableConfig. Sometimes users only care about aggregated results. The aggregation merge engine aggregates each value field with the latest data one by one under the same primary key according to the aggregate function. Each field not part of the primary keys can be given an aggregate function, specified by the fields.<field-name>.aggregate-function table property, otherwise it will use last_non_null_value aggregation as default. Bucketed //paimon.apache.org/docs/1.4/append-table/bucketed/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/append-table/bucketed/ Bucketed Append # You can define the bucket and bucket-key to get a bucketed append table. Example to create bucketed append table: Flink CREATE TABLE my_table ( product_id BIGINT, price DOUBLE, sales BIGINT ) WITH ( 'bucket' = '8', 'bucket-key' = 'product_id' ); Data Skipping # The primary and most significant advantage of a bucketed append table is data skipping. When queries contain equality (=) or IN filter conditions on the bucket-key, Paimon can efficiently push these predicates down to skip irrelevant bucket files entirely. Concurrency Control //paimon.apache.org/docs/1.4/concepts/concurrency-control/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/concurrency-control/ Concurrency Control # Paimon supports optimistic concurrency for multiple concurrent write jobs. Each job writes data at its own pace and generates a new snapshot based on the current snapshot by applying incremental files (deleting or adding files) at the time of committing. There may be two types of commit failures here: Snapshot conflict: the snapshot id has been preempted, the table has generated a new snapshot from another job. Contributing //paimon.apache.org/docs/1.4/project/contributing/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/project/contributing/ Contributing # Apache Paimon is developed by an open and friendly community. Everybody is cordially welcome to join the community and contribute to Apache Paimon. There are several ways to interact with the community and contribute to Paimon including asking questions, filing bug reports, proposing new features, joining discussions on the mailing lists, contributing code or documentation, improving website, testing release candidates and writing corresponding blog etc. What do you want to do? DLF Token //paimon.apache.org/docs/1.4/concepts/rest/dlf/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/rest/dlf/ DLF Token # DLF (Data Lake Formation) building is a fully-managed platform for unified metadata and data storage and management, aiming to provide customers with functions such as metadata management, storage management, permission management, storage analysis, and storage optimization. DLF provides multiple authentication methods for different environments. The 'warehouse' is your catalog instance name on the server, not the path. Use the access key # CREATE CATALOG `paimon-rest-catalog` WITH ( 'type' = 'paimon', 'uri' = '<catalog server url>', 'metastore' = 'rest', 'warehouse' = 'my_instance_name', 'token. Doris //paimon.apache.org/docs/1.4/ecosystem/doris/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/ecosystem/doris/ Doris # This documentation is a guide for using Paimon in Doris. More details can be found in Apache Doris Website Version # Paimon currently supports Apache Doris 2.0.6 and above. Create Paimon Catalog # Use CREATE CATALOG statement in Apache Doris to create Paimon Catalog. Doris support multi types of Paimon Catalogs. Here are some examples: -- HDFS based Paimon Catalog CREATE CATALOG `paimon_hdfs` PROPERTIES ( "type" = "paimon", "warehouse" = "hdfs://172. Java API //paimon.apache.org/docs/1.4/program-api/java-api/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/program-api/java-api/ Java API # If possible, recommend using computing engines such as Flink SQL or Spark SQL. Dependency # Maven dependency: <dependency> <groupId>org.apache.paimon</groupId> <artifactId>paimon-bundle</artifactId> <version>1.4.1</version> </dependency> Or download the jar file: Paimon Bundle. Paimon relies on Hadoop environment, you should add hadoop classpath or bundled jar. Create Catalog # Before coming into contact with the Table, you need to create a Catalog. import org.apache.paimon.catalog.Catalog; import org.apache.paimon.catalog.CatalogContext; import org. Kafka CDC //paimon.apache.org/docs/1.4/cdc-ingestion/kafka-cdc/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/cdc-ingestion/kafka-cdc/ Kafka CDC # Prepare Kafka Bundled Jar # flink-sql-connector-kafka-*.jar Supported Formats # Flink provides several Kafka CDC formats: Canal Json, Debezium Json, Debezium Avro, Ogg Json, Maxwell Json and Normal Json. If a message in a Kafka topic is a change event captured from another database using the Change Data Capture (CDC) tool, then you can use the Paimon Kafka CDC. Write the INSERT, UPDATE, DELETE messages parsed into the paimon table. Manage Tags //paimon.apache.org/docs/1.4/pypaimon/manage-tags/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/pypaimon/manage-tags/ Manage Tags # Just like Java API of Paimon, you can create a tag based on a snapshot. The tag will maintain the manifests and data files of the snapshot. A typical usage is creating tags daily, then you can maintain the historical data of each day for batch reading. Create and Delete Tag # You can create a tag with given name and snapshot ID, and delete a tag with given name. Primary Key Table //paimon.apache.org/docs/1.4/iceberg/primary-key-table/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/iceberg/primary-key-table/ Primary Key Tables # Let’s walk through a simple example, where we query Paimon tables with Iceberg connectors in Flink and Spark. Before trying out this example, make sure that your compute engine already supports Iceberg. Please refer to Iceberg’s document if you haven’t set up Iceberg. Flink: Preparation when using Flink SQL Client Spark: Using Iceberg in Spark 3 Flink SQL CREATE CATALOG paimon_catalog WITH ( 'type' = 'paimon', 'warehouse' = '<path-to-warehouse>' ); CREATE TABLE paimon_catalog. Ray Data //paimon.apache.org/docs/1.4/pypaimon/ray-data/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/pypaimon/ray-data/ Ray Data # Read # This requires ray to be installed. You can convert the splits into a Ray Dataset and handle it by Ray Data API for distributed processing: table_read = read_builder.new_read() ray_dataset = table_read.to_ray(splits) print(ray_dataset) # MaterializedDataset(num_blocks=1, num_rows=9, schema={f0: int32, f1: string}) print(ray_dataset.take(3)) # [{'f0': 1, 'f1': 'a'}, {'f0': 2, 'f1': 'b'}, {'f0': 3, 'f1': 'c'}] print(ray_dataset.to_pandas()) # f0 f1 # 0 1 a # 1 2 b # 2 3 c # 3 4 d # . Snapshot //paimon.apache.org/docs/1.4/concepts/spec/snapshot/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/spec/snapshot/ Snapshot # Each commit generates a snapshot file, and the version of the snapshot file starts from 1 and must be continuous. EARLIEST and LATEST are hint files at the beginning and end of the snapshot list, and they can be inaccurate. When hint files are inaccurate, the read will scan all snapshot files to determine the beginning and end. warehouse └── default.db └── my_table ├── snapshot ├── EARLIEST ├── LATEST ├── snapshot-1 ├── snapshot-2 └── snapshot-3 Writing commit will preempt the next snapshot id, and once the snapshot file is successfully written, this commit will be visible. Table Mode //paimon.apache.org/docs/1.4/primary-key-table/table-mode/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/primary-key-table/table-mode/ Table Mode # The file structure of the primary key table is roughly shown in the above figure. The table or partition contains multiple buckets, and each bucket is a separate LSM tree structure that contains multiple files. The writing process of LSM is roughly as follows: Flink checkpoint flush L0 files, and trigger a compaction as needed to merge the data. According to the different processing ways during writing, there are three modes: Write Performance //paimon.apache.org/docs/1.4/maintenance/write-performance/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/maintenance/write-performance/ Write Performance # Paimon’s write performance is closely related to checkpoint, so if you need greater write throughput: Flink Configuration ('flink-conf.yaml'/'config.yaml' or SET in SQL): Increase the checkpoint interval ('execution.checkpointing.interval'), increase max concurrent checkpoints to 3 ('execution.checkpointing.max-concurrent-checkpoints'), or just use batch mode. Increase write-buffer-size. Enable write-buffer-spillable. Rescale bucket number if you are using Fixed-Bucket mode. Option 'changelog-producer' = 'lookup' or 'full-compaction', and option 'full-compaction.delta-commits' have a large impact on write performance, if it is a snapshot / full synchronization phase you can unset these options and then enable them again in the incremental phase. Catalog //paimon.apache.org/docs/1.4/concepts/catalog/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/catalog/ Catalog # Paimon provides a Catalog abstraction to manage the table of contents and metadata. The Catalog abstraction provides a series of ways to help you better integrate with computing engines. We always recommend that you use Catalog to access the Paimon table. Catalogs # Paimon catalogs currently support four types of metastores: filesystem metastore (default), which stores both metadata and table files in filesystems. hive metastore, which additionally stores metadata in Hive metastore. Catalog API //paimon.apache.org/docs/1.4/program-api/catalog-api/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/program-api/catalog-api/ Catalog API # Create Database # You can use the catalog to create databases. The created databases are persistence in the file system. import org.apache.paimon.catalog.Catalog; public class CreateDatabase { public static void main(String[] args) { try { Catalog catalog = CreateCatalog.createFilesystemCatalog(); catalog.createDatabase("my_db", false); } catch (Catalog.DatabaseAlreadyExistException e) { // do something } } } Determine Whether Database Exists # You can use the catalog to determine whether the database exists Committer //paimon.apache.org/docs/1.4/project/committer/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/project/committer/ Committer # Become a Committer # How to become a committer # There is no strict protocol for becoming a committer. Candidates for new committers are typically people that are active contributors and community members. Candidates are suggested by current committers or PMC members, and voted upon by the PMC. If you would like to become a committer, you should engage with the community and start contributing to Apache Paimon in any of the above ways. Dedicated Compaction //paimon.apache.org/docs/1.4/maintenance/dedicated-compaction/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/maintenance/dedicated-compaction/ Dedicated Compaction # Paimon’s snapshot management supports writing with multiple writers. For S3-like object store, its 'RENAME' does not have atomic semantic. We need to configure Hive metastore and enable 'lock.enabled' option for the catalog. By default, Paimon supports concurrent writing to different partitions. A recommended mode is that streaming job writes records to Paimon’s latest partition, Simultaneously batch job (overwrite) writes records to the historical partition. So far, everything works very well, but if you need multiple writers to write records to the same partition, it will be a bit more complicated. First Row //paimon.apache.org/docs/1.4/primary-key-table/merge-engine/first-row/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/primary-key-table/merge-engine/first-row/ First Row # By specifying 'merge-engine' = 'first-row', users can keep the first row of the same primary key. It differs from the deduplicate merge engine that in the first-row merge engine, it will generate insert only changelog. first-row merge engine only supports none and lookup changelog producer. For streaming queries must be used with the lookup changelog producer. You can not specify sequence.field. Not accept DELETE and UPDATE_BEFORE message. Hive //paimon.apache.org/docs/1.4/ecosystem/hive/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/ecosystem/hive/ Hive # This documentation is a guide for using Paimon in Hive. Version # Paimon currently supports Hive 3.1, 2.3, 2.2, 2.1 and 2.1-cdh-6.3. Execution Engine # Paimon currently supports MR and Tez execution engine for Hive Read, and MR execution engine for Hive Write. Note If you use beeline, please restart the hive cluster. Installation # Download the jar file with corresponding version. Jar Hive 3. Iceberg Tags //paimon.apache.org/docs/1.4/iceberg/iceberg-tags/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/iceberg/iceberg-tags/ Iceberg Tags # When enable iceberg compatibility, Paimon Tags will also be synced to Iceberg Tags. Tags are only synced to Iceberg if the referenced snapshot exists in the Iceberg table. CREATE CATALOG paimon WITH ( 'type' = 'paimon', 'warehouse' = '<path-to-warehouse>' ); CREATE CATALOG iceberg WITH ( 'type' = 'iceberg', 'catalog-type' = 'hadoop', 'warehouse' = '<path-to-warehouse>/iceberg', 'cache-enabled' = 'false' -- disable iceberg catalog caching to quickly see the result ); -- create tag for paimon table CALL paimon. Manifest //paimon.apache.org/docs/1.4/concepts/spec/manifest/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/spec/manifest/ Manifest # Manifest List # ├── manifest └── manifest-list-51c16f7b-421c-4bc0-80a0-17677f343358-1 Manifest List includes meta of several manifest files. Its name contains UUID, it is an avro file, the schema is: _FILE_NAME: STRING, manifest file name. _FILE_SIZE: BIGINT, manifest file size. _NUM_ADDED_FILES: BIGINT, number added files in manifest. _NUM_DELETED_FILES: BIGINT, number deleted files in manifest. _PARTITION_STATS: SimpleStats, partition stats, the minimum and maximum values of partition fields in this manifest are beneficial for skipping certain manifest files during queries, it is a SimpleStats. Mongo CDC //paimon.apache.org/docs/1.4/cdc-ingestion/mongo-cdc/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/cdc-ingestion/mongo-cdc/ Mongo CDC # Prepare MongoDB Bundled Jar # Version Bundled Jar 3.5.0 flink-sql-connector-mongodb-cdc-3.5.0.jar Only CDC 3.5.0 or above is supported. Synchronizing Tables # By using MongoDBSyncTableAction in a Flink DataStream job or directly through flink run, users can synchronize one collection from MongoDB into one Paimon table. To use this feature through flink run, run the following shell command. PyTorch //paimon.apache.org/docs/1.4/pypaimon/pytorch/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/pypaimon/pytorch/ PyTorch # Read # This requires torch to be installed. You can read all the data into a torch.utils.data.Dataset or torch.utils.data.IterableDataset: from torch.utils.data import DataLoader table_read = read_builder.new_read() dataset = table_read.to_torch(splits, streaming=True, prefetch_concurrency=2) dataloader = DataLoader( dataset, batch_size=2, num_workers=2, # Concurrency to read data shuffle=False ) # Collect all data from dataloader for batch_idx, batch_data in enumerate(dataloader): print(batch_data) # output: # {'user_id': tensor([1, 2]), 'behavior': ['a', 'b']} # {'user_id': tensor([3, 4]), 'behavior': ['c', 'd']} # {'user_id': tensor([5, 6]), 'behavior': ['e', 'f']} # {'user_id': tensor([7, 8]), 'behavior': ['g', 'h']} When the streaming parameter is true, it will iteratively read; when it is false, it will read the full amount of data into memory. SQL Query //paimon.apache.org/docs/1.4/flink/sql-query/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/flink/sql-query/ SQL Query # Just like all other tables, Paimon tables can be queried with SELECT statement. Batch Query # Paimon’s batch read returns all the data in a snapshot of the table. By default, batch reads return the latest snapshot. -- Flink SQL SET 'execution.runtime-mode' = 'batch'; Batch Time Travel # Paimon batch reads with time travel can specify a snapshot or a tag and read the corresponding data. SQL Query //paimon.apache.org/docs/1.4/spark/sql-query/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/spark/sql-query/ SQL Query # Just like all other tables, Paimon tables can be queried with SELECT statement. Batch Query # Paimon’s batch read returns all the data in a snapshot of the table. By default, batch reads return the latest snapshot. -- read all columns SELECT * FROM t; Paimon also supports reading some hidden metadata columns, currently supporting the following columns: __paimon_partition: The partition of the record. __paimon_bucket: The bucket of the record. Changelog Producer //paimon.apache.org/docs/1.4/primary-key-table/changelog-producer/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/primary-key-table/changelog-producer/ Changelog Producer # Streaming write can continuously produce the latest changes for streaming read. By specifying the changelog-producer table property when creating the table, users can choose the pattern of changes produced from table files. changelog-producer may significantly reduce compaction performance, please do not enable it unless necessary. None # By default, no extra changelog producer will be applied to the writer of table. Paimon source can only see the merged changes across snapshots, like what keys are removed and what are the new values of some keys. Clone To Paimon //paimon.apache.org/docs/1.4/migration/clone-to-paimon/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/migration/clone-to-paimon/ Clone To Paimon # Clone supports cloning tables to Paimon tables. Clone is OVERWRITE semantic that will overwrite the partitions of the target table according to the data. Clone is reentrant, but it requires existing tables to contain all fields from the source table and have the same partition fields. Currently, clone supports clone Hive tables in Hive Catalog to Paimon Catalog, supports Parquet, ORC, Avro formats, target table will be append table. Consumer ID //paimon.apache.org/docs/1.4/flink/consumer-id/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/flink/consumer-id/ Consumer ID # Consumer id can help you accomplish the following two things: Safe consumption: When deciding whether a snapshot has expired, Paimon looks at all the consumers of the table in the file system, and if there are consumers that still depend on this snapshot, then this snapshot will not be deleted by expiration. Resume from breakpoint: When previous job is stopped, the newly started job can continue to consume from the previous progress without resuming from the state. Data Evolution //paimon.apache.org/docs/1.4/pypaimon/data-evolution/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/pypaimon/data-evolution/ Data Evolution # PyPaimon for Data Evolution mode. See Data Evolution. Prerequisites # To use partial updates / data evolution, enable both options when creating the table: row-tracking.enabled: true data-evolution.enabled: true Update Columns By Row ID # You can create TableUpdate.update_by_arrow_with_row_id to update columns to data evolution tables. The input data should include the _ROW_ID column, update operation will automatically sort and match each _ROW_ID to its corresponding first_row_id, then groups rows with the same first_row_id and writes them to a separate file. DataFile //paimon.apache.org/docs/1.4/concepts/spec/datafile/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/spec/datafile/ DataFile # Partition # Consider a Partition table via Flink SQL: CREATE TABLE part_t ( f0 INT, f1 STRING, dt STRING ) PARTITIONED BY (dt); INSERT INTO part_t VALUES (1, '11', '20240514'); The file system will be: part_t ├── dt=20240514 │ └── bucket-0 │ └── data-ca1c3c38-dc8d-4533-949b-82e195b41bd4-0.orc ├── manifest │ ├── manifest-08995fe5-c2ac-4f54-9a5f-d3af1fcde41d-0 │ ├── manifest-list-51c16f7b-421c-4bc0-80a0-17677f343358-0 │ └── manifest-list-51c16f7b-421c-4bc0-80a0-17677f343358-1 ├── schema │ └── schema-0 └── snapshot ├── EARLIEST ├── LATEST └── snapshot-1 Paimon adopts the same partitioning concept as Apache Hive to separate data. Hive Catalogs //paimon.apache.org/docs/1.4/iceberg/hive-catalog/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/iceberg/hive-catalog/ Hive Catalog # When creating Paimon table, set 'metadata.iceberg.storage' = 'hive-catalog'. This option value not only store Iceberg metadata like hadoop-catalog, but also create Iceberg external table in Hive. This Paimon table can be accessed from Iceberg Hive catalog later. To provide information about Hive metastore, you also need to set some (or all) of the following table options when creating Paimon table. Option Default Type Description metadata. Manage Snapshots //paimon.apache.org/docs/1.4/maintenance/manage-snapshots/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/maintenance/manage-snapshots/ Manage Snapshots # This section will describe the management and behavior related to snapshots. Expire Snapshots # Paimon writers generate one or two snapshot per commit. Each snapshot may add some new data files or mark some old data files as deleted. However, the marked data files are not truly deleted because Paimon also supports time traveling to an earlier snapshot. They are only deleted when the snapshot expires. Pulsar CDC //paimon.apache.org/docs/1.4/cdc-ingestion/pulsar-cdc/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/cdc-ingestion/pulsar-cdc/ Pulsar CDC # Prepare Pulsar Bundled Jar # flink-connector-pulsar-*.jar Supported Formats # Flink provides several Pulsar CDC formats: Canal Json, Debezium Json, Debezium Avro, Ogg Json, Maxwell Json and Normal Json. If a message in a pulsar topic is a change event captured from another database using the Change Data Capture (CDC) tool, then you can use the Paimon Pulsar CDC. Write the INSERT, UPDATE, DELETE messages parsed into the paimon table. Rest Catalog //paimon.apache.org/docs/1.4/iceberg/rest-catalog/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/iceberg/rest-catalog/ Rest Catalog # When creating Paimon table, set 'metadata.iceberg.storage' = 'rest-catalog'. This option value will not only store Iceberg metadata like hadoop-catalog, but also create table in iceberg rest catalog. This Paimon table can be accessed from Iceberg Rest catalog later. You need to provide information about Rest Catalog by setting options prefixed with 'metadata.iceberg.rest.', such as 'metadata.iceberg.rest.uri' = 'https://localhost/'. Paimon will try to use these options to initialize an iceberg rest catalog, and use this rest catalog to commit metadata. Row Tracking //paimon.apache.org/docs/1.4/append-table/row-tracking/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/append-table/row-tracking/ Row tracking # Row tracking allows Paimon to track row-level tracking in a Paimon append table. Once enabled on a Paimon table, two more hidden columns will be added to the table schema: _ROW_ID: BIGINT, this is a unique identifier for each row in the table. It is used to track the update of the row and can be used to identify the row in case of update, merge into or delete. Tables //paimon.apache.org/docs/1.4/concepts/rest/tables/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/rest/tables/ Tables # Paimon supports tables: paimon table: Paimon Data Table with or without Primary key format-table: file format table refers to a directory that contains multiple files of the same format, where operations on this table allow for reading or writing to these files, compatible with Hive tables. object table: provides metadata indexes for unstructured data objects in the specified Object Storage directory. Paimon Table # Primary Key Table # See Paimon with Primary key. Trino //paimon.apache.org/docs/1.4/ecosystem/trino/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/ecosystem/trino/ Trino # This documentation is a guide for using Paimon in Trino. Version # Paimon currently supports Trino 440. Filesystem # From version 0.8, Paimon share Trino filesystem for all actions, which means, you should config Trino filesystem before using trino-paimon. You can find information about how to config filesystems for Trino on Trino official website. Preparing Paimon Jar File # Download You can also manually build a bundled jar from the source code. Amoro //paimon.apache.org/docs/1.4/ecosystem/amoro/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/ecosystem/amoro/ Apache Amoro With Paimon # Apache Amoro(incubating) is a Lakehouse management system built on open data lake formats. Working with compute engines including Flink, Spark, and Trino, Amoro brings pluggable and Table Maintenance features for a Lakehouse to provide out-of-the-box data warehouse experience, and helps data platforms or products easily build infra-decoupled, stream-and-batch-fused and lake-native architecture. AMS(Amoro Management Service) provides Lakehouse management features, like self-optimizing, data expiration, etc. It also provides a unified catalog service for all compute engines, which can also be combined with existing metadata services like HMS(Hive Metastore). Cpp API //paimon.apache.org/docs/1.4/program-api/cpp-api/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/program-api/cpp-api/ Cpp API # Paimon C++ is a high-performance C++ implementation of Apache Paimon. Paimon C++ aims to provide a native, high-performance and extensible implementation that allows native engines to access the Paimon datalake format with maximum efficiency. Environment Settings # Paimon C++ is currently governed under Alibaba open source community. You can checkout the document for more details about environment settings. git clone https://github.com/alibaba/paimon-cpp.git cd paimon-cpp mkdir build-release cd build-release cmake . Data Evolution //paimon.apache.org/docs/1.4/append-table/data-evolution/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/append-table/data-evolution/ Data Evolution # Overview # Paimon supports complete Schema Evolution, allowing you to freely add, modify, or delete column schema. But how to backfill newly added columns or update column data. Data Evolution Mode is a new feature for Append tables that revolutionizes how you handle data evolution, particularly when adding new columns. This mode allows you to update partial columns without rewriting entire data files. Instead, it writes new column data to separate files and intelligently merges them with the original data during read operations. Ecosystem //paimon.apache.org/docs/1.4/iceberg/ecosystem/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/iceberg/ecosystem/ Iceberg Ecosystems # AWS Athena # AWS Athena may use old manifest reader to read Iceberg manifest by names, we should let Paimon producing legacy Iceberg manifest list file, you can enable: 'metadata.iceberg.manifest-legacy-version'. DuckDB # Duckdb may rely on files placed in the root/data directory, while Paimon is usually placed directly in the root directory, so you can configure this parameter for the table to achieve compatibility: 'data-file. PVFS //paimon.apache.org/docs/1.4/concepts/rest/pvfs/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/rest/pvfs/ Paimon Virtual Storage # The REST Catalog provides built-in storage, including Paimon Table, Format Table, and Object Table (also known as Fileset or Volume), both of which require direct access to the file system. And our REST Catalog generates UUID paths, which makes it difficult to directly access the file system. So there is PVFS, which can allow users to access it through similar methods pvfs://catalog_name/database_name/table_name/, use the path to access all internal tables in the REST Catalog, including Paimon Table, Format Table, and Object Table. Sequence & Rowkind //paimon.apache.org/docs/1.4/primary-key-table/sequence-rowkind/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/primary-key-table/sequence-rowkind/ Sequence and Rowkind # When creating a table, you can specify the 'sequence.field' by specifying fields to determine the order of updates, or you can specify the 'rowkind.field' to determine the changelog kind of record. Sequence Field # By default, the primary key table determines the merge order according to the input order (the last input record will be the last to merge). However, in distributed computing, there will be some cases that lead to data disorder. SQL Alter //paimon.apache.org/docs/1.4/spark/sql-alter/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/spark/sql-alter/ Altering Tables # Changing/Adding Table Properties # The following SQL sets write-buffer-size table property to 256 MB. ALTER TABLE my_table SET TBLPROPERTIES ( 'write-buffer-size' = '256 MB' ); Removing Table Properties # The following SQL removes write-buffer-size table property. ALTER TABLE my_table UNSET TBLPROPERTIES ('write-buffer-size'); Changing/Adding Table Comment # The following SQL changes comment of table my_table to table comment. ALTER TABLE my_table SET TBLPROPERTIES ( 'comment' = 'table comment' ); Removing Table Comment # The following SQL removes table comment. SQL Lookup //paimon.apache.org/docs/1.4/flink/sql-lookup/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/flink/sql-lookup/ Lookup Joins # Lookup Joins are a type of join in streaming queries. It is used to enrich a table with data that is queried from Paimon. The join requires one table to have a processing time attribute and the other table to be backed by a lookup source connector. Paimon supports lookup joins on tables with primary keys and append tables in Flink. The following example illustrates this feature. Auxiliary //paimon.apache.org/docs/1.4/spark/auxiliary/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/spark/auxiliary/ Auxiliary Statements # Set / Reset # The SET command sets a property, returns the value of an existing property or returns all SQLConf properties with value and meaning. The RESET command resets runtime configurations specific to the current session which were set via the SET command to their default values. To set dynamic options globally, you need add the spark.paimon. prefix. You can also set dynamic table options at this format: spark. Blob Storage //paimon.apache.org/docs/1.4/append-table/blob/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/append-table/blob/ Blob Storage # Overview # The BLOB (Binary Large Object) type is a data type designed for storing multimodal data such as images, videos, audio files, and other large binary objects in Paimon tables. Unlike traditional BYTES type which stores binary data inline with other columns, BLOB type stores large binary data in separate files and maintains references to them, providing better performance for large objects. The Blob Storage is based on Data Evolution mode. Compaction //paimon.apache.org/docs/1.4/primary-key-table/compaction/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/primary-key-table/compaction/ Compaction # When more and more records are written into the LSM tree, the number of sorted runs will increase. Because querying an LSM tree requires all sorted runs to be combined, too many sorted runs will result in a poor query performance, or even out of memory. To limit the number of sorted runs, we have to merge several sorted runs into one big sorted run once in a while. Configurations //paimon.apache.org/docs/1.4/iceberg/configurations/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/iceberg/configurations/ Configurations # Options for Iceberg Compatibility. Key Default Type Description metadata.iceberg.compaction.max.file-num 50 Integer If number of small Iceberg manifest metadata files exceeds this limit, always trigger manifest metadata compaction regardless of their total size. metadata.iceberg.compaction.min.file-num 10 Integer Minimum number of Iceberg manifest metadata files to trigger manifest metadata compaction. metadata.iceberg.database (none) String Metastore database name for Iceberg Catalog. Set this as an iceberg database alias if using a centralized Catalog. FileFormat //paimon.apache.org/docs/1.4/concepts/spec/fileformat/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/spec/fileformat/ File Format # Currently, supports Parquet, Avro, ORC, CSV, JSON, and Lance file formats. Recommended column format is Parquet, which has a high compression rate and fast column projection queries. Recommended row based format is Avro, which has good performance n reading and writing full row (all columns). Recommended testing format is CSV, which has better readability but the worst read-write performance. Recommended format for ML workloads is Lance, which is optimized for vector search and machine learning use cases. FUSE Support //paimon.apache.org/docs/1.4/pypaimon/fuse-support/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/pypaimon/fuse-support/ FUSE Support # When using PyPaimon REST Catalog to access remote object storage (such as OSS, S3, or HDFS), data access typically goes through remote storage SDKs. However, in scenarios where remote storage paths are mounted locally via FUSE (Filesystem in Userspace), users can access data directly through local filesystem paths for better performance. This feature enables PyPaimon to use local file access when FUSE mount is available, bypassing remote storage SDKs. Rescale Bucket //paimon.apache.org/docs/1.4/maintenance/rescale-bucket/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/maintenance/rescale-bucket/ Rescale Bucket # Since the number of total buckets dramatically influences the performance, Paimon allows users to tune bucket numbers by ALTER TABLE command and reorganize data layout by INSERT OVERWRITE without recreating the table/partition. When executing overwrite jobs, the framework will automatically scan the data with the old bucket number and hash the record according to the current bucket number. Rescale Overwrite # -- rescale number of total buckets ALTER TABLE table_identifier SET ('bucket' = '. SQL Alter //paimon.apache.org/docs/1.4/flink/sql-alter/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/flink/sql-alter/ Altering Tables # Changing/Adding Table Properties # The following SQL sets write-buffer-size table property to 256 MB. ALTER TABLE my_table SET ( 'write-buffer-size' = '256 MB' ); Removing Table Properties # The following SQL removes write-buffer-size table property. ALTER TABLE my_table RESET ('write-buffer-size'); Changing/Adding Table Comment # The following SQL changes comment of table my_table to table comment. ALTER TABLE my_table SET ( 'comment' = 'table comment' ); Removing Table Comment # The following SQL removes table comment. System Tables //paimon.apache.org/docs/1.4/concepts/system-tables/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/system-tables/ System Tables # Paimon provides a very rich set of system tables to help users better analyze and query the status of Paimon tables: Query the status of the data table: Data System Table. Query the global status of the entire Catalog: Global System Table. Data System Table # Data System tables contain metadata and information about each Paimon data table, such as the snapshots created and the options in use. Table Index //paimon.apache.org/docs/1.4/concepts/spec/tableindex/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/spec/tableindex/ Table index # Table Index files is in the index directory. Dynamic Bucket Index # Dynamic bucket index is used to store the correspondence between the hash value of the primary-key and the bucket. Its structure is very simple, only storing hash values in the file: HASH_VALUE | HASH_VALUE | HASH_VALUE | HASH_VALUE | … HASH_VALUE is the hash value of the primary-key. 4 bytes, BIG_ENDIAN. Deletion Vectors # Deletion file is used to store the deleted records position for each data file. Vector Storage //paimon.apache.org/docs/1.4/append-table/vector/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/append-table/vector/ Vector Storage # Overview # With the explosive growth of AI scenarios, vector storage has become increasingly important. Paimon provides optimized storage solutions specifically designed for vector data to meet the needs of various scenarios. Vector Data Type # Vector data comes in many types, among which dense vectors are the most commonly used. They are typically expressed as fixed-length, densely packed arrays, generally without null elements. Data Types //paimon.apache.org/docs/1.4/concepts/data-types/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/data-types/ Data Types # A data type describes the logical type of a value in the table ecosystem. It can be used to declare input and/or output types of operations. All data types supported by Paimon are as follows: DataType Description BOOLEAN Data type of a boolean with a (possibly) three-valued logic of TRUE, FALSE, and UNKNOWN. CHAR CHAR(n) Data type of a fixed-length character string. Default Value //paimon.apache.org/docs/1.4/flink/default-value/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/flink/default-value/ Default Value # Paimon allows specifying default values for columns. When users write to these tables without explicitly providing values for certain columns, Paimon automatically generates default values for these columns. Create Table # Flink SQL does not have native support for default values, so we can only create a table without default values: CREATE TABLE my_table ( a BIGINT, b STRING, c INT, tags ARRAY<STRING>, properties MAP<STRING, STRING>, nested ROW<x INT, y STRING> ); We support the procedure of modifying column default values in Flink. Default Value //paimon.apache.org/docs/1.4/spark/default-value/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/spark/default-value/ Default Value # Paimon allows specifying default values for columns. When users write to these tables without explicitly providing values for certain columns, Paimon automatically generates default values for these columns. Create Table # You can create a table with columns with default values using the following SQL: CREATE TABLE my_table ( a BIGINT, b STRING DEFAULT 'my_value', c INT DEFAULT 5, tags ARRAY<STRING> DEFAULT ARRAY('tag1', 'tag2', 'tag3'), properties MAP<STRING, STRING> DEFAULT MAP('key1', 'value1', 'key2', 'value2'), nested STRUCT<x: INT, y: STRING> DEFAULT STRUCT(42, 'default_value') ); Insert Table # For SQL commands that execute table writes, such as the INSERT, UPDATE, and MERGE commands, the DEFAULT keyword or NULL value is parsed into the default value specified for the corresponding column. File Index //paimon.apache.org/docs/1.4/concepts/spec/fileindex/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/spec/fileindex/ File index # Define file-index.${index_type}.columns, Paimon will create its corresponding index file for each file. If the index file is too small, it will be stored directly in the manifest, or in the directory of the data file. Each data file corresponds to an index file, which has a separate file definition and can contain different types of indexes with multiple columns. Index File # File index file format. Global Index //paimon.apache.org/docs/1.4/append-table/global-index/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/append-table/global-index/ Global Index # Overview # Global Index is a powerful indexing mechanism for Data Evolution (append) tables. It enables efficient row-level lookups and filtering without full-table scans. Paimon supports multiple global index types: BTree Index: A B-tree based index for scalar column lookups. Supports equality, IN, range predicates, and can be combined across multiple columns with AND/OR logic. Vector Index: An approximate nearest neighbor (ANN) index powered by DiskANN for vector similarity search. Manage Tags //paimon.apache.org/docs/1.4/maintenance/manage-tags/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/maintenance/manage-tags/ Manage Tags # Paimon’s snapshots can provide an easy way to query historical data. But in most scenarios, a job will generate too many snapshots and table will expire old snapshots according to table configuration. Snapshot expiration will also delete old data files, and the historical data of expired snapshots cannot be queried anymore. To solve this problem, you can create a tag based on a snapshot. The tag will maintain the manifests and data files of the snapshot. PyJindoSDK Support //paimon.apache.org/docs/1.4/pypaimon/pyjindosdk-support/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/pypaimon/pyjindosdk-support/ PyJindoSDK Support # Introduction # JindoSDK is a high-performance storage SDK developed by Alibaba Cloud for accessing OSS (Object Storage Service) and other cloud storage systems. It provides optimized I/O performance and deep integration with the Alibaba Cloud ecosystem. PyPaimon now supports using PyJindoSDK (the Python binding of JindoSDK) to access OSS. Compared to the legacy implementation based on PyArrow’s S3FileSystem, PyJindoSDK offers better performance and compatibility when working with OSS. Query Performance //paimon.apache.org/docs/1.4/primary-key-table/query-performance/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/primary-key-table/query-performance/ Query Performance # Table Mode # The table schema has the greatest impact on query performance. See Table Mode. For Merge On Read table, the most important thing you should pay attention to is the number of buckets, which will limit the concurrency of reading data. For MOW (Deletion Vectors) or COW table or Read Optimized table, there is no limit to the concurrency of reading data, and they can also utilize some filtering conditions for non-primary-key columns. Chain Table //paimon.apache.org/docs/1.4/primary-key-table/chain-table/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/primary-key-table/chain-table/ Chain Table # Chain table is a new capability for primary key tables that transforms how you process incremental data. Imagine a scenario where you periodically store a full snapshot of data (for example, once a day), even though only a small portion changes between snapshots. ODS binlog dump is a typical example of this pattern. Taking a daily binlog dump job as an example. A batch job merges yesterday’s full dataset with today’s incremental changes to produce a new full dataset. DataFrame //paimon.apache.org/docs/1.4/spark/dataframe/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/spark/dataframe/ DataFrame # Paimon supports creating table, inserting data, and querying through the Spark DataFrame API. Create Table # You can specify table properties with option or set partition columns with partitionBy if needed. val data: DataFrame = Seq((1, "x1", "p1"), (2, "x2", "p2")).toDF("a", "b", "pt") data.write.format("paimon") .option("primary-key", "a,pt") .option("k1", "v1") .partitionBy("pt") .saveAsTable("test_tbl") // or .save("/path/to/default.db/test_tbl") Insert # Insert Into # You can achieve INSERT INTO semantics by setting the mode to append. Functions //paimon.apache.org/docs/1.4/concepts/functions/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/functions/ Functions # Paimon introduces a Function abstraction designed to support functions in a standard format for compute engine, addressing: Unified Column-Level Filtering and Processing: Facilitates operations at the column level, including tasks such as encryption and decryption of data. Parameterized View Capabilities: Supports parameterized operations within views, enhancing the dynamism and usability of data retrieval processes. Types of Functions Supported # Currently, Paimon supports three types of functions: Metrics //paimon.apache.org/docs/1.4/maintenance/metrics/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/maintenance/metrics/ Paimon Metrics # Paimon has built a metrics system to measure the behaviours of reading and writing, like how many manifest files it scanned in the last planning, how long it took in the last commit operation, how many files it deleted in the last compact operation. In Paimon’s metrics system, metrics are updated and reported at table granularity. There are three types of metrics provided in the Paimon metric system, Gauge, Counter, Histogram. Flink CDC //paimon.apache.org/docs/1.4/cdc-ingestion/flink-cdc/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/cdc-ingestion/flink-cdc/ Flink CDC # Flink CDC is a streaming data integration tool for the Flink engine. It allows users to describe their ETL pipeline logic via YAML elegantly and help users automatically generating customized Flink operators and submitting job. The Paimon Pipeline connector can be used as both the Data Source or the Data Sink of the Flink CDC pipeline. This document describes how to set up the Paimon Pipeline connector as the Data Source. Manage Privileges //paimon.apache.org/docs/1.4/maintenance/manage-privileges/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/maintenance/manage-privileges/ Manage Privileges # Paimon provides a privilege system on catalogs. Privileges determine which users can perform which operations on which objects, so that you can manage table access in a fine-grained manner. Currently, Paimon adopts the identity-based access control (IBAC) privilege model. That is, privileges are directly assigned to users. This privilege system only prevents unwanted users from accessing tables through catalogs. It does not block access through temporary table (by specifying table path on filesystem), nor does it prevent user from directly modifying data files on filesystem. PK Clustering Override //paimon.apache.org/docs/1.4/primary-key-table/pk-clustering-override/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/primary-key-table/pk-clustering-override/ PK Clustering Override # By default, data files in a primary key table are physically sorted by the primary key. This is optimal for point lookups but can hurt scan performance when queries filter on non-primary-key columns. PK Clustering Override mode changes the physical sort order of data files from the primary key to user-specified clustering columns. This significantly improves scan performance for queries that filter or group by clustering columns, while still maintaining primary key uniqueness through deletion vectors. SQL Upsert //paimon.apache.org/docs/1.4/spark/sql-upsert/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/spark/sql-upsert/ SQL Upsert # For table without primary key, Paimon supports upsert write mode: If the row with the same upsert key already exists, perform update; otherwise, perform insert. Usage # Specify the following table properties when creating the table upsert-key: Defines the key columns used for upsert, cannot be used together with primary key. Unlike primary key, the upsert key value can be null, and null-equality matching is supported. Views //paimon.apache.org/docs/1.4/concepts/views/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/views/ Views # A view is a logical table that encapsulates business logic and domain-specific semantics. While most compute engines support views natively, each engine stores view metadata in proprietary formats, creating interoperability challenges across different platforms. Paimon views abstracting engine-specific query dialects and establishing unified metadata standards. View metadata could enable centralized view management that facilitates cross-engine sharing and reduces maintenance complexity in heterogeneous computing environments. Catalog support # View metadata is persisted only when the catalog implementation supports it: Manage Branches //paimon.apache.org/docs/1.4/maintenance/manage-branches/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/maintenance/manage-branches/ Manage Branches # In streaming data processing, it’s difficult to correct data for it may affect the existing data, and users will see the streaming provisional results, which is not expected. We suppose the branch that the existing workflow is processing on is ‘main’ branch, by creating custom data branch, it can help to do experimental tests and data validating for the new job on the existing table, which doesn’t need to stop the existing reading / writing workflows and no need to copy data from the main branch. Structured Streaming //paimon.apache.org/docs/1.4/spark/structured-streaming/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/spark/structured-streaming/ Structured Streaming # Paimon supports streaming data processing with Spark Structured Streaming, enabling both streaming write and streaming query. Streaming Write # Paimon Structured Streaming only supports the two append and complete modes. // Create a paimon table if not exists. spark.sql(s""" |CREATE TABLE T (k INT, v STRING) |TBLPROPERTIES ('primary-key'='k', 'bucket'='3') |""".stripMargin) // Here we use MemoryStream to fake a streaming source. val inputData = MemoryStream[(Int, String)] val df = inputData. Manage Partitions //paimon.apache.org/docs/1.4/maintenance/manage-partitions/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/maintenance/manage-partitions/ Manage Partitions # Paimon provides multiple ways to manage partitions, including expire historical partitions by different strategies or mark a partition done to notify the downstream application that the partition has finished writing. Expiring Partitions # You can set partition.expiration-time when creating a partitioned table. Paimon streaming sink will periodically check the status of partitions and delete expired partitions according to time. How to determine whether a partition has expired: you can set partition. Procedures //paimon.apache.org/docs/1.4/flink/procedures/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/flink/procedures/ Procedures # Flink 1.18 and later versions support Call Statements, which make it easier to manipulate data and metadata of Paimon table by writing SQLs instead of submitting Flink jobs. In 1.18, the procedure only supports passing arguments by position. You must pass all arguments in order, and if you don’t want to pass some arguments, you must use '' as placeholder. For example, if you want to compact table default. Action Jars //paimon.apache.org/docs/1.4/flink/action-jars/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/flink/action-jars/ Action Jars # After the Flink Local Cluster has been started, you can execute the action jar by using the following command. <FLINK_HOME>/bin/flink run \ /path/to/paimon-flink-action-1.4.1.jar \ <action> <args> The following command is used to compact a table. <FLINK_HOME>/bin/flink run \ /path/to/paimon-flink-action-1.4.1.jar \ compact \ --path <TABLE_PATH> Merging into table # Paimon supports “MERGE INTO” via submitting the ‘merge_into’ job through flink run. Command Line Interface //paimon.apache.org/docs/1.4/pypaimon/cli/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/pypaimon/cli/ Command Line Interface # PyPaimon provides a command-line interface (CLI) for interacting with Paimon catalogs and tables. The CLI allows you to read data from Paimon tables directly from the command line. Installation # The CLI is installed automatically when you install PyPaimon: pip install pypaimon After installation, the paimon command will be available in your terminal. Basic Usage # Before using the CLI, you need to create a catalog configuration file. Procedures //paimon.apache.org/docs/1.4/spark/procedures/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/spark/procedures/ Procedures # This section introduce all available spark procedures about paimon. Procedure Name Explanation Example compact To compact files. Argument: table: the target table identifier. Cannot be empty. partitions: partition filter. the comma (",") represents "AND", the semicolon (";") represents "OR". If you want to compact one partition with date=01 and day=01, you need to write 'date=01,day=01'. Left empty for all partitions. (Can't be used together with " Savepoint //paimon.apache.org/docs/1.4/flink/savepoint/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/flink/savepoint/ Savepoint # Paimon has its own snapshot management, this may conflict with Flink’s checkpoint management, causing exceptions when restoring from savepoint (don’t worry, it will not cause the storage to be damaged). It is recommended that you use the following methods to savepoint: Use Flink Stop with savepoint. Use Paimon Tag with Flink Savepoint, and rollback-to-tag before restoring from savepoint. Stop with savepoint # This feature of Flink ensures that the last checkpoint is fully processed, which means there will be no more uncommitted metadata left. Configurations //paimon.apache.org/docs/1.4/maintenance/configurations/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/maintenance/configurations/ Configuration # CoreOptions # Core options for paimon. Key Default Type Description add-column-before-partition false Boolean If true, when adding a new column without specifying a position, the column will be placed before the first partition column instead of at the end of the schema. This only takes effect for partitioned tables. aggregation.remove-record-on-delete false Boolean Whether to remove the whole row in aggregation engine when -D records are received. REST API //paimon.apache.org/docs/1.4/concepts/rest/rest-api/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/concepts/rest/rest-api/ Redoc.init('\/\/paimon.apache.org\/docs\/1.4/rest-catalog-open-api.yaml', { disableSearch: true }, document.getElementById('redoc-container')); Versions //paimon.apache.org/docs/1.4/versions/ Mon, 01 Jan 0001 00:00:00 +0000 //paimon.apache.org/docs/1.4/versions/ Versions # An appendix of hosted documentation for all versions of Apache Paimon. master stable 1.3 1.2 1.1