Apache Paimon 0.7 Available #

February 29, 2024 - Paimon Community (dev@paimon.apache.org)

Apache Paimon PPMC has officially released Apache Paimon 0.7.0-incubating version. A total of 34 people contributed to this version and completed over 300 Commits. Thank you to all contributors for their support!

In this version, we mainly focus on enhancement and optimization for the existing features. For more details, please refer to the following content below.

Flink #

Lookup Join #

Fix bug that lookup join cannot handle sequence field of dim table.
Introduced primary key partial lookup based on Paimon hash lookup for lookup join.
Use parallel reading and bulk loading to speed up initial data loading of dim table.

Paimon CDC #

In 0.7, we continued to improve the Paimon CDC:

Support Postgres table synchronization.
Paimon data commit relies on checkpoint but many users forget to enable it when submitting CDC job. In this case, the job will set checkpoint interval to 180 seconds.
Support UNAWARE Bucket mode table as CDC sink.
Support extract time attribute from CDC input as watermark. Now, you can set tag.automatic-creation to watermark in CDC jobs.

Spark #

Merge-into supports WHEN NOT MATCHED BY SOURCE semantics.
Sort compact supports Hilbert Curve sorter.
Multiple optimization for improving query performance.

Hive #

In 0.7, we mainly focus on improving compatibility with Hive.

Support timestamp with local time zone type.
Support create database with location, comment and properties of HiveSQL.
Support table comment.

Tag Management #

Support a new tag.automatic-creation mode batch. In this mode, a tag will be created after a batch job completed.
The tag auto creation relies on commit, so if no commit when it is time to auto-create tag, the tag won’t be created. In this case, we introduce an option snapshot.watermark-idle-timeout. If the flink source idles over the specified timeout duration, the job will force to create a snapshot and thus trigger tag creation.

New Aggregation Functions #

count: counts the values across multiple rows.
product: computes product values across multiple rows.
nested-update: collects multiple rows into one ARRAY (so-called ‘nested table’). You can use fields.<field-name>.nested-key=pk0,pk1,... to define the primary keys of the nested table. If no keys defined, the rows will be appended to the array.
collect: collects elements into an ARRAY. You can set fields.<field-name>.distinct=true to deduplicate elements.
merge_map: merges input maps into single map.

New Metrics #

Support Flink standard connector metric currentEmitEventTimeLag.
Support level0FileCount to show the compaction progress.

Other Improvements #

Besides above, there are some useful improvements for existed features:

New time travel option scan.file-creation-time-millis: By specifying this option, only the data files created after this time will be read. It is more convenient than scan.timestamp-millis and scan.tag-name, but is imprecise (depending on whether compaction occurs).
For primary key table, now the row kind can be determined by field which is specified by option rowkind.field.
Support ignoring delete records in deduplicate mode by option deduplicate.ignore-delete.
Support ignoring consumer id when starting streaming reading job by option consumer.ignore-progress.
Support new procedure expire_snapshots to manually trigger snapshot expiration.
Support new system table aggregation_fields to show the aggregation fields information for aggregate or partial-update table.
Introduce bloom filter to speed up the local file lookup, which can benefit both lookup changelog-producer and flink lookup join.