Overview #

Apache Paimon(incubating) is a streaming data lake platform that supports high-speed data ingestion, change data tracking and efficient real-time analytics.

Architecture #

As shown in the architecture above:

Read/Write: Paimon supports a versatile way to read/write data and perform OLAP queries.

For reads, it supports consuming data
- from historical snapshots (in batch mode),
- from the latest offset (in streaming mode), or
- reading incremental snapshots in a hybrid way.
For writes, it supports streaming synchronization from the changelog of databases (CDC) or batch insert/overwrite from offline data.

Ecosystem: In addition to Apache Flink, Paimon also supports read by other computation engines like Apache Hive, Apache Spark and Trino.

Internal: Under the hood, Paimon stores the columnar files on the filesystem/object-store and uses the LSM tree structure to support a large volume of data updates and high-performance queries.

Unified Storage #

For streaming engines like Apache Flink, there are typically three types of connectors:

Message queue, such as Apache Kafka, it is used in both source and intermediate stages in this pipeline, to guarantee the latency stay within seconds.
OLAP system, such as Clickhouse, it receives processed data in streaming fashion and serving user’s ad-hoc queries.
Batch storage, such as Apache Hive, it supports various operations of the traditional batch processing, including INSERT OVERWRITE.

Paimon provides table abstraction. It is used in a way that does not differ from the traditional database:

In batch execution mode, it acts like a Hive table and supports various operations of Batch SQL. Query it to see the latest snapshot.
In streaming execution mode, it acts like a message queue. Query it acts like querying a stream changelog from a message queue where historical data never expires.