Overview

Overview #

Apache Paimon(incubating) is a streaming data lake platform that supports high-speed data ingestion, change data tracking and efficient real-time analytics.

Architecture #

As shown in the architecture above:

Read/Write: Paimon supports a versatile way to read/write data and perform OLAP queries.

  • For reads, it supports consuming data
    • from historical snapshots (in batch mode),
    • from the latest offset (in streaming mode), or
    • reading incremental snapshots in a hybrid way.
  • For writes, it supports streaming synchronization from the changelog of databases (CDC) or batch insert/overwrite from offline data.

Ecosystem: In addition to Apache Flink, Paimon also supports read by other computation engines like Apache Hive, Apache Spark and Trino.

Internal: Under the hood, Paimon stores the columnar files on the filesystem/object-store and uses the LSM tree structure to support a large volume of data updates and high-performance queries.

Unified Storage #

For streaming engines like Apache Flink, there are typically three types of connectors:

  • Message queue, such as Apache Kafka, it is used in both source and intermediate stages in this pipeline, to guarantee the latency stay within seconds.
  • OLAP system, such as ClickHouse, it receives processed data in streaming fashion and serving user’s ad-hoc queries.
  • Batch storage, such as Apache Hive, it supports various operations of the traditional batch processing, including INSERT OVERWRITE.

Paimon provides table abstraction. It is used in a way that does not differ from the traditional database:

  • In batch execution mode, it acts like a Hive table and supports various operations of Batch SQL. Query it to see the latest snapshot.
  • In streaming execution mode, it acts like a message queue. Query it acts like querying a stream changelog from a message queue where historical data never expires.
Edit This Page
Apache Paimon is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.
Copyright © 2023 The Apache Software Foundation. Apache Paimon, Paimon, and its feather logo are trademarks of The Apache Software Foundation.