Flink CDC

Flink CDC is a streaming data integration tool for the Flink engine. It allows users to describe their ETL pipeline logic via YAML elegantly and help users automatically generating customized Flink operators and submitting job.

The Paimon Pipeline connector can be used as both the Data Source or the Data Sink of the Flink CDC pipeline. This document describes how to set up the Paimon Pipeline connector as the Data Source. If you are interested in using Paimon as the Data Sink, please refer to Flink CDC's Paimon Pipeline Connector document.

What can the connector do?

Synchronizes data from a Paimon warehouse, database or table to an external system supported by Flink CDC
Synchronizes schema changes
Automatically discovers newly created tables in the source Paimon warehouse.

How to create Pipeline

The pipeline for reading data from Paimon and sink to Doris can be defined as follows:

source:
  type: paimon
  name: Paimon Source
  database: default
  table: test_table
  catalog.properties.metastore: filesystem
  catalog.properties.warehouse: /path/warehouse

sink:
  type: doris
  name: Doris Sink
  fenodes: 127.0.0.1:8030
  username: root
  password: pass

pipeline:
  name: Paimon to Doris Pipeline
  parallelism: 2

Pipeline Connector Options

Key	Default	Type	Description
database	(none)	String	Name of the database to be scanned. By default, all databases will be scanned.
table	(none)	String	Name of the table to be scanned. By default, all tables will be scanned.
table.discovery-interval	1 min	Duration	The discovery interval of new tables. Only effective when database or table is not set.

Catalog Options

Apart from the pipeline connector options described above, in the CDC yaml file you can also configure options that starts with catalog.properties.. For example, catalog.properties.warehouse or catalog.properties.metastore. Such options will have their prefix removed and the rest be regarded as catalog options. Please refer to the Configurations section for catalog options available.

Usage Notes

Data updates for primary key tables (-U, +U) will be replaced with -D and +I.
Does not support dropping tables. If you need to drop a table from the Paimon warehouse, please restart the Flink CDC job after performing the drop operation. When the job restarts, it will stop reading data from the dropped table, and the target table in the external system will remain unchanged from its state before the job was stopped.
Data from the same table will be consumed by the same Flink source subtask. If the amount of data varies significantly across different tables, performance bottlenecks caused by data skew may be observed in Flink CDC jobs.
If the CDC job has consumed up to the latest snapshot of a table and the next snapshot is not available yet, the monitoring and consumption of this table may be temporarily paused until continuous.discovery-interval has passed.

Data Type Mapping

Paimon type	CDC type	NOTE
TINYINT	TINYINT
SMALLINT	SMALLINT
INT	INT
BIGINT	BIGINT
FLOAT	FLOAT
DOUBLE	DOUBLE
DECIMAL(p, s)	DECIMAL(p, s)
BOOLEAN	BOOLEAN
DATE	DATE
TIMESTAMP	TIMESTAMP
TIMESTAMP_LTZ	TIMESTAMP_LTZ
CHAR(n)	CHAR(n)
VARCHAR(n)	VARCHAR(n)

What can the connector do?​

How to create Pipeline​

Pipeline Connector Options​

database

table

table.discovery-interval

Catalog Options​

Usage Notes​

Data Type Mapping​