Postgres CDC #

Paimon supports synchronizing changes from different databases using change data capture (CDC). This feature requires Flink and its CDC connectors.

Prepare CDC Bundled Jar #

flink-connector-postgres-cdc-*.jar

Synchronizing Tables #

By using PostgresSyncTableAction in a Flink DataStream job or directly through flink run, users can synchronize one or multiple tables from PostgreSQL into one Paimon table.

To use this feature through flink run, run the following shell command.

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.8.2.jar \
    postgres_sync_table
    --warehouse <warehouse_path> \
    --database <database_name> \
    --table <table_name> \
    [--partition_keys <partition_keys>] \
    [--primary_keys <primary_keys>] \
    [--type_mapping <option1,option2...>] \
    [--computed_column <'column-name=expr-name(args[, ...])'> [--computed_column ...]] \
    [--metadata_column <metadata_column>] \
    [--postgres_conf <postgres_cdc_source_conf> [--postgres_conf <postgres_cdc_source_conf> ...]] \
    [--catalog_conf <paimon_catalog_conf> [--catalog_conf <paimon_catalog_conf> ...]] \
    [--table_conf <paimon_table_sink_conf> [--table_conf <paimon_table_sink_conf> ...]]

Configuration	Description
--warehouse	The path to Paimon warehouse.
--database	The database name in Paimon catalog.
--table	The Paimon table name.
--partition_keys	The partition keys for Paimon table. If there are multiple partition keys, connect them with comma, for example "dt,hh,mm".
--primary_keys	The primary keys for Paimon table. If there are multiple primary keys, connect them with comma, for example "buyer_id,seller_id".
--type_mapping	It is used to specify how to map PostgreSQL data type to Paimon type. Supported options: "to-string": maps all PostgreSQL types to STRING.
--computed_column	The definitions of computed columns. The argument field is from PostgreSQL table field name. See here for a complete list of configurations.
--metadata_column	--metadata_column is used to specify which metadata columns to include in the output schema of the connector. Metadata columns provide additional information related to the source data, for example: --metadata_column table_name,database_name,schema_name,op_ts. See its document for a complete list of available metadata.
--postgres_conf	The configuration for Flink CDC Postgres sources. Each configuration should be specified in the format "key=value". hostname, username, password, database-name, schema-name, table-name and slot.name are required configurations, others are optional. See its document for a complete list of configurations.
--catalog_conf	The configuration for Paimon catalog. Each configuration should be specified in the format "key=value". See here for a complete list of catalog configurations.
--table_conf	The configuration for Paimon table sink. Each configuration should be specified in the format "key=value". See here for a complete list of table configurations.

If the Paimon table you specify does not exist, this action will automatically create the table. Its schema will be derived from all specified PostgreSQL tables. If the Paimon table already exists, its schema will be compared against the schema of all specified PostgreSQL tables.

Example 1: synchronize tables into one Paimon table

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.8.2.jar \
    postgres_sync_table \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --table test_table \
    --partition_keys pt \
    --primary_keys pt,uid \
    --computed_column '_year=year(age)' \
    --postgres_conf hostname=127.0.0.1 \
    --postgres_conf username=root \
    --postgres_conf password=123456 \
    --postgres_conf database-name='source_db' \
    --postgres_conf schema-name='public' \
    --postgres_conf table-name='source_table1|source_table2' \
    --postgres_conf slot.name='paimon_cdc' \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4

As example shows, the postgres_conf’s table-name supports regular expressions to monitor multiple tables that satisfy the regular expressions. The schemas of all the tables will be merged into one Paimon table schema.

Example 2: synchronize shards into one Paimon table

You can also set ‘schema-name’ with a regular expression to capture multiple schemas. A typical scenario is that a table ‘source_table’ is split into schema ‘source_schema1’, ‘source_schema2’ …, then you can synchronize data of all the ‘source_table’s into one Paimon table.

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.8.2.jar \
    postgres_sync_table \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --table test_table \
    --partition_keys pt \
    --primary_keys pt,uid \
    --computed_column '_year=year(age)' \
    --postgres_conf hostname=127.0.0.1 \
    --postgres_conf username=root \
    --postgres_conf password=123456 \
    --postgres_conf database-name='source_db' \
    --postgres_conf schema-name='source_schema.+' \
    --postgres_conf table-name='source_table' \
    --postgres_conf slot.name='paimon_cdc' \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4