CDC Ingestion

CDC Ingestion #

Paimon supports synchronizing changes from different databases using change data capture (CDC). This feature requires Flink and its CDC connectors.

MySQL #

Prepare CDC Bundled Jar #

flink-sql-connector-mysql-cdc-*.jar

Synchronizing Tables #

By using MySqlSyncTableAction in a Flink DataStream job or directly through flink run, users can synchronize one or multiple tables from MySQL into one Paimon table.

To use this feature through flink run, run the following shell command.

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.4.0-incubating.jar \
    mysql-sync-table
    --warehouse <warehouse-path> \
    --database <database-name> \
    --table <table-name> \
    [--partition-keys <partition-keys>] \
    [--primary-keys <primary-keys>] \
    [--computed-column <'column-name=expr-name(args[, ...])'> [--computed-column ...]] \
    [--mysql-conf <mysql-cdc-source-conf> [--mysql-conf <mysql-cdc-source-conf> ...]] \
    [--catalog-conf <paimon-catalog-conf> [--catalog-conf <paimon-catalog-conf> ...]] \
    [--table-conf <paimon-table-sink-conf> [--table-conf <paimon-table-sink-conf> ...]]
Configuration Description
--warehouse
The path to Paimon warehouse.
--database
The database name in Paimon catalog.
--table
The Paimon table name.
--partition-keys
The partition keys for Paimon table. If there are multiple partition keys, connect them with comma, for example "dt,hh,mm".
--primary-keys
The primary keys for Paimon table. If there are multiple primary keys, connect them with comma, for example "buyer_id,seller_id".
--computed-column
The definitions of computed columns. The argument field is from MySQL table field name.

Supported expressions are:
  • year(date-column): Extract year from a DATE, DATETIME or TIMESTAMP. Output is an INT value represent the year.
  • substring(column,beginInclusive): Get column.substring(beginInclusive). Output is a STRING.
  • substring(column,beginInclusive,endExclusive): Get column.substring(beginInclusive,endExclusive). Output is a STRING.
  • truncate(column,width): truncate column by width. Output type is same with column.
    • If the column is a STRING, truncate(column,width) will truncate the string to width characters, namely "value.substring(0, width)".
    • If the column is an INT or LONG, truncate(column,width) will truncate the number with the algorithm "v - (((v % W) + W) % W)". The "redundant" compute part is to keep the result always positive.
    • If the column is a DECIMAL, truncate(column,width) will truncate the decimal with the algorithm: let "scaled_W = decimal(W, scale(v))", then return "v - (v % scaled_W)".
--mysql-conf
The configuration for Flink CDC MySQL table sources. Each configuration should be specified in the format "key=value". hostname, username, password, database-name and table-name are required configurations, others are optional. See its document for a complete list of configurations.
--catalog-conf
The configuration for Paimon catalog. Each configuration should be specified in the format "key=value". See here for a complete list of catalog configurations.
--table-conf
The configuration for Paimon table sink. Each configuration should be specified in the format "key=value". See here for a complete list of table configurations.

If the Paimon table you specify does not exist, this action will automatically create the table. Its schema will be derived from all specified MySQL tables. If the Paimon table already exists, its schema will be compared against the schema of all specified MySQL tables.

This action supports a limited number of schema changes. Currently, the framework can not drop columns, so the behaviors of DROP will be ignored, RENAME will add a new column. Currently supported schema changes includes:

  • Adding columns.

  • Altering column types. More specifically,

    • altering from a string type (char, varchar, text) to another string type with longer length,
    • altering from a binary type (binary, varbinary, blob) to another binary type with longer length,
    • altering from an integer type (tinyint, smallint, int, bigint) to another integer type with wider range,
    • altering from a floating-point type (float, double) to another floating-point type with wider range,

    are supported.

Example

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.4.0-incubating.jar \
    mysql-sync-table \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --table test_table \
    --partition-keys pt \
    --primary-keys pt,uid \
    --computed-columns '_year=year(age)' \
    --mysql-conf hostname=127.0.0.1 \
    --mysql-conf username=root \
    --mysql-conf password=123456 \
    --mysql-conf database-name=source_db \
    --mysql-conf table-name='source_table_.*' \
    --catalog-conf metastore=hive \
    --catalog-conf uri=thrift://hive-metastore:9083 \
    --table-conf bucket=4 \
    --table-conf changelog-producer=input \
    --table-conf sink.parallelism=4

Synchronizing Databases #

By using MySqlSyncDatabaseAction in a Flink DataStream job or directly through flink run, users can synchronize the whole MySQL database into one Paimon database.

To use this feature through flink run, run the following shell command.

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.4.0-incubating.jar \
    mysql-sync-database
    --warehouse <warehouse-path> \
    --database <database-name> \
    [--ignore-incompatible <true/false>] \
    [--table-prefix <paimon-table-prefix>] \
    [--table-suffix <paimon-table-suffix>] \
    [--including-tables <mysql-table-name|name-regular-expr>] \
    [--excluding-tables <mysql-table-name|name-regular-expr>] \
    [--mysql-conf <mysql-cdc-source-conf> [--mysql-conf <mysql-cdc-source-conf> ...]] \
    [--catalog-conf <paimon-catalog-conf> [--catalog-conf <paimon-catalog-conf> ...]] \
    [--table-conf <paimon-table-sink-conf> [--table-conf <paimon-table-sink-conf> ...]]
Configuration Description
--warehouse
The path to Paimon warehouse.
--database
The database name in Paimon catalog.
--ignore-incompatible
It is default false, in this case, if MySQL table name exists in Paimon and their schema is incompatible,an exception will be thrown. You can specify it to true explicitly to ignore the incompatible tables and exception.
--table-prefix
The prefix of all Paimon tables to be synchronized. For example, if you want all synchronized tables to have "ods_" as prefix, you can specify "--table-prefix ods_".
--table-suffix
The suffix of all Paimon tables to be synchronized. The usage is same as "--table-prefix".
--including-tables
It is used to specify which source tables are to be synchronized. You must use '|' to separate multiple tables.Because '|' is a special character, a comma is required, for example: 'a|b|c'.Regular expression is supported, for example, specifying "--including-tables test|paimon.*" means to synchronize table 'test' and all tables start with 'paimon'.
--excluding-tables
It is used to specify which source tables are not to be synchronized. The usage is same as "--including-tables". "--excluding-tables" has higher priority than "--including-tables" if you specified both.
--mysql-conf
The configuration for Flink CDC MySQL table sources. Each configuration should be specified in the format "key=value". hostname, username, password, database-name and table-name are required configurations, others are optional. See its document for a complete list of configurations.
--catalog-conf
The configuration for Paimon catalog. Each configuration should be specified in the format "key=value". See here for a complete list of catalog configurations.
--table-conf
The configuration for Paimon table sink. Each configuration should be specified in the format "key=value". See here for a complete list of table configurations.

Only tables with primary keys will be synchronized.

For each MySQL table to be synchronized, if the corresponding Paimon table does not exist, this action will automatically create the table. Its schema will be derived from all specified MySQL tables. If the Paimon table already exists, its schema will be compared against the schema of all specified MySQL tables.

This action supports a limited number of schema changes. Currently, the framework can not drop columns, so the behaviors of DROP will be ignored, RENAME will add a new column. Currently supported schema changes includes:

  • Adding columns.

  • Altering column types. More specifically,

    • altering from a string type (char, varchar, text) to another string type with longer length,
    • altering from a binary type (binary, varbinary, blob) to another binary type with longer length,
    • altering from an integer type (tinyint, smallint, int, bigint) to another integer type with wider range,
    • altering from a floating-point type (float, double) to another floating-point type with wider range,

    are supported.

Example

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.4.0-incubating.jar \
    mysql-sync-database \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --mysql-conf hostname=127.0.0.1 \
    --mysql-conf username=root \
    --mysql-conf password=123456 \
    --mysql-conf database-name=source_db \
    --catalog-conf metastore=hive \
    --catalog-conf uri=thrift://hive-metastore:9083 \
    --table-conf bucket=4 \
    --table-conf changelog-producer=input \
    --table-conf sink.parallelism=4