MySQL CDC #

Paimon supports synchronizing changes from different databases using change data capture (CDC). This feature requires Flink and its CDC connectors.

Prepare CDC Bundled Jar #

Download CDC Bundled Jar and put them under <FLINK_HOME>/lib/.

Version	Bundled Jar
2.3.x	Only supported in versions below 0.8.2 flink-sql-connector-mysql-cdc-2.3.x.jar
2.4.x	Only supported in versions below 0.8.2 flink-sql-connector-mysql-cdc-2.4.x.jar
3.0.x	Only supported in versions below 0.8.2 flink-sql-connector-mysql-cdc-3.0.x.jar flink-cdc-common-3.0.x.jar
3.1.x	flink-sql-connector-mysql-cdc-3.1.x.jar mysql-connector-java-8.0.27.jar

Synchronizing Tables #

By using MySqlSyncTableAction in a Flink DataStream job or directly through flink run, users can synchronize one or multiple tables from MySQL into one Paimon table.

To use this feature through flink run, run the following shell command.

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.9.0.jar \
    mysql_sync_table
    --warehouse <warehouse-path> \
    --database <database-name> \
    --table <table-name> \
    [--partition_keys <partition_keys>] \
    [--primary_keys <primary-keys>] \
    [--type_mapping <option1,option2...>] \
    [--computed_column <'column-name=expr-name(args[, ...])'> [--computed_column ...]] \
    [--metadata_column <metadata-column>] \
    [--mysql_conf <mysql-cdc-source-conf> [--mysql_conf <mysql-cdc-source-conf> ...]] \
    [--catalog_conf <paimon-catalog-conf> [--catalog_conf <paimon-catalog-conf> ...]] \
    [--table_conf <paimon-table-sink-conf> [--table_conf <paimon-table-sink-conf> ...]]

Configuration	Description
--warehouse	The path to Paimon warehouse.
--database	The database name in Paimon catalog.
--table	The Paimon table name.
--partition_keys	The partition keys for Paimon table. If there are multiple partition keys, connect them with comma, for example "dt,hh,mm".
--primary_keys	The primary keys for Paimon table. If there are multiple primary keys, connect them with comma, for example "buyer_id,seller_id".
--type_mapping	It is used to specify how to map MySQL data type to Paimon type. Supported options: "tinyint1-not-bool": maps MySQL TINYINT(1) to TINYINT instead of BOOLEAN. "to-nullable": ignores all NOT NULL constraints (except for primary keys). This is used to solve the problem that Flink cannot accept the MySQL 'ALTER TABLE ADD COLUMN column type NOT NULL DEFAULT x' operation. "to-string": maps all MySQL types to STRING. "char-to-string": maps MySQL CHAR(length)/VARCHAR(length) types to STRING. "longtext-to-bytes": maps MySQL LONGTEXT types to BYTES. "bigint-unsigned-to-bigint": maps MySQL BIGINT UNSIGNED, BIGINT UNSIGNED ZEROFILL, SERIAL to BIGINT. You should ensure overflow won't occur when using this option.
--computed_column	The definitions of computed columns. The argument field is from MySQL table field name. See here for a complete list of configurations.
--metadata_column	--metadata_column is used to specify which metadata columns to include in the output schema of the connector. Metadata columns provide additional information related to the source data, for example: --metadata_column table_name,database_name,op_ts. See its document for a complete list of available metadata.
--mysql_conf	The configuration for Flink CDC MySQL sources. Each configuration should be specified in the format "key=value". hostname, username, password, database-name and table-name are required configurations, others are optional. See its document for a complete list of configurations.
--catalog_conf	The configuration for Paimon catalog. Each configuration should be specified in the format "key=value". See here for a complete list of catalog configurations.
--table_conf	The configuration for Paimon table sink. Each configuration should be specified in the format "key=value". See here for a complete list of table configurations.

If the Paimon table you specify does not exist, this action will automatically create the table. Its schema will be derived from all specified MySQL tables. If the Paimon table already exists, its schema will be compared against the schema of all specified MySQL tables.

Example 1: synchronize tables into one Paimon table

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.9.0.jar \
    mysql_sync_table \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --table test_table \
    --partition_keys pt \
    --primary_keys pt,uid \
    --computed_column '_year=year(age)' \
    --mysql_conf hostname=127.0.0.1 \
    --mysql_conf username=root \
    --mysql_conf password=123456 \
    --mysql_conf database-name='source_db' \
    --mysql_conf table-name='source_table1|source_table2' \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4

As example shows, the mysql_conf’s table-name supports regular expressions to monitor multiple tables that satisfy the regular expressions. The schemas of all the tables will be merged into one Paimon table schema.

Example 2: synchronize shards into one Paimon table

You can also set ‘database-name’ with a regular expression to capture multiple databases. A typical scenario is that a table ‘source_table’ is split into database ‘source_db1’, ‘source_db2’ …, then you can synchronize data of all the ‘source_table’s into one Paimon table.

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.9.0.jar \
    mysql_sync_table \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --table test_table \
    --partition_keys pt \
    --primary_keys pt,uid \
    --computed_column '_year=year(age)' \
    --mysql_conf hostname=127.0.0.1 \
    --mysql_conf username=root \
    --mysql_conf password=123456 \
    --mysql_conf database-name='source_db.+' \
    --mysql_conf table-name='source_table' \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4

Synchronizing Databases #

By using MySqlSyncDatabaseAction in a Flink DataStream job or directly through flink run, users can synchronize the whole MySQL database into one Paimon database.

To use this feature through flink run, run the following shell command.

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.9.0.jar \
    mysql_sync_database
    --warehouse <warehouse-path> \
    --database <database-name> \
    [--ignore_incompatible <true/false>] \
    [--merge_shards <true/false>] \
    [--table_prefix <paimon-table-prefix>] \
    [--table_suffix <paimon-table-suffix>] \
    [--including_tables <mysql-table-name|name-regular-expr>] \
    [--excluding_tables <mysql-table-name|name-regular-expr>] \
    [--mode <sync-mode>] \
    [--metadata_column <metadata-column>] \
    [--type_mapping <option1,option2...>] \
    [--partition_keys <partition_keys>] \
    [--primary_keys <primary-keys>] \
    [--mysql_conf <mysql-cdc-source-conf> [--mysql_conf <mysql-cdc-source-conf> ...]] \
    [--catalog_conf <paimon-catalog-conf> [--catalog_conf <paimon-catalog-conf> ...]] \
    [--table_conf <paimon-table-sink-conf> [--table_conf <paimon-table-sink-conf> ...]]

Configuration	Description
--warehouse	The path to Paimon warehouse.
--database	The database name in Paimon catalog.
--ignore_incompatible	It is default false, in this case, if MySQL table name exists in Paimon and their schema is incompatible,an exception will be thrown. You can specify it to true explicitly to ignore the incompatible tables and exception.
--merge_shards	It is default true, in this case, if some tables in different databases have the same name, their schemas will be merged and their records will be synchronized into one Paimon table. Otherwise, each table's records will be synchronized to a corresponding Paimon table, and the Paimon table will be named to 'databaseName_tableName' to avoid potential name conflict.
--table_prefix	The prefix of all Paimon tables to be synchronized. For example, if you want all synchronized tables to have "ods_" as prefix, you can specify "--table_prefix ods_".
--table_suffix	The suffix of all Paimon tables to be synchronized. The usage is same as "--table_prefix".
--including_tables	It is used to specify which source tables are to be synchronized. You must use '\|' to separate multiple tables.Because '\|' is a special character, a comma is required, for example: 'a\|b\|c'.Regular expression is supported, for example, specifying "--including_tables test\|paimon.*" means to synchronize table 'test' and all tables start with 'paimon'.
--excluding_tables	It is used to specify which source tables are not to be synchronized. The usage is same as "--including_tables". "--excluding_tables" has higher priority than "--including_tables" if you specified both.
--mode	It is used to specify synchronization mode. Possible values: "divided" (the default mode if you haven't specified one): start a sink for each table, the synchronization of the new table requires restarting the job. "combined": start a single combined sink for all tables, the new table will be automatically synchronized.
--metadata_column	--metadata_column is used to specify which metadata columns to include in the output schema of the connector. Metadata columns provide additional information related to the source data, for example: --metadata_column table_name,database_name,op_ts. See its document for a complete list of available metadata.
--type_mapping	It is used to specify how to map MySQL data type to Paimon type. Supported options: "tinyint1-not-bool": maps MySQL TINYINT(1) to TINYINT instead of BOOLEAN. "to-nullable": ignores all NOT NULL constraints (except for primary keys). This is used to solve the problem that Flink cannot accept the MySQL 'ALTER TABLE ADD COLUMN column type NOT NULL DEFAULT x' operation. "to-string": maps all MySQL types to STRING. "char-to-string": maps MySQL CHAR(length)/VARCHAR(length) types to STRING. "longtext-to-bytes": maps MySQL LONGTEXT types to BYTES. "bigint-unsigned-to-bigint": maps MySQL BIGINT UNSIGNED, BIGINT UNSIGNED ZEROFILL, SERIAL to BIGINT. You should ensure overflow won't occur when using this option.
--partition_keys	The partition keys for Paimon table. If there are multiple partition keys, connect them with comma, for example "dt,hh,mm". If the keys are not in source table, the sink table won't set partition keys.
--primary_keys	The primary keys for Paimon table. If there are multiple primary keys, connect them with comma, for example "buyer_id,seller_id". If the keys are not in source table, but the source table has primary keys, the sink table will use source table's primary keys. Otherwise, the sink table won't set primary keys.
--mysql_conf	The configuration for Flink CDC MySQL sources. Each configuration should be specified in the format "key=value". hostname, username, password, database-name and table-name are required configurations, others are optional. See its document for a complete list of configurations.
--catalog_conf	The configuration for Paimon catalog. Each configuration should be specified in the format "key=value". See here for a complete list of catalog configurations.
--table_conf	The configuration for Paimon table sink. Each configuration should be specified in the format "key=value". See here for a complete list of table configurations.

Only tables with primary keys will be synchronized.

For each MySQL table to be synchronized, if the corresponding Paimon table does not exist, this action will automatically create the table. Its schema will be derived from all specified MySQL tables. If the Paimon table already exists, its schema will be compared against the schema of all specified MySQL tables.

Example 1: synchronize entire database

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.9.0.jar \
    mysql_sync_database \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --mysql_conf hostname=127.0.0.1 \
    --mysql_conf username=root \
    --mysql_conf password=123456 \
    --mysql_conf database-name=source_db \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4

Example 2: synchronize newly added tables under database

Let’s say at first a Flink job is synchronizing tables [product, user, address] under database source_db. The command to submit the job looks like:

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.9.0.jar \
    mysql_sync_database \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --mysql_conf hostname=127.0.0.1 \
    --mysql_conf username=root \
    --mysql_conf password=123456 \
    --mysql_conf database-name=source_db \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4 \
    --including_tables 'product|user|address'

At a later point we would like the job to also synchronize tables [order, custom], which contains history data. We can achieve this by recovering from the previous snapshot of the job and thus reusing existing state of the job. The recovered job will first snapshot newly added tables, and then continue reading changelog from previous position automatically.

The command to recover from previous snapshot and add new tables to synchronize looks like:

<FLINK_HOME>/bin/flink run \
    --fromSavepoint savepointPath \
    /path/to/paimon-flink-action-0.9.0.jar \
    mysql_sync_database \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --mysql_conf hostname=127.0.0.1 \
    --mysql_conf username=root \
    --mysql_conf password=123456 \
    --mysql_conf database-name=source_db \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --including_tables 'product|user|address|order|custom'

You can set --mode combined to enable synchronizing newly added tables without restarting job.

Example 3: synchronize and merge multiple shards

Let’s say you have multiple database shards db1, db2, … and each database has tables tbl1, tbl2, …. You can synchronize all the db.+.tbl.+ into tables test_db.tbl1, test_db.tbl2 … by following command:

<FLINK_HOME>/bin/flink run \
    /path/to/paimon-flink-action-0.9.0.jar \
    mysql_sync_database \
    --warehouse hdfs:///path/to/warehouse \
    --database test_db \
    --mysql_conf hostname=127.0.0.1 \
    --mysql_conf username=root \
    --mysql_conf password=123456 \
    --mysql_conf database-name='db.+' \
    --catalog_conf metastore=hive \
    --catalog_conf uri=thrift://hive-metastore:9083 \
    --table_conf bucket=4 \
    --table_conf changelog-producer=input \
    --table_conf sink.parallelism=4 \
    --including_tables 'tbl.+'

By setting database-name to a regular expression, the synchronization job will capture all tables under matched databases and merge tables of the same name into one table.

You can set --merge_shards false to prevent merging shards. The synchronized tables will be named to ‘databaseName_tableName’ to avoid potential name conflict.

FAQ #

Chinese characters in records ingested from MySQL are garbled.

Try to set env.java.opts: -Dfile.encoding=UTF-8 in flink-conf.yaml (the option is changed to env.java.opts.all since Flink-1.17).