This documentation is for an unreleased version of Apache Paimon. We recommend you use the latest stable version.
MySQL CDC #
Paimon supports synchronizing changes from different databases using change data capture (CDC). This feature requires Flink and its CDC connectors.
Prepare CDC Bundled Jar #
Download CDC Bundled Jar
and put them under <FLINK_HOME>/lib/.
Version | Bundled Jar |
---|---|
2.3.x | Only supported in versions below 0.8.2 |
2.4.x | Only supported in versions below 0.8.2 |
3.0.x | Only supported in versions below 0.8.2 |
3.1.x |
Synchronizing Tables #
By using MySqlSyncTableAction in a Flink DataStream job or directly through flink run
, users can synchronize one or multiple tables from MySQL into one Paimon table.
To use this feature through flink run
, run the following shell command.
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-1.0-SNAPSHOT.jar \
mysql_sync_table
--warehouse <warehouse-path> \
--database <database-name> \
--table <table-name> \
[--partition_keys <partition_keys>] \
[--primary_keys <primary-keys>] \
[--type_mapping <option1,option2...>] \
[--computed_column <'column-name=expr-name(args[, ...])'> [--computed_column ...]] \
[--metadata_column <metadata-column>] \
[--mysql_conf <mysql-cdc-source-conf> [--mysql_conf <mysql-cdc-source-conf> ...]] \
[--catalog_conf <paimon-catalog-conf> [--catalog_conf <paimon-catalog-conf> ...]] \
[--table_conf <paimon-table-sink-conf> [--table_conf <paimon-table-sink-conf> ...]]
Configuration | Description |
---|---|
--warehouse |
The path to Paimon warehouse. |
--database |
The database name in Paimon catalog. |
--table |
The Paimon table name. |
--partition_keys |
The partition keys for Paimon table. If there are multiple partition keys, connect them with comma, for example "dt,hh,mm". |
--primary_keys |
The primary keys for Paimon table. If there are multiple primary keys, connect them with comma, for example "buyer_id,seller_id". |
--type_mapping |
It is used to specify how to map MySQL data type to Paimon type. Supported options:
|
--computed_column |
The definitions of computed columns. The argument field is from MySQL table field name. See here for a complete list of configurations. |
--metadata_column |
--metadata_column is used to specify which metadata columns to include in the output schema of the connector. Metadata columns provide additional information related to the source data, for example: --metadata_column table_name,database_name,op_ts. See its document for a complete list of available metadata. |
--mysql_conf |
The configuration for Flink CDC MySQL sources. Each configuration should be specified in the format "key=value". hostname, username, password, database-name and table-name are required configurations, others are optional. See its document for a complete list of configurations. |
--catalog_conf |
The configuration for Paimon catalog. Each configuration should be specified in the format "key=value". See here for a complete list of catalog configurations. |
--table_conf |
The configuration for Paimon table sink. Each configuration should be specified in the format "key=value". See here for a complete list of table configurations. |
If the Paimon table you specify does not exist, this action will automatically create the table. Its schema will be derived from all specified MySQL tables. If the Paimon table already exists, its schema will be compared against the schema of all specified MySQL tables.
Example 1: synchronize tables into one Paimon table
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-1.0-SNAPSHOT.jar \
mysql_sync_table \
--warehouse hdfs:///path/to/warehouse \
--database test_db \
--table test_table \
--partition_keys pt \
--primary_keys pt,uid \
--computed_column '_year=year(age)' \
--mysql_conf hostname=127.0.0.1 \
--mysql_conf username=root \
--mysql_conf password=123456 \
--mysql_conf database-name='source_db' \
--mysql_conf table-name='source_table1|source_table2' \
--catalog_conf metastore=hive \
--catalog_conf uri=thrift://hive-metastore:9083 \
--table_conf bucket=4 \
--table_conf changelog-producer=input \
--table_conf sink.parallelism=4
As example shows, the mysql_conf’s table-name supports regular expressions to monitor multiple tables that satisfy the regular expressions. The schemas of all the tables will be merged into one Paimon table schema.
Example 2: synchronize shards into one Paimon table
You can also set ‘database-name’ with a regular expression to capture multiple databases. A typical scenario is that a table ‘source_table’ is split into database ‘source_db1’, ‘source_db2’ …, then you can synchronize data of all the ‘source_table’s into one Paimon table.
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-1.0-SNAPSHOT.jar \
mysql_sync_table \
--warehouse hdfs:///path/to/warehouse \
--database test_db \
--table test_table \
--partition_keys pt \
--primary_keys pt,uid \
--computed_column '_year=year(age)' \
--mysql_conf hostname=127.0.0.1 \
--mysql_conf username=root \
--mysql_conf password=123456 \
--mysql_conf database-name='source_db.+' \
--mysql_conf table-name='source_table' \
--catalog_conf metastore=hive \
--catalog_conf uri=thrift://hive-metastore:9083 \
--table_conf bucket=4 \
--table_conf changelog-producer=input \
--table_conf sink.parallelism=4
Synchronizing Databases #
By using MySqlSyncDatabaseAction in a Flink DataStream job or directly through flink run
, users can synchronize the whole MySQL database into one Paimon database.
To use this feature through flink run
, run the following shell command.
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-1.0-SNAPSHOT.jar \
mysql_sync_database
--warehouse <warehouse-path> \
--database <database-name> \
[--ignore_incompatible <true/false>] \
[--merge_shards <true/false>] \
[--table_prefix <paimon-table-prefix>] \
[--table_suffix <paimon-table-suffix>] \
[--including_tables <mysql-table-name|name-regular-expr>] \
[--excluding_tables <mysql-table-name|name-regular-expr>] \
[--mode <sync-mode>] \
[--metadata_column <metadata-column>] \
[--type_mapping <option1,option2...>] \
[--partition_keys <partition_keys>] \
[--primary_keys <primary-keys>] \
[--mysql_conf <mysql-cdc-source-conf> [--mysql_conf <mysql-cdc-source-conf> ...]] \
[--catalog_conf <paimon-catalog-conf> [--catalog_conf <paimon-catalog-conf> ...]] \
[--table_conf <paimon-table-sink-conf> [--table_conf <paimon-table-sink-conf> ...]]
Configuration | Description |
---|---|
--warehouse |
The path to Paimon warehouse. |
--database |
The database name in Paimon catalog. |
--ignore_incompatible |
It is default false, in this case, if MySQL table name exists in Paimon and their schema is incompatible,an exception will be thrown. You can specify it to true explicitly to ignore the incompatible tables and exception. |
--merge_shards |
It is default true, in this case, if some tables in different databases have the same name, their schemas will be merged and their records will be synchronized into one Paimon table. Otherwise, each table's records will be synchronized to a corresponding Paimon table, and the Paimon table will be named to 'databaseName_tableName' to avoid potential name conflict. |
--table_prefix |
The prefix of all Paimon tables to be synchronized. For example, if you want all synchronized tables to have "ods_" as prefix, you can specify "--table_prefix ods_". |
--table_suffix |
The suffix of all Paimon tables to be synchronized. The usage is same as "--table_prefix". |
--including_tables |
It is used to specify which source tables are to be synchronized. You must use '|' to separate multiple tables.Because '|' is a special character, a comma is required, for example: 'a|b|c'.Regular expression is supported, for example, specifying "--including_tables test|paimon.*" means to synchronize table 'test' and all tables start with 'paimon'. |
--excluding_tables |
It is used to specify which source tables are not to be synchronized. The usage is same as "--including_tables". "--excluding_tables" has higher priority than "--including_tables" if you specified both. |
--mode |
It is used to specify synchronization mode. Possible values:
|
--metadata_column |
--metadata_column is used to specify which metadata columns to include in the output schema of the connector. Metadata columns provide additional information related to the source data, for example: --metadata_column table_name,database_name,op_ts. See its document for a complete list of available metadata. |
--type_mapping |
It is used to specify how to map MySQL data type to Paimon type. Supported options:
|
--partition_keys |
The partition keys for Paimon table. If there are multiple partition keys, connect them with comma, for example "dt,hh,mm". If the keys are not in source table, the sink table won't set partition keys. |
--primary_keys |
The primary keys for Paimon table. If there are multiple primary keys, connect them with comma, for example "buyer_id,seller_id". If the keys are not in source table, but the source table has primary keys, the sink table will use source table's primary keys. Otherwise, the sink table won't set primary keys. |
--mysql_conf |
The configuration for Flink CDC MySQL sources. Each configuration should be specified in the format "key=value". hostname, username, password, database-name and table-name are required configurations, others are optional. See its document for a complete list of configurations. |
--catalog_conf |
The configuration for Paimon catalog. Each configuration should be specified in the format "key=value". See here for a complete list of catalog configurations. |
--table_conf |
The configuration for Paimon table sink. Each configuration should be specified in the format "key=value". See here for a complete list of table configurations. |
Only tables with primary keys will be synchronized.
For each MySQL table to be synchronized, if the corresponding Paimon table does not exist, this action will automatically create the table. Its schema will be derived from all specified MySQL tables. If the Paimon table already exists, its schema will be compared against the schema of all specified MySQL tables.
Example 1: synchronize entire database
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-1.0-SNAPSHOT.jar \
mysql_sync_database \
--warehouse hdfs:///path/to/warehouse \
--database test_db \
--mysql_conf hostname=127.0.0.1 \
--mysql_conf username=root \
--mysql_conf password=123456 \
--mysql_conf database-name=source_db \
--catalog_conf metastore=hive \
--catalog_conf uri=thrift://hive-metastore:9083 \
--table_conf bucket=4 \
--table_conf changelog-producer=input \
--table_conf sink.parallelism=4
Example 2: synchronize newly added tables under database
Let’s say at first a Flink job is synchronizing tables [product, user, address]
under database source_db
. The command to submit the job looks like:
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-1.0-SNAPSHOT.jar \
mysql_sync_database \
--warehouse hdfs:///path/to/warehouse \
--database test_db \
--mysql_conf hostname=127.0.0.1 \
--mysql_conf username=root \
--mysql_conf password=123456 \
--mysql_conf database-name=source_db \
--catalog_conf metastore=hive \
--catalog_conf uri=thrift://hive-metastore:9083 \
--table_conf bucket=4 \
--table_conf changelog-producer=input \
--table_conf sink.parallelism=4 \
--including_tables 'product|user|address'
At a later point we would like the job to also synchronize tables [order, custom], which contains history data. We can achieve this by recovering from the previous snapshot of the job and thus reusing existing state of the job. The recovered job will first snapshot newly added tables, and then continue reading changelog from previous position automatically.
The command to recover from previous snapshot and add new tables to synchronize looks like:
<FLINK_HOME>/bin/flink run \
--fromSavepoint savepointPath \
/path/to/paimon-flink-action-1.0-SNAPSHOT.jar \
mysql_sync_database \
--warehouse hdfs:///path/to/warehouse \
--database test_db \
--mysql_conf hostname=127.0.0.1 \
--mysql_conf username=root \
--mysql_conf password=123456 \
--mysql_conf database-name=source_db \
--catalog_conf metastore=hive \
--catalog_conf uri=thrift://hive-metastore:9083 \
--table_conf bucket=4 \
--including_tables 'product|user|address|order|custom'
You can set --mode combined
to enable synchronizing newly added tables without restarting job.
Example 3: synchronize and merge multiple shards
Let’s say you have multiple database shards db1
, db2
, … and each database has tables tbl1
, tbl2
, …. You can
synchronize all the db.+.tbl.+
into tables test_db.tbl1
, test_db.tbl2
… by following command:
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-1.0-SNAPSHOT.jar \
mysql_sync_database \
--warehouse hdfs:///path/to/warehouse \
--database test_db \
--mysql_conf hostname=127.0.0.1 \
--mysql_conf username=root \
--mysql_conf password=123456 \
--mysql_conf database-name='db.+' \
--catalog_conf metastore=hive \
--catalog_conf uri=thrift://hive-metastore:9083 \
--table_conf bucket=4 \
--table_conf changelog-producer=input \
--table_conf sink.parallelism=4 \
--including_tables 'tbl.+'
By setting database-name to a regular expression, the synchronization job will capture all tables under matched databases and merge tables of the same name into one table.
You can set --merge_shards false
to prevent merging shards. The synchronized tables will be named to ‘databaseName_tableName’
to avoid potential name conflict.
FAQ #
- Chinese characters in records ingested from MySQL are garbled.
- Try to set
env.java.opts: -Dfile.encoding=UTF-8
inflink-conf.yaml
(the option is changed toenv.java.opts.all
since Flink-1.17).