Kafka CDC #
Prepare Kafka Bundled Jar #
flink-sql-connector-kafka-*.jar
Supported Formats #
Flink provides several Kafka CDC formats: Canal Json, Debezium Json, Debezium Avro, Ogg Json, Maxwell Json and Normal Json. If a message in a Kafka topic is a change event captured from another database using the Change Data Capture (CDC) tool, then you can use the Paimon Kafka CDC. Write the INSERT, UPDATE, DELETE messages parsed into the paimon table.
Formats | Supported |
---|---|
Canal CDC | True |
Debezium CDC | True |
Maxwell CDC | True |
OGG CDC | True |
JSON | True |
aws-dms-json | True |
The JSON sources possibly missing some information. For example, Ogg and Maxwell format standards don’t contain field types; When you write JSON sources into Flink Kafka sink, it will only reserve data and row type and drop other information. The synchronization job will try best to handle the problem as follows:
- Usually, debezium-json contains ‘schema’ field, from which Paimon will retrieve data types. Make sure your debezium json has this field, or Paimon will use ‘STRING’ type.
- If missing field types, Paimon will use ‘STRING’ type as default.
- If missing database name or table name, you cannot do database synchronization, but you can still do table synchronization.
- If missing primary keys, the job might create non primary key table. You can set primary keys when submit job in table synchronization.
Synchronizing Tables #
By using KafkaSyncTableAction in a Flink DataStream job or directly through flink run
, users can synchronize one or multiple tables from Kafka’s one topic into one Paimon table.
To use this feature through flink run
, run the following shell command.
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-1.0.0.jar \
kafka_sync_table
--warehouse <warehouse-path> \
--database <database-name> \
--table <table-name> \
[--partition_keys <partition_keys>] \
[--primary_keys <primary-keys>] \
[--type_mapping to-string] \
[--computed_column <'column-name=expr-name(args[, ...])'> [--computed_column ...]] \
[--kafka_conf <kafka-source-conf> [--kafka_conf <kafka-source-conf> ...]] \
[--catalog_conf <paimon-catalog-conf> [--catalog_conf <paimon-catalog-conf> ...]] \
[--table_conf <paimon-table-sink-conf> [--table_conf <paimon-table-sink-conf> ...]]
Configuration | Description |
---|---|
--warehouse |
The path to Paimon warehouse. |
--database |
The database name in Paimon catalog. |
--table |
The Paimon table name. |
--partition_keys |
The partition keys for Paimon table. If there are multiple partition keys, connect them with comma, for example "dt,hh,mm". |
--primary_keys |
The primary keys for Paimon table. If there are multiple primary keys, connect them with comma, for example "buyer_id,seller_id". |
--type_mapping |
It is used to specify how to map MySQL data type to Paimon type. Supported options:
|
--computed_column |
The definitions of computed columns. The argument field is from Kafka topic's table field name. See here for a complete list of configurations. |
--kafka_conf |
The configuration for Flink Kafka sources. Each configuration should be specified in the format `key=value`. `properties.bootstrap.servers`, `topic/topic-pattern`, `properties.group.id`, and `value.format` are required configurations, others are optional.See its document for a complete list of configurations. |
--catalog_conf |
The configuration for Paimon catalog. Each configuration should be specified in the format "key=value". See here for a complete list of catalog configurations. |
--table_conf |
The configuration for Paimon table sink. Each configuration should be specified in the format "key=value". See here for a complete list of table configurations. |
If the Paimon table you specify does not exist, this action will automatically create the table. Its schema will be derived from all specified Kafka topic’s tables,it gets the earliest non-DDL data parsing schema from topic. If the Paimon table already exists, its schema will be compared against the schema of all specified Kafka topic’s tables.
Example 1:
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-1.0.0.jar \
kafka_sync_table \
--warehouse hdfs:///path/to/warehouse \
--database test_db \
--table test_table \
--partition_keys pt \
--primary_keys pt,uid \
--computed_column '_year=year(age)' \
--kafka_conf properties.bootstrap.servers=127.0.0.1:9020 \
--kafka_conf topic=order \
--kafka_conf properties.group.id=123456 \
--kafka_conf value.format=canal-json \
--catalog_conf metastore=hive \
--catalog_conf uri=thrift://hive-metastore:9083 \
--table_conf bucket=4 \
--table_conf changelog-producer=input \
--table_conf sink.parallelism=4
If the kafka topic doesn’t contain message when you start the synchronization job, you must manually create the table before submitting the job. You can define the partition keys and primary keys only, and the left columns will be added by the synchronization job.
NOTE: In this case you shouldn’t use –partition_keys or –primary_keys, because those keys are defined when creating the table and can not be modified. Additionally, if you specified computed columns, you should also define all the argument columns used for computed columns.
Example 2: If you want to synchronize a table which has primary key ‘id INT’, and you want to compute a partition key ‘part=date_format(create_time,yyyy-MM-dd)’, you can create a such table first (the other columns can be omitted):
CREATE TABLE test_db.test_table (
id INT, -- primary key
create_time TIMESTAMP, -- the argument of computed column part
part STRING, -- partition key
PRIMARY KEY (id, part) NOT ENFORCED
) PARTITIONED BY (part);
Then you can submit synchronization job:
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-1.0.0.jar \
kafka_sync_table \
--warehouse hdfs:///path/to/warehouse \
--database test_db \
--table test_table \
--computed_column 'part=date_format(create_time,yyyy-MM-dd)' \
... (other conf)
Example 3: For some append data (such as log data), it can be treated as special CDC data with only INSERT operation type, so you can use ‘format=json’ to synchronize such data to the Paimon table.
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-1.0.0.jar \
kafka_sync_table \
--warehouse hdfs:///path/to/warehouse \
--database test_db \
--table test_table \
--partition_keys pt \
--computed_column 'pt=date_format(event_tm, yyyyMMdd)' \
--kafka_conf properties.bootstrap.servers=127.0.0.1:9020 \
--kafka_conf topic=test_log \
--kafka_conf properties.group.id=123456 \
--kafka_conf value.format=json \
--catalog_conf metastore=hive \
--catalog_conf uri=thrift://hive-metastore:9083 \
--table_conf sink.parallelism=4
Synchronizing Databases #
By using KafkaSyncDatabaseAction in a Flink DataStream job or directly through flink run
, users can synchronize the multi topic or one topic into one Paimon database.
To use this feature through flink run
, run the following shell command.
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-1.0.0.jar \
kafka_sync_database
--warehouse <warehouse-path> \
--database <database-name> \
[--table_mapping <table-name>=<paimon-table-name>] \
[--table_prefix <paimon-table-prefix>] \
[--table_suffix <paimon-table-suffix>] \
[--table_prefix_db <paimon-table-prefix-by-db>] \
[--table_suffix_db <paimon-table-suffix-by-db>] \
[--including_tables <table-name|name-regular-expr>] \
[--excluding_tables <table-name|name-regular-expr>] \
[--including_dbs <database-name|name-regular-expr>] \
[--excluding_dbs <database-name|name-regular-expr>] \
[--type_mapping to-string] \
[--partition_keys <partition_keys>] \
[--primary_keys <primary-keys>] \
[--kafka_conf <kafka-source-conf> [--kafka_conf <kafka-source-conf> ...]] \
[--catalog_conf <paimon-catalog-conf> [--catalog_conf <paimon-catalog-conf> ...]] \
[--table_conf <paimon-table-sink-conf> [--table_conf <paimon-table-sink-conf> ...]]
Configuration | Description |
---|---|
--warehouse |
The path to Paimon warehouse. |
--database |
The database name in Paimon catalog. |
--ignore_incompatible |
It is default false, in this case, if MySQL table name exists in Paimon and their schema is incompatible,an exception will be thrown. You can specify it to true explicitly to ignore the incompatible tables and exception. |
--table_mapping |
The table name mapping between source database and Paimon. For example, if you want to synchronize a source table named "test" to a Paimon table named "paimon_test", you can specify "--table_mapping test=paimon_test". Multiple mappings could be specified with multiple "--table_mapping" options. "--table_mapping" has higher priority than "--table_prefix" and "--table_suffix". |
--table_prefix |
The prefix of all Paimon tables to be synchronized except those specified by "--table_mapping" or "--table_prefix_db". For example, if you want all synchronized tables to have "ods_" as prefix, you can specify "--table_prefix ods_". |
--table_suffix |
The suffix of all Paimon tables to be synchronized except those specified by "--table_mapping" or "--table_suffix_db". The usage is same as "--table_prefix". |
--table_prefix_db |
The prefix of the Paimon tables to be synchronized from the specified db. For example, if you want to prefix the tables from db1 with "ods_db1_", you can specify "--table_prefix_db db1=ods_db1_". "--table_prefix_db" has higher priority than "--table_prefix". |
--table_suffix_db |
The suffix of the Paimon tables to be synchronized from the specified db. The usage is same as "--table_prefix_db". |
--including_tables |
It is used to specify which source tables are to be synchronized. You must use '|' to separate multiple tables.Because '|' is a special character, a comma is required, for example: 'a|b|c'.Regular expression is supported, for example, specifying "--including_tables test|paimon.*" means to synchronize table 'test' and all tables start with 'paimon'. |
--excluding_tables |
It is used to specify which source tables are not to be synchronized. The usage is same as "--including_tables". "--excluding_tables" has higher priority than "--including_tables" if you specified both. |
--including_dbs |
It is used to specify the databases within which the tables are to be synchronized. The usage is same as "--including_tables". |
--excluding_dbs |
It is used to specify the databases within which the tables are not to be synchronized. The usage is same as "--excluding_tables". "--excluding_dbs" has higher priority than "--including_dbs" if you specified both. |
--type_mapping |
It is used to specify how to map MySQL data type to Paimon type. Supported options:
|
--partition_keys |
The partition keys for Paimon table. If there are multiple partition keys, connect them with comma, for example "dt,hh,mm". If the keys are not in source table, the sink table won't set partition keys. |
--multiple_table_partition_keys |
The partition keys for each different Paimon table. If there are multiple partition keys, connect them with comma, for example
|
--primary_keys |
The primary keys for Paimon table. If there are multiple primary keys, connect them with comma, for example "buyer_id,seller_id". If the keys are not in source table, but the source table has primary keys, the sink table will use source table's primary keys. Otherwise, the sink table won't set primary keys. |
--kafka_conf |
The configuration for Flink Kafka sources. Each configuration should be specified in the format `key=value`. `properties.bootstrap.servers`, `topic/topic-pattern`, `properties.group.id`, and `value.format` are required configurations, others are optional.See its document for a complete list of configurations. |
--catalog_conf |
The configuration for Paimon catalog. Each configuration should be specified in the format "key=value". See here for a complete list of catalog configurations. |
--table_conf |
The configuration for Paimon table sink. Each configuration should be specified in the format "key=value". See here for a complete list of table configurations. |
This action will build a single combined sink for all tables. For each Kafka topic’s table to be synchronized, if the corresponding Paimon table does not exist, this action will automatically create the table, and its schema will be derived from all specified Kafka topic’s tables. If the Paimon table already exists and its schema is different from that parsed from Kafka record, this action will try to preform schema evolution.
Example
Synchronization from one Kafka topic to Paimon database.
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-1.0.0.jar \
kafka_sync_database \
--warehouse hdfs:///path/to/warehouse \
--database test_db \
--kafka_conf properties.bootstrap.servers=127.0.0.1:9020 \
--kafka_conf topic=order \
--kafka_conf properties.group.id=123456 \
--kafka_conf value.format=canal-json \
--catalog_conf metastore=hive \
--catalog_conf uri=thrift://hive-metastore:9083 \
--table_conf bucket=4 \
--table_conf changelog-producer=input \
--table_conf sink.parallelism=4
Synchronization from multiple Kafka topics to Paimon database.
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-1.0.0.jar \
kafka_sync_database \
--warehouse hdfs:///path/to/warehouse \
--database test_db \
--kafka_conf properties.bootstrap.servers=127.0.0.1:9020 \
--kafka_conf topic=order\;logistic_order\;user \
--kafka_conf properties.group.id=123456 \
--kafka_conf value.format=canal-json \
--catalog_conf metastore=hive \
--catalog_conf uri=thrift://hive-metastore:9083 \
--table_conf bucket=4 \
--table_conf changelog-producer=input \
--table_conf sink.parallelism=4
Additional kafka_config #
There are some useful options to build Flink Kafka Source, but they are not provided by flink-kafka-connector document. They are:
Key | Default | Type | Description |
---|---|---|---|
schema.registry.url | (none) | String | When configuring "value.format=debezium-avro" which requires using the Confluence schema registry model for Apache Avro serialization, you need to provide the schema registry URL. |