This documentation is for an unreleased version of Apache Paimon. We recommend you use the latest stable version.

SQL Write #

Insert Table #

The INSERT statement inserts new rows into a table or overwrites the existing data in the table. The inserted rows can be specified by value expressions or result from a query.

Syntax

INSERT { INTO | OVERWRITE } table_identifier [ part_spec ] [ column_list ] { value_expr | query };

Parameters

table_identifier: Specifies a table name, which may be optionally qualified with a database name.
part_spec: An optional parameter that specifies a comma-separated list of key and value pairs for partitions.
column_list: An optional parameter that specifies a comma-separated list of columns belonging to the table_identifier table. Spark will reorder the columns of the input query to match the table schema according to the specified column list.

Note: Since Spark 3.4, INSERT INTO commands with explicit column lists comprising fewer columns than the target table will automatically add the corresponding default values for the remaining columns (or NULL for any column lacking an explicitly-assigned default value). In Spark 3.3 or earlier, column_list’s size must be equal to the target table’s column size, otherwise these commands would have failed.
value_expr ( { value | NULL } [ , … ] ) [ , ( … ) ]: Specifies the values to be inserted. Either an explicitly specified value or a NULL can be inserted. A comma must be used to separate each value in the clause. More than one set of values can be specified to insert multiple rows.

For more information, please check the syntax document: Spark INSERT Statement

Insert Into #

Use INSERT INTO to apply records and changes to tables.

INSERT INTO my_table SELECT ...

Insert Overwrite #

Use INSERT OVERWRITE to overwrite the whole table.

INSERT OVERWRITE my_table SELECT ...

Insert Overwrite Partition #

Use INSERT OVERWRITE to overwrite a partition.

INSERT OVERWRITE my_table PARTITION (key1 = value1, key2 = value2, ...) SELECT ...

Dynamic Overwrite Partition #

Spark’s default overwrite mode is static partition overwrite. To enable dynamic overwritten you need to set the Spark session configuration spark.sql.sources.partitionOverwriteMode to dynamic

For example:

CREATE TABLE my_table (id INT, pt STRING) PARTITIONED BY (pt);
INSERT INTO my_table VALUES (1, 'p1'), (2, 'p2');

-- Static overwrite (Overwrite the whole table)
INSERT OVERWRITE my_table VALUES (3, 'p1');
-- or 
INSERT OVERWRITE my_table PARTITION (pt) VALUES (3, 'p1');

SELECT * FROM my_table;
/*
+---+---+
| id| pt|
+---+---+
|  3| p1|
+---+---+
*/

-- Static overwrite with specified partitions (Only overwrite pt='p1')
INSERT OVERWRITE my_table PARTITION (pt='p1') VALUES (3);

SELECT * FROM my_table;
/*
+---+---+
| id| pt|
+---+---+
|  2| p2|
|  3| p1|
+---+---+
*/
  
-- Dynamic overwrite (Only overwrite pt='p1')
SET spark.sql.sources.partitionOverwriteMode=dynamic;
INSERT OVERWRITE my_table VALUES (3, 'p1');

SELECT * FROM my_table;
/*
+---+---+
| id| pt|
+---+---+
|  2| p2|
|  3| p1|
+---+---+
*/

Truncate Table #

The TRUNCATE TABLE statement removes all the rows from a table or partition(s).

TRUNCATE TABLE my_table;

Update Table #

Updates the column values for the rows that match a predicate. When no predicate is provided, update the column values for all rows.

Note:

Update primary key columns is not supported when the target table is a primary key table.

Spark supports update PrimitiveType and StructType, for example:

-- Syntax
UPDATE table_identifier SET column1 = value1, column2 = value2, ... WHERE condition;

CREATE TABLE t (
  id INT, 
  s STRUCT<c1: INT, c2: STRING>, 
  name STRING)
TBLPROPERTIES (
  'primary-key' = 'id', 
  'merge-engine' = 'deduplicate'
);

-- you can use
UPDATE t SET name = 'a_new' WHERE id = 1;
UPDATE t SET s.c2 = 'a_new' WHERE s.c1 = 1;

Delete From Table #

Deletes the rows that match a predicate. When no predicate is provided, deletes all rows.

DELETE FROM my_table WHERE id = 1;

Merge Into Table #

Merges a set of updates, insertions and deletions based on a source table into a target table.

Note:

In update clause, to update primary key columns is not supported when the target table is a primary key table.

Example: One

This is a simple demo that, if a row exists in the target table update it, else insert it.

-- Here both source and target tables have the same schema: (a INT, b INT, c STRING), and a is a primary key.

MERGE INTO target
USING source
ON target.a = source.a
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED
THEN INSERT *

Example: Two

This is a demo with multiple, conditional clauses.

-- Here both source and target tables have the same schema: (a INT, b INT, c STRING), and a is a primary key.

MERGE INTO target
USING source
ON target.a = source.a
WHEN MATCHED AND target.a = 5 THEN
   UPDATE SET b = source.b + target.b      -- when matched and meet the condition 1, then update b;
WHEN MATCHED AND source.c > 'c2' THEN
   UPDATE SET *    -- when matched and meet the condition 2, then update all the columns;
WHEN MATCHED THEN
   DELETE      -- when matched, delete this row in target table;
WHEN NOT MATCHED AND c > 'c9' THEN
   INSERT (a, b, c) VALUES (a, b * 1.1, c)      -- when not matched but meet the condition 3, then transform and insert this row;
WHEN NOT MATCHED THEN
INSERT *      -- when not matched, insert this row without any transformation;

Write Merge Schema #

Since the table schema may be updated during writing, catalog caching needs to be disabled to use this feature. Configure spark.sql.catalog.<catalogName>.cache-enabled to false.

Write merge schema is a feature that allows users to easily modify the current schema of a table to adapt to existing data, or new data that changes over time, while maintaining data integrity and consistency.

Paimon supports automatic schema merging of source data and current table data while data is being written, and uses the merged schema as the latest schema of the table, and it only requires configuring write.merge-schema.

data.write
  .format("paimon")
  .mode("append")
  .option("write.merge-schema", "true")
  .save(location)

When enable write.merge-schema, Paimon can allow users to perform the following actions on table schema by default:

Adding columns
Up-casting the type of column(e.g. Int -> Long)

Paimon also supports explicit type conversions between certain types (e.g. String -> Date, Long -> Int), it requires an explicit configuration write.merge-schema.explicit-cast.

Write merge schema can be used in streaming mode at the same time.

val inputData = MemoryStream[(Int, String)]
inputData
  .toDS()
  .toDF("col1", "col2")
  .writeStream
  .format("paimon")
  .option("checkpointLocation", "/path/to/checkpoint")
  .option("write.merge-schema", "true")
  .option("write.merge-schema.explicit-cast", "true")
  .start(location)

Here list the configurations.

Scan Mode	Description
write.merge-schema	If true, merge the data schema and the table schema automatically before write data.
write.merge-schema.explicit-cast	If true, allow to merge data types if the two types meet the rules for explicit casting.

This mode also supports Spark SQL. Here is an example:

SET `spark.paimon.write.merge-schema` = true;

CREATE TABLE t (a INT, b STRING);
INSERT INTO t VALUES (1, '1'), (2, '2');

-- Need using `BY NAME` statement (requires Spark 3.5+)
INSERT INTO t BY NAME SELECT 3 AS a, '3' AS b, 3 AS c;