Quick Start #
Preparation #
Paimon currently supports Spark 3.5, 3.4, 3.3, 3.2 and 3.1. We recommend the latest Spark version for a better experience.
Download the jar file with corresponding version.
| Version | Jar |
|---|---|
| Spark 3.5 | paimon-spark-3.5-0.8.2.jar |
| Spark 3.4 | paimon-spark-3.4-0.8.2.jar |
| Spark 3.3 | paimon-spark-3.3-0.8.2.jar |
| Spark 3.2 | paimon-spark-3.2-0.8.2.jar |
| Spark 3.1 | paimon-spark-3.1-0.8.2.jar |
You can also manually build bundled jar from the source code.
To build from source code, clone the git repository.
Build bundled jar with the following command.
mvn clean install -DskipTests
For Spark 3.3, you can find the bundled jar in ./paimon-spark/paimon-spark-3.3/target/paimon-spark-3.3-0.8.2.jar.
Setup #
If you are using HDFS, make sure that the environment variableHADOOP_HOMEorHADOOP_CONF_DIRis set.
Step 1: Specify Paimon Jar File
Append path to paimon jar file to the --jars argument when starting spark-sql.
spark-sql ... --jars /path/to/paimon-spark-3.3-0.8.2.jar
OR use the --packages option.
spark-sql ... --packages org.apache.paimon:paimon-spark-3.3:0.8.2
Alternatively, you can copy paimon-spark-3.3-0.8.2.jar under spark/jars in your Spark installation directory.
Step 2: Specify Paimon Catalog
When starting spark-sql, use the following command to register Paimon’s Spark catalog with the name paimon. Table files of the warehouse is stored under /tmp/paimon.
spark-sql ... \
--conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog \
--conf spark.sql.catalog.paimon.warehouse=file:/tmp/paimon \
--conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions
Catalogs are configured using properties under spark.sql.catalog.(catalog_name). In above case, ‘paimon’ is the catalog name, you can change it to your own favorite catalog name.
After spark-sql command line has started, run the following SQL to create and switch to database default.
USE paimon;
USE default;
After switching to the catalog ('USE paimon'), Spark’s existing tables will not be directly accessible, you
can use the spark_catalog.${database_name}.${table_name} to access Spark tables.
When starting spark-sql, use the following command to register Paimon’s Spark Generic catalog to replace Spark
default catalog spark_catalog. (default warehouse is Spark spark.sql.warehouse.dir)
Currently, it is only recommended to use SparkGenericCatalog in the case of Hive metastore, Paimon will infer
Hive conf from Spark session, you just need to configure Spark’s Hive conf.
spark-sql ... \
--conf spark.sql.catalog.spark_catalog=org.apache.paimon.spark.SparkGenericCatalog \
--conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions
Using SparkGenericCatalog, you can use Paimon tables in this Catalog or non-Paimon tables such as Spark’s csv,
parquet, Hive tables, etc.
Create Table #
create table my_table (
k int,
v string
) tblproperties (
'primary-key' = 'k'
);
create table my_table (
k int,
v string
) USING paimon
tblproperties (
'primary-key' = 'k'
) ;
Insert Table #
Paimon currently supports Spark 3.2+ for SQL write.
INSERT INTO my_table VALUES (1, 'Hi'), (2, 'Hello');
Query Table #
SELECT * FROM my_table;
/*
1 Hi
2 Hello
*/
val dataset = spark.read.format("paimon").load("file:/tmp/paimon/default.db/my_table")
dataset.show()
/*
+---+------+
| k | v|
+---+------+
| 1| Hi|
| 2| Hello|
+---+------+
*/
Spark Type Conversion #
This section lists all supported type conversion between Spark and Paimon.
All Spark’s data types are available in package org.apache.spark.sql.types.
| Spark Data Type | Paimon Data Type | Atomic Type |
|---|---|---|
StructType |
RowType |
false |
MapType |
MapType |
false |
ArrayType |
ArrayType |
false |
BooleanType |
BooleanType |
true |
ByteType |
TinyIntType |
true |
ShortType |
SmallIntType |
true |
IntegerType |
IntType |
true |
LongType |
BigIntType |
true |
FloatType |
FloatType |
true |
DoubleType |
DoubleType |
true |
StringType |
VarCharType, CharType |
true |
DateType |
DateType |
true |
TimestampType |
TimestampType, LocalZonedTimestamp |
true |
DecimalType(precision, scale) |
DecimalType(precision, scale) |
true |
BinaryType |
VarBinaryType, BinaryType |
true |