Quick Start #
Preparation #
Paimon currently supports Spark 3.5, 3.4, 3.3, 3.2 and 3.1. We recommend the latest Spark version for a better experience.
Download the jar file with corresponding version.
Version | Jar |
---|---|
Spark 3.5 | paimon-spark-3.5-0.8.2.jar |
Spark 3.4 | paimon-spark-3.4-0.8.2.jar |
Spark 3.3 | paimon-spark-3.3-0.8.2.jar |
Spark 3.2 | paimon-spark-3.2-0.8.2.jar |
Spark 3.1 | paimon-spark-3.1-0.8.2.jar |
You can also manually build bundled jar from the source code.
To build from source code, clone the git repository.
Build bundled jar with the following command.
mvn clean install -DskipTests
For Spark 3.3, you can find the bundled jar in ./paimon-spark/paimon-spark-3.3/target/paimon-spark-3.3-0.8.2.jar
.
Setup #
If you are using HDFS, make sure that the environment variableHADOOP_HOME
orHADOOP_CONF_DIR
is set.
Step 1: Specify Paimon Jar File
Append path to paimon jar file to the --jars
argument when starting spark-sql
.
spark-sql ... --jars /path/to/paimon-spark-3.3-0.8.2.jar
OR use the --packages
option.
spark-sql ... --packages org.apache.paimon:paimon-spark-3.3:0.8.2
Alternatively, you can copy paimon-spark-3.3-0.8.2.jar
under spark/jars
in your Spark installation directory.
Step 2: Specify Paimon Catalog
When starting spark-sql
, use the following command to register Paimon’s Spark catalog with the name paimon
. Table files of the warehouse is stored under /tmp/paimon
.
spark-sql ... \
--conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog \
--conf spark.sql.catalog.paimon.warehouse=file:/tmp/paimon \
--conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions
Catalogs are configured using properties under spark.sql.catalog.(catalog_name). In above case, ‘paimon’ is the catalog name, you can change it to your own favorite catalog name.
After spark-sql
command line has started, run the following SQL to create and switch to database default
.
USE paimon;
USE default;
After switching to the catalog ('USE paimon'
), Spark’s existing tables will not be directly accessible, you
can use the spark_catalog.${database_name}.${table_name}
to access Spark tables.
When starting spark-sql
, use the following command to register Paimon’s Spark Generic catalog to replace Spark
default catalog spark_catalog
. (default warehouse is Spark spark.sql.warehouse.dir
)
Currently, it is only recommended to use SparkGenericCatalog
in the case of Hive metastore, Paimon will infer
Hive conf from Spark session, you just need to configure Spark’s Hive conf.
spark-sql ... \
--conf spark.sql.catalog.spark_catalog=org.apache.paimon.spark.SparkGenericCatalog \
--conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions
Using SparkGenericCatalog
, you can use Paimon tables in this Catalog or non-Paimon tables such as Spark’s csv,
parquet, Hive tables, etc.
Create Table #
create table my_table (
k int,
v string
) tblproperties (
'primary-key' = 'k'
);
create table my_table (
k int,
v string
) USING paimon
tblproperties (
'primary-key' = 'k'
) ;
Insert Table #
Paimon currently supports Spark 3.2+ for SQL write.
INSERT INTO my_table VALUES (1, 'Hi'), (2, 'Hello');
Query Table #
SELECT * FROM my_table;
/*
1 Hi
2 Hello
*/
val dataset = spark.read.format("paimon").load("file:/tmp/paimon/default.db/my_table")
dataset.show()
/*
+---+------+
| k | v|
+---+------+
| 1| Hi|
| 2| Hello|
+---+------+
*/
Spark Type Conversion #
This section lists all supported type conversion between Spark and Paimon.
All Spark’s data types are available in package org.apache.spark.sql.types
.
Spark Data Type | Paimon Data Type | Atomic Type |
---|---|---|
StructType |
RowType |
false |
MapType |
MapType |
false |
ArrayType |
ArrayType |
false |
BooleanType |
BooleanType |
true |
ByteType |
TinyIntType |
true |
ShortType |
SmallIntType |
true |
IntegerType |
IntType |
true |
LongType |
BigIntType |
true |
FloatType |
FloatType |
true |
DoubleType |
DoubleType |
true |
StringType |
VarCharType , CharType |
true |
DateType |
DateType |
true |
TimestampType |
TimestampType , LocalZonedTimestamp |
true |
DecimalType(precision, scale) |
DecimalType(precision, scale) |
true |
BinaryType |
VarBinaryType , BinaryType |
true |