Spark2

Spark2 #

This documentation is a guide for using Paimon in Spark2.

Version #

Paimon supports Spark 2.4+. It is highly recommended to use Spark 2.4+ version with many improvements.

Preparing Paimon Jar File #

Download paimon-spark-2-0.4.0-incubating.jar.

You can also manually build bundled jar from the source code.

To build from source code, clone the git repository.

Build bundled jar with the following command.

mvn clean install -DskipTests

You can find the bundled jar in ./paimon-spark/paimon-spark-2/target/paimon-spark-2-0.4.0-incubating.jar.

Quick Start #

If you are using HDFS, make sure that the environment variable HADOOP_HOME or HADOOP_CONF_DIR is set.

Step 1: Prepare Test Data

Paimon currently only supports reading tables through Spark2. To create a Paimon table with records, please follow our Flink quick start guide.

After the guide, all table files should be stored under the path /tmp/paimon, or the warehouse path you’ve specified.

Step 2: Specify Paimon Jar File

You can append path to paimon jar file to the --jars argument when starting spark-shell.

spark-shell ... --jars /path/to/paimon-spark-2-0.4.0-incubating.jar

Alternatively, you can copy paimon-spark-2-0.4.0-incubating.jar under spark/jars in your Spark installation directory.

Step 3: Query Table

Paimon with Spark 2.4 does not support DDL. You can use the Dataset reader and register the Dataset as a temporary table. In spark shell:

val dataset = spark.read.format("paimon").load("file:/tmp/paimon/default.db/word_count")
dataset.createOrReplaceTempView("word_count")
spark.sql("SELECT * FROM word_count").show()