This documentation is for an unreleased version of Apache Paimon. We recommend you use the latest stable version.

Filesystems #

Apache Paimon utilizes the same pluggable file systems as Apache Flink. Users can follow the standard plugin mechanism to configure the plugin structure if using Flink as compute engine. However, for other engines like Spark or Hive, the provided opt jars (by Flink) may get conflicts and cannot be used directly. It is not convenient for users to fix class conflicts, thus Paimon provides the self-contained and engine-unified FileSystem pluggable jars for user to query tables from Spark/Hive side.

Supported FileSystems #

FileSystem	URI Scheme	Pluggable	Description
Local File System	file://	N	Built-in Support
HDFS	hdfs://	N	Built-in Support, ensure that the cluster is in the hadoop environment
Aliyun OSS	oss://	Y
S3	s3://	Y
Tencent Cloud Object Storage	cosn://	Y
Microsoft Azure Storage	abfs://	Y
Huawei OBS	obs://	Y
Google Cloud Storage	gs://	Y

Dependency #

We recommend you to download the jar directly: Download Link.

You can also manually build bundled jar from the source code.

To build from source code, clone the git repository.

Build shaded jar with the following command.

mvn clean install -DskipTests

You can find the shaded jars under ./paimon-filesystems/paimon-${fs}/target/paimon-${fs}-1.4-SNAPSHOT.jar.

HDFS #

You don’t need any additional dependencies to access HDFS because you have already taken care of the Hadoop dependencies.

HDFS Configuration #

For HDFS, the most important thing is to be able to read your HDFS configuration.

Flink

You may not have to do anything, if you are in a hadoop environment. Otherwise pick one of the following ways to configure your HDFS:

Set environment variable HADOOP_HOME or HADOOP_CONF_DIR.
Configure 'hadoop-conf-dir' in the paimon catalog.
Configure Hadoop options through prefix 'hadoop.' in the paimon catalog.

The first approach is recommended.

If you do not want to include the value of the environment variable, you can configure hadoop-conf-loader to option.

Hive/Spark

HDFS Configuration is available directly through the computation cluster, see cluster configuration of Hive and Spark for details.

Paimon-Hadoop Uber package #

If you want to read from and write to Paimon tables on a Hadoop cluster within an application that has no Hadoop dependencies, you can use the Paimon Hadoop uber jar.

Download paimon-hadoop-uber-1.4-SNAPSHOT.jar.

Put paimon-hadoop-uber-1.4-SNAPSHOT.jar into classpath directory of your application.

Hadoop-compatible file systems (HCFS) #

All Hadoop file systems are automatically available when the Hadoop libraries are on the classpath.

This way, Paimon seamlessly supports all of Hadoop file systems implementing the org.apache.hadoop.fs.FileSystem interface, and all Hadoop-compatible file systems (HCFS).

HDFS
Alluxio (see configuration specifics below)
XtreemFS
…

The Hadoop configuration has to have an entry for the required file system implementation in the core-site.xml file.

For Alluxio support add the following entry into the core-site.xml file:

<property>
  <name>fs.alluxio.impl</name>
  <value>alluxio.hadoop.FileSystem</value>
</property>

Kerberos #

Flink

Configure the following options in your catalog configuration:

security.kerberos.login.keytab: Absolute path to a Kerberos keytab file that contains the user credentials. Please make sure it is copied to each machine.
security.kerberos.login.principal: Kerberos principal name associated with the keytab.

And configure the following option in the program’s java property:

java.security.krb5.conf: Absolute path to the Kerberos configuration file. Please make sure it is copied to each machine.

Spark

It is recommended to use Spark Kerberos Keytab.

Hive

An intuitive approach is to configure Hive’s kerberos authentication.

Trino/JavaAPI

Configure the following three options in your catalog configuration:

security.kerberos.login.keytab: Absolute path to a Kerberos keytab file that contains the user credentials. Please make sure it is copied to each machine.
security.kerberos.login.principal: Kerberos principal name associated with the keytab.
security.kerberos.login.use-ticket-cache: True or false, indicates whether to read from your Kerberos ticket cache.

For JavaAPI:

SecurityContext.install(catalogOptions);

HDFS HA #

Ensure that hdfs-site.xml and core-site.xml contain the necessary HA configuration.

HDFS ViewFS #

Ensure that hdfs-site.xml and core-site.xml contain the necessary ViewFs configuration.

OSS #

Download paimon-oss-1.4-SNAPSHOT.jar.

Flink

If you have already configured oss access through Flink (Via Flink FileSystem), here you can skip the following configuration.

Put paimon-oss-1.4-SNAPSHOT.jar into lib directory of your Flink home, and create catalog:

CREATE CATALOG my_catalog WITH (
    'type' = 'paimon',
    'warehouse' = 'oss://<bucket>/<path>',
    'fs.oss.endpoint' = 'oss-cn-hangzhou.aliyuncs.com',
    'fs.oss.accessKeyId' = 'xxx',
    'fs.oss.accessKeySecret' = 'yyy'
);

Spark

If you have already configured oss access through Spark (Via Hadoop FileSystem), here you can skip the following configuration.

Place paimon-oss-1.4-SNAPSHOT.jar together with paimon-spark-1.4-SNAPSHOT.jar under Spark’s jars directory, and start like

spark-sql \
  --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog \
  --conf spark.sql.catalog.paimon.warehouse=oss://<bucket>/<path> \
  --conf spark.sql.catalog.paimon.fs.oss.endpoint=oss-cn-hangzhou.aliyuncs.com \
  --conf spark.sql.catalog.paimon.fs.oss.accessKeyId=xxx \
  --conf spark.sql.catalog.paimon.fs.oss.accessKeySecret=yyy

Hive

If you have already configured oss access through Hive (Via Hadoop FileSystem), here you can skip the following configuration.

NOTE: You need to ensure that Hive metastore can access oss.

Place paimon-oss-1.4-SNAPSHOT.jar together with paimon-hive-connector-1.4-SNAPSHOT.jar under Hive’s auxlib directory, and start like

SET paimon.fs.oss.endpoint=oss-cn-hangzhou.aliyuncs.com;
SET paimon.fs.oss.accessKeyId=xxx;
SET paimon.fs.oss.accessKeySecret=yyy;

And read table from hive metastore, table can be created by Flink or Spark, see Catalog with Hive Metastore

SELECT * FROM test_table;
SELECT COUNT(1) FROM test_table;

Trino

From version 0.8, paimon-trino uses trino filesystem as basic file read and write system. We strongly recommend you to use jindo-sdk in trino.

You can find How to config jindo sdk on trino here. Please note that:

Use paimon to replace hive-hadoop2 when you decompress the plugin jar and find location to put in.
You can specify the core-site.xml in paimon.properties on configuration hive.config.resources.
Presto and Jindo use the same configuration method.

If you environment has jindo sdk dependencies, you can use Jindo Fs to connect OSS. Jindo has better read and write efficiency.

Download paimon-jindo-1.4-SNAPSHOT.jar.

S3 #

Download paimon-s3-1.4-SNAPSHOT.jar.

Flink

If you have already configured s3 access through Flink (Via Flink FileSystem), here you can skip the following configuration.

Put paimon-s3-1.4-SNAPSHOT.jar into lib directory of your Flink home, and create catalog:

CREATE CATALOG my_catalog WITH (
    'type' = 'paimon',
    'warehouse' = 's3://<bucket>/<path>',
    's3.endpoint' = 'your-endpoint-hostname',
    's3.access-key' = 'xxx',
    's3.secret-key' = 'yyy'
);

Spark

If you have already configured s3 access through Spark (Via Hadoop FileSystem), here you can skip the following configuration.

Place paimon-s3-1.4-SNAPSHOT.jar together with paimon-spark-1.4-SNAPSHOT.jar under Spark’s jars directory, and start like

spark-sql \
  --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog \
  --conf spark.sql.catalog.paimon.warehouse=s3://<bucket>/<path> \
  --conf spark.sql.catalog.paimon.s3.endpoint=your-endpoint-hostname \
  --conf spark.sql.catalog.paimon.s3.access-key=xxx \
  --conf spark.sql.catalog.paimon.s3.secret-key=yyy

Hive

If you have already configured s3 access through Hive ((Via Hadoop FileSystem)), here you can skip the following configuration.

NOTE: You need to ensure that Hive metastore can access s3.

Place paimon-s3-1.4-SNAPSHOT.jar together with paimon-hive-connector-1.4-SNAPSHOT.jar under Hive’s auxlib directory, and start like

SET paimon.s3.endpoint=your-endpoint-hostname;
SET paimon.s3.access-key=xxx;
SET paimon.s3.secret-key=yyy;

And read table from hive metastore, table can be created by Flink or Spark, see Catalog with Hive Metastore

SELECT * FROM test_table;
SELECT COUNT(1) FROM test_table;

Trino

Paimon use shared trino filesystem as basic read and write system.

Please refer to Trino S3 to config s3 filesystem in trino.

S3 Compliant Object Stores #

The S3 Filesystem also support using S3 compliant object stores such as MinIO, Tencent’s COS and IBM’s Cloud Object Storage. Just configure your endpoint to the provider of the object store service.

s3.endpoint: your-endpoint-hostname

Configure Path Style Access #

Some S3 compliant object stores might not have virtual host style addressing enabled by default, for example when using Standalone MinIO for testing purpose. In such cases, you will have to provide the property to enable path style access.

s3.path.style.access: true

S3A Performance #

Tune Performance for S3AFileSystem.

If you encounter the following exception:

Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool.

Try to configure this in catalog options: fs.s3a.connection.maximum=1000.

Google Cloud Storage #

Download paimon-gs-1.4-SNAPSHOT.jar.

Flink

If you have already configured gcs access through Flink (Via Flink FileSystem), here you can skip the following configuration.

Put paimon-gs-1.4-SNAPSHOT.jar into lib directory of your Flink home, and create catalog:

CREATE CATALOG my_catalog WITH (
    'type' = 'paimon',
    'warehouse' = 'gs://<bucket>/<path>',
    'fs.gs.auth.type' = 'SERVICE_ACCOUNT_JSON_KEYFILE',
    'fs.gs.auth.service.account.json.keyfile' = '/path/to/service-account-.json'
);

Microsoft Azure Storage #

Download paimon-azure-1.4-SNAPSHOT.jar.

Flink

If you have already configured azure access through Flink (Via Flink FileSystem), here you can skip the following configuration.

Put paimon-azure-1.4-SNAPSHOT.jar into lib directory of your Flink home, and create catalog:

CREATE CATALOG my_catalog WITH (
   'type' = 'paimon',
   'warehouse' = 'wasb://,<container>@<account>.blob.core.windows.net/<path>',
   'fs.azure.account.key.Account.blob.core.windows.net' = 'yyy'
);

Spark

If you have already configured azure access through Spark (Via Hadoop FileSystem), here you can skip the following configuration.

Place paimon-azure-1.4-SNAPSHOT.jar together with paimon-spark-1.4-SNAPSHOT.jar under Spark’s jars directory, and start like

spark-sql \
  --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog \
  --conf spark.sql.catalog.paimon.warehouse=wasb://,<container>@<account>.blob.core.windows.net/<path> \
  --conf fs.azure.account.key.Account.blob.core.windows.net=yyy \

OBS #

Download paimon-obs-1.4-SNAPSHOT.jar.

Flink

If you have already configured obs access through Flink (Via Flink FileSystem), here you can skip the following configuration.

Put paimon-obs-1.4-SNAPSHOT.jar into lib directory of your Flink home, and create catalog:

CREATE CATALOG my_catalog WITH (
    'type' = 'paimon',
    'warehouse' = 'obs://<bucket>/<path>',
    'fs.obs.endpoint' = 'obs-endpoint-hostname',
    'fs.obs.access.key' = 'xxx',
    'fs.obs.secret.key' = 'yyy'
);

Spark

If you have already configured obs access through Spark (Via Hadoop FileSystem), here you can skip the following configuration.

Place paimon-obs-1.4-SNAPSHOT.jar together with paimon-spark-1.4-SNAPSHOT.jar under Spark’s jars directory, and start like

spark-sql \
  --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog \
  --conf spark.sql.catalog.paimon.warehouse=obs://<bucket>/<path> \
  --conf spark.sql.catalog.paimon.fs.obs.endpoint=obs-endpoint-hostname \
  --conf spark.sql.catalog.paimon.fs.obs.access.key=xxx \
  --conf spark.sql.catalog.paimon.fs.obs.secret.key=yyy

Hive

If you have already configured obs access through Hive ((Via Hadoop FileSystem)), here you can skip the following configuration.

NOTE: You need to ensure that Hive metastore can access obs.

Place paimon-obs-1.4-SNAPSHOT.jar together with paimon-hive-connector-1.4-SNAPSHOT.jar under Hive’s auxlib directory, and start like

SET paimon.fs.obs.endpoint=obs-endpoint-hostname;
SET paimon.fs.obs.access.key=xxx;
SET paimon.fs.obs.secret.key=yyy;

And read table from hive metastore, table can be created by Flink or Spark, see Catalog with Hive Metastore

SELECT * FROM test_table;
SELECT COUNT(1) FROM test_table;