Filesystems

Apache Paimon utilizes the same pluggable file systems as Apache Flink. Users can follow the standard plugin mechanism to configure the plugin structure if using Flink as compute engine. However, for other engines like Spark or Hive, the provided opt jars (by Flink) may get conflicts and cannot be used directly. It is not convenient for users to fix class conflicts, thus Paimon provides the self-contained and engine-unified FileSystem pluggable jars for user to query tables from Spark/Hive side.

Supported FileSystems

FileSystem	URI Scheme	Pluggable	Description
Local File System	file://	N	Built-in Support
HDFS	hdfs://	N	Built-in Support, ensure that the cluster is in the hadoop environment
Aliyun OSS	oss://	Y
S3	s3://	Y
Tencent Cloud Object Storage	cosn://	Y
Microsoft Azure Storage	abfs://	Y
Huawei OBS	obs://	Y
Google Cloud Storage	gs://	Y

Dependency

We recommend you to download the jar directly: Download Link.

You can also manually build bundled jar from the source code.

To build from source code, clone the git repository.

Build shaded jar with the following command.

mvn clean install -DskipTests

You can find the shaded jars under ./paimon-filesystems/paimon-${fs}/target/paimon-${fs}-1.5-SNAPSHOT.jar.

HDFS

You don't need any additional dependencies to access HDFS because you have already taken care of the Hadoop dependencies.

HDFS Configuration

For HDFS, the most important thing is to be able to read your HDFS configuration.

Flink
Hive/Spark

You may not have to do anything, if you are in a hadoop environment. Otherwise, pick one of the following ways to configure your HDFS:

Set environment variable HADOOP_HOME or HADOOP_CONF_DIR.
Configure 'hadoop-conf-dir' in the paimon catalog.
Configure Hadoop options through prefix 'hadoop.' in the paimon catalog.

The first approach is recommended.

If you do not want to include the value of the environment variable, you can configure hadoop-conf-loader to option.

Paimon-Hadoop Uber package

If you want to read from and write to Paimon tables on a Hadoop cluster within an application that has no Hadoop dependencies, you can use the Paimon Hadoop uber jar.

Download paimon-hadoop-uber-1.5-SNAPSHOT.jar.

Put paimon-hadoop-uber-1.5-SNAPSHOT.jar into classpath directory of your application.

Hadoop-compatible file systems (HCFS)

All Hadoop file systems are automatically available when the Hadoop libraries are on the classpath.

This way, Paimon seamlessly supports all of Hadoop file systems implementing the org.apache.hadoop.fs.FileSystem interface, and all Hadoop-compatible file systems (HCFS).

HDFS
Alluxio (see configuration specifics below)
XtreemFS
…

The Hadoop configuration has to have an entry for the required file system implementation in the core-site.xml file.

For Alluxio support add the following entry into the core-site.xml file:

<property>
  <name>fs.alluxio.impl</name>
  <value>alluxio.hadoop.FileSystem</value>
</property>

Kerberos

Flink
Spark
Hive
Trino/JavaAPI

Configure the following options in your catalog configuration:

security.kerberos.login.keytab: Absolute path to a Kerberos keytab file that contains the user credentials. Please make sure it is copied to each machine.
security.kerberos.login.principal: Kerberos principal name associated with the keytab.

And configure the following option in the program's java property:

java.security.krb5.conf: Absolute path to the Kerberos configuration file. Please make sure it is copied to each machine.

Configure the following three options in your catalog configuration:

security.kerberos.login.keytab: Absolute path to a Kerberos keytab file that contains the user credentials. Please make sure it is copied to each machine.
security.kerberos.login.principal: Kerberos principal name associated with the keytab.
security.kerberos.login.use-ticket-cache: True or false, indicates whether to read from your Kerberos ticket cache.

For JavaAPI:

SecurityContext.install(catalogOptions);

HDFS HA

Ensure that hdfs-site.xml and core-site.xml contain the necessary HA configuration.

HDFS ViewFS

Ensure that hdfs-site.xml and core-site.xml contain the necessary ViewFs configuration.

OSS

Download paimon-oss-1.5-SNAPSHOT.jar.

Flink
Spark
Hive
Trino

info

If you have already configured oss access through Flink (Via Flink FileSystem), here you can skip the following configuration.

Put paimon-oss-1.5-SNAPSHOT.jar into lib directory of your Flink home, and create catalog:

CREATE CATALOG my_catalog WITH (
    'type' = 'paimon',
    'warehouse' = 'oss://<bucket>/<path>',
    'fs.oss.endpoint' = 'oss-cn-hangzhou.aliyuncs.com',
    'fs.oss.accessKeyId' = 'xxx',
    'fs.oss.accessKeySecret' = 'yyy'
);

info

If you have already configured oss access through Spark (Via Hadoop FileSystem), here you can skip the following configuration.

Place paimon-oss-1.5-SNAPSHOT.jar together with paimon-spark-1.5-SNAPSHOT.jar under Spark's jars directory, and start like

spark-sql \
  --conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions \
  --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog \
  --conf spark.sql.catalog.paimon.warehouse=oss://<bucket>/<path> \
  --conf spark.sql.catalog.paimon.fs.oss.endpoint=oss-cn-hangzhou.aliyuncs.com \
  --conf spark.sql.catalog.paimon.fs.oss.accessKeyId=xxx \
  --conf spark.sql.catalog.paimon.fs.oss.accessKeySecret=yyy

info

If you have already configured oss access through Hive (Via Hadoop FileSystem), here you can skip the following configuration.

NOTE: You need to ensure that Hive metastore can access oss.

Place paimon-oss-1.5-SNAPSHOT.jar together with paimon-hive-connector-1.5-SNAPSHOT.jar under Hive's auxlib directory, and start like

SET paimon.fs.oss.endpoint=oss-cn-hangzhou.aliyuncs.com;
SET paimon.fs.oss.accessKeyId=xxx;
SET paimon.fs.oss.accessKeySecret=yyy;

And read table from hive metastore, table can be created by Flink or Spark, see Catalog with Hive Metastore

SELECT * FROM test_table;
SELECT COUNT(1) FROM test_table;

From version 0.8, paimon-trino uses trino filesystem as basic file read and write system. We strongly recommend you to use jindo-sdk in trino.

You can find How to config jindo sdk on trino here. Please note that:

Use paimon to replace hive-hadoop2 when you decompress the plugin jar and find location to put in.
You can specify the core-site.xml in paimon.properties on configuration hive.config.resources.
Presto and Jindo use the same configuration method.

If your environment has jindo sdk dependencies, you can use Jindo Fs to connect OSS. Jindo has better read and write efficiency.

Download paimon-jindo-1.5-SNAPSHOT.jar.

Server-Side Encryption

Paimon can stamp OSS server-side-encryption headers on the writes it performs (PutObject, server-side CopyObject used by rename/commit, and multipart-upload initiation) via three keys that map 1:1 to the OSS request headers:

# Encryption method: AES256, KMS or SM4 (-> x-oss-server-side-encryption)
fs.oss.server-side-encryption: KMS
# Customer CMK key id, only valid with KMS (-> x-oss-server-side-encryption-key-id)
fs.oss.server-side-encryption-key-id: your-cmk-key-id
# Data-encryption algorithm SM4, only valid with KMS (-> x-oss-server-side-data-encryption)
fs.oss.server-side-data-encryption: SM4

When any of these is set, Paimon takes over the SSE headers on its write paths (these keys take precedence over hadoop-aliyun's native fs.oss.server-side-encryption-algorithm). A key id or SM4 data-encryption requires fs.oss.server-side-encryption=KMS (it defaults to KMS when unset), and is rejected otherwise.

S3

Download paimon-s3-1.5-SNAPSHOT.jar.

Flink
Spark
Hive
Trino

info

If you have already configured s3 access through Flink (Via Flink FileSystem), here you can skip the following configuration.

Put paimon-s3-1.5-SNAPSHOT.jar into lib directory of your Flink home, and create catalog:

CREATE CATALOG my_catalog WITH (
    'type' = 'paimon',
    'warehouse' = 's3://<bucket>/<path>',
    's3.endpoint' = 'your-endpoint-hostname',
    's3.access-key' = 'xxx',
    's3.secret-key' = 'yyy'
);

info

If you have already configured s3 access through Spark (Via Hadoop FileSystem), here you can skip the following configuration.

Place paimon-s3-1.5-SNAPSHOT.jar together with paimon-spark-1.5-SNAPSHOT.jar under Spark's jars directory, and start like

spark-sql \
  --conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions \
  --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog \
  --conf spark.sql.catalog.paimon.warehouse=s3://<bucket>/<path> \
  --conf spark.sql.catalog.paimon.s3.endpoint=your-endpoint-hostname \
  --conf spark.sql.catalog.paimon.s3.access-key=xxx \
  --conf spark.sql.catalog.paimon.s3.secret-key=yyy

info

If you have already configured s3 access through Hive ((Via Hadoop FileSystem)), here you can skip the following configuration.

NOTE: You need to ensure that Hive metastore can access s3.

Place paimon-s3-1.5-SNAPSHOT.jar together with paimon-hive-connector-1.5-SNAPSHOT.jar under Hive's auxlib directory, and start like

SET paimon.s3.endpoint=your-endpoint-hostname;
SET paimon.s3.access-key=xxx;
SET paimon.s3.secret-key=yyy;

And read table from hive metastore, table can be created by Flink or Spark, see Catalog with Hive Metastore

SELECT * FROM test_table;
SELECT COUNT(1) FROM test_table;

S3 Compliant Object Stores

The S3 Filesystem also support using S3 compliant object stores such as MinIO, Tencent's COS and IBM's Cloud Object Storage. Just configure your endpoint to the provider of the object store service.

s3.endpoint: your-endpoint-hostname

Configure Path Style Access

Some S3 compliant object stores might not have virtual host style addressing enabled by default, for example when using Standalone MinIO for testing purpose. In such cases, you will have to provide the property to enable path style access.

s3.path.style.access: true

S3A Performance

Tune Performance for S3AFileSystem.

If you encounter the following exception:

Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool.

Try to configure this in catalog options: fs.s3a.connection.maximum=1000.

Google Cloud Storage

Download paimon-gs-1.5-SNAPSHOT.jar.

Flink

info

If you have already configured gcs access through Flink (Via Flink FileSystem), here you can skip the following configuration.

Put paimon-gs-1.5-SNAPSHOT.jar into lib directory of your Flink home, and create catalog:

CREATE CATALOG my_catalog WITH (
    'type' = 'paimon',
    'warehouse' = 'gs://<bucket>/<path>',
    'fs.gs.auth.type' = 'SERVICE_ACCOUNT_JSON_KEYFILE',
    'fs.gs.auth.service.account.json.keyfile' = '/path/to/service-account-.json'
);

Microsoft Azure Storage

Download paimon-azure-1.5-SNAPSHOT.jar.

Flink
Spark

info

If you have already configured azure access through Flink (Via Flink FileSystem), here you can skip the following configuration.

Put paimon-azure-1.5-SNAPSHOT.jar into lib directory of your Flink home, and create catalog:

CREATE CATALOG my_catalog WITH (
   'type' = 'paimon',
   'warehouse' = 'wasb://,<container>@<account>.blob.core.windows.net/<path>',
   'fs.azure.account.key.Account.blob.core.windows.net' = 'yyy'
);

info

If you have already configured azure access through Spark (Via Hadoop FileSystem), here you can skip the following configuration.

Place paimon-azure-1.5-SNAPSHOT.jar together with paimon-spark-1.5-SNAPSHOT.jar under Spark's jars directory, and start like

spark-sql \
  --conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions \
  --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog \
  --conf spark.sql.catalog.paimon.warehouse=wasb://,<container>@<account>.blob.core.windows.net/<path> \
  --conf fs.azure.account.key.Account.blob.core.windows.net=yyy \

OBS

Download paimon-obs-1.5-SNAPSHOT.jar.

Flink
Spark
Hive

info

If you have already configured obs access through Flink (Via Flink FileSystem), here you can skip the following configuration.

Put paimon-obs-1.5-SNAPSHOT.jar into lib directory of your Flink home, and create catalog:

CREATE CATALOG my_catalog WITH (
    'type' = 'paimon',
    'warehouse' = 'obs://<bucket>/<path>',
    'fs.obs.endpoint' = 'obs-endpoint-hostname',
    'fs.obs.access.key' = 'xxx',
    'fs.obs.secret.key' = 'yyy'
);

info

If you have already configured obs access through Spark (Via Hadoop FileSystem), here you can skip the following configuration.

Place paimon-obs-1.5-SNAPSHOT.jar together with paimon-spark-1.5-SNAPSHOT.jar under Spark's jars directory, and start like

spark-sql \
  --conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions \
  --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog \
  --conf spark.sql.catalog.paimon.warehouse=obs://<bucket>/<path> \
  --conf spark.sql.catalog.paimon.fs.obs.endpoint=obs-endpoint-hostname \
  --conf spark.sql.catalog.paimon.fs.obs.access.key=xxx \
  --conf spark.sql.catalog.paimon.fs.obs.secret.key=yyy

info

If you have already configured obs access through Hive ((Via Hadoop FileSystem)), here you can skip the following configuration.

NOTE: You need to ensure that Hive metastore can access obs.

Place paimon-obs-1.5-SNAPSHOT.jar together with paimon-hive-connector-1.5-SNAPSHOT.jar under Hive's auxlib directory, and start like

SET paimon.fs.obs.endpoint=obs-endpoint-hostname;
SET paimon.fs.obs.access.key=xxx;
SET paimon.fs.obs.secret.key=yyy;

And read table from hive metastore, table can be created by Flink or Spark, see Catalog with Hive Metastore

SELECT * FROM test_table;
SELECT COUNT(1) FROM test_table;

Supported FileSystems​

Dependency​

HDFS​

HDFS Configuration​

Paimon-Hadoop Uber package​

Hadoop-compatible file systems (HCFS)​

Kerberos​

HDFS HA​

HDFS ViewFS​

OSS​

Server-Side Encryption​

S3​

S3 Compliant Object Stores​

Configure Path Style Access​

S3A Performance​

Google Cloud Storage​

Microsoft Azure Storage​

OBS​

Supported FileSystems

Dependency

HDFS

HDFS Configuration

Paimon-Hadoop Uber package

Hadoop-compatible file systems (HCFS)

Kerberos

HDFS HA

HDFS ViewFS

OSS

Server-Side Encryption

S3

S3 Compliant Object Stores

Configure Path Style Access

S3A Performance

Google Cloud Storage

Microsoft Azure Storage

OBS