Skip to main content

Filesystems

Apache Paimon utilizes the same pluggable file systems as Apache Flink. Users can follow the standard plugin mechanism to configure the plugin structure if using Flink as compute engine. However, for other engines like Spark or Hive, the provided opt jars (by Flink) may get conflicts and cannot be used directly. It is not convenient for users to fix class conflicts, thus Paimon provides the self-contained and engine-unified FileSystem pluggable jars for user to query tables from Spark/Hive side.

Supported FileSystems

FileSystemURI SchemePluggableDescription
Local File Systemfile://NBuilt-in Support
HDFShdfs://NBuilt-in Support, ensure that the cluster is in the hadoop environment
Aliyun OSSoss://Y
S3s3://Y
Tencent Cloud Object Storagecosn://Y
Microsoft Azure Storageabfs://Y
Huawei OBSobs://Y
Google Cloud Storagegs://Y

Dependency

We recommend you to download the jar directly: Download Link.

You can also manually build bundled jar from the source code.

To build from source code, clone the git repository.

Build shaded jar with the following command.

mvn clean install -DskipTests

You can find the shaded jars under ./paimon-filesystems/paimon-${fs}/target/paimon-${fs}-1.5-SNAPSHOT.jar.

HDFS

You don't need any additional dependencies to access HDFS because you have already taken care of the Hadoop dependencies.

HDFS Configuration

For HDFS, the most important thing is to be able to read your HDFS configuration.

You may not have to do anything, if you are in a hadoop environment. Otherwise, pick one of the following ways to configure your HDFS:

  1. Set environment variable HADOOP_HOME or HADOOP_CONF_DIR.
  2. Configure 'hadoop-conf-dir' in the paimon catalog.
  3. Configure Hadoop options through prefix 'hadoop.' in the paimon catalog.

The first approach is recommended.

If you do not want to include the value of the environment variable, you can configure hadoop-conf-loader to option.

Paimon-Hadoop Uber package

If you want to read from and write to Paimon tables on a Hadoop cluster within an application that has no Hadoop dependencies, you can use the Paimon Hadoop uber jar.

Download paimon-hadoop-uber-1.5-SNAPSHOT.jar.

Put paimon-hadoop-uber-1.5-SNAPSHOT.jar into classpath directory of your application.

Hadoop-compatible file systems (HCFS)

All Hadoop file systems are automatically available when the Hadoop libraries are on the classpath.

This way, Paimon seamlessly supports all of Hadoop file systems implementing the org.apache.hadoop.fs.FileSystem interface, and all Hadoop-compatible file systems (HCFS).

  • HDFS
  • Alluxio (see configuration specifics below)
  • XtreemFS

The Hadoop configuration has to have an entry for the required file system implementation in the core-site.xml file.

For Alluxio support add the following entry into the core-site.xml file:

<property>
<name>fs.alluxio.impl</name>
<value>alluxio.hadoop.FileSystem</value>
</property>

Kerberos

Configure the following options in your catalog configuration:

  • security.kerberos.login.keytab: Absolute path to a Kerberos keytab file that contains the user credentials. Please make sure it is copied to each machine.
  • security.kerberos.login.principal: Kerberos principal name associated with the keytab.

And configure the following option in the program's java property:

  • java.security.krb5.conf: Absolute path to the Kerberos configuration file. Please make sure it is copied to each machine.

HDFS HA

Ensure that hdfs-site.xml and core-site.xml contain the necessary HA configuration.

HDFS ViewFS

Ensure that hdfs-site.xml and core-site.xml contain the necessary ViewFs configuration.

OSS

Download paimon-oss-1.5-SNAPSHOT.jar.

info

If you have already configured oss access through Flink (Via Flink FileSystem), here you can skip the following configuration.

Put paimon-oss-1.5-SNAPSHOT.jar into lib directory of your Flink home, and create catalog:

CREATE CATALOG my_catalog WITH (
'type' = 'paimon',
'warehouse' = 'oss://<bucket>/<path>',
'fs.oss.endpoint' = 'oss-cn-hangzhou.aliyuncs.com',
'fs.oss.accessKeyId' = 'xxx',
'fs.oss.accessKeySecret' = 'yyy'
);

If your environment has jindo sdk dependencies, you can use Jindo Fs to connect OSS. Jindo has better read and write efficiency.

Download paimon-jindo-1.5-SNAPSHOT.jar.

S3

Download paimon-s3-1.5-SNAPSHOT.jar.

info

If you have already configured s3 access through Flink (Via Flink FileSystem), here you can skip the following configuration.

Put paimon-s3-1.5-SNAPSHOT.jar into lib directory of your Flink home, and create catalog:

CREATE CATALOG my_catalog WITH (
'type' = 'paimon',
'warehouse' = 's3://<bucket>/<path>',
's3.endpoint' = 'your-endpoint-hostname',
's3.access-key' = 'xxx',
's3.secret-key' = 'yyy'
);

S3 Compliant Object Stores

The S3 Filesystem also support using S3 compliant object stores such as MinIO, Tencent's COS and IBM's Cloud Object Storage. Just configure your endpoint to the provider of the object store service.

s3.endpoint: your-endpoint-hostname

Configure Path Style Access

Some S3 compliant object stores might not have virtual host style addressing enabled by default, for example when using Standalone MinIO for testing purpose. In such cases, you will have to provide the property to enable path style access.

s3.path.style.access: true

S3A Performance

Tune Performance for S3AFileSystem.

If you encounter the following exception:

Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool.

Try to configure this in catalog options: fs.s3a.connection.maximum=1000.

Google Cloud Storage

Download paimon-gs-1.5-SNAPSHOT.jar.

info

If you have already configured gcs access through Flink (Via Flink FileSystem), here you can skip the following configuration.

Put paimon-gs-1.5-SNAPSHOT.jar into lib directory of your Flink home, and create catalog:

CREATE CATALOG my_catalog WITH (
'type' = 'paimon',
'warehouse' = 'gs://<bucket>/<path>',
'fs.gs.auth.type' = 'SERVICE_ACCOUNT_JSON_KEYFILE',
'fs.gs.auth.service.account.json.keyfile' = '/path/to/service-account-.json'
);

Microsoft Azure Storage

Download paimon-azure-1.5-SNAPSHOT.jar.

info

If you have already configured azure access through Flink (Via Flink FileSystem), here you can skip the following configuration.

Put paimon-azure-1.5-SNAPSHOT.jar into lib directory of your Flink home, and create catalog:

CREATE CATALOG my_catalog WITH (
'type' = 'paimon',
'warehouse' = 'wasb://,<container>@<account>.blob.core.windows.net/<path>',
'fs.azure.account.key.Account.blob.core.windows.net' = 'yyy'
);

OBS

Download paimon-obs-1.5-SNAPSHOT.jar.

info

If you have already configured obs access through Flink (Via Flink FileSystem), here you can skip the following configuration.

Put paimon-obs-1.5-SNAPSHOT.jar into lib directory of your Flink home, and create catalog:

CREATE CATALOG my_catalog WITH (
'type' = 'paimon',
'warehouse' = 'obs://<bucket>/<path>',
'fs.obs.endpoint' = 'obs-endpoint-hostname',
'fs.obs.access.key' = 'xxx',
'fs.obs.secret.key' = 'yyy'
);