HDFS

HDFS #

You don’t need any additional dependencies to access HDFS because you have already taken care of the Hadoop dependencies.

HDFS Configuration #

For HDFS, the most important thing is to be able to read your HDFS configuration.

You may not have to do anything, if you are in a hadoop environment. Otherwise pick one of the following ways to configure your HDFS:

  1. Set environment variable HADOOP_HOME or HADOOP_CONF_DIR.
  2. Configure 'hadoop-conf-dir' in the paimon catalog.
  3. Configure Hadoop options through prefix 'hadoop.' in the paimon catalog.

The first approach is recommended.

If you do not want to include the value of the environment variable, you can configure hadoop-conf-loader to option.

HDFS Configuration is available directly through the computation cluster, see cluster configuration of Hive and Spark for details.

Hadoop-compatible file systems (HCFS) #

All Hadoop file systems are automatically available when the Hadoop libraries are on the classpath.

This way, Paimon seamlessly supports all of Hadoop file systems implementing the org.apache.hadoop.fs.FileSystem interface, and all Hadoop-compatible file systems (HCFS).

  • HDFS
  • Alluxio (see configuration specifics below)
  • XtreemFS

The Hadoop configuration has to have an entry for the required file system implementation in the core-site.xml file.

For Alluxio support add the following entry into the core-site.xml file:

<property>
  <name>fs.alluxio.impl</name>
  <value>alluxio.hadoop.FileSystem</value>
</property>

Kerberos #

It is recommended to use Flink Kerberos Keytab.
It is recommended to use Spark Kerberos Keytab.
An intuitive approach is to configure Hive’s kerberos authentication.

Configure the following three options in your catalog configuration:

  • security.kerberos.login.keytab: Absolute path to a Kerberos keytab file that contains the user credentials. Please make sure it is copied to each machine.
  • security.kerberos.login.principal: Kerberos principal name associated with the keytab.
  • security.kerberos.login.use-ticket-cache: True or false, indicates whether to read from your Kerberos ticket cache.

For JavaAPI:

SecurityContext.install(catalogOptions);

HDFS HA #

Ensure that hdfs-site.xml and core-site.xml contain the necessary HA configuration.

HDFS ViewFS #

Ensure that hdfs-site.xml and core-site.xml contain the necessary ViewFs configuration.

Edit This Page
Copyright © 2024 The Apache Software Foundation. Apache Paimon, Paimon, and its feather logo are trademarks of The Apache Software Foundation.