Local Cache
This documentation is for an unreleased version of Apache Paimon. We recommend you use the latest stable version.

Local Cache #

When reading files from remote storage (S3, OSS, HDFS, etc.), each seek+read goes over the network. Paimon provides a block-level local cache that transparently caches file reads, significantly reducing remote I/O for repeated access patterns.

The cache supports two modes:

  • Disk cache: when local-cache.dir is configured, blocks are cached on local disk.
  • Memory cache: when local-cache.dir is not configured, blocks are cached in memory.

Cached File Types #

The cache classifies files by type. By default, only meta and global-index types are cached. You can customize this via the local-cache.whitelist option.

File Type Config Name Examples Default Cached
META meta snapshot, schema, manifest, statistics, tag Yes
GLOBAL_INDEX global-index BTree, Lumina, Tantivy index files Yes
BUCKET_INDEX bucket-index Hash, deletion vector index files No
DATA data Data files (ORC, Parquet, etc.) No
FILE_INDEX file-index Data-file level bloom filter, bitmap No

All file types can be added to the whitelist. The default whitelist is meta,global-index.

Enable Cache #

This is a catalog-level option. Configure it when creating the catalog:

import org.apache.paimon.catalog.CatalogContext;
import org.apache.paimon.catalog.CatalogFactory;
import org.apache.paimon.options.Options;

Options options = new Options();
options.set("warehouse", "s3://my-bucket/warehouse");
options.set("local-cache.enabled", "true");
// optional: use disk cache by specifying a directory
options.set("local-cache.dir", "/tmp/paimon-cache");
// optional: customize limits
options.set("local-cache.max-size", "2gb");
options.set("local-cache.block-size", "1mb");

CatalogContext context = CatalogContext.create(options);
Catalog catalog = CatalogFactory.createCatalog(context);

// All tables from this catalog will use the cache
Table table = catalog.getTable(Identifier.create("my_db", "my_table"));
import pypaimon

options = {
    "warehouse": "s3://my-bucket/warehouse",
    "local-cache.enabled": "true",
    # optional: use disk cache by specifying a directory
    "local-cache.dir": "/tmp/paimon-cache",
    # optional: customize limits
    "local-cache.max-size": "2gb",
    "local-cache.block-size": "1mb",
}

catalog = pypaimon.create_catalog(options)

# All tables from this catalog will use the cache
table = catalog.get_table("db.my_table")

Cache Options #

Option Type Default Description
local-cache.enabled Boolean false Whether to enable local block cache for file reads.
local-cache.dir String (none) Directory for storing cached blocks on disk. If not configured, memory cache is used.
local-cache.max-size MemorySize unlimited Maximum total size of the cache. When exceeded, the least recently used blocks are evicted.
local-cache.block-size MemorySize 1 mb Block size for caching. Files are logically divided into fixed-size blocks and cached independently.
local-cache.whitelist String meta,global-index Comma-separated list of file types to cache. Supported values: meta, global-index, bucket-index, data, file-index.

How It Works #

  • Files are logically divided into fixed-size blocks (default 1 MB).
  • On the first read, blocks are downloaded from remote storage and cached locally (on disk or in memory).
  • Subsequent reads of the same block are served from the local cache, skipping remote I/O.
  • When using disk cache, cache files are keyed by remote file path and block offset, so they persist across process restarts and can be reused.
  • When the cache exceeds max-size, the least recently used blocks are evicted automatically.

Cache Lifecycle #

The cache is created and managed by the Catalog. All tables obtained from the same catalog share a single cache instance. The cache lives as long as the Catalog object is reachable — no explicit close is needed.

In distributed computing frameworks (Flink, Spark), the FileIO is serialized and shipped to task managers. After deserialization, the cache is not recreated — reads fall through directly to the remote storage. This is by design: the cache lifecycle is bound to the Catalog that created it, and a deserialized FileIO is no longer managed by any Catalog.

If you need caching on task managers, create a new Catalog with cache options enabled on each worker node.

Edit This Page
Copyright © 2025 The Apache Software Foundation. Apache Paimon, Paimon, and its feather logo are trademarks of The Apache Software Foundation.