This documentation is for an unreleased version of Apache Paimon. We recommend you use the latest stable version.
Local Cache #
When reading files from remote storage (S3, OSS, HDFS, etc.), each seek+read goes over the network. Paimon provides a block-level local cache that transparently caches file reads, significantly reducing remote I/O for repeated access patterns.
The cache supports two modes:
- Disk cache: when
local-cache.diris configured, blocks are cached on local disk. - Memory cache: when
local-cache.diris not configured, blocks are cached in memory.
Cached File Types #
The cache classifies files by type. By default, only meta and global-index types are cached. You can customize this via the local-cache.whitelist option.
| File Type | Config Name | Examples | Default Cached |
|---|---|---|---|
| META | meta | snapshot, schema, manifest, statistics, tag | Yes |
| GLOBAL_INDEX | global-index | BTree, Lumina, Tantivy index files | Yes |
| BUCKET_INDEX | bucket-index | Hash, deletion vector index files | No |
| DATA | data | Data files (ORC, Parquet, etc.) | No |
| FILE_INDEX | file-index | Data-file level bloom filter, bitmap | No |
All file types can be added to the whitelist. The default whitelist is meta,global-index.
Enable Cache #
This is a catalog-level option. Configure it when creating the catalog:
import org.apache.paimon.catalog.CatalogContext;
import org.apache.paimon.catalog.CatalogFactory;
import org.apache.paimon.options.Options;
Options options = new Options();
options.set("warehouse", "s3://my-bucket/warehouse");
options.set("local-cache.enabled", "true");
// optional: use disk cache by specifying a directory
options.set("local-cache.dir", "/tmp/paimon-cache");
// optional: customize limits
options.set("local-cache.max-size", "2gb");
options.set("local-cache.block-size", "1mb");
CatalogContext context = CatalogContext.create(options);
Catalog catalog = CatalogFactory.createCatalog(context);
// All tables from this catalog will use the cache
Table table = catalog.getTable(Identifier.create("my_db", "my_table"));
import pypaimon
options = {
"warehouse": "s3://my-bucket/warehouse",
"local-cache.enabled": "true",
# optional: use disk cache by specifying a directory
"local-cache.dir": "/tmp/paimon-cache",
# optional: customize limits
"local-cache.max-size": "2gb",
"local-cache.block-size": "1mb",
}
catalog = pypaimon.create_catalog(options)
# All tables from this catalog will use the cache
table = catalog.get_table("db.my_table")
Cache Options #
| Option | Type | Default | Description |
|---|---|---|---|
local-cache.enabled |
Boolean | false | Whether to enable local block cache for file reads. |
local-cache.dir |
String | (none) | Directory for storing cached blocks on disk. If not configured, memory cache is used. |
local-cache.max-size |
MemorySize | unlimited | Maximum total size of the cache. When exceeded, the least recently used blocks are evicted. |
local-cache.block-size |
MemorySize | 1 mb | Block size for caching. Files are logically divided into fixed-size blocks and cached independently. |
local-cache.whitelist |
String | meta,global-index | Comma-separated list of file types to cache. Supported values: meta, global-index, bucket-index, data, file-index. |
How It Works #
- Files are logically divided into fixed-size blocks (default 1 MB).
- On the first read, blocks are downloaded from remote storage and cached locally (on disk or in memory).
- Subsequent reads of the same block are served from the local cache, skipping remote I/O.
- When using disk cache, cache files are keyed by remote file path and block offset, so they persist across process restarts and can be reused.
- When the cache exceeds
max-size, the least recently used blocks are evicted automatically.
Cache Lifecycle #
The cache is created and managed by the Catalog. All tables obtained from the same catalog share a single cache instance. The cache lives as long as the Catalog object is reachable — no explicit close is needed.
In distributed computing frameworks (Flink, Spark), the FileIO is serialized and shipped to task managers. After deserialization, the cache is not recreated — reads fall through directly to the remote storage. This is by design: the cache lifecycle is bound to the Catalog that created it, and a deserialized FileIO is no longer managed by any Catalog.
If you need caching on task managers, create a new Catalog with cache options enabled on each worker node.