System Tables
PyPaimon exposes table$<name> system tables through the existing
catalog and read-builder APIs. The supported short names are:
snapshots, schemas, options, manifests, files, partitions,
tags, and branches. Global tables under the sys database
(sys.all_tables, sys.catalog_options, ...) and the streaming
audit_log / binlog family are not exposed yet.
Basic Usage
Reuse a single read builder for both the scan and the read so that any projection or limit set on it is honoured by both sides:
from pypaimon import CatalogFactory
catalog = CatalogFactory.create({'warehouse': '/path/to/warehouse'})
snapshots = catalog.get_table('default.my_table$snapshots')
read_builder = snapshots.new_read_builder()
splits = read_builder.new_scan().plan().splits()
print(read_builder.new_read().to_pandas(splits))
with_projection and with_limit chain on the same builder:
read_builder = (
snapshots.new_read_builder()
.with_projection(['snapshot_id', 'commit_user', 'commit_time'])
.with_limit(10)
)
splits = read_builder.new_scan().plan().splits()
arrow_table = read_builder.new_read().to_arrow(splits)
The returned object exposes the regular Table surface, so the same
read builder works with to_pandas, to_arrow, to_iterator,
to_record_batch_iterator, and to_duckdb. Writes raise
NotImplementedError — system tables are read-only.
Available Tables
Each system table is listed below with its column layout (including
nullability) and primary-key choice. Tables are listed in the order
they appear in SystemTableLoader.
$snapshots
One row per persisted snapshot.
| Column | Type | Notes |
|---|---|---|
snapshot_id | BIGINT NOT NULL | Primary key |
schema_id | BIGINT NOT NULL | |
commit_user | STRING NOT NULL | |
commit_identifier | BIGINT NOT NULL | |
commit_kind | STRING NOT NULL | APPEND, COMPACT, ... |
commit_time | TIMESTAMP(3) NOT NULL | |
base_manifest_list | STRING NOT NULL | |
delta_manifest_list | STRING NOT NULL | |
changelog_manifest_list | STRING | |
total_record_count | BIGINT | |
delta_record_count | BIGINT | |
changelog_record_count | BIGINT | |
watermark | BIGINT | |
next_row_id | BIGINT |
$schemas
Every committed schema version, with fields / partition_keys /
primary_keys / options encoded as compact JSON strings.
| Column | Type | Notes |
|---|---|---|
schema_id | BIGINT NOT NULL | Primary key |
fields | STRING NOT NULL | JSON |
partition_keys | STRING NOT NULL | JSON list |
primary_keys | STRING NOT NULL | JSON list |
options | STRING NOT NULL | JSON map |
comment | STRING | |
update_time | TIMESTAMP(3) NOT NULL |
$options
Two columns echoing the active table options.
| Column | Type | Notes |
|---|---|---|
key | STRING NOT NULL | Primary key |
value | STRING NOT NULL |
$manifests
Manifest list for the latest snapshot.
| Column | Type | Notes |
|---|---|---|
file_name | STRING NOT NULL | Primary key |
file_size | BIGINT NOT NULL | |
num_added_files | BIGINT NOT NULL | |
num_deleted_files | BIGINT NOT NULL | |
schema_id | BIGINT NOT NULL | |
min_partition_stats | STRING | Placeholder (see Limitations) |
max_partition_stats | STRING | Placeholder (see Limitations) |
min_row_id | BIGINT | |
max_row_id | BIGINT |
$files
One row per ADD entry surviving the latest snapshot. Stats columns are
compact JSON dictionaries keyed by column name. The wire name
deleteRowCount is intentionally camelCase.
| Column | Type | Notes |
|---|---|---|
partition | STRING | pt=v/pt2=v2 |
bucket | INT NOT NULL | |
file_path | STRING NOT NULL | Primary key |
file_format | STRING NOT NULL | |
schema_id | BIGINT NOT NULL | |
level | INT NOT NULL | |
record_count | BIGINT NOT NULL | |
file_size_in_bytes | BIGINT NOT NULL | |
min_key | STRING | JSON list (PK tables only) |
max_key | STRING | JSON list (PK tables only) |
null_value_counts | STRING NOT NULL | JSON map |
min_value_stats | STRING NOT NULL | JSON map |
max_value_stats | STRING NOT NULL | JSON map |
min_sequence_number | BIGINT | |
max_sequence_number | BIGINT | |
creation_time | TIMESTAMP(3) | |
deleteRowCount | BIGINT | camelCase wire name |
file_source | STRING | |
first_row_id | BIGINT | |
write_cols | ARRAY |
$partitions
Aggregated partition statistics for the latest snapshot.
| Column | Type | Notes |
|---|---|---|
partition | STRING | pt=v/pt2=v2; primary key |
record_count | BIGINT NOT NULL | |
file_size_in_bytes | BIGINT NOT NULL | |
file_count | BIGINT NOT NULL | |
last_update_time | TIMESTAMP(3) | |
created_at | TIMESTAMP(3) | Filesystem path returns NULL |
created_by | STRING | Filesystem path returns NULL |
updated_by | STRING | Filesystem path returns NULL |
options | STRING | Filesystem path returns NULL |
total_buckets | INT NOT NULL | |
done | BOOLEAN NOT NULL | Filesystem path returns False |
$tags
Snapshot metadata for every tag.
| Column | Type | Notes |
|---|---|---|
tag_name | STRING NOT NULL | Primary key |
snapshot_id | BIGINT NOT NULL | |
schema_id | BIGINT NOT NULL | |
commit_time | TIMESTAMP(3) NOT NULL | |
record_count | BIGINT | |
create_time | TIMESTAMP(3) | Currently emitted as NULL |
time_retained | STRING | Currently emitted as NULL |
$branches
Every named branch with the branch directory's modification time.
| Column | Type | Notes |
|---|---|---|
branch_name | STRING NOT NULL | Primary key |
create_time | TIMESTAMP(3) NOT NULL |
Limitations
- Predicate pushdown is not yet implemented. Calling
with_filter(...)is accepted, but invokingnew_read()later will raiseNotImplementedErrorrather than silently dropping the predicate. Filter the resulting Arrow table / DataFrame on the client side instead. min_partition_stats/max_partition_statsin$manifestsare emitted asNULL. PyPaimon does not yet ship a helper that casts a partition row to its string form.tag.time_retainedandtag.create_timeareNULL. PyPaimon'sTagdataclass does not yet carry these fields — matchingFileSystemCatalog.get_tag's current behaviour.branch.create_timefalls back to epoch 0 when the underlying store cannot provide an mtime (some remote object stores viaPyArrowFileIO). Local filesystem catalogs always populate the real time.partitions.created_at / created_by / updated_by / options / doneare filled with placeholders for the filesystem path. REST-managed catalogs that expose those fields will be wired in a follow-up.list_tablesdoes not enumerate system tables. System tables remain accessible throughget_table('db.t$name').
Supported via Catalogs
FilesystemCatalog— fully supported.RESTCatalog— fully supported; columns that depend on catalog metadata (such as$partitions.created_by) are populated via the REST API where the server exposes them.