Skip to main content

System Tables

PyPaimon exposes table$<name> system tables through the existing catalog and read-builder APIs. The supported short names are: snapshots, schemas, options, manifests, files, partitions, tags, and branches. Global tables under the sys database (sys.all_tables, sys.catalog_options, ...) and the streaming audit_log / binlog family are not exposed yet.

Basic Usage

Reuse a single read builder for both the scan and the read so that any projection or limit set on it is honoured by both sides:

from pypaimon import CatalogFactory

catalog = CatalogFactory.create({'warehouse': '/path/to/warehouse'})
snapshots = catalog.get_table('default.my_table$snapshots')

read_builder = snapshots.new_read_builder()
splits = read_builder.new_scan().plan().splits()
print(read_builder.new_read().to_pandas(splits))

with_projection and with_limit chain on the same builder:

read_builder = (
snapshots.new_read_builder()
.with_projection(['snapshot_id', 'commit_user', 'commit_time'])
.with_limit(10)
)
splits = read_builder.new_scan().plan().splits()
arrow_table = read_builder.new_read().to_arrow(splits)

The returned object exposes the regular Table surface, so the same read builder works with to_pandas, to_arrow, to_iterator, to_record_batch_iterator, and to_duckdb. Writes raise NotImplementedError — system tables are read-only.

Available Tables

Each system table is listed below with its column layout (including nullability) and primary-key choice. Tables are listed in the order they appear in SystemTableLoader.

$snapshots

One row per persisted snapshot.

ColumnTypeNotes
snapshot_idBIGINT NOT NULLPrimary key
schema_idBIGINT NOT NULL
commit_userSTRING NOT NULL
commit_identifierBIGINT NOT NULL
commit_kindSTRING NOT NULLAPPEND, COMPACT, ...
commit_timeTIMESTAMP(3) NOT NULL
base_manifest_listSTRING NOT NULL
delta_manifest_listSTRING NOT NULL
changelog_manifest_listSTRING
total_record_countBIGINT
delta_record_countBIGINT
changelog_record_countBIGINT
watermarkBIGINT
next_row_idBIGINT

$schemas

Every committed schema version, with fields / partition_keys / primary_keys / options encoded as compact JSON strings.

ColumnTypeNotes
schema_idBIGINT NOT NULLPrimary key
fieldsSTRING NOT NULLJSON
partition_keysSTRING NOT NULLJSON list
primary_keysSTRING NOT NULLJSON list
optionsSTRING NOT NULLJSON map
commentSTRING
update_timeTIMESTAMP(3) NOT NULL

$options

Two columns echoing the active table options.

ColumnTypeNotes
keySTRING NOT NULLPrimary key
valueSTRING NOT NULL

$manifests

Manifest list for the latest snapshot.

ColumnTypeNotes
file_nameSTRING NOT NULLPrimary key
file_sizeBIGINT NOT NULL
num_added_filesBIGINT NOT NULL
num_deleted_filesBIGINT NOT NULL
schema_idBIGINT NOT NULL
min_partition_statsSTRINGPlaceholder (see Limitations)
max_partition_statsSTRINGPlaceholder (see Limitations)
min_row_idBIGINT
max_row_idBIGINT

$files

One row per ADD entry surviving the latest snapshot. Stats columns are compact JSON dictionaries keyed by column name. The wire name deleteRowCount is intentionally camelCase.

ColumnTypeNotes
partitionSTRINGpt=v/pt2=v2
bucketINT NOT NULL
file_pathSTRING NOT NULLPrimary key
file_formatSTRING NOT NULL
schema_idBIGINT NOT NULL
levelINT NOT NULL
record_countBIGINT NOT NULL
file_size_in_bytesBIGINT NOT NULL
min_keySTRINGJSON list (PK tables only)
max_keySTRINGJSON list (PK tables only)
null_value_countsSTRING NOT NULLJSON map
min_value_statsSTRING NOT NULLJSON map
max_value_statsSTRING NOT NULLJSON map
min_sequence_numberBIGINT
max_sequence_numberBIGINT
creation_timeTIMESTAMP(3)
deleteRowCountBIGINTcamelCase wire name
file_sourceSTRING
first_row_idBIGINT
write_colsARRAY

$partitions

Aggregated partition statistics for the latest snapshot.

ColumnTypeNotes
partitionSTRINGpt=v/pt2=v2; primary key
record_countBIGINT NOT NULL
file_size_in_bytesBIGINT NOT NULL
file_countBIGINT NOT NULL
last_update_timeTIMESTAMP(3)
created_atTIMESTAMP(3)Filesystem path returns NULL
created_bySTRINGFilesystem path returns NULL
updated_bySTRINGFilesystem path returns NULL
optionsSTRINGFilesystem path returns NULL
total_bucketsINT NOT NULL
doneBOOLEAN NOT NULLFilesystem path returns False

$tags

Snapshot metadata for every tag.

ColumnTypeNotes
tag_nameSTRING NOT NULLPrimary key
snapshot_idBIGINT NOT NULL
schema_idBIGINT NOT NULL
commit_timeTIMESTAMP(3) NOT NULL
record_countBIGINT
create_timeTIMESTAMP(3)Currently emitted as NULL
time_retainedSTRINGCurrently emitted as NULL

$branches

Every named branch with the branch directory's modification time.

ColumnTypeNotes
branch_nameSTRING NOT NULLPrimary key
create_timeTIMESTAMP(3) NOT NULL

Limitations

  • Predicate pushdown is not yet implemented. Calling with_filter(...) is accepted, but invoking new_read() later will raise NotImplementedError rather than silently dropping the predicate. Filter the resulting Arrow table / DataFrame on the client side instead.
  • min_partition_stats / max_partition_stats in $manifests are emitted as NULL. PyPaimon does not yet ship a helper that casts a partition row to its string form.
  • tag.time_retained and tag.create_time are NULL. PyPaimon's Tag dataclass does not yet carry these fields — matching FileSystemCatalog.get_tag's current behaviour.
  • branch.create_time falls back to epoch 0 when the underlying store cannot provide an mtime (some remote object stores via PyArrowFileIO). Local filesystem catalogs always populate the real time.
  • partitions.created_at / created_by / updated_by / options / done are filled with placeholders for the filesystem path. REST-managed catalogs that expose those fields will be wired in a follow-up.
  • list_tables does not enumerate system tables. System tables remain accessible through get_table('db.t$name').

Supported via Catalogs

  • FilesystemCatalog — fully supported.
  • RESTCatalog — fully supported; columns that depend on catalog metadata (such as $partitions.created_by) are populated via the REST API where the server exposes them.