CLI
Inspect Mosaic files from the terminal with the mosaic binary. A native, JVM-free toolkit driving the read-only MosaicReader API.
Install
# run from source
cargo run -p paimon-mosaic-cli -- schema data.mosaic
# install the `mosaic` binary
cargo install --path cli
mosaic schema data.mosaic
Commands
| Command | Shows | Reads |
|---|---|---|
schema | column names, Arrow types, nullability, bucket | footer only |
meta | row groups, rows, per-column stats | footer + index |
footer | magic, version, buckets, compression | footer only |
buckets | per-bucket layout and member columns | footer + index |
pages | per-column encoding + slot size | bucket data |
dictionary | dictionary entries of a dict column | bucket data |
column-size | on-disk bytes per column | footer + index + paged directories |
cat / head | cat: all rows by default (-n to limit); head: first 10; -c, --where | column data |
count | total row count | footer + index |
convert | import CSV or JSON lines into a new Mosaic file | writes file |
Inspection and query commands accept --json (convert writes a file). cat scans all rows by default (-n to limit); head prints 10 rows by default. cat/head/pages/column-size take -c a,b; dictionary takes -c <col>.
schema
Columns, Arrow types, nullability and bucket assignment, in original input order. Footer only.
$ mosaic schema data.mosaic
5 columns, 4 buckets
id: Int32 not null [bucket 0]
name: Utf8 [bucket 2]
kind: Utf8 [bucket 1]
score: Float64 [bucket 3]
flag: Int32 [bucket 0]
$ mosaic schema data.mosaic --json
{"columns":5,"buckets":4,"fields":[{"name":"id","type":"Int32","nullable":false,"bucket":0}, ...]}
meta
Total rows, row groups, and per-column stats (null count / min / max) for columns configured with stats.
$ mosaic meta data.mosaic
file: 200 rows, 5 columns, 4 buckets, 1 row groups
row group 0: 200 rows
id: nulls=0 min=0 max=199
score: nulls=0 min=0 max=298.5
footer
The 32-byte file footer: magic, format version, bucket count, row groups and compression.
$ mosaic footer data.mosaic
magic=MOSA version=1 buckets=4 row_groups=1 compression=zstd
buckets
Per row group, each bucket's layout (empty / monolithic / paged), on-disk size and member columns. Mosaic groups columns into buckets by name order. Monolithic buckets also report uncompressed size and ratio.
$ mosaic buckets data.mosaic
row group 0:
bucket 0: monolithic 27B (uncompressed 59 B, 2.19x) [kind]
bucket 1: paged 373B [flag, id]
bucket 2: paged 220B [name]
bucket 3: paged 542B [score]
pages
Per-column physical encoding (plain / const / dict / all_null) and on-disk slot size.
$ mosaic pages data.mosaic
row group 0:
flag: bucket 0 encoding=const slot=16B
id: bucket 0 encoding=plain slot=349B
kind: bucket 1 encoding=dict slot=28B
name: bucket 2 encoding=plain slot=216B
score: bucket 3 encoding=plain slot=538B
dictionary
Dump the dictionary of a dict-encoded column. Non-dict columns report as such.
$ mosaic dictionary data.mosaic -c kind
row group 0: 3 entries
0: a
1: b
2: c
$ mosaic dictionary data.mosaic -c kind --json
{"column":"kind","row_groups":[["a","b","c"]]}
column-size
On-disk bytes per column. Paged buckets give exact per-column sizes; a multi-column monolithic bucket is split evenly and marked (approx).
$ mosaic column-size data.mosaic
id: 349 B
name: 216 B
kind: 28 B
total: 593 B
cat / head
cat scans all rows by default (-n to limit); head prints the first 10 rows by default. -c projects columns, --where filters rows (one condition: = != > >= < <=; integers and floats compare exactly so =0.3 only matches a stored 0.3; Date32 accepts epoch-day or YYYY-MM-DD; row groups whose stats exclude the predicate are skipped), --json emits newline-delimited JSON.
$ mosaic cat data.mosaic -n 2
+----+--------+------+-------+------+
| id | name | kind | score | flag |
+----+--------+------+-------+------+
| 0 | user_0 | a | 0 | 7 |
| 1 | user_1 | b | 1.5 | 7 |
+----+--------+------+-------+------+
$ mosaic cat data.mosaic -n 2 -c name,score # projection
$ mosaic cat data.mosaic -n 2 --json
{"id":0,"name":"user_0","kind":"a","score":0,"flag":7}
{"id":1,"name":"user_1","kind":"b","score":1.5,"flag":7}
$ mosaic cat data.mosaic --where "kind=a" # all matching rows
$ mosaic head data.mosaic --json # preview rows
count
Total row count across all row groups.
$ mosaic count data.mosaic
200
convert
Import a CSV (with header) or JSON lines (one object per line) into a new Mosaic file; the schema is inferred. --stats id,score builds min/max stats for those columns, which cat --where then uses to skip non-matching row groups. Refuses to replace an existing output unless --overwrite is given.
$ mosaic convert data.csv -o data.mosaic --stats id
wrote data.mosaic (200 rows, 5 columns)
$ mosaic convert data.ndjson -o data.mosaic
wrote data.mosaic (200 rows, 5 columns)
ffi
(mosaic.h) or jni crates rather than shelling out to this CLI.