CLI

Inspect Mosaic files from the terminal with the mosaic binary. A native, JVM-free toolkit driving the read-only MosaicReader API.

Install

# run from source
cargo run -p paimon-mosaic-cli -- schema data.mosaic

# install the `mosaic` binary
cargo install --path cli
mosaic schema data.mosaic

Commands

CommandShowsReads
schemacolumn names, Arrow types, nullability, bucketfooter only
metarow groups, rows, per-column statsfooter + index
footermagic, version, buckets, compressionfooter only
bucketsper-bucket layout and member columnsfooter + index
pagesper-column encoding + slot sizebucket data
dictionarydictionary entries of a dict columnbucket data
column-sizeon-disk bytes per columnfooter + index + paged directories
cat / headcat: all rows by default (-n to limit); head: first 10; -c, --wherecolumn data
counttotal row countfooter + index
convertimport CSV or JSON lines into a new Mosaic filewrites file

Inspection and query commands accept --json (convert writes a file). cat scans all rows by default (-n to limit); head prints 10 rows by default. cat/head/pages/column-size take -c a,b; dictionary takes -c <col>.

schema

Columns, Arrow types, nullability and bucket assignment, in original input order. Footer only.

$ mosaic schema data.mosaic
5 columns, 4 buckets
  id: Int32 not null [bucket 0]
  name: Utf8 [bucket 2]
  kind: Utf8 [bucket 1]
  score: Float64 [bucket 3]
  flag: Int32 [bucket 0]

$ mosaic schema data.mosaic --json
{"columns":5,"buckets":4,"fields":[{"name":"id","type":"Int32","nullable":false,"bucket":0}, ...]}

meta

Total rows, row groups, and per-column stats (null count / min / max) for columns configured with stats.

$ mosaic meta data.mosaic
file: 200 rows, 5 columns, 4 buckets, 1 row groups
row group 0: 200 rows
    id: nulls=0 min=0 max=199
    score: nulls=0 min=0 max=298.5

footer

The 32-byte file footer: magic, format version, bucket count, row groups and compression.

$ mosaic footer data.mosaic
magic=MOSA version=1 buckets=4 row_groups=1 compression=zstd

buckets

Per row group, each bucket's layout (empty / monolithic / paged), on-disk size and member columns. Mosaic groups columns into buckets by name order. Monolithic buckets also report uncompressed size and ratio.

$ mosaic buckets data.mosaic
row group 0:
    bucket 0: monolithic 27B (uncompressed 59 B, 2.19x) [kind]
    bucket 1: paged 373B [flag, id]
    bucket 2: paged 220B [name]
    bucket 3: paged 542B [score]

pages

Per-column physical encoding (plain / const / dict / all_null) and on-disk slot size.

$ mosaic pages data.mosaic
row group 0:
    flag: bucket 0 encoding=const slot=16B
    id: bucket 0 encoding=plain slot=349B
    kind: bucket 1 encoding=dict slot=28B
    name: bucket 2 encoding=plain slot=216B
    score: bucket 3 encoding=plain slot=538B

dictionary

Dump the dictionary of a dict-encoded column. Non-dict columns report as such.

$ mosaic dictionary data.mosaic -c kind
row group 0: 3 entries
    0: a
    1: b
    2: c

$ mosaic dictionary data.mosaic -c kind --json
{"column":"kind","row_groups":[["a","b","c"]]}

column-size

On-disk bytes per column. Paged buckets give exact per-column sizes; a multi-column monolithic bucket is split evenly and marked (approx).

$ mosaic column-size data.mosaic
  id: 349 B
  name: 216 B
  kind: 28 B
  total: 593 B

cat / head

cat scans all rows by default (-n to limit); head prints the first 10 rows by default. -c projects columns, --where filters rows (one condition: = != > >= < <=; integers and floats compare exactly so =0.3 only matches a stored 0.3; Date32 accepts epoch-day or YYYY-MM-DD; row groups whose stats exclude the predicate are skipped), --json emits newline-delimited JSON.

$ mosaic cat data.mosaic -n 2
+----+--------+------+-------+------+
| id | name   | kind | score | flag |
+----+--------+------+-------+------+
| 0  | user_0 | a    | 0     | 7    |
| 1  | user_1 | b    | 1.5   | 7    |
+----+--------+------+-------+------+

$ mosaic cat data.mosaic -n 2 -c name,score   # projection

$ mosaic cat data.mosaic -n 2 --json
{"id":0,"name":"user_0","kind":"a","score":0,"flag":7}
{"id":1,"name":"user_1","kind":"b","score":1.5,"flag":7}

$ mosaic cat data.mosaic --where "kind=a"   # all matching rows
$ mosaic head data.mosaic --json             # preview rows

count

Total row count across all row groups.

$ mosaic count data.mosaic
200

convert

Import a CSV (with header) or JSON lines (one object per line) into a new Mosaic file; the schema is inferred. --stats id,score builds min/max stats for those columns, which cat --where then uses to skip non-matching row groups. Refuses to replace an existing output unless --overwrite is given.

$ mosaic convert data.csv -o data.mosaic --stats id
wrote data.mosaic (200 rows, 5 columns)
$ mosaic convert data.ndjson -o data.mosaic
wrote data.mosaic (200 rows, 5 columns)
Embedding instead For C/C++ or Java callers, embed the format directly via the ffi (mosaic.h) or jni crates rather than shelling out to this CLI.