Mosaic
A columnar-bucket hybrid format optimized for wide tables.
Overview
Mosaic is a columnar-bucket hybrid format optimized for wide tables (10,000+ columns). Columns are sorted by name and evenly distributed into buckets using range-based assignment, stored column-oriented within each bucket, and independently compressed. This enables efficient projection pushdown at bucket granularity — reading 10 columns out of 10,000 only decompresses the buckets that contain those 10 columns. Range-based assignment ensures that columns with similar name prefixes land in the same bucket, improving both compression ratio and projection locality.
Mosaic is implemented as a Rust core library with bindings for Java (via JNI), Python (via ctypes FFI), and C/C++ (via FFI), enabling high-performance read and write access across multiple language ecosystems.
Key Features
Columnar-Bucket Hybrid
Columns sorted by name are distributed into buckets via range-based assignment, enabling projection pushdown at bucket granularity. Similar name prefixes land in the same bucket.
Adaptive Encoding
Each column is automatically encoded as ALL_NULL, CONST, DICT, or PLAIN based on its data distribution.
Zstd Compression
Optional Zstandard compression per bucket and schema block, with configurable compression level. Each bucket is independently compressed.
BPE Name Compression
Byte Pair Encoding compresses column names in the schema block, reducing metadata overhead for wide tables.
Rich Type System
18 data types from Boolean to TimestampLtz, with support for fixed-width and variable-length encodings.
Multi-Language
Rust core with Java JNI bindings, Python ctypes bindings, and C/C++ FFI headers. Write once in Rust, use everywhere.
Supported Types
| Type | Width | Description |
|---|---|---|
Boolean | 1 | true / false |
TinyInt | 1 | Signed 8-bit integer |
SmallInt | 2 | Signed 16-bit integer |
Integer | 4 | Signed 32-bit integer |
BigInt | 8 | Signed 64-bit integer |
Float | 4 | 32-bit IEEE 754 |
Double | 8 | 64-bit IEEE 754 |
Date | 4 | Days since epoch |
Time | 4 | Milliseconds since midnight |
Char(n) | variable | Fixed-length string |
VarChar(n) | variable | Variable-length string with max length |
String | variable | Unbounded UTF-8 string |
Binary(n) | variable | Fixed-length byte array |
VarBinary(n) | variable | Variable-length byte array with max length |
Bytes | variable | Unbounded byte array |
Decimal(p, s) | 8 or variable | Exact numeric; compact (p≤18) or large |
Timestamp(p) | 8 or 12 | Millis (p≤3), micros (p≤6), or millis + nanos (p>6) |
TimestampLtz(p) | 8 or 12 | Same as Timestamp, with local timezone |
Benchmark
Test setup: 10,000 columns (90% STRING, 10% INT), column names ~80 bytes each, Zstd compression (level 9).
File Size (10 rows)
| Format | Size | vs Mosaic |
|---|---|---|
| Parquet | 9,696 KB | 14.8x |
| ORC | 6,377 KB | 9.7x |
| Mosaic | 654 KB | 1x |
Projection Read (500 rows)
File size — Parquet: 57.4 MB, ORC: 95.4 MB, Mosaic: 11.5 MB
| Projected Columns | Parquet | ORC | Mosaic |
|---|---|---|---|
| 10 / 10,000 | 53,170 us | 72,729 us | 25,081 us |
| 1 / 10,000 | 50,919 us | 70,712 us | 2,374 us |
Projection Read (4,500 rows)
File size — Parquet: 458.4 MB, ORC: 827.9 MB, Mosaic: 100.2 MB
| Projected Columns | Parquet | ORC | Mosaic |
|---|---|---|---|
| 10 / 10,000 | 369,627 us | 89,344 us | 67,314 us |
| 1 / 10,000 | 360,458 us | 81,934 us | 26,924 us |
Status
Mosaic is under active development as part of the Apache Paimon ecosystem. Both the write path and read path are fully implemented with round-trip test coverage.