Design & Format Specification

Architecture, binary layout, and internal data structures of Mosaic v1.

File Format Layout

A Mosaic file consists of four sections, written sequentially:

Bucket Data Row Group 0 Bucket 0 Bucket 1 Bucket 2 ... Bucket N-1 Row Group 1 Bucket 0 Bucket 1 Bucket 2 ... Bucket N-1 Schema Block 4B uncompressed size | compressed schema bytes Row Group Index numRows | nonEmpty | [bucketId, offset, compSize, uncompSize] ... | columnStats Footer (32 bytes) indexOffset(8) | schemaOffset(8) | numBuckets(4) | numRowGroups(4) compression(1) | version(1) | reserved(2) | magic "MOSA"(4) Footer schema Offset index Offset

Reading starts from the footer (last 32 bytes), which provides absolute offsets to locate the schema block and row group index.

Columnar-Bucket Hybrid

Mosaic is a columnar-bucket hybrid format. Columns are sorted by name and evenly distributed into buckets using range-based assignment:

bucket_id = sorted_position * num_buckets / num_columns
Columns (sorted by name) amount city email id name phone score zip Bucket 0 amount, city, email Bucket 1 id, name, phone Bucket 2 score, zip Example: 8 columns, 3 buckets

Within each bucket, data is stored column-oriented and independently compressed. This design enables efficient projection pushdown at bucket granularity — reading 10 columns out of 10,000 only decompresses the buckets that contain those 10 columns.

Range-based assignment ensures that columns with similar name prefixes (e.g., sensor_temp_1, sensor_temp_2) land in the same bucket, improving both compression ratio and projection locality.

The default is 100 buckets, automatically clamped to min(num_columns, 100). The bucket assignment is deterministic and derived from the sorted column order — it is not stored in the file.

Encoding Strategy

Each column within a bucket is independently encoded. The writer selects the most compact encoding for each column:

EncodingTagWhen UsedStorage
PLAIN0 Fallback for everything else Raw values (fixed-width or varint-prefixed) + null bitmap
CONST1 All non-null values are identical One value + null bitmap
DICT2 Number of distinct values ≤ 255 and total dict size ≤ 32 KB Dictionary + bit-packed indices + null bitmap
ALL_NULL3 Every value in the column is null Zero bytes (no data, no bitmap)

Column Encoding Selection

The encoding for each column is chosen automatically during writing based on value distribution and cost:

CONST detection is independent of dictionary tracking — it uses a lightweight byte comparison against the first non-null value, so it works for all types and value sizes (including long strings).

Dictionary encoding works for all data types including variable-width types (VARCHAR, VARBINARY, DECIMAL). Variable-width dictionary tracking is bounded by a configurable cumulative byte budget (default 32 KB) and abandoned when cardinality exceeds 255 or total dictionary entry bytes exceed the budget.

Bit-packed Dictionary Indices

Dictionary indices are bit-packed using bitWidth = ceil(log2(numEntries)) bits per non-null cell, packed LSB-first within each byte. The reader derives bitWidth from numEntries (already stored in dict metadata).

Examples: 2 distinct values → 1 bit/cell, 4 → 2 bits, 16 → 4 bits, 256 → 8 bits.

Note Null rows do not consume any bits in the bit-packed index array. Only non-null rows have corresponding dictionary indices.

Bucket Internal Structure

Each bucket stores column data in one of two modes, chosen automatically based on the uncompressed data size. The mode determines how compression is applied.

Monolithic Mode

When the average column page size is smaller than 32 KB (configurable via page_size_threshold), the entire bucket is compressed as a single zstd block. Individual column pages that are too small yield poor zstd compression ratios, so monolithic compression is more efficient in this case.

Monolithic Bucket (single zstd block) Encoding Flags (2 bits/column) Has-Nulls Flags (1 bit/column) CONST Metadata DICT Metadata (entries per column) Null Bitmaps (columns with nulls, excluding ALL_NULL) Col 0 data (PLAIN) Col 1 data (DICT) Col 2 data (PLAIN) All of the above compressed together as one zstd block

Paged Mode

When the average column page size is ≥ 32 KB, the bucket switches to paged mode. The bucket begins with a fixed-length page directory followed by self-describing, independently compressed column slots. The directory size is deterministic from the schema (num_columns_in_bucket × 4 bytes), enabling projection queries to read only the target columns' data with exactly 2 range-read operations on remote storage.

Paged Bucket Page Directory (fixed-length, uncompressed) Col 0: size (u32 LE) Col 1: size (u32 LE) Col 2: 0 (ALL_NULL) Col 3: size (u32 LE) Column Slots (each self-describing, independently zstd compressed) Slot 0 (Col A - PLAIN) uncompressed_size (varint) + zstd(encoding | flags | bitmap | data) Slot 1 (Col B - DICT) uncompressed_size (varint) + zstd(encoding | flags | dict | bitmap | indices) Col C (ALL_NULL) size=0 in directory no on-disk slot Slot 3 (Col D - CONST) uncompressed_size (varint) + zstd(encoding | flags | const_value | bitmap) Projection: SELECT col_A, col_D → read directory (fixed) + only Slot 0 & Slot 3 2 range-reads on remote storage — skip all other columns entirely

Page Directory

The directory is an array of num_columns_in_bucket entries, each a 4-byte u32 (little-endian) representing the total on-disk slot size for that column. A value of 0 means the column is ALL_NULL and has no on-disk data. The directory size is deterministic: num_columns_in_bucket × 4 bytes, computable from the schema alone.

Column Slot Format

Each non-ALL_NULL column has a slot on disk immediately after the directory:

On-disk slot:
    uncompressed_size  (varint, uncompressed prefix)
    compressed_data    (zstd compressed page_content)

page_content (after decompression):
    encoding           (1 byte: PLAIN=0, CONST=1, DICT=2)
    flags              (1 byte: bit 0 = has_nulls)
    [meta]             (encoding-specific, see below)
    [data]             (null bitmap if has_nulls, then column data)

Page Content by Encoding

EncodingOn-Disk Slot?page_content layout
ALL_NULLNo (size=0)
CONST (no nulls)Yes (tiny)encoding + flags + const_value
CONST (has nulls)Yesencoding + flags + const_value + null_bitmap
DICTYesencoding + flags + dict_table + [null_bitmap] + bit-packed indices
PLAINYesencoding + flags + [null_bitmap] + raw column data

Projected Read Path

  1. Compute dir_size = num_columns_in_bucket × 4 (known from schema)
  2. Range-read the directory from bucket_offset
  3. For each projected column, compute slot offset via prefix-sum of directory entries
  4. Range-read only the projected columns' slots (merge adjacent slots into a single IO)
  5. For each slot: parse uncompressed_size varint, then zstd::decompress
  6. Parse page_content: encoding, flags, meta, data → build column reader

Monolithic vs Paged Signaling

Each bucket in the row group index is described by a pair (compressed_size, bulk_decompress_size). This pair encodes three layout variants with zero additional bytes:

Condition Layout Meaning
compressed_size == 0 Empty No data on disk for this bucket; skip entirely.
compressed_size > 0 && bulk_decompress_size > 0 Monolithic The on-disk blob is a single compressed block. bulk_decompress_size is the decompressed size (used to allocate the output buffer before decompression).
compressed_size > 0 && bulk_decompress_size == 0 Paged The on-disk content is [directory (num_cols × u32le slot sizes)] followed by per-column compressed slots. Each slot is independently decompressible.

This encoding is unambiguous: a non-empty monolithic bucket always has bulk_decompress_size > 0 because a decompressed payload cannot be zero bytes. The combination compressed_size == 0 && bulk_decompress_size != 0 is invalid and must be rejected by the reader.

Validation Invariants

Compression

Both bucket data and the schema block support compression:

IDNameDescription
0NoneNo compression
1ZstdZstandard compression (default level 1)

In monolithic mode, compression is applied to the entire bucket as one block. In paged mode, the page directory is uncompressed (fixed-length, enabling direct offset computation), while each column slot is independently zstd-compressed. Paged mode is only used when the compression method is Zstd.

Row Groups

Large files are split into row groups to bound memory usage during writing. Each row group contains up to row_group_max_size bytes of uncompressed bucket data (default: 256 MB). The row group index in the file footer records offsets and sizes for each bucket in each row group, enabling random access to any row group.

Footer (32 bytes, big-endian)

OffsetSizeFieldDescription
08indexOffsetAbsolute offset of Row Group Index
88schemaBlockOffsetAbsolute offset of Schema Block
164numBucketsTotal number of buckets
204numRowGroupsTotal number of row groups
241compression0 = none, 1 = zstd
251versionFormat version (currently 1)
262(reserved)Padding, set to 0
284magicMOSA (0x4D4F5341)

Row Group Index

Varint-encoded, only non-empty buckets are stored. For each row group:

varint   numRows
varint   nonEmptyCount
repeated nonEmptyCount times:
    varint    bucketId
    8 bytes   bucketOffset       (big-endian, absolute file offset)
    varint    compressedSize     (total bytes: monolithic blob or directory + column slots)
    varint    bulkDecompressSize (> 0 = monolithic, = 0 = paged)

--- Column Statistics (appended after bucket entries) ---
varint   numStats              (0 if no stats configured)
repeated numStats times:
    varint    columnIndex      (global column index)
    varint    nullCount
    [if nullCount < numRows]:
        value   minValue       (serialized using standard value encoding)
        value   maxValue       (serialized using standard value encoding)

Empty buckets (no data) are omitted entirely, saving space for sparse schemas.

Column Statistics

Mosaic supports optional per-column min/max statistics at row group granularity, enabling filter pushdown: query engines can skip entire row groups whose value range does not overlap with a filter predicate.

Filter Pushdown

Query engines can use column statistics to skip entire row groups whose min/max range does not overlap with a filter predicate. For example, a filter age > 50 can skip any row group where max(age) ≤ 50.

Schema Block

Prefixed with a 4-byte big-endian int (uncompressed size), followed by the schema data (compressed with the file's compression method).

Columns are serialized in name-sorted order. Column names are compressed using one of two encodings, chosen dynamically by the writer based on which produces smaller output:

Schema Block Layout

varint   numColumns
varint   numBuckets
1 byte   nameEncoding          (0 = front coding, 1 = BPE + front coding)

--- if nameEncoding == 1 (BPE) ---
varint   numRules
repeated numRules times:
    1 byte   left               (left token of merge rule)
    1 byte   right              (right token of merge rule)

--- per column (repeated numColumns times, name-sorted order) ---
varint   sharedPrefixLen       (bytes shared with previous column name)
varint   suffixLen             (bytes of new suffix)
bytes    suffix                (suffixLen bytes, raw or BPE-encoded)
TypeDescriptor

--- original column order (delta + zigzag encoded) ---
repeated numColumns times:
    zigzag_varint   delta     (sorted position delta from previous; first relative to 0)

The first column has sharedPrefixLen = 0. To reconstruct a column name, take the first sharedPrefixLen bytes from the previous name and append the suffix. If BPE is used, decode the reconstructed byte sequence by recursively expanding tokens ≥ 0x80 using the merge rules.

Columns are stored on disk in name-sorted order for front-coding compression. The original (user-defined) column order is preserved via a delta+zigzag-encoded permutation at the end of the schema block. When reading without an explicit projection, columns are returned in their original input order. The delta encoding produces long runs of +1 for locally-ordered column groups, which compress extremely well under zstd.

TypeDescriptor

1 byte   typeId
1 byte   nullable      (0 = not null, 1 = nullable)
[type-specific params]
typeIdTypeParams
0BOOLEAN(none)
1TINYINT(none)
2SMALLINT(none)
3INTEGER(none)
4BIGINT(none)
5FLOAT(none)
6DOUBLE(none)
7DATE(none)
8CHARvarint length
9VARCHARvarint length
10STRING(none) — VARCHAR with MAX_LENGTH
11BINARYvarint length
12VARBINARYvarint length
13BYTES(none) — VARBINARY with MAX_LENGTH
14DECIMALvarint precision, varint scale
15TIMEvarint precision
16TIMESTAMPvarint precision
17TIMESTAMP_LTZvarint precision, varint timezoneLength, bytes timezone

Complex types (ARRAY, MAP, ROW, etc.), VARIANT, and BLOB are not supported.

Value Serialization

Values are serialized in the same format for PLAIN data, CONST metadata, and DICT entries:

TypeEncoding
BOOLEAN1 byte (0 or 1)
TINYINT1 byte
SMALLINT2 bytes big-endian
INTEGER / DATE / TIME4 bytes big-endian
BIGINT8 bytes big-endian
FLOAT4 bytes IEEE 754 (big-endian)
DOUBLE8 bytes IEEE 754 (big-endian)
DECIMAL (compact, precision ≤ 18)8 bytes big-endian (unscaled long)
DECIMAL (large, precision > 18)varint length + unscaled BigInteger bytes
TIMESTAMP (precision ≤ 3)8 bytes (epoch millis, big-endian)
TIMESTAMP (precision 4–6)8 bytes (epoch micros, big-endian)
TIMESTAMP (precision > 6)8 bytes (epoch millis) + 4 bytes (nanos of millis)
CHAR / VARCHAR / STRINGvarint length + UTF-8 bytes
BINARY / VARBINARY / BYTESvarint length + raw bytes

Varint Encoding

Unsigned 32-bit integers are encoded as 1–5 bytes using LEB128. Each byte contributes 7 data bits; the high bit indicates whether more bytes follow (1 = more, 0 = last byte).

00x00              (1 byte)
1270x7F              (1 byte)
1280x80 0x01         (2 bytes)
163830xFF 0x7F         (2 bytes)
163840x80 0x80 0x01    (3 bytes)

Limitations

  1. Complex types (ARRAY, MAP, MULTISET, ROW) are not supported.
  2. Mosaic format is designed for wide tables and may not be efficient for narrow tables with few columns.