PK Clustering Override
This documentation is for an unreleased version of Apache Paimon. We recommend you use the latest stable version.

PK Clustering Override #

By default, data files in a primary key table are physically sorted by the primary key. This is optimal for point lookups but can hurt scan performance when queries filter on non-primary-key columns.

PK Clustering Override mode changes the physical sort order of data files from the primary key to user-specified clustering columns. This significantly improves scan performance for queries that filter or group by clustering columns, while still maintaining primary key uniqueness through deletion vectors.

Quick Start #

CREATE TABLE my_table (
    id BIGINT,
    dt STRING,
    city STRING,
    amount DOUBLE,
    PRIMARY KEY (id) NOT ENFORCED
) WITH (
    'pk-clustering-override' = 'true',
    'clustering.columns' = 'city',
    'deletion-vectors.enabled' = 'true',
    'bucket' = '4'
);

After this, data files within each bucket will be physically sorted by city instead of id. Queries like SELECT * FROM my_table WHERE city = 'Beijing' can skip irrelevant data files by checking their min/max statistics on the clustering column.

How It Works #

PK Clustering Override replaces the default LSM compaction with a two-phase clustering compaction:

Phase 1 — Sort by Clustering Columns: Newly flushed (level 0) files are read, sorted by the configured clustering columns, and rewritten as sorted (level 1) files. A key index tracks each primary key’s file and row position to maintain uniqueness.

Phase 2 — Merge Overlapping Sections: Sorted files are grouped into sections based on clustering column range overlap. Overlapping sections are merged together. Adjacent small sections are also consolidated to reduce file count and IO amplification. Non-overlapping large files are left untouched.

During both phases, deduplication is handled via deletion vectors:

  • Deduplicate mode: When a key already exists in an older file, the old row is marked as deleted.
  • First-row mode: When a key already exists, the new row is marked as deleted, keeping the first-seen value.

When the number of files to merge exceeds sort-spill-threshold, smaller files are first spilled to row-based temporary files to reduce memory consumption, preventing OOM during multi-way merge.

Requirements #

Option Requirement
pk-clustering-override true
clustering.columns Must be set (one or more non-primary-key columns)
deletion-vectors.enabled Must be true
merge-engine deduplicate (default) or first-row only
sequence.fields Must not be set
record-level.expire-time Must not be set
Option Default Description
clustering.columns (none) Comma-separated column names used as the physical sort order for data files.
sort-spill-threshold (auto) When the number of merge readers exceeds this value, smaller files are spilled to row-based temp files to reduce memory usage.
sort-spill-buffer-size 64 mb Buffer size used for external sort during Phase 1 rewrite.

When to Use #

PK Clustering Override is beneficial when:

  • Analytical queries frequently filter or aggregate on non-primary-key columns (e.g., WHERE city = 'Beijing').
  • The table uses deduplicate or first-row merge engine.
  • You want data files physically co-located by a business dimension rather than the primary key.

It is not suitable when:

  • Point lookups by primary key are the dominant access pattern (default LSM sort is already optimal).
  • You need partial-update or aggregation merge engine.
  • sequence.fields or record-level.expire-time is required.
Edit This Page
Copyright © 2025 The Apache Software Foundation. Apache Paimon, Paimon, and its feather logo are trademarks of The Apache Software Foundation.