PK Clustering Override #
By default, data files in a primary key table are physically sorted by the primary key. This is optimal for point lookups but can hurt scan performance when queries filter on non-primary-key columns.
PK Clustering Override mode changes the physical sort order of data files from the primary key to user-specified clustering columns. This significantly improves scan performance for queries that filter or group by clustering columns, while still maintaining primary key uniqueness through deletion vectors.
Quick Start #
CREATE TABLE my_table (
id BIGINT,
dt STRING,
city STRING,
amount DOUBLE,
PRIMARY KEY (id) NOT ENFORCED
) WITH (
'pk-clustering-override' = 'true',
'clustering.columns' = 'city',
'deletion-vectors.enabled' = 'true',
'bucket' = '4'
);
After this, data files within each bucket will be physically sorted by city instead of id. Queries like
SELECT * FROM my_table WHERE city = 'Beijing' can skip irrelevant data files by checking their min/max statistics
on the clustering column.
Requirements #
| Option | Requirement |
|---|---|
pk-clustering-override |
true |
clustering.columns |
Must be set (one or more non-primary-key columns) |
deletion-vectors.enabled |
Must be true |
merge-engine |
deduplicate (default) or first-row only |
When to Use #
PK Clustering Override is beneficial when:
- Analytical queries frequently filter or aggregate on non-primary-key columns (e.g.,
WHERE city = 'Beijing'). - The table uses
deduplicateorfirst-rowmerge engine.
Although data files are no longer sorted by the primary key, filtering on bucket-key fields (which default to the
primary key) still benefits from bucket pruning. The query engine can skip entire buckets that do not contain matching
values, so queries like WHERE id = 12345 remain efficient.
Unsupported modes:
- Merge engine:
partial-updateoraggregation. - Changelog producer:
lookuporfull-compaction. - Configue:
sequence.fieldsorrecord-level.expire-time.