Skip to main content

Vector Index

Vector Index provides approximate nearest neighbor (ANN) search for vector similarity search scenarios such as recommendation systems, image retrieval, and RAG (Retrieval Augmented Generation) applications.

Supported vector index types:

Index TypeDescription
ivf-flatIVF index with flat vector storage.
ivf-pqIVF index with product quantization.
ivf-hnsw-flatIVF index with HNSW flat quantizer.
ivf-hnsw-sqIVF index with HNSW scalar quantizer.
luminaLumina DiskANN-based vector index.

Choose the index type based on the trade-off you want:

Index TypeBest For
ivf-flatHighest recall among IVF variants when storage and memory are acceptable.
ivf-pqSmaller index files and a balanced recall, latency, and storage trade-off.
ivf-hnsw-flatBetter recall inside IVF partitions with raw vector storage.
ivf-hnsw-sqHNSW search quality with scalar quantization to reduce index size.
luminaLarge-scale ANN search with DiskANN graph indexing and configurable rawf32, sq8, or pq encodings.

Build Vector Index

-- Create IVF-PQ vector index on 'embedding' column
CALL sys.create_global_index(
table => 'db.my_table',
index_column => 'embedding',
index_type => 'ivf-pq',
options => 'ivf-pq.distance.metric=cosine,ivf-pq.nlist=256,ivf-pq.pq.m=16'
);

-- Create Lumina DiskANN vector index on 'embedding' column
CALL sys.create_global_index(
table => 'db.my_table',
index_column => 'embedding',
index_type => 'lumina',
options => 'lumina.index.dimension=768,lumina.distance.metric=l2,lumina.encoding.type=sq8'
);

Use index_type => 'lumina' for new Lumina indexes. The legacy lumina-vector-ann identifier is kept only so existing tables can still load old indexes.

For ARRAY<FLOAT> vector columns, specify the vector dimension with <index-type>.dimension for IVF indexes or lumina.index.dimension for Lumina indexes. For VECTOR<FLOAT> columns, Paimon uses the dimension from the column type. When lumina.index.dimension is explicitly set for a VECTOR<FLOAT> column, it must match the vector type length.

Supported IVF vector index options:

OptionDefaultDescription
<index-type>.dimension128Vector dimension for ARRAY<FLOAT> columns. Ignored for VECTOR<FLOAT> columns.
<index-type>.distance.metricinner_productDistance metric. Supported values: l2, cosine, inner_product.
<index-type>.nlist256Number of IVF clusters used during index build. Higher values create more partitions and can improve recall for large datasets, but may increase build cost.
<index-type>.pq.m16Number of PQ sub-vectors for ivf-pq. The vector dimension must be divisible by this value. Higher values usually improve recall with larger index files.
<index-type>.pq.use-opqfalseWhether to enable OPQ for ivf-pq.
<index-type>.hnsw.m20HNSW graph out-degree for ivf-hnsw-flat and ivf-hnsw-sq.
<index-type>.hnsw.ef-construction150HNSW construction search width for ivf-hnsw-flat and ivf-hnsw-sq.
<index-type>.hnsw.max-level7Maximum HNSW level for ivf-hnsw-flat and ivf-hnsw-sq.

Supported Lumina vector index options:

OptionDefaultDescription
lumina.index.dimension128Vector dimension for ARRAY<FLOAT> columns. For VECTOR<FLOAT> columns, an explicitly configured value must match the type length.
lumina.distance.metricinner_productDistance metric. Supported values: l2, cosine, inner_product.
lumina.index.typediskannLumina index type. Currently supports DiskANN.
lumina.encoding.typepqVector encoding type. Supported values: rawf32, sq8, pq.
lumina.pretrain.sample_ratio0.2Sample ratio used for pretraining.
lumina.diskann.build.ef_construction1024Size of the dynamic candidate list during DiskANN graph construction.
lumina.diskann.build.neighbor_count64Maximum number of neighbors per node in the DiskANN graph.
lumina.diskann.build.thread_count32Number of threads used for DiskANN index building.
lumina.diskann.search.list_sizeunset; search uses max(1.5x top_k, 16)Default DiskANN search list size used when no query value is supplied.
lumina.diskann.search.beam_width4Beam width for DiskANN search.
lumina.encoding.pq.m64Number of sub-quantizers for PQ encoding. It is capped to the vector dimension when larger than the dimension.
lumina.search.parallel_number5Parallel number for Lumina search.

Lumina PQ encoding does not support the cosine distance metric. Use rawf32 or sq8 encoding with cosine, or use l2 or inner_product with pq.

The Lumina native library is currently available only on x86_64 (AMD64) architecture.

Per-Field Options

The options above can also be set at the table level (in TBLPROPERTIES), where they are shared by every vector column of the same index type. When a table has multiple vector columns, you can scope an option to a single column with fields.<field-name>.<option>. The field-level form takes precedence over the column-agnostic option for that column. Use the stored table column name exactly as <field-name>. Field-level vector options do not include the index-type prefix; for example, use fields.image_embedding.nlist to override the shared ivf-pq.nlist option for image_embedding:

CREATE TABLE my_table (
id INT,
title_embedding ARRAY<FLOAT>,
image_embedding ARRAY<FLOAT>
) TBLPROPERTIES (
'bucket' = '-1',
'row-tracking.enabled' = 'true',
'data-evolution.enabled' = 'true',
'global-index.enabled' = 'true',
-- per-column dimensions
'fields.title_embedding.dimension' = '768',
'fields.image_embedding.dimension' = '512',
-- shared by every ivf-pq column, overridden only for 'image_embedding'
'ivf-pq.nlist' = '256',
'fields.image_embedding.nlist' = '512'
);

With the properties above, title_embedding is indexed with nlist=256 while image_embedding uses nlist=512.

Lumina uses the same field-level convention. For example, fields.image_embedding.distance.metric overrides lumina.distance.metric for image_embedding, and fields.image_embedding.index.dimension overrides lumina.index.dimension.

Search-time options are passed with each vector search request:

OptionDefaultDescription
ivf.nprobe16Number of IVF clusters to probe during search. Higher values usually improve recall but increase latency.
ivf.refine_factorDisabledRetrieves top_k * refine_factor IVF candidates and reranks them with the original vectors stored in the Paimon table. It is disabled by default for every IVF variant and is most useful for compressed indexes such as ivf-pq and ivf-hnsw-sq when recall is more important than latency.
hnsw.ef_search0HNSW search width during search. Higher values usually improve recall but increase latency. 0 uses the native library default.
diskann.search.list_sizemax(1.5x top_k, 16)Lumina DiskANN search list size. Higher values usually improve recall but increase latency.
diskann.search.beam_width4Lumina DiskANN search beam width.
search.parallel_number5Lumina search parallel number.

Use the same distance metric at build time and query time. Search options can be passed per query, so you can use a larger ivf.nprobe or hnsw.ef_search for higher recall queries and a smaller value for latency-sensitive queries. Lumina query-time options use the native keys shown above; when the same options are configured as table or index options, use the lumina. prefix.

ivf.refine_factor can also be configured with refine_factor, rerank_factor, and hyphenated spellings such as ivf.refine-factor. Setting ivf.refine_factor=1 still performs the raw-vector rerank for the indexed candidates; leaving it unset skips the rerank stage.

-- Search for top-5 nearest neighbors
SELECT * FROM vector_search('my_table', 'embedding', array(1.0f, 2.0f, 3.0f), 5);