Vector Index
Vector Index provides approximate nearest neighbor (ANN) search for vector similarity search scenarios such as recommendation systems, image retrieval, and RAG (Retrieval Augmented Generation) applications.
Supported vector index types:
| Index Type | Description |
|---|---|
ivf-flat | IVF index with flat vector storage. |
ivf-pq | IVF index with product quantization. |
ivf-hnsw-flat | IVF index with HNSW flat quantizer. |
ivf-hnsw-sq | IVF index with HNSW scalar quantizer. |
lumina | Lumina DiskANN-based vector index. |
Choose the index type based on the trade-off you want:
| Index Type | Best For |
|---|---|
ivf-flat | Highest recall among IVF variants when storage and memory are acceptable. |
ivf-pq | Smaller index files and a balanced recall, latency, and storage trade-off. |
ivf-hnsw-flat | Better recall inside IVF partitions with raw vector storage. |
ivf-hnsw-sq | HNSW search quality with scalar quantization to reduce index size. |
lumina | Large-scale ANN search with DiskANN graph indexing and configurable rawf32, sq8, or pq encodings. |
Build Vector Index
-- Create IVF-PQ vector index on 'embedding' column
CALL sys.create_global_index(
table => 'db.my_table',
index_column => 'embedding',
index_type => 'ivf-pq',
options => 'ivf-pq.distance.metric=cosine,ivf-pq.nlist=256,ivf-pq.pq.m=16'
);
-- Create Lumina DiskANN vector index on 'embedding' column
CALL sys.create_global_index(
table => 'db.my_table',
index_column => 'embedding',
index_type => 'lumina',
options => 'lumina.index.dimension=768,lumina.distance.metric=l2,lumina.encoding.type=sq8'
);
Use index_type => 'lumina' for new Lumina indexes. The legacy lumina-vector-ann identifier is
kept only so existing tables can still load old indexes.
For ARRAY<FLOAT> vector columns, specify the vector dimension with <index-type>.dimension for
IVF indexes or lumina.index.dimension for Lumina indexes. For VECTOR<FLOAT> columns, Paimon uses
the dimension from the column type. When lumina.index.dimension is explicitly set for a
VECTOR<FLOAT> column, it must match the vector type length.
Supported IVF vector index options:
| Option | Default | Description |
|---|---|---|
<index-type>.dimension | 128 | Vector dimension for ARRAY<FLOAT> columns. Ignored for VECTOR<FLOAT> columns. |
<index-type>.distance.metric | inner_product | Distance metric. Supported values: l2, cosine, inner_product. |
<index-type>.nlist | 256 | Number of IVF clusters used during index build. Higher values create more partitions and can improve recall for large datasets, but may increase build cost. |
<index-type>.pq.m | 16 | Number of PQ sub-vectors for ivf-pq. The vector dimension must be divisible by this value. Higher values usually improve recall with larger index files. |
<index-type>.pq.use-opq | false | Whether to enable OPQ for ivf-pq. |
<index-type>.hnsw.m | 20 | HNSW graph out-degree for ivf-hnsw-flat and ivf-hnsw-sq. |
<index-type>.hnsw.ef-construction | 150 | HNSW construction search width for ivf-hnsw-flat and ivf-hnsw-sq. |
<index-type>.hnsw.max-level | 7 | Maximum HNSW level for ivf-hnsw-flat and ivf-hnsw-sq. |
Supported Lumina vector index options:
| Option | Default | Description |
|---|---|---|
lumina.index.dimension | 128 | Vector dimension for ARRAY<FLOAT> columns. For VECTOR<FLOAT> columns, an explicitly configured value must match the type length. |
lumina.distance.metric | inner_product | Distance metric. Supported values: l2, cosine, inner_product. |
lumina.index.type | diskann | Lumina index type. Currently supports DiskANN. |
lumina.encoding.type | pq | Vector encoding type. Supported values: rawf32, sq8, pq. |
lumina.pretrain.sample_ratio | 0.2 | Sample ratio used for pretraining. |
lumina.diskann.build.ef_construction | 1024 | Size of the dynamic candidate list during DiskANN graph construction. |
lumina.diskann.build.neighbor_count | 64 | Maximum number of neighbors per node in the DiskANN graph. |
lumina.diskann.build.thread_count | 32 | Number of threads used for DiskANN index building. |
lumina.diskann.search.list_size | unset; search uses max(1.5x top_k, 16) | Default DiskANN search list size used when no query value is supplied. |
lumina.diskann.search.beam_width | 4 | Beam width for DiskANN search. |
lumina.encoding.pq.m | 64 | Number of sub-quantizers for PQ encoding. It is capped to the vector dimension when larger than the dimension. |
lumina.search.parallel_number | 5 | Parallel number for Lumina search. |
Lumina PQ encoding does not support the cosine distance metric. Use rawf32 or sq8 encoding
with cosine, or use l2 or inner_product with pq.
The Lumina native library is currently available only on x86_64 (AMD64) architecture.
Per-Field Options
The options above can also be set at the table level (in TBLPROPERTIES), where they are shared
by every vector column of the same index type. When a table has multiple vector columns, you can
scope an option to a single column with fields.<field-name>.<option>. The field-level form takes
precedence over the column-agnostic option for that column. Use the stored table column name exactly
as <field-name>. Field-level vector options do not include the index-type prefix; for example,
use fields.image_embedding.nlist to override the shared ivf-pq.nlist option for
image_embedding:
CREATE TABLE my_table (
id INT,
title_embedding ARRAY<FLOAT>,
image_embedding ARRAY<FLOAT>
) TBLPROPERTIES (
'bucket' = '-1',
'row-tracking.enabled' = 'true',
'data-evolution.enabled' = 'true',
'global-index.enabled' = 'true',
-- per-column dimensions
'fields.title_embedding.dimension' = '768',
'fields.image_embedding.dimension' = '512',
-- shared by every ivf-pq column, overridden only for 'image_embedding'
'ivf-pq.nlist' = '256',
'fields.image_embedding.nlist' = '512'
);
With the properties above, title_embedding is indexed with nlist=256 while image_embedding
uses nlist=512.
Lumina uses the same field-level convention. For example, fields.image_embedding.distance.metric
overrides lumina.distance.metric for image_embedding, and
fields.image_embedding.index.dimension overrides lumina.index.dimension.
Vector Search
Search-time options are passed with each vector search request:
| Option | Default | Description |
|---|---|---|
ivf.nprobe | 16 | Number of IVF clusters to probe during search. Higher values usually improve recall but increase latency. |
ivf.refine_factor | Disabled | Retrieves top_k * refine_factor IVF candidates and reranks them with the original vectors stored in the Paimon table. It is disabled by default for every IVF variant and is most useful for compressed indexes such as ivf-pq and ivf-hnsw-sq when recall is more important than latency. |
hnsw.ef_search | 0 | HNSW search width during search. Higher values usually improve recall but increase latency. 0 uses the native library default. |
diskann.search.list_size | max(1.5x top_k, 16) | Lumina DiskANN search list size. Higher values usually improve recall but increase latency. |
diskann.search.beam_width | 4 | Lumina DiskANN search beam width. |
search.parallel_number | 5 | Lumina search parallel number. |
Use the same distance metric at build time and query time. Search options can be passed per query,
so you can use a larger ivf.nprobe or hnsw.ef_search for higher recall queries and a smaller
value for latency-sensitive queries. Lumina query-time options use the native keys shown above; when
the same options are configured as table or index options, use the lumina. prefix.
ivf.refine_factor can also be configured with refine_factor, rerank_factor, and hyphenated
spellings such as ivf.refine-factor. Setting ivf.refine_factor=1 still performs the raw-vector
rerank for the indexed candidates; leaving it unset skips the rerank stage.
- Spark SQL
- Flink SQL (Procedure)
- Java API
- Python SDK
-- Search for top-5 nearest neighbors
SELECT * FROM vector_search('my_table', 'embedding', array(1.0f, 2.0f, 3.0f), 5);
Unlike Spark's table-valued function, Flink uses a CALL procedure to perform vector search.
The procedure returns JSON-serialized rows as strings.
-- Search for top-5 nearest neighbors
CALL sys.vector_search(
`table` => 'db.my_table',
vector_column => 'embedding',
query_vector => '1.0,2.0,3.0',
top_k => 5
);
-- With projection (only return specific columns)
CALL sys.vector_search(
`table` => 'db.my_table',
vector_column => 'embedding',
query_vector => '1.0,2.0,3.0',
top_k => 5,
projection => 'id,name'
);
Table table = catalog.getTable(identifier);
// Step 1: Build vector search
float[] queryVector = {1.0f, 2.0f, 3.0f};
GlobalIndexResult result = table.newVectorSearchBuilder()
.withVector(queryVector)
.withLimit(5)
.withVectorColumn("embedding")
.withOption("ivf.nprobe", "16")
.executeLocal();
// Step 2: Read matching rows using the search result
ReadBuilder readBuilder = table.newReadBuilder();
TableScan.Plan plan = readBuilder.newScan().withGlobalIndexResult(result).plan();
try (RecordReader<InternalRow> reader = readBuilder.newRead().createReader(plan)) {
reader.forEachRemaining(row -> {
System.out.println("id=" + row.getInt(0) + ", name=" + row.getString(1));
});
}
Batch results keep the same order as input vectors.
float[][] queryVectors = {
{1.0f, 2.0f, 3.0f},
{3.0f, 2.0f, 1.0f}
};
List<GlobalIndexResult> batchResults = table.newBatchVectorSearchBuilder()
.withVectors(queryVectors)
.withLimit(5)
.withVectorColumn("embedding")
.executeBatchLocal();
// batchResults.get(i) corresponds to queryVectors[i].
For Java, use Table.newVectorSearchBuilder() to produce a global index result, then pass
the result to TableScan.withGlobalIndexResult.
table = catalog.get_table("db.my_table")
# Step 1: Build vector search
result = (
table.new_vector_search_builder()
.with_vector_column("embedding")
.with_query_vector([1.0, 2.0, 3.0])
.with_limit(5)
.with_option("ivf.nprobe", "16")
.execute_local()
)
# Step 2: Read matching rows using the search result
read_builder = table.new_read_builder()
scan = read_builder.new_scan().with_global_index_result(result)
plan = scan.plan()
table_read = read_builder.new_read()
pa_table = table_read.to_arrow(plan.splits())
print(pa_table)
You can also add a scalar filter to pre-filter rows before vector search:
from pypaimon.common.predicate_builder import PredicateBuilder
predicate = (
PredicateBuilder(table.fields)
.equal("category", "electronics")
)
result = (
table.new_vector_search_builder()
.with_vector_column("embedding")
.with_query_vector([1.0, 2.0, 3.0])
.with_limit(5)
.with_filter(predicate)
.execute_local()
)
The scalar filter is evaluated with matching scalar global indexes before vector search. Build a
BTree index for frequently used metadata filters, such as category, tenant_id, or event_time,
so vector search can restrict the candidate row ids before running ANN search.