Vector Index

Vector Index provides approximate nearest neighbor (ANN) search for vector similarity search scenarios such as recommendation systems, image retrieval, and RAG (Retrieval Augmented Generation) applications.

Supported vector index types:

Index Type	Description
`ivf-flat`	IVF index with flat vector storage.
`ivf-pq`	IVF index with product quantization.
`ivf-hnsw-flat`	IVF index with HNSW flat quantizer.
`ivf-hnsw-sq`	IVF index with HNSW scalar quantizer.
`lumina`	Lumina DiskANN-based vector index.

Choose the index type based on the trade-off you want:

Index Type	Best For
`ivf-flat`	Highest recall among IVF variants when storage and memory are acceptable.
`ivf-pq`	Smaller index files and a balanced recall, latency, and storage trade-off.
`ivf-hnsw-flat`	Better recall inside IVF partitions with raw vector storage.
`ivf-hnsw-sq`	HNSW search quality with scalar quantization to reduce index size.
`lumina`	Large-scale ANN search with DiskANN graph indexing and configurable `rawf32`, `sq8`, or `pq` encodings.

Build Vector Index

-- Create IVF-PQ vector index on 'embedding' column
CALL sys.create_global_index(
    table => 'db.my_table',
    index_column => 'embedding',
    index_type => 'ivf-pq',
    options => 'ivf-pq.distance.metric=cosine,ivf-pq.nlist=256,ivf-pq.pq.m=16'
);

-- Create Lumina DiskANN vector index on 'embedding' column
CALL sys.create_global_index(
    table => 'db.my_table',
    index_column => 'embedding',
    index_type => 'lumina',
    options => 'lumina.index.dimension=768,lumina.distance.metric=l2,lumina.encoding.type=sq8'
);

Use index_type => 'lumina' for new Lumina indexes. The legacy lumina-vector-ann identifier is kept only so existing tables can still load old indexes.

For ARRAY<FLOAT> vector columns, specify the vector dimension with <index-type>.dimension for IVF indexes or lumina.index.dimension for Lumina indexes. For VECTOR<FLOAT> columns, Paimon uses the dimension from the column type. When lumina.index.dimension is explicitly set for a VECTOR<FLOAT> column, it must match the vector type length.

Supported IVF vector index options:

Option	Default	Description
`<index-type>.dimension`	`128`	Vector dimension for `ARRAY<FLOAT>` columns. Ignored for `VECTOR<FLOAT>` columns.
`<index-type>.distance.metric`	`inner_product`	Distance metric. Supported values: `l2`, `cosine`, `inner_product`.
`<index-type>.nlist`	`256`	Number of IVF clusters used during index build. Higher values create more partitions and can improve recall for large datasets, but may increase build cost.
`<index-type>.pq.m`	`16`	Number of PQ sub-vectors for `ivf-pq`. The vector dimension must be divisible by this value. Higher values usually improve recall with larger index files.
`<index-type>.pq.use-opq`	`false`	Whether to enable OPQ for `ivf-pq`.
`<index-type>.hnsw.m`	`20`	HNSW graph out-degree for `ivf-hnsw-flat` and `ivf-hnsw-sq`.
`<index-type>.hnsw.ef-construction`	`150`	HNSW construction search width for `ivf-hnsw-flat` and `ivf-hnsw-sq`.
`<index-type>.hnsw.max-level`	`7`	Maximum HNSW level for `ivf-hnsw-flat` and `ivf-hnsw-sq`.

Supported Lumina vector index options:

Option	Default	Description
`lumina.index.dimension`	`128`	Vector dimension for `ARRAY<FLOAT>` columns. For `VECTOR<FLOAT>` columns, an explicitly configured value must match the type length.
`lumina.distance.metric`	`inner_product`	Distance metric. Supported values: `l2`, `cosine`, `inner_product`.
`lumina.index.type`	`diskann`	Lumina index type. Currently supports DiskANN.
`lumina.encoding.type`	`pq`	Vector encoding type. Supported values: `rawf32`, `sq8`, `pq`.
`lumina.pretrain.sample_ratio`	`0.2`	Sample ratio used for pretraining.
`lumina.diskann.build.ef_construction`	`1024`	Size of the dynamic candidate list during DiskANN graph construction.
`lumina.diskann.build.neighbor_count`	`64`	Maximum number of neighbors per node in the DiskANN graph.
`lumina.diskann.build.thread_count`	`32`	Number of threads used for DiskANN index building.
`lumina.diskann.search.list_size`	unset; search uses `max(1.5x top_k, 16)`	Default DiskANN search list size used when no query value is supplied.
`lumina.diskann.search.beam_width`	`4`	Beam width for DiskANN search.
`lumina.encoding.pq.m`	`64`	Number of sub-quantizers for PQ encoding. It is capped to the vector dimension when larger than the dimension.
`lumina.search.parallel_number`	`5`	Parallel number for Lumina search.

Lumina PQ encoding does not support the cosine distance metric. Use rawf32 or sq8 encoding with cosine, or use l2 or inner_product with pq.

The Lumina native library is currently available only on x86_64 (AMD64) architecture.

Per-Field Options

The options above can also be set at the table level (in TBLPROPERTIES), where they are shared by every vector column of the same index type. When a table has multiple vector columns, you can scope an option to a single column with fields.<field-name>.<option>. The field-level form takes precedence over the column-agnostic option for that column. Use the stored table column name exactly as <field-name>. Field-level vector options do not include the index-type prefix; for example, use fields.image_embedding.nlist to override the shared ivf-pq.nlist option for image_embedding:

CREATE TABLE my_table (
    id INT,
    title_embedding ARRAY<FLOAT>,
    image_embedding ARRAY<FLOAT>
) TBLPROPERTIES (
    'bucket' = '-1',
    'row-tracking.enabled' = 'true',
    'data-evolution.enabled' = 'true',
    'global-index.enabled' = 'true',
    -- per-column dimensions
    'fields.title_embedding.dimension' = '768',
    'fields.image_embedding.dimension' = '512',
    -- shared by every ivf-pq column, overridden only for 'image_embedding'
    'ivf-pq.nlist' = '256',
    'fields.image_embedding.nlist' = '512'
);

With the properties above, title_embedding is indexed with nlist=256 while image_embedding uses nlist=512.

Lumina uses the same field-level convention. For example, fields.image_embedding.distance.metric overrides lumina.distance.metric for image_embedding, and fields.image_embedding.index.dimension overrides lumina.index.dimension.

Vector Search

Search-time options are passed with each vector search request:

Option	Default	Description
`ivf.nprobe`	`16`	Number of IVF clusters to probe during search. Higher values usually improve recall but increase latency.
`ivf.refine_factor`	Disabled	Retrieves `top_k * refine_factor` IVF candidates and reranks them with the original vectors stored in the Paimon table. It is disabled by default for every IVF variant and is most useful for compressed indexes such as `ivf-pq` and `ivf-hnsw-sq` when recall is more important than latency.
`hnsw.ef_search`	`0`	HNSW search width during search. Higher values usually improve recall but increase latency. `0` uses the native library default.
`diskann.search.list_size`	`max(1.5x top_k, 16)`	Lumina DiskANN search list size. Higher values usually improve recall but increase latency.
`diskann.search.beam_width`	`4`	Lumina DiskANN search beam width.
`search.parallel_number`	`5`	Lumina search parallel number.

Use the same distance metric at build time and query time. Search options can be passed per query, so you can use a larger ivf.nprobe or hnsw.ef_search for higher recall queries and a smaller value for latency-sensitive queries. Lumina query-time options use the native keys shown above; when the same options are configured as table or index options, use the lumina. prefix.

ivf.refine_factor can also be configured with refine_factor, rerank_factor, and hyphenated spellings such as ivf.refine-factor. Setting ivf.refine_factor=1 still performs the raw-vector rerank for the indexed candidates; leaving it unset skips the rerank stage.

Spark SQL
Flink SQL (Procedure)
Java API
Python SDK

-- Search for top-5 nearest neighbors
SELECT * FROM vector_search('my_table', 'embedding', array(1.0f, 2.0f, 3.0f), 5);

Unlike Spark's table-valued function, Flink uses a CALL procedure to perform vector search. The procedure returns JSON-serialized rows as strings.

-- Search for top-5 nearest neighbors
CALL sys.vector_search(
    `table` => 'db.my_table',
    vector_column => 'embedding',
    query_vector => '1.0,2.0,3.0',
    top_k => 5
);

-- With projection (only return specific columns)
CALL sys.vector_search(
    `table` => 'db.my_table',
    vector_column => 'embedding',
    query_vector => '1.0,2.0,3.0',
    top_k => 5,
    projection => 'id,name'
);

Table table = catalog.getTable(identifier);

// Step 1: Build vector search
float[] queryVector = {1.0f, 2.0f, 3.0f};
GlobalIndexResult result = table.newVectorSearchBuilder()
        .withVector(queryVector)
        .withLimit(5)
        .withVectorColumn("embedding")
        .withOption("ivf.nprobe", "16")
        .executeLocal();

// Step 2: Read matching rows using the search result
ReadBuilder readBuilder = table.newReadBuilder();
TableScan.Plan plan = readBuilder.newScan().withGlobalIndexResult(result).plan();
try (RecordReader<InternalRow> reader = readBuilder.newRead().createReader(plan)) {
    reader.forEachRemaining(row -> {
        System.out.println("id=" + row.getInt(0) + ", name=" + row.getString(1));
    });
}

Batch results keep the same order as input vectors.

float[][] queryVectors = {
    {1.0f, 2.0f, 3.0f},
    {3.0f, 2.0f, 1.0f}
};
List<GlobalIndexResult> batchResults = table.newBatchVectorSearchBuilder()
        .withVectors(queryVectors)
        .withLimit(5)
        .withVectorColumn("embedding")
        .executeBatchLocal();
// batchResults.get(i) corresponds to queryVectors[i].

For Java, use Table.newVectorSearchBuilder() to produce a global index result, then pass the result to TableScan.withGlobalIndexResult.

table = catalog.get_table("db.my_table")

# Step 1: Build vector search
result = (
    table.new_vector_search_builder()
    .with_vector_column("embedding")
    .with_query_vector([1.0, 2.0, 3.0])
    .with_limit(5)
    .with_option("ivf.nprobe", "16")
    .execute_local()
)

# Step 2: Read matching rows using the search result
read_builder = table.new_read_builder()
scan = read_builder.new_scan().with_global_index_result(result)
plan = scan.plan()
table_read = read_builder.new_read()
pa_table = table_read.to_arrow(plan.splits())
print(pa_table)

You can also add a scalar filter to pre-filter rows before vector search:

from pypaimon.common.predicate_builder import PredicateBuilder

predicate = (
    PredicateBuilder(table.fields)
    .equal("category", "electronics")
)

result = (
    table.new_vector_search_builder()
    .with_vector_column("embedding")
    .with_query_vector([1.0, 2.0, 3.0])
    .with_limit(5)
    .with_filter(predicate)
    .execute_local()
)

The scalar filter is evaluated with matching scalar global indexes before vector search. Build a BTree index for frequently used metadata filters, such as category, tenant_id, or event_time, so vector search can restrict the candidate row ids before running ANN search.

Build Vector Index​

Per-Field Options​

Vector Search​

Build Vector Index

Per-Field Options

Vector Search