This documentation is for an unreleased version of Apache Paimon. We recommend you use the latest stable version.

Global Index #

PyPaimon supports querying global indexes built on Data Evolution (append) tables. Three index types are available:

BTree Index: B-tree based index for scalar column lookups. Supports equality, IN, range, and combined predicates.
Vector Index (Lumina): Approximate nearest neighbor (ANN) index for vector similarity search.
Full-Text Index (Tantivy): Full-text search index for text retrieval with relevance scoring.

Global indexes must be built beforehand (e.g., via Spark or Flink). See Global Index for how to create indexes.

BTree Index #

BTree index is automatically used during scan when a filter predicate matches the indexed column. No special API is needed — just set a filter on the read builder.

import pypaimon

catalog = pypaimon.create_catalog(...)
table = catalog.get_table("db.my_table")

# BTree index is used automatically when filtering on indexed columns
read_builder = table.new_read_builder()
read_builder = read_builder.with_filter(
    pypaimon.PredicateBuilder(table.fields)
    .in_("name", ["a200", "a300"])
)

scan = read_builder.new_scan()
read = read_builder.new_read()
splits = scan.plan().splits
data = read.to_arrow(splits)

Supported predicates: equal, not_equal, less_than, less_or_equal, greater_than, greater_or_equal, in_, not_in, between, is_null, is_not_null.

Vector Index (Lumina) #

Use VectorSearchBuilder to perform approximate nearest neighbor search on a vector column, then read the matched rows.

table = catalog.get_table("db.my_table")

# Step 1: vector search to get matching row IDs
builder = table.new_vector_search_builder()
index_result = (
    builder
    .with_vector_column("embedding")
    .with_query_vector([1.0, 2.0, 3.0, ...])
    .with_limit(10)
    .execute_local()
)

# Step 2: read actual data for matched rows
read_builder = table.new_read_builder()
scan = read_builder.new_scan()
scan.with_global_index_result(index_result)
read = read_builder.new_read()
data = read.to_arrow(scan.plan().splits)

You can also add a scalar filter to pre-filter rows before vector search:

predicate = (
    pypaimon.PredicateBuilder(table.fields)
    .equal("category", "electronics")
)

index_result = (
    table.new_vector_search_builder()
    .with_vector_column("embedding")
    .with_query_vector([1.0, 2.0, 3.0, ...])
    .with_limit(10)
    .with_filter(predicate)
    .execute_local()
)

read_builder = table.new_read_builder()
scan = read_builder.new_scan()
scan.with_global_index_result(index_result)
read = read_builder.new_read()
data = read.to_arrow(scan.plan().splits)

Full-Text Index (Tantivy) #

Use FullTextSearchBuilder to perform full-text search on a text column, then read the matched rows.

table = catalog.get_table("db.my_table")

# Step 1: full-text search to get matching row IDs
builder = table.new_full_text_search_builder()
index_result = (
    builder
    .with_text_column("content")
    .with_query_text("search keywords")
    .with_limit(20)
    .execute_local()
)

# Step 2: read actual data for matched rows
read_builder = table.new_read_builder()
scan = read_builder.new_scan()
scan.with_global_index_result(index_result)
read = read_builder.new_read()
data = read.to_arrow(scan.plan().splits)

For better performance when reading from remote storage, consider enabling the Local Cache.