Skip to main content

Hybrid Search

Hybrid search queries multiple scored routes in one request and ranks the results before reading table rows. A route can be a vector search route or a full-text search route, so hybrid search can combine multiple vector indexes, multiple full-text indexes, or a mix of vector and full-text indexes. This is useful when one table stores several searchable representations for the same record, such as title embeddings, body embeddings, and content text.

Before running hybrid search, create a global index for every vector or text column used by the query:

CALL sys.create_global_index(
table => 'db.my_table',
index_column => 'title_embedding',
index_type => 'ivf-pq',
options => 'ivf-pq.distance.metric=cosine,ivf-pq.nlist=256,ivf-pq.pq.m=16'
);

CALL sys.create_global_index(
table => 'db.my_table',
index_column => 'body_embedding',
index_type => 'ivf-pq',
options => 'ivf-pq.distance.metric=cosine,ivf-pq.nlist=256,ivf-pq.pq.m=16'
);

CALL sys.create_global_index(
table => 'db.my_table',
index_column => 'content',
index_type => 'tantivy-fulltext'
);

For Spark SQL, use the hybrid_search(table_name, vector_routes, full_text_routes, limit[, ranker]) table-valued function. The fourth argument is the final number of ranked results to return. The optional fifth argument selects the ranker. Supported rankers are rrf, weighted_score, and mrr; the default is rrf.

Rankers combine route scores differently:

RankerBest ForDescription
rrfCombining routes with different score scalesReciprocal rank fusion uses each route's rank order, so vector and full-text scores do not need to be normalized.
weighted_scoreWeighting routes by normalized score, not just rankMin-max normalizes each route's scores to [0, 1], then sums them weighted by route weight, so weights (not raw score magnitude) control each route's influence. The exposed __paimon_search_score is therefore a per-query relative value in [0, sum of weights], not a raw similarity or BM25 score.
mrrEmphasizing top-ranked hits from each routeWeighted reciprocal-rank fusion sums weight / rank for each row returned by a route, where rank starts at 1.

The second argument is an array of vector route configs created by named_struct:

FieldRequiredDefaultDescription
field or vector_columnYesN/AVector column to search.
query_vectorYesN/AQuery vector for this route.
limitNoFinal limitTop K results to retrieve from this vector column before ranking.
weightNo1.0Weight for this route when ranking results.
optionsNoEmpty mapRoute-specific vector search options such as ivf.nprobe and hnsw.ef_search.

The third argument is an array of full-text route configs created by named_struct:

FieldRequiredDefaultDescription
queryYesN/ALanceDB-style full-text query JSON for this route.
limitNoFinal limitTop K results to retrieve from this text column before ranking.
weightNo1.0Weight for this route when ranking results.
optionsNoEmpty mapReserved for future full-text search options. Only an empty map is accepted.

Within each route array, every named_struct should use the same fields because Spark requires array elements to have the same struct type. Full-text route options is currently a reserved field; pass map() when the field is needed for struct-type consistency.

Use route limit values larger than the final limit when each route should contribute enough candidates for ranking. For example, with a final limit of 10, route limits such as 50 or 100 give the ranker more candidates to merge.

Vector and full-text routes can be searched together:

SELECT id, __paimon_search_score
FROM hybrid_search(
'my_table',
array(
named_struct(
'field', 'title_embedding',
'query_vector', array(1.0f, 0.0f, 0.0f),
'limit', 50,
'weight', 2.0f,
'options', map('ivf.nprobe', '32'))),
array(
named_struct(
'query', '{"match":{"column":"content","terms":"paimon search"}}',
'limit', 50,
'weight', 1.0f,
'options', map())),
10,
'rrf');

To search two vector indexes without a full-text route, pass array() as the third argument:

SELECT id, __paimon_search_score
FROM hybrid_search(
'my_table',
array(
named_struct(
'field', 'title_embedding',
'query_vector', array(1.0f, 0.0f, 0.0f),
'limit', 50,
'weight', 2.0f,
'options', map('ivf.nprobe', '32')),
named_struct(
'field', 'body_embedding',
'query_vector', array(0.0f, 1.0f, 0.0f),
'limit', 50,
'weight', 1.0f,
'options', map('ivf.nprobe', '32'))),
array(),
10,
'weighted_score');

Use mrr when you want weighted reciprocal-rank fusion:

SELECT id, __paimon_search_score
FROM hybrid_search(
'my_table',
array(
named_struct(
'field', 'title_embedding',
'query_vector', array(1.0f, 0.0f, 0.0f),
'limit', 50,
'weight', 2.0f,
'options', map('ivf.nprobe', '32'))),
array(),
10,
'mrr');

Spark SQL adds __paimon_search_score to expose the ranked score.