Skip to main content

Full-Text Index

Full-Text Index provides text search capabilities powered by Tantivy. It is suitable for text retrieval scenarios such as document search, log analysis, and content-based filtering. Search results are scored by the full-text index and can be consumed directly or combined with vector routes in Hybrid Search.

Build Full-Text Index

-- Create full-text index on 'content' column
CALL sys.create_global_index(
table => 'db.my_table',
index_column => 'content',
index_type => 'tantivy-fulltext'
);

For other content where users often search by short character fragments, build the index with Tantivy's ngram tokenizer:

CALL sys.create_global_index(
table => 'db.my_table',
index_column => 'content',
index_type => 'tantivy-fulltext',
options => 'tantivy.tokenizer=ngram,tantivy.ngram.min-gram=2,tantivy.ngram.max-gram=2'
);

For Chinese word segmentation, build the index with the jieba tokenizer:

CALL sys.create_global_index(
table => 'db.my_table',
index_column => 'content',
index_type => 'tantivy-fulltext',
options => 'tantivy.tokenizer=jieba'
);

To customize text analysis, choose a base tokenizer and compose token filters:

CALL sys.create_global_index(
table => 'db.my_table',
index_column => 'content',
index_type => 'tantivy-fulltext',
options => 'tantivy.tokenizer=simple,tantivy.stem=true,tantivy.remove-stop-words=true'
);

Supported tokenizer options:

OptionDefaultDescription
tantivy.tokenizerdefaultTokenizer used by the full-text index. Supported values: default, simple, whitespace, raw, ngram, jieba.
tantivy.ngram.min-gram2Minimum gram length for the ngram tokenizer.
tantivy.ngram.max-gram2Maximum gram length for the ngram tokenizer.
tantivy.ngram.prefix-onlyfalseWhether the ngram tokenizer only emits prefix ngrams.
tantivy.lower-casetrueWhether configurable tokenizers lowercase emitted tokens.
tantivy.max-token-length40Maximum token length kept by configurable tokenizers.
tantivy.ascii-foldingfalseWhether to normalize non-ASCII Latin characters to ASCII.
tantivy.stemfalseWhether to apply stemming to emitted tokens.
tantivy.languageenglishLanguage used by stemming and built-in stop word filters.
tantivy.remove-stop-wordsfalseWhether to remove built-in stop words for the configured language.
tantivy.stop-words Semicolon-separated custom stop words to remove.
tantivy.with-positiontrueWhether to store term positions for phrase queries. Disable it to reduce index size when phrase queries are not needed.

Tokenizer settings are persisted in global index metadata. Existing index files keep using the tokenizer they were built with, even if later index builds use different options. Paimon does not load arbitrary Rust tokenizer plugins from configuration; custom analysis is provided by composing the supported tokenizer and filter options above. PyPaimon can query jieba indexes when the Python jieba package is installed.

Choose tokenizer settings based on the query pattern:

Query PatternSuggested Options
Natural-language English textUse the default tokenizer, or enable tantivy.stem=true and tantivy.remove-stop-words=true.
Short fragments or substring-like lookupUse tantivy.tokenizer=ngram and tune tantivy.ngram.min-gram / tantivy.ngram.max-gram.
Chinese textUse tantivy.tokenizer=jieba.
Exact token matchingUse tantivy.tokenizer=raw or tantivy.tokenizer=whitespace, depending on how the field is written.

Set tantivy.with-position=false to reduce index size when phrase queries are not needed.

Full-text search accepts a JSON query DSL. The root JSON object should contain one query type. Paimon supports match, match_phrase / phrase, boost, multi_match, and boolean. The same JSON DSL is used by Spark SQL full_text_search(...), Java FullTextQuery.fromJson(...), Python FullTextQuery.from_json(...), and full-text routes in Hybrid Search.

-- Search for top-10 documents matching any query term. The default query operator is 'Or'.
SELECT * FROM full_text_search(
'my_table',
'{"match":{"column":"content","terms":"paimon lake format"}}',
10
);

-- Search for top-10 documents matching all query terms.
SELECT * FROM full_text_search(
'my_table',
'{"match":{"column":"content","terms":"paimon lake format","operator":"And"}}',
10
);

-- Structured query DSL. The JSON shape follows LanceDB full-text queries:
-- match, match_phrase, boost, multi_match, and boolean.
SELECT * FROM full_text_search(
'my_table',
'{"match_phrase":{"column":"content","terms":"paimon lake","slop":1}}',
10
);

SELECT * FROM full_text_search(
'my_table',
'{"boolean":{"must":[{"match":{"column":"content","terms":"paimon"}},{"match":{"column":"content","terms":"format"}}],"must_not":[{"match":{"column":"content","terms":"vector"}}]}}',
10
);

SELECT * FROM full_text_search(
'my_table',
'{"multi_match":{"query":"paimon","columns":["title","content"],"boost":[2.0,1.0],"operator":"Or"}}',
10
);

Query DSL Reference

Match

Use match to search terms in one text column. By default, a row matches if any query term matches.

{
"match": {
"column": "content",
"terms": "paimon lake format",
"operator": "or",
"boost": 2.0,
"fuzziness": 1,
"max_expansions": 50,
"prefix_length": 0
}
}
FieldRequiredDefaultDescription
columnYesText column to search.
termsYesQuery text. query is also accepted as an alias.
operatorNoorHow query terms are combined. Supported values are or and and.
boostNo1.0Positive score multiplier for this query.
fuzzinessNo0Edit distance for fuzzy matching. Supported numeric values are 0, 1, and 2. Use null or auto to leave fuzziness unset.
max_expansionsNo50Maximum fuzzy term expansions. maxExpansions is also accepted.
prefix_lengthNo0Number of leading characters that must match exactly for fuzzy matching. prefixLength is also accepted.

Phrase

Use match_phrase to match terms in order. Phrase queries require the index to store positions, so keep tantivy.with-position=true when building indexes that need phrase search.

{
"match_phrase": {
"column": "content",
"terms": "paimon lake",
"slop": 1
}
}

phrase is accepted as an alias for match_phrase, and query is accepted as an alias for terms. Paimon serializes phrase queries as match_phrase.

FieldRequiredDefaultDescription
columnYesText column to search.
termsYesPhrase text. query is also accepted as an alias.
slopNo0Maximum number of term-position moves allowed in the phrase.

Multi Match

Use multi_match to search the same query text across multiple columns in one full-text query. Column boosts are applied inside the full-text score before the final top-k is selected.

{
"multi_match": {
"query": "paimon lake",
"columns": ["title", "content"],
"boost": [2.0, 1.0],
"operator": "or"
}
}
FieldRequiredDefaultDescription
queryYesQuery text searched in every listed column.
columnsYesNon-empty array of text columns.
boostNoall 1.0Per-column score multipliers. The array length must match columns. boosts is also accepted.
operatorNoorHow query terms are combined within each column. Supported values are or and and.

Boost

Use boost to reduce the score of documents that also match a negative query.

{
"boost": {
"positive": {
"match": {
"column": "content",
"terms": "paimon"
}
},
"negative": {
"match": {
"column": "content",
"terms": "vector"
}
},
"negative_boost": 0.3
}
}
FieldRequiredDefaultDescription
positiveYesQuery that contributes the main score.
negativeYesQuery used to down-rank matching rows.
negative_boostNo0.5Positive multiplier applied to rows that also match negative. negativeBoost is also accepted.

Boolean

Use boolean to combine nested queries. must clauses restrict matches, should clauses add matches and score, and must_not clauses exclude matches.

{
"boolean": {
"must": [
{
"match": {
"column": "content",
"terms": "paimon"
}
}
],
"should": [
{
"match_phrase": {
"column": "content",
"terms": "lake format"
}
}
],
"must_not": [
{
"match": {
"column": "content",
"terms": "vector"
}
}
]
}
}
FieldRequiredDefaultDescription
mustNoemptyArray of nested queries that every result must match.
shouldNoemptyArray of nested queries that can match and contribute score.
must_notNoemptyArray of nested queries that matching rows must not satisfy.

boolean also accepts a queries array for explicit occurrence labels:

{
"boolean": {
"queries": [
{
"occur": "must",
"query": {
"match": {
"column": "content",
"terms": "paimon"
}
}
},
[
"must_not",
{
"match": {
"column": "content",
"terms": "vector"
}
}
]
]
}
}

Supported occur values are should, must, and must_not.

Validation Notes

  • Query text and column names cannot be empty.
  • Boost values and negative_boost must be positive.
  • max_expansions must be positive.
  • prefix_length and slop must be non-negative.
  • fuzziness must be 0, 1, 2, null, or auto.
  • max_expansions and prefix_length are part of the DSL for LanceDB compatibility. The current Tantivy backend accepts the default values only: max_expansions=50 and prefix_length=0.
  • PyPaimon can parse the same DSL, but its local Tantivy reader does not support every advanced scoring option yet. Unsupported options fail fast instead of changing query semantics silently.
  • Multi-column queries such as multi_match require full-text indexes for every referenced column.