Full-Text Index
Full-Text Index provides text search capabilities powered by Tantivy. It is suitable for text retrieval scenarios such as document search, log analysis, and content-based filtering. Search results are scored by the full-text index and can be consumed directly or combined with vector routes in Hybrid Search.
Build Full-Text Index
-- Create full-text index on 'content' column
CALL sys.create_global_index(
table => 'db.my_table',
index_column => 'content',
index_type => 'tantivy-fulltext'
);
For other content where users often search by short character fragments, build the
index with Tantivy's ngram tokenizer:
CALL sys.create_global_index(
table => 'db.my_table',
index_column => 'content',
index_type => 'tantivy-fulltext',
options => 'tantivy.tokenizer=ngram,tantivy.ngram.min-gram=2,tantivy.ngram.max-gram=2'
);
For Chinese word segmentation, build the index with the jieba tokenizer:
CALL sys.create_global_index(
table => 'db.my_table',
index_column => 'content',
index_type => 'tantivy-fulltext',
options => 'tantivy.tokenizer=jieba'
);
To customize text analysis, choose a base tokenizer and compose token filters:
CALL sys.create_global_index(
table => 'db.my_table',
index_column => 'content',
index_type => 'tantivy-fulltext',
options => 'tantivy.tokenizer=simple,tantivy.stem=true,tantivy.remove-stop-words=true'
);
Supported tokenizer options:
| Option | Default | Description |
|---|---|---|
tantivy.tokenizer | default | Tokenizer used by the full-text index. Supported values: default, simple, whitespace, raw, ngram, jieba. |
tantivy.ngram.min-gram | 2 | Minimum gram length for the ngram tokenizer. |
tantivy.ngram.max-gram | 2 | Maximum gram length for the ngram tokenizer. |
tantivy.ngram.prefix-only | false | Whether the ngram tokenizer only emits prefix ngrams. |
tantivy.lower-case | true | Whether configurable tokenizers lowercase emitted tokens. |
tantivy.max-token-length | 40 | Maximum token length kept by configurable tokenizers. |
tantivy.ascii-folding | false | Whether to normalize non-ASCII Latin characters to ASCII. |
tantivy.stem | false | Whether to apply stemming to emitted tokens. |
tantivy.language | english | Language used by stemming and built-in stop word filters. |
tantivy.remove-stop-words | false | Whether to remove built-in stop words for the configured language. |
tantivy.stop-words | | Semicolon-separated custom stop words to remove. |
tantivy.with-position | true | Whether to store term positions for phrase queries. Disable it to reduce index size when phrase queries are not needed. |
Tokenizer settings are persisted in global index metadata. Existing index files keep using the
tokenizer they were built with, even if later index builds use different options.
Paimon does not load arbitrary Rust tokenizer plugins from configuration; custom analysis is
provided by composing the supported tokenizer and filter options above. PyPaimon can query
jieba indexes when the Python jieba package is installed.
Choose tokenizer settings based on the query pattern:
| Query Pattern | Suggested Options |
|---|---|
| Natural-language English text | Use the default tokenizer, or enable tantivy.stem=true and tantivy.remove-stop-words=true. |
| Short fragments or substring-like lookup | Use tantivy.tokenizer=ngram and tune tantivy.ngram.min-gram / tantivy.ngram.max-gram. |
| Chinese text | Use tantivy.tokenizer=jieba. |
| Exact token matching | Use tantivy.tokenizer=raw or tantivy.tokenizer=whitespace, depending on how the field is written. |
Set tantivy.with-position=false to reduce index size when phrase queries are not needed.
Full-Text Search
Full-text search accepts a JSON query DSL. The root JSON object should contain one query type.
Paimon supports match, match_phrase / phrase, boost, multi_match, and boolean.
The same JSON DSL is used by Spark SQL full_text_search(...), Java FullTextQuery.fromJson(...),
Python FullTextQuery.from_json(...), and full-text routes in Hybrid Search.
- Spark SQL
- Java API
- Python SDK
-- Search for top-10 documents matching any query term. The default query operator is 'Or'.
SELECT * FROM full_text_search(
'my_table',
'{"match":{"column":"content","terms":"paimon lake format"}}',
10
);
-- Search for top-10 documents matching all query terms.
SELECT * FROM full_text_search(
'my_table',
'{"match":{"column":"content","terms":"paimon lake format","operator":"And"}}',
10
);
-- Structured query DSL. The JSON shape follows LanceDB full-text queries:
-- match, match_phrase, boost, multi_match, and boolean.
SELECT * FROM full_text_search(
'my_table',
'{"match_phrase":{"column":"content","terms":"paimon lake","slop":1}}',
10
);
SELECT * FROM full_text_search(
'my_table',
'{"boolean":{"must":[{"match":{"column":"content","terms":"paimon"}},{"match":{"column":"content","terms":"format"}}],"must_not":[{"match":{"column":"content","terms":"vector"}}]}}',
10
);
SELECT * FROM full_text_search(
'my_table',
'{"multi_match":{"query":"paimon","columns":["title","content"],"boost":[2.0,1.0],"operator":"Or"}}',
10
);
Table table = catalog.getTable(identifier);
// Step 1: Build full-text search
GlobalIndexResult result = table.newFullTextSearchBuilder()
.withQuery(FullTextQuery.phrase("paimon lake", "content", 1))
.withLimit(10)
.executeLocal();
// Step 2: Read matching rows using the search result
ReadBuilder readBuilder = table.newReadBuilder();
TableScan.Plan plan = readBuilder.newScan().withGlobalIndexResult(result).plan();
try (RecordReader<InternalRow> reader = readBuilder.newRead().createReader(plan)) {
reader.forEachRemaining(row -> {
System.out.println("id=" + row.getInt(0) + ", content=" + row.getString(1));
});
}
from pypaimon.globalindex.full_text_query import MatchQuery
table = catalog.get_table("db.my_table")
# Step 1: Build full-text search
result = (
table.new_full_text_search_builder()
.with_query(MatchQuery("paimon lake format", "content", operator="And"))
.with_limit(10)
.execute_local()
)
# Step 2: Read matching rows using the search result
read_builder = table.new_read_builder()
scan = read_builder.new_scan().with_global_index_result(result)
plan = scan.plan()
table_read = read_builder.new_read()
pa_table = table_read.to_arrow(plan.splits())
print(pa_table)
PyPaimon reads the Tantivy tokenizer settings stored in the global index metadata. Indexes built
with tantivy.tokenizer=ngram can be queried from Python when the installed tantivy-py package
provides custom tokenizer support. Indexes built with tantivy.tokenizer=jieba can be queried
when the jieba package is installed.
Query DSL Reference
Match
Use match to search terms in one text column. By default, a row matches if any query term
matches.
{
"match": {
"column": "content",
"terms": "paimon lake format",
"operator": "or",
"boost": 2.0,
"fuzziness": 1,
"max_expansions": 50,
"prefix_length": 0
}
}
| Field | Required | Default | Description |
|---|---|---|---|
column | Yes | Text column to search. | |
terms | Yes | Query text. query is also accepted as an alias. | |
operator | No | or | How query terms are combined. Supported values are or and and. |
boost | No | 1.0 | Positive score multiplier for this query. |
fuzziness | No | 0 | Edit distance for fuzzy matching. Supported numeric values are 0, 1, and 2. Use null or auto to leave fuzziness unset. |
max_expansions | No | 50 | Maximum fuzzy term expansions. maxExpansions is also accepted. |
prefix_length | No | 0 | Number of leading characters that must match exactly for fuzzy matching. prefixLength is also accepted. |
Phrase
Use match_phrase to match terms in order. Phrase queries require the index to store positions,
so keep tantivy.with-position=true when building indexes that need phrase search.
{
"match_phrase": {
"column": "content",
"terms": "paimon lake",
"slop": 1
}
}
phrase is accepted as an alias for match_phrase, and query is accepted as an alias for
terms. Paimon serializes phrase queries as match_phrase.
| Field | Required | Default | Description |
|---|---|---|---|
column | Yes | Text column to search. | |
terms | Yes | Phrase text. query is also accepted as an alias. | |
slop | No | 0 | Maximum number of term-position moves allowed in the phrase. |
Multi Match
Use multi_match to search the same query text across multiple columns in one full-text query.
Column boosts are applied inside the full-text score before the final top-k is selected.
{
"multi_match": {
"query": "paimon lake",
"columns": ["title", "content"],
"boost": [2.0, 1.0],
"operator": "or"
}
}
| Field | Required | Default | Description |
|---|---|---|---|
query | Yes | Query text searched in every listed column. | |
columns | Yes | Non-empty array of text columns. | |
boost | No | all 1.0 | Per-column score multipliers. The array length must match columns. boosts is also accepted. |
operator | No | or | How query terms are combined within each column. Supported values are or and and. |
Boost
Use boost to reduce the score of documents that also match a negative query.
{
"boost": {
"positive": {
"match": {
"column": "content",
"terms": "paimon"
}
},
"negative": {
"match": {
"column": "content",
"terms": "vector"
}
},
"negative_boost": 0.3
}
}
| Field | Required | Default | Description |
|---|---|---|---|
positive | Yes | Query that contributes the main score. | |
negative | Yes | Query used to down-rank matching rows. | |
negative_boost | No | 0.5 | Positive multiplier applied to rows that also match negative. negativeBoost is also accepted. |
Boolean
Use boolean to combine nested queries. must clauses restrict matches, should clauses add
matches and score, and must_not clauses exclude matches.
{
"boolean": {
"must": [
{
"match": {
"column": "content",
"terms": "paimon"
}
}
],
"should": [
{
"match_phrase": {
"column": "content",
"terms": "lake format"
}
}
],
"must_not": [
{
"match": {
"column": "content",
"terms": "vector"
}
}
]
}
}
| Field | Required | Default | Description |
|---|---|---|---|
must | No | empty | Array of nested queries that every result must match. |
should | No | empty | Array of nested queries that can match and contribute score. |
must_not | No | empty | Array of nested queries that matching rows must not satisfy. |
boolean also accepts a queries array for explicit occurrence labels:
{
"boolean": {
"queries": [
{
"occur": "must",
"query": {
"match": {
"column": "content",
"terms": "paimon"
}
}
},
[
"must_not",
{
"match": {
"column": "content",
"terms": "vector"
}
}
]
]
}
}
Supported occur values are should, must, and must_not.
Validation Notes
- Query text and column names cannot be empty.
- Boost values and
negative_boostmust be positive. max_expansionsmust be positive.prefix_lengthandslopmust be non-negative.fuzzinessmust be0,1,2,null, orauto.max_expansionsandprefix_lengthare part of the DSL for LanceDB compatibility. The current Tantivy backend accepts the default values only:max_expansions=50andprefix_length=0.- PyPaimon can parse the same DSL, but its local Tantivy reader does not support every advanced scoring option yet. Unsupported options fail fast instead of changing query semantics silently.
- Multi-column queries such as
multi_matchrequire full-text indexes for every referenced column.