Full-Text Index

Full-Text Index provides text search capabilities powered by Tantivy. It is suitable for text retrieval scenarios such as document search, log analysis, and content-based filtering. Search results are scored by the full-text index and can be consumed directly or combined with vector routes in Hybrid Search.

Build Full-Text Index

-- Create full-text index on 'content' column
CALL sys.create_global_index(
    table => 'db.my_table',
    index_column => 'content',
    index_type => 'tantivy-fulltext'
);

For other content where users often search by short character fragments, build the index with Tantivy's ngram tokenizer:

CALL sys.create_global_index(
    table => 'db.my_table',
    index_column => 'content',
    index_type => 'tantivy-fulltext',
    options => 'tantivy.tokenizer=ngram,tantivy.ngram.min-gram=2,tantivy.ngram.max-gram=2'
);

For Chinese word segmentation, build the index with the jieba tokenizer:

CALL sys.create_global_index(
    table => 'db.my_table',
    index_column => 'content',
    index_type => 'tantivy-fulltext',
    options => 'tantivy.tokenizer=jieba'
);

To customize text analysis, choose a base tokenizer and compose token filters:

CALL sys.create_global_index(
    table => 'db.my_table',
    index_column => 'content',
    index_type => 'tantivy-fulltext',
    options => 'tantivy.tokenizer=simple,tantivy.stem=true,tantivy.remove-stop-words=true'
);

Supported tokenizer options:

Option	Default	Description
`tantivy.tokenizer`	`default`	Tokenizer used by the full-text index. Supported values: `default`, `simple`, `whitespace`, `raw`, `ngram`, `jieba`.
`tantivy.ngram.min-gram`	`2`	Minimum gram length for the `ngram` tokenizer.
`tantivy.ngram.max-gram`	`2`	Maximum gram length for the `ngram` tokenizer.
`tantivy.ngram.prefix-only`	`false`	Whether the `ngram` tokenizer only emits prefix ngrams.
`tantivy.lower-case`	`true`	Whether configurable tokenizers lowercase emitted tokens.
`tantivy.max-token-length`	`40`	Maximum token length kept by configurable tokenizers.
`tantivy.ascii-folding`	`false`	Whether to normalize non-ASCII Latin characters to ASCII.
`tantivy.stem`	`false`	Whether to apply stemming to emitted tokens.
`tantivy.language`	`english`	Language used by stemming and built-in stop word filters.
`tantivy.remove-stop-words`	`false`	Whether to remove built-in stop words for the configured language.
`tantivy.stop-words`		Semicolon-separated custom stop words to remove.
`tantivy.with-position`	`true`	Whether to store term positions for phrase queries. Disable it to reduce index size when phrase queries are not needed.

Tokenizer settings are persisted in global index metadata. Existing index files keep using the tokenizer they were built with, even if later index builds use different options. Paimon does not load arbitrary Rust tokenizer plugins from configuration; custom analysis is provided by composing the supported tokenizer and filter options above. PyPaimon can query jieba indexes when the Python jieba package is installed.

Choose tokenizer settings based on the query pattern:

Query Pattern	Suggested Options
Natural-language English text	Use the default tokenizer, or enable `tantivy.stem=true` and `tantivy.remove-stop-words=true`.
Short fragments or substring-like lookup	Use `tantivy.tokenizer=ngram` and tune `tantivy.ngram.min-gram` / `tantivy.ngram.max-gram`.
Chinese text	Use `tantivy.tokenizer=jieba`.
Exact token matching	Use `tantivy.tokenizer=raw` or `tantivy.tokenizer=whitespace`, depending on how the field is written.

Set tantivy.with-position=false to reduce index size when phrase queries are not needed.

Full-Text Search

Full-text search accepts a JSON query DSL. The root JSON object should contain one query type. Paimon supports match, match_phrase / phrase, boost, multi_match, and boolean. The same JSON DSL is used by Spark SQL full_text_search(...), Java FullTextQuery.fromJson(...), Python FullTextQuery.from_json(...), and full-text routes in Hybrid Search.

Spark SQL
Java API
Python SDK

-- Search for top-10 documents matching any query term. The default query operator is 'Or'.
SELECT * FROM full_text_search(
    'my_table',
    '{"match":{"column":"content","terms":"paimon lake format"}}',
    10
);

-- Search for top-10 documents matching all query terms.
SELECT * FROM full_text_search(
    'my_table',
    '{"match":{"column":"content","terms":"paimon lake format","operator":"And"}}',
    10
);

-- Structured query DSL. The JSON shape follows LanceDB full-text queries:
-- match, match_phrase, boost, multi_match, and boolean.
SELECT * FROM full_text_search(
    'my_table',
    '{"match_phrase":{"column":"content","terms":"paimon lake","slop":1}}',
    10
);

SELECT * FROM full_text_search(
    'my_table',
    '{"boolean":{"must":[{"match":{"column":"content","terms":"paimon"}},{"match":{"column":"content","terms":"format"}}],"must_not":[{"match":{"column":"content","terms":"vector"}}]}}',
    10
);

SELECT * FROM full_text_search(
    'my_table',
    '{"multi_match":{"query":"paimon","columns":["title","content"],"boost":[2.0,1.0],"operator":"Or"}}',
    10
);

Table table = catalog.getTable(identifier);

// Step 1: Build full-text search
GlobalIndexResult result = table.newFullTextSearchBuilder()
        .withQuery(FullTextQuery.phrase("paimon lake", "content", 1))
        .withLimit(10)
        .executeLocal();

// Step 2: Read matching rows using the search result
ReadBuilder readBuilder = table.newReadBuilder();
TableScan.Plan plan = readBuilder.newScan().withGlobalIndexResult(result).plan();
try (RecordReader<InternalRow> reader = readBuilder.newRead().createReader(plan)) {
    reader.forEachRemaining(row -> {
        System.out.println("id=" + row.getInt(0) + ", content=" + row.getString(1));
    });
}

from pypaimon.globalindex.full_text_query import MatchQuery

table = catalog.get_table("db.my_table")

# Step 1: Build full-text search
result = (
    table.new_full_text_search_builder()
    .with_query(MatchQuery("paimon lake format", "content", operator="And"))
    .with_limit(10)
    .execute_local()
)

# Step 2: Read matching rows using the search result
read_builder = table.new_read_builder()
scan = read_builder.new_scan().with_global_index_result(result)
plan = scan.plan()
table_read = read_builder.new_read()
pa_table = table_read.to_arrow(plan.splits())
print(pa_table)

PyPaimon reads the Tantivy tokenizer settings stored in the global index metadata. Indexes built with tantivy.tokenizer=ngram can be queried from Python when the installed tantivy-py package provides custom tokenizer support. Indexes built with tantivy.tokenizer=jieba can be queried when the jieba package is installed.

Query DSL Reference

Match

Use match to search terms in one text column. By default, a row matches if any query term matches.

{
  "match": {
    "column": "content",
    "terms": "paimon lake format",
    "operator": "or",
    "boost": 2.0,
    "fuzziness": 1,
    "max_expansions": 50,
    "prefix_length": 0
  }
}

Field	Required	Default	Description
`column`	Yes		Text column to search.
`terms`	Yes		Query text. `query` is also accepted as an alias.
`operator`	No	`or`	How query terms are combined. Supported values are `or` and `and`.
`boost`	No	`1.0`	Positive score multiplier for this query.
`fuzziness`	No	`0`	Edit distance for fuzzy matching. Supported numeric values are `0`, `1`, and `2`. Use `null` or `auto` to leave fuzziness unset.
`max_expansions`	No	`50`	Maximum fuzzy term expansions. `maxExpansions` is also accepted.
`prefix_length`	No	`0`	Number of leading characters that must match exactly for fuzzy matching. `prefixLength` is also accepted.

Phrase

Use match_phrase to match terms in order. Phrase queries require the index to store positions, so keep tantivy.with-position=true when building indexes that need phrase search.

{
  "match_phrase": {
    "column": "content",
    "terms": "paimon lake",
    "slop": 1
  }
}

phrase is accepted as an alias for match_phrase, and query is accepted as an alias for terms. Paimon serializes phrase queries as match_phrase.

Field	Required	Default	Description
`column`	Yes		Text column to search.
`terms`	Yes		Phrase text. `query` is also accepted as an alias.
`slop`	No	`0`	Maximum number of term-position moves allowed in the phrase.

Multi Match

Use multi_match to search the same query text across multiple columns in one full-text query. Column boosts are applied inside the full-text score before the final top-k is selected.

{
  "multi_match": {
    "query": "paimon lake",
    "columns": ["title", "content"],
    "boost": [2.0, 1.0],
    "operator": "or"
  }
}

Field	Required	Default	Description
`query`	Yes		Query text searched in every listed column.
`columns`	Yes		Non-empty array of text columns.
`boost`	No	all `1.0`	Per-column score multipliers. The array length must match `columns`. `boosts` is also accepted.
`operator`	No	`or`	How query terms are combined within each column. Supported values are `or` and `and`.

Boost

Use boost to reduce the score of documents that also match a negative query.

{
  "boost": {
    "positive": {
      "match": {
        "column": "content",
        "terms": "paimon"
      }
    },
    "negative": {
      "match": {
        "column": "content",
        "terms": "vector"
      }
    },
    "negative_boost": 0.3
  }
}

Field	Required	Default	Description
`positive`	Yes		Query that contributes the main score.
`negative`	Yes		Query used to down-rank matching rows.
`negative_boost`	No	`0.5`	Positive multiplier applied to rows that also match `negative`. `negativeBoost` is also accepted.

Boolean

Use boolean to combine nested queries. must clauses restrict matches, should clauses add matches and score, and must_not clauses exclude matches.

{
  "boolean": {
    "must": [
      {
        "match": {
          "column": "content",
          "terms": "paimon"
        }
      }
    ],
    "should": [
      {
        "match_phrase": {
          "column": "content",
          "terms": "lake format"
        }
      }
    ],
    "must_not": [
      {
        "match": {
          "column": "content",
          "terms": "vector"
        }
      }
    ]
  }
}

Field	Required	Default	Description
`must`	No	empty	Array of nested queries that every result must match.
`should`	No	empty	Array of nested queries that can match and contribute score.
`must_not`	No	empty	Array of nested queries that matching rows must not satisfy.

boolean also accepts a queries array for explicit occurrence labels:

{
  "boolean": {
    "queries": [
      {
        "occur": "must",
        "query": {
          "match": {
            "column": "content",
            "terms": "paimon"
          }
        }
      },
      [
        "must_not",
        {
          "match": {
            "column": "content",
            "terms": "vector"
          }
        }
      ]
    ]
  }
}

Supported occur values are should, must, and must_not.

Validation Notes

Query text and column names cannot be empty.
Boost values and negative_boost must be positive.
max_expansions must be positive.
prefix_length and slop must be non-negative.
fuzziness must be 0, 1, 2, null, or auto.
max_expansions and prefix_length are part of the DSL for LanceDB compatibility. The current Tantivy backend accepts the default values only: max_expansions=50 and prefix_length=0.
PyPaimon can parse the same DSL, but its local Tantivy reader does not support every advanced scoring option yet. Unsupported options fail fast instead of changing query semantics silently.
Multi-column queries such as multi_match require full-text indexes for every referenced column.

Build Full-Text Index​

Full-Text Search​

Query DSL Reference​

Match​

Phrase​

Multi Match​

Boost​

Boolean​

Validation Notes​