SQL Query
PyPaimon supports executing SQL queries on Paimon tables, powered by pypaimon-rust and DataFusion.
Installation
SQL query support requires additional dependencies. Install them with:
pip install pypaimon[sql]
This will install pypaimon-rust (which bundles DataFusion).
Usage
Create a SQLContext, register one or more catalogs with their options, and run SQL queries.
Basic Query
from pypaimon_rust.datafusion import SQLContext
import pyarrow as pa
ctx = SQLContext()
ctx.register_catalog("paimon", {"warehouse": "/path/to/warehouse"})
ctx.set_current_catalog("paimon")
ctx.set_current_database("default")
# Execute SQL and get PyArrow RecordBatches
batches = ctx.sql("SELECT * FROM my_table")
table = pa.Table.from_batches(batches)
print(table)
# Convert to Pandas DataFrame
df = table.to_pandas()
print(df)
SQLContext can also be imported from pypaimon:
from pypaimon import SQLContext
Table Reference Format
The default catalog and default database can be configured via set_current_catalog() and set_current_database(), so you can reference tables in multiple ways:
# Direct table name (uses default database)
ctx.sql("SELECT * FROM my_table")
# Two-part: database.table
ctx.sql("SELECT * FROM mydb.my_table")
# Three-part: catalog.database.table
ctx.sql("SELECT * FROM paimon.mydb.my_table")
Multi-Catalog Query
SQLContext supports registering multiple catalogs for cross-catalog queries:
ctx = SQLContext()
ctx.register_catalog("a", {"warehouse": "/path/to/warehouse_a"})
ctx.register_catalog("b", {
"metastore": "rest",
"uri": "http://localhost:8080",
"warehouse": "warehouse_b",
})
ctx.set_current_catalog("a")
ctx.set_current_database("default")
# Cross-catalog join
batches = ctx.sql("""
SELECT a_users.name, b_orders.amount
FROM a.default.users AS a_users
JOIN b.default.orders AS b_orders ON a_users.id = b_orders.user_id
""")
Register Arrow Batches
You can register PyArrow RecordBatches as temporary tables:
batch = pa.record_batch([[1, 2], ["alice", "bob"]], names=["id", "name"])
ctx.register_batch("paimon.default.my_temp", batch)
batches = ctx.sql("SELECT * FROM paimon.default.my_temp")
Supported SQL Syntax
The SQL engine is powered by Apache DataFusion, which supports a rich set of SQL syntax. For the full SQL reference, see the paimon-rust SQL documentation which covers:
- DDL:
CREATE SCHEMA,CREATE TABLE(withPARTITIONED BY,PRIMARY KEY,WITHoptions),DROP TABLE,ALTER TABLE,CREATE TEMPORARY TABLE/VIEW - DML:
INSERT INTO,INSERT OVERWRITE(dynamic/static partitions),UPDATE,DELETE,MERGE INTO,TRUNCATE TABLE - Procedures:
CALL sys.create_tag,CALL sys.rollback_to, etc. - Queries:
SELECT, column projection, filter pushdown,COUNT(*)pushdown - Time Travel:
VERSION AS OF,TIMESTAMP AS OF - Vector Search:
vector_search()table function - Full-Text Search:
full_text_search()table function - Dynamic Options:
SET/RESET - System Tables:
$options,$schemas,$snapshots,$tags,$manifests
For the DataFusion query syntax (JOINs, aggregations, subqueries, CTEs, window functions, etc.), see the DataFusion SQL documentation.