Querying Tables

Querying Tables #

Just like all other tables, Paimon tables can be queried with SELECT statement.

Scan Mode #

By specifying the scan.mode table property, users can specify where and how Paimon sources should produce records.

Scan Mode Batch Source Behavior Streaming Source Behavior
default The default scan mode. Determines actual scan mode according to other table properties. If "scan.timestamp-millis" is set the actual scan mode will be "from-timestamp", and if "scan.snapshot-id" is set the actual startup mode will be "from-snapshot". Otherwise the actual scan mode will be "latest-full".
latest-full Produces the latest snapshot of table. Produces the latest snapshot on the table upon first startup, and continues to read the following changes.
compacted-full Produces the snapshot after the latest compaction. Produces the snapshot after the latest compaction on the table upon first startup, and continues to read the following changes.
latest Same as "latest-full" Continuously reads latest changes without producing a snapshot at the beginning.
from-timestamp Produces a snapshot earlier than or equals to the timestamp specified by "scan.timestamp-millis". Continuously reads changes starting from timestamp specified by "scan.timestamp-millis", without producing a snapshot at the beginning.
from-snapshot Produces a snapshot specified by "scan.snapshot-id". Continuously reads changes starting from a snapshot specified by "scan.snapshot-id", without producing a snapshot at the beginning.
from-snapshot-full Produces a snapshot specified by "scan.snapshot-id". Produces from snapshot specified by "scan.snapshot-id" on the table upon first startup, and continuously reads changes.

Users can also adjust changelog-producer table property to specify the pattern of produced changes. See changelog producer for details.

Time Travel #

Currently, Paimon supports time travel for Flink and Spark 3 (requires Spark 3.3+).


you can use dynamic table options to specify scan mode and from where to start:

-- travel to snapshot with id 1L
SELECT * FROM t /*+ OPTIONS('scan.snapshot-id' = '1') */;

-- travel to specified timestamp with a long value in milliseconds
SELECT * FROM t /*+ OPTIONS('scan.timestamp-millis' = '1678883047356') */;

you can use VERSION AS OF and TIMESTAMP AS OF in query to do time travel:

-- travel to snapshot with id 1L (use snapshot id as version)
SELECT * FROM t VERSION AS OF 1;

-- travel to specified timestamp 
SELECT * FROM t TIMESTAMP AS OF '2023-06-01 00:00:00.123';

-- you can also use a long value in seconds as timestamp
SELECT * FROM t TIMESTAMP AS OF 1678883047;

Consumer ID #

This is an experimental feature.

You can specify the consumer-id when streaming read table:

SELECT * FROM t /*+ OPTIONS('consumer-id' = 'myid') */;

When stream read Paimon tables, the next snapshot id to be recorded into the file system. This has several advantages:

  1. When previous job is stopped, the newly started job can continue to consume from the previous progress without resuming from the state. The newly reading will start reading from next snapshot id found in consumer files.
  2. When deciding whether a snapshot has expired, Paimon looks at all the consumers of the table in the file system, and if there are consumers that still depend on this snapshot, then this snapshot will not be deleted by expiration.
NOTE: If there is a consumer that will not be used anymore, please delete it, otherwise it will affect the expiration of the snapshot. The consumer file is in ${table-path}/consumer/consumer-${consumer-id}.

System Tables #

System tables contain metadata and information about each table, such as the snapshots created and the options in use. Users can access system tables with batch queries.

Currently, Flink, Spark and Trino supports querying system tables.

In some cases, the table name needs to be enclosed with back quotes to avoid syntax parsing conflicts, for example triple access mode:

SELECT * FROM my_catalog.my_db.`MyTable$snapshots`;

Snapshots Table #

You can query the snapshot history information of the table through snapshots table, including the record count occurred in the snapshot.

SELECT * FROM MyTable$snapshots;

/*
+--------------+------------+-----------------+-------------------+--------------+-------------------------+---------------------+---------------------+-------------------------+
|  snapshot_id |  schema_id |     commit_user | commit_identifier |  commit_kind |             commit_time |  total_record_count |  delta_record_count |  changelog_record_count |
+--------------+------------+-----------------+-------------------+--------------+-------------------------+---------------------+---------------------+-------------------------+
|            2 |          0 | 7ca4cd28-98e... |                 2 |       APPEND | 2022-10-26 11:44:15.600 |                   2 |                   2 |                       0 |
|            1 |          0 | 870062aa-3e9... |                 1 |       APPEND | 2022-10-26 11:44:15.148 |                   1 |                   1 |                       0 |
+--------------+------------+-----------------+-------------------+--------------+-------------------------+---------------------+---------------------+-------------------------+
2 rows in set
*/

By querying the snapshots table, you can know the commit and expiration information about that table and time travel through the data.

Schemas Table #

You can query the historical schemas of the table through schemas table.

SELECT * FROM MyTable$schemas;

/*
+-----------+--------------------------------+----------------+--------------+---------+---------+
| schema_id |                         fields | partition_keys | primary_keys | options | comment |
+-----------+--------------------------------+----------------+--------------+---------+---------+
|         0 | [{"id":0,"name":"word","typ... |             [] |     ["word"] |      {} |         |
|         1 | [{"id":0,"name":"word","typ... |             [] |     ["word"] |      {} |         |
|         2 | [{"id":0,"name":"word","typ... |             [] |     ["word"] |      {} |         |
+-----------+--------------------------------+----------------+--------------+---------+---------+
3 rows in set
*/

You can join the snapshots table and schemas table to get the fields of given snapshots.

SELECT s.snapshot_id, t.schema_id, t.fields 
    FROM MyTable$snapshots s JOIN MyTable$schemas t 
    ON s.schema_id=t.schema_id where s.snapshot_id=100;

Options Table #

You can query the table’s option information which is specified from the DDL through options table. The options not shown will be the default value. You can take reference to [Configuration].

SELECT * FROM MyTable$options;

/*
+------------------------+--------------------+
|         key            |        value       |
+------------------------+--------------------+
| snapshot.time-retained |         5 h        |
+------------------------+--------------------+
1 rows in set
*/

Audit log Table #

If you need to audit the changelog of the table, you can use the audit_log system table. Through audit_log table, you can get the rowkind column when you get the incremental data of the table. You can use this column for filtering and other operations to complete the audit.

There are four values for rowkind:

  • +I: Insertion operation.
  • -U: Update operation with the previous content of the updated row.
  • +U: Update operation with new content of the updated row.
  • -D: Deletion operation.
SELECT * FROM MyTable$audit_log;

/*
+------------------+-----------------+-----------------+
|     rowkind      |     column_0    |     column_1    |
+------------------+-----------------+-----------------+
|        +I        |      ...        |      ...        |
+------------------+-----------------+-----------------+
|        -U        |      ...        |      ...        |
+------------------+-----------------+-----------------+
|        +U        |      ...        |      ...        |
+------------------+-----------------+-----------------+
3 rows in set
*/

Files Table #

You can query the files of the table with specific snapshot.

-- Query the files of latest snapshot
SELECT * FROM MyTable$files;
+-----------+--------+--------------------------------+-------------+-----------+-------+--------------+--------------------+---------+---------+------------------------+-------------------------+-------------------------+-----------------------+
| partition | bucket |                      file_path | file_format | schema_id | level | record_count | file_size_in_bytes | min_key | max_key |      null_value_counts |         min_value_stats |         max_value_stats |         creation_time |
+-----------+--------+--------------------------------+-------------+-----------+-------+--------------+--------------------+---------+---------+------------------------+-------------------------+-------------------------+-----------------------+
|       [3] |      0 | data-8f64af95-29cc-4342-adc... |         orc |         0 |     0 |            1 |                593 |     [c] |     [c] | {cnt=0, val=0, word=0} | {cnt=3, val=33, word=c} | {cnt=3, val=33, word=c} |2023-02-24T16:06:21.166|
|       [2] |      0 | data-8b369068-0d37-4011-aa5... |         orc |         0 |     0 |            1 |                593 |     [b] |     [b] | {cnt=0, val=0, word=0} | {cnt=2, val=22, word=b} | {cnt=2, val=22, word=b} |2023-02-24T16:06:21.166|
|       [2] |      0 | data-83aa7973-060b-40b6-8c8... |         orc |         0 |     0 |            1 |                605 |     [d] |     [d] | {cnt=0, val=0, word=0} | {cnt=2, val=32, word=d} | {cnt=2, val=32, word=d} |2023-02-24T16:06:21.166|
|       [5] |      0 | data-3d304f4a-bcea-44dc-a13... |         orc |         0 |     0 |            1 |                593 |     [c] |     [c] | {cnt=0, val=0, word=0} | {cnt=5, val=51, word=c} | {cnt=5, val=51, word=c} |2023-02-24T16:06:21.166|
|       [1] |      0 | data-10abb5bc-0170-43ae-b6a... |         orc |         0 |     0 |            1 |                595 |     [a] |     [a] | {cnt=0, val=0, word=0} | {cnt=1, val=11, word=a} | {cnt=1, val=11, word=a} |2023-02-24T16:06:21.166|
|       [4] |      0 | data-2c9b7095-65b7-4013-a7a... |         orc |         0 |     0 |            1 |                593 |     [a] |     [a] | {cnt=0, val=0, word=0} | {cnt=4, val=12, word=a} | {cnt=4, val=12, word=a} |2023-02-24T16:06:21.166|
+-----------+--------+--------------------------------+-------------+-----------+-------+--------------+--------------------+---------+---------+------------------------+-------------------------+-------------------------+-----------------------+
6 rows in set

-- You can also query the files with specific snapshot
SELECT * FROM MyTable$files /*+ OPTIONS('scan.snapshot-id'='1') */;
+-----------+--------+--------------------------------+-------------+-----------+-------+--------------+--------------------+---------+---------+------------------------+-------------------------+-------------------------+-----------------------+
| partition | bucket |                      file_path | file_format | schema_id | level | record_count | file_size_in_bytes | min_key | max_key |      null_value_counts |         min_value_stats |         max_value_stats |         creation_time |
+-----------+--------+--------------------------------+-------------+-----------+-------+--------------+--------------------+---------+---------+------------------------+-------------------------+-------------------------+-----------------------+
|       [3] |      0 | data-8f64af95-29cc-4342-adc... |         orc |         0 |     0 |            1 |                593 |     [c] |     [c] | {cnt=0, val=0, word=0} | {cnt=3, val=33, word=c} | {cnt=3, val=33, word=c} |2023-02-24T16:06:21.166|
|       [2] |      0 | data-8b369068-0d37-4011-aa5... |         orc |         0 |     0 |            1 |                593 |     [b] |     [b] | {cnt=0, val=0, word=0} | {cnt=2, val=22, word=b} | {cnt=2, val=22, word=b} |2023-02-24T16:06:21.166|
|       [1] |      0 | data-10abb5bc-0170-43ae-b6a... |         orc |         0 |     0 |            1 |                595 |     [a] |     [a] | {cnt=0, val=0, word=0} | {cnt=1, val=11, word=a} | {cnt=1, val=11, word=a} |2023-02-24T16:06:21.166|
+-----------+--------+--------------------------------+-------------+-----------+-------+--------------+--------------------+---------+---------+------------------------+-------------------------+-------------------------+-----------------------+
3 rows in set