Skip to main content

Table index

Table Index files is in the index directory.

Dynamic Bucket Index

Dynamic bucket index is used to store the correspondence between the hash value of the primary-key and the bucket.

Its structure is very simple, only storing hash values in the file:

HASH_VALUE | HASH_VALUE | HASH_VALUE | HASH_VALUE | ...

HASH_VALUE is the hash value of the primary-key. 4 bytes, BIG_ENDIAN.

Deletion Vectors

Deletion file is used to store the deleted records position for each data file. Each bucket has one deletion file for primary key table.

The deletion file is a binary file, and the format is as follows:

  • First, record version by a byte. Current version is 1.
  • Then, record <size of serialized bin, serialized bin, checksum of serialized bin> in sequence.
  • Size and checksum are BIG_ENDIAN Integer.

For each serialized bin, its serialization format is determined by deletion-vectors.bitmap64. Paimon will use a 32-bit bitmap to store deleted records by default, but if deletion-vectors.bitmap64 is set to true, a 64-bit bitmap will be used. Serialization of the two bitmaps is different. Note that only 64-bit bitmap implementation is compatible with Iceberg.

Serialized bin for 32-bit bitmap:(default)

  • First, record a const magic number by an int (BIG_ENDIAN). Current the magic number is 1581511376.
  • Then, record a 32-bit serialized bitmap. Which is a RoaringBitmap (org.roaringbitmap.RoaringBitmap).

Serialized bin for 64-bit bitmap:

  • First, record a const magic number by an int (LITTLE_ENDIAN). Current the magic number is 1681511377.
  • Then, record a 64-bit serialized bitmap. Which supports positive 64-bit positions (the most significant bit must be 0), but is optimized for cases where most positions fit in 32 bits by using an array of 32-bit Roaring bitmaps. The internal bitmap array is grown as needed to accommodate the largest position. The serialization of the 64-bit bitmap is as follows:
    • First, record the size of bitmaps array by a long (LITTLE_ENDIAN).
    • Then, record the index by an int (LITTLE_ENDIAN) and serialized bytes of each bitmap in the array in sequence.