This documentation is for an unreleased version of Apache Paimon. We recommend you use the latest stable version.
Vector Storage #
Overview #
With the explosive growth of AI scenarios, vector storage has become increasingly important.
Paimon provides optimized storage solutions specifically designed for vector data to meet the needs of various scenarios.
Vector Data Type #
Vector data comes in many types, among which dense vectors are the most commonly used. They are typically expressed as fixed-length, densely packed arrays, generally without null elements.
Paimon supports defining columns of type VECTOR<t, n>, which represents a fixed-length, dense vector column, where:
t: The element type of the vector. Currently supports seven primitive types:BOOLEAN,TINYINT,SMALLINT,INT,BIGINT,FLOAT,DOUBLE;n: The vector dimension, must be a positive integer not exceeding2,147,483,647;null constraint:VECTORtype supports definingNOT NULLor the default nullable. However, if a specificVECTORvalue itself is notnull, its elements are not allowed to benull.
Compared to variable-length arrays, these features make dense vectors more concise in storage and memory representation, with benefits including:
- More natural semantic constraints, preventing mismatched lengths,
nullelements, and other anomalies at the data storage layer; - Better point-lookup performance, eliminating offset array storage and access;
- Closer alignment with type representations in specialized vector engines, often avoiding memory copies and type conversions during queries.
Example: Define a table with a VECTOR column using Java API and write one row of data.
public class CreateTableWithVector {
public static void main(String[] args) throws Exception {
// Schema
Schema.Builder schemaBuilder = Schema.newBuilder();
schemaBuilder.column("id", DataTypes.BIGINT());
schemaBuilder.column("embed", DataTypes.VECTOR(3, DataTypes.FLOAT()));
schemaBuilder.option(CoreOptions.FILE_FORMAT.key(), "lance");
schemaBuilder.option(CoreOptions.FILE_COMPRESSION.key(), "none");
Schema schema = schemaBuilder.build();
// Create catalog
String database = "default";
String tempPath = System.getProperty("java.io.tmpdir") + UUID.randomUUID();
Path warehouse = new Path(TraceableFileIO.SCHEME + "://" + tempPath);
Identifier identifier = Identifier.create("default", "my_table");
try (Catalog catalog = CatalogFactory.createCatalog(CatalogContext.create(warehouse))) {
// Create table
catalog.createDatabase(database, true);
catalog.createTable(identifier, schema, true);
FileStoreTable table = (FileStoreTable) catalog.getTable(identifier);
// Write data
BatchWriteBuilder builder = table.newBatchWriteBuilder();
InternalVector vector = BinaryVector.fromPrimitiveArray(new float[] {1.0f, 2.0f, 3.0f});
try (BatchTableWrite batchTableWrite = builder.newWrite()) {
try (BatchTableCommit commit = builder.newCommit()) {
batchTableWrite.write(GenericRow.of(1L, vector));
commit.commit(batchTableWrite.prepareCommit());
}
}
// Read data
ReadBuilder readBuilder = table.newReadBuilder();
TableScan.Plan plan = readBuilder.newScan().plan();
try (RecordReader<InternalRow> reader = readBuilder.newRead().createReader(plan)) {
reader.forEachRemaining(row -> {
float[] readVector = row.getVector(1).toFloatArray();
System.out.println(Arrays.toString(readVector));
});
}
}
}
}
Notes:
- Columns of
VECTORtype cannot be used as primary key columns, partition columns, or for sorting.
Engine-Level Representation #
Since engine layers typically don’t have dedicated vector types, to support VECTOR type in engine SQL, Paimon provides a separate configuration to convert the engine’s ARRAY type to Paimon’s VECTOR type.
Usage:
'vector-field': Declare columns asVECTORtype, multiple columns separated by commas (,);'field.{field-name}.vector-dim': Declare the dimension of the vector column.
Example: Define a table with a VECTOR column using Flink SQL.
CREATE TABLE IF NOT EXISTS ts_table (
id BIGINT,
embed1 ARRAY<FLOAT>,
embed2 ARRAY<FLOAT>
) WITH (
'file.format' = 'lance',
'vector-field' = 'embed1,embed2',
'field.embed1.vector-dim' = '128',
'field.embed2.vector-dim' = '768'
);
Notes:
- When defining
vector-fieldcolumns, you must provide the vector dimension; otherwise, the CREATE TABLE statement will fail; - Currently, only Flink SQL supports this configuration; other engines have not been implemented yet.
Dedicated File Format for Vector #
When mapping VECTOR type to the file format layer, the ideal storage format is FixedSizeList. Currently, this is only supported for certain file formats (such as lance) through the paimon-arrow integration. This means that to use VECTOR type, you must specify a particular format via file.format, which has a global impact. In particular, this may be unfavorable for scalars and multimodal (Blob) data.
Therefore, Paimon provides a solution to store vector columns separately within Data Evolution tables.
Layout:
table/
├── bucket-0/
│ ├── data-uuid-0.parquet # Contains id, name columns
│ ├── data-uuid-1.blob # Contains blob data
│ ├── data-uuid-2.vector.lance # Contains vector data using lance format
│ └── ...
├── manifest/
├── schema/
└── snapshot/
Usage:
vector.file.format: StoreVECTORtype columns separately in the specified file format;vector.target-file-size: If stored separately, specifies the target file size for vector data, defaulting to10 * 'target-file-size'.
Example: Store VECTOR columns separately using Flink SQL.
CREATE TABLE IF NOT EXISTS ts_table (
id BIGINT,
embed ARRAY<FLOAT>
) WITH (
'file.format' = 'parquet',
'vector.file.format' = 'lance',
'vector-field' = 'embed',
'field.embed.vector-dim' = '128',
'row-tracking.enabled' = 'true',
'data-evolution.enabled' = 'true'
);