This documentation is for an unreleased version of Apache Paimon. We recommend you use the latest stable version.

PyTorch #

Read #

This requires torch to be installed.

You can read all the data into a torch.utils.data.Dataset or torch.utils.data.IterableDataset:

from torch.utils.data import DataLoader

table_read = read_builder.new_read()
dataset = table_read.to_torch(splits, streaming=True)
dataloader = DataLoader(
    dataset,
    batch_size=2,
    num_workers=2,  # Concurrency to read data
    shuffle=False
)

# Collect all data from dataloader
for batch_idx, batch_data in enumerate(dataloader):
    print(batch_data)

# output:
#   {'user_id': tensor([1, 2]), 'behavior': ['a', 'b']}
#   {'user_id': tensor([3, 4]), 'behavior': ['c', 'd']}
#   {'user_id': tensor([5, 6]), 'behavior': ['e', 'f']}
#   {'user_id': tensor([7, 8]), 'behavior': ['g', 'h']}

When the streaming parameter is true, it will iteratively read; when it is false, it will read the full amount of data into memory.