This documentation is for an unreleased version of Apache Paimon. We recommend you use the latest stable version.
PyTorch
PyTorch #
Read #
This requires torch to be installed.
You can read all the data into a torch.utils.data.Dataset or torch.utils.data.IterableDataset:
from torch.utils.data import DataLoader
table_read = read_builder.new_read()
dataset = table_read.to_torch(splits, streaming=True)
dataloader = DataLoader(
dataset,
batch_size=2,
num_workers=2, # Concurrency to read data
shuffle=False
)
# Collect all data from dataloader
for batch_idx, batch_data in enumerate(dataloader):
print(batch_data)
# output:
# {'user_id': tensor([1, 2]), 'behavior': ['a', 'b']}
# {'user_id': tensor([3, 4]), 'behavior': ['c', 'd']}
# {'user_id': tensor([5, 6]), 'behavior': ['e', 'f']}
# {'user_id': tensor([7, 8]), 'behavior': ['g', 'h']}
When the streaming parameter is true, it will iteratively read;
when it is false, it will read the full amount of data into memory.