public class BucketShuffleStrategy extends Object implements ShuffleStrategy
BucketShuffleStrategy
defines a strategy for determining the target subtask id for a
given bucket id and join key hash. It also provides a method to retrieve the set of bucket ids
that are required to be cached by a specific subtask.Constructor and Description |
---|
BucketShuffleStrategy(int numBuckets) |
Modifier and Type | Method and Description |
---|---|
Set<Integer> |
getRequiredCacheBucketIds(int subtaskId,
int numSubtasks)
Retrieves the set of bucket IDs that are required to be cached by a specific subtask.
|
int |
getTargetSubtaskId(int bucketId,
int joinKeyHash,
int numSubtasks)
Determines the target subtask ID for a given bucket ID and join key hash.
|
public int getTargetSubtaskId(int bucketId, int joinKeyHash, int numSubtasks)
The method uses two different strategies based on the comparison between the number of buckets and the number of subtasks:
1. If the number of buckets is greater than or equal to the number of subtasks (numBuckets
>= numSubtasks), then the target subtask ID is determined by assigning buckets to subtasks in
a round-robin manner:
e.g., numBuckets = 5, numSubtasks = 3
Bucket 0 -> Subtask 0
Bucket 1 -> Subtask 1
Bucket 2 -> Subtask 2
Bucket 3 -> Subtask 0
Bucket 4 -> Subtask 1
2. If the number of buckets is less than the number of subtasks (numBuckets <
numSubtasks), then the target subtask ID is determined by distributing subtasks to buckets in
a round-robin manner:
e.g., numBuckets = 2, numSubtasks = 5
Bucket 0 -> Subtask 0,2,4
Bucket 1 -> Subtask 1,3
getTargetSubtaskId
in interface ShuffleStrategy
bucketId
- The ID of the bucket.joinKeyHash
- The hash of the join key.numSubtasks
- The total number of target subtasks.public Set<Integer> getRequiredCacheBucketIds(int subtaskId, int numSubtasks)
getTargetSubtaskId(int, int, int)
to ensure
consistency in bucket to subtask mapping.
The method uses two different strategies based on the comparison between the number of buckets and the number of subtasks:
1. If the number of buckets is greater than or equal to the number of subtasks (numBuckets
>= numSubtasks), then each subtask caches buckets in a round-robin manner.
e.g., numBuckets = 5, numSubtasks = 3
Subtask 0 -> Bucket 0,3
Subtask 1 -> Bucket 1,4
Subtask 2 -> Bucket 2
2. If the number of buckets is less than the number of subtasks (numBuckets <
numSubtasks), then each subtask caches buckets by polling from buckets in a round-robin
manner:
e.g., numBuckets = 2, numSubtasks = 5
Subtask 0 -> Bucket 0
Subtask 1 -> Bucket 1
Subtask 2 -> Bucket 0
Subtask 3 -> Bucket 1
Subtask 4 -> Bucket 0
getRequiredCacheBucketIds
in interface ShuffleStrategy
subtaskId
- The ID of the subtask.numSubtasks
- The total number of subtasks.Copyright © 2023–2025 The Apache Software Foundation. All rights reserved.