Class PointwiseVertexInputInfoComputer
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptioncompute(List<BlockingInputInfo> inputInfos, int parallelism, int minParallelism, int maxParallelism, long dataVolumePerTask) Decide parallelism and input infos, which will make the data be evenly distributed to downstream subtasks for POINTWISE, such that different downstream subtasks consume roughly the same amount of data.
-
Constructor Details
-
PointwiseVertexInputInfoComputer
public PointwiseVertexInputInfoComputer()
-
-
Method Details
-
compute
public Map<IntermediateDataSetID,JobVertexInputInfo> compute(List<BlockingInputInfo> inputInfos, int parallelism, int minParallelism, int maxParallelism, long dataVolumePerTask) Decide parallelism and input infos, which will make the data be evenly distributed to downstream subtasks for POINTWISE, such that different downstream subtasks consume roughly the same amount of data.Assume that `inputInfo` has two partitions, each partition has three subpartitions, their data bytes are: {0->[1,2,1], 1->[2,1,2]}, and the expected parallelism is 3. The calculation process is as follows:
1. Create subpartition slices for input which is composed of several subpartitions. The created slice list and its data bytes are: [1,2,1,2,1,2]
2. Distribute the subpartition slices array into n balanced parts (described by `IndexRange`, named SubpartitionSliceRanges) based on data volume: [0,1],[2,3],[4,5]
3. Reorganize the distributed results into a mapping of partition range to subpartition range: {0 -> [0,1]}, {0->[2,2],1->[0,0]}, {1->[1,2]}.
The final result is the `SubpartitionGroup` that each of the three parallel tasks need to subscribe.- Parameters:
inputInfos- The information of consumed blocking resultsparallelism- The parallelism of the job vertexminParallelism- the min parallelismmaxParallelism- the max parallelismdataVolumePerTask- proposed data volume per task for this set of inputInfo- Returns:
- the parallelism and vertex input infos
-