Class VertexParallelismAndInputInfosDeciderUtils

java.lang.Object
org.apache.flink.runtime.scheduler.adaptivebatch.util.VertexParallelismAndInputInfosDeciderUtils

public class VertexParallelismAndInputInfosDeciderUtils extends Object
Utils class for VertexParallelismAndInputInfosDecider.
  • Constructor Details

    • VertexParallelismAndInputInfosDeciderUtils

      public VertexParallelismAndInputInfosDeciderUtils()
  • Method Details

    • adjustToClosestLegalParallelism

      public static Optional<List<IndexRange>> adjustToClosestLegalParallelism(long currentDataVolumeLimit, int currentParallelism, int minParallelism, int maxParallelism, long minLimit, long maxLimit, Function<Long,Integer> parallelismComputer, Function<Long,List<IndexRange>> subpartitionRangesComputer)
      Adjust the parallelism to the closest legal parallelism and return the computed subpartition ranges.
      Parameters:
      currentDataVolumeLimit - current data volume limit
      currentParallelism - current parallelism
      minParallelism - the min parallelism
      maxParallelism - the max parallelism
      minLimit - the minimum data volume limit
      maxLimit - the maximum data volume limit
      parallelismComputer - a function to compute the parallelism according to the data volume limit
      subpartitionRangesComputer - a function to compute the subpartition ranges according to the data volume limit
      Returns:
      the computed subpartition ranges or Optional.empty() if we can't find any legal parallelism
    • cartesianProduct

      public static <T> List<List<T>> cartesianProduct(List<List<T>> lists)
      Computes the Cartesian product of a list of lists.

      The Cartesian product is a set of all possible combinations formed by picking one element from each list. For example, given input lists [[1, 2], [3, 4]], the result will be [[1, 3], [1, 4], [2, 3], [2, 4]].

      Note: If the input list is empty or contains an empty list, the result will be an empty list.

      Type Parameters:
      T - the type of elements in the lists
      Parameters:
      lists - a list of lists for which the Cartesian product is to be computed
      Returns:
      a list of lists representing the Cartesian product, where each inner list is a combination
    • median

      public static long median(long[] nums)
      Calculates the median of a given array of long integers. If the calculated median is less than 1, it returns 1 instead.
      Parameters:
      nums - an array of long integers for which to calculate the median.
      Returns:
      the median value, which will be at least 1.
    • computeSkewThreshold

      public static long computeSkewThreshold(long medianSize, double skewedFactor, long defaultSkewedThreshold)
      Computes the skew threshold based on the given media size and skewed factor.

      The skew threshold is calculated as the product of the media size and the skewed factor. To ensure that the computed threshold does not fall below a specified default value, the method uses Math.max(long, long) to return the largest of the calculated threshold and the default threshold.

      Parameters:
      medianSize - the size of the median
      skewedFactor - a factor indicating the degree of skewness
      defaultSkewedThreshold - the default threshold to be used if the calculated threshold is less than this value
      Returns:
      the computed skew threshold, which is guaranteed to be at least the default skewed threshold.
    • computeTargetSize

      public static long computeTargetSize(long[] subpartitionBytes, long skewedThreshold, long dataVolumePerTask)
      Computes the target data size for each task based on the sizes of non-skewed subpartitions.

      The target size is determined as the average size of non-skewed subpartitions and ensures that the target size is at least equal to the specified data volume per task.

      Parameters:
      subpartitionBytes - an array representing the data size of each subpartition
      skewedThreshold - skewed threshold in bytes
      dataVolumePerTask - the amount of data that should be allocated per task
      Returns:
      the computed target size for each task, which is the maximum between the average size of non-skewed subpartitions and data volume per task.
    • getNonBroadcastInputInfos

      public static List<BlockingInputInfo> getNonBroadcastInputInfos(List<BlockingInputInfo> consumedResults)
    • hasSameNumPartitions

      public static boolean hasSameNumPartitions(List<BlockingInputInfo> inputInfos)
    • getMaxNumPartitions

      public static int getMaxNumPartitions(List<BlockingInputInfo> consumedResults)
    • checkAndGetSubpartitionNum

      public static int checkAndGetSubpartitionNum(List<BlockingInputInfo> consumedResults)
    • checkAndGetSubpartitionNumForAggregatedInputs

      public static int checkAndGetSubpartitionNumForAggregatedInputs(Collection<AggregatedBlockingInputInfo> inputInfos)
    • isLegalParallelism

      public static boolean isLegalParallelism(int parallelism, int minParallelism, int maxParallelism)
    • checkAndGetIntraCorrelation

      public static boolean checkAndGetIntraCorrelation(List<BlockingInputInfo> inputInfos)
    • checkAndGetParallelism

      public static int checkAndGetParallelism(Collection<JobVertexInputInfo> vertexInputInfos)
    • tryComputeSubpartitionSliceRange

      public static Optional<List<IndexRange>> tryComputeSubpartitionSliceRange(int minParallelism, int maxParallelism, long maxDataVolumePerTask, Map<Integer,List<SubpartitionSlice>> subpartitionSlices)
      Attempts to compute the subpartition slice ranges to ensure even distribution of data across downstream tasks.

      This method first tries to compute the subpartition slice ranges by evenly distributing the data volume. If that fails, it attempts to compute the ranges by evenly distributing the number of subpartition slices.

      Parameters:
      minParallelism - The minimum parallelism.
      maxParallelism - The maximum parallelism.
      maxDataVolumePerTask - The maximum data volume per task.
      subpartitionSlices - A map of lists of subpartition slices grouped by type or index number.
      Returns:
      An Optional containing a list of index ranges representing the subpartition slice ranges. Returns an empty Optional if no suitable ranges can be computed.
    • createJobVertexInputInfos

      public static Map<IntermediateDataSetID,JobVertexInputInfo> createJobVertexInputInfos(List<BlockingInputInfo> inputInfos, Map<Integer,List<SubpartitionSlice>> subpartitionSlices, List<IndexRange> subpartitionSliceRanges, Function<Integer,Integer> subpartitionSliceKeyResolver)
    • createdJobVertexInputInfoForBroadcast

      public static JobVertexInputInfo createdJobVertexInputInfoForBroadcast(BlockingInputInfo inputInfo, int parallelism)
    • createdJobVertexInputInfoForNonBroadcast

      public static JobVertexInputInfo createdJobVertexInputInfoForNonBroadcast(BlockingInputInfo inputInfo, List<IndexRange> subpartitionSliceRanges, List<SubpartitionSlice> subpartitionSlices)
    • calculateDataVolumePerTaskForInputsGroup

      public static long calculateDataVolumePerTaskForInputsGroup(long globalDataVolumePerTask, List<BlockingInputInfo> inputsGroup, List<BlockingInputInfo> allInputs)
    • calculateDataVolumePerTaskForInput

      public static long calculateDataVolumePerTaskForInput(long globalDataVolumePerTask, long inputsGroupBytes, long totalDataBytes)
    • logBalancedDataDistributionOptimizationResult

      public static void logBalancedDataDistributionOptimizationResult(org.slf4j.Logger logger, JobVertexID jobVertexId, BlockingInputInfo inputInfo, JobVertexInputInfo optimizedJobVertexInputInfo)
      Logs the data distribution optimization info when a balanced data distribution algorithm is effectively optimized compared to the num-based data distribution algorithm.
      Parameters:
      logger - The logger instance used for logging output.
      jobVertexId - The id for the job vertex.
      inputInfo - The original input info
      optimizedJobVertexInputInfo - The optimized job vertex input info.