Class CheckpointFailureManager

java.lang.Object
org.apache.flink.runtime.checkpoint.CheckpointFailureManager

public class CheckpointFailureManager extends Object
The checkpoint failure manager which centralized manage checkpoint failure processing logic.
  • Field Details

    • UNLIMITED_TOLERABLE_FAILURE_NUMBER

      public static final int UNLIMITED_TOLERABLE_FAILURE_NUMBER
      See Also:
    • EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE

      public static final String EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE
      See Also:
  • Constructor Details

  • Method Details

    • handleCheckpointException

      public void handleCheckpointException(@Nullable PendingCheckpoint pendingCheckpoint, CheckpointProperties checkpointProperties, CheckpointException exception, @Nullable ExecutionAttemptID executionAttemptID, org.apache.flink.api.common.JobID job, @Nullable PendingCheckpointStats pendingCheckpointStats, CheckpointStatsTracker statsTracker)
      Failures on JM:
      • all checkpoints - go against failure counter.
      • any savepoints - don’t do anything, manual action, the failover will not help anyway.

      Failures on TM:

      • all checkpoints - go against failure counter (failover might help and we want to notify users).
      • sync savepoints - we must always fail, otherwise we risk deadlock when the job cancelation waiting for finishing savepoint which never happens.
      • non sync savepoints - go against failure counter (failover might help solve the problem).
      Parameters:
      pendingCheckpoint - the failed checkpoint if it was initialized already.
      checkpointProperties - the checkpoint properties in order to determinate which handle strategy can be used.
      exception - the checkpoint exception.
      executionAttemptID - the execution attempt id, as a safe guard.
      job - the JobID.
      pendingCheckpointStats - the pending checkpoint statistics.
      statsTracker - the tracker for checkpoint statistics.
    • checkFailureCounter

      public void checkFailureCounter(CheckpointException exception, long checkpointId)
    • handleCheckpointSuccess

      public void handleCheckpointSuccess(long checkpointId)
      Handle checkpoint success.
      Parameters:
      checkpointId - the failed checkpoint id used to count the continuous failure number based on checkpoint id sequence.