Class CollectSinkFunction<IN>

java.lang.Object
org.apache.flink.api.common.functions.AbstractRichFunction
org.apache.flink.streaming.api.functions.sink.legacy.RichSinkFunction<IN>
org.apache.flink.streaming.api.operators.collect.CollectSinkFunction<IN>
Type Parameters:
IN - type of results to be written into the sink.
All Implemented Interfaces:
Serializable, org.apache.flink.api.common.functions.Function, org.apache.flink.api.common.functions.RichFunction, org.apache.flink.api.common.state.CheckpointListener, CheckpointedFunction, SinkFunction<IN>

@Internal public class CollectSinkFunction<IN> extends RichSinkFunction<IN> implements CheckpointedFunction, org.apache.flink.api.common.state.CheckpointListener
A sink function that collects query results and sends them back to the client.

This sink works by limiting the number of results buffered in it (can be configured) so that when the buffer is full, it back-pressures the job until the client consumes some results.

NOTE: When using this sink, make sure that its parallelism is 1, and make sure that it is used in a StreamTask.

Communication Protocol Explanation

We maintain the following variables in this communication protocol

  1. version: This variable will be set to a random value when the sink opens. Client discovers that the sink has restarted if this variable is different.
  2. offset: This indicates that client has successfully received the results before this offset. Sink can safely throw these results away.
  3. lastCheckpointedOffset: This is the value of offset when the checkpoint happens. This value will be restored from the checkpoint and set back to offset when the sink restarts. Clients who need exactly-once semantics need to rely on this value for the position to revert when a failover happens.

Client will put version and offset into the request, indicating that it thinks what the current version is and it has received this much results.

Sink will check the validity of the request. If version mismatches or offset is smaller than expected, sink will send back the current version and lastCheckpointedOffset with an empty result list.

If the request is valid, sink prepares some results starting from offset and sends them back to the client with lastCheckpointedOffset. If there is currently no results starting from offset, sink will not wait but will instead send back an empty result list.

For client who wants exactly-once semantics, when receiving the response, the client will check for the following conditions:

  1. If the version mismatches, client knows that sink has restarted. It will throw away all uncheckpointed results after lastCheckpointedOffset.
  2. If lastCheckpointedOffset increases, client knows that a checkpoint happens. It can now move all results before this offset to a user-visible buffer.
  3. If the response also contains new results, client will now move these new results into uncheckpointed buffer.

Note that

  1. user can only see results before a lastCheckpointedOffset, and
  2. client will go back to the latest lastCheckpointedOffset when sink restarts,

client will never throw away results in user-visible buffer. So this communication protocol achieves exactly-once semantics.

In order not to block job finishing/cancelling, if there are still results in sink's buffer when job terminates, these results will be sent back to client through accumulators.

See Also:
  • Constructor Details

    • CollectSinkFunction

      public CollectSinkFunction(org.apache.flink.api.common.typeutils.TypeSerializer<IN> serializer, long maxBytesPerBatch, String accumulatorName)
  • Method Details

    • getMaxBytesPerBatch

      public long getMaxBytesPerBatch()
    • initializeState

      public void initializeState(FunctionInitializationContext context) throws Exception
      Description copied from interface: CheckpointedFunction
      This method is called when the parallel function instance is created during distributed execution. Functions typically set up their state storing data structures in this method.
      Specified by:
      initializeState in interface CheckpointedFunction
      Parameters:
      context - the context for initializing the operator
      Throws:
      Exception - Thrown, if state could not be created ot restored.
    • snapshotState

      public void snapshotState(FunctionSnapshotContext context) throws Exception
      Description copied from interface: CheckpointedFunction
      This method is called when a snapshot for a checkpoint is requested. This acts as a hook to the function to ensure that all state is exposed by means previously offered through FunctionInitializationContext when the Function was initialized, or offered now by FunctionSnapshotContext itself.
      Specified by:
      snapshotState in interface CheckpointedFunction
      Parameters:
      context - the context for drawing a snapshot of the operator
      Throws:
      Exception - Thrown, if state could not be created ot restored.
    • open

      public void open(org.apache.flink.api.common.functions.OpenContext openContext) throws Exception
      Specified by:
      open in interface org.apache.flink.api.common.functions.RichFunction
      Overrides:
      open in class org.apache.flink.api.common.functions.AbstractRichFunction
      Throws:
      Exception
    • invoke

      public void invoke(IN value, SinkFunction.Context context) throws Exception
      Description copied from interface: SinkFunction
      Writes the given value to the sink. This function is called for every record.

      You have to override this method when implementing a SinkFunction, this is a default method for backward compatibility with the old-style method only.

      Specified by:
      invoke in interface SinkFunction<IN>
      Parameters:
      value - The input record.
      context - Additional context about the input record.
      Throws:
      Exception - This method may throw exceptions. Throwing an exception will cause the operation to fail and may trigger recovery.
    • close

      public void close() throws Exception
      Specified by:
      close in interface org.apache.flink.api.common.functions.RichFunction
      Overrides:
      close in class org.apache.flink.api.common.functions.AbstractRichFunction
      Throws:
      Exception
    • accumulateFinalResults

      public void accumulateFinalResults() throws Exception
      Throws:
      Exception
    • notifyCheckpointComplete

      public void notifyCheckpointComplete(long checkpointId)
      Specified by:
      notifyCheckpointComplete in interface org.apache.flink.api.common.state.CheckpointListener
    • notifyCheckpointAborted

      public void notifyCheckpointAborted(long checkpointId)
      Specified by:
      notifyCheckpointAborted in interface org.apache.flink.api.common.state.CheckpointListener
    • setOperatorEventGateway

      public void setOperatorEventGateway(OperatorEventGateway eventGateway)
    • serializeAccumulatorResult

      @VisibleForTesting public static byte[] serializeAccumulatorResult(long offset, String version, long lastCheckpointedOffset, List<byte[]> buffer) throws IOException
      Throws:
      IOException
    • deserializeAccumulatorResult

      public static org.apache.flink.api.java.tuple.Tuple2<Long,CollectCoordinationResponse> deserializeAccumulatorResult(byte[] serializedAccResults) throws IOException
      Throws:
      IOException