Class FileSource<T>
- Type Parameters:
T- The type of the events/records produced by this source.
- All Implemented Interfaces:
Serializable,org.apache.flink.api.connector.source.DynamicParallelismInference,org.apache.flink.api.connector.source.Source<T,,FileSourceSplit, PendingSplitsCheckpoint<FileSourceSplit>> org.apache.flink.api.connector.source.SourceReaderFactory<T,,FileSourceSplit> org.apache.flink.api.java.typeutils.ResultTypeQueryable<T>
This source supports all (distributed) file systems and object stores that can be accessed via
the Flink's FileSystem class.
Start building a file source via one of the following calls:
This creates a FileSource.FileSourceBuilder on which you can configure all the
properties of the file source.
Batch and Streaming
This source supports both bounded/batch and continuous/streaming data inputs. For the bounded/batch case, the file source processes all files under the given path(s). In the continuous/streaming case, the source periodically checks the paths for new files and will start reading those.
When you start creating a file source (via the FileSource.FileSourceBuilder created
through one of the above-mentioned methods) the source is by default in bounded/batch mode. Call
AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration) to put the source into
continuous streaming mode.
Format Types
The reading of each file happens through file readers defined by file formats. These define the parsing logic for the contents of the file. There are multiple classes that the source supports. Their interfaces trade of simplicity of implementation and flexibility/efficiency.
- A
StreamFormatreads the contents of a file from a file stream. It is the simplest format to implement, and provides many features out-of-the-box (like checkpointing logic) but is limited in the optimizations it can apply (such as object reuse, batching, etc.). - A
BulkFormatreads batches of records from a file at a time. It is the most "low level" format to implement, but offers the greatest flexibility to optimize the implementation.
Discovering / Enumerating Files
The way that the source lists the files to be processes is defined by the FileEnumerator. The FileEnumerator is responsible to select the relevant files (for
example filter out hidden files) and to optionally splits files into multiple regions (= file
source splits) that can be read in parallel).
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final classThe builder for theFileSource, to configure the various behaviors.Nested classes/interfaces inherited from class org.apache.flink.connector.file.src.AbstractFileSource
AbstractFileSource.AbstractFileSourceBuilder<T,SplitT extends FileSourceSplit, SELF extends AbstractFileSource.AbstractFileSourceBuilder<T, SplitT, SELF>> Nested classes/interfaces inherited from interface org.apache.flink.api.connector.source.DynamicParallelismInference
org.apache.flink.api.connector.source.DynamicParallelismInference.Context -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final FileEnumerator.ProviderThe default file enumerator used for non-splittable formats.static final FileSplitAssigner.ProviderThe default split assigner, a lazy locality-aware assigner.static final FileEnumerator.ProviderThe default file enumerator used for splittable formats. -
Method Summary
Modifier and TypeMethodDescriptionstatic <T> FileSource.FileSourceBuilder<T>forBulkFileFormat(BulkFormat<T, FileSourceSplit> bulkFormat, org.apache.flink.core.fs.Path... paths) Builds a newFileSourceusing aBulkFormatto read batches of records from files.static <T> FileSource.FileSourceBuilder<T>forRecordStreamFormat(StreamFormat<T> streamFormat, org.apache.flink.core.fs.Path... paths) Builds a newFileSourceusing aStreamFormatto read record-by-record from a file stream.org.apache.flink.core.io.SimpleVersionedSerializer<FileSourceSplit>intinferParallelism(org.apache.flink.api.connector.source.DynamicParallelismInference.Context dynamicParallelismContext) Methods inherited from class org.apache.flink.connector.file.src.AbstractFileSource
createEnumerator, createReader, getAssignerFactory, getBoundedness, getContinuousEnumerationSettings, getEnumeratorCheckpointSerializer, getEnumeratorFactory, getProducedType, restoreEnumeratorMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.apache.flink.api.connector.source.Source
declareWatermarks
-
Field Details
-
DEFAULT_SPLIT_ASSIGNER
The default split assigner, a lazy locality-aware assigner. -
DEFAULT_SPLITTABLE_FILE_ENUMERATOR
The default file enumerator used for splittable formats. The enumerator recursively enumerates files, split files that consist of multiple distributed storage blocks into multiple splits, and filters hidden files (files starting with '.' or '_'). Files with suffixes of common compression formats (for example '.gzip', '.bz2', '.xy', '.zip', ...) will not be split. -
DEFAULT_NON_SPLITTABLE_FILE_ENUMERATOR
The default file enumerator used for non-splittable formats. The enumerator recursively enumerates files, creates one split for the file, and filters hidden files (files starting with '.' or '_').
-
-
Method Details
-
getSplitSerializer
- Specified by:
getSplitSerializerin interfaceorg.apache.flink.api.connector.source.Source<T,FileSourceSplit, PendingSplitsCheckpoint<FileSourceSplit>> - Specified by:
getSplitSerializerin classAbstractFileSource<T,FileSourceSplit>
-
inferParallelism
public int inferParallelism(org.apache.flink.api.connector.source.DynamicParallelismInference.Context dynamicParallelismContext) - Specified by:
inferParallelismin interfaceorg.apache.flink.api.connector.source.DynamicParallelismInference
-
forRecordStreamFormat
public static <T> FileSource.FileSourceBuilder<T> forRecordStreamFormat(StreamFormat<T> streamFormat, org.apache.flink.core.fs.Path... paths) Builds a newFileSourceusing aStreamFormatto read record-by-record from a file stream.When possible, stream-based formats are generally easier (preferable) to file-based formats, because they support better default behavior around I/O batching or progress tracking (checkpoints).
Stream formats also automatically de-compress files based on the file extension. This supports files ending in ".deflate" (Deflate), ".xz" (XZ), ".bz2" (BZip2), ".gz", ".gzip" (GZip).
-
forBulkFileFormat
public static <T> FileSource.FileSourceBuilder<T> forBulkFileFormat(BulkFormat<T, FileSourceSplit> bulkFormat, org.apache.flink.core.fs.Path... paths) Builds a newFileSourceusing aBulkFormatto read batches of records from files.Examples for bulk readers are compressed and vectorized formats such as ORC or Parquet.
-