class MultiFileParquetPartitionReader extends MultiFileCoalescingPartitionReaderBase with ParquetPartitionReaderBase
A PartitionReader that can read multiple Parquet files up to the certain size. It will coalesce small files together and copy the block data in a separate thread pool to speed up processing the small files before sending down to the GPU.
Efficiently reading a Parquet split on the GPU requires re-constructing the Parquet file in memory that contains just the column chunks that are needed. This avoids sending unnecessary data to the GPU and saves GPU memory.
- Alphabetic
- By Inheritance
- MultiFileParquetPartitionReader
- ParquetPartitionReaderBase
- MultiFileCoalescingPartitionReaderBase
- MultiFileReaderFunctions
- FilePartitionReaderBase
- Arm
- ScanWithMetrics
- Logging
- PartitionReader
- Closeable
- AutoCloseable
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
-
new
MultiFileParquetPartitionReader(conf: Configuration, splits: Array[PartitionedFile], clippedBlocks: Seq[ParquetSingleDataBlockMeta], isSchemaCaseSensitive: Boolean, readDataSchema: StructType, debugDumpPrefix: String, maxReadBatchSizeRows: Integer, maxReadBatchSizeBytes: Long, execMetrics: Map[String, GpuMetric], partitionSchema: StructType, numThreads: Int, ignoreMissingFiles: Boolean, ignoreCorruptFiles: Boolean, useFieldId: Boolean)
- conf
the Hadoop configuration
- splits
the partitioned files to read
- clippedBlocks
the block metadata from the original Parquet file that has been clipped to only contain the column chunks to be read
- isSchemaCaseSensitive
whether schema is case sensitive
- readDataSchema
the Spark schema describing what will be read
- debugDumpPrefix
a path prefix to use for dumping the fabricated Parquet data or null
- maxReadBatchSizeRows
soft limit on the maximum number of rows the reader reads per batch
- maxReadBatchSizeBytes
soft limit on the maximum number of bytes the reader reads per batch
- execMetrics
metrics
- partitionSchema
Schema of partitions.
- numThreads
the size of the threadpool
- ignoreMissingFiles
Whether to ignore missing files
- ignoreCorruptFiles
Whether to ignore corrupt files
Type Members
- class ParquetCopyBlocksRunner extends Callable[(Seq[DataBlockBase], Long)]
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- implicit def ParquetSingleDataBlockMeta(in: ExtraInfo): ParquetExtraInfo
-
def
addPartitionValues(batch: Option[ColumnarBatch], inPartitionValues: InternalRow, partitionSchema: StructType): Option[ColumnarBatch]
- Attributes
- protected
- Definition Classes
- MultiFileReaderFunctions
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
var
batch: Option[ColumnarBatch]
- Attributes
- protected
- Definition Classes
- FilePartitionReaderBase
-
def
calculateEstimatedBlocksOutputSize(batchContext: BatchContext): Long
Calculate the output size according to the block chunks and the schema, and the estimated output size will be used as the initialized size of allocating HostMemoryBuffer
Calculate the output size according to the block chunks and the schema, and the estimated output size will be used as the initialized size of allocating HostMemoryBuffer
Please be note, the estimated size should be at least equal to size of HEAD + Blocks + FOOTER
- batchContext
the batch building context
- returns
Long, the estimated output size
- Definition Classes
- MultiFileParquetPartitionReader → MultiFileCoalescingPartitionReaderBase
-
def
calculateFinalBlocksOutputSize(footerOffset: Long, blocks: Seq[DataBlockBase], bContext: BatchContext): Long
Calculate the final block output size which will be used to decide if re-allocate HostMemoryBuffer
Calculate the final block output size which will be used to decide if re-allocate HostMemoryBuffer
There is no need to re-calculate the block size, just calculate the footer size and plus footerOffset.
If the size calculated by this function is bigger than the one calculated by calculateEstimatedBlocksOutputSize, then it will cause HostMemoryBuffer re-allocating, and cause the performance issue.
- footerOffset
footer offset
- blocks
blocks to be evaluated
- returns
the output size
- Definition Classes
- MultiFileParquetPartitionReader → MultiFileCoalescingPartitionReaderBase
-
def
calculateParquetFooterSize(currentChunkedBlocks: Seq[BlockMetaData], schema: MessageType): Long
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
- Annotations
- @nowarn()
-
def
calculateParquetOutputSize(currentChunkedBlocks: Seq[BlockMetaData], schema: MessageType, handleCoalesceFiles: Boolean): Long
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
def
checkIfNeedToSplitDataBlock(currentBlockInfo: SingleDataBlockInfo, nextBlockInfo: SingleDataBlockInfo): Boolean
To check if the next block will be split into another ColumnarBatch
To check if the next block will be split into another ColumnarBatch
- currentBlockInfo
current SingleDataBlockInfo
- nextBlockInfo
next SingleDataBlockInfo
- returns
true: split the next block into another ColumnarBatch and vice versa
- Definition Classes
- MultiFileParquetPartitionReader → MultiFileCoalescingPartitionReaderBase
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native() @HotSpotIntrinsicCandidate()
-
def
close(): Unit
- Definition Classes
- FilePartitionReaderBase → Closeable → AutoCloseable
-
def
closeOnExcept[T <: AutoCloseable, V](r: Option[T])(block: (Option[T]) ⇒ V): V
Executes the provided code block, closing the resources only if an exception occurs
Executes the provided code block, closing the resources only if an exception occurs
- Definition Classes
- Arm
-
def
closeOnExcept[T <: AutoCloseable, V](r: ArrayBuffer[T])(block: (ArrayBuffer[T]) ⇒ V): V
Executes the provided code block, closing the resources only if an exception occurs
Executes the provided code block, closing the resources only if an exception occurs
- Definition Classes
- Arm
-
def
closeOnExcept[T <: AutoCloseable, V](r: Array[T])(block: (Array[T]) ⇒ V): V
Executes the provided code block, closing the resources only if an exception occurs
Executes the provided code block, closing the resources only if an exception occurs
- Definition Classes
- Arm
-
def
closeOnExcept[T <: AutoCloseable, V](r: Seq[T])(block: (Seq[T]) ⇒ V): V
Executes the provided code block, closing the resources only if an exception occurs
Executes the provided code block, closing the resources only if an exception occurs
- Definition Classes
- Arm
-
def
closeOnExcept[T <: AutoCloseable, V](r: T)(block: (T) ⇒ V): V
Executes the provided code block, closing the resource only if an exception occurs
Executes the provided code block, closing the resource only if an exception occurs
- Definition Classes
- Arm
-
def
computeBlockMetaData(blocks: Seq[BlockMetaData], realStartOffset: Long, copyRangesToUpdate: Option[ArrayBuffer[CopyRange]] = None): Seq[BlockMetaData]
Computes new block metadata to reflect where the blocks and columns will appear in the computed Parquet file.
Computes new block metadata to reflect where the blocks and columns will appear in the computed Parquet file.
- blocks
block metadata from the original file(s) that will appear in the computed file
- realStartOffset
starting file offset of the first block
- copyRangesToUpdate
optional buffer to update with ranges of column data to copy
- returns
updated block metadata
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
- Annotations
- @nowarn()
-
val
conf: Configuration
- Definition Classes
- MultiFileParquetPartitionReader → ParquetPartitionReaderBase
-
def
copyBlocksData(in: FSDataInputStream, out: HostMemoryOutputStream, blocks: Seq[BlockMetaData], realStartOffset: Long): Seq[BlockMetaData]
Copies the data corresponding to the clipped blocks in the original file and compute the block metadata for the output.
Copies the data corresponding to the clipped blocks in the original file and compute the block metadata for the output. The output blocks will contain the same column chunk metadata but with the file offsets updated to reflect the new position of the column data as written to the output.
- in
the input stream for the original Parquet file
- out
the output stream to receive the data
- blocks
block metadata from the original file that will appear in the computed file
- realStartOffset
starting file offset of the first block
- returns
updated block metadata corresponding to the output
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
val
copyBufferSize: Int
- Definition Classes
- ParquetPartitionReaderBase
-
def
copyDataRange(range: CopyRange, in: FSDataInputStream, out: OutputStream, copyBuffer: Array[Byte]): Unit
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
def
createBatchContext(chunkedBlocks: LinkedHashMap[Path, ArrayBuffer[DataBlockBase]], clippedSchema: SchemaBase): BatchContext
Return a batch context which will be shared during the process of building a memory file, aka with the following APIs.
Return a batch context which will be shared during the process of building a memory file, aka with the following APIs.
- calculateEstimatedBlocksOutputSize
- writeFileHeader
- getBatchRunner
- calculateFinalBlocksOutputSize
- writeFileFooter It is useful when something is needed by some or all of the above APIs. Children can override this to return a customized batch context.
- chunkedBlocks
mapping of file path to data blocks
- clippedSchema
schema info
- Attributes
- protected
- Definition Classes
- MultiFileCoalescingPartitionReaderBase
-
def
currentMetricsValues(): Array[CustomTaskMetric]
- Definition Classes
- PartitionReader
-
def
dumpDataToFile(hmb: HostMemoryBuffer, dataLength: Long, splits: Array[PartitionedFile], debugDumpPrefix: Option[String] = None, format: Option[String] = None): Unit
Dump the data from HostMemoryBuffer to a file named by debugDumpPrefix + random + format
Dump the data from HostMemoryBuffer to a file named by debugDumpPrefix + random + format
- hmb
host data to be dumped
- dataLength
data size
- splits
PartitionedFile to be handled
- debugDumpPrefix
file name prefix, if it is None, will not dump
- format
file name suffix, if it is None, will not dump
- Attributes
- protected
- Definition Classes
- FilePartitionReaderBase
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
val
execMetrics: Map[String, GpuMetric]
- Definition Classes
- MultiFileParquetPartitionReader → ParquetPartitionReaderBase
-
def
fileSystemBytesRead(): Long
- Attributes
- protected
- Definition Classes
- MultiFileReaderFunctions
- Annotations
- @nowarn()
-
def
freeOnExcept[T <: RapidsBuffer, V](r: T)(block: (T) ⇒ V): V
Executes the provided code block, freeing the RapidsBuffer only if an exception occurs
Executes the provided code block, freeing the RapidsBuffer only if an exception occurs
- Definition Classes
- Arm
-
def
get(): ColumnarBatch
- Definition Classes
- FilePartitionReaderBase → PartitionReader
-
def
getBatchRunner(taskContext: TaskContext, file: Path, outhmb: HostMemoryBuffer, blocks: ArrayBuffer[DataBlockBase], offset: Long, batchContext: BatchContext): Callable[(Seq[DataBlockBase], Long)]
The sub-class must implement the real file reading logic in a Callable which will be running in a thread pool
The sub-class must implement the real file reading logic in a Callable which will be running in a thread pool
- file
file to be read
- outhmb
the sliced HostMemoryBuffer to hold the blocks, and the implementation is in charge of closing it in sub-class
- blocks
blocks meta info to specify which blocks to be read
- offset
used as the offset adjustment
- batchContext
the batch building context
- returns
Callable[(Seq[DataBlockBase], Long)], which will be submitted to a ThreadPoolExecutor, and the Callable will return a tuple result and result._1 is block meta info with the offset adjusted result._2 is the bytes read
- Definition Classes
- MultiFileParquetPartitionReader → MultiFileCoalescingPartitionReaderBase
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
final
def
getFileFormatShortName: String
File format short name used for logging and other things to uniquely identity which file format is being used.
File format short name used for logging and other things to uniquely identity which file format is being used.
- returns
the file format short name
- Definition Classes
- MultiFileParquetPartitionReader → MultiFileCoalescingPartitionReaderBase
-
def
getParquetOptions(clippedSchema: MessageType, useFieldId: Boolean): ParquetOptions
- Definition Classes
- ParquetPartitionReaderBase
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
def
initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
var
isDone: Boolean
- Attributes
- protected
- Definition Classes
- FilePartitionReaderBase
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
val
isSchemaCaseSensitive: Boolean
- Definition Classes
- MultiFileParquetPartitionReader → ParquetPartitionReaderBase
-
def
isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
log: Logger
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logName: String
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
var
maxDeviceMemory: Long
- Attributes
- protected
- Definition Classes
- FilePartitionReaderBase
-
val
metrics: Map[String, GpuMetric]
- Definition Classes
- ScanWithMetrics
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
next(): Boolean
- Definition Classes
- MultiFileCoalescingPartitionReaderBase → PartitionReader
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @HotSpotIntrinsicCandidate()
-
def
populateCurrentBlockChunk(blockIter: BufferedIterator[BlockMetaData], maxReadBatchSizeRows: Int, maxReadBatchSizeBytes: Long): Seq[BlockMetaData]
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
def
readBufferToTable(dataBuffer: HostMemoryBuffer, dataSize: Long, clippedSchema: SchemaBase, extraInfo: ExtraInfo): Table
Sent host memory to GPU to decode
Sent host memory to GPU to decode
- dataBuffer
the data which can be decoded in GPU
- dataSize
data size
- clippedSchema
the clipped schema
- extraInfo
the extra information for specific file format
- returns
Table
- Definition Classes
- MultiFileParquetPartitionReader → MultiFileCoalescingPartitionReaderBase
-
val
readDataSchema: StructType
- Definition Classes
- MultiFileParquetPartitionReader → ParquetPartitionReaderBase
-
def
readPartFile(blocks: Seq[BlockMetaData], clippedSchema: MessageType, filePath: Path): (HostMemoryBuffer, Long)
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
- implicit def toBlockMetaData(block: DataBlockBase): BlockMetaData
- implicit def toBlockMetaDataSeq(blocks: Seq[DataBlockBase]): Seq[BlockMetaData]
-
def
toCudfColumnNames(readDataSchema: StructType, fileSchema: MessageType, isCaseSensitive: Boolean, useFieldId: Boolean): Seq[String]
Take case-sensitive into consideration when getting the data reading column names before sending parquet-formatted buffer to cudf.
Take case-sensitive into consideration when getting the data reading column names before sending parquet-formatted buffer to cudf. Also clips the column names if
useFieldIdis true.- readDataSchema
Spark schema to read
- fileSchema
the schema of the dumped parquet-formatted buffer, already removed unmatched
- isCaseSensitive
if it is case sensitive
- useFieldId
if enabled
spark.sql.parquet.fieldId.read.enabled- returns
a sequence of tuple of column names following the order of readDataSchema
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
- implicit def toDataBlockBase(blocks: Seq[BlockMetaData]): Seq[DataBlockBase]
- implicit def toMessageType(schema: SchemaBase): MessageType
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
def
withResource[T <: AutoCloseable, V](h: CloseableHolder[T])(block: (CloseableHolder[T]) ⇒ V): V
Executes the provided code block and then closes the resource
Executes the provided code block and then closes the resource
- Definition Classes
- Arm
-
def
withResource[T <: AutoCloseable, V](r: ArrayBuffer[T])(block: (ArrayBuffer[T]) ⇒ V): V
Executes the provided code block and then closes the array buffer of resources
Executes the provided code block and then closes the array buffer of resources
- Definition Classes
- Arm
-
def
withResource[T <: AutoCloseable, V](r: Array[T])(block: (Array[T]) ⇒ V): V
Executes the provided code block and then closes the array of resources
Executes the provided code block and then closes the array of resources
- Definition Classes
- Arm
-
def
withResource[T <: AutoCloseable, V](r: Seq[T])(block: (Seq[T]) ⇒ V): V
Executes the provided code block and then closes the sequence of resources
Executes the provided code block and then closes the sequence of resources
- Definition Classes
- Arm
-
def
withResource[T <: AutoCloseable, V](r: Option[T])(block: (Option[T]) ⇒ V): V
Executes the provided code block and then closes the Option[resource]
Executes the provided code block and then closes the Option[resource]
- Definition Classes
- Arm
-
def
withResource[T <: AutoCloseable, V](r: T)(block: (T) ⇒ V): V
Executes the provided code block and then closes the resource
Executes the provided code block and then closes the resource
- Definition Classes
- Arm
-
def
withResourceIfAllowed[T, V](r: T)(block: (T) ⇒ V): V
Executes the provided code block and then closes the value if it is AutoCloseable
Executes the provided code block and then closes the value if it is AutoCloseable
- Definition Classes
- Arm
-
def
writeFileFooter(buffer: HostMemoryBuffer, bufferSize: Long, footerOffset: Long, blocks: Seq[DataBlockBase], bContext: BatchContext): (HostMemoryBuffer, Long)
Writer a footer for a specific file format.
Writer a footer for a specific file format. If there is no footer for the file format, just return (hmb, offset)
Please be note, some file formats may re-allocate the HostMemoryBuffer because of the estimated initialized buffer size may be a little smaller than the actual size. So in this case, the hmb should be closed in the implementation.
- buffer
The buffer holding (header + data blocks)
- bufferSize
The total buffer size which equals to size of (header + blocks + footer)
- footerOffset
Where begin to write the footer
- blocks
The data block meta info
- returns
the buffer and the buffer size
- Definition Classes
- MultiFileParquetPartitionReader → MultiFileCoalescingPartitionReaderBase
-
def
writeFileHeader(buffer: HostMemoryBuffer, bContext: BatchContext): Long
Write a header for a specific file format.
Write a header for a specific file format. If there is no header for the file format, just ignore it and return 0
- buffer
where the header will be written
- returns
how many bytes written
- Definition Classes
- MultiFileParquetPartitionReader → MultiFileCoalescingPartitionReaderBase
-
def
writeFooter(out: OutputStream, blocks: Seq[BlockMetaData], schema: MessageType): Unit
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase