Class MutableHashTable<BT,PT>
java.lang.Object
org.apache.flink.runtime.operators.hash.MutableHashTable<BT,PT>
- Type Parameters:
BT- The type of records from the build side that are stored in the hash table.PT- The type of records from the probe side that are stored in the hash table.
- All Implemented Interfaces:
org.apache.flink.core.memory.MemorySegmentSource
- Direct Known Subclasses:
ReOpenableMutableHashTable
public class MutableHashTable<BT,PT>
extends Object
implements org.apache.flink.core.memory.MemorySegmentSource
An implementation of a Hybrid Hash Join. The join starts operating in memory and gradually starts
spilling contents to disk, when the memory is not sufficient. It does not need to know a priori
how large the input will be.
The design of this class follows in many parts the design presented in "Hash joins and hash teams in Microsoft SQL Server", by Goetz Graefe et al. In its current state, the implementation lacks features like dynamic role reversal, partition tuning, or histogram guided partitioning.
The layout of the buckets inside a memory segment is as follows:
+----------------------------- Bucket x ---------------------------- |Partition (1 byte) | Status (1 byte) | element count (2 bytes) | | next-bucket-in-chain-pointer (8 bytes) | probedFlags (2 bytes) | reserved (2 bytes) | | |hashCode 1 (4 bytes) | hashCode 2 (4 bytes) | hashCode 3 (4 bytes) | | ... hashCode n-1 (4 bytes) | hashCode n (4 bytes) | |pointer 1 (8 bytes) | pointer 2 (8 bytes) | pointer 3 (8 bytes) | | ... pointer n-1 (8 bytes) | pointer n (8 bytes) | +---------------------------- Bucket x + 1-------------------------- |Partition (1 byte) | Status (1 byte) | element count (2 bytes) | | next-bucket-in-chain-pointer (8 bytes) | probedFlags (2 bytes) | reserved (2 bytes) | | |hashCode 1 (4 bytes) | hashCode 2 (4 bytes) | hashCode 3 (4 bytes) | | ... hashCode n-1 (4 bytes) | hashCode n (4 bytes) | |pointer 1 (8 bytes) | pointer 2 (8 bytes) | pointer 3 (8 bytes) | | ... pointer n-1 (8 bytes) | pointer n (8 bytes) +------------------------------------------------------------------- | ... |
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classstatic final classstatic classIterate all the elements in memory which has not been probed during probe phase. -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected final List<org.apache.flink.core.memory.MemorySegment>The free memory segments currently available to the hash join.protected org.apache.flink.core.memory.MemorySegment[]The array of memory segments that contain the buckets which form the actual hash-table of hash-codes and pointers to the elements.protected final intThe number of bits that describe the position of a bucket in a memory segment.protected final intThe number of hash table buckets in a single memory segment - 1.protected final org.apache.flink.api.common.typeutils.TypeComparator<BT>The utilities to hash and compare the build side data types.protected final org.apache.flink.api.common.typeutils.TypeSerializer<BT>The utilities to serialize the build side data types.protected AtomicBooleanFlag indicating that the closing logic has been invoked.protected FileIOChannel.EnumeratorThe channel enumerator that is used while processing the current partition to create channels for the spill partitions it requires.protected intThe recursion depth of the partition that is currently processed.protected booleanprotected final IOManagerThe I/O manager used to instantiate writers for the spilled partitions.protected booleanIf true, build side partitions are kept for multiple probe steps.protected intThe number of buckets in the current table.protected final ArrayList<HashPartition<BT,PT>> The partitions that are built by processing the current partition.protected MutableHashTable.ProbeIterator<PT>Iterator over the elements from the probe side.protected final org.apache.flink.api.common.typeutils.TypeSerializer<PT>The utilities to serialize the probe side data types.protected final intThe size of the segments used by the hash join buckets.protected final LinkedBlockingQueue<org.apache.flink.core.memory.MemorySegment>The queue of buffers that can be used for write-behind.protected intThe number of buffers in the write behind queue that are actually not write behind buffers, but regular buffers that only have not yet returned. -
Constructor Summary
ConstructorsConstructorDescriptionMutableHashTable(org.apache.flink.api.common.typeutils.TypeSerializer<BT> buildSideSerializer, org.apache.flink.api.common.typeutils.TypeSerializer<PT> probeSideSerializer, org.apache.flink.api.common.typeutils.TypeComparator<BT> buildSideComparator, org.apache.flink.api.common.typeutils.TypeComparator<PT> probeSideComparator, org.apache.flink.api.common.typeutils.TypePairComparator<PT, BT> comparator, List<org.apache.flink.core.memory.MemorySegment> memorySegments, IOManager ioManager) MutableHashTable(org.apache.flink.api.common.typeutils.TypeSerializer<BT> buildSideSerializer, org.apache.flink.api.common.typeutils.TypeSerializer<PT> probeSideSerializer, org.apache.flink.api.common.typeutils.TypeComparator<BT> buildSideComparator, org.apache.flink.api.common.typeutils.TypeComparator<PT> probeSideComparator, org.apache.flink.api.common.typeutils.TypePairComparator<PT, BT> comparator, List<org.apache.flink.core.memory.MemorySegment> memorySegments, IOManager ioManager, boolean useBloomFilters) MutableHashTable(org.apache.flink.api.common.typeutils.TypeSerializer<BT> buildSideSerializer, org.apache.flink.api.common.typeutils.TypeSerializer<PT> probeSideSerializer, org.apache.flink.api.common.typeutils.TypeComparator<BT> buildSideComparator, org.apache.flink.api.common.typeutils.TypeComparator<PT> probeSideComparator, org.apache.flink.api.common.typeutils.TypePairComparator<PT, BT> comparator, List<org.apache.flink.core.memory.MemorySegment> memorySegments, IOManager ioManager, int avgRecordLen, boolean useBloomFilters) -
Method Summary
Modifier and TypeMethodDescriptionvoidabort()static byteassignPartition(int bucket, byte numPartitions) Assigns a partition to a bucket.protected final voidbuildBloomFilterForBucketsInPartition(int partNum, HashPartition<BT, PT> partition) protected voidbuildInitialTable(org.apache.flink.util.MutableObjectIterator<BT> input) Creates the initial hash table.protected voidprotected voidThis method clears all partitions currently residing (partially) in memory.voidclose()Closes the hash table.protected voidcreatePartitions(int numPartitions, int recursionLevel) org.apache.flink.util.MutableObjectIterator<BT>List<org.apache.flink.core.memory.MemorySegment>static intgetInitialTableSize(int numBuffers, int bufferSize, int numPartitions, int recordLenBytes) getMatchesFor(PT record) protected HashPartition<BT,PT> getNewInMemoryPartition(int number, int recursionLevel) Returns a new inMemoryPartition object.static intgetNumWriteBehindBuffers(int numBuffers) Determines the number of buffers to be used for asynchronous write behind.static intgetPartitioningFanOutNoEstimates(int numBuffers) Gets the number of partitions to be used for an initial hash-table, when no estimates are available.org.apache.flink.api.common.typeutils.TypeComparator<PT>static inthash(int code, int level) The level parameter is needed so that we can have different hash functions when we recursively apply the partitioning, so that the working set eventually fits into memory.protected voidinitTable(int numBuckets, byte numPartitions) protected final voidinsertIntoTable(BT record, int hashCode) booleanorg.apache.flink.core.memory.MemorySegmentThis is the method called by the partitions to request memory to serialize records.voidopen(org.apache.flink.util.MutableObjectIterator<BT> buildSide, org.apache.flink.util.MutableObjectIterator<PT> probeSide) Opens the hash join.voidopen(org.apache.flink.util.MutableObjectIterator<BT> buildSide, org.apache.flink.util.MutableObjectIterator<PT> probeSide, boolean buildOuterJoin) Opens the hash join.protected booleanprotected booleanprotected booleanprotected voidReleases the table (the array of buckets) and returns the occupied memory segments to the list of free segments.protected intSelects a partition and spills it.
-
Field Details
-
buildSideSerializer
The utilities to serialize the build side data types. -
probeSideSerializer
The utilities to serialize the probe side data types. -
buildSideComparator
The utilities to hash and compare the build side data types. -
availableMemory
The free memory segments currently available to the hash join. -
writeBehindBuffers
The queue of buffers that can be used for write-behind. Any buffer that is written asynchronously to disk is returned through this queue. hence, it may sometimes contain more -
ioManager
The I/O manager used to instantiate writers for the spilled partitions. -
segmentSize
protected final int segmentSizeThe size of the segments used by the hash join buckets. All segments must be of equal size to ease offset computations. -
bucketsPerSegmentMask
protected final int bucketsPerSegmentMaskThe number of hash table buckets in a single memory segment - 1. Because memory segments can be comparatively large, we fit multiple buckets into one memory segment. This variable is a mask that is 1 in the lower bits that define the number of a bucket in a segment. -
bucketsPerSegmentBits
protected final int bucketsPerSegmentBitsThe number of bits that describe the position of a bucket in a memory segment. Computed as log2(bucketsPerSegment). -
partitionsBeingBuilt
The partitions that are built by processing the current partition. -
probeIterator
Iterator over the elements from the probe side. -
currentEnumerator
The channel enumerator that is used while processing the current partition to create channels for the spill partitions it requires. -
buckets
protected org.apache.flink.core.memory.MemorySegment[] bucketsThe array of memory segments that contain the buckets which form the actual hash-table of hash-codes and pointers to the elements. -
numBuckets
protected int numBucketsThe number of buckets in the current table. The bucket array is not necessarily fully used, when not all buckets that would fit into the last segment are actually used. -
writeBehindBuffersAvailable
protected int writeBehindBuffersAvailableThe number of buffers in the write behind queue that are actually not write behind buffers, but regular buffers that only have not yet returned. This is part of an optimization that the spilling code needs not wait until the partition is completely spilled before proceeding. -
currentRecursionDepth
protected int currentRecursionDepthThe recursion depth of the partition that is currently processed. The initial table has a recursion depth of 0. Partitions spilled from a table that is built for a partition with recursion depth n have a recursion depth of n+1. -
closed
Flag indicating that the closing logic has been invoked. -
keepBuildSidePartitions
protected boolean keepBuildSidePartitionsIf true, build side partitions are kept for multiple probe steps. -
furtherPartitioning
protected boolean furtherPartitioning
-
-
Constructor Details
-
MutableHashTable
public MutableHashTable(org.apache.flink.api.common.typeutils.TypeSerializer<BT> buildSideSerializer, org.apache.flink.api.common.typeutils.TypeSerializer<PT> probeSideSerializer, org.apache.flink.api.common.typeutils.TypeComparator<BT> buildSideComparator, org.apache.flink.api.common.typeutils.TypeComparator<PT> probeSideComparator, org.apache.flink.api.common.typeutils.TypePairComparator<PT, BT> comparator, List<org.apache.flink.core.memory.MemorySegment> memorySegments, IOManager ioManager) -
MutableHashTable
public MutableHashTable(org.apache.flink.api.common.typeutils.TypeSerializer<BT> buildSideSerializer, org.apache.flink.api.common.typeutils.TypeSerializer<PT> probeSideSerializer, org.apache.flink.api.common.typeutils.TypeComparator<BT> buildSideComparator, org.apache.flink.api.common.typeutils.TypeComparator<PT> probeSideComparator, org.apache.flink.api.common.typeutils.TypePairComparator<PT, BT> comparator, List<org.apache.flink.core.memory.MemorySegment> memorySegments, IOManager ioManager, boolean useBloomFilters) -
MutableHashTable
public MutableHashTable(org.apache.flink.api.common.typeutils.TypeSerializer<BT> buildSideSerializer, org.apache.flink.api.common.typeutils.TypeSerializer<PT> probeSideSerializer, org.apache.flink.api.common.typeutils.TypeComparator<BT> buildSideComparator, org.apache.flink.api.common.typeutils.TypeComparator<PT> probeSideComparator, org.apache.flink.api.common.typeutils.TypePairComparator<PT, BT> comparator, List<org.apache.flink.core.memory.MemorySegment> memorySegments, IOManager ioManager, int avgRecordLen, boolean useBloomFilters)
-
-
Method Details
-
open
public void open(org.apache.flink.util.MutableObjectIterator<BT> buildSide, org.apache.flink.util.MutableObjectIterator<PT> probeSide) throws IOException Opens the hash join. This method reads the build-side input and constructs the initial hash table, gradually spilling partitions that do not fit into memory.- Parameters:
buildSide- Build side input.probeSide- Probe side input.- Throws:
IOException- Thrown, if an I/O problem occurs while spilling a partition.
-
open
public void open(org.apache.flink.util.MutableObjectIterator<BT> buildSide, org.apache.flink.util.MutableObjectIterator<PT> probeSide, boolean buildOuterJoin) throws IOException Opens the hash join. This method reads the build-side input and constructs the initial hash table, gradually spilling partitions that do not fit into memory.- Parameters:
buildSide- Build side input.probeSide- Probe side input.buildOuterJoin- Whether outer join on build side.- Throws:
IOException- Thrown, if an I/O problem occurs while spilling a partition.
-
processProbeIter
- Throws:
IOException
-
processUnmatchedBuildIter
- Throws:
IOException
-
prepareNextPartition
- Throws:
IOException
-
nextRecord
- Throws:
IOException
-
getMatchesFor
- Throws:
IOException
-
getCurrentProbeRecord
-
getBuildSideIterator
-
close
public void close()Closes the hash table. This effectively releases all internal structures and closes all open files and removes them. The call to this method is valid both as a cleanup after the complete inputs were properly processed, and as an cancellation call, which cleans up all resources that are currently held by the hash join. -
abort
public void abort() -
getFreedMemory
-
buildInitialTable
protected void buildInitialTable(org.apache.flink.util.MutableObjectIterator<BT> input) throws IOException Creates the initial hash table. This method sets up partitions, hash index, and inserts the data from the given iterator.- Parameters:
input- The iterator with the build side data.- Throws:
IOException- Thrown, if an element could not be fetched and deserialized from the iterator, or if serialization fails.
-
buildTableFromSpilledPartition
- Throws:
IOException
-
insertIntoTable
- Throws:
IOException
-
getNewInMemoryPartition
Returns a new inMemoryPartition object. This is required as a plug for ReOpenableMutableHashTable. -
createPartitions
protected void createPartitions(int numPartitions, int recursionLevel) -
clearPartitions
protected void clearPartitions()This method clears all partitions currently residing (partially) in memory. It releases all memory and deletes all spilled partitions.This method is intended for a hard cleanup in the case that the join is aborted.
-
initTable
protected void initTable(int numBuckets, byte numPartitions) -
releaseTable
protected void releaseTable()Releases the table (the array of buckets) and returns the occupied memory segments to the list of free segments. -
spillPartition
Selects a partition and spills it. The number of the spilled partition is returned.- Returns:
- The number of the spilled partition.
- Throws:
IOException
-
buildBloomFilterForBucketsInPartition
protected final void buildBloomFilterForBucketsInPartition(int partNum, HashPartition<BT, PT> partition) -
nextSegment
public org.apache.flink.core.memory.MemorySegment nextSegment()This is the method called by the partitions to request memory to serialize records. It automatically spills partitions, if memory runs out.- Specified by:
nextSegmentin interfaceorg.apache.flink.core.memory.MemorySegmentSource- Returns:
- The next available memory segment.
-
getNumWriteBehindBuffers
public static int getNumWriteBehindBuffers(int numBuffers) Determines the number of buffers to be used for asynchronous write behind. It is currently computed as the logarithm of the number of buffers to the base 4, rounded up, minus 2. The upper limit for the number of write behind buffers is however set to six.- Parameters:
numBuffers- The number of available buffers.- Returns:
- The number
-
getPartitioningFanOutNoEstimates
public static int getPartitioningFanOutNoEstimates(int numBuffers) Gets the number of partitions to be used for an initial hash-table, when no estimates are available.The current logic makes sure that there are always between 10 and 127 partitions, and close to 0.1 of the number of buffers.
- Parameters:
numBuffers- The number of buffers available.- Returns:
- The number of partitions to use.
-
getInitialTableSize
public static int getInitialTableSize(int numBuffers, int bufferSize, int numPartitions, int recordLenBytes) -
assignPartition
public static byte assignPartition(int bucket, byte numPartitions) Assigns a partition to a bucket.- Parameters:
bucket- The bucket to get the partition for.numPartitions- The number of partitions.- Returns:
- The partition for the bucket.
-
hash
public static int hash(int code, int level) The level parameter is needed so that we can have different hash functions when we recursively apply the partitioning, so that the working set eventually fits into memory. -
getProbeSideComparator
-