Class BloomFilter
java.lang.Object
org.apache.flink.runtime.operators.util.BloomFilter
BloomFilter is a probabilistic data structure for set membership check. BloomFilters are highly
space efficient when compared to using a HashSet. Because of the probabilistic nature of bloom
filter false positive (element not present in bloom filter but test() says true) are possible but
false negatives are not possible (if element is present then test() will never say false). The
false positive probability is configurable depending on which storage requirement may increase or
decrease. Lower the false positive probability greater is the space requirement. Bloom filters
are sensitive to number of elements that will be inserted in the bloom filter. During the
creation of bloom filter expected number of entries must be specified. If the number of
insertions exceed the specified initial number of entries then false positive probability will
increase accordingly.
Internally, this implementation of bloom filter uses MemorySegment to store BitSet, BloomFilter and BitSet are designed to be able to switch between different MemorySegments, so that Flink can share the same BloomFilter/BitSet object instance for different bloom filters.
Part of this class refers to the implementation from Apache Hive project https://github.com/apache/hive/blob/master/common/src/java/org/apache/hive/common/util/BloomFilter.java
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidaddHash(int hash32) static doubleestimateFalsePositiveProbability(long inputEntries, int bitSize) Compute the false positive probability based on given input entries and bits size.static BloomFilterfromBytes(byte[] bytes) Deserializing bytes array to BloomFilter.static byte[]mergeSerializedBloomFilters(byte[] bf1Bytes, byte[] bf2Bytes) static intoptimalNumOfBits(long inputEntries, double fpp) Compute optimal bits number with given input entries and expected false positive probability.voidreset()voidsetBitsLocation(org.apache.flink.core.memory.MemorySegment memorySegment, int offset) booleantestHash(int hash32) static byte[]toBytes(BloomFilter filter) Serializing to bytes, note that only heap memory is currently supported.toString()
-
Field Details
-
bitSet
-
numHashFunctions
protected int numHashFunctions
-
-
Constructor Details
-
BloomFilter
public BloomFilter(int expectedEntries, int byteSize)
-
-
Method Details
-
setBitsLocation
public void setBitsLocation(org.apache.flink.core.memory.MemorySegment memorySegment, int offset) -
optimalNumOfBits
public static int optimalNumOfBits(long inputEntries, double fpp) Compute optimal bits number with given input entries and expected false positive probability.- Parameters:
inputEntries-fpp-- Returns:
- optimal bits number
-
estimateFalsePositiveProbability
public static double estimateFalsePositiveProbability(long inputEntries, int bitSize) Compute the false positive probability based on given input entries and bits size. Note: this is just the math expected value, you should not expect the fpp in real case would under the return value for certain.- Parameters:
inputEntries-bitSize-- Returns:
-
addHash
public void addHash(int hash32) -
testHash
public boolean testHash(int hash32) -
reset
public void reset() -
toString
-
toBytes
Serializing to bytes, note that only heap memory is currently supported. -
fromBytes
Deserializing bytes array to BloomFilter. Currently, only heap memory is supported. -
mergeSerializedBloomFilters
public static byte[] mergeSerializedBloomFilters(byte[] bf1Bytes, byte[] bf2Bytes)
-