Bloom Filters: An Efficient Approach to Lookup Optimization

A basic look into Guava Bloom Filter

Jul 18, 2023

This article will involve a look into Guava’s BloomFilter class. If you want to take a look at it, feel free to look here

What are Bloom Filters?

Bloom Filters are a type of probabilistic data structure, which offer an efficient approach to optimize lookup operations at the expense of a small degree of accuracy, using less memory space and achieving a constant-time query. They are commonly used for read optimization in LSM Trees and Cache systems.

Algorithm

We might recall how bitmaps are used for lookup operations. Essentially, bitmaps perform hash mapping for elements, using a single bit to determine if an element is present in a set.

An example of bitmaps being used in encoding and compression.

A Bloom Filter works on a similar concept, but it saves space at the cost of increased time complexity. This is achieved by using multiple hash mappings and a set of bits to determine if an element belongs to a set.

Bloom Filters do allow hash collisions, which means they come with a specific rate of false positives. So, we need to be watchful while implementing your application logic.

Put

The Guava implementation was adopted from Harvard’s paper "Less Hashing, Same Performance: Building a Better Bloom Filter" Guava. It states that just two 32-bit hash functions are needed to map to n bits.

Here's Guava implementation of the 'put' operation for a Bloom Filter:

  /**
   * Puts an element into this {@code BloomFilter}. Ensures that subsequent invocations of {@link
   * #mightContain(Object)} with the same element will always return {@code true}.
   *
   * @return true if the Bloom filter's bits changed as a result of this operation. If the bits
   *     changed, this is <i>definitely</i> the first time {@code object} has been added to the
   *     filter. If the bits haven't changed, this <i>might</i> be the first time {@code object} has
   *     been added to the filter. Note that {@code put(t)} always returns the <i>opposite</i>
   *     result to what {@code mightContain(t)} would have returned at the time it is called.
   * @since 12.0 (present in 11.0 with {@code void} return type})
   */
public <T extends @Nullable Object> boolean put(
        @ParametricNullness T object,
        Funnel<? super T> funnel,
        int numHashFunctions,
        LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

      boolean bitsChanged = false;
      long combinedHash = hash1;
      for (int i = 0; i < numHashFunctions; i++) {
        // Make the combined hash positive and indexable
        bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
        combinedHash += hash2;
      }
      return bitsChanged;
}

The put method is used to insert an element into the Bloom Filter. The intention is that subsequent calls to the mightContain method with the same element will invariably return true. If the Bloom filter's bits change as a result of this operation, it unequivocally suggests that the object is being added to the filter for the first time. If the bits remain unchanged, it could still possibly be the object's first addition to the filter.

Firstly, the 'object' is converted into a basic type through the 'funnel', then a 64-bit hash is calculated, with the higher and lower halves being used as the hash values. This approach only needs a single hash function call, significantly reducing time overhead.

Interestingly, this concept is similar to double hashing in open addressing, which the textbook "Introduction to Algorithms" claims has the least conflict. The idea here is to reduce hash collisions, which naturally leads to fewer false positives.

hi = (h(key) + i * h1(key)) % m

This implementation of the Bloom Filter essentially inserts the same object 'numHashFunctions' times into the hash table, which employs open addressing and double hashing.

MightContain

The 'mightContain' function is implemented as follows:

public <T extends @Nullable Object> boolean mightContain(
        @ParametricNullness T object,
        Funnel<? super T> funnel,
        int numHashFunctions,
        LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
      int hash1 = (int) hash64;
      int hash2 = (int) (hash64 >>> 32);

      for (int i = 1; i <= numHashFunctions; i++) {
        int combinedHash = hash1 + (i * hash2);
        // Flip all the bits if it's negative (guaranteed positive number)
        if (combinedHash < 0) {
          combinedHash = ~combinedHash;
        }
        if (!bits.get(combinedHash % bitSize)) {
          return false;
        }
      }
      return true;
    }
}

The logic is similar to the 'put' function. If any mapped bits are 0, the element is missing from the bloom filter. If all bits are 1, then the element might be present. However, due to hash collisions, the element's presence is not 100% guaranteed. This probability of hash collision can be reduced by using multiple hash functions.

For example, look below, where you can provide a number of hash functions through the constructor.

  /** Number of hashes per element */
  private final int numHashFunctions;
.....
  private BloomFilter(
      LockFreeBitArray bits, int numHashFunctions, Funnel<? super T> funnel, Strategy strategy) {
    checkArgument(numHashFunctions > 0, "numHashFunctions (%s) must be > 0", numHashFunctions);
    checkArgument(
        numHashFunctions <= 255, "numHashFunctions (%s) must be <= 255", numHashFunctions);
    this.bits = checkNotNull(bits);
    this.numHashFunctions = numHashFunctions;
    this.funnel = checkNotNull(funnel);
    this.strategy = checkNotNull(strategy);
  }

Hash Function Calculation

Additionally, Parameters for the Bloom Filter can be automatically calculated based on requirements.

Here, the number of hashes (k), bits per insertion (b) and total bits (m) can be computed based on expected insertions (n) and the desired false positive probability (p).

  // Cheat sheet:
  //
  // m: total bits
  // n: expected insertions
  // b: m/n, bits per insertion
  // p: expected false positive probability
  //
  // 1) Optimal k = b * ln2
  // 2) p = (1 - e ^ (-kn/m))^k
  // 3) For optimal k: p = 2 ^ (-k) ~= 0.6185^b
  // 4) For optimal k: m = -nlnp / ((ln2) ^ 2)
  /**
   * Computes the optimal k (number of hashes per element inserted in Bloom filter), given the
   * expected insertions and total number of bits in the Bloom filter.
   *
   * <p>See http://en.wikipedia.org/wiki/File:Bloom_filter_fp_probability.svg for the formula.
   *
   * @param n expected insertions (must be positive)
   * @param m total number of bits in Bloom filter (must be positive)
   */
  @VisibleForTesting
  static int optimalNumOfHashFunctions(long n, long m) {
    // (m / n) * log(2), but avoid truncation due to division!
    return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
  }

Bit calculation

Given the number of elements and the false positive rate, we calculate the number of Bloom filter bits.

/**
   * Computes m (total bits of Bloom filter) which is expected to achieve, for the specified
   * expected insertions, the required false positive probability.
   *
   * <p>See http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives for the
   * formula.
   *
   * @param n expected insertions (must be positive)
   * @param p false positive rate (must be 0 < p < 1)
   */
  @VisibleForTesting
  static long optimalNumOfBits(long n, double p) {
    if (p == 0) {
      p = Double.MIN_VALUE;
    }
    return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
  }

Compatibility

In the case of Bloom Filters with identical parameters, bit operations can be performed to combine the information from two sets. This is done by taking a union of the bit arrays of the two Bloom Filters.

 public boolean isCompatible(BloomFilter<T> that) {
    checkNotNull(that);
    return this != that
        && this.numHashFunctions == that.numHashFunctions
        && this.bitSize() == that.bitSize()
        && this.strategy.equals(that.strategy)
        && this.funnel.equals(that.funnel);
  }
  public void putAll(BloomFilter<T> that) {
    checkNotNull(that);
    checkArgument(this != that, "Cannot combine a BloomFilter with itself.");
    checkArgument(
        this.numHashFunctions == that.numHashFunctions,
        "BloomFilters must have the same number of hash functions (%s != %s)",
        this.numHashFunctions,
        that.numHashFunctions);
    checkArgument(
        this.bitSize() == that.bitSize(),
        "BloomFilters must have the same size underlying bit arrays (%s != %s)",
        this.bitSize(),
        that.bitSize());
    checkArgument(
        this.strategy.equals(that.strategy),
        "BloomFilters must have equal strategies (%s != %s)",
        this.strategy,
        that.strategy);
    checkArgument(
        this.funnel.equals(that.funnel),
        "BloomFilters must have equal funnels (%s != %s)",
        this.funnel,
        that.funnel);
    this.bits.putAll(that.bits);
  }

Finally, other aspects related to serialization are ignored in this article as they do not contribute significantly to understanding the basic implementation and operation of Bloom Filters.

References

https://15721.courses.cs.cmu.edu/spring2023/slides/05-compression.pdf

Venkat’s Substack

Discussion about this post