1. Overview
In this article, we’ll be looking at the Bloom filter construct from the Guava library. A Bloom filter is a memory-efficient, probabilistic data structure that we can use to answer the question of whether or not a given element is in a set.
There are no false negatives with a Bloom filter, so when it returns false, we can be 100% certain that the element is not in the set.
However, a Bloom filter can return false positives, so when it returns true, there is a high probability that the element is in the set, but we can not be 100% sure.
For a more in-depth analysis of how a Bloom filter works, you can go through this tutorial.
2. Maven Dependency
We will be using Guava’s implementation of the Bloom filter, so let’s add the guava dependency:
<dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>22.0</version> </dependency>
The latest version can be found on Maven Central.
3. Why Use Bloom Filter?
The Bloom filter is designed to be space-efficient and fast. When using it, we can specify the probability of false positive responses which we can accept and, according to that configuration, the Bloom filter will occupy as little memory as it can.
Due to this space-efficiency, the Bloom filter will easily fit in memory even for huge numbers of elements. Some databases, including Cassandra and Oracle, use this filter as the first check before going to disk or cache, for example, when a request for a specific ID comes in.
If the filter returns that the ID is not present, the database can stop further processing of the request and return to the client. Otherwise, it goes to the disk and returns the element if it is found on disk.
4. Creating a Bloom Filter
Suppose we want to create a Bloom filter for up to 500 Integers and that we can tolerate a one-percent (0.01) probability of false positives.
We can use the BloomFilter class from the Guava library to achieve this. We need to pass the number of elements that we expect to be inserted into the filter and the desired false positive probability:
BloomFilter<Integer> filter = BloomFilter.create( Funnels.integerFunnel(), 500, 0.01);
Now let’s add some numbers to the filter:
filter.put(1); filter.put(2); filter.put(3);
We added only three elements, and we defined that the maximum number of insertions will be 500, so our Bloom filter should yield very accurate results. Let’s test it using the mightContain() method:
assertThat(filter.mightContain(1)).isTrue(); assertThat(filter.mightContain(2)).isTrue(); assertThat(filter.mightContain(3)).isTrue(); assertThat(filter.mightContain(100)).isFalse();
As the name suggests, we cannot be 100% sure that a given element is actually in the filter when the method returns true.
When mightContain() returns true in our example, we can be 99% certain that the element is in the filter, and there is a one-percent probability that the result is a false positive. When the filter returns false, we can be 100% certain that the element is not present.
5. Over-Saturated Bloom Filter
When we design our Bloom filter, it is important that we provide a reasonably accurate value for the expected number of elements. Otherwise, our filter will return false positives at a much higher rate than desired. Let’s see an example.
Suppose that we created a filter with a desired false-positive probability of one percent and an expected some elements equal to five, but then we inserted 100,000 elements:
BloomFilter<Integer> filter = BloomFilter.create( Funnels.integerFunnel(), 5, 0.01); IntStream.range(0, 100_000).forEach(filter::put);
Because the expected number of elements is so small, the filter will occupy very little memory.
However, as we add more items than expected, the filter becomes over-saturated and has a much higher probability of returning false positive results than the desired one percent:
assertThat(filter.mightContain(1)).isTrue(); assertThat(filter.mightContain(2)).isTrue(); assertThat(filter.mightContain(3)).isTrue(); assertThat(filter.mightContain(1_000_000)).isTrue();
Note that the mightContatin() method returned true even for a value that we didn’t insert into the filter previously.
6. Conclusion
In this quick tutorial, we looked at the probabilistic nature of the Bloom filter data structure – making use of the Guava implementation.
You can find the implementation of all these examples and code snippets in the GitHub project.
This is a Maven project, so it should be easy to import and run as it is.