• Bloom filter on k-mers in biological sequence analysis. Source: Pellow et al. 2017, fig. 1.
• As hash area increases, errors and disk accesses decrease. Source: Bloom 1970, table I.
• Neural Bloom Filter architecture. Source: Rae et al. 2019, fig. 1.
• Bloom filter with m=12 and k=3. Source: Luo et al. 2019, fig. 2.
• Some applications of Bloom filter. Source: Patgiri et al. 2019, fig. 5.
• False positive rate (FPR) of the Bloom filter. Source: Tarkoma et al. 2012, fig. 3.
• Bloom filter variants and their features. Source: Tarkoma et al. 2012, table II.

# Bloom Filter

arvindpdmn
1339 DevCoins
Last updated by arvindpdmn
on 2020-05-12 09:04:31
Created by arvindpdmn
on 2020-05-11 06:23:50

## Summary

Given a set of elements, suppose we wish to know if a particular element is present in this set. There are many searching algorithms to do this. When faced with a huge set (millions of elements), even with an efficient search algorithm, storage is a problem. There's also latency due to disk access. Bloom filter is one possible solution.

Bloom filter is a data structure that stores the original set in a more compact form with support for set membership queries, that is, to query if an element is a member of the set. Bloom filter is a space-efficient probabilistic data structure.

With the rise of big data since the mid-2000s, there's been increased interest in Bloom filter. From the original Bloom filter (1970), many variants have been proposed for a diverse range of applications.

## Milestones

1965

In the construction of a symbol table in compiler design, Batson describes a method that does faster search for symbols. Rather than do a slow linear lookup, an arithmetic transformation is done on the symbol. The result gives the address where the symbol will be stored (or retrieved) in the table. The term hashing becomes common by late 1960s for this sort of transformation on the input.

1970

Burton H. Bloom publishes a paper titled Space/Time Trade-offs in Hash Coding with Allowable Errors. In conventional hash-coding methods, many lookups are needed when an element doesn't belong to the set. He therefore proposes a novel method to reduce the reject time, time taken to declare that an element is a non-member of the set. To allow for smaller hash area, he accepts some fraction of errors. With smaller hash area, hashes can be stored in core memory for faster access.

2000

Fan et al. propose the Counting Bloom Filter in the context web proxy caching. Since caches have to be updated, they find it necessary to keep a count of how many times a bit position in the filter has been set. They propose 4 bits per filter bit while keeping false negatives low. When a request comes in, the proxy checks its own cache. If absent, it checks the Bloom filters of participating proxies. Based on the result, it decides whether to query another proxy. Bloom filters are updated among proxies based on defined thresholds.

2004

Shanmugasundaram et al. propose the Hierarchical Bloom Filter to address the problem of substring matching. In addition to the substrings, the filter also stores concatenation of substrings, thus creating an hierarchy. It's applied in the context of IP traffic to attribute payloads to source and destination hosts.

2006

Borrowing from standard hashing, Kirsch and Mitzenmacher propose how to simulate multiple hash functions for the Bloom filter from just two independent uniform random hash functions. Their $$k$$ hash functions are of the form $$g_i(x) = h_1(x) + i \cdot h_2(x)\,mod\,p$$, for $$i$$ in range $$[0,k-1]$$ and where $$p$$ is prime, the filter has $$kp$$ bits and each function has range $$[0,p-1]$$.

2007

Almeida et al propose the Scalable Bloom Filter. Sometimes it's difficult to estimate upfront the maximum number of elements that need to be stored in a Bloom filter. Designers therefore tend to waste space by overestimating, sometimes by orders of magnitude. By allowing the filter size to scale dynamically, the scalable Bloom filter guarantees that false positives don't exceed a defined design threshold. It's parameterized by initial size, error probability, growth rate and error probability tightening rate.

2008

Bose et al. observe that the original analysis of false positive rate by Bloom and others is not exact. The give an exact formula for the false positive rate plus lower and upper bounds. The rate is slightly larger than originally calculated but the difference is negligible for large $$m$$ and small $$k$$.

2019

Rae et al. propose the Neural Bloom Filter that uses an artificial neural network with one-shot meta-learning. Softmax is computed on content-based attention between the query q and a learned address matrix A. Thus the network learns where to store elements based on their content. This is more space efficient but at the cost of performance for a single query. It outperforms standard Bloom filter on batch queries parallelized on GPUs.

## Discussion

• Could you explain the Bloom filter with an example?

A Bloom filter is simply a bit array of length $$m$$ for storing elements of set $$S=\{x_1,x_2,\ldots,x_n\}$$. The filter starts with all zeros, meaning that the set is empty. When an element is added, it is hashed using $$k$$ independent hash functions. Each function will output a bit position in the filter, which is then set to one. Thus, each element will set $$k$$ bits of the filter.

When a query is made, the queried element is put through the $$k$$ hash functions. The resulting bit positions are checked in the filter. If at least one of these is a zero, we're certain that the element is not in the set. If all bits are ones, then it may be in the set. Thus, the filter is probabilistic. The reason it's probabilistic is that a query may check against bit positions set by more than one stored element.

In the figure, we can say that $$x_4$$ is definitely not in S but $$x_1$$ (true positive) and $$x_5$$ (false positive) are most probably in S.

• Under what conditions can I use the Bloom filter?

Suppose the records of a large database table are stored on disk and need to be queried. To avoid slow disk access, a Bloom filter representation of the data is kept in fast memory. If the Bloom filter says that a record is absent, we're sure that it's absent and avoid disk access. If the Bloom filter says the record is present, we're not sure. We need to query the database to find out. False positive is when the filter says something is present when it's not. False positives should be at a level acceptable to the application. When most queried records are absent in the database, Bloom filter gives performance gains.

Back in 2003, Broder and Mitzenmacher stated the Bloom filter principle,

Wherever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated.

Multiple elements can set the same bit in the filter. Therefore, elements once added can't be removed from the filter. Neither can we obtain the list of elements added. Standard Bloom filter is not suited for applications that need these features.

• How do I select the parameters of the Bloom filter?

There are essentially three parameters: filter size $$m$$, number of hash functions $$k$$ and number of elements to be stored $$n$$.

For lower false positives, use higher values of $$m$$ and $$k$$ with a tradeoff. Higher $$m$$ means more memory consumption. Higher $$k$$ means more computation. Since higher $$k$$ sets more bits, there's an optimal value $$k_{opt}$$ beyond which false positives start to increase.

When adding elements beyond the design limit, false positives go up. To achieve a fixed false positive probability, we should linearly scale $$m$$ as $$n$$ increases. In fact, the bit-to-element ratio is bounded as $$m/n\geqslant1/ln(2)$$.

In practice, we estimate $$n$$ in advance for our application. We decide on the false positive probability we're willing to tolerate. We select $$m$$ and calculate $$k$$, or select $$k$$ and calculate $$m$$. A handy calculator is available online.

If parameters are not chosen correctly, an attacker can corrupt the filter with well-chosen inputs.

• Which hash functions are suitable for use in a Bloom filter?

Hash functions should ideally be independent, meaning that for the same input they set different bits of the filter. Otherwise, hash collisions become more common, leading to higher false positives. Functions should also distribute the input space uniformly.

Hash functions should be fast. The use of cryptographic hash functions is possible but not really required. Among the non-cryptographic hash functions suited for Bloom filter are MurmurHash, Fowler–Noll–Vo (FNV) and Jenkins. One developer showed how replacing MD5 with MurmurHash brought performance improvements.

It's possible to generate $$k$$ hash functions from just two hash functions. One researcher pointed out that if the filter size is 32 bits, on a 64-bit machine a single hash function can give two 32-bit hash values, thus simulating two hash functions.

• Which are some real-world applications of the Bloom filter?

For network-related applications, Bloom filter is used to collaborate in peer-to-peer networks, resource routing, packet routing, and measurements. Content Delivery Networks (CDNs) use Bloom filters to avoid caching files seen only once.

For applications that use databases, Bloom filter enables efficient searches, privacy preservation, content synchronization, and duplicate detection.

Medium uses Bloom filter to deduplicate recommendations. In a streaming application processing 17 million events per day per partition, Bloom filter was used to deduplicate events at scale. The filter needed 108Gb across 1024 partitions. Reads and writes were 20x and 3x faster respectively.

Chrome browser uses the filter to represent a set of malicious URLs. When user requests a URL, if the filter indicates a probable match, the request is sent to a server to check if the URL is indeed safe.

In a web application, Bloom filters could be used to keep count of visitors coming from a specific city accessing a webpage.

• How's the performance of the Bloom filter?

A Bloom filter with one hash function is equivalent to ordinary hashing. Bloom filter is a generalization of hashing. It allows for more interesting design tradeoffs.

The complexity to add an element or to query the filter is fixed at $$O(k)$$. In other words, it's independent of how many items have been added to the filter.

Bloom filter simplifies some operations. Suppose we have two filters to represents sets $$S_1$$ and $$S_2$$. The union of the these two sets is simply an OR operation of the two filters. Intersection of the two sets can also be approximated with the filters. Another trick is to half the size of the filter with an OR operation of the first and second halves of the filter.

• What are some variants of the Bloom filter?

Bloom filter has inspired dozens of variants. One survey paper from 2019 mentioned more than 60 variants, their capabilities, performance and application areas. A variant might aim to reduce false positives, improve performance or provide additional features.

Counting Bloom filter allows us to implement a multiset so that multiple instances of the same element can be stored. Some of these support deletions as well but at the cost of false negatives. Bitwise Bloom filter and spectral Bloom filter improve on the counting Bloom filter.

When elements are added to a filter beyond it's designed capacity, false positives increase. Scalable Bloom filter allows us to grow the filter size as more items are added.

When filters have to be exchanged over a network (such as in web caching applications), Compressed Bloom filter saves on network bandwidth. By choosing the hash functions and thereby changing the way bits are distributed in the filter, better compression is achieved.

• Could you mention some useful resources on the Bloom filter?

Survey paper by Luo et al. (2019) is worth reading. Jason Davies has published online an interactive visualization of the Bloom filter. JavaScript code for this is also available.

C++ library libbf contains implementations of the Bloom filter and some of its variants.

A Java implementation of the filter (and other probabilistic data structures) can be found within stream-lib. Guava library in Java also has an implementation.

Packages offering the filter are available in Python as well as in Node.js.

## Sample Code

• # Source: https://www.kdnuggets.com/2016/08/gentle-introduction-bloom-filter.html
# Accessed 2020-05-11

import mmh3

class BloomFilter(set):

def __init__(self, size, hash_count):
super(BloomFilter, self).__init__()
self.bit_array = bitarray(size)
self.bit_array.setall(0)
self.size = size
self.hash_count = hash_count

def __len__(self):
return self.size

def __iter__(self):
return iter(self.bit_array)

for ii in range(self.hash_count):
index = mmh3.hash(item, ii) % self.size
self.bit_array[index] = 1

return self

def __contains__(self, item):
out = True
for ii in range(self.hash_count):
index = mmh3.hash(item, ii) % self.size
if self.bit_array[index] == 0:
out = False

return out

def main():
bloom = BloomFilter(100, 10)
animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',
'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',
'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']
# First insertion of animals into the bloom filter
for animal in animals:

# Membership existence for already inserted animals
# There should not be any false negatives
for animal in animals:
if animal in bloom:
print('{} is in bloom filter as expected'.format(animal))
else:
print('Something is terribly went wrong for {}'.format(animal))
print('FALSE NEGATIVE!')

# Membership existence for not inserted animals
# There could be false positives
other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',
'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',
'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla',
'hawk' ]
for other_animal in other_animals:
if other_animal in bloom:
print('{} is not in the bloom, but a false positive'.format(other_animal))
else:
print('{} is not in the bloom filter as expected'.format(other_animal))

if __name__ == '__main__':
main()

## Milestones

1965

In the construction of a symbol table in compiler design, Batson describes a method that does faster search for symbols. Rather than do a slow linear lookup, an arithmetic transformation is done on the symbol. The result gives the address where the symbol will be stored (or retrieved) in the table. The term hashing becomes common by late 1960s for this sort of transformation on the input.

1970

Burton H. Bloom publishes a paper titled Space/Time Trade-offs in Hash Coding with Allowable Errors. In conventional hash-coding methods, many lookups are needed when an element doesn't belong to the set. He therefore proposes a novel method to reduce the reject time, time taken to declare that an element is a non-member of the set. To allow for smaller hash area, he accepts some fraction of errors. With smaller hash area, hashes can be stored in core memory for faster access.

2000

Fan et al. propose the Counting Bloom Filter in the context web proxy caching. Since caches have to be updated, they find it necessary to keep a count of how many times a bit position in the filter has been set. They propose 4 bits per filter bit while keeping false negatives low. When a request comes in, the proxy checks its own cache. If absent, it checks the Bloom filters of participating proxies. Based on the result, it decides whether to query another proxy. Bloom filters are updated among proxies based on defined thresholds.

2004

Shanmugasundaram et al. propose the Hierarchical Bloom Filter to address the problem of substring matching. In addition to the substrings, the filter also stores concatenation of substrings, thus creating an hierarchy. It's applied in the context of IP traffic to attribute payloads to source and destination hosts.

2006

Borrowing from standard hashing, Kirsch and Mitzenmacher propose how to simulate multiple hash functions for the Bloom filter from just two independent uniform random hash functions. Their $$k$$ hash functions are of the form $$g_i(x) = h_1(x) + i \cdot h_2(x)\,mod\,p$$, for $$i$$ in range $$[0,k-1]$$ and where $$p$$ is prime, the filter has $$kp$$ bits and each function has range $$[0,p-1]$$.

2007

Almeida et al propose the Scalable Bloom Filter. Sometimes it's difficult to estimate upfront the maximum number of elements that need to be stored in a Bloom filter. Designers therefore tend to waste space by overestimating, sometimes by orders of magnitude. By allowing the filter size to scale dynamically, the scalable Bloom filter guarantees that false positives don't exceed a defined design threshold. It's parameterized by initial size, error probability, growth rate and error probability tightening rate.

2008

Bose et al. observe that the original analysis of false positive rate by Bloom and others is not exact. The give an exact formula for the false positive rate plus lower and upper bounds. The rate is slightly larger than originally calculated but the difference is negligible for large $$m$$ and small $$k$$.

2019

Rae et al. propose the Neural Bloom Filter that uses an artificial neural network with one-shot meta-learning. Softmax is computed on content-based attention between the query q and a learned address matrix A. Thus the network learns where to store elements based on their content. This is more space efficient but at the cost of performance for a single query. It outperforms standard Bloom filter on batch queries parallelized on GPUs.

## Tags

• Probabilistic Data Structures
• Count-Min Sketch
• Hash Function
• Feature Hashing
• Cuckoo Hashing
• Bit Array

Author
No. of Edits
No. of Chats
DevCoins
2
0
1339
2016
Words
0
Chats
2
Edits
0
Likes
138
Hits

## Cite As

Devopedia. 2020. "Bloom Filter." Version 2, May 12. Accessed 2020-05-25. https://devopedia.org/bloom-filter
• Site Map