# Bloom Filter

## Summary

Given a set of elements, suppose we wish to know if a particular element is present in this set. There are many searching algorithms to do this. When faced with a huge set (millions of elements), even with an efficient search algorithm, storage is a problem. There's also latency due to disk access. Bloom filter is one possible solution.

Bloom filter is a data structure that stores the original set in a more compact form with support for **set membership queries**, that is, to query if an element is a member of the set. Bloom filter is a **space-efficient probabilistic data structure**.^{}

With the rise of big data since the mid-2000s, there's been increased interest in Bloom filter.^{} From the original Bloom filter (1970), many variants have been proposed for a diverse range of applications.^{}

## Milestones

## Discussion

Could you explain the Bloom filter with an example? A Bloom filter is simply a

**bit array**of length \(m\) for storing elements of set \(S=\{x_1,x_2,\ldots,x_n\}\). The filter starts with all zeros, meaning that the set is empty. When an element is added, it is hashed using \(k\)**independent hash functions**. Each function will output a bit position in the filter, which is then set to one. Thus, each element will set \(k\) bits of the filter.^{}When a query is made, the queried element is put through the \(k\) hash functions. The resulting bit positions are checked in the filter. If at least one of these is a zero, we're certain that the element is not in the set. If all bits are ones, then it

*may be*in the set. Thus, the filter is probabilistic. The reason it's probabilistic is that a query may check against bit positions set by more than one stored element.^{}In the figure, we can say that \(x_4\) is definitely not in S but \(x_1\) (

*true positive*) and \(x_5\) (*false positive*) are most probably in S.^{}Under what conditions can I use the Bloom filter? Suppose the records of a large database table are stored on disk and need to be queried. To avoid slow disk access, a Bloom filter representation of the data is kept in fast memory. If the Bloom filter says that a record is absent, we're sure that it's absent and avoid disk access. If the Bloom filter says the record is present, we're not sure. We need to query the database to find out.

**False positive**is when the filter says something is present when it's not. False positives should be at a level acceptable to the application.^{}When most queried records are absent in the database, Bloom filter gives performance gains.^{}Back in 2003, Broder and Mitzenmacher stated the

*Bloom filter principle*,^{}Wherever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated.

Multiple elements can set the same bit in the filter.

^{}Therefore, elements once added can't be removed from the filter. Neither can we obtain the list of elements added. Standard Bloom filter is not suited for applications that need these features.^{}^{}How do I select the parameters of the Bloom filter? There are essentially three parameters: filter size \(m\), number of hash functions \(k\) and number of elements to be stored \(n\).

For lower false positives, use higher values of \(m\) and \(k\) with a tradeoff. Higher \(m\) means more memory consumption. Higher \(k\) means more computation. Since higher \(k\) sets more bits, there's an optimal value \(k_{opt}\) beyond which false positives start to increase.

^{}When adding elements beyond the design limit, false positives go up. To achieve a fixed false positive probability, we should linearly scale \(m\) as \(n\) increases. In fact, the bit-to-element ratio is bounded as \(m/n\geqslant1/ln(2)\).

^{}In practice, we estimate \(n\) in advance for our application. We decide on the false positive probability we're willing to tolerate. We select \(m\) and calculate \(k\), or select \(k\) and calculate \(m\). A handy calculator is available online.

^{}If parameters are not chosen correctly, an attacker can corrupt the filter with well-chosen inputs.

^{}Which hash functions are suitable for use in a Bloom filter? Hash functions should ideally be independent, meaning that for the same input they set different bits of the filter. Otherwise, hash collisions become more common, leading to higher false positives. Functions should also distribute the input space uniformly.

^{}Hash functions should be fast. The use of cryptographic hash functions is possible but not really required.

^{}Among the non-cryptographic hash functions suited for Bloom filter are MurmurHash, Fowler–Noll–Vo (FNV) and Jenkins.^{}One developer showed how replacing MD5 with MurmurHash brought performance improvements.^{}It's possible to generate \(k\) hash functions from just two hash functions. One researcher pointed out that if the filter size is 32 bits, on a 64-bit machine a single hash function can give two 32-bit hash values, thus simulating two hash functions.

^{}Which are some real-world applications of the Bloom filter? For network-related applications, Bloom filter is used to collaborate in peer-to-peer networks, resource routing, packet routing, and measurements.

^{}Content Delivery Networks (CDNs) use Bloom filters to avoid caching files seen only once.^{}For applications that use databases, Bloom filter enables efficient searches, privacy preservation, content synchronization, and duplicate detection.

^{}Medium uses Bloom filter to deduplicate recommendations.

^{}In a streaming application processing 17 million events per day per partition, Bloom filter was used to deduplicate events at scale. The filter needed 108Gb across 1024 partitions. Reads and writes were 20x and 3x faster respectively.^{}Chrome browser uses the filter to represent a set of malicious URLs. When user requests a URL, if the filter indicates a probable match, the request is sent to a server to check if the URL is indeed safe.

^{}In a web application, Bloom filters could be used to keep count of visitors coming from a specific city accessing a webpage.

^{}How's the performance of the Bloom filter? A Bloom filter with one hash function is equivalent to ordinary hashing. Bloom filter is a generalization of hashing. It allows for more interesting design tradeoffs.

^{}The complexity to add an element or to query the filter is fixed at \(O(k)\). In other words, it's independent of how many items have been added to the filter.

^{}Bloom filter simplifies some operations. Suppose we have two filters to represents sets \(S_1\) and \(S_2\). The union of the these two sets is simply an OR operation of the two filters. Intersection of the two sets can also be approximated with the filters. Another trick is to half the size of the filter with an OR operation of the first and second halves of the filter.

^{}What are some variants of the Bloom filter? Bloom filter has inspired dozens of variants. One survey paper from 2019 mentioned more than 60 variants, their capabilities, performance and application areas.

^{}A variant might aim to reduce false positives, improve performance or provide additional features.^{}**Counting Bloom filter**allows us to implement a multiset so that multiple instances of the same element can be stored. Some of these support deletions as well but at the cost of false negatives. Bitwise Bloom filter and spectral Bloom filter improve on the counting Bloom filter.^{}When elements are added to a filter beyond it's designed capacity, false positives increase.

**Scalable Bloom filter**allows us to grow the filter size as more items are added.^{}When filters have to be exchanged over a network (such as in web caching applications),

**Compressed Bloom filter**saves on network bandwidth. By choosing the hash functions and thereby changing the way bits are distributed in the filter, better compression is achieved.^{}Could you mention some useful resources on the Bloom filter? Survey paper by Luo et al. (2019) is worth reading. Jason Davies has published online an interactive visualization of the Bloom filter. JavaScript code for this is also available.

C++ library libbf contains implementations of the Bloom filter and some of its variants.

^{}A Java implementation of the filter (and other probabilistic data structures) can be found within stream-lib.

^{}Guava library in Java also has an implementation.^{}Packages offering the filter are available in Python as well as in Node.js.

^{}

## Sample Code

## References

- Akyildiz, Bugra. 2016. "A Gentle Introduction to Bloom Filter." KDNuggets, August. Accessed 2020-05-11.
- Almeida, Paulo Sérgio, Carlos Baquero, Nuno Preguiça, and David Hutchison. 2007. "Scalable Bloom Filters." Information Processing Letters, Elsevier, vol. 101, no. 6, pp. 255-261, March 31. Accessed 2020-05-11.
- Batson, Alan. 1965. "The organization of symbol tables." Communications of the ACM, vol. 8, no. 2, pp. 111-112, February. Accessed 2020-05-12.
- Bloom, Burton H. 1970. "Space/Time Trade-offs in Hash Coding with Allowable Errors." Communications of the ACM, vol. 13, no. 7, pp. 422-426, July. Accessed 2020-05-11.
- Bose, Prosenjit, Hua Guo, Evangelos Kranakis, Anil Maheshwari, Pat Morin, Jason Morrison, Michiel Smid, and Yihui Tang. 2008. "On the false-positive rate of Bloom filters." Information Processing Letters, Elsevier, vol. 108, no. 4, pp. 210-213, October 31. doi:10.1016/j.ipl.2008.05.018. Accessed 2020-05-11.
- Broder, Andrei, and Michael Mitzenmacher. 2003. "Network Applications of Bloom Filters: A Survey." Internet Mathematics, vol. 1, no. 4, pp. 485-509, November. Accessed 2020-05-11.
- Davies, Jason. 2020. "Bloom Filters." Accessed 2020-05-11.
- Fan, Li, Pei Cao, Jussara Almeida, and Andrei Z. Broder. 2000. "Summary cache: a scalable wide-area web cache sharing protocol." IEEE/ACM Transactions on Networking, vol. 8, no. 3, pp. 281-293, June. Accessed 2020-05-11.
- Fitzgerald, William. 2011. "Producing n hash functions by hashing only once." Will.Whim blog, September 3. Accessed 2020-05-11.
- GeeksforGeeks. 2017. "Bloom Filters – Introduction and Python Implementation." GeeksforGeeks, April 17. Updated 2018-02-08. Accessed 2020-05-11.
- Gerbet, Thomas, Amrit Kumar, and Cédric Lauradoux. 2014. "The Power of Evil Choices in Bloom Filters." Research Report RR-8627, INRIA Grenoble, hal-01082158v2, November. Accessed 2020-05-11.
- Hurst, Thomas. 2018. "Bloom Filter Calculator." October 15. Accessed 2020-05-11.
- Katsov, Ilya. 2012. "Probabilistic Data Structures for Web Analytics and Data Mining." Highly Scalable Blog, May 1. Accessed 2020-05-11.
- Kirsch, Adam, and Michael Mitzenmacher. 2006. "Less Hashing, Same Performance: Building a Better Bloom Filter." In Y. Azar and T. Erlebach (Eds.), Proc. of European Symposium on Algorithms (ESA), Springer-Verlag Berlin, Heidelberg, LNCS 4168, pp. 456–467, September 11-13. Accessed 2020-05-11.
- Luo, Lailong, Deke Guo, Richard T.B. Ma, Ori Rottenstreich, and Xueshan Luo. 2019. "Optimizing Bloom Filter: Challenges, Solutions, and Comparisons." arXiv, v2, January 7. Accessed 2020-05-11.
- Martí, Vicent. 2012. "Some performance tweaks." bitly/dablooms, on GitHub, August 5. Accessed 2020-05-11.
- Nath, Kousik. 2018. "Bloom Filter: A simple but interesting data structure." Data Driven Investor, on Medium, September 23. Accessed 2020-05-11.
- Nilsson, Stefan. 2018. "Your basic int: a most powerful data type." Yourbasic.org, February 21. Updated 2019-05-14. Accessed 2020-05-11.
- Patgiri, Ripon, Sabuzima Nayak, and Samir Kumar Borgohain. 2019. "Hunting the Pertinency of Bloom Filter in Computer Networking and Beyond: A Survey." Journal of Computer Networks and Communications, Hindawi, February 5. doi:10.1155/2019/2712417. Accessed 2020-05-11.
- Pellow, David, Darya Filippova, and Carl Kingsford. 2017. "Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters." Journal of Computational Biology, 24(6): 547–557, June. Accessed 2020-05-11.
- Rae, Jack W, Sergey Bartunov, and Timothy P Lillicrap. 2019. "Meta-Learning Neural Bloom Filters." Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, PMLR 97. Accessed 2020-05-11.
- Singh, Gurminder. 2019. "Deduplication at Scale." Amplitude Blog, on Medium, May 24. Accessed 2020-05-11.
- Tarkoma, Sasu, Christian Esteve Rothenberg, and Eemil Lagerspetz. 2012. "Theory and Practice of Bloom Filters for Distributed Systems." IEEE Communications Surveys & Tutorials, vol. 14, no. 1, pp. 131-155, March. Accessed 2020-05-11.
- Vallentin, Matthias. 2011. "A Garden Variety of Bloom Filters." Blog, June 14. Updated 2013-07-17. Accessed 2020-05-11.
- Wikipedia. 2020a. "Bloom filter." Wikipedia, April 29. Accessed 2020-05-11.
- Wikipedia. 2020b. "Hash function." Wikipedia, May 6. Accessed 2020-05-12.

## Milestones

## Tags

## See Also

- Probabilistic Data Structures
- Count-Min Sketch
- Hash Function
- Feature Hashing
- Cuckoo Hashing
- Bit Array

## Further Reading

- Bloom, Burton H. 1970. "Space/Time Trade-offs in Hash Coding with Allowable Errors." Communications of the ACM, vol. 13, no. 7, pp. 422-426, July. Accessed 2020-05-11.
- Luo, Lailong, Deke Guo, Richard T.B. Ma, Ori Rottenstreich, and Xueshan Luo. 2019. "Optimizing Bloom Filter: Challenges, Solutions, and Comparisons." arXiv, v2, January 7. Accessed 2020-05-11.
- Tarkoma, Sasu, Christian Esteve Rothenberg, and Eemil Lagerspetz. 2012. "Theory and Practice of Bloom Filters for Distributed Systems." IEEE Communications Surveys & Tutorials, vol. 14, no. 1, pp. 131-155, March. Accessed 2020-05-11.
- Hurst, Thomas. 2018. "Bloom Filter Calculator." October 15. Accessed 2020-05-11.
- Vallentin, Matthias. 2011. "A Garden Variety of Bloom Filters." Blog, June 14. Updated 2013-07-17. Accessed 2020-05-11.
- Mill, Bill. 2017. "Bloom Filters by Example." September 27. Accessed 2020-05-11.