Bloom Filter
 Summary

Discussion
 Could you explain the Bloom filter with an example?
 Under what conditions can I use the Bloom filter?
 How do I select the parameters of the Bloom filter?
 Which hash functions are suitable for use in a Bloom filter?
 Which are some realworld applications of the Bloom filter?
 How's the performance of the Bloom filter?
 What are some variants of the Bloom filter?
 Could you mention some useful resources on the Bloom filter?
 Milestones
 Sample Code
 References
 Further Reading
 Article Stats
 Cite As
Given a set of elements, suppose we wish to know if a particular element is present in this set. There are many searching algorithms to do this. When faced with a huge set (millions of elements), even with an efficient search algorithm, storage is a problem. There's also latency due to disk access. Bloom filter is one possible solution.
Bloom filter is a data structure that stores the original set in a more compact form with support for set membership queries, that is, to query if an element is a member of the set. Bloom filter is a spaceefficient probabilistic data structure.^{}
With the rise of big data since the mid2000s, there's been increased interest in Bloom filter.^{} From the original Bloom filter (1970), many variants have been proposed for a diverse range of applications.^{}
Discussion
Could you explain the Bloom filter with an example? A Bloom filter is simply a bit array of length \(m\) for storing elements of set \(S=\{x_1,x_2,\ldots,x_n\}\). The filter starts with all zeros, meaning that the set is empty. When an element is added, it is hashed using \(k\) independent hash functions. Each function will output a bit position in the filter, which is then set to one. Thus, each element will set \(k\) bits of the filter.^{}
When a query is made, the queried element is put through the \(k\) hash functions. The resulting bit positions are checked in the filter. If at least one of these is a zero, we're certain that the element is not in the set. If all bits are ones, then it may be in the set. Thus, the filter is probabilistic. The reason it's probabilistic is that a query may check against bit positions set by more than one stored element.^{}
In the figure, we can say that \(x_4\) is definitely not in S but \(x_1\) (true positive) and \(x_5\) (false positive) are most probably in S.^{}
Under what conditions can I use the Bloom filter? Suppose the records of a large database table are stored on disk and need to be queried. To avoid slow disk access, a Bloom filter representation of the data is kept in fast memory. If the Bloom filter says that a record is absent, we're sure that it's absent and avoid disk access. If the Bloom filter says the record is present, we're not sure. We need to query the database to find out. False positive is when the filter says something is present when it's not. False positives should be at a level acceptable to the application.^{} When most queried records are absent in the database, Bloom filter gives performance gains.^{}
Back in 2003, Broder and Mitzenmacher stated the Bloom filter principle,^{}
Wherever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated.
Multiple elements can set the same bit in the filter.^{} Therefore, elements once added can't be removed from the filter. Neither can we obtain the list of elements added. Standard Bloom filter is not suited for applications that need these features.^{} ^{}
How do I select the parameters of the Bloom filter? There are essentially three parameters: filter size \(m\), number of hash functions \(k\) and number of elements to be stored \(n\).
For lower false positives, use higher values of \(m\) and \(k\) with a tradeoff. Higher \(m\) means more memory consumption. Higher \(k\) means more computation. Since higher \(k\) sets more bits, there's an optimal value \(k_{opt}\) beyond which false positives start to increase.^{}
When adding elements beyond the design limit, false positives go up. To achieve a fixed false positive probability, we should linearly scale \(m\) as \(n\) increases. In fact, the bittoelement ratio is bounded as \(m/n\geqslant1/ln(2)\).^{}
In practice, we estimate \(n\) in advance for our application. We decide on the false positive probability we're willing to tolerate. We select \(m\) and calculate \(k\), or select \(k\) and calculate \(m\). A handy calculator is available online.^{}
If parameters are not chosen correctly, an attacker can corrupt the filter with wellchosen inputs.^{}
Which hash functions are suitable for use in a Bloom filter? Hash functions should ideally be independent, meaning that for the same input they set different bits of the filter. Otherwise, hash collisions become more common, leading to higher false positives. Functions should also distribute the input space uniformly.^{}
Hash functions should be fast. The use of cryptographic hash functions is possible but not really required.^{} Among the noncryptographic hash functions suited for Bloom filter are MurmurHash, Fowler–Noll–Vo (FNV) and Jenkins.^{} One developer showed how replacing MD5 with MurmurHash brought performance improvements.^{}
It's possible to generate \(k\) hash functions from just two hash functions. One researcher pointed out that if the filter size is 32 bits, on a 64bit machine a single hash function can give two 32bit hash values, thus simulating two hash functions.^{}
Which are some realworld applications of the Bloom filter? For networkrelated applications, Bloom filter is used to collaborate in peertopeer networks, resource routing, packet routing, and measurements.^{} Content Delivery Networks (CDNs) use Bloom filters to avoid caching files seen only once.^{}
For applications that use databases, Bloom filter enables efficient searches, privacy preservation, content synchronization, and duplicate detection.^{}
Medium uses Bloom filter to deduplicate recommendations.^{} In a streaming application processing 17 million events per day per partition, Bloom filter was used to deduplicate events at scale. The filter needed 108Gb across 1024 partitions. Reads and writes were 20x and 3x faster respectively.^{}
Chrome browser uses the filter to represent a set of malicious URLs. When user requests a URL, if the filter indicates a probable match, the request is sent to a server to check if the URL is indeed safe.^{}
In a web application, Bloom filters could be used to keep count of visitors coming from a specific city accessing a webpage.^{}
How's the performance of the Bloom filter? A Bloom filter with one hash function is equivalent to ordinary hashing. Bloom filter is a generalization of hashing. It allows for more interesting design tradeoffs.^{}
The complexity to add an element or to query the filter is fixed at \(O(k)\). In other words, it's independent of how many items have been added to the filter.^{}
Bloom filter simplifies some operations. Suppose we have two filters to represents sets \(S_1\) and \(S_2\). The union of the these two sets is simply an OR operation of the two filters. Intersection of the two sets can also be approximated with the filters. Another trick is to half the size of the filter with an OR operation of the first and second halves of the filter.^{}
What are some variants of the Bloom filter? Bloom filter has inspired dozens of variants. One survey paper from 2019 mentioned more than 60 variants, their capabilities, performance and application areas.^{} A variant might aim to reduce false positives, improve performance or provide additional features.^{}
Counting Bloom filter allows us to implement a multiset so that multiple instances of the same element can be stored. Some of these support deletions as well but at the cost of false negatives. Bitwise Bloom filter and spectral Bloom filter improve on the counting Bloom filter.^{}
When elements are added to a filter beyond it's designed capacity, false positives increase. Scalable Bloom filter allows us to grow the filter size as more items are added.^{}
When filters have to be exchanged over a network (such as in web caching applications), Compressed Bloom filter saves on network bandwidth. By choosing the hash functions and thereby changing the way bits are distributed in the filter, better compression is achieved.^{}
Could you mention some useful resources on the Bloom filter? Survey paper by Luo et al. (2019) is worth reading. Jason Davies has published online an interactive visualization of the Bloom filter. JavaScript code for this is also available.
C++ library libbf contains implementations of the Bloom filter and some of its variants.^{}
A Java implementation of the filter (and other probabilistic data structures) can be found within streamlib.^{} Guava library in Java also has an implementation.^{}
Packages offering the filter are available in Python as well as in Node.js.^{}
Milestones
In the construction of a symbol table in compiler design, Batson describes a method that does faster search for symbols. Rather than do a slow linear lookup, an arithmetic transformation is done on the symbol. The result gives the address where the symbol will be stored (or retrieved) in the table.^{} The term hashing becomes common by late 1960s for this sort of transformation on the input.^{}
Burton H. Bloom publishes a paper titled Space/Time Tradeoffs in Hash Coding with Allowable Errors. In conventional hashcoding methods, many lookups are needed when an element doesn't belong to the set. He therefore proposes a novel method to reduce the reject time, time taken to declare that an element is a nonmember of the set. To allow for smaller hash area, he accepts some fraction of errors. With smaller hash area, hashes can be stored in core memory for faster access.^{}
Fan et al. propose the Counting Bloom Filter in the context web proxy caching. Since caches have to be updated, they find it necessary to keep a count of how many times a bit position in the filter has been set. They propose 4 bits per filter bit while keeping false negatives low. When a request comes in, the proxy checks its own cache. If absent, it checks the Bloom filters of participating proxies. Based on the result, it decides whether to query another proxy. Bloom filters are updated among proxies based on defined thresholds.^{}
Shanmugasundaram et al. propose the Hierarchical Bloom Filter to address the problem of substring matching. In addition to the substrings, the filter also stores concatenation of substrings, thus creating an hierarchy. It's applied in the context of IP traffic to attribute payloads to source and destination hosts.^{}
Borrowing from standard hashing, Kirsch and Mitzenmacher propose how to simulate multiple hash functions for the Bloom filter from just two independent uniform random hash functions. Their \(k\) hash functions are of the form \(g_i(x) = h_1(x) + i \cdot h_2(x)\,mod\,p\), for \(i\) in range \([0,k1]\) and where \(p\) is prime, the filter has \(kp\) bits and each function has range \([0,p1]\).^{}
Almeida et al propose the Scalable Bloom Filter. Sometimes it's difficult to estimate upfront the maximum number of elements that need to be stored in a Bloom filter. Designers therefore tend to waste space by overestimating, sometimes by orders of magnitude. By allowing the filter size to scale dynamically, the scalable Bloom filter guarantees that false positives don't exceed a defined design threshold. It's parameterized by initial size, error probability, growth rate and error probability tightening rate.^{}
Bose et al. observe that the original analysis of false positive rate by Bloom and others is not exact. The give an exact formula for the false positive rate plus lower and upper bounds. The rate is slightly larger than originally calculated but the difference is negligible for large \(m\) and small \(k\).^{}
Rae et al. propose the Neural Bloom Filter that uses an artificial neural network with oneshot metalearning. Softmax is computed on contentbased attention between the query q and a learned address matrix A. Thus the network learns where to store elements based on their content. This is more space efficient but at the cost of performance for a single query. It outperforms standard Bloom filter on batch queries parallelized on GPUs.^{}
Sample Code
References
 Akyildiz, Bugra. 2016. "A Gentle Introduction to Bloom Filter." KDNuggets, August. Accessed 20200511.
 Almeida, Paulo Sérgio, Carlos Baquero, Nuno Preguiça, and David Hutchison. 2007. "Scalable Bloom Filters." Information Processing Letters, Elsevier, vol. 101, no. 6, pp. 255261, March 31. Accessed 20200511.
 Batson, Alan. 1965. "The organization of symbol tables." Communications of the ACM, vol. 8, no. 2, pp. 111112, February. Accessed 20200512.
 Bloom, Burton H. 1970. "Space/Time Tradeoffs in Hash Coding with Allowable Errors." Communications of the ACM, vol. 13, no. 7, pp. 422426, July. Accessed 20200511.
 Bose, Prosenjit, Hua Guo, Evangelos Kranakis, Anil Maheshwari, Pat Morin, Jason Morrison, Michiel Smid, and Yihui Tang. 2008. "On the falsepositive rate of Bloom filters." Information Processing Letters, Elsevier, vol. 108, no. 4, pp. 210213, October 31. doi:10.1016/j.ipl.2008.05.018. Accessed 20200511.
 Broder, Andrei, and Michael Mitzenmacher. 2003. "Network Applications of Bloom Filters: A Survey." Internet Mathematics, vol. 1, no. 4, pp. 485509, November. Accessed 20200511.
 Davies, Jason. 2020. "Bloom Filters." Accessed 20200511.
 Fan, Li, Pei Cao, Jussara Almeida, and Andrei Z. Broder. 2000. "Summary cache: a scalable widearea web cache sharing protocol." IEEE/ACM Transactions on Networking, vol. 8, no. 3, pp. 281293, June. Accessed 20200511.
 Fitzgerald, William. 2011. "Producing n hash functions by hashing only once." Will.Whim blog, September 3. Accessed 20200511.
 GeeksforGeeks. 2017. "Bloom Filters – Introduction and Python Implementation." GeeksforGeeks, April 17. Updated 20180208. Accessed 20200511.
 Gerbet, Thomas, Amrit Kumar, and Cédric Lauradoux. 2014. "The Power of Evil Choices in Bloom Filters." Research Report RR8627, INRIA Grenoble, hal01082158v2, November. Accessed 20200511.
 Hurst, Thomas. 2018. "Bloom Filter Calculator." October 15. Accessed 20200511.
 Katsov, Ilya. 2012. "Probabilistic Data Structures for Web Analytics and Data Mining." Highly Scalable Blog, May 1. Accessed 20200511.
 Kirsch, Adam, and Michael Mitzenmacher. 2006. "Less Hashing, Same Performance: Building a Better Bloom Filter." In Y. Azar and T. Erlebach (Eds.), Proc. of European Symposium on Algorithms (ESA), SpringerVerlag Berlin, Heidelberg, LNCS 4168, pp. 456–467, September 1113. Accessed 20200511.
 Luo, Lailong, Deke Guo, Richard T.B. Ma, Ori Rottenstreich, and Xueshan Luo. 2019. "Optimizing Bloom Filter: Challenges, Solutions, and Comparisons." arXiv, v2, January 7. Accessed 20200511.
 Martí, Vicent. 2012. "Some performance tweaks." bitly/dablooms, on GitHub, August 5. Accessed 20200511.
 Nath, Kousik. 2018. "Bloom Filter: A simple but interesting data structure." Data Driven Investor, on Medium, September 23. Accessed 20200511.
 Nilsson, Stefan. 2018. "Your basic int: a most powerful data type." Yourbasic.org, February 21. Updated 20190514. Accessed 20200511.
 Patgiri, Ripon, Sabuzima Nayak, and Samir Kumar Borgohain. 2019. "Hunting the Pertinency of Bloom Filter in Computer Networking and Beyond: A Survey." Journal of Computer Networks and Communications, Hindawi, February 5. doi:10.1155/2019/2712417. Accessed 20200511.
 Pellow, David, Darya Filippova, and Carl Kingsford. 2017. "Improving Bloom Filter Performance on Sequence Data Using kmer Bloom Filters." Journal of Computational Biology, 24(6): 547–557, June. Accessed 20200511.
 Rae, Jack W, Sergey Bartunov, and Timothy P Lillicrap. 2019. "MetaLearning Neural Bloom Filters." Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, PMLR 97. Accessed 20200511.
 Singh, Gurminder. 2019. "Deduplication at Scale." Amplitude Blog, on Medium, May 24. Accessed 20200511.
 Tarkoma, Sasu, Christian Esteve Rothenberg, and Eemil Lagerspetz. 2012. "Theory and Practice of Bloom Filters for Distributed Systems." IEEE Communications Surveys & Tutorials, vol. 14, no. 1, pp. 131155, March. Accessed 20200511.
 Vallentin, Matthias. 2011. "A Garden Variety of Bloom Filters." Blog, June 14. Updated 20130717. Accessed 20200511.
 Wikipedia. 2020a. "Bloom filter." Wikipedia, April 29. Accessed 20200511.
 Wikipedia. 2020b. "Hash function." Wikipedia, May 6. Accessed 20200512.
Further Reading
 Bloom, Burton H. 1970. "Space/Time Tradeoffs in Hash Coding with Allowable Errors." Communications of the ACM, vol. 13, no. 7, pp. 422426, July. Accessed 20200511.
 Luo, Lailong, Deke Guo, Richard T.B. Ma, Ori Rottenstreich, and Xueshan Luo. 2019. "Optimizing Bloom Filter: Challenges, Solutions, and Comparisons." arXiv, v2, January 7. Accessed 20200511.
 Tarkoma, Sasu, Christian Esteve Rothenberg, and Eemil Lagerspetz. 2012. "Theory and Practice of Bloom Filters for Distributed Systems." IEEE Communications Surveys & Tutorials, vol. 14, no. 1, pp. 131155, March. Accessed 20200511.
 Hurst, Thomas. 2018. "Bloom Filter Calculator." October 15. Accessed 20200511.
 Vallentin, Matthias. 2011. "A Garden Variety of Bloom Filters." Blog, June 14. Updated 20130717. Accessed 20200511.
 Mill, Bill. 2017. "Bloom Filters by Example." September 27. Accessed 20200511.