A hash table is a foundational data structure known for its optimal performance for insertion, search, and deletion queries given a well-chosen hash function. However, hash tables can encounter issues such as potential collisions, which can slow down processes and require increased memory space to mitigate.
A probabilistic data structure known as a Bloom filter can solve these issues by reducing memory usage through a minor trade-off in accuracy. A Bloom filter is structured as a binary array initially filled with zeroes. It relies on pre-selected hash functions to map objects to the array’s range, with each output value corresponding to an element at that index.
An example Bloom filter could be 13 bits wide with 3 hash functions. Each function maps an input object to the range [0, 12]. When an object needs to be added, the hash functions produce indices in the array which are then set to one. The occurrence of 1 at an array element partially signifies an object’s existence in the Bloom filter.
Check for an object’s existence by computing its hash values. If any one of its associated array elements is zero, it does not exist. If all associated elements are one, then it probably exists. This “probably” makes the Bloom filter a probabilistic data structure, as it can provide false positive responses in cases where a non-existent object’s hashes conflict with those of existing objects. The chance of false positives rises as the number of objects relative to the size of the filter’s array increases.
To calculate the possibility of false positives, consider factors like the number of hash functions, the array size, and the number of objects that have been inserted. The probability decreases with more hash functions and larger array sizes or fewer inserted objects. Before adding objects, you can determine an optimal number of hash functions that will minimize false positives given the array size and an estimated number of objects to be inserted.
To further minimize false positives, multiple independent Bloom filters can be combined, such that an object is only considered to exist if it appears in all filters. However, a bloom filter has some limitations. It doesn’t support deletion and can’t alter the number of hash functions or array size after setup.
Despite these drawbacks, Bloom filters are widely used in large systems like databases, including Apache HBase and Cassandra, and PostgreSQL, where they quickly check for non-existent rows or columns, a much faster method than disk lookups. Medium uses Bloom filters to keep track of pages already recommended to users. Previously, Google Chrome implemented Bloom filters to identify safe URLs, reducing the need for full URL checks.
Therefore, a Bloom filter offers an innovative trade-off of slight accuracy reduction for substantial memory gains, providing a robust solution for many distributed systems. It allows users to find a suitable balance between precision and performance by adjusting the number of hash functions relative to the size of the filter.