System Design: Bloom Filter

Artificial Intelligence

System Design: Bloom Filter

admin

March 24, 2024

Smartly transforming a hash table to a probabilistic data structure to trade accuracy for big memory gains

Hash table is one of the vital widely known and used data structures. With a smart alternative of hash function, a hash table can produce optimal performance for insertion, search and deletion queries in constant time.

The principal drawback of the hash table is potential collisions. To avoid them, one in all the usual methods includes increasing the hash table size. While this approach works well generally, sometimes we’re still limited in using large memory space.

It’s needed to recall that a hash table all the time provides an accurate response to any query. It’d undergo collisions and be slow sometimes but it surely all the time guarantees 100% correct responses. It seems that in some systems, we don’t all the time must receive correct information to queries. Such a decrease in accuracy might be used to concentrate on improving other elements of the system.

In this text, we’ll discover an revolutionary data structure called a Bloom filter. In easy words, it’s a modified version of an ordinary hash table which trades off a small decrease in accuracy for memory space gains.

Bloom filter is organised in the shape of a boolean array of size m. Initially all of its elements are marked as 0 (false). Other than that, it’s needed to decide on k hash functions that take objects as input and map them to the range [0, m — 1]. Every output value will later correspond to an array element at that index.

For higher results, it is suggested that hash functions output values whose distribution is near uniform.

In our example, we can be using a Bloom filter of size m = 13 with k = 3 hash functions. Each of those functions maps an input object to the range [0, 12].

Insertion

Each time a recent object must be added, it’s passed through k predefined hash functions. For every output hash value, the corresponding element at that index becomes 1 (true).

The “banana” object is added to the Bloom filter. The hash functions output values are 6, 2 and 9. Array elements at those indexes change to 1.

If an array element whose index was outputted from a hash function has already been set to 1, then it simply stays as 1.

The “apple” object is added to the Bloom filter. Array elements at indexes 10, 9 and 4 are assigned to 1. Despite the fact that the 9-th element of array was already assigned to 1, its value doesn’t change here.

Mainly, the presense of 1 at any array element acts as a partial prove that a component hashing to the respective array index actually exists within the Bloom filter.

Search

To examine if an object exists, its k hash values are computed. There might be two possible scenarios:

If these is no less than one hash value for which the respective array element equals 0, which means the object doesn’t exist.

During insertion, an object becomes related to several array elements which can be marked as 1. If an object really existed within the filter, than all the hash functions would deterministically output the identical sequence of indexes pointing to 1. Nonetheless, pointing to an array element with 0 clearly signifies that the present object just isn’t present in the info structure.

Checking if the “orange” object is present within the Bloom filter. Since there may be no less than one hash function (precisely two in our case) outputting an index (7 and 12) of the array whose element is the same as 0, which means “orange” doesn’t exist within the filter.

If for all hash values, the respective array elements equal 1, which means the object probably exists (not 100%).

This statement is strictly what makes the Bloom filter a probabilistic data structure. If an object was added before, then during a search, the Bloom filter guarantees that hash values can be the identical for it, thus the article can be found.

Checking if the “banana” object is present within the Bloom filter. For the reason that hash functions are deterministic, they output the exact same array positions that were used before throughout the insertion of “banana”. Because of this, “banana” exists within the filter.

Nevertheless, the Bloom filter can produce a false positive response when an object doesn’t actually exist however the Bloom filter claims otherwise. This happens when all hash functions for the article return hash values of 1 corresponding to other already inserted objects within the filter.

Example of a false positive response. Despite the fact that “cherry” was not added before, the filter thinks it exists as all the output hash values for “cherry” point to array elements with values of 1.

False positive answers are likely to occur when the variety of inserted objects becomes relatively high as compared to the scale of the Bloom filter’s array.

Estimation of false positive errors

It is feasible to estimate the probability of getting a false positive error, given the Bloom’s filter structure.

Image adopted by the writer. Source: Bloom filter | Wikipedia

The complete proof of this formula might be found on Wikipedia. Based on that expression, we are able to make a pair of interesting observations:

The FP probability decreases with the rise within the variety of hash hash functions k, increase within the array size m, and reduce within the variety of inserted objects n.

Increase in k, increase in m or decrease in n result in lower FP rate

Before inserting objects into the Bloom filter, we are able to find the optimal variety of required hash functions k that may minimize the FP probability if we all know the array size m and may estimate the variety of objects n that can be inserted in the longer term.

The optimal variety of hash functions k that minimizes the FP probability

An alternative choice of reducing FP probability is a mixture (AND conjunction) of several independent Bloom filters. A component is ultimately considered to be present in the info structure only whether it is present in all Bloom filters.

Constraints

Contrary to hash tables, the usual implementation of a Bloom filter doesn’t support deletion.
The chosen variety of hash functions k and array size m at first can’t be modified later. If there may be such a necessity, the one technique to do it’s to construct one other Bloom filter with recent settings by inserting all of the previously stored objects.

In keeping with the page from Wikipedia, the Bloom filter is widely utilized in large systems:

Databases like Apache HBase, Apache Cassandra and PostgreSQL use the Bloom filter to envision non-existing rows or columns. This approach is considerably faster than using disk lookups.
Medium uses the Bloom filter to filter out pages which have already been really helpful to a user.
Google Chrome used the Bloom filter previously to discover malicious URLs. A URL was considered secure if the Bloom filter returned a negative response. Otherwise, the complete check was performed.

Google’s algorithm that was used to envision for malicious URLs. Using the Bloom filter allowed to significantly reduce the variety of more computationally heavy full checks that might have been required otherwise for a big portion of secure links.

In this text, we have now covered an alternate approach to constructing hash tables. When a small decrease in accuracy might be compromised for more efficient memory usage, the Bloom filter seems to be a sturdy solution in lots of distributed systems.

Various the variety of hash functions with the Bloom filter’s size allows us to seek out probably the most suitable balance between accuracy and performance requirements.

All images unless otherwise noted are by the writer.

Smartly transforming a hash table to a probabilistic data structure to trade accuracy for big memory gains

Insertion

Search

Estimation of false positive errors

Constraints

LEAVE A REPLY Cancel reply