Jan
24

Using a bloom filter to reduce expensive operations like disk IO

By kellabyte  //  Architecture, Databases  //  16 Comments

The best kind of optimizations are ones that eliminate the need to do expensive but wasteful work. While analyzing internals of some open source databases I’ve found that some of them spend a lot of time trying to do the wrong optimizations at the wrong times resulting in wasteful work. The more interesting implementations try to detect when this expensive wasteful work can be avoided and just don’t do it.

You would be surprised at how much processing some databases do just to tell you that the thing you queried for doesn’t exist in the database.

A data structure that sometimes can help reduce wasteful work like this is a Bloom Filter.

A Bloom Filter (invented in 1970 by Burton Bloom) is a probabilistic data structure that can give you an answer whether something exists in a set with a level of accuracy. This means a bloom filter can return false positives but cannot return false negatives. If you ask the bloom filter if an element exists in the set, it may return "yes its a member of the set" even though it isn’t. If the bloom filter returns false however, you can guarantee 100% that it is not in the set. Bloom filters can represent membership of a large set but with a small amount of bytes in comparison to holding an exact index and because of this bloom filters are an effective data structure for many different purposes.

How a bloom filter works

A bloom filter is actually pretty simple. A bloom filter is backed by a bit set which can be set to whatever length you want (which affects accuracy, but more on that later).

bloomfilter_empty

 

When a new element comes in, the value is hashed which will decide which bit(s) in the bit set need to be set to represent this value has been seen before.

bloomfilter_adding

 

Now that the bits in the bit set have been set for foo and bar we can query the bloom filter to tell us if something has been seen before. The element is hashed but instead of setting the bits, this time a check is done and if the bits that would have been set are already set the bloom filter will return true that the element has been seen before. With a small enough bit set and enough elements added to the bloom filter the more and more bits get set and false positives increase.

bloomfilter_querying

 

Bloom filter accuracy

The accuracy of a bloom filter can be adjusted by the bit set size and the amount of hash functions used. If you have 1 million elements you want to add to the bloom filter, a bit set size of 10 million bits and 4 hash functions will have a 1.2% chance of returning a false positive in a data structure you can hold in memory with 1.19 megabytes of RAM. The bloom filter will be much smaller than if you stored this in an index file with every element identifier.

Read here for more on the math behind bloom filter accuracy.

Example usage

In the case your application (like a database) is storing files on disk, you can drastically reduce file IO by having a bloom filter represent membership of each file and testing the bloom filters before opening and reading files.

bloomfilter_filemembership

 

If the bloom filter says the element doesn’t exist in the file then there is no purpose in reading all the files. Depending on the use cases, this could drastically reduce expensive IO operations. There is the chance you may do some file IO by accident due to false positives but tuning the accuracy will make this minimal and more efficient overall.

In a distributed data set if you have thousands of nodes it’s not very efficient to query every node in some cases when you can use similar optimizations to test membership of data on remote nodes and avoid network hops.

Using a bloom filter in the right situations to do a quick check to avoid unnecessary work is definitely a nice option to have and can yield good results.

  • EJ

    It would be good to note that bloom filters work best for read only data.

  • EJ

    s/read only/read many/. From the twiki:

    Removing an element from this simple Bloom filter is impossible because false negatives are not permitted. An element maps to k bits, and although setting any one of those k bits to zero suffices to remove the element, it also results in removing any other elements that happen to map onto that bit. Since there is no way of determining whether any other elements have been added that affect the bits for an element to be removed, clearing any of the bits would introduce the possibility for false negatives.

    One-time removal of an element from a Bloom filter can be simulated by having a second Bloom filter that contains items that have been removed. However, false positives in the second filter become false negatives in the composite filter, which may be undesirable. In this approach re-adding a previously removed item is not possible, as one would have to remove it from the “removed” filter.

  • http://www.csindepth.com Shahriar

    The problem behind this method is the cost of hashing during insert and update and updating the bloom bit set also has its own difficulties. Of course the bloom filter would increase the search speed, on the other hand it will be a bottle neck in CRUD operations.

  • http://muddypa.ws nportelli

    I really want to see a post on how you learn and research so many things so quickly. Do you sleep? Are you a robot? Have intense focus? Either way I’m jealous.

  • DarrenWang

    great post

  • Frans

    Thank you for this interesting article.
    One day I might need this.

  • Visar Uruqi

    Great article thanks for sharing this information

  • Patrick Baggett

    Sometimes I call myself a “Computer Scientist”, and then I get taken back to school by a 1970s algorithm! Awesome, thanks for the write-up; your explanation makes perfect sense!

  • Mel Green

    Excellent post! Clean and easy to understand, thank you for sharing. Are you familiar with any existing implementations of a bloom filter in popular languages such as .Net or javascript?

  • Ilka

    Very cool! Thank you : )

  • kjnilsson

    Interesting, clear and informative post. Thanks!

  • Pingback: Data Whereness: welcome to the world of 'New Data'... - Media Science

  • Pingback: De prin web adunate #3 | Andrei Husanu

  • Shrivatsa Perur

    superb write up…… very easy to understand… thanks a lot

  • Pingback: Architecture | kuangmingchen

  • Pingback: Database | kuangmingchen