The best kind of optimizations are ones that eliminate the need to do expensive but wasteful work. While analyzing internals of some open source databases I’ve found that some of them spend a lot of time trying to do the wrong optimizations at the wrong times resulting in wasteful work. The more interesting implementations try to detect when this expensive wasteful work can be avoided and just don’t do it.
You would be surprised at how much processing some databases do just to tell you that the thing you queried for doesn’t exist in the database.
A data structure that sometimes can help reduce wasteful work like this is a Bloom Filter.
A Bloom Filter (invented in 1970 by Burton Bloom) is a probabilistic data structure that can give you an answer whether something exists in a set with a level of accuracy. This means a bloom filter can return false positives but cannot return false negatives. If you ask the bloom filter if an element exists in the set, it may return "yes its a member of the set" even though it isn’t. If the bloom filter returns false however, you can guarantee 100% that it is not in the set. Bloom filters can represent membership of a large set but with a small amount of bytes in comparison to holding an exact index and because of this bloom filters are an effective data structure for many different purposes.
How a bloom filter works
A bloom filter is actually pretty simple. A bloom filter is backed by a bit set which can be set to whatever length you want (which affects accuracy, but more on that later).
When a new element comes in, the value is hashed which will decide which bit(s) in the bit set need to be set to represent this value has been seen before.
Now that the bits in the bit set have been set for foo and bar we can query the bloom filter to tell us if something has been seen before. The element is hashed but instead of setting the bits, this time a check is done and if the bits that would have been set are already set the bloom filter will return true that the element has been seen before. With a small enough bit set and enough elements added to the bloom filter the more and more bits get set and false positives increase.
Bloom filter accuracy
The accuracy of a bloom filter can be adjusted by the bit set size and the amount of hash functions used. If you have 1 million elements you want to add to the bloom filter, a bit set size of 10 million bits and 4 hash functions will have a 1.2% chance of returning a false positive in a data structure you can hold in memory with 1.19 megabytes of RAM. The bloom filter will be much smaller than if you stored this in an index file with every element identifier.
Read here for more on the math behind bloom filter accuracy.
In the case your application (like a database) is storing files on disk, you can drastically reduce file IO by having a bloom filter represent membership of each file and testing the bloom filters before opening and reading files.
If the bloom filter says the element doesn’t exist in the file then there is no purpose in reading all the files. Depending on the use cases, this could drastically reduce expensive IO operations. There is the chance you may do some file IO by accident due to false positives but tuning the accuracy will make this minimal and more efficient overall.
In a distributed data set if you have thousands of nodes it’s not very efficient to query every node in some cases when you can use similar optimizations to test membership of data on remote nodes and avoid network hops.
Using a bloom filter in the right situations to do a quick check to avoid unnecessary work is definitely a nice option to have and can yield good results.
- The 99th percentile matters
- Batching and pipelining linearizable operations in replicated logs
- Trick to reduce allocations improves response latency in Haywire
- Improving the protocol parsing performance in Redis
- Mencius and Fast Mencius a high performance replicated state machine for WANs
- Tuning Paxos for high-throughput with batching and pipelining
- Scalable Eventually Consistent Counters
- Create benchmarks and results that have value
- Routing aware master elections
- My new test lab
- Responsible benchmarking
- Understanding hardware still matters in the cloud
- The “network partitions are rare” fallacy
- Messaging and event sourcing
- Further reducing memory allocations and use of string functions in Haywire
- HTTP response caching in Haywire
- Atomic sector writes and misdirected writes
- How memory mapped files, filesystems and cloud storage works
- Hello haywire
- Active Anti-Entropy
- October 2014
- September 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- November 2013
- October 2013
- August 2013
- July 2013
- June 2013
- May 2013
- April 2013
- March 2013
- January 2013
- October 2012
- September 2012
- August 2012
- May 2012
- April 2012
- February 2012
- January 2012
- December 2011
- September 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010