The best kind of optimizations are ones that eliminate the need to do expensive but wasteful work. While analyzing internals of some open source databases I’ve found that some of them spend a lot of time trying to do the wrong optimizations at the wrong times resulting in wasteful work. The more interesting implementations try to detect when this expensive wasteful work can be avoided and just don’t do it.
You would be surprised at how much processing some databases do just to tell you that the thing you queried for doesn’t exist in the database.
A data structure that sometimes can help reduce wasteful work like this is a Bloom Filter.
A Bloom Filter (invented in 1970 by Burton Bloom) is a probabilistic data structure that can give you an answer whether something exists in a set with a level of accuracy. This means a bloom filter can return false positives but cannot return false negatives. If you ask the bloom filter if an element exists in the set, it may return "yes its a member of the set" even though it isn’t. If the bloom filter returns false however, you can guarantee 100% that it is not in the set. Bloom filters can represent membership of a large set but with a small amount of bytes in comparison to holding an exact index and because of this bloom filters are an effective data structure for many different purposes.
How a bloom filter works
A bloom filter is actually pretty simple. A bloom filter is backed by a bit set which can be set to whatever length you want (which affects accuracy, but more on that later).
When a new element comes in, the value is hashed which will decide which bit(s) in the bit set need to be set to represent this value has been seen before.
Now that the bits in the bit set have been set for foo and bar we can query the bloom filter to tell us if something has been seen before. The element is hashed but instead of setting the bits, this time a check is done and if the bits that would have been set are already set the bloom filter will return true that the element has been seen before. With a small enough bit set and enough elements added to the bloom filter the more and more bits get set and false positives increase.
Bloom filter accuracy
The accuracy of a bloom filter can be adjusted by the bit set size and the amount of hash functions used. If you have 1 million elements you want to add to the bloom filter, a bit set size of 10 million bits and 4 hash functions will have a 1.2% chance of returning a false positive in a data structure you can hold in memory with 1.19 megabytes of RAM. The bloom filter will be much smaller than if you stored this in an index file with every element identifier.
Read here for more on the math behind bloom filter accuracy.
In the case your application (like a database) is storing files on disk, you can drastically reduce file IO by having a bloom filter represent membership of each file and testing the bloom filters before opening and reading files.
If the bloom filter says the element doesn’t exist in the file then there is no purpose in reading all the files. Depending on the use cases, this could drastically reduce expensive IO operations. There is the chance you may do some file IO by accident due to false positives but tuning the accuracy will make this minimal and more efficient overall.
In a distributed data set if you have thousands of nodes it’s not very efficient to query every node in some cases when you can use similar optimizations to test membership of data on remote nodes and avoid network hops.
Using a bloom filter in the right situations to do a quick check to avoid unnecessary work is definitely a nice option to have and can yield good results.