Distributed systems that have relaxed consistency guarantees between replicas need ways to eventually converge on the same state. Doing this with a large set of data can be a real challenge. Processing and comparing full sets of data is expensive. If two replicas aren’t consistent you want to find out what pieces of data aren’t the same as efficiently as possible so that you can repair the inconsistent data.
Several eventually consistent databases such as Cassandra, Riak and Voldemort implement ideas introduced in the Amazon Dynamo paper. The Dynamo paper included the idea of using Merkle Tree’s (hash tree) for comparing data efficiently. Bit Torrent uses merkle tree’s as well.
The leaves in a merkle tree are hashes of data. Nodes further up in the tree are the hashes of their respective children. This allows you to compare in a top-down fashion until you find a mismatch. If the top level hash is the same, both nodes are consistent without having to check any further.
Anti-entropy is the process of detecting inconsistencies and repairing these inconsistencies. Sometimes this is simply describes as a “repair” that you may issue manually. In Riak there is support for what’s called active anti-entropy. Active anti-entropy takes a more automated approach by continuously in the background comparing merkle tree’s between replicas and repairing inconsistencies.
In my opinion the choice whether active anti-entropy makes sense depends on a lot of factors about your system and the workload characteristics.
Joseph Blomstedt has an excellent video describing how active anti-entropy works in Riak. The video is also a really great example of how merkle tree’s work so I suggest anyone using Cassandra or Riak or just interested in distributed systems to give it a watch. Or four.
- Responsible benchmarking
- Understanding hardware still matters in the cloud
- The “network partitions are rare” fallacy
- Messaging and event sourcing
- Further reducing memory allocations and use of string functions in Haywire
- HTTP response caching in Haywire
- Atomic sector writes and misdirected writes
- How memory mapped files, filesystems and cloud storage works
- Hello haywire
- Active Anti-Entropy
- Lightning Memory-Mapped Database
- Write amplification
- Amortizing de-duplication at read time instead of write time
- LevelDB was designed for mobile devices
- AMQP and wire format interopability
- Convergent Replicated Data Types
- Configuration is bad but what about operational flexibility?
- An alternative to Paxos, the RAFT consensus algorithm
- Version tolerance and accidental operation complexity
- Hardware configurations can introduce tight coupling and increase failure foot print
- November 2013
- October 2013
- August 2013
- July 2013
- June 2013
- May 2013
- April 2013
- March 2013
- January 2013
- October 2012
- September 2012
- August 2012
- May 2012
- April 2012
- February 2012
- January 2012
- December 2011
- September 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010