During a transaction discussion with Selena Marie and Howard Chu on Twitter, Howard mentioned routing protocols which reminded me of an idea I had several months ago that I would like to talk about in this post.
Partitions in distributed systems happen for many reasons, many of which unrelated to networking infrastructure failures. Anything that interrupts communication between nodes has effectively created a partition. One example I like to bring up is a CLR or JVM stop-the-world garbage collection that happens at the same time on multiple nodes (yes this can happen).
Figure 1. Created by Kyle Kingsbury
Kyle Kingsbury has done a great job in his Jepsen blog series testing and reporting how various systems handle the classical split brain scenario. Depending on the levels of consistency and availability your system is allowed to tolerate, you may allow the spit brain scenario to continue taking writes on both sides of the partition. Even though there is a minority of nodes N1 and N2, clients C1 and C2 as depicted in figure 1 are still allowed to successfully write to these nodes. Amazon Dynamo implementations like Cassandra, Riak and Voldemort allow this to increase availability.
Allowing writes on both sides of a partition is great for availability but can pose problems for consistency when the partition is healed. If a cell was updated on both sides of the partition, which one do you keep? There’s no clear answer for this because it depends on your application. There’s also a cascading correctness issue to consider. Is it acceptable to let readers make decisions from potentially incorrect data to create more new data? As Kyle points out, wall clock last write wins isn’t always a good solution because data can be lost. Vector clocks help by introducing a logical clock that removes the time sync problems and offers a way to record partial ordering of events in distributed systems that can detect causality violations. When a partition is healed, vector clocks allow nodes to automatically merge their changes and in the event of a conflict, a user-defined merge function can make the decision what to keep and what to discard.
This is all wonderful but like anything else, there are trade-offs made when using vector clocks. There are performance implications of storing all this meta data. It may not be desirable to have this performance impact if last write wins is acceptable for your system.
Riak supports both last write wins and vector clocks with client defined merge functions. Voldemort supports resolving via last write wins or vector clocks but you cannot disable the recording or transferring of vector clocks. Last write wins is used when the vector clock cannot resolve the conflict (Thanks to Jay Kreps for this correction). Cassandra only supports last write wins. In Cassandra, it is possible to effectively model vector clocks functionality in Cassandra columns by writing changes to a new column each time. This will lack built-in functionality for automatically merging of non-conflicting changes. A user-defined merge function in the client is required for both conflicting and non-conflicting merges. This is a pattern I’ve seen people model in Cassandra.
In a system that requires more constraints for consistency you may need to make some trade-offs and these options may not be acceptable.
Figure 2. Created by Kyle Kingsbury
Some systems design the concept of a master node replicating to slave nodes. All writes go to the master so that consistency can be guaranteed. If the master goes down, a new master will get elected and hopefully the only candidates are nodes that are up to date so that you don’t have data loss by giving a node missing data the role as a master. This is no simple problem to solve, it’s hard to get right. On the minority side of the partition, an election will fail because there’s not enough nodes to make a majority.
While a new master is being elected, writes can be interrupted as the cluster is sorting election out. Clients that may be on the side of the minority will have nowhere to write to. It’s clear this type of system has reduced availability but these are the trade-offs made to increase consistency.
The classical split brain scenario is discussed quite often but there are also much more complex partitions that can happen that throw off some of the algorithms implemented in distributed systems like the databases and queues you use today.
First thing I want to point out is that there is no all-knowing overseer of your distributed system. Each node has its own perspective of the cluster state. For example, N1 may think it’s in a minority side of a partition with N2 and can’t communicate to N3, N4 and N5 but N2 may not necessarily agree with N1. N2 may see a totally healthy cluster.
In these more complex scenarios, some nodes may have overlapping views that can really mess with distributed system algorithms. Depending on implementation, master elections can get hairy in these situations.
From the perspective of nodes N1 and N2 shown in Figure 3, it appears that N3 and N4 are unreachable. A majority can be reached with N1+N2+N5, an election is raised.
From the perspectives of N3 and N4 shown in Figure 4, it appears N1, N2 are unreachable. This also results in a majority because N3+N4+N5 can be reached and an election will get raised. Wait a minute! Two majorities are possible! Some implementations will protect against this and force N5 to pick a side but I’ve seen some elect duelling masters. When this happens you’re in for a real fun mess.
One solution to this is to add routing awareness to the election process similar to what we see in networking infrastructure like shortest path routing. Since N5 can see the whole cluster, it might be wise to discover that before raising an election and giving nodes with the healthiest perspectives a position as a candidate for election. Other nodes can be made aware that a full cluster is still possible. In the scenario above, this would mean N5 would be the only candidate possible but a more complicated situation may have multiple candidates but ultimately only one gets chosen. If the slaves need to communicate with each other, these requests and responses need to route through N5 because N5 is the only bridge across the partition.
The trade-off with adding routing awareness is that it will take longer to sort out what the right decision is and to take action on it. This means reduced availability during this process but once the new master comes on line you will have 4 replicas which may regain your capacity to handle read load if you’re allowing reads from slaves (another trade-off with consequences, which I’m not going to cover in this post).
Adding routing to distributed systems isn’t new. Chord and Pastry include routing. In the Amazon Dynamo paper, routing was explicitly excluded to lower latency.
Routing in distributed systems is not a new concept, I thought of it one day and like most things, yup, someone else already thought about this. There are many more examples than Chord and Pastry however I haven’t found many open source infrastructure projects that include routing awareness in the election process. If you know of any please leave a comment in this post! It’s impossible for me to know how they all work
- Batching and pipelining linearizable operations in replicated logs
- Trick to reduce allocations improves response latency in Haywire
- Improving the protocol parsing performance in Redis
- Mencius and Fast Mencius a high performance replicated state machine for WANs
- Tuning Paxos for high-throughput with batching and pipelining
- Scalable Eventually Consistent Counters
- Create benchmarks and results that have value
- Routing aware master elections
- My new test lab
- Responsible benchmarking
- Understanding hardware still matters in the cloud
- The “network partitions are rare” fallacy
- Messaging and event sourcing
- Further reducing memory allocations and use of string functions in Haywire
- HTTP response caching in Haywire
- Atomic sector writes and misdirected writes
- How memory mapped files, filesystems and cloud storage works
- Hello haywire
- Active Anti-Entropy
- Lightning Memory-Mapped Database
- September 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- November 2013
- October 2013
- August 2013
- July 2013
- June 2013
- May 2013
- April 2013
- March 2013
- January 2013
- October 2012
- September 2012
- August 2012
- May 2012
- April 2012
- February 2012
- January 2012
- December 2011
- September 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010