Routing aware master elections

By kellabyte  //  Distributed Systems  //  1 Comment

During a transaction discussion with Selena Marie and Howard Chu on Twitter, Howard mentioned routing protocols which reminded me of an idea I had several months ago that I would like to talk about in this post.

Partitions in distributed systems happen for many reasons, many of which unrelated to networking infrastructure failures. Anything that interrupts communication between nodes has effectively created a partition. One example I like to bring up is a CLR or JVM stop-the-world garbage collection that happens at the same time on multiple nodes (yes this can happen).

Split Brain

Figure 1. Created by Kyle Kingsbury

Kyle Kingsbury has done a great job in his Jepsen blog series testing and reporting how various systems handle the classical split brain scenario. Depending on the levels of consistency and availability your system is allowed to tolerate, you may allow the spit brain scenario to continue taking writes on both sides of the partition. Even though there is a minority of nodes N1 and N2, clients C1 and C2 as depicted in figure 1 are still allowed to successfully write to these nodes. Amazon Dynamo implementations like Cassandra, Riak and Voldemort allow this to increase availability.

Allowing writes on both sides of a partition is great for availability but can pose problems for consistency when the partition is healed. If a cell was updated on both sides of the partition, which one do you keep? There’s no clear answer for this because it depends on your application. There’s also a cascading correctness issue to consider. Is it acceptable to let readers make decisions from potentially incorrect data to create more new data? As Kyle points out, wall clock last write wins isn’t always a good solution because data can be lost. Vector clocks help by introducing a logical clock that removes the time sync problems and offers a way to record partial ordering of events in distributed systems that can detect causality violations. When a partition is healed, vector clocks allow nodes to automatically merge their changes and in the event of a conflict, a user-defined merge function can make the decision what to keep and what to discard.

This is all wonderful but like anything else, there are trade-offs made when using vector clocks. There are performance implications of storing all this meta data. It may not be desirable to have this performance impact if last write wins is acceptable for your system.

Riak supports both last write wins and vector clocks with client defined merge functions. Voldemort supports resolving via last write wins or vector clocks but you cannot disable the recording or transferring of vector clocks. Last write wins is used when the vector clock cannot resolve the conflict (Thanks to Jay Kreps for this correction). Cassandra only supports last write wins. In Cassandra, it is possible to effectively model vector clocks functionality in Cassandra columns by writing changes to a new column each time. This will lack built-in functionality for automatically merging of non-conflicting changes.  A user-defined merge function in the client is required for both conflicting and non-conflicting merges. This is a pattern I’ve seen people model in Cassandra.

In a system that requires more constraints for consistency you may need to make some trade-offs and these options may not be acceptable.

Figure 2. Created by Kyle Kingsbury

Some systems design the concept of a master node replicating to slave nodes. All writes go to the master so that consistency can be guaranteed. If the master goes down, a new master will get elected and hopefully the only candidates are nodes that are up to date so that you don’t have data loss by giving a node missing data the role as a master. This is no simple problem to solve, it’s hard to get right. On the minority side of the partition, an election will fail because there’s not enough nodes to make a majority.

While a new master is being elected, writes can be interrupted as the cluster is sorting election out. Clients that may be on the side of the minority will have nowhere to write to. It’s clear this type of system has reduced availability but these are the trade-offs made to increase consistency.

Complex partitions

The classical split brain scenario is discussed quite often but there are also much more complex partitions that can happen that throw off some of the algorithms implemented in distributed systems like the databases and queues you use today.

First thing I want to point out is that there is no all-knowing overseer of your distributed system. Each node has its own perspective of the cluster state. For example, N1 may think it’s in a minority side of a partition with N2 and can’t communicate to N3, N4 and N5 but N2 may not necessarily agree with N1. N2 may see a totally healthy cluster.

Overlapping views

In these more complex scenarios, some nodes may have overlapping views that can really mess with distributed system algorithms. Depending on implementation, master elections can get hairy in these situations.

Figure 3.

From the perspective of nodes N1 and N2 shown in Figure 3, it appears that N3 and N4 are unreachable. A majority can be reached with N1+N2+N5, an election is raised.


Figure 4.

From the perspectives of N3 and N4 shown in Figure 4, it appears N1, N2 are unreachable. This also results in a majority because N3+N4+N5 can be reached and an election will get raised. Wait a minute! Two majorities are possible! Some implementations will protect against this and force N5 to pick a side but I’ve seen some elect duelling masters. When this happens you’re in for a real fun mess.

Routing protocols

One solution to this is to add routing awareness to the election process similar to what we see in networking infrastructure like shortest path routing. Since N5 can see the whole cluster, it might be wise to discover that before raising an election and giving nodes with the healthiest perspectives a position as a candidate for election. Other nodes can be made aware that a full cluster is still possible. In the scenario above, this would mean N5 would be the only candidate possible but a more complicated situation may have multiple candidates but ultimately only one gets chosen. If the slaves need to communicate with each other, these requests and responses need to route through N5 because N5 is the only bridge across the partition.

The trade-off with adding routing awareness is that it will take longer to sort out what the right decision is and to take action on it. This  means reduced availability during this process but once the new master comes on line you will have 4 replicas which may regain your capacity to handle read load if you’re allowing reads from slaves (another trade-off with consequences, which I’m not going to cover in this post).

Adding routing to distributed systems isn’t new. Chord and Pastry include routing. In the Amazon Dynamo paper, routing was explicitly excluded to lower latency.


Routing in distributed systems is not a new concept, I thought of it one day and like most things, yup, someone else already thought about this. There are many more examples than Chord and Pastry however I haven’t found many open source infrastructure projects that include routing awareness in the election process. If you know of any please leave a comment in this post! It’s impossible for me to know how they all work :)

  • http://xzilla.net/ xzilla

    We were just discussing a number of these topics at the office last friday, specifically in the area of single primary node, multiple secondary node layouts. Most of the time when people lay these systems out, they focus on the various database nodes determining who is the primary, but one method I haven’t seen explored much was the idea of trying to run some type of zookeeper-esque cluster amongst the application nodes; ie. they are testing which databases they can connect to, and if a majority of the application servers agree, then they would all decide to switch to a secondary (this presumes they have a way of forcing a failover event at the database level).

    The main upside that I see from this is it reduces complexity; in most cases you have the issue of trying to determine what the database cluster should look like, and also how to notify your application of what it looks like and make sure the application servers don’t get confused about that in their own right. In this scenario, since the application servers make the decision, there isn’t a disconnect there. To the degree you can get your application servers to move cohesively to a new server, you don’t have to worry about getting mixed up writes. (Granted there are some other trade-offs here, but if you have a specific flow of data through your application, I think it’s at least valid to start with the idea that you don’t really care what your database instances think the cluster should look like, it’s really about what your applications can see).

    Anyway, just some random thoughts leftover from the weekend :-)