The “network partitions are rare” fallacy

By kellabyte  //  Uncategorized  //  3 Comments

I often hear that network partitions are rare in clusters that are up to a couple hundred nodes in size and that you should expect Gossip protocols to converge on cluster state quickly. Usually this is followed up by saying network related issues are not common.

I wish this fallacy would die because experiencing partitions in a cluster has very little to do with network infrastructure failures. Many of the possible causes can happen even when all the machines are in the same LAN or even in the same rack. Partitions can happen even in small cluster sizes.

Gossip based failure detectors can be subjected to a number of problems that can cause Gossip messages to be dropped that go beyond the scope of network infrastructure. These issues can impact really small clusters of even 2 nodes.

Let’s say 1 node out of 2 is experiencing JVM or CLR stop the world garbage collection pauses. Gossip messages will be dropped and the healthy node will consider the GC’ing node down.

What if you have replication factor of 3 in a larger cluster and an extremely bad query is executed on one replica set which causes their CPU’s to peg for a minute and drop Gossip messages? 1/3rd of your nodes will be down from the perspective of the other 2/3rd’s because that replica set will appear partitioned.

What if a virtual machine host is experiencing difficulties and all the guest machines are unreachable?

Using network protocols like Gossip for failure detection is not about detecting network infrastructure failures. It’s about detecting unreachable systems and marking them as unavailable. There are many more reasons why a system may become unreachable than I’ve pointed out in this post.

I work on a 4 digit node count cluster, certainly larger cluster sizes can create for chaotic environments, but even with all nodes within the same LAN, most partitions I’ve experienced have nothing to do with network infrastructure failures.

Describing partitions in distributed systems as rare because network infrastructure failure is rare is not a realistic view of what can cause partitions.

Partitions in distributed systems is about availability not about network infrastructure. Many things can cause a node to be unavailable to a segment of other nodes.

I would also argue that nodes that cannot satisfy response time SLA’s are considered unavailable even though they are generally “up”.


  • Paul Banks

    Thanks for this post. In general I agree and it’s certainly something to be aware of.

    I do have one comment though. It seems to me that most of the things you talk about here are actually not “partitions” in terms of CAP theorem. Your examples like GC pauses, CPU overload, etc are all “availability” failures according to CAP rather than partitions. At least in my understanding (correction welcome).

    In some cases this distinction doesn’t mean much and I’m sure you are right that in many cases people do not give necessary consideration to those factors.

    But perhaps part of the “fallacy” is due to some people genuinely referring to “partitions” in the CAP sense where one or more nodes are unable to communicate with others but *both* parts of cluster are available and otherwise working normally.

    Of course network errors are still much more common than people suggest as you imply. In an OLTP system just a handful of dropped packets or a temporary burst in connections which exceeds some socket limit or buffer can be a network failure between two machines for the purpose of recording that transaction and may need to be treated as a genuine partition to maintain desired CAP properties.

    Thanks again for the article. How do you define ‘partition’?

  • Rajiv

    Great point. We cannot assume that our system health is only tied to networking errors.
    What you describe though seems like availability issues and not partitions. A node knocked out by GC pressure or lack of CPU will appear unavailable to all other nodes and not result in a split brain situation where certain nodes can communicate with it and certain others cannot.

    • John Tamplin

      The machines aren’t all synchronized in their checks, so when A pings C and decides it is dead, it may respond fine with B pings it. Thus A & B have different ideas about the state of C.