I often hear that network partitions are rare in clusters that are up to a couple hundred nodes in size and that you should expect Gossip protocols to converge on cluster state quickly. Usually this is followed up by saying network related issues are not common.
I wish this fallacy would die because experiencing partitions in a cluster has very little to do with network infrastructure failures. Many of the possible causes can happen even when all the machines are in the same LAN or even in the same rack. Partitions can happen even in small cluster sizes.
Gossip based failure detectors can be subjected to a number of problems that can cause Gossip messages to be dropped that go beyond the scope of network infrastructure. These issues can impact really small clusters of even 2 nodes.
Let’s say 1 node out of 2 is experiencing JVM or CLR stop the world garbage collection pauses. Gossip messages will be dropped and the healthy node will consider the GC’ing node down.
What if you have replication factor of 3 in a larger cluster and an extremely bad query is executed on one replica set which causes their CPU’s to peg for a minute and drop Gossip messages? 1/3rd of your nodes will be down from the perspective of the other 2/3rd’s because that replica set will appear partitioned.
What if a virtual machine host is experiencing difficulties and all the guest machines are unreachable?
Using network protocols like Gossip for failure detection is not about detecting network infrastructure failures. It’s about detecting unreachable systems and marking them as unavailable. There are many more reasons why a system may become unreachable than I’ve pointed out in this post.
I work on a 4 digit node count cluster, certainly larger cluster sizes can create for chaotic environments, but even with all nodes within the same LAN, most partitions I’ve experienced have nothing to do with network infrastructure failures.
Describing partitions in distributed systems as rare because network infrastructure failure is rare is not a realistic view of what can cause partitions.
Partitions in distributed systems is about availability not about network infrastructure. Many things can cause a node to be unavailable to a segment of other nodes.
I would also argue that nodes that cannot satisfy response time SLA’s are considered unavailable even though they are generally “up”.
@kellabyte you can even cause a network partition with SQL Server mirroring just by rebuilding indexes on the primary.
— Jeremiah Peschka (@peschkaj) November 4, 2013
- The 99th percentile matters
- Batching and pipelining linearizable operations in replicated logs
- Trick to reduce allocations improves response latency in Haywire
- Improving the protocol parsing performance in Redis
- Mencius and Fast Mencius a high performance replicated state machine for WANs
- Tuning Paxos for high-throughput with batching and pipelining
- Scalable Eventually Consistent Counters
- Create benchmarks and results that have value
- Routing aware master elections
- My new test lab
- Responsible benchmarking
- Understanding hardware still matters in the cloud
- The “network partitions are rare” fallacy
- Messaging and event sourcing
- Further reducing memory allocations and use of string functions in Haywire
- HTTP response caching in Haywire
- Atomic sector writes and misdirected writes
- How memory mapped files, filesystems and cloud storage works
- Hello haywire
- Active Anti-Entropy
- October 2014
- September 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- November 2013
- October 2013
- August 2013
- July 2013
- June 2013
- May 2013
- April 2013
- March 2013
- January 2013
- October 2012
- September 2012
- August 2012
- May 2012
- April 2012
- February 2012
- January 2012
- December 2011
- September 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010