As engineers, I think we have a responsibility with how we publish benchmarks. Not everyone is as much of an expert as you on the things you are publishing the benchmarks about.
We all do micro benchmarking to iterate fast on some things during development but rarely does it make sense for users of our systems to make decisions based on these micro benchmarks.
This benchmark in particular concerns me because in the comments is this quote (emphasis mine):
Greg, This is a short test run specifically to show how this behave under this scenario. Mostly, because we’re having users that use this micro benchmark as a way to base decisions. We’re doing longer & bigger tests, yes. We’ll post about them later on.
This benchmark doesn’t actually provide much useful information. It is too short and compares fully featured DBMS systems to storage engines. I always stress very much that people never make decisions based on benchmarks like this.
These paint the fully featured DBMS systems in a negative light that isn’t a fair comparison. They are doing a LOT more work. I’m sure the FoundationDB folks will not be happy to know they were roped into an unfair comparison in a benchmark where the code is not even available.
Let me give you an example that tells a very different story about 2 storage engines included in the graphs from that blog post. OpenLDAP can use LMDB. Active Directory uses Esent. Those graphs show that Esent is faster than LMDB. However, if you benchmark Active Directory vs OpenLDAP you will see a very different result. OpenLDAP is much faster than Active Directory in both random reads and random writes. Does that conclude that LMDB is faster than Esent? No it does not. I could certainly paint that picture if I wanted to and confuse people. But that would be wrong.
In contrast, a better example of a more responsible benchmark is the LMDB HyperDex benchmarks. They run the same workload on the same fully featured DBMS on top of multiple storage engines. This results in measurements that can actually be compared with any kind of usefulness.
Performance in a fully featured DBMS encompasses far more than what a storage engine is doing. It’s our responsibility to publish apples to apples comparisons. If we do not, we should make it crystal clear that this is highly experimental and that users do not make decisions based on the published micro benchmarks.
I tweet my fair share of experimental benchmark results to get community feedback. If you ever feel I’m presenting it in a way that promotes basing decisions on, please call me out on it. That’s one thing I like about the infrastructure communities. At times they self police and keep each other honest.
- Open source your benchmark code or else we can’t trust them to be correct.
- Compare apples to apples.
- Be crystal clear how the results should be interpreted.
It’s our responsibility.
A response to my post was published that doesn’t change my stance, more like re-affirms it. I don’t care if you call it a benchmark or a performance test and whether it’s on your personal blog or your company blog. An invalid test is an invalid test. Your users (customers) are reading the material that you publish. There’s a responsibility to not mislead your users. Not everyone is a expert on the topics to catch the subtleties.
Oren does point out the source code for his tests is available, I stand corrected on that. But they weren’t linked in his original post as far as I could see.
One of the comments from bpm asks why it rubbed me the wrong way. That’s a great question! There are several reasons. I provide consultation services on several topics including (but not limited to) distributed systems, databases and large scalability. These kinds of posts land in my lap from those relationships and I’m tired of them landing in my lap and spending time explaining why the results aren’t useful and why the decision they were going to make shouldn’t factor these highly flawed results. It’s a big waste of time for these organizations.
We must be careful not to manufacture hype based on invalid results. If you are writing performance tests to iterate fast in development and you want to publish the results in a social way by all means! But please make sure they are at least valid.
I am trying to see how users will evaluate Voron down the road. A lot of the time, that means users doing micro benchmarks to see how good we are. Yes, those aren’t very useful, but they are a major way people make decisions. And I want to make sure that we come out in a good light under that scenario.
In that quote is the admission that the results aren’t very useful but still persists that a goal is that invalid measurements sway decision making. Some database communities do not publish invalid results on any medium but sadly some do. If you are going to publish micro benchmarks, at least make them valid ones.
I write these comments so I don’t have to keep repeating them. More importantly, I write these comments because I think it’s important to increase awareness.
In the end, I hope we can improve published information so that people can make better decisions rather than catering to poor decisions like this does.
- The 99th percentile matters
- Batching and pipelining linearizable operations in replicated logs
- Trick to reduce allocations improves response latency in Haywire
- Improving the protocol parsing performance in Redis
- Mencius and Fast Mencius a high performance replicated state machine for WANs
- Tuning Paxos for high-throughput with batching and pipelining
- Scalable Eventually Consistent Counters
- Create benchmarks and results that have value
- Routing aware master elections
- My new test lab
- Responsible benchmarking
- Understanding hardware still matters in the cloud
- The “network partitions are rare” fallacy
- Messaging and event sourcing
- Further reducing memory allocations and use of string functions in Haywire
- HTTP response caching in Haywire
- Atomic sector writes and misdirected writes
- How memory mapped files, filesystems and cloud storage works
- Hello haywire
- Active Anti-Entropy
- October 2014
- September 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- November 2013
- October 2013
- August 2013
- July 2013
- June 2013
- May 2013
- April 2013
- March 2013
- January 2013
- October 2012
- September 2012
- August 2012
- May 2012
- April 2012
- February 2012
- January 2012
- December 2011
- September 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010