Responsible benchmarking

By kellabyte  //  Databases  //  8 Comments

As engineers, I think we have a responsibility with how we publish benchmarks. Not everyone is as much of an expert as you on the things you are publishing the benchmarks about.

We all do micro benchmarking to iterate fast on some things during development but rarely does it make sense for users of our systems to make decisions based on these micro benchmarks.

This benchmark in particular concerns me because in the comments is this quote (emphasis mine):

Greg, This is a short test run specifically to show how this behave under this scenario. Mostly, because we’re having users that use this micro benchmark as a way to base decisions. We’re doing longer & bigger tests, yes. We’ll post about them later on.

This benchmark doesn’t actually provide much useful information. It is too short and compares fully featured DBMS systems to storage engines. I always stress very much that people never make decisions based on benchmarks like this.

These paint the fully featured DBMS systems in a negative light that isn’t a fair comparison. They are doing a LOT more work. I’m sure the FoundationDB folks will not be happy to know they were roped into an unfair comparison in a benchmark where the code is not even available.

Let me give you an example that tells a very different story about 2 storage engines included in the graphs from that blog post. OpenLDAP can use LMDB. Active Directory uses Esent. Those graphs show that Esent is faster than LMDB. However, if you benchmark Active Directory vs OpenLDAP you will see a very different result

. OpenLDAP is much faster than Active Directory in both random reads and random writes. Does that conclude that LMDB is faster than Esent? No it does not. I could certainly paint that picture if I wanted to and confuse people. But that would be wrong.

In contrast, a better example of a more responsible benchmark is the LMDB HyperDex benchmarks

. They run the same workload on the same fully featured DBMS on top of multiple storage engines. This results in measurements that can actually be compared with any kind of usefulness.

Performance in a fully featured DBMS encompasses far more than what a storage engine is doing. It’s our responsibility to publish apples to apples comparisons. If we do not, we should make it crystal clear that this is highly experimental and that users do not make decisions based on the published micro benchmarks.

I tweet my fair share of experimental benchmark results to get community feedback. If you ever feel I’m presenting it in a way that promotes basing decisions on, please call me out on it. That’s one thing I like about the infrastructure communities. At times they self police and keep each other honest.

  • Open source your benchmark code or else we can’t trust them to be correct.
  • Compare apples to apples.
  • Be crystal clear how the results should be interpreted.

It’s our responsibility.


A response to my post was published that doesn’t change my stance, more like re-affirms it. I don’t care if you call it a benchmark or a performance test and whether it’s on your personal blog or your company blog. An invalid test is an invalid test. Your users (customers) are reading the material that you publish. There’s a responsibility to not mislead your users. Not everyone is a expert on the topics to catch the subtleties.

Oren does point out the source code for his tests is available, I stand corrected on that. But they weren’t linked in his original post as far as I could see.

One of the comments from bpm asks why it rubbed me the wrong way. That’s a great question! There are several reasons. I provide consultation services on several topics including (but not limited to) distributed systems, databases and large scalability. These kinds of posts land in my lap from those relationships and I’m tired of them landing in my lap and spending time explaining why the results aren’t useful and why the decision they were going to make shouldn’t factor these highly flawed results. It’s a big waste of time for these organizations.

We must be careful not to manufacture hype based on invalid results. If you are writing performance tests to iterate fast in development and you want to publish the results in a social way by all means! But please make sure they are at least valid.

I am trying to see how users will evaluate Voron down the road. A lot of the time, that means users doing micro benchmarks to see how good we are. Yes, those aren’t very useful, but they are a major way people make decisions. And I want to make sure that we come out in a good light under that scenario.

In that quote is the admission that the results aren’t very useful but still persists that a goal is that invalid measurements sway decision making. Some database communities do not publish invalid results on any medium but sadly some do. If you are going to publish micro benchmarks, at least make them valid ones.

I write these comments so I don’t have to keep repeating them. More importantly, I write these comments because I think it’s important to increase awareness


In the end, I hope we can improve published information so that people can make better decisions rather than catering to poor decisions like this does.

  • Rajiv

    Spot on. This is one of the worse ones since it compares completely different things.

    • http://www.Marisic.Net/ dotnetchris

      Compares the completely different things of “Writing 100,000 sequential items (100 items per transaction in 1,000 transactions):” vs “And writing 100,000 random items:”.

      Sure looks like an apples to apples comparison to me.

  • http://www.Marisic.Net/ dotnetchris

    I think this is a valid comparison of specific kinds of performance for these products. The list of included technologies are products that either a current storage engine for RavenDB or have been considered as possible storage engines.

    Ayende was not happy with the options so he instead chose his option, to build a new one. Which then he directly compares against others in specific areas.

    I find absolutely nothing misleading about these metrics. Any person that has even the basic understanding of technology and data analysis would see these are very specific and would be absolutely silly for you to make a business decision based on these small slivers of the bigger picture.

    • Kelly Sommers


      I understand RavenDB is your favourite database.

      “Which then he directly compares against others in specific areas”

      Actually no he doesn’t test specifics because the fully featured databases include way more features and operations than what he is trying to prove in the test. This flat out makes it an invalid test. For example, SQL Server is parsing SQL and a dozen other things. This is far from a specific test, not even close.

      I’ve received overwhelming responses from infrastructure engineers from all over the industry that these kinds of benchmarks carry very little value and confuse people more than anything. His blog post even admits to the lack of value.

      I’m also confident if I did a benchmark and published something like this fake graph (to prove a point) you would be upset it’s not comparing apples to apples. Like the blog post I commented on, it’s an invalid and unfair test.

      You’re just defending for the sake of defending.

      • http://www.Marisic.Net/ dotnetchris

        Honestly I think you’re looking to rant just to rant.

        That graph you included you leave out what it’s actually comparing. Regardless do I find that graph disingenuous? No. it makes plenty of sense to me to see a magnitudes of order differences between file system and database.

  • Dupuydauby Cyrille


    we are on the same line here: beware of micro benchmarking

  • D Kh

    I agree that majority of benchmarks available are misleading, maybe not intentionally, but they are just wrong in the majority of cases.

    The reality is always very different from the “commonly-believed” things.
    Here we did a bechmark suite for serializers in .NET, created different data patterns and ser/deser patterns. We used models from real business world, i.e. EDI X12 835 claims, RPC batching calls etc, conferences with participants, events, schedules etc…

    The results are completely different that one would have expected. For example
    many serializers are very fast at processing “flat primitive” objects like “TypicalPerson” without any sub-references,
    the moment we turn that to an array of TypicalPerson[100] – everything just falls apart, and it turns out that the commonly believed fastest serializers like MsgPack or JIL, shift to 25% or even below in the speed spectrum. Look here:

    Benchmarking real stuff requires significant investment. I found that profilers also skew the real-time picture doing micro-benchmarking.

  • Leena Davis

    We got to learn and use Redis. If you want a server like REDIS to ‘fall to it’s knees’ just do char buffer[8192] then gen_rand( buffer, 8192) and hiredis.set( buffer, 8192 ), then run 300.000 times. Writing enormous random bytes instead of memset( buffer, ‘x’, 8192 ) makes sure a slowdown. Or use the KILLER mysql ‘ORDER BY RAND()’ and sqlite’s ‘ORDER BY RANDOM()’. Then our amazon m4.large and azure D4 operates at 250-1500 RPS other than 97.000-125.000 ‘benchmarked’ fine tuned requests per second.