Create benchmarks and results that have value

By kellabyte  //  Performance  //  2 Comments

Different types of benchmarks exist whether its micro benchmarks or other kinds of benchmarks. Benchmark setups themselves are a simulated environment so it’s hard to build a truly perfect benchmark that nobody will scrutinize. Sometimes in these cases, if people are left with more questions after seeing the original benchmark it’s a sign there may be more benchmarks you should run with different setups to answer the unknowns.

Sometimes there are some really bad benchmarks that really provide zero value. People trust published benchmarks far too much. Please be critical. Ask questions.

I seem to encounter 3 problems the most.

  1. The setup makes no sense and testing the code in a config it will never have the privilege to run in.
  2. The benchmark is way too short to prove that this code/technology/whatever is a good idea.
  3. Published results that offer statistics that have very little value that don’t reveal anything.

I see benchmarking as an exercise of  answering a question and discovering new questions. The results should provide enough information to support the answers it claims to provide. This does not mean skew the configuration to reach the results you want to prove.

An example of a bad benchmark

I don’t mean to pick on this particular benchmark but I want to use it as an example because I see these kinds of benchmarks too often than I’d like to.

There are a couple problems with this benchmark.

  1. The benchmark code only ran 10 iterations.
  2. It publishes only the average statistic.

I can’t think of any takeaway that is useful in the published results. 10 iterations is small enough to avoid all kinds of runtime stuff and even with the raw results available, 10 iterations are not enough to derive any useful statistics. Averages don’t really tell us anything (more on that later). The code is available so luckily that means we can run it ourselves and get some usefulness from it by changing the configuration.

The benchmark code is useful but the executed run and it’s published results are not very useful. I personally would not have published results from this run.

We don’t know what the various code is doing. Benchmarking helps to find what kind of trade-offs exist in the code and how they impact workload. 10 iterations is not enough to get anything from. If we are benchmarking something running on the CLR, JVM or Go, when do I care about how fast my ORM is 10 times without the GC? Never. I always care about the GC. The GC is there, and this test is too fast to get it involved. I have no idea if any ORM’s in this test are creating a really bad GC collect situation that will result in major performance drops. There are a lot of things not visible in a test this short. I don’t know if any ORM’s eat my RAM for lunch, or if some are extremely CPU intensive. It’s just too short.

Extract useful statistics

Averages are pretty useless in my experience. This isn’t a new revelation either, I’m late to the party. I learned in production, my measuring methods were poor. The metrics I used to capture were not extracting the reality of what was going on. I was not aware of what I was putting in production until the phone rings at 3AM on a Sunday morning and some crazy shit goes on. After revising what statistics my benchmarks extract, I saw what happened in production.

Here’s some quotes from this article.

What if I told you that back in the mid-1980′s at the University of North Carolina, the average starting salary of geography students was well over $100,000? Knowing that, would you have considered making a career change?

But what if I also told you that basketball great Michael Jordan—formerly the world’s highest paid athlete—graduated from UNC with a degree in geography? Now do you believe me?

Maybe the mean isn’t always a slam dunk.

The Mean Can Mislead

In the case of Michael Jordan and fellow UNC geography graduates, the average is not a good representation of the true center of the data. Jordan’s earnings from his athletic career raises the “average” salary for geography graduates in a way that doesn’t accurately convey what graduates are likely to earn. By almost any measure, Jordan’s earnings would be an outlier.

Averages can hide the nasty truths in results by presenting an overly positive or negative result by hiding spikes.

Predictable performance

Choosing options that have predictable performance is a good path to running a successful production system. You can capacity plan with better accuracy and you get a whole lot less fire drill phone calls to fix something in the middle of the night. Sometimes something is very fast and predictable but sometimes it’s very fast and unpredictable and you need to know this. Averages will not surface spikes in a way you can identify. After learning some hard lessons in production, I’ll take something that’s a little bit slower with much more predictability than something that goes bat shit crazy once in awhile.

Using percentiles

Percentiles can be described in practical terms as a way to determine when the data is moving away from the average.

This is key because this is where you may see some code bases differentiate from the others. Some might have very comparable averages to each other but the percentiles may indicate a much different story.

Here’s a benchmark result from 2 HTTP servers.


Running 10s test @ http://skynet1:8080
  24 threads and 24 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   292.17us   91.68us  16.32ms   98.52%
    Req/Sec     3.44k   204.55     4.89k    58.86%
  Latency Distribution
     50%  285.00us
     75%  323.00us
     90%  347.00us
     99%  389.00us
  781005 requests in 10.00s, 195.89MB read
Requests/sec:  78,120.88
Transfer/sec:     19.59MB


Running 10s test @ http://skynet1:8000
  24 threads and 24 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   323.34us  618.32us  52.81ms   95.99%
    Req/Sec     3.48k     2.00k    8.56k    63.08%
  Latency Distribution
     50%  172.00us
     75%  395.00us
     90%  660.00us
     99%    1.37ms
  785369 requests in 10.00s, 196.98MB read
Requests/sec:  78,556.37
Transfer/sec:     19.70MB

The averages between ServerA and ServerB are very close at 292us and 323us. We could publish a benchmark that looks like this.

Stat HTTP ServerA HTTP ServerB
Requests/second 78,120.88 78,556.37
Average 292.17us 323.34us

This looks wonderful. Both HTTP servers are relatively close in performance! Pretty close in requests/second and the average response time is also very close.

This representation is misleading because both HTTP servers actually perform quite a bit differently if we extract better statistics. One of the HTTP servers has much worse spikes in latency than the other that will affect 10% of your traffic. That is very important information to be aware of. Averages are not good enough.

Check out the results highlighted in red above. The latency distribution (percentiles) tell a very different story.

Stat HTTP ServerA HTTP ServerB
Requests/second 78,120.88 78,556.37
Average 292.17us 323.34us
Max 16.32ms (54x Average) 52.81ms (160x Average)
90% 347.00us 660.00us
99% 389.00us 1.37ms

Once we put the percentiles of the latency distribution side by side we start to see how the HTTP servers differentiate. ServerB has a max response time 3.2x worse than ServerA’s max response time. 10% of ServerB’s responses are nearly 2x worse than ServerA’s.

From these results it may not be a good idea to choose ServerB over ServerA. ServerB has a request/second increase over ServerA but it’s really only a 0.557%  increase. Is this worth the headaches that the max and 90% statistics show to surface? If I choose ServerA I will have slightly less performance but it will be much more predictable performance. This makes it easier to manage and plan how many web servers I need as load grows. I’ll have less weird shit happen at night.

That’s that. We pick ServerA right? What if I told you I’m actually going to pick ServerB? Why would I want a less predictable HTTP server with barely any requests/second gain? Well, when I benchmark I capture even more information than this :)

At what cost

Knowing the performance characteristics using better statistics is only part of the story. What did it cost us? We don’t know yet. ServerA looks more predictable with a very negligble requests/second trade-off.  I happened to capture a whole array of hardware measurements during the benchmark runs. The most interesting results come from the CPU metrics.





ServerB uses approximately 10% less CPU to achieve a higher requests/second throughput than ServerA. The trade-off is in the 90%-99% there’s some worse latency. This 10% CPU usage in this case is nearly a whole CPU core free. If this HTTP server was to be embedded in a database to serve JSON documents and the database was going to be processing some very CPU intensive queries, this free CPU core can be really significant in the overall end-to-end response time.

How I benchmark

I get asked often how I do my benchmarking and what tools I use. This is changing all the time. I watch a lot of people on Twitter who do benchmarking a lot better than I do so these are some things I’ve picked up along the way. I have plans to improve much further.

Here are some things I try to keep in mind each time I benchmark.

  • Output statistics that bring value.
  • Run different workload configurations (threads, read/write mix, etc).
  • Capture hardware measurements.

Here’s a benchmark I ran of Go’s HTTP server versus Haywire. I ran 2 of the Techempower benchmarks on my bare metal server with runs for different connection counts. HTTP benchmarks I use Wrk which is an excellent HTTP benchmarking tool which is what output the results you saw earlier in this post with the percentiles.

Every single run (one per connection count) has the following results captured.

  1. Wrk output with throughput and percentile latency distribution.
  2. Hardware measurements in csv format.
  3. Hardware measurements graphed.

I plot the output of the hardware measurements captured with dstat using gnuplot but I’m going to be replacing gnuplot with R soon. I don’t have everything as automated as I want to but it’s evolving every day.

I measure hardware all the time when I benchmark. It reveals so many things that are important. For example, here’s CPU measurements graphed during a 3 hour LevelDB benchmark.

One thing I’m missing right now are histograms. I need to add these because I can’t visibly see if operations/second dip or raise throughout any of these tests. I can see hints of it in hardware measurements but that’s not conclusive enough evidence. I want to be able to identify when those spike latencies in the 90% happened in the run. This is important because I might notice a blip in the hardware measurements that correlates. This is really important to understand.


For me, benchmarking is about answering questions. I want to run something useful that extracts statistics that tells me or my readers something meaningful. The results need to reveal or confirm something.

I don’t expect all benchmarks to be scientific quality but I do think there’s some kind of low barrier that should be expected from published results. 10 iterations with an average statistic does not reveal anything meaningful in my opinion.

Every day I’m trying to improve how I do this.