Different types of benchmarks exist whether its micro benchmarks or other kinds of benchmarks. Benchmark setups themselves are a simulated environment so it’s hard to build a truly perfect benchmark that nobody will scrutinize. Sometimes in these cases, if people are left with more questions after seeing the original benchmark it’s a sign there may be more benchmarks you should run with different setups to answer the unknowns.
Sometimes there are some really bad benchmarks that really provide zero value. People trust published benchmarks far too much. Please be critical. Ask questions.
I seem to encounter 3 problems the most.
- The setup makes no sense and testing the code in a config it will never have the privilege to run in.
- The benchmark is way too short to prove that this code/technology/whatever is a good idea.
- Published results that offer statistics that have very little value that don’t reveal anything.
I see benchmarking as an exercise of answering a question and discovering new questions. The results should provide enough information to support the answers it claims to provide. This does not mean skew the configuration to reach the results you want to prove.
An example of a bad benchmark
I don’t mean to pick on this particular benchmark but I want to use it as an example because I see these kinds of benchmarks too often than I’d like to.
There are a couple problems with this benchmark.
- The benchmark code only ran 10 iterations.
- It publishes only the average statistic.
I can’t think of any takeaway that is useful in the published results. 10 iterations is small enough to avoid all kinds of runtime stuff and even with the raw results available, 10 iterations are not enough to derive any useful statistics. Averages don’t really tell us anything (more on that later). The code is available so luckily that means we can run it ourselves and get some usefulness from it by changing the configuration.
The benchmark code is useful but the executed run and it’s published results are not very useful. I personally would not have published results from this run.
We don’t know what the various code is doing. Benchmarking helps to find what kind of trade-offs exist in the code and how they impact workload. 10 iterations is not enough to get anything from. If we are benchmarking something running on the CLR, JVM or Go, when do I care about how fast my ORM is 10 times without the GC? Never. I always care about the GC. The GC is there, and this test is too fast to get it involved. I have no idea if any ORM’s in this test are creating a really bad GC collect situation that will result in major performance drops. There are a lot of things not visible in a test this short. I don’t know if any ORM’s eat my RAM for lunch, or if some are extremely CPU intensive. It’s just too short.
Extract useful statistics
Averages are pretty useless in my experience. This isn’t a new revelation either, I’m late to the party. I learned in production, my measuring methods were poor. The metrics I used to capture were not extracting the reality of what was going on. I was not aware of what I was putting in production until the phone rings at 3AM on a Sunday morning and some crazy shit goes on. After revising what statistics my benchmarks extract, I saw what happened in production.
Here’s some quotes from this article.
What if I told you that back in the mid-1980′s at the University of North Carolina, the average starting salary of geography students was well over $100,000? Knowing that, would you have considered making a career change?
But what if I also told you that basketball great Michael Jordan—formerly the world’s highest paid athlete—graduated from UNC with a degree in geography? Now do you believe me?
Maybe the mean isn’t always a slam dunk.
The Mean Can Mislead
In the case of Michael Jordan and fellow UNC geography graduates, the average is not a good representation of the true center of the data. Jordan’s earnings from his athletic career raises the “average” salary for geography graduates in a way that doesn’t accurately convey what graduates are likely to earn. By almost any measure, Jordan’s earnings would be an outlier.
Averages can hide the nasty truths in results by presenting an overly positive or negative result by hiding spikes.
Choosing options that have predictable performance is a good path to running a successful production system. You can capacity plan with better accuracy and you get a whole lot less fire drill phone calls to fix something in the middle of the night. Sometimes something is very fast and predictable but sometimes it’s very fast and unpredictable and you need to know this. Averages will not surface spikes in a way you can identify. After learning some hard lessons in production, I’ll take something that’s a little bit slower with much more predictability than something that goes bat shit crazy once in awhile.
Percentiles can be described in practical terms as a way to determine when the data is moving away from the average.
This is key because this is where you may see some code bases differentiate from the others. Some might have very comparable averages to each other but the percentiles may indicate a much different story.
Here’s a benchmark result from 2 HTTP servers.
Running 10s test @ http://skynet1:8080 24 threads and 24 connections Thread Stats Avg Stdev Max +/- Stdev Latency 292.17us 91.68us 16.32ms 98.52% Req/Sec 3.44k 204.55 4.89k 58.86% Latency Distribution 50% 285.00us 75% 323.00us 90% 347.00us 99% 389.00us 781005 requests in 10.00s, 195.89MB read Requests/sec: 78,120.88 Transfer/sec: 19.59MB
Running 10s test @ http://skynet1:8000 24 threads and 24 connections Thread Stats Avg Stdev Max +/- Stdev Latency 323.34us 618.32us 52.81ms 95.99% Req/Sec 3.48k 2.00k 8.56k 63.08% Latency Distribution 50% 172.00us 75% 395.00us 90% 660.00us 99% 1.37ms 785369 requests in 10.00s, 196.98MB read Requests/sec: 78,556.37 Transfer/sec: 19.70MB
The averages between ServerA and ServerB are very close at 292us and 323us. We could publish a benchmark that looks like this.
|Stat||HTTP ServerA||HTTP ServerB|
This looks wonderful. Both HTTP servers are relatively close in performance! Pretty close in requests/second and the average response time is also very close.
This representation is misleading because both HTTP servers actually perform quite a bit differently if we extract better statistics. One of the HTTP servers has much worse spikes in latency than the other that will affect 10% of your traffic. That is very important information to be aware of. Averages are not good enough.
Check out the results highlighted in red above. The latency distribution (percentiles) tell a very different story.
|Stat||HTTP ServerA||HTTP ServerB|
|Max||16.32ms (54x Average)||52.81ms (160x Average)|
Once we put the percentiles of the latency distribution side by side we start to see how the HTTP servers differentiate. ServerB has a max response time 3.2x worse than ServerA’s max response time. 10% of ServerB’s responses are nearly 2x worse than ServerA’s.
From these results it may not be a good idea to choose ServerB over ServerA. ServerB has a request/second increase over ServerA but it’s really only a 0.557% increase. Is this worth the headaches that the max and 90% statistics show to surface? If I choose ServerA I will have slightly less performance but it will be much more predictable performance. This makes it easier to manage and plan how many web servers I need as load grows. I’ll have less weird shit happen at night.
That’s that. We pick ServerA right? What if I told you I’m actually going to pick ServerB? Why would I want a less predictable HTTP server with barely any requests/second gain? Well, when I benchmark I capture even more information than this
At what cost
Knowing the performance characteristics using better statistics is only part of the story. What did it cost us? We don’t know yet. ServerA looks more predictable with a very negligble requests/second trade-off. I happened to capture a whole array of hardware measurements during the benchmark runs. The most interesting results come from the CPU metrics.
ServerB uses approximately 10% less CPU to achieve a higher requests/second throughput than ServerA. The trade-off is in the 90%-99% there’s some worse latency. This 10% CPU usage in this case is nearly a whole CPU core free. If this HTTP server was to be embedded in a database to serve JSON documents and the database was going to be processing some very CPU intensive queries, this free CPU core can be really significant in the overall end-to-end response time.
How I benchmark
I get asked often how I do my benchmarking and what tools I use. This is changing all the time. I watch a lot of people on Twitter who do benchmarking a lot better than I do so these are some things I’ve picked up along the way. I have plans to improve much further.
Here are some things I try to keep in mind each time I benchmark.
- Output statistics that bring value.
- Run different workload configurations (threads, read/write mix, etc).
- Capture hardware measurements.
Every single run (one per connection count) has the following results captured.
- Wrk output with throughput and percentile latency distribution.
- Hardware measurements in csv format.
- Hardware measurements graphed.
I plot the output of the hardware measurements captured with dstat using gnuplot but I’m going to be replacing gnuplot with R soon. I don’t have everything as automated as I want to but it’s evolving every day.
I measure hardware all the time when I benchmark. It reveals so many things that are important. For example, here’s CPU measurements graphed during a 3 hour LevelDB benchmark.
One thing I’m missing right now are histograms. I need to add these because I can’t visibly see if operations/second dip or raise throughout any of these tests. I can see hints of it in hardware measurements but that’s not conclusive enough evidence. I want to be able to identify when those spike latencies in the 90% happened in the run. This is important because I might notice a blip in the hardware measurements that correlates. This is really important to understand.
For me, benchmarking is about answering questions. I want to run something useful that extracts statistics that tells me or my readers something meaningful. The results need to reveal or confirm something.
I don’t expect all benchmarks to be scientific quality but I do think there’s some kind of low barrier that should be expected from published results. 10 iterations with an average statistic does not reveal anything meaningful in my opinion.
Every day I’m trying to improve how I do this.
- The 99th percentile matters
- Batching and pipelining linearizable operations in replicated logs
- Trick to reduce allocations improves response latency in Haywire
- Improving the protocol parsing performance in Redis
- Mencius and Fast Mencius a high performance replicated state machine for WANs
- Tuning Paxos for high-throughput with batching and pipelining
- Scalable Eventually Consistent Counters
- Create benchmarks and results that have value
- Routing aware master elections
- My new test lab
- Responsible benchmarking
- Understanding hardware still matters in the cloud
- The “network partitions are rare” fallacy
- Messaging and event sourcing
- Further reducing memory allocations and use of string functions in Haywire
- HTTP response caching in Haywire
- Atomic sector writes and misdirected writes
- How memory mapped files, filesystems and cloud storage works
- Hello haywire
- Active Anti-Entropy
- October 2014
- September 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- November 2013
- October 2013
- August 2013
- July 2013
- June 2013
- May 2013
- April 2013
- March 2013
- January 2013
- October 2012
- September 2012
- August 2012
- May 2012
- April 2012
- February 2012
- January 2012
- December 2011
- September 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010