Write amplification

Write amplification or “write cost” is a term used to describe how much physical disk is written to for a logical piece of data in the database.

Typically (but not always) when you make a write to a database it gets written to at least 2 different places.

  1. In the commit log for reliability to ensure you won’t lose anything if something fails in the later stages.
  2. In a form that suits the orientation of the databases data model.

For example, Log-Structured Merge-Tree based storage engines store SSTable’s (sorted string tables) on disk.

Often we see database or store engine benchmarks that claim superiority by short benchmarks that only run for a few minutes but this isn’t a realistic view on what this technology is going to face in a production environment.


The graph above is showing CPU usage and the green dots are measurements of the CPU I/O wait times. Meaning the CPU is waiting for I/O requests to complete. The part visible in this image of the graph is about 1 hour of the 3 hour test. It’s interesting how the iowaits are changing over time, right? How is this going to look 48 hours from now?

If we look at the disk throughput measurements and zoom in to 30 minutes into the test we see another interesting story. The writes are pretty consistent and this continues mostly through the rest of the test.


The test I was running unfortunately doesn’t output the records inserted per second so that I can graph that to compare but it does output total record count on the screen and that was going slower as time went on but the disk throughput was pretty consistent. I’ll be working on instrumenting the test to output these metrics so I can plot the trend with more accuracy.

What’s this telling us? It’s hinting that in the beginning X records were being written to using a disk throughput rate of Y. Y sustained but X degraded. This perhaps means the write cost per record was going up the longer the test was run.

Write amplification was potentially getting worse over time.

The database measured was a LSM based storage engine and it was compacting which means re-writing bytes that were previously written. Disk utilization was being spent on previously written records which impacts new writes currently in progress. Compacting log segments isn’t necessarily bad but it’s something you need to keep an eye on because the compaction strategy may not align well with your production workload characteristics. Often databases offer multiple storage engine options or compaction strategies that can help with these sorts of things.

In many cases I would rather a slower database and storage engine that behaves in predictable ways over longer periods of time than one that is really fast in the beginning but can degrade really badly later on in production. Getting a call at 2AM in the morning is never fun.

  • http://na Maxime Caron

    Hi Kelly,
    I am very curious how you generated the graph for this post! Especially the one with CPU I/O wait. Is there any chance you could share some of the code you used.

    I am very good with R and Python and could draw pretty graph myself but I am not sure exactly how you extracted the source data for the graph?

  • Gaurav Menghani

    Can you be explicit about what the green and red dots are in the graph, and what the X axis is? I think the green is the iowait dots, and red is the disk throughput (?). Also, you mention the number of writes going slower. Is there any graph that you can share?