Some databases use memory mapped files inside the database engine to reduce the need to manage an on-disk storage model and in-memory cache model by letting the operating system do the heavy lifting. An interesting discussion in the comments of this post debates that allocating a large memory mapped file in the cloud increases storage costs because you are paying for the allocated space long before you actually use it.
This is not true and I’m going to dig into the engineering of how memory mapped files, file systems and cloud storage works to explain to you why.
Memory mapped files
First let’s review what a memory mapped file is. A memory mapped file maps a segment of virtual memory to a file on disk and gives you an API where you can read or write any offset in this potentially large array of memory (terabytes even). The OS controls what to page in and out of RAM. It can look at times like your system has no RAM free if you map large files but when another process demands memory the OS will free pages from RAM and re-read them from disk if those memory offsets are requested again. Each operating system behaves differently in its page algorithms.
Amazon AWS, Rackspace and Windows Azure and most other cloud providers don’t support dynamic disks. There may be some exceptions but once we dig into how things work under the hood it won’t matter.
There are most commonly 2 storage charging models that cloud providers use.
Charge fees based on the provisioned volume size. If you create a 100GB disk you pay for a 100GB disk even if it is unused.
Charge fees based on consumed data size. You get charged for only the data you actually use. The way this is calculated can be fairly ambiguous (more on this later).
Some cloud providers charge additional fees like input/output operations or in the case of Amazon AWS you can provision guaranteed IOPS.
Below are some examples.
It’s not clear to me the architecture design and technology choices for EBS volumes but there are details about the cost structure.
Amazon EBS Standard volumes
- $ per GB-month of provisioned storage
- $ per 1 million I/O requests
Amazon EBS Provisioned IOPS volumes
- $ per GB-month of provisioned storage
- $ per provisioned IOPS-month
Rackspace Cloud Block Storage
Quoted from the Rackspace website.
With Rackspace, your monthly block storage bill is transparent and predictable because Cloud Block Storage comes with free I/O. Pay only for the amount of block storage that you provision.
Quoted from the Windows Azure pricing website.
Storage is charged based on storage volume (the amount of data in Blob, Table, Queue and Drive) and storage transactions (number of read and write operations to storage).
Transactions include both read and write operations to Storage.
In the Q&A section further details are explained.
Question: If I use Storage for only a few days in a month, is the cost prorated?
Answer: Yes. Storage capacity is billed in units of the average daily amount of data stored (in GB) over a monthly period. For example, if you consistently utilized 10 GB of storage for the first half of the month and none for the second half of the month, you would be billed for your average usage of 5 GB of storage.
Provisioned volume size based pricing
This is the pricing model Amazon AWS and Rackspace use.
Since provisioned volume size based pricing models already charge you for the max size of the volume from the beginning. What you do with the file system really doesn’t matter. You are paying the full price. There’s no reason to discussing this pricing model in relation to memory mapped files because there is nothing the memory mapped file can do to change anything. If you purchased 100GB and you’re using 1GB you still pay for 100GB.
Consumption based pricing
This is the pricing model Windows Azure uses.
With consumption based pricing you pay for what you consume. If you consume 1GB of bytes on a 100GB disk you get charged for 1GB of storage. Cloud providers can do some tricks under the hood like compression or sparse files (more on this later) to reduce their data footprint on their physical media. Are they charging you for the compressed bytes, uncompressed bytes? In the case of sparse files, are they charging you for the data you request to be stored or the data they are storing after sparse file optimizations? Sometime cloud providers don’t communicate clearly how they calculate “used bytes”.
In the consumption based pricing model one reason people get afraid of memory mapped files is because when you memory map 100GB of space and then look at the free space of the drive it will report 100GB less free. Holy crap! I’m paying for the whole drive just because I created a large memory mapped file right? Wrong. You aren’t being charged and once you understand the engineering underneath it will all make sense.
It’s extremely important to remember that consumption based pricing models are about storage at the host level, not storage inside the virtual disk image. Yes there is a difference and its an extremely important one. Cloud providers do not mount or read your disk image to query the file system how many bytes the file system reports as used. The cloud provider only cares about bytes it has to physically store. More proof of this is you can deploy compressed, encrypted or your own custom file system kernel modules. The cloud provider does not peek into the file system to get usage statistics.
One of the reasons cloud providers don’t peek into the file system to get usage statistics (besides the privacy/compatibility issues) is because that information is actually meaningless because it is not a reliable source of actual physical media consumption.
To understand why this is we need to dig into how a file system works.
File systems can do some pretty amazing tricks. In the case of memory mapped files the interaction between the operating system and the file system can sometimes confuse the end user. I’m going to demystify some of the details using the NTFS file system as an example because it is one of the file systems that I find confuses people the most when using memory mapped files.
On Windows when you map a 100GB memory mapped file the operating system will report the file is 100GB and that the drive has 100GB more used space. Although Windows reports this as used space, you can think of it more like reserved space. Nothing has been written to the disk except for some file table metadata and a bitmap index (a big array of 0′s and 1′s). This is a very small amount of data. The key in all the trickery is in the bitmap index.
A file system allocates contiguous groups of sectors called clusters. In NTFS inside the MFT (Master File Table) there is a special $BitMap file that keeps track of used and unused clusters on the volume. Every bit set to 1 in the $BitMap file represents a used cluster. When a file is deleted the bits in the $BitMap file are set back to 0. What many don’t realize is when you memory map a file it sets the clusters to used in the $BitMap file even though no data has been written to disk for these clusters. From now on we should consider the $BitMap file as used and reserved clusters.
When you look at your drives used (+ reserved!) and free space in Explorer it isn’t traversing the file table and counting all the bytes. It’s reading this small amount of data in the $BitMap file and counting how many clusters are set to 1 and doing the math on sector size etc.
Using the FSCTL_GET_VOLUME_BITMAP API to retrieve the $BitMap from NTFS we can memory map a file and see how it changes.
Disk before running
Disk after memory mapping
As you can see above, there are 39,061,759 clusters on this disk which means there are 39,061,759 bits in the $BitMap file. This is 4.6 megabytes of memory uncompressed. It should start to become clear now that a bitmap is a great optimization for a lot of things because of how small of memory it requires. 4.6 megabytes of memory is used to represent used or free clusters on a several hundred GB disk.
Also shown above, it took 3.5 milliseconds to map a 97 gigabyte file on the rusty spinning disk. A mechanical disk can’t write 27 GB/ms. Nothing got written to physical media besides the MFT and the $BitMap reservations. We can also see the additional clusters flagged as used in the $BitMap. After deleting the memory mapped file the $BitMap releases the used clusters.
Memory mapping a file doesn’t actually write any bytes for the space we mapped to physical media. Only metadata is written to the MFT and $BitMap.
The description of how a file system like NTFS works in relation to cloud fees is relevant because it doesn’t make sense for a cloud provider to peek into the file systems used space which would look into this $BitMap index. It also doesn’t make sense for them to traverse the MFT and count all the bytes. None of this would work with compressed, encrypted, custom or obscure file systems the cloud provider may not handle.
$BitMap like most implementations using bitmaps is an optimization so that the file system can avoid doing expensive I/O. It makes no sense to base cloud fee’s on something that uses $BitMap. I’ll prove a little later in this post.
What the cloud provider cares about is bytes on physical media. They don’t (and justifiably shouldn’t) care about file systems. Using file system optimizations like compression or sparse files can actually save you costs in cloud storage because they demand less storage on physical media.
I’m going to map a 9 gigabyte file in the cloud on Windows Azure who uses a consumption based billing model to prove to you this is how things work (if mapping 97 gigabytes in 3.5 milliseconds didn’t convince you).
Windows Azure uses fixed sized VHD’s but in their architecture information and diagrams they describe that the underlining storage is blob storage used for replication.
The fixed format lays the logical disk out linearly within the file, such that disk offset X is stored at blob offset X. At the end of the blob, there is a small footer that describes the properties of the VHD. Often, the fixed format wastes space because most disks have large unused ranges in them. However, in Windows Azure, fixed .vhd files are stored in a sparse format, so you receive the benefits of both the fixed and dynamic disks at the same time.
If we play fantasy again and pretend that memory mapping a new file did issue writes to the disk via the file system we still wouldn’t see significant usage consumption on physical media because the sparse file support built in doesn’t write any data to the physical disk for data that are contiguous 0′s . All it does is set the $BitMap bits.
Most modern file systems support sparse files including must Unix variants.
On Wednesday August 15th 2013 I created a 10 GB storage volume on Windows Azure, mounted and formatted the drive on a Windows Server 2012 virtual machine instance. Windows Azure updates billing information once a day, so I left this volume mounted for several days so that I would have a few billing points in the graph that would tell me how much this cost me to simply mount the empty volume.
On Sunday August the 18th 2013 I created a 9GB memory mapped file on the 10GB storage drive.
As you can see in the billing history above on the 15th I wasn’t charged for creating a 10GB disk and on the 18th I was not charged for consuming 9GB of the disk. This was entirely expected because I didn’t write any data on the disk. Only when I write data to clusters on the disk will I start to get billed for the bytes I’ve consumed. It’s as simple as that.
Whether the cloud provider charging model is based on provisioned volume size or consumption, using a memory mapped file does not affect your billing. The fear that memory mapped files costs you more isn’t true and if someone tells you this you should be sceptical.
There is no magic in the cloud that reads file systems to charge you money for bytes you never wrote to physical media in the first place. Understanding the engineering behind these systems makes that clear.
- Batching and pipelining linearizable operations in replicated logs
- Trick to reduce allocations improves response latency in Haywire
- Improving the protocol parsing performance in Redis
- Mencius and Fast Mencius a high performance replicated state machine for WANs
- Tuning Paxos for high-throughput with batching and pipelining
- Scalable Eventually Consistent Counters
- Create benchmarks and results that have value
- Routing aware master elections
- My new test lab
- Responsible benchmarking
- Understanding hardware still matters in the cloud
- The “network partitions are rare” fallacy
- Messaging and event sourcing
- Further reducing memory allocations and use of string functions in Haywire
- HTTP response caching in Haywire
- Atomic sector writes and misdirected writes
- How memory mapped files, filesystems and cloud storage works
- Hello haywire
- Active Anti-Entropy
- Lightning Memory-Mapped Database
- September 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- November 2013
- October 2013
- August 2013
- July 2013
- June 2013
- May 2013
- April 2013
- March 2013
- January 2013
- October 2012
- September 2012
- August 2012
- May 2012
- April 2012
- February 2012
- January 2012
- December 2011
- September 2011
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010