Mar
5

Using bitmap indexes in databases

By kellabyte  //  Architecture, Databases  //  9 Comments

There are many different data structures used in databases to create indexes used to quickly evaluate queries. Each one has different strengths and weaknesses based on the tradeoffs they make on memory, cpu and storage (if persisted). One of these types of indexes is called a bitmap index. For the purpose of this post I’m going to be using relational database terminology but the same techniques can apply to different database types like column oriented databases or many others non-relational databases.

Like a bloom filter, a bitmap index represents membership. A bitmap index is an array of bits, a bit represents each row where each bit set to 1 represents membership in the set.

To illustrate this better lets create an example table with 4 rows but consider it could be a table with 100 million rows.

TABLE USERS

UserId Name Country
100 Jane Canada
101 Joe USA
102 John Germany
103 Julie USA

Creating a bitmap index on the country column would give us 3 bitmap indexes (3 arrays of bits) that looks like the following.

COUNTRY INDEX

Canada
[1, 0, 0, 0]
USA
[0, 1, 0, 1]
Germany
[0, 0, 1, 0]

Now a bitmap index for the Name column which creates 4 bitmap indexes.

NAME INDEX

Jane
[1, 0, 0, 0]
Joe
[0, 1, 0, 0]
John
[0, 0, 1, 0]
Julie
[0, 0, 0, 1]

In addition to the bitmap indexes, we need a map that maps index positions to rows in the database (more on why later).

ROW MAP

[100, 101, 102, 103]

Types of data

Bitmap indexes do well at indexing categorial data with well defined values like countries, dates and ages as a few examples. Bitmap indexes don’t do well with numerical data with continuous values like prices or kilometres per hour that you may want to sum or average. A 1 in the array of bits represents membership in the set for a unique term which makes them suitable for WHERE A = B type evaluations and not WHERE A > B evaluations.

Memory and storage

If you have 100 million rows in your database, the storage for the Country column index would be 3 bitmap indexes with 100 million bits (12 megabytes) each taking a total 36MB.

Using bitmap indexes means managing a lot of indexes because you need a new array of bits per unique term. That’s one of the cons of a bitmap index. If you have 100 million rows and each has a unique term (say a timestamp) you would create bitmap indexes for each timestamp where only 1 bit is set out of 100 million bits.

Fortunately there are ways to tackle this problem called Compressed Bitmaps which I will cover later.

Evaluating a query

With the indexes and row map defined we can jump back to the high level and explain how a query like this works.

SELECT *
FROM USERS
WHERE Name = ‘Julie’ AND Country = ‘USA’

There’s a lot of material and research on how to optimize query operations such as choosing which order to do things based on cardinality and many other aspects. In these examples I’m not explaining the process of optimizing a query execution plan, just explaining the basics.

To evaluate which rows match the criteria we can do bitwise operations like bitwise AND and OR’s.

STEP 1: Evaluate Name = ‘Julie’ AND Country = ‘USA’

We take the 2 bitmap indexes that represent the terms “Julie” and “USA” and we bitwise AND them.

    0001 (Julie)
AND 0101 (USA)
  = 0001

STEP 2: Map results to row map

Take all the set bits and map the row key.

[0, 0, 0, 1]
= rowmap[3]
= rows["103"]

STEP 3: Select results

With the results from the where clause evaluated into row keys now we can use however our database storage works to fetch the rows and the columns selected in the query to return.

Deletes

Depending on the database implementation, deleting a row in a database may cause a bunch of index rebuilds to occur but luckily with bitmap indexes you can still respond to queries with responses that exclude the deleted rows of data while these indexes rebuild in the background. Create a new bitmap index that represents deleted rows then bitwise AND against this bitmap index with any query results. In a bitmap index that represents deletes, a bit set to 1 represents a valid row and a bit set to 0 represents a deleted row.

Using the query mentioned above, if we deleted Julie the operations would look something like this.

    0001 (Julie)
AND 0101 (USA)
AND 1110 (Deletes)
  = 0000 (No matching rows!)

The bitmap index representing deletes is temporary and can be removed once the indexes affected by the row deletion have been rebuilt.

Group By and Joins

Bitmap indexes can be used for evaluating a lot of different parts of a query such as GROUP BY and JOIN clauses. Evaluating a join is a multi-step process because you need to evaluate the bitmap indexes from the right side join table and then take the resulting bitmap and bitwise AND it against the query result on the left side like we did on the deletes example. One to many to one relationships get even more complicated because there is an intermediary table facilitating the relationship. Explaining joins is a much more involved example I will leave for another blog post.

Bitmap indexes in OLAP databases

An OLAP (online analytical processing) database is typically used for reporting and business intelligence where there is a need to query data in many different ways (dimensions) with quick response times and is sometimes called an “OLAP Cube”. OLAP data is usually stored with a star schema where there is a fact table and multiple dimension tables. If we were analyzing sales, the sales table would be the fact table. Country and day of week would be dimension tables. This star schema would allow us to do analytical queries answering questions like “top 10 sales by country on Saturday” giving you insight on which countries you may want to increase promotions.

Some OLAP databases use bitmap indexes. When the OLAP cube is being generated it creates pre-calculated results so that queries can be extremely fast. When the cube is being generated it’s going through a materialization stage where it is evaluating every possible bitmap index combination for the dimension tables and creates all the resulting bitmaps.

The nice thing about using bitmap indexes in an OLAP database, since everything is an array of bits, you can choose not to materialize everything. If a pre-calculated result explodes the cube due to the size of possible combinations you can do what is called partial materialization where you don’t materialize everything but you can still evaluate bitmaps from parts that are materialized with bitmaps that aren’t materialized.

Accelerating with SIMD & GPU’s

Included in many modern processors is SSE2 which provides SIMD (Single instruction, multiple data). Using SIMD to evaluate the bitwise AND or bitwise OR operations would give the ability to evaluate (depending on CPU) 128 bits at once. With Intel AVX (Advanced Vector Extensions) in Sandy Bridge and AMD BullDozer up to 256 bits is supported. This would be a significant increase in performance for the operations explained earlier.

Using GPGPU could also accelerate processing even further but this poses additional challenges around getting data in and out of the GPU VRAM. If your indexes are larger than the VRAM available you will need to swap data in and out of the GPU many times.

Compressed Bitmap Index

One of the cons to using bitmap indexes is the amount of bitmap indexes that get created. You need to create one bitmap index the size of the row count per unique term. Luckily run-length encoding fits this problem really well. If you have really high cardinality columns, you will likely have most bits set to 0 and these can be compressed very well. Compressed bitmaps allow you to do the bitwise AND and bitwise OR operations in the compressed form avoiding having to decompress these large bitmaps before using them. Sparsity of membership when using run-length encoding will dictate how effective the compression rate is.

You can find some really good compressed bitmap implementations at the following links.

C#
Java
C++
Ruby

Daniel Lemire wrote most of the above compressed bitmap implementations and he has a lot of amazing papers and blog posts on database topics. I highly recommend checking out his website and reading his work.

Conclusion

Compressed bitmap indexes can be really useful for evaluating queries on really large datasets quickly and offer a good compromise between memory efficiency and processing speed since in many cases they can be faster than uncompressed bitmaps. I only cover some basic uses of bitmap indexes, there are a lot of different ways to approach data indexing and query execution plans.

Special thanks to Cliff Moon for pointing me in the right direction when I was learning bitmap indexes.

  • drdamour

    isn’t a key feature of bitmap indexes that they can index a column with a lot of null values very efficiently of “IS NULL” type filters?

  • http://smallsoftlab.blogspot.com/ Vladimir Marinkovic

    Hello,

    First of all, I just want to say that this is one very well written blog post. I really enjoyed reading it.
    It covers all of different aspects nicely, so no additional reading is required in order to understand the main topic.

    I especially like the part where you mentioned how bitmap indexing works for OLAP multidimensional querying.
    Many people that work with BI solutions don’t pay much attention to it, but it can definitely help for better understanding of how OLAP processing is done.

    Thanks a lot!

  • http://Thisisn'tnewstuff meshuggenah

    You’re article makes it sound like this is relatively new technology. Model 204, a mainframe database, has been using bit map indexes for over 40 years. Nice of C# et al to finally catch up!

  • http://twitter.com/demius Alexis

    Thanks for the fascinating post K – damn interesting stuff

  • Michael Guyver

    Hi,

    Until AVX2 arrives POR operations can only operate on 128 bits as per the SSE instructions. The 256-bit “integer” instructions are due to arrive in AVX2.

    From the Intel Volume 2 manual, 4-204:

    VPOR xmm1, xmm2, xmm3/m128 — AVX — Bitwise OR of xmm2/m128 and xmm3.

    Regards

    Michael

  • http://lemire.me/en/ Daniel Lemire

    @kellybyte Thanks for the good words.

    @meshuggenah Bitmaps predate (slightly) model 204, however compressed bitmaps are more recent (probably due to Oracle).

  • http://www.symas.com Howard Chu

    Run Length encoding has certainly been around for ages. Another mechanism to consider is block encoding. Instead of a purely sequential stream of bits, a block has an address (giving its offset into the theoretical full stream of bits) followed by a chunk of the bitmap. The advantage over pure RLE is that chunks are of fixed size so you can locate any entry using binary search, whereas for a pure RLE stream you must start from the beginning and do a linear search.

    e.g., you could have blocks (0,1111000011110000) (16,0011001100110011) and so on. If the map is sparse and mostly zeros, you can omit large ranges of them entirely.
    (0,…)
    (16,…)
    (256,…)
    (272,…)
    (1024,…)

  • Vinod

    Concise and simple explanation with excellent examples.

    Thank you!

  • kendyshop

    Hey,

    i have a question. How can i determinate the size of a bitmap index in disk blocks? For example: i have a disk block size of 4KB with a header size of 96 Bytes. I have a bitmap on a column of type char(8) with 6 different value. The relation has 1000 tuples. Can somebody explains me, how can i calculate the size of the bitmap in disk block?

    Thank