Getting started using Apache Cassandra
Cassandra is an open source scalable and highly available “NoSQL” distributed database management system. Cassandra claims to offer fault tolerant linear scalability with no single point of failure. In the diverse solutions available under the “NoSQL” category (I don’t like this term, here’s why) such as ColumnFamily, Document, Graph and Key-Value databases Cassandra sits in the ColumnFamily camp. Cassandra is a Google BigTable data model with an Amazon Dynamo like infrastructure with tuneable consistency. The Cassandra data model is designed for large scale distributed data and trades ACID compliant data practices for performance and availability.
Cassandra is optimized for very fast and highly available writes. Cassandra writes first to a commit log on disk for durability then commits to an in-memory structure called a memtable. A write is successful once both commits are complete. Writes are batched in memory and written to disk in a table structure called an SSTable (sorted string table). Memtables and SSTables are created per column family. With this design Cassandra has minimal disk I/O and offers high speed write performance because the commit log is append-only and Cassandra doesn’t seek on writes. In the event of a fault when writing to the SSTable Cassandra can simply replay the commit log. For more information on how Cassandra operates with writes read this article.
If you need BIG data and fast writes, Cassandra is worth a look.
Data Model
The Cassandra data model has 4 main concepts which are cluster, keyspace, column family and super column. I’ll leave out super column in this post to keep things simple.
Clusters contain many nodes (machines) and can contain multiple keyspaces.
A keyspace is a namespace to group multiple column families, typically one per application.
A column contains a name, value and timestamp (left out in the diagram).
A column family contains multiple columns referenced by a row keys.
Installing
Cassandra is written in Java and can run on a vast array of operating systems and platform. A lot of this post is relevant no matter what platform you are on. Since we are going to be jumping into how to use Cassandra in C# we will install Cassandra on a Windows machine. Fortunately the great people at DataStax (amazing bunch over there!) have created a Windows installer for their version of Cassandra. The DataStax version of Cassandra comes bundled with a lot of handy things I’ll cover later. The Windows installer will install the Java runtime and makes getting Cassandra up and running a breeze.
After installing you’ll find the following in your start menu.
Using the CLI
Run the Cassandra CLI Utility. Using the CLI we can create keyspaces, column families, etc. Let’s connect to our local node and create a blog keyspace.
Now let’s create the posts column family which will store our posts.
Now let’s add a couple sample blog posts.
Through the CLI we can also view what is stored in the column family.
Using DataStax OpsCenter
The DataStax version of Cassandra comes with an application that DataStax developed called OpsCenter. OpsCenter gives us the ability to view all kinds of metrics from our Cassandra cluster node but also allows us to create keyspaces and column families.
Here’s what the main view of OpsCenter looks like.
On the data modeling view we can see our blog keyspace which also shows us our posts column family.
The keyspace and column family we built via the CLI we can build through the OpsCenter user interface. The standard version of Cassandra from Apache doesn’t come with OpsCenter and there’s a lot of functionality there. I suggest taking a look at it and poke around.
Using the CQL Shell we can use a SQL-like query language called CQL (Cassandra Query Language) to query our keyspace.
Accessing Cassandra in C#
Now let’s write some code to access the posts we stored in the posts column family. There are 2 Cassandra libraries for .NET that I have used so far. Fluent Cassandra and Cassandraemon. I’m going to use Cassandraemon for this example.
There are many client libraries for several languages available to access data from Cassandra making it easy to use from almost anywhere.
One more thing…
As always I love to think of unconventional things to try so to prove how portable Cassandra and the client C# code is I ventured to make this work on a dual core 1ghz ARM Cortex A9 embedded SBC (Single Board Computer). This device can be compared to the dual core platforms available on smart phones these days.
My goal was to get big data on a little device.
First I had to install a Java runtime for Linux on an ARM platform which I found here. After installing I downloaded the tarball for the DataStax version of Cassandra. I was able to get it running very easily.
After getting Cassandra up and running I needed a Mono runtime for an ARM platform to run my .NET executable, “apt-get mono” was sufficient to get what I needed.
With those installed I was ready to go! I copied my bin\Debug directory to the embedded device and ran the executable that was compiled via Visual Studio. That’s right, I didn’t need to recompile! The binary straight from the Intel box developed with Visual Studio was used and run with Mono on the ARM based embedded device.
This is a testament to how portable Mono is. I did not change or recompile Cassandraemon or my application code. It just worked.
With Cassandra not only can we use high performance systems and scale to very big data but we can have big data on little devices such as this one which did all this with only a 5v USB cable plugged in.
Tagged as cassandra, mono + Categorized as Databases
1 Comments
Trackbacks & Pingbacks
-
} ?>
-
Wajih
Great article…was really looking for something simple!
} ?>
if ($runonce) { ?>
Great article…was really looking for something simple!