Cassandra is a highly scalable, distributed database originally built at Facebook. It's now a top-level Apache open source project with widespread adoption, and is deployed at scale by Netflix, eBay, Spotify and Apple.
It's main strengths are:-
The diagram above illustrates a globally distributed system whereby users gain millisecond performance on a cluster of machines at each location, while results are automatically replicated globally. Using Amazon EC2, if the Singapore based system crashed, the workload could automatically resume using London and New York based servers at a reduced speed.
This benefit was illustrated in April 2011 when a major US-East outage on Amazon left many web sites down, but Netflix operations running Cassandra were unaffected because of cross-regional replication in place.
Like any specialised, highly tuned tool, Cassandra is built to solve a specific problem. High volume, low latency read/write lookups from a massive user base. It does however have significant limitations:-
An Oracle database supports immediate consistency as there's only one copy of the data. However, on a distributed database, the data is replicated to multiple independently running nodes, and to maximise throughput, we may relax the consistency rule, and settle for eventual consistency.
The diagram below illustrates the situation where an update is applied to one node, and eventually written to the two other replicas. In the interim (maybe a few microseconds), a reader can potentially read the old (stale) copy of the data. Eventually, all three replicas will have consistent results - hence the name.
While not the best solution in every case, the Netflix Benchmark demonstrates, the benefits of relaxing the consistency rule, as read throughput increased from 600,000 to nearly a million reads per second. This clearly illustrates the trade off of performance against consistency and resilience. If you don't need the guaranteed absolute latest entry, it's a significant performance gain.
Cassandra is also highly flexible, and supports 11 levels of consistency. The most commonly used are:-
Note: If an operation is written using ALL consistency, it can be safely read using ONE which maximises read throughput over writes. This gives huge flexibility (and opportunity for hard to identify bugs), but it's a powerful option to have available.
Although Cassandra has flexible support for consistency, this must not be confused with "read consistency" or support for transactions - a feature often taken for granted on traditional database platforms.
Effectively on Cassandra, every write operation is implicitly committed, and immediately visible to all readers without delay.
Take a simple example. An application which transfers money from a savings to a current account would typically involve two operations committed in a single transaction. One to deduct from the savings account, and another to credit the current account. In Cassandra, another user could read the balance of both accounts mid-way through with potentially catastrophic results.
Cassandra does support a lightweight test and set operation to set a value depending upon a value, and the ability to batch load multiple inserts, but this provides at most, a rather basic level of lightweight transaction support.
In conclusion, if you have a web or sensor streaming based application with a massive or highly unpredictable workload needing very low latency read/write operations using mainly unique keys Cassandra is an excellent option. If you can deploy a cloud based solution, and need potentially massive scalability, all the better.
If however you have a Data Warehouse workload, summarising billions of rows, a database like Vertica might be a better option. Likewise, if you need full transactional support you should look elsewhere - perhaps at VoltDB.
There are over 150 NoSQL databases available, many of which support massive throughput and scalability. However, where Cassandra is unique is in providing tuneable consistency along with high level of resilience and native replication across a potentially globally distributed system.
Provided you can work within the lack of transaction support and eventual consistency, it's a potentially powerful tool to deploy.
Thanks for reading my article. If you found it useful, do follow me or leave a comment, or even better - share with your colleagues.
After 30 years as a freelance Data Warehouse Architect, I'm turning my attention to Big Data including Hadoop, NoSQL and NewSQL.