In my previous posting I described how Hadoop compares to a traditional database solution from Oracle. In this article I will describe the so called NoSQL databases, and explain when to deploy this solution.
While Spark SQL, Impala and Hive are potential solutions for interactive analytic queries across petabyte sized data sets, they have severe limitations:-
Where fast random access, and OLTP like functionality are needed, the most sensible solution is a traditional relational database, but while processing semi-structured data or internet scale throughput, there's a new product class available, the so-called NoSQL database.
There are literally hundreds of options available, each with different strengths, but noSQL databases tend to:-
HBase is used by Facebook to process 80 billion transactions per day
Be aware, NoSQL databases tend to be open source (in most cases), and may run on Hadoop HDFS although some databases also run stand-alone and independent.
OK, before you nod off to sleep, bear with me. It's a simple concept with practical implications for any Big Data project.
Let's take a typical Big Data architecture problem. Assume you're designing a sales support system for a globally distributed workforce in London, New York and Singapore. You need to provide the latest sales totals by customer to help support your clients. You need sub-second response times on screen with a maximum transaction latency of 30 seconds. That means a transaction posted in London must be visible in Singapore within 30 seconds.
As we need sub-second response times, we need a cluster of machines at each of the three locations, and data replication between them.
We have three primary challenges:-
Can you see the problem with the requirements above? They're partly mutually exclusive.
For example, if the link to Singapore fails, you cannot ensure transactions in Singapore are correctly reflected in London - the consistency requirement is broken. Likewise, if you insist upon 100% data consistency, you cannot guarantee 100% availability if the link between data centres fail.
The CAP Theorem (first presented by Eric Brewer in 1998) indicates you have three requirements. Put simply you need to choose any two, and compromise on the third.
A typical relational database is deployed on a single server, and sacrifices Partition Tolerance in favour of Consistency and Availability. Likewise, any distributed NoSQL database must choose between two of the three.
Based upon an original Google solution (BigTable), this is a column based key-value store for very low latency processing on a massive scale. Random direct access is available via a single predefined key, and data changes can be automatically versioned as entries are updated.
Best suited to high volume transaction processing on massive data sets, HBase is used by Facebook to process 80 billion transactions per day on the "Messenger" Application, with a peak throughput of two million operations per second. Like most NoSQL solutions, HBase is an excellent solution for high volume OLTP processing, provided access paths to the data are direct access using a single key.
Conversely, HBase is a poor solution for Data Warehouse applications which required ad-hoc analysis over millions of rows, and where inserts tend to be batch operations with few updates.
In terms of CAP, HBase favours Consistency and Partition Tolerance over Availability. HBase is a highly reliable solution, but it relies upon a single master node to guarantee availability. However, as data is hosted on HDFS, there's almost zero risk of data loss, even after a hardware failure. Even a Master Node failure is normally recovered within seconds to a fail-over.
It's ironic that in Greek mythology Cassandra had the gift of prophesy, but was cursed by Apollo to always tell the truth, because Apache Cassandra compromises on data Consistency.
It's primary strength is it's guaranteed availability, and it's ability to be deployed in a geographically distributed solution. It does however (by default), support "eventual consistency" between nodes.
In the example of the analytic sales system above, it means transactions posted in New York will eventually be posted to London and Singapore, but there might be a slight delay. However, this does mean if connectivity to New York is temporarily lost, each site will remain available and consistency automatically re-established once the problem is resolved.
Clearly if you're running a bank payments system, managing global financial transfers, this is not an appropriate solution, but most use-cases don't demand 100% consistency every time. There is the option to insist upon consistency for a particular application (eg. cash withdrawals), but this does mean those transactions will fail if connectivity to other machines is down. As above, CAP rules apply.
Similar to HBase, Cassandra is designed for high transaction processing velocity, and data can be replicated across geographically dispersed clusters which suits this kind of use-case. Cassandra has its own SQL like interface (CSQL), but can also be accessed directly by Apache Spark.
Unlike most open source NoSQL databases, MongoDB is built and supported by a corporation so benefits from full vendor support. It's name is a play on the word HUmongoS, and like many other NoSQL databases it scales to petabyte data sizes.
The unique strength of MongoDB is however it's ability to handle unstructured and semi-structured data, for example text documents and JSON files. It's remarkably flexible, and is capable of supporting multiple indexes (including document indexes).
Like HBase, it favours Consistency and Partition Tolerance over Availability as it uses a single master node. It is however, a highly reliable solution, and Ebay use it to deliver media metadata with an availability of 99.999%, and although it integrates well with Hadoop and Spark it can also be deployed standalone.
Similar to other NoSQL databases, it does not implement a flexible relational model, but can be accessed using a SQL interface using Apache Drill.
There are hundreds of potential database solutions available in the following categories:-
The question is, when to use a NoSQL database?
Clearly the traditional relational database solutions from Oracle and IBM have dominated the market for the past 30 years in both OLTP and Analytic workloads, with MPP solutions from Teradata and the like carving out a significant market niche for scalable data warehouse applications.
Recently however Hadoop has started to make inroads into the data warehouse marketplace, and Apache Spark and Cloudera have demonstrated impressive performance in summarising billions of rows on essentially commodity hardware with massive scalability.
NoSQL solutions however, are best considered when you have:-
You could argue (and Oracle will), that a relational database solution will satisfy most if not all of the above requirements, and the simplest solution is (by far) the best to adopt. However, it's no longer a case of "one size fits all", especially when your data sizes approach the petabyte scale.
Do keep in mind though - Keep it Simple.
Thanks for reading this post. If you found it helpful, do click on the link below to spread the knowledge. Even better, leave a message and say hello!
After 30 years as a freelance Data Warehouse Architect, I'm turning my attention to Big Data including Hadoop, NoSQL and NewSQL.