Spread the word!

NoSQL in Plain English

In my previous posting I described how Hadoop compares to a traditional database solution from Oracle. In this article I will describe the so called NoSQL databases, and explain when to deploy this solution.

Why not use Hadoop?

While Spark SQL, Impala and Hive are potential solutions for interactive analytic queries across petabyte sized data sets, they have severe limitations:-

  •  No Random Access to data: Hadoop provides no indexes or random access lookups on a single row. Instead it is designed to summarise and present massive data sets at scale.
  • Append Only: It's not possible to update or delete rows in place, although data can be appended to an existing data set. This means raw HDFS is not suitable for OLTP like workloads.

 Where fast random access, and OLTP like functionality are needed, the most sensible solution is a traditional relational database, but while processing semi-structured data or internet scale throughput, there's a new product class available, the so-called NoSQL database.

What’s different about NoSQL?

There are literally hundreds of options available, each with different strengths, but noSQL databases tend to:-

  •  Not use SQL: The clue is in the name. Instead they have a custom built query language, although many are accessible via other SQL tools including Spark and Apache Drill.
  • Support High Velocity Workloads: They are built to process massive data velocity - tens of thousands of inserts per second across hundreds of nodes.
  • Support massive data volumes: many can automatically shard (spread out) data across a number of machines, and support very fast retrieval of massive (petabyte) sized data sets with ease.
  • Are implemented on a distributed system: To achieve the high velocity and massive data volumes they must scale out.
  • Avoid a relational model: They tend not to support the highly flexible (but often performance constrained) relational model. This means database design tends to be centred around efficient data access for a specific application rather than a general purpose query solution. However the benefits of performance, scalability and throughput make this compromise worth while. This does mean NoSQL database designs tend to be highly denormalised, and they will remain a relatively niche solution.
HBase is used by Facebook to process 80 billion transactions per day    

Be aware, NoSQL databases tend to be open source (in most cases), and may run on Hadoop HDFS although some databases also run stand-alone and independent.

The CAP Theorem

OK, before you nod off to sleep, bear with me. It's a simple concept with practical implications for any Big Data project.

Let's take a typical Big Data architecture problem. Assume you're designing a sales support system for a globally distributed workforce in London, New York and Singapore. You need to provide the latest sales totals by customer to help support your clients. You need sub-second response times on screen with a maximum transaction latency of 30 seconds. That means a transaction posted in London must be visible in Singapore within 30 seconds.

As we need sub-second response times, we need a cluster of machines at each of the three locations, and data replication between them.

We have three primary challenges:-

  1. Consistency: Ideally data in all three locations are consistent with each other. This means a sale posted in London is visible at the exact same instant in New York and Singapore. If it fails to complete, it's rolled back everywhere at the same instant.
  2. Availability: This is a mission critical system where down-time is not an option.
  3. Partition Tolerance: If the link between any two data-centres fails, the system must keep going. The entire Far East cannot stop accepting requests because the link to London fails.

Can you see the problem with the requirements above? They're partly mutually exclusive.

For example, if the link to Singapore fails, you cannot ensure transactions in Singapore are correctly reflected in London - the consistency requirement is broken. Likewise, if you insist upon 100% data consistency, you cannot guarantee 100% availability if the link between data centres fail.

The CAP Theorem (first presented by Eric Brewer in 1998) indicates you have three requirements. Put simply you need to choose any two, and compromise on the third.

A typical relational database is deployed on a single server, and sacrifices Partition Tolerance in favour of Consistency and Availability. Likewise, any distributed NoSQL database must choose between two of the three.

Apache HBase

Based upon an original Google solution (BigTable), this is a column based key-value store for very low latency processing on a massive scale. Random direct access is available via a single predefined key, and data changes can be automatically versioned as entries are updated.

Best suited to high volume transaction processing on massive data sets, HBase is used by Facebook to process 80 billion transactions per day on the "Messenger" Application, with a peak throughput of two million operations per second. Like most NoSQL solutions, HBase is an excellent solution for high volume OLTP processing, provided access paths to the data are direct access using a single key.

Conversely, HBase is a poor solution for Data Warehouse applications which required ad-hoc analysis over millions of rows, and where inserts tend to be batch operations with few updates.

In terms of CAP, HBase favours Consistency and Partition Tolerance over Availability. HBase is a highly reliable solution, but it relies upon a single master node to guarantee availability. However, as data is hosted on HDFS, there's almost zero risk of data loss, even after a hardware failure. Even a Master Node failure is normally recovered within seconds to a fail-over.

Apache Cassandra

It's ironic that in Greek mythology Cassandra had the gift of prophesy, but was cursed by Apollo to always tell the truth, because Apache Cassandra compromises on data Consistency.

It's primary strength is it's guaranteed availability, and it's ability to be deployed in a geographically distributed solution. It does however (by default), support "eventual consistency" between nodes.

In the example of the analytic sales system above, it means transactions posted in New York will eventually be posted to London and Singapore, but there might be a slight delay.  However, this does mean if connectivity to New York is temporarily lost, each site will remain available and consistency automatically re-established once the problem is resolved.

Clearly if you're running a bank payments system, managing global financial transfers, this is not an appropriate solution, but most use-cases don't demand 100% consistency every time. There is the option to insist upon consistency for a particular application (eg. cash withdrawals), but this does mean those transactions will fail if connectivity to other machines is down. As above, CAP rules apply. 

Similar to HBase, Cassandra is designed for high transaction processing velocity, and data can be replicated across geographically dispersed clusters which suits this kind of use-case. Cassandra has its own SQL like interface (CSQL), but can also be accessed directly by Apache Spark.

MongoDB

Unlike most open source NoSQL databases, MongoDB is built and supported by a corporation so benefits from full vendor support. It's name is a play on the word HUmongoS, and like many other NoSQL databases it scales to petabyte data sizes. 

The unique strength of MongoDB is however it's ability to handle unstructured and semi-structured data, for example text documents and JSON files. It's remarkably flexible, and is capable of supporting multiple indexes (including document indexes).

Like HBase, it favours Consistency and Partition Tolerance over Availability as it uses a single master node. It is however, a highly reliable solution, and Ebay use it to deliver media metadata with an availability of 99.999%, and although it integrates well with Hadoop and Spark it can also be deployed standalone.

Similar to other NoSQL databases, it does not implement a flexible relational model, but can be accessed using a SQL interface using Apache Drill.

Summary of Strengths

  • Apache HBase for consistency and high volume random access transaction processing over Hadoop HDFS.
  • Apache Cassandra for a highly available solution, especially in geographically distributed systems using data replication between clusters.
  • MongoDB for it's remarkably flexible data model - excellent for processing unstructured documents especially text and JSON. It's also supported directly by the vendor.

When to consider a NoSQL database?

There are hundreds of potential database solutions available in the following categories:-

  • Traditional Relational databases. Including IBM DB2, Oracle and Microsoft SQL Server
  • MPP Warehouse Solutions. Including Teradata, Netezza, Vertica and Greenplum
  • SQL oriented column storage over Hadoop. Including Cloudera Impala, Apache Spark and Hive
  • NoSQL databases including HBase, Cassandra and MongoDB

 The question is, when to use a NoSQL database?

Clearly the traditional relational database solutions from Oracle and IBM have dominated the market for the past 30 years in both OLTP and Analytic workloads, with MPP solutions from Teradata and the like carving out a significant market niche for scalable data warehouse applications.

Recently however Hadoop has started to make inroads into the data warehouse marketplace, and Apache Spark and Cloudera have demonstrated impressive performance in summarising billions of rows on essentially commodity hardware with massive scalability.

NoSQL solutions however, are best considered when you have:-

  • Random Access OLTP workloads. When you need to perform single row insert or update operations using primarily random access lookups as opposed to batch or micro-batch data append operations.
  • Massive Data Volumes. When volume exceeds the ability to store and process the data on a single machine
  • Very High Data Velocity. When data arrives at a speed and volume that would overwhelm a single server
  • Massive Scalability Requirements. When your workload doubles (users or data volumes) every few months which implies the need for a scale out solution.
  • Geographically Dispersed Users. When your users are globally dispersed, and need very fast (low latency) access to data replicated across the globe.

You could argue (and Oracle will), that a relational database solution will satisfy most if not all of the above requirements, and the simplest solution is (by far) the best to adopt. However, it's no longer a case of "one size fits all", especially when your data sizes approach the petabyte scale.

Do keep in mind though - Keep it Simple.

Thank you

Thanks for reading this post.  If you found it helpful, do click on the link below to spread the knowledge.  Even better, leave a message and say hello!

About the Author John Ryan

After 30 years as a freelance Data Warehouse Architect, I'm turning my attention to Big Data including Hadoop, NoSQL and NewSQL.

follow me on:

Leave a Comment: