Spread the word!

Big Database Technology 2 – Architecture

In my previous article I gave a short summary of the meaning of Database Scalability.  In this article I’ll describe the most common architectures used to scale to Big Data sizes.

Database Hardware Architectures

Over the past 30 years there have been three primary architectures to deploy a database including:-

  1. Shared Memory: Which effectively means shared everything.  A database running on a single machine with multiple CPUs sharing memory, and access to disks. Contention across all three cause the scalability constraints.  The majority of small-medium size databases use this architecture.
  2. Shared Disk: Which uses a number of closely clustered machines to provide a single database service, but accessing a single shared copy of the data. (eg. Oracle RAC).
  3. Shared Nothing: Also known as MPP (Massively Parallel Processing) solutions typically employed by large Data Warehouse systems. These cluster machines, but work largely independently, with data spread across the available resources. (eg. Vertica or Greenplum).

Each option has different features and constraints, and (depending upon the database solution), different scalability options.

Shared Memory Architecture

The diagram below illustrates the key hardware components of the simplest solution – Shared Memory.

In this solution, users connect to a single database server typically with locally attached disks. This works well for many databases, and is the traditional (and sensible) option for most applications. Scalability options are however limited to Scale Up, by adding CPUs, memory or disks.

The main benefits and drawbacks of a Shared Memory solution are:-

  • Simplicity: This is a tried and tested, battle hardened architecture that’s been proven to work well for over 30 years (I started with Oracle version 5 in 1986).  It requires no special architecting, works on anything from a laptop to a high end Unix machine, and is equally at home in the cloud or the desktop.
  • Scalability but limited: Given modern servers can host multiple processors, gigabytes of memory and terabytes of disk, there’s plenty of scope for growth, although eventually with an upper limit.
  • High Availability at a cost: Enterprise class machines become increasingly expensive as the hardware includes dual redundant disks, network and power supplies. Likewise for full HA you’ll need to factor in an off-site hot-standby system.

For most requirements, this may well be the simplest and cheapest solution. Increasingly, cloud deployment should be considered, even for small to medium size systems as this can help control capital costs, and provide a managed and relatively painless upgrade path.

Shared Disk Architecture

The diagram below illustrates the Shared Disk Architecture in which users are transparently routed (perhaps using load balancing), to a database server in a cluster of machines. As the name implies, the disk system is shared by all nodes in the cluster (and is therefore a potential bottleneck), and is typically provided by a separate NAS or SAN disk system.

Under the hood, this is a relatively complex solution, as nodes are closely networked, to act as a single machine. This can create performance and scalability challenges when for example a data item is simultaneously changed on two different nodes, and the data is repeatedly “pinged” between the two. This means this deployment can require some application changes to sensibly partition the application on the servers.  At it’s extreme, it’s possible (for example), to deploy a data warehouse using one node for ETL processing, and the other dedicated to servicing user read queries.

The main benefits and drawbacks of a Shared Disk solution are:-

  • Upgrade Path: It can provide an excellent upgrade path from a single node system provided it’s deployed sensibly.
  • The Risk of Pinging: Can limit performance, and if users can execute updates from any node, the risk of lock contention across the cluster will increasingly limit overall scalability. For this reason, many deployments of Oracle RAC don't exceed three nodes.
  • Cost of SAN Disks: The solution relies upon a fast shared disk system which on a high end SAN solution can prove expensive.
  • Inefficient Caching: As each node can potentially access the same data, it’s likely the data will be held (duplicated), in each cache, making less efficient use of memory.
  • High Availability: Is not built in, and dual redundant networks, power supplies, RAID disks and a standby database can add considerably to the overall cost.

As an on-premises upgrade path from a single database server, this provides a useful roadmap for growth. It’s sensible however to limit cluster size to around three for most solutions, before considering re-platforming to a larger machine.

Shared Nothing Architecture

The diagram below illustrates the Shared Nothing Architecture whereby data is spread (sharded) across a number of independently running machines that act as a single cluster. Unlike the Shared Disk solution, each node is responsible for only a part of the data. In effect, nothing is shared.

Unlike the Shared Disk solution which relies upon hardware for high availability, this solution can automatically replicate data to other nodes providing a significant element of built in high availability. These principles are applied to a range of solutions including Hadoop HDFS, NoSQL databases like Cassandra, and MPP column data stores including Vertica and Greenplum.

Unlike the Shared Disk solution which can be deployed without thought, an MPP database potentially needs careful attention to data sharding and co-location to maximise scalability and performance. If for example most queries are within a geographic region, it makes sense to shard data across the cluster by region instead of business area. Once deployed, applying a different sharding and data partitioning strategy is not an insignificant operation on a multi-terabyte system.

The main benefits and drawbacks of a Shared Nothing database are:-

  • Need for Sharding: This solution relies upon sensible data placement and sharding to work well.  Some (eg. Key/Value lookup solutions) have built in automatic sharding, others need to be carefully deployed, and designed into the application.
  • Cheaper hardware: An MPP solution can be deployed on relatively inexpensive hardware with directly attached storage. In addition, scalability costs are incremental - typically by adding additional nodes.
  • High Availability: If data is automatically replicated across nodes, the solution can provide built in (inexpensive) high availability, as queries executed on a failed node can be re-executed on others. 
  • Fast single node reads: Queries served by a single node run fast, and leave other nodes free to satisfy other requests, and unlike the shared disk solution, each node caches only the data it’s responsible for making efficient use of memory.
  • Write Scalability: As there’s no distributed lock manager, this can be a great solution for high throughput insert/update operations, (provided they only affect a single data shard). From a write perspective it’s linearly scalable as each node is only responsible for a sub-set of the data. Updates which affect multiple nodes however need a distributed two phase commit which can slow operations.
  • Read Scalability: While both reads and writes work well when all the required data is on a single node, read queries across several shards (perhaps the entire database) can set a absolute limit on scalability. Effectively every node in the cluster will handle a given number of queries per second. A complex analytic query across every node in the cluster counts as one of those queries. There are ways to work around this, (eg. Materialised views, summaries and data replication), but it’s a constraint to be aware of.

As I’ve described before, NoSQL databases can provide nearly unlimited linear scalability for high velocity workloads on a Shared Nothing architecture.

Cassandra in particular has been demonstrated to scale to nearly 300 nodes, processing over 200,000 writes per second during a Netflix benchmark test.

Of course there are significant downsides.

NoSQL Scalability Limits

NoSQL databases address a very narrow use-case. In many cases, single key lookup operations where the data can be guaranteed to be retrieved from a single node. If a Cassandra or Riak database is deployed to 50 nodes, and workload doubles, simply adding 50 nodes will solve the problem – providing the same response time, but to double the user population.

This principle breaks however if we use secondary indexes that run across the sharding key. Access via a secondary index on MongoDB, Cassandra or Riak, and you’ll potentially execute the query against every node in the cluster. Run several of these concurrently, and you’ll quickly hit the scalability limits of your your entire system.

In summary, when queries run across multiple shared nothing nodes in the cluster, scalability will drop.

NoSQL and Eventual Consistency

If you’ve ever seen the above page while using LinkedIn, you’ve experienced the limits of Eventual Consistency used by NoSQL databases. In short, you’re not guaranteed to find the data (or the latest version of it) when you need it.

Relational databases work within ACID transactions, and once you press save, the server only responds when the change durable, and is guaranteed to be consistently available to everyone. NoSQL databases on the other hand tend to compromise on consistency in favour of speed and availability.

When data is written, the process can return control immediately, when it’s written to a majority of replica copies, or (slowest of all), when it’s written to all copies. This introduces application complexity as you need to handle these unexpected (sometimes temporary) data quality issues. You also need to accept you cannot guarantee 100% data accuracy in the event of hardware or systems failure.

Of course, the bigger the cluster, the more hardware failures. For example, given a hard disk mean time between failure (MTBF) of 10-50 years, on a cluster with 10,000 disks you should expect an average of one failure per day.

NewSQL Database Technology

NewSQL databases attempt to provide the high availability, massive scalability and millisecond latency of NoSQL databases, but with the flexibility of a relational database, and ACID transactional. For strict OLTP (short lived, fast transactions) with the potential for massive scale, these provide a compelling solution with potential throughput reaching millions of transactions per second.

Some like NuoDB, CokroachDB and Google Spanner are designed for elastic cloud based solutions, and separate data processing from storage nodes. This gives potentially signficant benefits of being able to scale them independently.

Finally VoltDB from the Michael Stonebraker stable (Ingres, Postgres and Vertica), is a horizontally scalable, fully ACID compliant database built for streaming applications. Designed to capture, process and analyse sensor data in near real time (milliseconds), it provides strong transactional guarantees, and even survived the remarkably challenging Jepsen test of distributed system safety in the face of hardware failure.

Historical Context and Conclusion

As little as ten years ago the Shared Memory/Disk/Nothing architectures were the only serious options available from the big three vendors – Oracle, Microsoft and IBM.

The massive explosion of data volumes handled by Google, Amazon and LinkedIn lead to the development of NoSQL databases to provide fast Key/Value lookup operations to complement the entirely batch oriented serial processing available in Hadoop.

The subsequent explosion of open source development means there’s nearly 300 databases available, including Key-Value, Document, Graph and text search engines, with specialised solutions for every use case.

It’s unlikely every database will survive, but it’s good to see the 30 year relational model completely reinvented with the NewSQL databases while being largely transparent to the developer. The relational database is not dead yet.

Thank You

Thanks for reading this far - there was a lot to say. If you found this interesting, share it with colleagues or press the like button below. You may also be interested in the next article in the sequence.

About the Author John Ryan

After 30 years as a freelance Data Warehouse Architect, I'm turning my attention to Big Data including Hadoop, NoSQL and NewSQL.

follow me on:

Leave a Comment: