Despite multiple challenger technologies (eg. Object Oriented databases), the relational databases from Oracle, IBM and Microsoft have prevailed supreme for over 30 years.
This dominance has however been increasingly challenged by a new wave of technology solutions which started with database appliances from Teradata and Nettezza. Driven by Big Data requirements, this was followed by open source NoSQL databases including HBase and Cassandra. Development continued with NewSQL databases including VoltDB and MemSQL, and finally in (although not a database), Hadoop HDFS is providing a significant challenge as a potential data store.
In this article, I will summarise the traditional approaches available to provide database scalability, comparing the benefits and drawbacks of each. In part two, I’ll describe the different database architectures, and in the final article, I’ll describe how the new challenger solutions fit in to the overall picture.
First, it’s sensible to define some terms.
Most systems don’t run at 100% capacity, and must build in headroom to allow for a temporary spike in traffic without a significant drop in performance. However, this does not make them scalable, just that there’s spare capacity. Scalability refers to the options available to cope with a longer term increase in traffic when there’s no headroom available. The typical options are described as vertical (scale up), or horizontal (scale out).
Although closely correlated, it’s important to treat these separately as they often require a different approach. System performance refers to the throughput (transactions per minute) or average response time. In my experience, performance improvements are gained using efficient database design, and scalability during the selection of an appropriate hardware architecture.
In short, you need to build scalability into the hardware architecture and database selection, and can (for the most part), maximise performance later - during the database design and deployment phase. Do this in the wrong sequence, and you’ll find your scalability options severely limited.
Typically associated with Cloud based solutions (either on premises or hosted externally), this refers to the ability of the system to rapidly grow or shrink as the processing demands change – often dynamically. This implies manual or automatic hardware allocation on a cluster to best match the resources available to the demands of a given task. Elasticity (and associated cost control) is one of the greatest benefits of cloud based solutions and can be used to control costs, and make more efficient use of machine resources.
Assuming you need to scale your system, there are two options, scaling Up or Out.
The diagram above illustrates the situation where we add disk, memory or processing capacity to the system, and eventually to migrating to a larger hardware platform. Typically however, neither the benefits nor the costs are linear, and faster disks, processors and network add significant cost. In addition, as most systems are constrained by a performance bottleneck, increasing capacity in one area often shifts the bottleneck to another, and doubling the size of the machine seldom doubles the capacity.
The scale out option implies a distributed system whereby additional machines are added to a cluster to provide additional capacity. Often more likely to yield a linear increase in scalability, although not necessarily increased performance.
The arguments for/against Scaling Up include:-
The arguments for/against Scaling Out include:-
As in most aspects of IT, it’s important to start with a clear understanding of the problem, and separate the challenges of performance from the bigger, architectural requirement of scalability. Many would love to try out new technologies including NoSQL databases, but (as you’ll see in the next article), these do come with significant drawbacks, although admittedly with truly unlimited scalability.
If expected database growth is more organic, it may be more sensible to consider Scaling Up the existing hardware platform, although if you’re already running a MPP cluster or you’ve hit the hardware limits, then a Scale Out architecture may be more appropriate. Again you need to be aware of the potentially significant drawbacks.
The most obvious (but important) take-away however, is you can design and tune for performance, but once selected and installed, your options to scale have already been decided. Architect for scalability, and then design and tune for maximum performance.
Thanks for reading this. If you found it interesting, do follow me or leave a comment. Even better, share it with colleagues.
After 30 years as a freelance Data Warehouse Architect, I'm turning my attention to Big Data including Hadoop, NoSQL and NewSQL.