Although Hadoop and Big Data (whatever that is) are the new kids on the block, don’t be too quick to write off relational database technology. In this article I’ll explain the differences (and benefits) of both solutions.
As much as the marketing hype would have us believe, Hadoop is NOT a database, but a collection open source software which runs as a distributed storage framework (HDFS) to manage very large data sets. It’s primary purpose is the storage, management and delivery of data, for analytical purposes. It’s hard to talk about Hadoop without getting into keywords and jargon (for example, Impala, YARN, Parquet and Spark), so I’ll start by explaining the basics.
At the very core of Hadoop is HDFS (Hadoop Distributed File System). So, it’s not a database after all – at it’s core it’s a file system, but a very powerful one.
It’s impossible to really understand Hadoop without understanding it’s underlying hardware architecture which gives it two of it’s biggest strengths, it’s scalability, and massive parallel processing (MPP) capability.
To illustrate the difference, the diagram below illustrates a typical database architecture in which a user executes SQL queries against a single large database server. Despite sophisticated caching techniques, the biggest bottleneck for most Business Intelligence applications is still the ability to fetch data from disk into memory for processing. This limits both the system processing, and it’s ability to scale – to quickly grow to deal with increasing data volumes.
As there’s a single server, it also needs expensive redundant hardware to guarantee availability. This will include dual redundant power supplies, network connections and disk mirroring which, on very large platforms can make this an expensive system to build and maintain.
Compare this the Hadoop Distributed Architecture below. In this solution, the user executes SQL queries against a cluster of commodity servers, and the entire process is run in parallel. As effort is distributed across several machines, the disk bottleneck is less of an issue, and as data volumes grow, the solution can be extended with additional servers to tens or even hundreds of nodes.
Hadoop has automatic recovery built in such that if one server becomes unavailable, the work is automatically re-distributed among the surviving nodes which avoids the huge cost overhead of expensive custom made dual redundant hardware. This can lead to a huge advantage in availability, as a single machine can be taken down for service, maintenance or an operating system upgrade with zero system downtime.
Hadoop has several other potential advantages over a traditional RDBMS most often explained by the three (and increasing) Vs.
The advent of The Cloud leads to an even greater advantage (although not another “V” in this case) – Elasticity.
That’s the ability to provide on-demand scalability using cloud based servers to deal with unexpected or unpredictable workloads. This means entire networks of machines can spin up as needed to deal with massive data processing challenges while hardware costs are restrained by a pay-as-you-go model. Of course, in a highly regulated industry (eg. Financial Services) with highly sensitive data, the cloud may well be treated with suspicion, in which case you may want to consider an “On Premises” Cloud based solution to secure your data.
As if the hardware advantages where not already compelling, Hadoop can also natively support Column based storage which gives analytic queries a massive performance and compression advantage.
The diagram above illustrates the difference in the two methods. With traditional row based storage, it’s quick to identify and fetch a single row, for example SALES row 5. This is great for transaction processing systems where we need to update the value for a single row.
However, analytic queries tend to fetch, summarise and process millions or billions of rows. Take the following simple query:-
SELECT team, sum(value)
GROUP by team;
On a row based system, this query needs to fetch every column of every row into memory to strip out the team and value. On a table with 100 columns and a billion rows this could lead to terabytes of processing.
However, on a column based solution, the same query would involve processing just 2% of the data with enormous performance benefits. As an added bonus, you’ll also find significant compression advantages. For example, the repeating values in the TEAM column can be replaced by a simple dictionary encoding technique to compress the data. On a simple test of a billion rows of text this produced a 50% saving, reducing a 56Gb text file to 26Gb using the Parquet data format.
As a comparison of Hadoop and Oracle costs demonstrates, despite the increasing cost of scarce Hadoop skills, the benefits of Open Source software and inexpensive hardware mean it can be considerably cheaper to host a large Hadoop system than an Oracle database.
On a system storing 168 TB of data, and taking account of hardware, license costs, support and personnel, the study found Oracle cost around 200% more than the corresponding Hadoop solution. Of course that doesn’t take account of migrating data from the warehouse to Hadoop, although whether that’s sensible is in fact a question for later article.
To start, you need to choose the right tool for the job. Throughout this article I’ve repeatedly talked about Analytics, Data Warehousing and Business Intelligence. That’s because Hadoop is not a traditional database, and is not suitable for transaction processing tasks – as a back-end data store for screen based transaction systems.
This is because Hadoop is not ACID compliant. This means:-
In fact Hadoop sacrifices ACID compliance in favour of throughput. It’s also designed to deal with large data volumes, and the smallest typical unit of work is around 128Mb. Compare this to a typical Oracle data block at around 8 Kilobytes.
All is not lost however, as most Data Warehouse processing is batch oriented, fetching, processing and storing massive data sets, and Hadoop is purpose built for this use-case.
Products like Cloudera Impala and Apache Hive and Spark SQL add low latency end-user SQL query and analysis tools over massive data sets, and again ACID compliance tends to be less important in Business Intelligence systems where 99.9% accuracy is a dreamed of goal rather than a business critical requirement.
If you do need massive OLTP throughput you might consider a NoSQL database, described in my article here.
The fact is Oracle is not going to go away any time soon. It’s been the core enterprise database platform for over 30 years, and that’s not going to change over night. Indeed Oracle is already adopting and adapting to the new challenger technologies with the Oracle Big Data Appliance, Exadata Appliance and Oracle 12c In Memory which I’ll cover in separate articles.
I do however sense the overall Data Warehouse architecture is changing, and Hadoop and the plethora technology products that come with it will each add an additional specialised capability to the overall stack. In the meantime, we need to be mindful of our approach so we deliver Requirements Driven Development rather then CV driven solutions.
It’s certainly good to be living in interesting times.
Thanks for reading this far. If you found this interesting click share it via the links below.
After 30 years as a freelance Data Warehouse Architect, I'm turning my attention to Big Data including Hadoop, NoSQL and NewSQL.