Big data has moved from just being a buzzword to a necessity that executives need to figure out how to wrangle.
Today the adoption of big data technologies and tools have witnessed significant growth with over 40% of organizations implementing big data as forecasted by Forrester, while IDC predicts that the big data and business analytics market is set to hit an all-time high of $274.3 billion in 2022 from $189.1 billion it's expected to reach this year.
With this push for big data and big data analytics, finding the write system, best practices and data models that allow analysts and engineers access to the treasure troves of data can be difficult. Do you use traditional databases, columnar databases, or some other data storage system?
Let's start with the this discussion by comparing a traditional relational database to Hadoop(specifically Hadoop partnered with a layer like Presto or Hive).
Hadoop is a distributed file system with an open-source infrastructure that allows for the distributing and processing of Big data sets.
Hadoop is designed to scale up from single servers to lots of machines, offering local storage and computation to each server. Apache Hadoop comes with a distributed file system and other components like Mapreduce (framework for parallel computation using a key-value pair), Yarn and Hadoop common (Java Libraries).
Presto is a distributed SQL query engine that can be used to sit on top of data systems like HDFS, Hadoop, Cassandra, and even traditional relational databases. It allows analysts the ability to use the benefits of Hadoop with out having to understand the complexities and intricacies of what is going on underneath the hood. This allows engineers the ability to use abstractions such as tables to organize data in a more traditional data warehouse format.
Relational DB is formed from a set of described tables from which data can be reassembled or assessed in various ways without needing to reorganize the entire database tables. I know this kind of sounds weird, but in its simplest form, RDB is the basics for all SQL as well as all database management systems like Microsoft SQL Server, Oracle and MySQL.
RDB can also be called RDBMS, which stands for Relational database management system. RDB is a database management system that works with a relational model. RDBMS is the evolution of all databases; it's more like any typical database rather than a significant ban.
Unlike RDBMS, Hadoop is not a database, but rather a distributed file system that can store and process a massive amount of data clusters across computers. However, RDBMS is a structured database approach in which data is stored in rows and columns which can be updated with SQL and presented in different tables. This structured approach of RDB limits its capability to store and process a large amount of data. So Hadoop, with Mapreduce or Spark can handle large volumes of data.
Data variety is typically referred to as the type of data processed. For now, we have three main types of data types; Structured, unstructured, and semi-structured. Relational DB can only manage and process structured and semi-structured data in a limited volume. RDB is limited in managing unstructured data. However, Hadoop leverages its ability to manage and process all of the above data types; structured, unstructured, and semi-structured data. As a matter of fact, Hadoop is now the fastest known method for managing and processing huge volumes of unstructured data
As stated earlier. Hadoop on it's own isn't a database. However, thanks to open source projects like Hive and Presto you can abstract the file system into a table like format that is accesible with SQL.
This has allowed many companies to start switching over parts or all of their datawarehouses to Hadoop.
It is for accessibility, and hopes for performance on cheaper machines. Whether or not this is actually working depends from company to company and data management team to data management team.
Because although systems like hadoop promise better performance. There are a lot of downsides that don't always get discussed.
"Before we get started, it's going to seem as if I dislike Hadoop. This is not the case, I am just going to be pointing out some of the largest pitfalls and weaknesses.
We will get into the technical difficulties shortly. But before we cover the technical cons, we wanted to discuss the talent issue.
Both Hadoop and traditional relational databases require technical know how and getting this knowledge is expensive. Now, this might be up for debate, but generally speaking, most relational databases are arguably easier to use.
This is because there are so few moving pieces in comparison. With Hadoop you need to think about managing cluster, the Hadoop nodes, security, Presto or whatever interface you are using and really several other technically administrative tasks that take up lots of time and skill. These skills can get you upwards of 150k a year as a data engineer. This is great for employees..but expensive for companies.
In comparison, most relation database systems like SQL Server or Oracle are "some what" more straightforward. Security is built in, performance tuning is built in and in most importantly, there is a larger talent pool of people who understand how to manage and use standard DBs.
Why do you think interfaces like Presto and Hive that are both very SQL like exist? It's because data professionals needed a way to interface with Hadoop that was much more familiar.
So the biggest issue most companies face is not the complexity of Hadoop but the lack/cost of talent that can operate Hadoop correctly.
Unlike RDBMS, Hadoop faces a lot of security problems which can be challenging when managing complex applications. In fact, the original Hadoop releases had no authentication system set up under the assumption that the system would be running in a safe environment.
More recent releases do have access and permissions, authentication and encryption modules. They aren't that straight forward to use and usually require a decent amount of ramp up. This can make it difficult to support and scale if you are just using the Hadoop out of the box without any form of third-party like Hortonworks($$$).
Hadoop is designed with the concept of write once read many. Hadoop is not designed for write once and update many. So for data specialists who are used to have the ability to update, forget about it.
For those who aren't into data modeling, the issue this causes might not be apparent right away. Nor is it all that exhilarating to understand...
But not being able to run an update statement limits a lot of modeling that can be beneficial from the perspective of data volume.
For example (we are about to go granular). Let's say you want to track someone's promotions in a company. Traditionally, in a RDMS you can simply track the employee_id, position and start and end date of said position. You don't need to track all the days in between. When a position is switched you can update the end date, add a new row for that employee with a new position and the start date, leaving the end date null. It would look like the below.
This took two rows of data and we now have all the information required.
In comparison, there are a couple of ways you could store this data using Presto to save similar information
One method is to store a person's position every day in a date partition. The downside here is you will essentially need a row for everyday the person was in said position. The problem here is you will be storing a massive amount of data. If you have 10,000 employees, that is 10,000 rows daily.
Another method would be to use a similar data model as the RDBMS. That is only have one row for every employee_id and position combination with a start and end date. However, using this method will only work if you have access to the previous days information. This is a limiting factor.
In the end, you are probably storing much more data than is really necessary and or performing a lot of transactions that are unneeded. The point here is that although Hadoop can provide some advantages, it's not always the best tool.
If you enjoyed this post about software engineering then consider these posts as well!
4 Simple Python Ideas to Automate Your Workflow
Real Life Use Cases Of Dashboards With Actual ROI
Analyzing Medical Data With BigQuery And Python
Learning Data Science: Our Top 25 Data Science Courses
The Best And Only Python Tutorial You Will Ever Need To Watch
Improving Your Data Strategy