The evolution of the Modern Data Stack: From RDBMS to the LakeHouse

#architecture #dataengineering #learning #discuss

This post aims to provide a historical picture of the evolution of the typical data stack over a span of about 5 decades. A lot has happened, but I will try to keep it as simple and digestible as possible.

Fig 1: Evolution of the tools, concepts and technologies in the modern data stack

1970 - 1980s: The relational model, SQL, ACID and the RDBMS.

It all started with the relational model invented by Edgar F. Codd in the 1970s in his paper. Codd proposes the following: That data can be structured as a set of tuples (rows) where each value belongs to a field/attribute (column), altogether making a table that consists of a primary key that relates to other tables, by being a foreign key on other tables. He also proposes that users should not care about how the data is stored, that changes to the data should not break the applications using them. Lastly, he introduces normalisation, and that a language exists where users can consistently access and modify data. Within the next decade, specifically 1983, by Theo Härder and Andreas Reuter, the popular ACID (atomicity, consistency, isolation and durability) principles were established. These principles are important to mention as they are still relevant today and are at the centre of many of the innovations happening in the data stack infrastructure space. The relational model and ACID principles found very high adoption and utility in OLTP databases with that transactional workload. I have added these here briefly, as these concepts will be revisited later in this post in the Lakehouse section.

1980 - 2000s: The proliferation of the traditional Data Warehouse.

OLTP databases performed very well in scenarios where fast data retrieval was essential, as well as modifying and deleting operations on a record (row-by-row) basis. Analytical operations, such as aggregation, were considered to be expensive operations. They were expensive because they were data scan-heavy when they did not need to be. Take, for example, if one wanted to know the total revenue from a set of products, they would have to scan every product detail, such as product_name, product_id, product_category, order_number, amount, discount, and quantity. Ideally, to calculate that number, the only necessary fields required are amount * quantity, and product_category, for filtering.

Between 1980 and 2000, the innovations that helped with this were column-oriented architectures. This architecture allowed data to be stored by columns (fields) and not as rows. Additionally, other data modelling techniques arose. Examples of these data modelling techniques are the Kimball Data Modelling and One Big Table modelling. The column-oriented architecture in conjunction with the new data modelling techniques was found to yield performant results by scanning less I/O, allowing vectorisation and yielding better data compression as more and more data was being collected. Systems that used all these new innovations were termed ‘the data warehouse’. At the time, it was new, but now they are referred to as the traditional data warehouse.

2003 - 2006: Distributed Systems, MapReduce and cheaper storage

In 2003, the Google File System was developed. The Google File System is a distributed fault-tolerant storage system that runs on commodity hardware, built to meet Google’s increasing and rapidly growing data. The GFS is optimised for handling massive datasets and batch processing. Although it is not really used today and has been replaced by Google’s Colossus, the GFS was pivotal to the big data revolution. The GFS powered Google's search engine, allowing for efficient storage and fast data access. Also, it was able to store multimedia files.

In 2004, Google developed MapReduce, a processing framework for large-scale data. It enables parallel operations on massive datasets by splitting up the data into chunks, mapping tasks to their respective chunks and applying a reduce function to create a final output. These operations are carried out on clusters having a master slave (worker) architecture, with the master coordinating the operations and orchestration shuffle operations, if necessary, among worker nodes. The GFS, being a distributed file system, had a profound impact on the success of MapReduce by providing a reliable and scalable storage infrastructure that enabled data locality and fault tolerance.

Following GFS and MapReduce was the development of Hadoop in 2006. I have used just the word ‘Hadoop’ intentionally, as it is more like an ecosystem than just one thing. The Hadoop ecosystem consists mainly of 3 components: the Hadoop Distributed File System, MapReduce and YARN (Yet Another Resource Negotiator). Strongly inspired by both the GFS and Google's MapReduce, the HDFS and Hadoop MapReduce were created as open source implementations of these ideas, and YARN was developed to manage cluster resources.

Before we continue on Hadoop Ecosystem, here is a slight detour on Data Lakes and Variety in Big Data.

2010: The Variety Big Data Problem

In the previous sections, the data around a subject was structured, predictable, stayed in the shape almost through out its lifetime, and could be modeled to fit the tabular structure described in the relational model and Kimball data model. Using the relational model and traditional databases and warehouses required designing and carefully modelling the data to accurately describe an entity such that it was still accessible and analysis-ready. These modelling exercises took time and required expertise. One of the outcomes of this data modelling effort was schemas. Also, these schemas had to be known before any data was stored. From a database perspective, this is referred to as schema-on-write.

There are the V’s of big data. Volume, Variety, Velocity, Veracity and Value. Between 2000 and 2010, the variety of data stored began to shift more from just data that could easily be defined in a tabular structure to non-structured data, leading to the emergence of data lakes in 2010, a term coined by James Dixon. Simply put, A data lake is a central location that holds data in its native, raw format, whether structured or unstructured and at any scale. Data lakes did not need any predefined schema for the data stored. In so doing, data lakes deferred the need for structuring the data until the data was accessed, instead of at the point where the data was stored. When reading the data, the schema was defined, more like inferred, at the point when the data is retrieved, schema-on-read.

2006 - 2013: SQL-on-Hadoop - From batch to interactive analysis

Analysts were fond of SQL. SQL had already become very widely used at this time, especially with the proliferation of certain technologies (RDBMS and the data warehouse specifically) that have been discussed in the post. As a result, Hive was developed to cater to the needs of analysts by enabling SQL-like queries on massive datasets on the Hadoop Distributed File System without having to write complex Java programs. It also provided a centralised metadata store for all datasets in what is referred to today as the Hive Metastore. This development began in 2006, and in 2009, Facebook released their paper introducing Hive. In 2008, Hive was open-sourced, and it became a top-level Apache project. This development brought about SQL-like queries on HDFS, but not interactive analysis like current-day analysts now do, or the typical data warehouses provided.

While the initial version of Hive allowed for SQL-like operations, these operations were really only suitable for batch operations due to latency. Before going into what technology allowed for interactive analysis in the Hadoop ecosystem, it is worth introducing Dremel. Dremel revolutionised how SQL operations were run on object storage. It was developed by Google from about 2006 but made public in 2010. Therefore, it can be argued that Dremel pioneered the “interactive era”. It is worth noting that Dremel is sometimes referred to as a query engine only and sometimes as both a query engine and a specific columnar storage format. We will talk a bit more about formats shortly. Dremel is still widely used today as it powers Google's BigQuery.

As a response to Dremel and strongly inspired by it, many developments have occurred to allow interactive analysis in the Hadoop ecosystem. From 2013-2015, there was a focus on efficiency, leading to the development of Apache Tez, replacing the traditional MapReduce. Apache Tez reduced the time for many queries from minutes to seconds. In this same time window, massive parallel processing engines like Apache Impala and Presto (SQL on Everything), now known as Trino, were developed. Trino, today, is one of the core technologies powering AWS Athena.

2013 and Ongoing: Columnar file formats on data lakes

As briefly introduced earlier, Dremel is sometimes referred to as a query engine and a nested columnar data format. The data model was essential to the performance gains that Google was able to achieve by eliminating processing overhead. Similarly, other columnar data formats began to emerge. One of them is the very popular Apache Parque*t*, released in 2013. Parquet is a column-oriented storage format designed by Twitter and Cloudera to improve on Hadoop's existing storage format. To learn more about the Parquet file format, see this blog post. Another known columnar file format is the ORC (Optimised Row Columnar) file format, built for the Hadoop Ecosystem too to improve storage and analysis efficiency.

Today, there are more columnar formats; however, Parquet is very much still widely used. These columnar file formats helped speed up queries, lent themselves excellently to data compression, supported parallel processing, and improved storage efficiency. It also supported schema evolution.

2015: Streaming Semantics, Unification of Batch and Stream

It is worth calling out Apache Spark and Apache Flink. Apache Spark was originally developed in 2009 at UC Berkeley's AMPLab, and Flink in 2010 as part of a project named Stratosphere. Data processing can typically be categorised into streaming or batching, and this categorisation usually determines the tool and technologies used. For batch processing, Apache Spark was a go-to tool, while for streaming, Apache Flink was a common choice, and a result, led to separately maintained code bases for batch and stream data processing code. This architecture was referred to as the Lambda Architecture.

In 2015, Google published a research paper on the Dataflow Model. This research provided the theoretical foundation for treating batch and streaming as the same problem. The overarching idea is to treat continuous, messy data as permanently “incomplete” and provide simple building blocks so developers can decide, for each job, how much accuracy, speed, and cost they want, while letting the system handle ordering, windowing, and updates. In the same year, Google released a commercial product offering that was an implementation of their dataflow idea. Conveniently, the product was called Google Cloud Dataflow. In 2016, the Google Cloud Dataflow SDK was made open source and donated to the Apache Foundation; that open source SDK is today known as Apache Beam. This provided a standardised, engine-agnostic way to write unified pipelines that could run on multiple runners like Spark and Flink.

Within this same period, streaming in Apache Spark became a thing, and Apache Flink programs could be developed to treat batches of data as a finite stream. As of 2018, there was a consolidation at the engine level, integrating for both stream and batch processing using the same internal logic. The unification of batch and stream is crucial for the Modern Data Lakehouse, which will be discussed in a later section in this post.

2016 - 2021: Open table formats - ACID on object stores

Remember the columnar file formats, the data lake and the query engines such as Presto? If you do, we can proceed as they are essential for the remaining sections. They are key parts of a system that is being put together in this section of the blog.

There was still a big challenge about data lakes, and it was maintaining transactional integrity. One thing traditional databases and data warehouses handled well. Other limitations of data lakes were vendor lock-in from different data engine layers, such as Spark, Trino, etc. To address this, Open Table Formats (OTF) were created. Open Table Formats brought database-like ACID operations to Open File Formats in data lakes.

It is key to distinguish Open File Formats and Open Table Formats. Open File Formats are file formats such as Parquet, while Open Table Formats are, simply put, a metadata layer over an Open File Format. Common Open File Formats are the Apache Hudi, developed by Uber in 2016, Apache Iceberg, developed by Netflix in 2017 and Delta Lake Open File Format, built by Databricks. They are all open-source and continue to be a critical component in the Lakehouse by enabling ACID operations, schema evolution, better partition management, in many cases time travel (point-in-time recovery), and interoperability of the numerous data processing engines mentioned in this blog, such as Spark, Flink, Trino and so on.

In the next section, we will look at how all this innovation now integrates together to form what is called the Lakehouse.

2021 - Ongoing: The LakeHouse

The CIDR 2021 Lakehouse paper argued that implementing warehouse features (ACID tables, schema enforcement/evolution, indexing, cache, governance) on open data lake formats could meet or approach cloud‑warehouse performance without data duplication across lake + warehouse tiers. Lakehouses support SQL analytics and advanced machine learning directly, eliminate multiple ETL steps, reduce lock-in and staleness, and can reach competitive performance using new metadata layers. From the historical developments, one can argue it was the right time for all the technologies to be tightly integrated, such as it formed one big new product. To prove this argument, we look at Databricks Platform. A unified platform using the low-cost storage with an open file format, a metadata layer (delta-lake open file format), a performance layer known as the delta engine, declarative dataFrame APIs for Machine Learning and Data Science and a Multi-API Support Layer. The performance of the entire system was optimal in comparison to other traditional data warehouses using the TPC-DS Power Test, making a very strong case for the viability of the data lakehouse.

So far, the data lakehouse has had successes for the past few years. There have been other implementations within cloud providers such as the GCP BigLake, AWS CloudFormation, Azure Fabric and so on. Most of the cloud provider options are not natively a lakehouse solution, but they integrate some/most of the components. One major area of the lakehouse that is becoming more robust is the data governance features and capabilities.

TL;DR

Over the last 6 decades, analytics and data systems have evolved so much, with so much tooling and technologies. There have been very pivotal moments, such as the relational model and the traditional OLTP databases. In that same period, ACID principles were established to maintain the integrity of the databases. Row-level access was good, but aggregations were expensive; this led to the OLAP data warehouse, optimised for analytical workloads.

Then came the great decoupling, with the invention of Google File System, MapReduce, and distributed systems. The inseparable components of a once tightly knit data system began to become standalone, robust components. This storage system further allowed for the compute layer to be thought of and built as a separate but integratable component, as seen in Hadoop, Dremel, Presto, etc.

Now we had storage and compute, but not efficiency. This led to a focus on performance, forming the development of columnar file formats and then the unification of batching and streaming semantics, shifting from the Lambda architecture with the dataflow model, thinking of a single engine, with the internal functions to handle both instead of having separate codebases.

I consider us to be in a consolidation stage, a stage where the best of all worlds is coming together to form a big, well-thought-out product that has the advanced performance of the tightly knit data warehouses and the flexibility and interoperability that came from the decoupling, as seen in data lakes and the query engines. This consolidation is seen in the Lakehouse.