Explaining the History of Data Lakehouse

#lakehouse #dataengineering #warehouse

Overview

This overview explains the brief history of the Data Lakehouse concept, starting with Data Warehouses, based on personal experience and the influential Databricks paper that continues to impact the industry three years after its initial publication.
But first, data "lake" is part of metaphorical structure forcing us to think of data as something "natural", something given. See a constructivist critique on a great blogpost of my boss:

Why we need to stop thinking of data as oil

1980-2010: The Data Warehouse (OLAP?) Era

Early work focused on data warehouses, such as SAP B/W.
Data warehouses, invented in the 1980s, promised:
1. Homogeneous data
2. Compressed/high-performance storage
3. Historical data for decision support (reporting)

2010-2020: The Data Lake Revolution

Around 2010, a crisis in data warehousing emerged.
Storage and Compute are decoupled — allowing multiple compute engines to connect to a single storage system (e.g., HDFS + Spark, Presto, Hive, etc.).
Triggered by the iPhone and Cloud computing (Big Data revolution).
Data Lakes emerged as a solution.
Metaphor: Imagine a vast lake (of data) with a large warehouse nearby, but that warehouse still 1000 times smaller in capacity than the lake itself.
Challenges of Data Warehouse on Lake architecture:
- Isolated environments
- Error-prone data transfer from lake to warehouse
- Slow and labor-intensive processes
- Data often became stale by the time it reached the warehouse

2020-Present: The Data Lakehouse Era

Key developments:
- Hudi (2017, Uber)
- Iceberg (2018, Netflix)
- Delta (2019, Databricks)
These formats provide functionality previously exclusive to warehouses:
- Abstract table layer over files + ACID tables: Transactional capabilities
- Support for indexing
- Transactional and relational model support for file-based data
Data Lakehouse concept:
- Not just near the lake, but floating on it like an oil rig
- Directly accesses and utilizes lake data without specific connections
- Cost-effective and easily replaceable
- Multiple "houses" can coexist on the same lake
- Rapid, reliable data access without extra engineering work
- Immediate value from lake data available in the "house"

The Data Lakehouse combines the best features of data warehouses and data lakes, offering a flexible, scalable, and efficient solution for modern data management needs. This concept forms the foundation of products like Databricks' data platform.

For more detailed information, refer to:

DEV Community

Explaining the History of Data Lakehouse

Overview

1980-2010: The Data Warehouse (OLAP?) Era

2010-2020: The Data Lake Revolution

2020-Present: The Data Lakehouse Era

Top comments (0)

Read next

The Silent Struggles of Working in Oversized Organizations

LLMs vs SLMs in Development

From Frontend to Backend: Build Scalable Pagination for Web Applications

Debouncing vs Throttling in JavaScript: Optimizing Function Calls for Better Performance