DEV Community

Pavol Z. Kutaj
Pavol Z. Kutaj

Posted on

Explaining the History of Data Lakehouse

Overview

This overview explains the brief history of the Data Lakehouse concept, starting with Data Warehouses, based on personal experience and the influential Databricks paper that continues to impact the industry three years after its initial publication.
But first, data "lake" is part of metaphorical structure forcing us to think of data as something "natural", something given. See a constructivist critique on a great blogpost of my boss:

Why we need to stop thinking of data as oil

1980-2010: The Data Warehouse (OLAP?) Era

  • Early work focused on data warehouses, such as SAP B/W.
  • Data warehouses, invented in the 1980s, promised:
    1. Homogeneous data
    2. Compressed/high-performance storage
    3. Historical data for decision support (reporting)

2010-2020: The Data Lake Revolution

  • Around 2010, a crisis in data warehousing emerged.
  • Storage and Compute are decoupled — allowing multiple compute engines to connect to a single storage system (e.g., HDFS + Spark, Presto, Hive, etc.).
  • Triggered by the iPhone and Cloud computing (Big Data revolution).
  • Data Lakes emerged as a solution.
  • Metaphor: Imagine a vast lake (of data) with a large warehouse nearby, but that warehouse still 1000 times smaller in capacity than the lake itself.
  • Challenges of Data Warehouse on Lake architecture:
    • Isolated environments
    • Error-prone data transfer from lake to warehouse
    • Slow and labor-intensive processes
    • Data often became stale by the time it reached the warehouse

2020-Present: The Data Lakehouse Era

  • Key developments:
    • Hudi (2017, Uber)
    • Iceberg (2018, Netflix)
    • Delta (2019, Databricks)
  • These formats provide functionality previously exclusive to warehouses:
    • Abstract table layer over files + ACID tables: Transactional capabilities
    • Support for indexing
    • Transactional and relational model support for file-based data
  • Data Lakehouse concept:
    • Not just near the lake, but floating on it like an oil rig
    • Directly accesses and utilizes lake data without specific connections
    • Cost-effective and easily replaceable
    • Multiple "houses" can coexist on the same lake
    • Rapid, reliable data access without extra engineering work
    • Immediate value from lake data available in the "house"

The Data Lakehouse combines the best features of data warehouses and data lakes, offering a flexible, scalable, and efficient solution for modern data management needs. This concept forms the foundation of products like Databricks' data platform.

For more detailed information, refer to:

  1. CIDR 2021 Paper
  2. Onehouse Blog
  3. YouTube Video

Postmark Image

Speedy emails, satisfied customers

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up

Top comments (0)

Billboard image

Try REST API Generation for Snowflake

DevOps for Private APIs. Automate the building, securing, and documenting of internal/private REST APIs with built-in enterprise security on bare-metal, VMs, or containers.

  • Auto-generated live APIs mapped from Snowflake database schema
  • Interactive Swagger API documentation
  • Scripting engine to customize your API
  • Built-in role-based access control

Learn more

👋 Kindness is contagious

Discover a treasure trove of wisdom within this insightful piece, highly respected in the nurturing DEV Community enviroment. Developers, whether novice or expert, are encouraged to participate and add to our shared knowledge basin.

A simple "thank you" can illuminate someone's day. Express your appreciation in the comments section!

On DEV, sharing ideas smoothens our journey and strengthens our community ties. Learn something useful? Offering a quick thanks to the author is deeply appreciated.

Okay