From Kimball to Lakehouse: The Evolution of Data Storage (with Python Demo)

#ai #discuss #architecture #learning

Data Storage Architecture: Deconstructing Warehouse, Lake, and Lakehouse 🏛️

In modern data engineering, choosing the right storage architecture is critical. Based on the data_engineering_book, this guide breaks down the core differences between traditional Warehouses, Data Lakes, and the modern Lakehouse, while providing a hands-on Delta Lake demo.

1. Warehouse vs. Lake vs. Lakehouse

Understanding the core philosophy of each architecture is the first step toward a successful design.

Architecture	Definition	Design Philosophy
Warehouse (Kimball/Inmon)	Structured, integrated, non-volatile storage using Star/Snowflake schemas.	Schema-on-Write. Optimized for fast BI reporting and business logic.
Data Lake	A vast repository for raw data (Structured/Unstructured) with no strict schema.	Schema-on-Read. Optimized for data exploration, ML, and low-cost storage.
Data Lakehouse	A hybrid architecture bringing warehouse management to the data lake.	Best of both. Retains lake flexibility with warehouse-level ACID transactions and governance.

2. Core Storage Design Principles

According to the Data Engineering Book, a robust storage layer must balance maintainability, performance, and cost.

A. Layering Strategy (The Medallion Architecture)

Raw (Bronze/ODS): Stores data in its original form. Enables reprocessing if logic changes.
Clean (Silver/CDM): Deduplicated, standardized, and filtered data.
Integrated (Gold/DWD): Themed data organized by business subjects (User, Order, etc.).
Aggregated (Platinum/DM): Summarized data ready for BI dashboards.

B. Partitioning Strategy

Partitions reduce the amount of data scanned, directly boosting query performance.

Partition Keys: Choose high-frequency filter fields (e.g., dt, region).
Granularity: Avoid "Small File Problem" by ensuring partitions aren't too granular (e.g., use day instead of second).

C. Data Lifecycle Management (DLM)

Hot Data (<7 days): High-performance storage (SSD / Delta Lake active partitions).
Warm Data (7 days - 3 months): Standard object storage (S3 Standard).
Cold Data (>3 months): Archival storage (S3 Glacier) to minimize costs.

3. Hands-on: Building a Lakehouse with Delta Lake

Delta Lake is the backbone of the Lakehouse architecture, providing ACID transactions and Schema Enforcement. Here is a Python/PySpark snippet to get you started:

from pyspark.sql import SparkSession
from delta.tables import DeltaTable
import pyspark.sql.functions as F

# 1. Initialize Spark with Delta Support
spark = SparkSession.builder \
    .appName("DeltaLakehouseDemo") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# 2. Ingest to Raw Layer (ODS)
ods_path = "./lakehouse/ods/user_behavior"
data = [("user1", "2024-05-20", "click"), ("user2", "2024-05-20", "purchase")]
df = spark.createDataFrame(data, ["user_id", "dt", "action"])

df.write.format("delta").partitionBy("dt").mode("overwrite").save(ods_path)

# 3. Time Travel Capability
# Access a specific version of your data effortlessly
df_v0 = spark.read.format("delta").option("versionAsOf", 0).load(ods_path)

# 4. ACID Transaction: Atomic Updates
delta_table = DeltaTable.forPath(spark, ods_path)
delta_table.update(
    condition="user_id = 'user1'",
    set={"action": "'view'"}
)

print("Updated Lakehouse Data:")
spark.read.format("delta").load(ods_path).show()

4. Decision Tree: Choosing Your Architecture

Not every project needs a full Lakehouse. Use this decision tree from our handbook to decide:

Do you only need structured data for fixed BI reports?
Yes: Traditional Data Warehouse (Snowflake/Redshift).
No: Proceed.
Do you need to store Unstructured data (Logs, Videos, JSON)?
Yes: Proceed to Lakehouse/Data Lake.
Do you need ACID transactions and Schema Enforcement?
No: Pure Data Lake (S3 + Hive/Glue).
Yes: Data Lakehouse (Delta Lake / Iceberg).

Conclusion

The evolution from Warehouses to Lakehouses represents a shift toward balancing agility with governance. By implementing layering, partitioning, and lifecycle management, you can build a storage layer that scales with your business.

For the full architectural guide and more hands-on demos, visit our repository:

👉 GitHub: datascale-ai/data_engineering_book

Are you still using a traditional Data Warehouse, or have you migrated to a Lakehouse? Share your migration stories below! 👇