Data Storage Architecture: Deconstructing Warehouse, Lake, and Lakehouse 🏛️
In modern data engineering, choosing the right storage architecture is critical. Based on the data_engineering_book, this guide breaks down the core differences between traditional Warehouses, Data Lakes, and the modern Lakehouse, while providing a hands-on Delta Lake demo.
1. Warehouse vs. Lake vs. Lakehouse
Understanding the core philosophy of each architecture is the first step toward a successful design.
| Architecture | Definition | Design Philosophy |
|---|---|---|
| Warehouse (Kimball/Inmon) | Structured, integrated, non-volatile storage using Star/Snowflake schemas. | Schema-on-Write. Optimized for fast BI reporting and business logic. |
| Data Lake | A vast repository for raw data (Structured/Unstructured) with no strict schema. | Schema-on-Read. Optimized for data exploration, ML, and low-cost storage. |
| Data Lakehouse | A hybrid architecture bringing warehouse management to the data lake. | Best of both. Retains lake flexibility with warehouse-level ACID transactions and governance. |
2. Core Storage Design Principles
According to the Data Engineering Book, a robust storage layer must balance maintainability, performance, and cost.
A. Layering Strategy (The Medallion Architecture)
- Raw (Bronze/ODS): Stores data in its original form. Enables reprocessing if logic changes.
- Clean (Silver/CDM): Deduplicated, standardized, and filtered data.
- Integrated (Gold/DWD): Themed data organized by business subjects (User, Order, etc.).
- Aggregated (Platinum/DM): Summarized data ready for BI dashboards.
B. Partitioning Strategy
Partitions reduce the amount of data scanned, directly boosting query performance.
-
Partition Keys: Choose high-frequency filter fields (e.g.,
dt,region). -
Granularity: Avoid "Small File Problem" by ensuring partitions aren't too granular (e.g., use
dayinstead ofsecond).
C. Data Lifecycle Management (DLM)
- Hot Data (<7 days): High-performance storage (SSD / Delta Lake active partitions).
- Warm Data (7 days - 3 months): Standard object storage (S3 Standard).
- Cold Data (>3 months): Archival storage (S3 Glacier) to minimize costs.
3. Hands-on: Building a Lakehouse with Delta Lake
Delta Lake is the backbone of the Lakehouse architecture, providing ACID transactions and Schema Enforcement. Here is a Python/PySpark snippet to get you started:
from pyspark.sql import SparkSession
from delta.tables import DeltaTable
import pyspark.sql.functions as F
# 1. Initialize Spark with Delta Support
spark = SparkSession.builder \
.appName("DeltaLakehouseDemo") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# 2. Ingest to Raw Layer (ODS)
ods_path = "./lakehouse/ods/user_behavior"
data = [("user1", "2024-05-20", "click"), ("user2", "2024-05-20", "purchase")]
df = spark.createDataFrame(data, ["user_id", "dt", "action"])
df.write.format("delta").partitionBy("dt").mode("overwrite").save(ods_path)
# 3. Time Travel Capability
# Access a specific version of your data effortlessly
df_v0 = spark.read.format("delta").option("versionAsOf", 0).load(ods_path)
# 4. ACID Transaction: Atomic Updates
delta_table = DeltaTable.forPath(spark, ods_path)
delta_table.update(
condition="user_id = 'user1'",
set={"action": "'view'"}
)
print("Updated Lakehouse Data:")
spark.read.format("delta").load(ods_path).show()
4. Decision Tree: Choosing Your Architecture
Not every project needs a full Lakehouse. Use this decision tree from our handbook to decide:
- Do you only need structured data for fixed BI reports?
- Yes: Traditional Data Warehouse (Snowflake/Redshift).
No: Proceed.
Do you need to store Unstructured data (Logs, Videos, JSON)?
Yes: Proceed to Lakehouse/Data Lake.
Do you need ACID transactions and Schema Enforcement?
No: Pure Data Lake (S3 + Hive/Glue).
Yes: Data Lakehouse (Delta Lake / Iceberg).
Conclusion
The evolution from Warehouses to Lakehouses represents a shift toward balancing agility with governance. By implementing layering, partitioning, and lifecycle management, you can build a storage layer that scales with your business.
For the full architectural guide and more hands-on demos, visit our repository:
👉 GitHub: datascale-ai/data_engineering_book
Are you still using a traditional Data Warehouse, or have you migrated to a Lakehouse? Share your migration stories below! 👇

Top comments (0)