(1) Emerging Data Lakehouse Handbook (2025): Concepts and Design of Data Warehouse Layering

#datascience #bigdata #database #seatunnel

The “Emerging Data Lakehouse Design and Practice Handbook · From Layered Architecture to Data Lakehouse Architecture Design (2025)” series focuses on the design and practice of moving from traditional data warehouse layering to modern data lakehouse architectures. This handbook explains the core value of data warehouse layers, common layer types, ETL architecture and data transformation processes under each layer, corresponding technical architecture, and provides a deep dive into layered design using the Source Layer (ODS), Data Warehouse Layer (DW), and Data Service Layer (DWS) as examples. Finally, it explores trends in data warehouse technology and provides a summary.

This article is the first in the series, providing a detailed analysis of the concept and design of data warehouse layering, offering a systematic guide for practitioners in data lakehouse design and implementation.

💓 Friendly reminder: Purely practical content, quite long—recommended to bookmark for study~

Core Value of Data Warehouse Layering

A layered structure in a data warehouse ensures that each layer serves specific purposes and applications, making the data hierarchy clearer and the structure more explicit. Properly classified table layers help users quickly understand and locate relevant data, reducing query complexity.

Optimize data relationships: cohesion within layers, separation between layers
Layered design ensures that models within each layer are highly focused and unified, concentrating on specific data processing tasks, which simplifies maintenance and optimization. Operations such as data cleaning, integration, and transformation are centralized in the same layer, improving data quality and making it easier to locate and resolve issues.
Models between different layers are relatively independent, reducing inter-layer coupling. Changes in one layer do not affect others, enhancing system flexibility and scalability. This design better accommodates business changes and system upgrades, ensuring stability and efficiency.
Hide underlying system complexity and shield downstream from upstream changes
When business systems are updated, the layering mechanism can effectively shield downstream users from underlying changes. The warehouse layer masks the complex naming and structures of source systems, ensuring standardization and consistency. Even if source systems change, the warehouse layer handles the necessary adjustments, preventing frequent modifications to downstream code and logic, thereby maintaining data continuity and stability.
Enhance data reuse and reduce redundant work
Layering allows data to be progressively refined at each stage, producing high-quality data for direct use in subsequent layers. This reduces the need for repeated extraction and processing across different tasks or projects. Once standardized, cleaned, and transformed at the lower layer, data can be reused at higher layers, avoiding redundant development and computation.
For example, different business teams or analytical applications can rely on the same source data without re-extracting or reprocessing each time, improving development efficiency and reducing potential errors.
Enable traceability of data sources and avoid “spaghetti” dependencies
Layered design divides data processing into stages, each focusing on specific tasks, ensuring that data at each layer has a clear lineage. Processing steps, transformation rules, and change history can be fully tracked across layers. Data lineage is particularly clear in layered data warehouses.
Each layer preserves the path from raw data collection to final usage. If a data issue occurs, engineers can quickly trace its source and identify the processing step causing the problem. This mechanism ensures traceability of changes, helping maintain data quality and reliability.
Simplify complex problems and resolve business logic step by step
Layered design breaks complex tasks into manageable steps, each focusing on a single task, simplifying management and maintenance. When issues arise, fixes can start from the affected step without adjusting the entire data chain, ensuring accuracy and maintainability.
Simplify maintenance and ensure system stability
Layered design makes maintenance easier—issues in one layer are resolved within that layer without affecting others. This balance of performance, cost, efficiency, and quality supports stable and efficient operation of large-scale data systems.

By applying layering and modeling, big data systems can achieve structured and efficient storage and management, maximizing data value while balancing performance and cost.

Common Data Layers

Layering is not arbitrary but aims to solve development, ETL organization, data storage, and permission management challenges.

Common layers include:
ODS (Operational Data Store), STG/SDATA (Staging/Data Mirror Layer), DWD/SOR/PDATA (Detail/Atomic Layer), DWS/SMA/MID (Summary/Intermediate Layer), ADS/IDX (Application/Index Layer), DM (Data Mart Layer), etc.

In practice, warehouses often include ODS, DWD, DWS, DWM, DM, and ADS layers, but boundaries may not be strictly distinct. Complex business needs or real-world scenarios may mix or simplify layers to support real-time analysis or cross-departmental requirements, making architecture more flexible and complex.

Initial layer definitions:

ODS (Operational Data Store): Stores raw or near-real-time data from multiple sources. Table names closely reflect source systems.
STG/SDATA (Staging Area/Data Mirror Layer): Receives data exported from ODS or source systems, typically partitioned by day, e.g., STG_Table1_20241108.
DWD/SOR/PDATA (Data Warehouse Detail/Source of Record): Stores high-granularity cleaned, integrated, and transformed data. Core of the warehouse, usually designed by subject, shielding downstream users from source system complexities. Historically, Teradata and IBM accumulated expertise here, e.g., 10-domain or 11-domain models, though evolving business requires flexible approaches.
DWS/SMA/MID (Data Warehouse Summary/Service Layer): Stores aggregated and computed summary data, supporting common queries and analysis. Enhances data reuse, unifies business logic, and reduces warehouse pressure.
DWM (Data Warehouse Modeling): Traditional OLAP modeling layer with star/snowflake schemas. With modern OLAP engines, this layer often merges with ADS/IDX layers for application analysis.
DM (Data Mart): Custom subsets of the warehouse for specific departments or business lines, providing flexible analysis.
ADS/IDX (Application Data Store/Index Layer): Final application-facing layer for efficient queries and analytics. Can integrate with BI tools, dashboards, or directly feed large models for natural language query.

In general, the three-layer structure (ODS, DWD, DWS) forms the basic framework, with further DW subdivisions depending on company needs and scenarios.

ETL Architecture under Data Warehouse Layers

ETL (Extraction, Transformation, Loading) bridges source systems and the warehouse, moving heterogeneous data into temporary intermediate layers for cleaning, transformation, and integration before loading into the warehouse or marts.

ETL often occupies 60–80% of implementation effort. With larger datasets and real-time/non-structured data, ETL evolves into EtLT (Extraction → Lightweight Transformation → Loading) to handle diverse sources and streaming scenarios.

Extraction
Extracts raw data from sources. Traditional methods include initial load (for building dimension and fact tables) and incremental refresh (via scheduled jobs or triggers). Non-structured sources (APIs, XML) and binlog/CDC data require parsing and formatting into compatible formats for processing engines (e.g., WhaleTunnel/SeaTunnel).
Lightweight Transformation / Data Cleaning
Cleans and standardizes data, removing duplicates, inconsistencies, or rule violations. Ensures consistent formats across sources (e.g., unifying id vs ids fields). Complex business logic is usually applied within the warehouse via SQL/procedures rather than external tools.
For example, WhaleTunnel/SeaTunnel supports UI/script-based lightweight cleaning, adding fields, modifying data types/names, and filtering unnecessary data.
Loading
Loads cleaned/transformed data into target storage (HDFS, Doris, Hive, Hudi, Iceberg, Greenplum, etc.) using bulk load or high-speed APIs, not JDBC, to avoid bottlenecks. Tasks are managed via platforms like Apache DolphinScheduler or WhaleScheduler, with layer-specific DAGs for efficiency.

Data Transformation

Large volumes of data enter warehouses/lakehouses via batch or streaming. Modern warehouses leverage SQL for complex business transformations rather than relying on ETL tools (Informatica, DataStage, Talend), simplifying migration, personnel management, and Agile/DataOps workflows.

EtLT architecture is now mainstream, with SQL, Python, or Shell scripts commonly used for transformations, orchestrated by scheduling tools (Apache DolphinScheduler, WhaleScheduler) to efficiently manage task flows.

💓 Next article preview: (2) Technical Architecture under Data Warehouse Layers