Apache SeaTunnel

Posted on Sep 4

(I) Principles of Data Model Architecture: Four Layers and Seven Stages

#bigdata #dataengineering #programming #datawarehouse

Handbook of Emerging Data Lake and Warehouse Design and Practice · Data Lake and Warehouse Modeling and Model Naming Specifications (2025) consists of four progressive guides. With the main line of "model architecture - common specifications - layering specifications - naming specifications", it systematically constructs a modern data lake and warehouse that can be evolved, governed, and shared.

The first article, "Principles of Data Model Architecture", proposes a four - layer data layering architecture of "ODS - DW - APP" (including DWD/DWM/DWS within DW), and focuses on four principles: theme domain division, high cohesion and low coupling, sinking of common logic, and cost - performance balance, laying a unified and extensible design foundation for the dimensional modeling of the lake - warehouse integration.

The following three articles will implement common design patterns, detailed specifications for each layer, and a unified naming system within this framework, helping enterprises complete the full - link construction from data ingestion into the lake to value realization with a set of methodologies. Stay tuned for the full version.

Steering the Course: Top - level Principles of Data Model Architecture that Keep the Data Lake and Warehouse Afloat for a Decade

1. How Many Layers Are Enough?

An excellent and reliable data warehouse system often requires a clear data layering structure, which should ensure the stability of the data layer, shield the impact on the downstream, and avoid overly long chains. So, the question is, we've always been talking about data warehouse layering, but how many layers are the best for a data warehouse?

Currently, the mainstream layering methods in the market are diverse. However, we can't just look at the surface; we need to see the underlying laws. We should not layer for the sake of layering. There is no one - size - fits - all solution; only the suitable one matters.

Layering aims to quickly support the current business with data, abstract a common framework for the future, empower other business lines, provide stable and accurate data support for business development, and offer directions for new business development according to the existing models, that is, data - driven and enabling.

A good layering architecture has the following advantages:

Clear data structure;
Data lineage tracking;
Reduced duplicate design;
Organized data relationships;
Shielding the impact of raw data.

Data warehouse layering should be combined with the company's business, and the responsibilities of each layer need to be clearly defined. Generally, the following layering structure is adopted:

2. Panoramic View of Four Layers and Seven Stages

At which layer should data warehouse modeling be carried out? Taking dimensional modeling as an example, modeling is carried out at the layer next to the data source layer. In the above figure, it is the DW layer for data warehouse modeling. So, the DW layer is the core layer of data warehouse construction.

The following elaborates on the construction specifications for each layer, which are slightly different from the layering in the above figure:

1. Data Source Layer ODS

The ODS layer is the layer closest to the data in the data source. Considering that it may be necessary to trace data issues later, it is not recommended to do too much data cleaning work at this layer. Just access the raw data intact. Processes such as data denoising, deduplication, and outlier handling can be done in the subsequent DWD layer.

2. Data Warehouse Layer DW

The data warehouse layer is the core layer to be designed when building a data warehouse. Here, the data obtained from the ODS layer is used to establish various data models according to themes.
The DW layer is further divided into the DWD (Data Warehouse Detail) layer, the DWM (Data WareHouse Middle) layer, and the DWS (Data WareHouse Servce) layer.

1) Data Detail Layer: DWD (Data Warehouse Detail)
This layer generally maintains the same data granularity as the ODS layer and provides certain data quality assurance. What the DWD layer needs to do is to clean, integrate, and standardize the data. Dirty data, garbage data, data with inconsistent specifications, inconsistent state definitions, and non - standard naming will all be processed.
At the same time, to improve the usability of the data detail layer, this layer will adopt some dimension - degenerate techniques, degenerating the dimensions into the fact table to reduce the association between the fact table and the dimension table.
In addition, a part of data aggregation will also be done at this layer, gathering data of the same theme into one table to improve data availability.

2) Data Intermediate Layer: DWM (Data WareHouse Middle)
This layer will perform light aggregation operations on the data of the DWD layer to generate a series of intermediate tables, improving the reusability of common indicators and reducing duplicate processing.
Intuitively, it is to perform aggregation operations on common core dimensions to calculate corresponding statistical indicators.
In actual calculations, if the statistical indicators of the wide table are directly calculated from the DWD or ODS, there will be problems such as too much calculation and too few dimensions. Therefore, the general approach is to first calculate multiple small intermediate tables in the DWM layer and then splice them into a DWS wide table. Since the boundary between wide and narrow is not easy to define, the DWM layer can also be removed, leaving only the DWS layer, and putting all the data in the DWS layer as well.

3) Data Service Layer: DWS (Data WareHouse Servce)
The DWS layer is the public summary layer, which will perform light summarization. The granularity is slightly coarser than the detailed data. Based on the basic data in the DWD layer, it integrates and summarizes the service data for analyzing a certain theme domain, usually a wide table. The DWS layer should cover 80% of application scenarios. It is also called the data mart or wide table.
According to business division, such as theme domains like traffic, orders, users, etc., wide tables with more fields are generated for subsequent business queries, OLAP analysis, data distribution, etc.
Generally speaking, there will be relatively few data tables in this layer, and one table will cover more business content. Due to its many fields, the tables in this layer are generally also called wide tables.

3. Data Application Layer APP

Here, it mainly provides data for data products and data analysis. Generally, it is stored in systems such as ES, PostgreSql, Redis, etc. for online systems, and may also be stored in Hive or Druid for data analysis and data mining. For example, the report data we often mention is generally placed here.

4. Dimension Table Layer

If there are too many dimension tables, a separate layer can be designed for dimension tables. The dimension table layer mainly contains two parts of data:

High - cardinality dimension data: Generally, it is data tables similar to user profile tables and product profile tables. The data volume may be in the tens of millions or even hundreds of millions.
Low - cardinality dimension data: Generally, it is configuration tables, such as the Chinese meanings corresponding to enumeration values, or date dimension tables. The data volume may be in single digits or tens of thousands.

3. Theme Domain Division Principles

1) Division by Business or Business Process

Business is easy to understand. It refers to functional modules/business lines.

Business process: It refers to the business activity events of an enterprise. For example, placing an order, making a payment, and refunding are all business processes. However, it should be noted that a business process is an indivisible behavioral event. Generally speaking, a business process is an event in an enterprise's activities.

2) Division by Data Domain

A data domain refers to an abstract set of business processes or dimensions for business analysis. Among them, a business process can be summarized as indivisible behavioral events. Under the business process, indicators can be defined. A dimension refers to the environment of measurement. For example, in the event of a buyer placing an order, the buyer is a dimension. To ensure the vitality of the entire system, the data domain needs to be abstracted, refined, maintained, and updated in the long term, but not easily changed. When dividing data domains, it should not only cover all current business needs but also be able to include new businesses without impact when they enter, either into the existing data domains or by expanding new data domains.

4. Data Model Design Principles → "Five Golden Rules, None Can Be Missed"

1) High Cohesion, Low Coupling

That is, high cohesion within the theme and low coupling between different themes. The detail layer divides themes according to business processes, the summary layer divides different analysis themes according to "entity + activity", and the application layer divides different application themes according to application needs.

2) Separation of Core Model and Extended Model

Establish a core model and extended model system. The fields included in the core model support common core businesses, and the fields included in the extended model support personalized or a small number of application needs. The fields of the extended model should not overly intrude into the core model, so as not to damage the simplicity and maintainability of the core model's architecture.

3) Sinking and Singularity of Common Processing Logic

The more common the processing logic at the bottom, the more it should be encapsulated and implemented at the bottom of the data scheduling dependency. Do not expose the common processing logic to the application implementation, and do not let the common logic exist in multiple places simultaneously.

4) Cost - Performance Balance

Appropriate data redundancy can exchange for query and refresh performance, but over - redundancy and data replication should be avoided.

5) Data Rollbackability

With the same processing logic, the data results are determined to be the same when run multiple times at different times.

Next Article Preview: "Data Warehouse Public Design Specifications"

DEV Community