Apache SeaTunnel

Posted on Sep 12

(Ⅱ) A Complete Guide to Core Data Warehouse Design Standards: From Layers, Types to Lifecycle

#datascience #bigdata #datawarehouse #dataengineering

“Emerging Data Lakehouse Design and Practice Handbook · Data Lakehouse Modeling and Naming Standards (2025)” consists of four progressive guides, structured along the main line of Model Architecture – Common Standards – Layering Standards – Naming Standards. Together, they systematically build an evolvable, governable, and shareable modern data lakehouse.

The first article “(I) Principles of Data Model Architecture: Four Layers and Seven Stages, the 'First Cornerstone' of Data Lake and Warehouse Modeling” introduced the four-layer (ODS–DW–APP, with DWD/DWM/DWS inside DW) data layering architecture. Around four core principles—domain partitioning, high cohesion & low coupling, common logic sinking, and cost-performance balance—it laid a unified and extensible foundation for dimensional modeling in a lakehouse architecture.

This article is the second in the series. It details the common design standards followed in data warehouses, including hierarchy invocation standards, data type standards, string handling, and other warehouse design specifications.

The remaining two articles will further analyze detailed specifications for each data warehouse layer and the unified naming system within this framework, helping enterprises complete the full chain from data ingestion to value realization with one methodology. Stay tuned for the full series.

1. Hierarchy Invocation Standards: Controlling Data Warehouse Flow and Reference Principles

🚀 Business data flow design and layering reference essentials

Stable business should follow the standard flow design: ODS → DWD → DWS → APP. For unstable or exploratory demands, ODS → DWD → APP or ODS → DWD → DWM → APP models may be used.

Beyond ensuring data flow rationality, reference principles must also be followed:

Normal flow: ODS → DWD → DWM → DWS → APP. If ODS → DWD → DWS → APP occurs, it means the domain is incomplete. DWD data should be landed into DWM. For rarely used tables, DWD → DWS is allowed.
Avoid having a DWS wide table reference both DWD and the DWM of the same domain.
Within the same domain, generating DWM from DWM should be avoided to maintain ETL efficiency.
DWM, DWS, and APP must not directly use ODS tables. Only DWD can reference ODS.
Reverse dependencies are forbidden, e.g., DWM tables depending on DWS tables.

Example:
![6](# (Part 2) A Complete Guide to Core Data Warehouse Design Standards: From Layers, Types to Lifecycle

The first article “Principles of Data Model Architecture” introduced the four-layer (ODS–DW–APP, with DWD/DWM/DWS inside DW) data layering architecture. Around four core principles—domain partitioning, high cohesion & low coupling, common logic sinking, and cost-performance balance—it laid a unified and extensible foundation for dimensional modeling in a lakehouse architecture.

1. Hierarchy Invocation Standards: Controlling Data Warehouse Flow and Reference Principles

🚀 Business data flow design and layering reference essentials

Stable business should follow the standard flow design: ODS → DWD → DWS → APP. For unstable or exploratory demands, ODS → DWD → APP or ODS → DWD → DWM → APP models may be used.

Beyond ensuring data flow rationality, reference principles must also be followed:

Normal flow: ODS → DWD → DWM → DWS → APP. If ODS → DWD → DWS → APP occurs, it means the domain is incomplete. DWD data should be landed into DWM. For rarely used tables, DWD → DWS is allowed.
Avoid having a DWS wide table reference both DWD and the DWM of the same domain.
Within the same domain, generating DWM from DWM should be avoided to maintain ETL efficiency.
DWM, DWS, and APP must not directly use ODS tables. Only DWD can reference ODS.
Reverse dependencies are forbidden, e.g., DWM tables depending on DWS tables.

Example:

2. Data Type Standards: Unifying Data Warehouse Type Settings

🔍 Precise type definitions for various data

Different data types must be standardized and strictly enforced:

Amount: double or decimal(28,6) for precision; clarify whether the unit is cents or yuan.
String: string.
ID fields: bigint.
Time: string.
Status: string.

3. Data Redundancy Standards: Reasonable Control of Wide Table Redundancy

🤔 Considering frequency, latency, and duplication rate

Redundant fields in wide tables must ensure:

Redundant fields are frequently used, referenced downstream by at least 3 or more.
Redundant fields should not introduce significant data latency.
Redundant fields should not overly duplicate existing fields; generally not exceeding 60%. Otherwise, use join or extend the original table.

4. NULL Field Handling Standards: Strategies for Dimensions and Metrics

❓ Why set NULL values this way

Dimension fields: set to -1.
Metric fields: set to 0.

5. Metric Caliber Standards: Ensuring Consistency of Metrics

🧩 Specific methods for metric collation and management

Consistency must be ensured within domains, without ambiguity.

Through layering, unified data outputs are provided to ensure consistent external metric definitions, avoiding “same metric, different definition.”

1) Metric Collation
Inconsistent definitions increase data usage costs, leading to disputes and repeated verification. In governance, all metrics collected from requirements are further refined to clarify their definitions. If two metrics share a name but differ in definition, determine whether to merge. If both must exist, their names must clearly distinguish them.

2) Metric Management
Metric management includes atomic metrics and derived metrics.

Atomic metrics:

Assign to production line, business unit, domain, and business process.
Define source data within the business process.
Record English/Chinese name, description.
Fill in the metric function.
System auto-generates definition expressions.
System auto-generates SQL from the definition and the source table.

Derived metrics:

Built on atomic metrics with additional dimensions or qualifiers.

6. Data Table Handling Standards: Characteristics of Different Table Types

⚡ Differences between incremental, full, snapshot, and zipper tables

1) Incremental Table
New data since the last extraction.

Records changes only, not totals.
Reports changes only; no change, no report.
One partition per day.

2) Full Table
All the latest state data each day.

Reports regardless of change.
Each report includes all data (changed + unchanged).
Only one partition.

3) Snapshot Table
Partitioned daily, recording full data as of that date.

Reports regardless of change.
Each report includes all data (changed + unchanged).
One partition per day.

4) Zipper Table
Full data is up to date.

Records all changes of an entity from start to current.
Each report includes the historical final state at present.
Current record stores last change before now.
Only one partition.

7. Table Lifecycle Management: Strategies Based on Data History and Table Type

⏳ Lifecycle matrix via historical data grading and table type categorization

1) Historical Data Grading

Four levels: P0, P1, P2, P3.

P0: Critical domain and app data, non-recoverable, e.g., transactions, logs, group KPIs, IPO-related tables.
P1: Important business and app data, non-recoverable, e.g., key product data.
P2: Important but recoverable business/app data, e.g., intermediate ETL process data.
P3: Non-critical, recoverable data, e.g., SNS product reports.

2) Table Types

Event stream table (incremental): non-key or non-duplicate data like logs.
Event mirror table (incremental): process data with primary key and slowly changing attributes, e.g., transactions, order states.
Dimension tables: users, products.
Merge full table: process/dimension data with possible duplicates; merge keeps the latest by key, older states in previous partitions.
ETL temporary tables: generated during processing, not recommended to keep, max 7 days.
TT temporary data: data from TT pulls or DbSync, flows to DS layer, ODS keeps original; lifespan default 93 days.
Regular full tables: small business/product data, often pulled directly; long retention, strategy by historical level.

By grading historical data and categorizing table types, a lifecycle management matrix is generated.

Previous article: “(Part 1) Data Model Architecture Principles: Four Layers, Seven Levels—the First Building Block of Lakehouse Modeling”
Next article preview: Detailed design standards for each data warehouse layer

DEV Community

(Ⅱ) A Complete Guide to Core Data Warehouse Design Standards: From Layers, Types to Lifecycle

1. Hierarchy Invocation Standards: Controlling Data Warehouse Flow and Reference Principles

1. Hierarchy Invocation Standards: Controlling Data Warehouse Flow and Reference Principles

2. Data Type Standards: Unifying Data Warehouse Type Settings

3. Data Redundancy Standards: Reasonable Control of Wide Table Redundancy

4. NULL Field Handling Standards: Strategies for Dimensions and Metrics

5. Metric Caliber Standards: Ensuring Consistency of Metrics

6. Data Table Handling Standards: Characteristics of Different Table Types

7. Table Lifecycle Management: Strategies Based on Data History and Table Type

Top comments (0)