From third-normal-form warehouses to hash-key vaults and one-big-table lakehouses
1. Why bother with data modeling at all?
A data model is the contract between raw information and every application, query, and dashboard that touches it. Good modeling:
- Tames complexity – It gives your team a common vocabulary (orders, products, patients).
- Protects data quality – Constraints stop bad keys from sneaking in.
- Speeds up queries – Indexes and partitions only work when the model is explicit.
- Enables governance – Lineage, audit, PII masking, and retention rules ride on model metadata.
Poor modeling has the opposite effect: convoluted ETL, expensive re-work, and a queue of frustrated analysts.
2. A taxonomy of modeling patterns
Below is a “periodic table” of the approaches you’ll meet in modern analytics. Each sits somewhere on two axes:
- Normalization ←→ Denormalization (how many joins vs. how many duplicates)
- Raw-capture first ←→ Serve-layer first (ingest everything now vs. design for consumption up-front)
2.1 Highly-normalized, top-down
Pattern | Creator | Essence | Strengths | Drawbacks |
---|---|---|---|---|
3NF Enterprise Data Warehouse (Inmon / CIF) | Bill Inmon | 3rd-normal-form EDW at the core, marts downstream | Single source of truth, minimal redundancy, rigorous data governance | Heavy initial modeling, slower to onboard new sources |
Anchor Modeling | Lars Rönnbäck | Split every attribute into its own “attribute table” hanging off an anchor | Schema changes are add-only; great fit for columnar engines | Very many joins; not intuitive for business users |
Focal Modeling | Roelant Vos | Groups Anchor attributes into “Focal” tables per theme | Fewer joins, keeps Anchor’s agility | Newer, smaller community |
2.2 Ensemble (history-friendly, ELT-first)
Pattern | Creator | Essence | Strengths | Drawbacks |
---|---|---|---|---|
Data Vault 1.0 / 2.0 | Dan Linstedt | Hubs (business keys) + Links (relationships) + Satellites (context) in insert-only tables | Handles many volatile sources, full audit trail, parallel ELT loads | Raw vault is not query-friendly; extra layer needed |
Hybrid Vault | Community | Raw Vault feeding Kimball stars or wide tables | Combines agility with query simplicity | Two layers to maintain |
2.3 Denormalized, serve-layer first
Pattern | Creator | Essence | Strengths | Drawbacks |
---|---|---|---|---|
Dimensional (Kimball) | Ralph Kimball | Fact tables (measures) + Dimensions (context) arranged in star/snowflake schemas | Intuitive for analysts, performant in BI tools, clear grain & SCD handling | Up-front design effort; schema drift requires remodel |
One Big Table (OBT) | Data-lake crowd | Dump facts + dimension attributes into a single very wide table | Zero joins; engines prune unused columns | Explodes in width; hard to manage slowly changing history |
Super-Marts / Aggregate Tables | — | Pre-calculated roll-ups & joins stored as materialized views | Blazing fast dashboards; isolates hot queries | Needs refresh jobs; duplicates data |
2.4 Semantic or conceptual first
Pattern | Essence | Use case |
---|---|---|
Entity-Relationship (ER) / UML | White-board entities & relationships, then map to tables | Requirements workshops, data contracts |
Fact-Based / ORM (Object-Role) | Capture assertions in natural language; tool generates schema | Complex business rules, metadata-driven generation |
2.5 Domain-driven & federated
Pattern | Essence | Use case |
---|---|---|
Data Mesh | Each domain owns its model (could be star, vault, etc.) and publishes it as a data product with contracts | Large orgs, federated governance, cross-domain self-service |
2.6 NoSQL-specific designs
Store type | Modeling focus | Example tactic |
---|---|---|
Document | Embed child objects you read together; reference on 1-to-many | Put order lines inside an order document |
Key–Value | Design keys for access path; keep values opaque | Use tenant:customer:123 as key |
Column-family | Row key decides partition; create one wide row per entity | IoT sensor ID + timestamp |
Graph | Label nodes & relationships; think traversals |
(Customer)-[:PURCHASED]->(Product) patterns |
3. How people mix & match in the real world
A common cloud pattern:
Sources ─► Raw landing zone (PARQUET/JSON) -- schema-on-read
│
▼
Data Vault → PIT / Bridge tables -- auditable, historized
│
▼
Kimball Star / Wide Tables in “gold” layer -- BI friendly
│
▼
Dashboards, APIs, ML notebooks
Land quickly, model slowly has become the mantra: ELT to a vault or anchor model for resiliency, then transform into stars or OBTs to keep analysts happy.
4. Choosing the right pattern – decision matrix
Primary driver | Favors | Rationale |
---|---|---|
Dozens of changing source systems | Data Vault / Anchor | Insert-only, hash-key joins absorb drift |
Self-service dashboards tomorrow | Kimball Star | Clear grain, predictable joins |
Regulatory audit & history | Data Vault | Satellites keep every version with load datetime |
Minimal infra & headcount | OBT | Simplest to build; engine optimizes column scans |
Enterprise canonical model | 3NF EDW (Inmon) | Deduplicate, enforce referential integrity |
Federated teams | Data Mesh | Domain ownership, shared contracts (e.g., OpenAPI, JSON Schema) |
5. Practical tips for any chosen model
- Start with workload stories – List your ten most valuable queries and ingest paths; model to make them trivial.
- Version your schema – Use migration files, semantic-version tags, and column-level comments.
- Automate documentation – dbt docs, DataHub, OpenMetadata, or ERD generators keep humans in sync.
- Separate raw ↔ curated – Never let business users query the ingest layer directly.
- Monitor & refactor – Track slow queries and unused tables; models are living artifacts.
6. Take-aways
- There is no one-size-fits-all. Modeling is a continuum of agility ↔ simplicity and normalization ↔ duplication.
- Layering beats silver bullets – Raw ensemble models safeguard lineage; downstream stars or wide tables serve speed.
- People & process matter – Clear ownership, automated tests, and good documentation count as much as the pattern you pick.
Whether you are designing a green-field lakehouse, untangling a decade-old EDW, or federating dozens of domain teams, understanding these modeling options lets you pick—and combine—the right tools for today’s needs without boxing in tomorrow’s growth.
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.