DatanestDigital

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Medallion Architecture Accelerator: Medallion Architecture Guide

#databricks #azure #dataengineering #devops

Medallion Architecture Guide

Medallion Architecture Accelerator by Datanest Digital

Overview

The medallion architecture organises a data lakehouse into three logical layers, each providing progressively higher data quality and business value:

Layer	Purpose	Data State	Write Pattern
Bronze	Raw ingestion, source-of-truth archive	As-is from source	Append / Streaming
Silver	Cleansed, conformed, deduplicated	Type-safe, validated	Merge (Upsert)
Gold	Business aggregations, KPIs, models	Aggregated / Dimensional	Overwrite / Merge

Bronze Layer -- Patterns

What belongs here

Exact copies of source data with ingestion metadata.
Rescue columns for malformed records.
Schema evolution history (Auto Loader).

Recommended metadata columns

Column	Description
`_source_system`	Upstream system identifier
`_source_name`	Logical source name
`_source_file`	Input file path or URI
`_ingestion_timestamp`	When the row was ingested
`_ingestion_date`	Date partition key
`_row_hash`	SHA-256 hash for change detection
`_batch_id`	Batch run identifier
`_rescue_data`	Malformed record rescue column

Patterns

Streaming Append -- Use Auto Loader with availableNow trigger for incremental file ingestion. Provides exactly-once semantics via checkpointing.
JDBC Watermark -- Track the MAX(watermark_column) in a control table. On each run, extract only rows above the watermark.
Schema Evolution -- Enable cloudFiles.schemaEvolutionMode = addNewColumns so new fields in the source automatically appear in the bronze table.

Anti-patterns

Transforming data at bronze. Bronze should be a raw archive. Push all transformations to silver.
Overwriting bronze tables. Always append. You lose audit history if you overwrite.
Ignoring rescue columns. Malformed records silently disappear without them.
No metadata columns. Without lineage columns, debugging production issues becomes guesswork.

Silver Layer -- Patterns

What belongs here

Deduplicated records with primary key enforcement.
Type-cast, validated, null-handled columns.
Conformance (naming conventions, date formats, currency normalisation).

Patterns

Deduplication -- Use ROW_NUMBER windowed over primary keys, ordered by ingestion timestamp descending, to keep only the latest version.
Merge (Upsert) -- Write with Delta MERGE to handle both inserts and updates in a single atomic operation.
SCD Type 2 -- For slowly changing dimensions, expire old rows and insert new versions with effective date ranges.
Quarantine -- Route rows that fail validation rules to a separate table for investigation, rather than silently dropping them.
Column Mapping -- Declaratively map source columns to target columns with type casting and default values.

Anti-patterns

Skipping deduplication. Bronze data frequently contains duplicates from replays, late-arriving data, or schema evolution. Always deduplicate.
No validation. Write garbage to silver and it propagates to gold. Use quality gates.
Monolithic transforms. Break complex logic into composable steps (map -> dedup -> validate -> write).
Full table scans every run. Use incremental reads (streaming or watermark-based) to avoid re-processing.

Gold Layer -- Patterns

What belongs here

Pre-aggregated KPIs for dashboards and reports.
Star-schema dimensional models (fact + dimension tables).
Materialized views for BI tool consumption.

Patterns

Time-grain aggregations -- Truncate timestamps to hour/day/week/month and group by the truncated column plus business dimensions.
Fact tables with conformed dimensions -- Join silver event tables with SCD Type 2 dimensions, resolving surrogate keys.
Computed measures -- Derive new metrics from base columns (e.g. revenue = quantity * unit_price * (1 - discount)).
Multi-grain aggregations -- Build summary tables at different granularities (daily, weekly, monthly) for different reporting needs.

Anti-patterns

Raw data in gold. Gold should always contain aggregated or joined data. If you're just copying silver, you don't need a gold table.
Too many small gold tables. Consolidate related metrics into fewer, wider tables.
No surrogate keys. Natural keys change. Use surrogate keys in fact tables for stable joins.
Aggregating directly from bronze. Skip the bronze-to-gold shortcut. Quality and deduplication happen at silver for a reason.

Quality Gates

Place quality gates between layers:

Bronze --> [Quality Gate] --> Silver --> [Quality Gate] --> Gold

Recommended checks by layer

Transition	Checks
Bronze -> Silver	Not-null on key columns, row count, data freshness
Silver -> Gold	Primary key uniqueness, referential integrity, value ranges

Severity levels

BLOCK -- Halt the pipeline and alert. Use for critical data integrity issues.
WARN -- Log a warning but continue processing. Use for expected anomalies.
LOG -- Record silently. Use for informational monitoring.

Delta Optimization by Layer

Setting	Bronze	Silver	Gold
Target file size	256 MB	128 MB	64 MB
VACUUM retention	7 days	7 days	3 days
Auto-compact	Enabled	Enabled	Enabled
Optimized writes	Enabled	Enabled	Enabled
ZORDER columns	`_ingestion_date`	Business keys	Query filter cols
Liquid clustering	Optional	Recommended	Recommended

When to use ZORDER vs Liquid Clustering

ZORDER: Legacy approach. Requires explicit OPTIMIZE ... ZORDER BY maintenance jobs. Good when you have 1-3 well-known filter columns.
Liquid Clustering: (DBR 13.3+) Self-maintaining. No explicit OPTIMIZE needed for layout. Replaces both partitioning and ZORDER. Preferred for new tables.

Naming Conventions

Object	Convention	Example
Bronze database	`raw`	`main.raw`
Silver database	`cleansed` or `conformed`	`main.cleansed`
Gold database	`analytics` or `curated`	`main.analytics`
Bronze table	Source name (lowercase)	`main.raw.orders`
Silver table	Source entity name	`main.cleansed.orders`
Gold dimension	`dim_` prefix	`main.analytics.dim_customer`
Gold fact table	`fact_` prefix	`main.analytics.fact_order_lines`
Gold aggregate	Descriptive name with grain	`main.analytics.daily_revenue`
Checkpoint path	`/checkpoints/{source}/layer`	`/checkpoints/orders/bronze`

Pipeline Orchestration

Recommended flow

1. Bronze ingestion    (parallel per source)
2. Quality gate        (bronze -> silver)
3. Silver transform    (parallel per source)
4. Quality gate        (silver -> gold)
5. Gold build          (depends on silver tables)
6. Delta optimisation  (post-pipeline maintenance)

Orchestration options

Databricks Workflows -- Native job orchestration with dependency graphs. Best for Databricks-native teams.
Apache Airflow -- Full DAG orchestration. Use the Databricks operator for triggering notebook jobs.
Azure Data Factory / AWS Step Functions -- Cloud-native orchestrators for multi-service pipelines.

Incremental Processing Strategies

Strategy	When to use	Mechanism
Structured Streaming	Continuous or near-real-time file ingestion	Auto Loader + checkpoint
Available Now	Scheduled incremental batch from files	`trigger(availableNow=True)`
JDBC Watermark	Relational database CDC	High-watermark control table
Change Data Feed	Delta-to-Delta incremental reads	`readChangeData` with version
Merge with timestamp	Silver/Gold tables without streaming	Filter on `_ingestion_timestamp`

Common Pitfalls

Mixing concerns across layers. Each layer has a single responsibility. Don't validate at bronze or aggregate at silver.
No idempotency. Every pipeline step should produce the same result if re-run. Use MERGE, checkpoints, or watermarks.
Schema drift without tracking. Auto Loader's addNewColumns mode handles this. For JDBC, compare schemas between runs and alert on changes.
Ignoring small files. Streaming workloads create many small files. Configure auto-compaction and run OPTIMIZE regularly.
No testing. Use the included testing framework to validate each layer before promoting to production.

Built by Datanest Digital

This is 1 of 20 resources in the Datanest Platform Pro toolkit. Get the complete [Medallion Architecture Accelerator] with all files, templates, and documentation for $79.

Get the Full Kit →

Or grab the entire Datanest Platform Pro bundle (20 products) for $199 — save 30%.

Get the Complete Bundle →

DEV Community

Medallion Architecture Accelerator: Medallion Architecture Guide

Medallion Architecture Guide

Overview

Bronze Layer -- Patterns

What belongs here

Recommended metadata columns

Patterns

Anti-patterns

Silver Layer -- Patterns

What belongs here

Patterns

Anti-patterns

Gold Layer -- Patterns

What belongs here

Patterns

Anti-patterns

Quality Gates

Recommended checks by layer

Severity levels

Delta Optimization by Layer

When to use ZORDER vs Liquid Clustering

Naming Conventions

Pipeline Orchestration

Recommended flow

Orchestration options

Incremental Processing Strategies

Common Pitfalls

Related Articles

Top comments (0)