Medallion Architecture Guide
Medallion Architecture Accelerator by Datanest Digital
Overview
The medallion architecture organises a data lakehouse into three logical layers, each providing progressively higher data quality and business value:
| Layer | Purpose | Data State | Write Pattern |
|---|---|---|---|
| Bronze | Raw ingestion, source-of-truth archive | As-is from source | Append / Streaming |
| Silver | Cleansed, conformed, deduplicated | Type-safe, validated | Merge (Upsert) |
| Gold | Business aggregations, KPIs, models | Aggregated / Dimensional | Overwrite / Merge |
Bronze Layer -- Patterns
What belongs here
- Exact copies of source data with ingestion metadata.
- Rescue columns for malformed records.
- Schema evolution history (Auto Loader).
Recommended metadata columns
| Column | Description |
|---|---|
_source_system |
Upstream system identifier |
_source_name |
Logical source name |
_source_file |
Input file path or URI |
_ingestion_timestamp |
When the row was ingested |
_ingestion_date |
Date partition key |
_row_hash |
SHA-256 hash for change detection |
_batch_id |
Batch run identifier |
_rescue_data |
Malformed record rescue column |
Patterns
Streaming Append -- Use Auto Loader with
availableNowtrigger for incremental file ingestion. Provides exactly-once semantics via checkpointing.JDBC Watermark -- Track the
MAX(watermark_column)in a control table. On each run, extract only rows above the watermark.Schema Evolution -- Enable
cloudFiles.schemaEvolutionMode = addNewColumnsso new fields in the source automatically appear in the bronze table.
Anti-patterns
- Transforming data at bronze. Bronze should be a raw archive. Push all transformations to silver.
- Overwriting bronze tables. Always append. You lose audit history if you overwrite.
- Ignoring rescue columns. Malformed records silently disappear without them.
- No metadata columns. Without lineage columns, debugging production issues becomes guesswork.
Silver Layer -- Patterns
What belongs here
- Deduplicated records with primary key enforcement.
- Type-cast, validated, null-handled columns.
- Conformance (naming conventions, date formats, currency normalisation).
Patterns
Deduplication -- Use ROW_NUMBER windowed over primary keys, ordered by ingestion timestamp descending, to keep only the latest version.
Merge (Upsert) -- Write with Delta MERGE to handle both inserts and updates in a single atomic operation.
SCD Type 2 -- For slowly changing dimensions, expire old rows and insert new versions with effective date ranges.
Quarantine -- Route rows that fail validation rules to a separate table for investigation, rather than silently dropping them.
Column Mapping -- Declaratively map source columns to target columns with type casting and default values.
Anti-patterns
- Skipping deduplication. Bronze data frequently contains duplicates from replays, late-arriving data, or schema evolution. Always deduplicate.
- No validation. Write garbage to silver and it propagates to gold. Use quality gates.
- Monolithic transforms. Break complex logic into composable steps (map -> dedup -> validate -> write).
- Full table scans every run. Use incremental reads (streaming or watermark-based) to avoid re-processing.
Gold Layer -- Patterns
What belongs here
- Pre-aggregated KPIs for dashboards and reports.
- Star-schema dimensional models (fact + dimension tables).
- Materialized views for BI tool consumption.
Patterns
Time-grain aggregations -- Truncate timestamps to hour/day/week/month and group by the truncated column plus business dimensions.
Fact tables with conformed dimensions -- Join silver event tables with SCD Type 2 dimensions, resolving surrogate keys.
Computed measures -- Derive new metrics from base columns (e.g.
revenue = quantity * unit_price * (1 - discount)).Multi-grain aggregations -- Build summary tables at different granularities (daily, weekly, monthly) for different reporting needs.
Anti-patterns
- Raw data in gold. Gold should always contain aggregated or joined data. If you're just copying silver, you don't need a gold table.
- Too many small gold tables. Consolidate related metrics into fewer, wider tables.
- No surrogate keys. Natural keys change. Use surrogate keys in fact tables for stable joins.
- Aggregating directly from bronze. Skip the bronze-to-gold shortcut. Quality and deduplication happen at silver for a reason.
Quality Gates
Place quality gates between layers:
Bronze --> [Quality Gate] --> Silver --> [Quality Gate] --> Gold
Recommended checks by layer
| Transition | Checks |
|---|---|
| Bronze -> Silver | Not-null on key columns, row count, data freshness |
| Silver -> Gold | Primary key uniqueness, referential integrity, value ranges |
Severity levels
- BLOCK -- Halt the pipeline and alert. Use for critical data integrity issues.
- WARN -- Log a warning but continue processing. Use for expected anomalies.
- LOG -- Record silently. Use for informational monitoring.
Delta Optimization by Layer
| Setting | Bronze | Silver | Gold |
|---|---|---|---|
| Target file size | 256 MB | 128 MB | 64 MB |
| VACUUM retention | 7 days | 7 days | 3 days |
| Auto-compact | Enabled | Enabled | Enabled |
| Optimized writes | Enabled | Enabled | Enabled |
| ZORDER columns | _ingestion_date |
Business keys | Query filter cols |
| Liquid clustering | Optional | Recommended | Recommended |
When to use ZORDER vs Liquid Clustering
-
ZORDER: Legacy approach. Requires explicit
OPTIMIZE ... ZORDER BYmaintenance jobs. Good when you have 1-3 well-known filter columns. - Liquid Clustering: (DBR 13.3+) Self-maintaining. No explicit OPTIMIZE needed for layout. Replaces both partitioning and ZORDER. Preferred for new tables.
Naming Conventions
| Object | Convention | Example |
|---|---|---|
| Bronze database | raw |
main.raw |
| Silver database |
cleansed or conformed
|
main.cleansed |
| Gold database |
analytics or curated
|
main.analytics |
| Bronze table | Source name (lowercase) | main.raw.orders |
| Silver table | Source entity name | main.cleansed.orders |
| Gold dimension |
dim_ prefix |
main.analytics.dim_customer |
| Gold fact table |
fact_ prefix |
main.analytics.fact_order_lines |
| Gold aggregate | Descriptive name with grain | main.analytics.daily_revenue |
| Checkpoint path | /checkpoints/{source}/layer |
/checkpoints/orders/bronze |
Pipeline Orchestration
Recommended flow
1. Bronze ingestion (parallel per source)
2. Quality gate (bronze -> silver)
3. Silver transform (parallel per source)
4. Quality gate (silver -> gold)
5. Gold build (depends on silver tables)
6. Delta optimisation (post-pipeline maintenance)
Orchestration options
- Databricks Workflows -- Native job orchestration with dependency graphs. Best for Databricks-native teams.
- Apache Airflow -- Full DAG orchestration. Use the Databricks operator for triggering notebook jobs.
- Azure Data Factory / AWS Step Functions -- Cloud-native orchestrators for multi-service pipelines.
Incremental Processing Strategies
| Strategy | When to use | Mechanism |
|---|---|---|
| Structured Streaming | Continuous or near-real-time file ingestion | Auto Loader + checkpoint |
| Available Now | Scheduled incremental batch from files | trigger(availableNow=True) |
| JDBC Watermark | Relational database CDC | High-watermark control table |
| Change Data Feed | Delta-to-Delta incremental reads |
readChangeData with version |
| Merge with timestamp | Silver/Gold tables without streaming | Filter on _ingestion_timestamp
|
Common Pitfalls
Mixing concerns across layers. Each layer has a single responsibility. Don't validate at bronze or aggregate at silver.
No idempotency. Every pipeline step should produce the same result if re-run. Use MERGE, checkpoints, or watermarks.
Schema drift without tracking. Auto Loader's
addNewColumnsmode handles this. For JDBC, compare schemas between runs and alert on changes.Ignoring small files. Streaming workloads create many small files. Configure auto-compaction and run OPTIMIZE regularly.
No testing. Use the included testing framework to validate each layer before promoting to production.
Built by Datanest Digital
This is 1 of 20 resources in the Datanest Platform Pro toolkit. Get the complete [Medallion Architecture Accelerator] with all files, templates, and documentation for $79.
Or grab the entire Datanest Platform Pro bundle (20 products) for $199 — save 30%.
Top comments (0)