DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Medallion Architecture Accelerator: Medallion Architecture Guide

Medallion Architecture Guide

Medallion Architecture Accelerator by Datanest Digital


Overview

The medallion architecture organises a data lakehouse into three logical layers, each providing progressively higher data quality and business value:

Layer Purpose Data State Write Pattern
Bronze Raw ingestion, source-of-truth archive As-is from source Append / Streaming
Silver Cleansed, conformed, deduplicated Type-safe, validated Merge (Upsert)
Gold Business aggregations, KPIs, models Aggregated / Dimensional Overwrite / Merge

Bronze Layer -- Patterns

What belongs here

  • Exact copies of source data with ingestion metadata.
  • Rescue columns for malformed records.
  • Schema evolution history (Auto Loader).

Recommended metadata columns

Column Description
_source_system Upstream system identifier
_source_name Logical source name
_source_file Input file path or URI
_ingestion_timestamp When the row was ingested
_ingestion_date Date partition key
_row_hash SHA-256 hash for change detection
_batch_id Batch run identifier
_rescue_data Malformed record rescue column

Patterns

  1. Streaming Append -- Use Auto Loader with availableNow trigger for incremental file ingestion. Provides exactly-once semantics via checkpointing.

  2. JDBC Watermark -- Track the MAX(watermark_column) in a control table. On each run, extract only rows above the watermark.

  3. Schema Evolution -- Enable cloudFiles.schemaEvolutionMode = addNewColumns so new fields in the source automatically appear in the bronze table.

Anti-patterns

  • Transforming data at bronze. Bronze should be a raw archive. Push all transformations to silver.
  • Overwriting bronze tables. Always append. You lose audit history if you overwrite.
  • Ignoring rescue columns. Malformed records silently disappear without them.
  • No metadata columns. Without lineage columns, debugging production issues becomes guesswork.

Silver Layer -- Patterns

What belongs here

  • Deduplicated records with primary key enforcement.
  • Type-cast, validated, null-handled columns.
  • Conformance (naming conventions, date formats, currency normalisation).

Patterns

  1. Deduplication -- Use ROW_NUMBER windowed over primary keys, ordered by ingestion timestamp descending, to keep only the latest version.

  2. Merge (Upsert) -- Write with Delta MERGE to handle both inserts and updates in a single atomic operation.

  3. SCD Type 2 -- For slowly changing dimensions, expire old rows and insert new versions with effective date ranges.

  4. Quarantine -- Route rows that fail validation rules to a separate table for investigation, rather than silently dropping them.

  5. Column Mapping -- Declaratively map source columns to target columns with type casting and default values.

Anti-patterns

  • Skipping deduplication. Bronze data frequently contains duplicates from replays, late-arriving data, or schema evolution. Always deduplicate.
  • No validation. Write garbage to silver and it propagates to gold. Use quality gates.
  • Monolithic transforms. Break complex logic into composable steps (map -> dedup -> validate -> write).
  • Full table scans every run. Use incremental reads (streaming or watermark-based) to avoid re-processing.

Gold Layer -- Patterns

What belongs here

  • Pre-aggregated KPIs for dashboards and reports.
  • Star-schema dimensional models (fact + dimension tables).
  • Materialized views for BI tool consumption.

Patterns

  1. Time-grain aggregations -- Truncate timestamps to hour/day/week/month and group by the truncated column plus business dimensions.

  2. Fact tables with conformed dimensions -- Join silver event tables with SCD Type 2 dimensions, resolving surrogate keys.

  3. Computed measures -- Derive new metrics from base columns (e.g. revenue = quantity * unit_price * (1 - discount)).

  4. Multi-grain aggregations -- Build summary tables at different granularities (daily, weekly, monthly) for different reporting needs.

Anti-patterns

  • Raw data in gold. Gold should always contain aggregated or joined data. If you're just copying silver, you don't need a gold table.
  • Too many small gold tables. Consolidate related metrics into fewer, wider tables.
  • No surrogate keys. Natural keys change. Use surrogate keys in fact tables for stable joins.
  • Aggregating directly from bronze. Skip the bronze-to-gold shortcut. Quality and deduplication happen at silver for a reason.

Quality Gates

Place quality gates between layers:

Bronze --> [Quality Gate] --> Silver --> [Quality Gate] --> Gold
Enter fullscreen mode Exit fullscreen mode

Recommended checks by layer

Transition Checks
Bronze -> Silver Not-null on key columns, row count, data freshness
Silver -> Gold Primary key uniqueness, referential integrity, value ranges

Severity levels

  • BLOCK -- Halt the pipeline and alert. Use for critical data integrity issues.
  • WARN -- Log a warning but continue processing. Use for expected anomalies.
  • LOG -- Record silently. Use for informational monitoring.

Delta Optimization by Layer

Setting Bronze Silver Gold
Target file size 256 MB 128 MB 64 MB
VACUUM retention 7 days 7 days 3 days
Auto-compact Enabled Enabled Enabled
Optimized writes Enabled Enabled Enabled
ZORDER columns _ingestion_date Business keys Query filter cols
Liquid clustering Optional Recommended Recommended

When to use ZORDER vs Liquid Clustering

  • ZORDER: Legacy approach. Requires explicit OPTIMIZE ... ZORDER BY maintenance jobs. Good when you have 1-3 well-known filter columns.
  • Liquid Clustering: (DBR 13.3+) Self-maintaining. No explicit OPTIMIZE needed for layout. Replaces both partitioning and ZORDER. Preferred for new tables.

Naming Conventions

Object Convention Example
Bronze database raw main.raw
Silver database cleansed or conformed main.cleansed
Gold database analytics or curated main.analytics
Bronze table Source name (lowercase) main.raw.orders
Silver table Source entity name main.cleansed.orders
Gold dimension dim_ prefix main.analytics.dim_customer
Gold fact table fact_ prefix main.analytics.fact_order_lines
Gold aggregate Descriptive name with grain main.analytics.daily_revenue
Checkpoint path /checkpoints/{source}/layer /checkpoints/orders/bronze

Pipeline Orchestration

Recommended flow

1. Bronze ingestion    (parallel per source)
2. Quality gate        (bronze -> silver)
3. Silver transform    (parallel per source)
4. Quality gate        (silver -> gold)
5. Gold build          (depends on silver tables)
6. Delta optimisation  (post-pipeline maintenance)
Enter fullscreen mode Exit fullscreen mode

Orchestration options

  • Databricks Workflows -- Native job orchestration with dependency graphs. Best for Databricks-native teams.
  • Apache Airflow -- Full DAG orchestration. Use the Databricks operator for triggering notebook jobs.
  • Azure Data Factory / AWS Step Functions -- Cloud-native orchestrators for multi-service pipelines.

Incremental Processing Strategies

Strategy When to use Mechanism
Structured Streaming Continuous or near-real-time file ingestion Auto Loader + checkpoint
Available Now Scheduled incremental batch from files trigger(availableNow=True)
JDBC Watermark Relational database CDC High-watermark control table
Change Data Feed Delta-to-Delta incremental reads readChangeData with version
Merge with timestamp Silver/Gold tables without streaming Filter on _ingestion_timestamp

Common Pitfalls

  1. Mixing concerns across layers. Each layer has a single responsibility. Don't validate at bronze or aggregate at silver.

  2. No idempotency. Every pipeline step should produce the same result if re-run. Use MERGE, checkpoints, or watermarks.

  3. Schema drift without tracking. Auto Loader's addNewColumns mode handles this. For JDBC, compare schemas between runs and alert on changes.

  4. Ignoring small files. Streaming workloads create many small files. Configure auto-compaction and run OPTIMIZE regularly.

  5. No testing. Use the included testing framework to validate each layer before promoting to production.


Built by Datanest Digital


This is 1 of 20 resources in the Datanest Platform Pro toolkit. Get the complete [Medallion Architecture Accelerator] with all files, templates, and documentation for $79.

Get the Full Kit →

Or grab the entire Datanest Platform Pro bundle (20 products) for $199 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)