VF Insights

Posted on May 4

Medallion Architecture: Designing Bronze, Silver, and Gold Layers

#dataengineering #databricks #ai #datascience

Medallion Architecture: Designing Bronze, Silver, and Gold Layers

Before you write a single transformation, you need a blueprint.

The Medallion Architecture is the design pattern used by data teams worldwide to organize data in a Lakehouse. It's the blueprint Databricks themselves recommend, and it's the foundation of everything we'll build in the final article of this series.

Understanding it deeply — not just the names of the layers, but why they exist — is what separates pipelines that scale from pipelines that become a maintenance nightmare six months later.

What is the Medallion Architecture?

The Medallion Architecture organizes data into three progressive layers, each named after a metal:

Raw Sources
    ↓
 🥉 BRONZE  — Raw data, exactly as received
    ↓
 🥈 SILVER  — Cleaned, validated, enriched data
    ↓
 🥇 GOLD    — Aggregated, business-ready data
    ↓
Consumers (BI tools, ML models, APIs)

Each layer has a clear purpose. Data flows in one direction — from raw to refined — and each step adds value.

The core idea is simple: never transform in place. Instead of overwriting raw data with cleaned data, you keep every layer. Raw data is always preserved. If something breaks downstream, you can reprocess from the source without re-ingesting from external systems.

The Bronze Layer: Raw Data As-Is

What it is

Bronze is your landing zone. Data arrives here exactly as it came from the source — no cleaning, no filtering, no transformation.

Think of Bronze as your archive. It's the single source of truth for what you received, when you received it, and from where.

What goes in

Files dropped by upstream systems (CSV, JSON, XML, Avro, Parquet)
API responses stored as-is
Database snapshots
Streaming event data (Kafka, Kinesis)

What it looks like

/mnt/bronze/
  ├── sales/
  │   ├── year=2024/month=01/day=01/
  │   │   └── sales_20240101_143022.csv
  │   └── year=2024/month=01/day=02/
  │       └── sales_20240102_143055.csv
  ├── customers/
  │   └── customers_full_20240101.json
  └── products/
      └── products_20240101.parquet

The rules of Bronze

1. Never delete raw data. Raw files are your safety net. Storage is cheap — losing source data is not.

2. Add ingestion metadata. When you land data in Bronze, enrich it with columns that help you debug later:

from pyspark.sql.functions import current_timestamp, lit, input_file_name

df_bronze = df_raw \
    .withColumn("_ingested_at", current_timestamp()) \
    .withColumn("_source_file", input_file_name()) \
    .withColumn("_source_system", lit("sales_db"))

3. Store as Delta. Even raw data should be Delta. You get transaction safety, history, and the ability to query it reliably.

4. No business logic here. Bronze is not the place to rename columns, apply rules, or make assumptions about the data. That's Silver's job.

Ingestion Patterns

Full load: Ingest the entire dataset every run. Simple, but expensive for large tables.

df = spark.read.csv("/mnt/landing/sales/", header=True)
df.write.format("delta").mode("overwrite").save("/mnt/bronze/sales/")

Incremental load: Ingest only new or changed records. More efficient, requires a watermark (timestamp or ID).

last_ingested = "2024-01-14"

df = spark.read.csv("/mnt/landing/sales/") \
    .filter(f"file_date > '{last_ingested}'")

df.write.format("delta").mode("append").save("/mnt/bronze/sales/")

The Silver Layer: Clean, Validated, Trusted Data

What it is

Silver is your cleaned and conformed layer. This is where raw data becomes trustworthy.

If Bronze answers "what did we receive?", Silver answers "what is actually true?"

What happens in Silver

Operation	Example
Deduplication	Remove duplicate order records
Null handling	Fill missing regions with `'Unknown'`
Type casting	Cast string dates to `DateType`
Standardization	Uppercase all product names, normalize phone formats
Validation	Reject rows where `amount < 0`
Enrichment	Join with a reference table to add region names
Schema enforcement	Ensure all expected columns are present

What it looks like

from pyspark.sql.functions import col, upper, round, when, to_date, current_timestamp, lit

# Read from Bronze
bronze = spark.read.format("delta").load("/mnt/bronze/sales/")

# Transform to Silver
silver = bronze \
    .dropDuplicates(["order_id"]) \
    .dropna(subset=["order_id", "customer_id", "amount"]) \
    .filter(col("amount") > 0) \
    .withColumn("order_date", to_date(col("order_date"), "yyyy-MM-dd")) \
    .withColumn("product", upper(col("product"))) \
    .withColumn("amount", round(col("amount"), 2)) \
    .withColumn("region",
        when(col("region").isNull(), "Unknown")
        .otherwise(col("region"))
    ) \
    .withColumn("tier",
        when(col("amount") >= 5000, "Premium")
        .when(col("amount") >= 1000, "Standard")
        .otherwise("Basic")
    ) \
    .withColumn("_processed_at", current_timestamp()) \
    .drop("_source_file", "_ingested_at")  # Drop Bronze metadata

# Write to Silver
silver.write \
    .format("delta") \
    .mode("overwrite") \
    .partitionBy("year", "month") \
    .save("/mnt/silver/sales/")

The rules of Silver

1. One Silver table per source entity. sales_silver, customers_silver, products_silver — clean versions of your source tables.

2. Still row-level granularity. Silver doesn't aggregate. Every row still represents one event, one order, one customer record.

3. Idempotent transformations. Running the Silver pipeline twice should produce the same result as running it once. No side effects.

4. Document your cleaning rules. Silver is where business decisions are baked in (amount < 0 is invalid). These rules should be documented and reviewed by the business.

The Gold Layer: Business-Ready Data

What it is

Gold is your serving layer. It contains aggregated, pre-joined, business-meaningful tables — built specifically for consumption by BI tools, dashboards, ML models, or APIs.

If Silver is clean data, Gold is useful data. The difference is intent.

What goes in Gold

Gold tables are shaped by consumer needs. Common patterns:

Aggregated metrics

from pyspark.sql.functions import sum, count, avg, countDistinct

monthly_revenue = silver_sales \
    .groupBy("region", "year", "month") \
    .agg(
        sum("amount").alias("total_revenue"),
        count("order_id").alias("total_orders"),
        avg("amount").alias("avg_order_value"),
        countDistinct("customer_id").alias("unique_customers")
    )

Wide denormalized tables (flat joins)

# Join sales + customers + products into one wide table
# BI tools love these — no runtime joins needed
wide_sales = silver_sales \
    .join(silver_customers, on="customer_id", how="left") \
    .join(silver_products, on="product_id", how="left") \
    .select(
        "order_id", "order_date", "amount",
        "customer_name", "customer_email", "customer_country",
        "product_name", "product_category"
    )

KPI summary tables

%sql

CREATE OR REPLACE TABLE gold.kpi_daily AS
SELECT
    order_date,
    region,
    SUM(amount)                 AS daily_revenue,
    COUNT(order_id)             AS daily_orders,
    COUNT(DISTINCT customer_id) AS active_customers,
    SUM(amount) / COUNT(order_id) AS avg_order_value
FROM silver.sales
GROUP BY order_date, region;

The rules of Gold

1. Named for the consumer, not the source. gold.monthly_revenue_by_region, gold.customer_360, gold.kpi_daily — the name should tell a business person exactly what's inside.

2. Optimized for reads. Use OPTIMIZE and ZORDER on the columns most commonly filtered:

%sql
OPTIMIZE gold.kpi_daily ZORDER BY (order_date, region);

3. Multiple Gold tables are normal. Finance, Marketing, and Operations may each have their own Gold tables from the same Silver source. That's fine — it's the point. One clean Silver layer, many business-specific Gold views.

Folder Structure and Naming Conventions

Consistency here pays dividends. A structure that works for teams of all sizes:

/mnt/
  ├── bronze/
  │   ├── sales/
  │   ├── customers/
  │   └── products/
  │
  ├── silver/
  │   ├── sales/
  │   ├── customers/
  │   └── products/
  │
  └── gold/
      ├── monthly_revenue/
      ├── customer_360/
      └── kpi_daily/

Naming rules worth following:

Convention	Example
Lowercase everywhere	`sales_silver` not `Sales_Silver`
Layer suffix on table names	`sales_bronze`, `sales_silver`, `sales_gold`
Business names on Gold	`monthly_revenue`, `customer_360`, `kpi_daily`
Partition folders as key=value	`year=2024/month=01/`
Timestamps in ingestion file names	`sales_20240115_143022.csv`

How Data Flows Between Layers

Let's visualize the full picture:

External Source (DB, API, SaaS)
          ↓  [Ingestion: Fivetran, Airbyte, custom script]
     🥉 BRONZE
     Raw Delta tables
     Append-only or full refresh
     Preserves source data exactly
          ↓  [Transformation: PySpark / SQL notebook]
     🥈 SILVER
     Clean Delta tables
     Row-level, validated, enriched
     One table per entity
          ↓  [Aggregation: PySpark / SQL notebook]
     🥇 GOLD
     Business Delta tables
     Aggregated, wide, KPI-ready
     One table per use case
          ↓  [Serving: SQL endpoint, BI connector]
     Consumers
     Tableau / Power BI / Looker
     ML models / APIs / reports

Each arrow is a pipeline step — a notebook or a Workflow job. In the next article, we'll wire all of this together into a running end-to-end system.

Common Mistakes to Avoid

❌ Skipping Bronze and writing straight to Silver
If your source changes or a bug is introduced, you have no raw data to reprocess from. Always land data in Bronze first.

❌ Putting business logic in Bronze
Bronze is a mirror of the source. The moment you rename a column or apply a rule in Bronze, you've hidden what the source actually sent you.

❌ Aggregating in Silver
Silver should always be row-level. Aggregations belong in Gold. Mixing them makes Silver harder to reuse for multiple consumers.

❌ Building Gold tables no one uses
Gold tables have a cost — storage, compute, maintenance. Build them for real consumers. Don't pre-aggregate everything "just in case."

❌ Not partitioning Silver and Gold tables
Large tables without partitioning mean full scans on every query. Partition by date at minimum.

Wrapping Up

Here's what to take away from this article:

The Medallion Architecture organizes data into Bronze (raw), Silver (clean), and Gold (aggregated) layers
Bronze is your archive — raw, immutable, with ingestion metadata
Silver is your trusted layer — deduplicated, validated, enriched, row-level
Gold is your serving layer — aggregated, consumer-shaped, BI-ready
Data flows one direction only: Bronze → Silver → Gold
Consistent naming, folder structure, and partitioning are not optional — they're the foundation of maintainable pipelines

In the final article, we put all ten chapters together and build a complete end-to-end data warehouse in Databricks from scratch.

DEV Community

Medallion Architecture: Designing Bronze, Silver, and Gold Layers

Medallion Architecture: Designing Bronze, Silver, and Gold Layers

What is the Medallion Architecture?

The Bronze Layer: Raw Data As-Is

What it is

What goes in

What it looks like

The rules of Bronze

Ingestion Patterns

The Silver Layer: Clean, Validated, Trusted Data

What it is

What happens in Silver

What it looks like

The rules of Silver

The Gold Layer: Business-Ready Data

What it is

What goes in Gold

The rules of Gold

Folder Structure and Naming Conventions

How Data Flows Between Layers

Common Mistakes to Avoid

Wrapping Up

Top comments (0)