DEV Community

Vinicius Fagundes
Vinicius Fagundes

Posted on

Medallion Architecture: Designing Bronze, Silver, and Gold Layers

Medallion Architecture: Designing Bronze, Silver, and Gold Layers

Before you write a single transformation, you need a blueprint.

The Medallion Architecture is the design pattern used by data teams worldwide to organize data in a Lakehouse. It's the blueprint Databricks themselves recommend, and it's the foundation of everything we'll build in the final article of this series.

Understanding it deeply — not just the names of the layers, but why they exist — is what separates pipelines that scale from pipelines that become a maintenance nightmare six months later.


What is the Medallion Architecture?

The Medallion Architecture organizes data into three progressive layers, each named after a metal:

Raw Sources
    ↓
 🥉 BRONZE  — Raw data, exactly as received
    ↓
 🥈 SILVER  — Cleaned, validated, enriched data
    ↓
 🥇 GOLD    — Aggregated, business-ready data
    ↓
Consumers (BI tools, ML models, APIs)
Enter fullscreen mode Exit fullscreen mode

Each layer has a clear purpose. Data flows in one direction — from raw to refined — and each step adds value.

The core idea is simple: never transform in place. Instead of overwriting raw data with cleaned data, you keep every layer. Raw data is always preserved. If something breaks downstream, you can reprocess from the source without re-ingesting from external systems.


The Bronze Layer: Raw Data As-Is

What it is

Bronze is your landing zone. Data arrives here exactly as it came from the source — no cleaning, no filtering, no transformation.

Think of Bronze as your archive. It's the single source of truth for what you received, when you received it, and from where.

What goes in

  • Files dropped by upstream systems (CSV, JSON, XML, Avro, Parquet)
  • API responses stored as-is
  • Database snapshots
  • Streaming event data (Kafka, Kinesis)

What it looks like

/mnt/bronze/
  ├── sales/
  │   ├── year=2024/month=01/day=01/
  │   │   └── sales_20240101_143022.csv
  │   └── year=2024/month=01/day=02/
  │       └── sales_20240102_143055.csv
  ├── customers/
  │   └── customers_full_20240101.json
  └── products/
      └── products_20240101.parquet
Enter fullscreen mode Exit fullscreen mode

The rules of Bronze

1. Never delete raw data. Raw files are your safety net. Storage is cheap — losing source data is not.

2. Add ingestion metadata. When you land data in Bronze, enrich it with columns that help you debug later:

from pyspark.sql.functions import current_timestamp, lit, input_file_name

df_bronze = df_raw \
    .withColumn("_ingested_at", current_timestamp()) \
    .withColumn("_source_file", input_file_name()) \
    .withColumn("_source_system", lit("sales_db"))
Enter fullscreen mode Exit fullscreen mode

3. Store as Delta. Even raw data should be Delta. You get transaction safety, history, and the ability to query it reliably.

4. No business logic here. Bronze is not the place to rename columns, apply rules, or make assumptions about the data. That's Silver's job.

Ingestion Patterns

Full load: Ingest the entire dataset every run. Simple, but expensive for large tables.

df = spark.read.csv("/mnt/landing/sales/", header=True)
df.write.format("delta").mode("overwrite").save("/mnt/bronze/sales/")
Enter fullscreen mode Exit fullscreen mode

Incremental load: Ingest only new or changed records. More efficient, requires a watermark (timestamp or ID).

last_ingested = "2024-01-14"

df = spark.read.csv("/mnt/landing/sales/") \
    .filter(f"file_date > '{last_ingested}'")

df.write.format("delta").mode("append").save("/mnt/bronze/sales/")
Enter fullscreen mode Exit fullscreen mode

The Silver Layer: Clean, Validated, Trusted Data

What it is

Silver is your cleaned and conformed layer. This is where raw data becomes trustworthy.

If Bronze answers "what did we receive?", Silver answers "what is actually true?"

What happens in Silver

Operation Example
Deduplication Remove duplicate order records
Null handling Fill missing regions with 'Unknown'
Type casting Cast string dates to DateType
Standardization Uppercase all product names, normalize phone formats
Validation Reject rows where amount < 0
Enrichment Join with a reference table to add region names
Schema enforcement Ensure all expected columns are present

What it looks like

from pyspark.sql.functions import col, upper, round, when, to_date, current_timestamp, lit

# Read from Bronze
bronze = spark.read.format("delta").load("/mnt/bronze/sales/")

# Transform to Silver
silver = bronze \
    .dropDuplicates(["order_id"]) \
    .dropna(subset=["order_id", "customer_id", "amount"]) \
    .filter(col("amount") > 0) \
    .withColumn("order_date", to_date(col("order_date"), "yyyy-MM-dd")) \
    .withColumn("product", upper(col("product"))) \
    .withColumn("amount", round(col("amount"), 2)) \
    .withColumn("region",
        when(col("region").isNull(), "Unknown")
        .otherwise(col("region"))
    ) \
    .withColumn("tier",
        when(col("amount") >= 5000, "Premium")
        .when(col("amount") >= 1000, "Standard")
        .otherwise("Basic")
    ) \
    .withColumn("_processed_at", current_timestamp()) \
    .drop("_source_file", "_ingested_at")  # Drop Bronze metadata

# Write to Silver
silver.write \
    .format("delta") \
    .mode("overwrite") \
    .partitionBy("year", "month") \
    .save("/mnt/silver/sales/")
Enter fullscreen mode Exit fullscreen mode

The rules of Silver

1. One Silver table per source entity. sales_silver, customers_silver, products_silver — clean versions of your source tables.

2. Still row-level granularity. Silver doesn't aggregate. Every row still represents one event, one order, one customer record.

3. Idempotent transformations. Running the Silver pipeline twice should produce the same result as running it once. No side effects.

4. Document your cleaning rules. Silver is where business decisions are baked in (amount < 0 is invalid). These rules should be documented and reviewed by the business.


The Gold Layer: Business-Ready Data

What it is

Gold is your serving layer. It contains aggregated, pre-joined, business-meaningful tables — built specifically for consumption by BI tools, dashboards, ML models, or APIs.

If Silver is clean data, Gold is useful data. The difference is intent.

What goes in Gold

Gold tables are shaped by consumer needs. Common patterns:

Aggregated metrics

from pyspark.sql.functions import sum, count, avg, countDistinct

monthly_revenue = silver_sales \
    .groupBy("region", "year", "month") \
    .agg(
        sum("amount").alias("total_revenue"),
        count("order_id").alias("total_orders"),
        avg("amount").alias("avg_order_value"),
        countDistinct("customer_id").alias("unique_customers")
    )
Enter fullscreen mode Exit fullscreen mode

Wide denormalized tables (flat joins)

# Join sales + customers + products into one wide table
# BI tools love these — no runtime joins needed
wide_sales = silver_sales \
    .join(silver_customers, on="customer_id", how="left") \
    .join(silver_products, on="product_id", how="left") \
    .select(
        "order_id", "order_date", "amount",
        "customer_name", "customer_email", "customer_country",
        "product_name", "product_category"
    )
Enter fullscreen mode Exit fullscreen mode

KPI summary tables

%sql

CREATE OR REPLACE TABLE gold.kpi_daily AS
SELECT
    order_date,
    region,
    SUM(amount)                 AS daily_revenue,
    COUNT(order_id)             AS daily_orders,
    COUNT(DISTINCT customer_id) AS active_customers,
    SUM(amount) / COUNT(order_id) AS avg_order_value
FROM silver.sales
GROUP BY order_date, region;
Enter fullscreen mode Exit fullscreen mode

The rules of Gold

1. Named for the consumer, not the source. gold.monthly_revenue_by_region, gold.customer_360, gold.kpi_daily — the name should tell a business person exactly what's inside.

2. Optimized for reads. Use OPTIMIZE and ZORDER on the columns most commonly filtered:

%sql
OPTIMIZE gold.kpi_daily ZORDER BY (order_date, region);
Enter fullscreen mode Exit fullscreen mode

3. Multiple Gold tables are normal. Finance, Marketing, and Operations may each have their own Gold tables from the same Silver source. That's fine — it's the point. One clean Silver layer, many business-specific Gold views.


Folder Structure and Naming Conventions

Consistency here pays dividends. A structure that works for teams of all sizes:

/mnt/
  ├── bronze/
  │   ├── sales/
  │   ├── customers/
  │   └── products/
  │
  ├── silver/
  │   ├── sales/
  │   ├── customers/
  │   └── products/
  │
  └── gold/
      ├── monthly_revenue/
      ├── customer_360/
      └── kpi_daily/
Enter fullscreen mode Exit fullscreen mode

Naming rules worth following:

Convention Example
Lowercase everywhere sales_silver not Sales_Silver
Layer suffix on table names sales_bronze, sales_silver, sales_gold
Business names on Gold monthly_revenue, customer_360, kpi_daily
Partition folders as key=value year=2024/month=01/
Timestamps in ingestion file names sales_20240115_143022.csv

How Data Flows Between Layers

Let's visualize the full picture:

External Source (DB, API, SaaS)
          ↓  [Ingestion: Fivetran, Airbyte, custom script]
     🥉 BRONZE
     Raw Delta tables
     Append-only or full refresh
     Preserves source data exactly
          ↓  [Transformation: PySpark / SQL notebook]
     🥈 SILVER
     Clean Delta tables
     Row-level, validated, enriched
     One table per entity
          ↓  [Aggregation: PySpark / SQL notebook]
     🥇 GOLD
     Business Delta tables
     Aggregated, wide, KPI-ready
     One table per use case
          ↓  [Serving: SQL endpoint, BI connector]
     Consumers
     Tableau / Power BI / Looker
     ML models / APIs / reports
Enter fullscreen mode Exit fullscreen mode

Each arrow is a pipeline step — a notebook or a Workflow job. In the next article, we'll wire all of this together into a running end-to-end system.


Common Mistakes to Avoid

❌ Skipping Bronze and writing straight to Silver
If your source changes or a bug is introduced, you have no raw data to reprocess from. Always land data in Bronze first.

❌ Putting business logic in Bronze
Bronze is a mirror of the source. The moment you rename a column or apply a rule in Bronze, you've hidden what the source actually sent you.

❌ Aggregating in Silver
Silver should always be row-level. Aggregations belong in Gold. Mixing them makes Silver harder to reuse for multiple consumers.

❌ Building Gold tables no one uses
Gold tables have a cost — storage, compute, maintenance. Build them for real consumers. Don't pre-aggregate everything "just in case."

❌ Not partitioning Silver and Gold tables
Large tables without partitioning mean full scans on every query. Partition by date at minimum.


Wrapping Up

Here's what to take away from this article:

  • The Medallion Architecture organizes data into Bronze (raw), Silver (clean), and Gold (aggregated) layers
  • Bronze is your archive — raw, immutable, with ingestion metadata
  • Silver is your trusted layer — deduplicated, validated, enriched, row-level
  • Gold is your serving layer — aggregated, consumer-shaped, BI-ready
  • Data flows one direction only: Bronze → Silver → Gold
  • Consistent naming, folder structure, and partitioning are not optional — they're the foundation of maintainable pipelines

In the final article, we put all ten chapters together and build a complete end-to-end data warehouse in Databricks from scratch.

Top comments (0)