Medallion Architecture: Designing Bronze, Silver, and Gold Layers
Before you write a single transformation, you need a blueprint.
The Medallion Architecture is the design pattern used by data teams worldwide to organize data in a Lakehouse. It's the blueprint Databricks themselves recommend, and it's the foundation of everything we'll build in the final article of this series.
Understanding it deeply — not just the names of the layers, but why they exist — is what separates pipelines that scale from pipelines that become a maintenance nightmare six months later.
What is the Medallion Architecture?
The Medallion Architecture organizes data into three progressive layers, each named after a metal:
Raw Sources
↓
🥉 BRONZE — Raw data, exactly as received
↓
🥈 SILVER — Cleaned, validated, enriched data
↓
🥇 GOLD — Aggregated, business-ready data
↓
Consumers (BI tools, ML models, APIs)
Each layer has a clear purpose. Data flows in one direction — from raw to refined — and each step adds value.
The core idea is simple: never transform in place. Instead of overwriting raw data with cleaned data, you keep every layer. Raw data is always preserved. If something breaks downstream, you can reprocess from the source without re-ingesting from external systems.
The Bronze Layer: Raw Data As-Is
What it is
Bronze is your landing zone. Data arrives here exactly as it came from the source — no cleaning, no filtering, no transformation.
Think of Bronze as your archive. It's the single source of truth for what you received, when you received it, and from where.
What goes in
- Files dropped by upstream systems (CSV, JSON, XML, Avro, Parquet)
- API responses stored as-is
- Database snapshots
- Streaming event data (Kafka, Kinesis)
What it looks like
/mnt/bronze/
├── sales/
│ ├── year=2024/month=01/day=01/
│ │ └── sales_20240101_143022.csv
│ └── year=2024/month=01/day=02/
│ └── sales_20240102_143055.csv
├── customers/
│ └── customers_full_20240101.json
└── products/
└── products_20240101.parquet
The rules of Bronze
1. Never delete raw data. Raw files are your safety net. Storage is cheap — losing source data is not.
2. Add ingestion metadata. When you land data in Bronze, enrich it with columns that help you debug later:
from pyspark.sql.functions import current_timestamp, lit, input_file_name
df_bronze = df_raw \
.withColumn("_ingested_at", current_timestamp()) \
.withColumn("_source_file", input_file_name()) \
.withColumn("_source_system", lit("sales_db"))
3. Store as Delta. Even raw data should be Delta. You get transaction safety, history, and the ability to query it reliably.
4. No business logic here. Bronze is not the place to rename columns, apply rules, or make assumptions about the data. That's Silver's job.
Ingestion Patterns
Full load: Ingest the entire dataset every run. Simple, but expensive for large tables.
df = spark.read.csv("/mnt/landing/sales/", header=True)
df.write.format("delta").mode("overwrite").save("/mnt/bronze/sales/")
Incremental load: Ingest only new or changed records. More efficient, requires a watermark (timestamp or ID).
last_ingested = "2024-01-14"
df = spark.read.csv("/mnt/landing/sales/") \
.filter(f"file_date > '{last_ingested}'")
df.write.format("delta").mode("append").save("/mnt/bronze/sales/")
The Silver Layer: Clean, Validated, Trusted Data
What it is
Silver is your cleaned and conformed layer. This is where raw data becomes trustworthy.
If Bronze answers "what did we receive?", Silver answers "what is actually true?"
What happens in Silver
| Operation | Example |
|---|---|
| Deduplication | Remove duplicate order records |
| Null handling | Fill missing regions with 'Unknown'
|
| Type casting | Cast string dates to DateType
|
| Standardization | Uppercase all product names, normalize phone formats |
| Validation | Reject rows where amount < 0
|
| Enrichment | Join with a reference table to add region names |
| Schema enforcement | Ensure all expected columns are present |
What it looks like
from pyspark.sql.functions import col, upper, round, when, to_date, current_timestamp, lit
# Read from Bronze
bronze = spark.read.format("delta").load("/mnt/bronze/sales/")
# Transform to Silver
silver = bronze \
.dropDuplicates(["order_id"]) \
.dropna(subset=["order_id", "customer_id", "amount"]) \
.filter(col("amount") > 0) \
.withColumn("order_date", to_date(col("order_date"), "yyyy-MM-dd")) \
.withColumn("product", upper(col("product"))) \
.withColumn("amount", round(col("amount"), 2)) \
.withColumn("region",
when(col("region").isNull(), "Unknown")
.otherwise(col("region"))
) \
.withColumn("tier",
when(col("amount") >= 5000, "Premium")
.when(col("amount") >= 1000, "Standard")
.otherwise("Basic")
) \
.withColumn("_processed_at", current_timestamp()) \
.drop("_source_file", "_ingested_at") # Drop Bronze metadata
# Write to Silver
silver.write \
.format("delta") \
.mode("overwrite") \
.partitionBy("year", "month") \
.save("/mnt/silver/sales/")
The rules of Silver
1. One Silver table per source entity. sales_silver, customers_silver, products_silver — clean versions of your source tables.
2. Still row-level granularity. Silver doesn't aggregate. Every row still represents one event, one order, one customer record.
3. Idempotent transformations. Running the Silver pipeline twice should produce the same result as running it once. No side effects.
4. Document your cleaning rules. Silver is where business decisions are baked in (amount < 0 is invalid). These rules should be documented and reviewed by the business.
The Gold Layer: Business-Ready Data
What it is
Gold is your serving layer. It contains aggregated, pre-joined, business-meaningful tables — built specifically for consumption by BI tools, dashboards, ML models, or APIs.
If Silver is clean data, Gold is useful data. The difference is intent.
What goes in Gold
Gold tables are shaped by consumer needs. Common patterns:
Aggregated metrics
from pyspark.sql.functions import sum, count, avg, countDistinct
monthly_revenue = silver_sales \
.groupBy("region", "year", "month") \
.agg(
sum("amount").alias("total_revenue"),
count("order_id").alias("total_orders"),
avg("amount").alias("avg_order_value"),
countDistinct("customer_id").alias("unique_customers")
)
Wide denormalized tables (flat joins)
# Join sales + customers + products into one wide table
# BI tools love these — no runtime joins needed
wide_sales = silver_sales \
.join(silver_customers, on="customer_id", how="left") \
.join(silver_products, on="product_id", how="left") \
.select(
"order_id", "order_date", "amount",
"customer_name", "customer_email", "customer_country",
"product_name", "product_category"
)
KPI summary tables
%sql
CREATE OR REPLACE TABLE gold.kpi_daily AS
SELECT
order_date,
region,
SUM(amount) AS daily_revenue,
COUNT(order_id) AS daily_orders,
COUNT(DISTINCT customer_id) AS active_customers,
SUM(amount) / COUNT(order_id) AS avg_order_value
FROM silver.sales
GROUP BY order_date, region;
The rules of Gold
1. Named for the consumer, not the source. gold.monthly_revenue_by_region, gold.customer_360, gold.kpi_daily — the name should tell a business person exactly what's inside.
2. Optimized for reads. Use OPTIMIZE and ZORDER on the columns most commonly filtered:
%sql
OPTIMIZE gold.kpi_daily ZORDER BY (order_date, region);
3. Multiple Gold tables are normal. Finance, Marketing, and Operations may each have their own Gold tables from the same Silver source. That's fine — it's the point. One clean Silver layer, many business-specific Gold views.
Folder Structure and Naming Conventions
Consistency here pays dividends. A structure that works for teams of all sizes:
/mnt/
├── bronze/
│ ├── sales/
│ ├── customers/
│ └── products/
│
├── silver/
│ ├── sales/
│ ├── customers/
│ └── products/
│
└── gold/
├── monthly_revenue/
├── customer_360/
└── kpi_daily/
Naming rules worth following:
| Convention | Example |
|---|---|
| Lowercase everywhere |
sales_silver not Sales_Silver
|
| Layer suffix on table names |
sales_bronze, sales_silver, sales_gold
|
| Business names on Gold |
monthly_revenue, customer_360, kpi_daily
|
| Partition folders as key=value | year=2024/month=01/ |
| Timestamps in ingestion file names | sales_20240115_143022.csv |
How Data Flows Between Layers
Let's visualize the full picture:
External Source (DB, API, SaaS)
↓ [Ingestion: Fivetran, Airbyte, custom script]
🥉 BRONZE
Raw Delta tables
Append-only or full refresh
Preserves source data exactly
↓ [Transformation: PySpark / SQL notebook]
🥈 SILVER
Clean Delta tables
Row-level, validated, enriched
One table per entity
↓ [Aggregation: PySpark / SQL notebook]
🥇 GOLD
Business Delta tables
Aggregated, wide, KPI-ready
One table per use case
↓ [Serving: SQL endpoint, BI connector]
Consumers
Tableau / Power BI / Looker
ML models / APIs / reports
Each arrow is a pipeline step — a notebook or a Workflow job. In the next article, we'll wire all of this together into a running end-to-end system.
Common Mistakes to Avoid
❌ Skipping Bronze and writing straight to Silver
If your source changes or a bug is introduced, you have no raw data to reprocess from. Always land data in Bronze first.
❌ Putting business logic in Bronze
Bronze is a mirror of the source. The moment you rename a column or apply a rule in Bronze, you've hidden what the source actually sent you.
❌ Aggregating in Silver
Silver should always be row-level. Aggregations belong in Gold. Mixing them makes Silver harder to reuse for multiple consumers.
❌ Building Gold tables no one uses
Gold tables have a cost — storage, compute, maintenance. Build them for real consumers. Don't pre-aggregate everything "just in case."
❌ Not partitioning Silver and Gold tables
Large tables without partitioning mean full scans on every query. Partition by date at minimum.
Wrapping Up
Here's what to take away from this article:
- The Medallion Architecture organizes data into Bronze (raw), Silver (clean), and Gold (aggregated) layers
- Bronze is your archive — raw, immutable, with ingestion metadata
- Silver is your trusted layer — deduplicated, validated, enriched, row-level
- Gold is your serving layer — aggregated, consumer-shaped, BI-ready
- Data flows one direction only: Bronze → Silver → Gold
- Consistent naming, folder structure, and partitioning are not optional — they're the foundation of maintainable pipelines
In the final article, we put all ten chapters together and build a complete end-to-end data warehouse in Databricks from scratch.
Top comments (0)