ETL Patterns Guide — Spark ETL Framework
A practical guide to building reliable, scalable data pipelines with the
medallion architecture pattern.
Medallion Architecture
The medallion (multi-hop) architecture organises data into three layers:
| Layer | Purpose | Data Quality | Schema |
|---|---|---|---|
| Bronze | Raw ingestion — land data as-is | Unvalidated | Inferred |
| Silver | Cleaned, conformed, deduplicated | Validated | Enforced |
| Gold | Business-level aggregates | Trusted | Optimised |
Why three layers?
- Auditability — Bronze retains the original data for replay or debugging.
- Decoupling — Consumers read from Gold; ingestion changes don't break dashboards.
- Quality escalation — Each layer adds more trust, caught by quality gates.
Idempotency
Every pipeline step should be safe to re-run without producing duplicates
or corrupted state.
Strategies
| Strategy | When to use |
|---|---|
| MERGE (upsert) | SCD Type 1 — overwrite on natural key |
| SCD Type 2 merge | Need full history of changes |
| Overwrite partition | Gold aggregates partitioned by date |
| Append + dedup window | Bronze landing where merge is too slow |
Example: idempotent partition overwrite
(
df.write
.format("delta")
.mode("overwrite")
.option("replaceWhere", f"date = '{target_date}'")
.saveAsTable("gold.daily_metrics")
)
Partitioning Strategies
| Use Case | Partition Column(s) | Notes |
|---|---|---|
| Time-series ingestion |
_ingested_at (date) |
Aligns with daily schedules |
| Customer analytics |
region, date
|
Avoid too many small files |
| Event streaming |
event_date, hour
|
Balance read & write speed |
Rules of thumb
- Target 1 GB per partition for Delta Lake.
- Avoid partitioning on high-cardinality columns (e.g.
user_id). - Use
OPTIMIZEandZORDERfor query-time performance instead.
Error Handling
Retry with exponential backoff
The ETLBase class tracks errors via metrics. Wrap orchestration with retries:
import time
MAX_RETRIES = 3
for attempt in range(1, MAX_RETRIES + 1):
result = etl.run()
if result is not None:
break
wait = 2 ** attempt
print(f"Attempt {attempt} failed — retrying in {wait}s")
time.sleep(wait)
else:
raise RuntimeError("Pipeline exhausted all retries")
Dead-letter pattern
Route malformed rows to a quarantine table instead of failing the whole batch:
good_df = df.filter(F.col("_corrupt_record").isNull())
bad_df = df.filter(F.col("_corrupt_record").isNotNull())
# Write bad rows for later inspection
bad_df.write.format("delta").mode("append").saveAsTable("quarantine.bad_records")
Data Quality Between Layers
Quality gates should be automated and blocking:
- Run quality checks after each layer's
load()step. - If critical checks fail, halt the pipeline and alert.
- Store check results in a Delta table for trending and SLA tracking.
Common checks by layer
| Layer | Check |
|---|---|
| Bronze | Row count > 0, no schema drift |
| Silver | PK unique, required cols not null, type correctness |
| Gold | Aggregate sanity (totals > 0), row count consistent |
Performance Tips
-
Use Delta Lake caching —
spark.catalog.cacheTable()for hot Silver tables. -
Broadcast small dimensions —
F.broadcast(dim_df)in joins. - Adaptive Query Execution — enabled by default in DBR 13.3+.
- OPTIMIZE + ZORDER — run nightly on frequently-queried Gold tables.
-
Avoid
.count()in production — useisEmpty()or streaming metrics instead.
Scheduling & Orchestration
| Tool | Best for |
|---|---|
| Databricks Jobs | Simple single-notebook pipelines |
| Apache Airflow | Multi-system DAGs with dependencies |
| dbt | SQL-centric transform + test workflows |
For Airflow integration, see the companion product
Airflow DAG Templates.
By Datanest Digital | Version 1.0.0
This is 1 of 11 resources in the Data Pipeline Pro toolkit. Get the complete [Spark ETL Framework] with all files, templates, and documentation for $59.
Or grab the entire Data Pipeline Pro bundle (11 products) for $169 — save 30%.
Top comments (0)