DatanestDigital

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Spark ETL Framework: ETL Patterns Guide — Spark ETL Framework

#spark #dataengineering #etl #python

ETL Patterns Guide — Spark ETL Framework

A practical guide to building reliable, scalable data pipelines with the
medallion architecture pattern.

By Datanest Digital

Medallion Architecture

The medallion (multi-hop) architecture organises data into three layers:

Layer	Purpose	Data Quality	Schema
Bronze	Raw ingestion — land data as-is	Unvalidated	Inferred
Silver	Cleaned, conformed, deduplicated	Validated	Enforced
Gold	Business-level aggregates	Trusted	Optimised

Why three layers?

Auditability — Bronze retains the original data for replay or debugging.
Decoupling — Consumers read from Gold; ingestion changes don't break dashboards.
Quality escalation — Each layer adds more trust, caught by quality gates.

Idempotency

Every pipeline step should be safe to re-run without producing duplicates
or corrupted state.

Strategies

Strategy	When to use
MERGE (upsert)	SCD Type 1 — overwrite on natural key
SCD Type 2 merge	Need full history of changes
Overwrite partition	Gold aggregates partitioned by date
Append + dedup window	Bronze landing where merge is too slow

Example: idempotent partition overwrite

(
    df.write
    .format("delta")
    .mode("overwrite")
    .option("replaceWhere", f"date = '{target_date}'")
    .saveAsTable("gold.daily_metrics")
)

Partitioning Strategies

Use Case	Partition Column(s)	Notes
Time-series ingestion	`_ingested_at` (date)	Aligns with daily schedules
Customer analytics	`region`, `date`	Avoid too many small files
Event streaming	`event_date`, `hour`	Balance read & write speed

Rules of thumb

Target 1 GB per partition for Delta Lake.
Avoid partitioning on high-cardinality columns (e.g. user_id).
Use OPTIMIZE and ZORDER for query-time performance instead.

Error Handling

Retry with exponential backoff

The ETLBase class tracks errors via metrics. Wrap orchestration with retries:

import time

MAX_RETRIES = 3

for attempt in range(1, MAX_RETRIES + 1):
    result = etl.run()
    if result is not None:
        break
    wait = 2 ** attempt
    print(f"Attempt {attempt} failed — retrying in {wait}s")
    time.sleep(wait)
else:
    raise RuntimeError("Pipeline exhausted all retries")

Dead-letter pattern

Route malformed rows to a quarantine table instead of failing the whole batch:

good_df = df.filter(F.col("_corrupt_record").isNull())
bad_df  = df.filter(F.col("_corrupt_record").isNotNull())

# Write bad rows for later inspection
bad_df.write.format("delta").mode("append").saveAsTable("quarantine.bad_records")

Data Quality Between Layers

Quality gates should be automated and blocking:

Run quality checks after each layer's load() step.
If critical checks fail, halt the pipeline and alert.
Store check results in a Delta table for trending and SLA tracking.

Common checks by layer

Layer	Check
Bronze	Row count > 0, no schema drift
Silver	PK unique, required cols not null, type correctness
Gold	Aggregate sanity (totals > 0), row count consistent

Performance Tips

Use Delta Lake caching — spark.catalog.cacheTable() for hot Silver tables.
Broadcast small dimensions — F.broadcast(dim_df) in joins.
Adaptive Query Execution — enabled by default in DBR 13.3+.
OPTIMIZE + ZORDER — run nightly on frequently-queried Gold tables.
Avoid .count() in production — use isEmpty() or streaming metrics instead.

Scheduling & Orchestration

Tool	Best for
Databricks Jobs	Simple single-notebook pipelines
Apache Airflow	Multi-system DAGs with dependencies
dbt	SQL-centric transform + test workflows

For Airflow integration, see the companion product
Airflow DAG Templates.

By Datanest Digital | Version 1.0.0

This is 1 of 11 resources in the Data Pipeline Pro toolkit. Get the complete [Spark ETL Framework] with all files, templates, and documentation for $59.

Get the Full Kit →

Or grab the entire Data Pipeline Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

DEV Community