DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Spark ETL Framework: ETL Patterns Guide — Spark ETL Framework

ETL Patterns Guide — Spark ETL Framework

A practical guide to building reliable, scalable data pipelines with the
medallion architecture pattern.

By Datanest Digital


Medallion Architecture

The medallion (multi-hop) architecture organises data into three layers:

Layer Purpose Data Quality Schema
Bronze Raw ingestion — land data as-is Unvalidated Inferred
Silver Cleaned, conformed, deduplicated Validated Enforced
Gold Business-level aggregates Trusted Optimised

Why three layers?

  1. Auditability — Bronze retains the original data for replay or debugging.
  2. Decoupling — Consumers read from Gold; ingestion changes don't break dashboards.
  3. Quality escalation — Each layer adds more trust, caught by quality gates.

Idempotency

Every pipeline step should be safe to re-run without producing duplicates
or corrupted state.

Strategies

Strategy When to use
MERGE (upsert) SCD Type 1 — overwrite on natural key
SCD Type 2 merge Need full history of changes
Overwrite partition Gold aggregates partitioned by date
Append + dedup window Bronze landing where merge is too slow

Example: idempotent partition overwrite

(
    df.write
    .format("delta")
    .mode("overwrite")
    .option("replaceWhere", f"date = '{target_date}'")
    .saveAsTable("gold.daily_metrics")
)
Enter fullscreen mode Exit fullscreen mode

Partitioning Strategies

Use Case Partition Column(s) Notes
Time-series ingestion _ingested_at (date) Aligns with daily schedules
Customer analytics region, date Avoid too many small files
Event streaming event_date, hour Balance read & write speed

Rules of thumb

  • Target 1 GB per partition for Delta Lake.
  • Avoid partitioning on high-cardinality columns (e.g. user_id).
  • Use OPTIMIZE and ZORDER for query-time performance instead.

Error Handling

Retry with exponential backoff

The ETLBase class tracks errors via metrics. Wrap orchestration with retries:

import time

MAX_RETRIES = 3

for attempt in range(1, MAX_RETRIES + 1):
    result = etl.run()
    if result is not None:
        break
    wait = 2 ** attempt
    print(f"Attempt {attempt} failed — retrying in {wait}s")
    time.sleep(wait)
else:
    raise RuntimeError("Pipeline exhausted all retries")
Enter fullscreen mode Exit fullscreen mode

Dead-letter pattern

Route malformed rows to a quarantine table instead of failing the whole batch:

good_df = df.filter(F.col("_corrupt_record").isNull())
bad_df  = df.filter(F.col("_corrupt_record").isNotNull())

# Write bad rows for later inspection
bad_df.write.format("delta").mode("append").saveAsTable("quarantine.bad_records")
Enter fullscreen mode Exit fullscreen mode

Data Quality Between Layers

Quality gates should be automated and blocking:

  1. Run quality checks after each layer's load() step.
  2. If critical checks fail, halt the pipeline and alert.
  3. Store check results in a Delta table for trending and SLA tracking.

Common checks by layer

Layer Check
Bronze Row count > 0, no schema drift
Silver PK unique, required cols not null, type correctness
Gold Aggregate sanity (totals > 0), row count consistent

Performance Tips

  1. Use Delta Lake cachingspark.catalog.cacheTable() for hot Silver tables.
  2. Broadcast small dimensionsF.broadcast(dim_df) in joins.
  3. Adaptive Query Execution — enabled by default in DBR 13.3+.
  4. OPTIMIZE + ZORDER — run nightly on frequently-queried Gold tables.
  5. Avoid .count() in production — use isEmpty() or streaming metrics instead.

Scheduling & Orchestration

Tool Best for
Databricks Jobs Simple single-notebook pipelines
Apache Airflow Multi-system DAGs with dependencies
dbt SQL-centric transform + test workflows

For Airflow integration, see the companion product
Airflow DAG Templates.


By Datanest Digital | Version 1.0.0


This is 1 of 11 resources in the Data Pipeline Pro toolkit. Get the complete [Spark ETL Framework] with all files, templates, and documentation for $59.

Get the Full Kit →

Or grab the entire Data Pipeline Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)