You have a customer record from a legacy database. The name field contains "JOHN SMITH " with extra spaces. The phone field has "(555) 123-4567" in a format your system does not accept. The email field is "NULL" as a literal string. The birth date is "0000-00-00".
You need to extract this record, fix all these issues, and load it into your target system. The question is: where in your pipeline does each fix happen? And when something breaks, how do you know which fix failed?
This is where the traditional 3-phase ETL model fails. "Extract, Transform, Load" bundles too much into "Transform." The 6-phase pattern unbundles it into distinct responsibilities, so when something breaks at 3 AM, you know exactly where to look.
Why 3 Phases Are Not Enough
The classic ETL model looks simple:
Extract → Transform → Load
But "Transform" is doing too much work. It handles field renaming, type conversion, data cleaning, business logic, and enrichment. When the pipeline fails with "Invalid date format," you are left asking: Was it a mapping issue? A type conversion? A business rule? A data quality problem?
| 3-Phase ETL | Problem |
|---|---|
| Extract | Clear responsibility — no issue here |
| Transform | Field renaming + type conversion + cleaning + business logic + enrichment — all bundled together |
| Load | Clear responsibility — no issue here |
The problem is not that "Transform" does too many things. The problem is that when it fails, you cannot tell which thing failed.
The 6-Phase ETL Pipeline Pattern
Let us trace what actually happens to that messy customer record as it flows through all six phases.
Extract → Map → Transform → Clean → Refine → Load
Starting record from source:
{
"cust_id": 12345,
"cust_nm": "JOHN SMITH ",
"cust_phone": "(555) 123-4567",
"cust_email": "NULL",
"birth_dt": "0000-00-00"
}
Phase 1: Extract — Get Raw Data
The extractor pulls the record exactly as it exists in the source. No modifications. No cleaning. Just faithful extraction.
{
"cust_id": 12345,
"cust_nm": "JOHN SMITH ", // Extra spaces? Still there.
"cust_phone": "(555) 123-4567", // Parentheses? Still there.
"cust_email": "NULL", // Literal string? Still there.
"birth_dt": "0000-00-00", // Invalid date? Still there.
"_meta": {
"extracted_at": "2024-01-15T10:30:00Z",
"source": "legacy_crm"
}
}
Phase 2: Map — Rename and Restructure
Field names change to match the target schema. No data values change, only the structure.
Mapping rules:
cust_id → customer_id
cust_nm → full_name
cust_phone → phone
cust_email → email
birth_dt → birth_date
If this phase fails, you know immediately: the source schema changed.
Phase 3: Transform — Convert Types
Data types change. Strings become integers. Dates get parsed into proper date objects. No business logic yet.
{
"customer_id": 12345,
"full_name": "JOHN SMITH ",
"phone": "(555) 123-4567",
"email": "NULL",
"birth_date": null // "0000-00-00" → null (unparseable date)
}
If this phase fails, the source sent data in an unexpected format.
Phase 4: Clean — Fix Data Quality
Data quality issues get fixed. Extra whitespace trimmed. Invalid phone formats normalized. Placeholder values like "NULL" become actual nulls.
{
"customer_id": 12345,
"full_name": "JOHN SMITH", // Extra spaces removed
"phone": "5551234567", // Normalized to digits only
"email": null, // "NULL" string → actual null
"birth_date": null
}
Phase 5: Refine — Apply Business Logic
Business rules and enrichment. Calculated fields. Lookups from reference tables.
{
"customer_id": 12345,
"full_name": "JOHN SMITH",
"phone": "5551234567",
"phone_formatted": "(555) 123-4567",
"email": null,
"email_status": "missing", // Business rule: flag missing emails
"birth_date": null,
"age_verified": false, // Business rule: needs birth date
"customer_tier": "standard" // Lookup from tier rules
}
Phase 6: Load — Write to Destination
The final record gets inserted or updated in the target system with proper transaction handling.
Why This Pattern Matters
When a pipeline fails in production, the error message tells you which phase failed. This is the difference between a 5-minute fix and a 3-hour investigation.
| Phase | Error Example | Root Cause | Fix Time |
|---|---|---|---|
| Extract | "Connection refused" | Source system down | Minutes |
| Map | "Unknown field 'customer_name'" | Schema changed at source | Minutes |
| Transform | "Cannot parse '2024/13/45' as date" | Unexpected format | Minutes |
| Clean | "Phone validation: value is all dashes" | Data quality problem | Minutes |
| Refine | "No tier for 'PREMIUM_PLUS'" | New business tier type | Minutes |
| Load | "FK constraint violation" | Dependency not loaded | Minutes |
With a 3-phase pipeline, all of these errors would say "Transform failed." You would spend hours reading through code trying to figure out what went wrong.
The Assembly Line Mental Model
Think of a car manufacturing plant. Raw materials pass through stations, each with a single responsibility.
| Station | Phase | What Happens | Failure Means |
|---|---|---|---|
| Receiving | Extract | Raw materials arrive | Supplier did not deliver |
| Sorting | Map | Parts sorted into bins | Part labels changed |
| Machining | Transform | Parts cut to spec | Wrong dimensions |
| QC | Clean | Defects caught | Material quality declined |
| Assembly | Refine | Parts become components | Design spec has a gap |
| Delivery | Load | Car rolls off the line | Customer garage is full |
When a car has a problem, you know exactly which station to investigate.
Common Anti-Patterns to Avoid
| Anti-Pattern | Why It Fails | Do This Instead |
|---|---|---|
| Cleaning in Extract | You lose raw source data for comparison | Extract faithfully, clean in Phase 4 |
| Business logic in Clean | Different change frequencies and owners | Clean in Phase 4, business rules in Phase 5 |
| Mapping in Transform | Schema errors look like format errors | Map in Phase 2, convert types in Phase 3 |
| Skipping phases | When source changes, you edit transform code instead of config | Keep all 6 phases |
| Phase coupling | Phases cannot be tested independently | Each phase depends only on record structure |
Implementation
Each phase is a function that takes a record in and returns a record out:
for record in extract(source):
mapped = map_fields(record, config.mappings)
transformed = convert_types(mapped, config.types)
cleaned = clean_fields(transformed, config.cleaners)
refined = apply_rules(cleaned, config.rules)
load(refined, destination)
Each function:
- Takes one record as input
- Returns one record as output (or null to skip)
- Has no side effects on other records
- Emits events for observability
- Can be tested with a single record
Key Takeaways
- Extract faithfully: Never modify data during extraction
- Map separately: Field renaming is configuration, not code
- Transform types explicitly: Type conversion has its own failure mode
- Clean with reporting: Track what changed so you detect upstream problems
- Refine for business: Business rules change independently from data quality rules
- Load with transactions: Use upserts, batch inserts, proper error handling
- Keep phases independent: Testable, skippable, debuggable
- Emit events at every phase: The event trail turns 3-hour investigations into 5-minute fixes
Six phases might seem like overhead until you are debugging a production failure. Then it seems like the bare minimum.
This is part of the ETL Pipeline Series on DECYON — real engineering patterns from 20+ years of building production systems.
Read the full version with interactive diagrams at decyon.com
Top comments (0)