Amit Kumar Singh

Posted on Jun 21

From DataStage and Informatica to Databricks Medallion Architecture: Why Migration Is More Than Code Conversion

#ai #python #machinelearning #automation

Legacy ETL modernization is often described as a technology migration.

Move DataStage jobs to Databricks.
Convert Informatica mappings into PySpark.
Replace legacy workflows with notebooks and Delta tables.

But that description misses the hardest part.

The real challenge is not converting syntax.

The challenge is understanding years of hidden transformation logic, reconstructing data lineage, separating technical processing from business logic, and deciding where each responsibility belongs in a modern architecture.

A DataStage job or Informatica mapping may contain raw ingestion, data cleansing, lookups, joins, business rules, aggregations, error handling, and reporting logic in one workflow.

A Databricks Medallion architecture expects something different.

It separates data processing into clearer layers:

Bronze
Raw ingestion and source preservation
Silver
Cleansing, standardization, enrichment, conformance, and quality controls
Gold
Business-ready models, aggregates, KPIs, reporting datasets, and semantic outputs

That means a successful migration cannot be a blind one-to-one conversion.

It needs to become a metadata and architecture exercise.

⸻

Why One-to-One Conversion Fails

A traditional legacy ETL job often looks like this:

Read source data
→ Filter records
→ Lookup reference data
→ Cleanse values
→ Deduplicate
→ Apply business calculations
→ Aggregate
→ Write reporting output

The problem is that all these responsibilities may exist inside one job, mapping, sequence, or workflow.

For example, a single DataStage job might:

ingest from Oracle
remove cancelled records
trim whitespace
standardize status values
join customer master data
calculate net order amount
aggregate sales by month
write a reporting table

If that entire job is converted directly into one Databricks notebook, the organization may simply recreate the old architecture in a new platform.

The code may run in Databricks, but the design remains difficult to maintain, test, govern, and scale.

The goal should not be:

Convert one legacy job into one notebook.

The goal should be:

Understand what each transformation is doing and place it in the right modern data layer.

⸻

The First Step: Extract Metadata, Not Just Code

A legacy ETL migration should begin by extracting structured metadata from the existing platform.

For DataStage, Informatica, SSIS, Talend, stored procedures, or other ETL tools, useful metadata may include:

job or mapping name
workflow dependencies
source tables, files, and APIs
target tables and files
source-to-target field mappings
joins and lookup logic
filters and conditions
transformation expressions
aggregations
surrogate key generation
reject handling
parameter values
schedules and sequencing
pre-SQL and post-SQL
restart or recovery logic
error-handling behavior

The purpose is to create a structured representation of the legacy job.

Legacy ETL Export
→ Metadata Parser
→ Canonical Metadata Model
→ Transformation Graph
→ Migration Blueprint

This is much more valuable than simply reading transformation code line by line.

⸻

Reconstructing the Transformation Graph

Once the metadata is extracted, the next step is to reconstruct the data lineage and transformation graph.

Consider this fictional example:

orders.csv
↓
filter cancelled orders
↓
lookup customer master
↓
standardize customer status
↓
deduplicate by order_id
↓
calculate order_amount
↓
aggregate monthly sales
↓
monthly_sales_summary

This graph reveals several different kinds of work:

raw ingestion
filtering
enrichment
standardization
deduplication
business calculation
reporting aggregation

These should not all be treated as one technical unit.

The transformation graph helps identify where the data changes, why it changes, and which downstream outputs depend on those changes.

It also makes hidden business logic visible.

⸻

Mapping Legacy ETL Logic to Bronze, Silver, and Gold

The Medallion architecture is useful because it separates responsibilities.

Here is a practical way to classify legacy ETL logic.

Legacy ETL Pattern Meaning Likely Medallion Layer
File, API, database, or CDC extraction Raw source ingestion Bronze
Source preservation and ingestion metadata Capture original source state Bronze
Basic schema enforcement Standardized ingestion Bronze or Silver
Trim, cast, rename, null cleanup Cleansing and standardization Silver
Deduplication Record normalization Silver
Lookup and reference joins Enrichment and conformance Silver
SCD handling Historical dimensional processing Silver
Business calculations Curated business logic Gold
Aggregation and KPI creation Reporting-ready metrics Gold
Dashboard/report output Consumption-ready dataset Gold

The important point is that a legacy component type does not automatically determine the Medallion layer.

For example, a DataStage Transformer stage might perform:

string trimming
null handling
a business calculation
a customer lookup
a reporting aggregation

Those are not all Silver transformations.

The migration process needs to inspect the intent of the logic.

⸻

Example: One Legacy Job Becomes Multiple Databricks Layers

Imagine this fictional legacy ETL workflow:

Oracle Orders
→ Transformer: trim strings and standardize status
→ Lookup: customer master
→ Transformer: calculate net_amount
→ Aggregator: monthly sales by customer
→ Reporting table

A modern Databricks Medallion proposal could look like this:

Bronze Layer
bronze_orders_raw

Ingest raw Oracle orders
Preserve source fields
Add ingestion timestamp
Add source identifier
Add load date
Retain raw records for traceability Silver Layer silver_orders
Trim and standardize string fields
Standardize status values
Validate schema
Apply null-handling rules
Deduplicate order records silver_orders_enriched
Join customer master data
Resolve customer keys
Apply standardized enrichment logic
Calculate normalized net_amount Gold Layer gold_customer_monthly_sales
Aggregate net sales by customer and month
Apply approved reporting definitions
Produce a curated business-ready output

This creates clearer ownership.

Bronze preserves the source.

Silver prepares trusted, reusable data.

Gold provides business-facing outputs.

⸻

What AI Can Assist With

AI can make this migration process faster and more structured.

For example, an AI-assisted migration workflow can help:

summarize legacy job purpose
parse transformation expressions
identify source and target dependencies
reconstruct lineage
classify transformations by intent
detect embedded business logic
suggest Bronze, Silver, and Gold placement
draft PySpark or Spark SQL
generate Delta table DDL
propose data-quality checks
generate reconciliation logic
create migration documentation
identify unclear or risky rules

Suppose a legacy rule says:

IF status_code = 'C' THEN 'Closed' ELSE 'Open'

An AI system can suggest:

Likely classification:
Silver-layer standardization rule
Potential concern:
Confirm whether status_code = 'C' means Closed across all source systems.
Recommended action:
Human review required before finalizing the standardization rule.

That is useful because the system is not pretending to know the business definition.

It is surfacing the decision that must be made.

⸻

What Still Requires Human Review

AI can accelerate analysis and drafting, but human accountability remains essential.

Humans should continue to make final decisions about:

business definitions
source-of-truth selection
financial logic
regulatory calculations
data-retention policies
exception handling
data-quality thresholds
reporting metrics
production deployment approval

For example, a legacy aggregation may calculate:

SUM(revenue) BY region, month

The technical migration system may recommend Gold.

But a human must still answer:

Is revenue gross or net?
Are refunds included?
Does month use calendar month or fiscal month?
Is region derived from customer, store, or sales territory?
Is this metric reusable across reports?

Those are business and governance questions, not merely coding questions.

⸻

The Role of a Canonical Metadata Model

A Canonical Metadata Model can become the bridge between legacy ETL and modern data architecture.

It can represent:

sources
targets
columns
transformations
joins
keys
data types
quality expectations
lineage
business definitions
approval status
assumptions
migration decisions

Once metadata is normalized, multiple outputs can be generated from the same source of truth.

Canonical Metadata Model
→ Databricks Medallion Architecture Proposal
→ PySpark / Spark SQL
→ Delta Table DDL
→ Data Quality Rules
→ Reconciliation Checks
→ Lineage Documentation
→ Migration Specification
→ Human Review Queue

This is more powerful than isolated code conversion because it creates reusable engineering intelligence.

⸻

How Data Engineering Copilot Could Support Legacy ETL Migration

A future Data Engineering Copilot capability could act as a Legacy ETL Migration Copilot.

Inputs could include:

DataStage export files
Informatica mapping exports
workflow metadata
SQL procedures
ETL job documentation
source-to-target mappings
data model documentation

The workflow could be:

Legacy ETL Export
→ Parse Job Metadata
→ Build Transformation Graph
→ Identify Dependencies
→ Classify Transformation Intent
→ Propose Bronze / Silver / Gold Layers
→ Generate Migration Artifacts
→ Flag Ambiguity
→ Route for Human Review

Potential outputs could include:

Medallion architecture recommendation
Bronze, Silver, and Gold pipeline design
Databricks notebook structure
PySpark code drafts
Spark SQL transformations
Delta table definitions
data-quality rules
reconciliation checks
migration documentation
dependency analysis
lineage diagrams
review questions for unresolved logic

The key is not automatic migration without oversight.

The key is to turn hidden legacy ETL logic into a reviewable modernization blueprint.

⸻

Migration Is a Metadata and Architecture Problem

Many legacy ETL modernization efforts fail because they focus only on tool replacement.

But old ETL jobs often contain years of accumulated business knowledge.

That knowledge may be undocumented.

It may be hidden inside transformations, lookups, stored procedures, filters, sequencing rules, and exception logic.

A successful migration must preserve that knowledge while improving the architecture.

That means the migration process should:

Extract metadata
→ Reconstruct lineage
→ Identify transformation intent
→ Separate technical and business responsibilities
→ Propose Medallion layers
→ Generate reviewable artifacts
→ Capture assumptions
→ Require human approval

The future of ETL modernization is not simply translating one tool into another.

It is making legacy data logic visible, structured, governed, and reusable.

⸻

Closing Thought

DataStage and Informatica jobs were often built in an era when ingestion, cleansing, business logic, and reporting were tightly combined.

Databricks Medallion architecture gives teams an opportunity to separate those responsibilities and create cleaner, more maintainable data products.

But that opportunity is lost when organizations perform blind one-to-one conversion.

The better approach is to treat legacy ETL modernization as a metadata-driven architecture exercise.

Do not just convert legacy jobs into new code.
Convert hidden transformation logic into a reviewable modernization blueprint.

That is where AI-assisted metadata platforms can create real value for enterprise data engineering teams.

⸻

Data Engineering Copilot is a personal product initiative focused on metadata-driven engineering and governed delivery workflows.

Illustrative examples in this article use fictional metadata only. No client, employer, production, or proprietary information is included.

Top comments (1)

Nazar Boyko • Jun 21

The one-to-one port warning is the part more teams need to hear. Moving a DataStage job straight into a single notebook just rebuilds the old tangle on a shinier platform, and you've still got ingestion and business rules fused in one place. What makes the piece work is starting from metadata, since that drags the hidden lineage into the open before anyone writes a line of PySpark. I also liked that your AI example flags the "C means Closed" rule for a human instead of guessing it. That's the honest use of it, surfacing the decision rather than pretending to know what the business meant.