DEV Community

Amit Kumar Singh
Amit Kumar Singh

Posted on

From DataStage and Informatica to Databricks Medallion Architecture: Why Migration Is More Than Code Conversion

Legacy ETL modernization is often described as a technology migration.

Move DataStage jobs to Databricks.
Convert Informatica mappings into PySpark.
Replace legacy workflows with notebooks and Delta tables.

But that description misses the hardest part.

The real challenge is not converting syntax.

The challenge is understanding years of hidden transformation logic, reconstructing data lineage, separating technical processing from business logic, and deciding where each responsibility belongs in a modern architecture.

A DataStage job or Informatica mapping may contain raw ingestion, data cleansing, lookups, joins, business rules, aggregations, error handling, and reporting logic in one workflow.

A Databricks Medallion architecture expects something different.

It separates data processing into clearer layers:

Bronze
Raw ingestion and source preservation
Silver
Cleansing, standardization, enrichment, conformance, and quality controls
Gold
Business-ready models, aggregates, KPIs, reporting datasets, and semantic outputs

That means a successful migration cannot be a blind one-to-one conversion.

It needs to become a metadata and architecture exercise.

Why One-to-One Conversion Fails

A traditional legacy ETL job often looks like this:

Read source data
→ Filter records
→ Lookup reference data
→ Cleanse values
→ Deduplicate
→ Apply business calculations
→ Aggregate
→ Write reporting output

The problem is that all these responsibilities may exist inside one job, mapping, sequence, or workflow.

For example, a single DataStage job might:

  • ingest from Oracle
  • remove cancelled records
  • trim whitespace
  • standardize status values
  • join customer master data
  • calculate net order amount
  • aggregate sales by month
  • write a reporting table

If that entire job is converted directly into one Databricks notebook, the organization may simply recreate the old architecture in a new platform.

The code may run in Databricks, but the design remains difficult to maintain, test, govern, and scale.

The goal should not be:

Convert one legacy job into one notebook.

The goal should be:

Understand what each transformation is doing and place it in the right modern data layer.

The First Step: Extract Metadata, Not Just Code

A legacy ETL migration should begin by extracting structured metadata from the existing platform.

For DataStage, Informatica, SSIS, Talend, stored procedures, or other ETL tools, useful metadata may include:

  • job or mapping name
  • workflow dependencies
  • source tables, files, and APIs
  • target tables and files
  • source-to-target field mappings
  • joins and lookup logic
  • filters and conditions
  • transformation expressions
  • aggregations
  • surrogate key generation
  • reject handling
  • parameter values
  • schedules and sequencing
  • pre-SQL and post-SQL
  • restart or recovery logic
  • error-handling behavior

The purpose is to create a structured representation of the legacy job.

Legacy ETL Export
→ Metadata Parser
→ Canonical Metadata Model
→ Transformation Graph
→ Migration Blueprint

This is much more valuable than simply reading transformation code line by line.

Reconstructing the Transformation Graph

Once the metadata is extracted, the next step is to reconstruct the data lineage and transformation graph.

Consider this fictional example:

orders.csv

filter cancelled orders

lookup customer master

standardize customer status

deduplicate by order_id

calculate order_amount

aggregate monthly sales

monthly_sales_summary

This graph reveals several different kinds of work:

  • raw ingestion
  • filtering
  • enrichment
  • standardization
  • deduplication
  • business calculation
  • reporting aggregation

These should not all be treated as one technical unit.

The transformation graph helps identify where the data changes, why it changes, and which downstream outputs depend on those changes.

It also makes hidden business logic visible.

Mapping Legacy ETL Logic to Bronze, Silver, and Gold

The Medallion architecture is useful because it separates responsibilities.

Here is a practical way to classify legacy ETL logic.

Legacy ETL Pattern Meaning Likely Medallion Layer
File, API, database, or CDC extraction Raw source ingestion Bronze
Source preservation and ingestion metadata Capture original source state Bronze
Basic schema enforcement Standardized ingestion Bronze or Silver
Trim, cast, rename, null cleanup Cleansing and standardization Silver
Deduplication Record normalization Silver
Lookup and reference joins Enrichment and conformance Silver
SCD handling Historical dimensional processing Silver
Business calculations Curated business logic Gold
Aggregation and KPI creation Reporting-ready metrics Gold
Dashboard/report output Consumption-ready dataset Gold

The important point is that a legacy component type does not automatically determine the Medallion layer.

For example, a DataStage Transformer stage might perform:

  • string trimming
  • null handling
  • a business calculation
  • a customer lookup
  • a reporting aggregation

Those are not all Silver transformations.

The migration process needs to inspect the intent of the logic.

Example: One Legacy Job Becomes Multiple Databricks Layers

Imagine this fictional legacy ETL workflow:

Oracle Orders
→ Transformer: trim strings and standardize status
→ Lookup: customer master
→ Transformer: calculate net_amount
→ Aggregator: monthly sales by customer
→ Reporting table

A modern Databricks Medallion proposal could look like this:

Bronze Layer
bronze_orders_raw

  • Ingest raw Oracle orders
  • Preserve source fields
  • Add ingestion timestamp
  • Add source identifier
  • Add load date
  • Retain raw records for traceability Silver Layer silver_orders
  • Trim and standardize string fields
  • Standardize status values
  • Validate schema
  • Apply null-handling rules
  • Deduplicate order records silver_orders_enriched
  • Join customer master data
  • Resolve customer keys
  • Apply standardized enrichment logic
  • Calculate normalized net_amount Gold Layer gold_customer_monthly_sales
  • Aggregate net sales by customer and month
  • Apply approved reporting definitions
  • Produce a curated business-ready output

This creates clearer ownership.

Bronze preserves the source.

Silver prepares trusted, reusable data.

Gold provides business-facing outputs.

What AI Can Assist With

AI can make this migration process faster and more structured.

For example, an AI-assisted migration workflow can help:

  • summarize legacy job purpose
  • parse transformation expressions
  • identify source and target dependencies
  • reconstruct lineage
  • classify transformations by intent
  • detect embedded business logic
  • suggest Bronze, Silver, and Gold placement
  • draft PySpark or Spark SQL
  • generate Delta table DDL
  • propose data-quality checks
  • generate reconciliation logic
  • create migration documentation
  • identify unclear or risky rules

Suppose a legacy rule says:

IF status_code = 'C' THEN 'Closed' ELSE 'Open'

An AI system can suggest:

Likely classification:
Silver-layer standardization rule
Potential concern:
Confirm whether status_code = 'C' means Closed across all source systems.
Recommended action:
Human review required before finalizing the standardization rule.

That is useful because the system is not pretending to know the business definition.

It is surfacing the decision that must be made.

What Still Requires Human Review

AI can accelerate analysis and drafting, but human accountability remains essential.

Humans should continue to make final decisions about:

  • business definitions
  • source-of-truth selection
  • financial logic
  • regulatory calculations
  • data-retention policies
  • exception handling
  • data-quality thresholds
  • reporting metrics
  • production deployment approval

For example, a legacy aggregation may calculate:

SUM(revenue) BY region, month

The technical migration system may recommend Gold.

But a human must still answer:

  • Is revenue gross or net?
  • Are refunds included?
  • Does month use calendar month or fiscal month?
  • Is region derived from customer, store, or sales territory?
  • Is this metric reusable across reports?

Those are business and governance questions, not merely coding questions.

The Role of a Canonical Metadata Model

A Canonical Metadata Model can become the bridge between legacy ETL and modern data architecture.

It can represent:

  • sources
  • targets
  • columns
  • transformations
  • joins
  • keys
  • data types
  • quality expectations
  • lineage
  • business definitions
  • approval status
  • assumptions
  • migration decisions

Once metadata is normalized, multiple outputs can be generated from the same source of truth.

Canonical Metadata Model
→ Databricks Medallion Architecture Proposal
→ PySpark / Spark SQL
→ Delta Table DDL
→ Data Quality Rules
→ Reconciliation Checks
→ Lineage Documentation
→ Migration Specification
→ Human Review Queue

This is more powerful than isolated code conversion because it creates reusable engineering intelligence.

How Data Engineering Copilot Could Support Legacy ETL Migration

A future Data Engineering Copilot capability could act as a Legacy ETL Migration Copilot.

Inputs could include:

  • DataStage export files
  • Informatica mapping exports
  • workflow metadata
  • SQL procedures
  • ETL job documentation
  • source-to-target mappings
  • data model documentation

The workflow could be:

Legacy ETL Export
→ Parse Job Metadata
→ Build Transformation Graph
→ Identify Dependencies
→ Classify Transformation Intent
→ Propose Bronze / Silver / Gold Layers
→ Generate Migration Artifacts
→ Flag Ambiguity
→ Route for Human Review

Potential outputs could include:

  • Medallion architecture recommendation
  • Bronze, Silver, and Gold pipeline design
  • Databricks notebook structure
  • PySpark code drafts
  • Spark SQL transformations
  • Delta table definitions
  • data-quality rules
  • reconciliation checks
  • migration documentation
  • dependency analysis
  • lineage diagrams
  • review questions for unresolved logic

The key is not automatic migration without oversight.

The key is to turn hidden legacy ETL logic into a reviewable modernization blueprint.

Migration Is a Metadata and Architecture Problem

Many legacy ETL modernization efforts fail because they focus only on tool replacement.

But old ETL jobs often contain years of accumulated business knowledge.

That knowledge may be undocumented.

It may be hidden inside transformations, lookups, stored procedures, filters, sequencing rules, and exception logic.

A successful migration must preserve that knowledge while improving the architecture.

That means the migration process should:

Extract metadata
→ Reconstruct lineage
→ Identify transformation intent
→ Separate technical and business responsibilities
→ Propose Medallion layers
→ Generate reviewable artifacts
→ Capture assumptions
→ Require human approval

The future of ETL modernization is not simply translating one tool into another.

It is making legacy data logic visible, structured, governed, and reusable.

Closing Thought

DataStage and Informatica jobs were often built in an era when ingestion, cleansing, business logic, and reporting were tightly combined.

Databricks Medallion architecture gives teams an opportunity to separate those responsibilities and create cleaner, more maintainable data products.

But that opportunity is lost when organizations perform blind one-to-one conversion.

The better approach is to treat legacy ETL modernization as a metadata-driven architecture exercise.

Do not just convert legacy jobs into new code.
Convert hidden transformation logic into a reviewable modernization blueprint.

That is where AI-assisted metadata platforms can create real value for enterprise data engineering teams.

Data Engineering Copilot is a personal product initiative focused on metadata-driven engineering and governed delivery workflows.

Illustrative examples in this article use fictional metadata only. No client, employer, production, or proprietary information is included.

Top comments (0)