DEV Community

Amit Kumar Singh
Amit Kumar Singh

Posted on

From Informatica XML to Snowflake: Why ETL Migration Needs a Governed Delivery Workflow

Legacy ETL modernization is often described as a conversion exercise:

Informatica mapping in. Snowflake SQL out.

That framing is incomplete.

A real migration is not only about translating expressions. It is about preserving transformation intent, identifying what is missing, documenting assumptions, validating target behavior, and ensuring that someone is accountable for decisions before generated artifacts are released.

I have been building a prototype called Data Engineering Copilot around that idea.

The latest capability starts from an Informatica PowerCenter XML export and produces a governed Snowflake migration delivery packet.

The workflow is:

Informatica PowerCenter XML
        ↓
Metadata and Lineage Extraction
        ↓
Canonical Metadata Model
        ↓
Snowflake Artifact Generation
        ↓
Validation and Migration Risk Assessment
        ↓
Human Review and Approval
        ↓
Governed Release Package
Enter fullscreen mode Exit fullscreen mode

The problem with simple code conversion

An Informatica mapping can contain far more than a direct field-to-field relationship.

A typical mapping may include:

  • source definitions and target definitions
  • source qualifiers and filters
  • expression transformations
  • reusable transformations
  • lookups
  • constants and default values
  • mapping parameters
  • target load order
  • connector-level lineage
  • update strategy or sequence-generation behavior
  • target fields with no visible incoming connector

A generator that only reads source and target columns may produce SQL that looks valid but does not preserve the original delivery intent.

That is risky.

For example, imagine a target field that has no visible source column. It may still be populated through:

  • a constant such as 'SOURCE_A'
  • a default such as 'XNA'
  • a surrogate-key lookup
  • a runtime parameter
  • a load timestamp
  • a sequence generator
  • a business decision that was never documented in the mapping

If the tool silently inserts NULL, the SQL may compile while the migration is functionally wrong.

The prototype approach

The Data Engineering Copilot prototype accepts two starting points:

  1. Business Requirement / Source-to-Target Mapping
  2. Legacy ETL Mapping

For the legacy path, the first supported adapter is Informatica PowerCenter XML.

The important design principle is that both paths converge into the same canonical metadata model.

Business Requirement / STTM ─┐
                             ├─ Canonical Metadata Model
Informatica XML ─────────────┘
                                      ↓
                             Artifact Factory
                                      ↓
                       Validation and Review Gate
                                      ↓
                          Human Approval and Export
Enter fullscreen mode Exit fullscreen mode

This means the product is not just an Informatica parser.

It is a governed metadata-to-delivery platform that can accept multiple sources of truth.

What the Informatica adapter extracts

For the initial version, the adapter reads metadata from PowerCenter XML such as:

  • SOURCE and SOURCEFIELD
  • TARGET and TARGETFIELD
  • TRANSFORMATION and TRANSFORMFIELD
  • INSTANCE
  • CONNECTOR
  • TABLEATTRIBUTE
  • source filters
  • lookup table names and conditions
  • transformation expressions
  • explicit default values
  • mapping parameters

From this, the platform builds a field-level canonical model with information such as:

Canonical field Example value
Source table L0_VLE_NACE
Source column CD_NACE
Target table L1_D_NACE
Target column CD_NACE
Transformation type Expression
Transformation logic TRIM(src.CD_NACE)
Filter condition business date predicate
Lookup table reference/surrogate-key table
Lineage path source → qualifier → expression → target expression → target
Migration status Supported with Review / Manual Decision Required

Translating common legacy patterns

The first version supports a transparent subset of common Informatica patterns.

Expression transformations

An Informatica expression such as:

ltrim(rtrim(CD_NACE_in))
Enter fullscreen mode Exit fullscreen mode

can become:

TRIM(src.CD_NACE)
Enter fullscreen mode Exit fullscreen mode

A custom defaulting rule such as:

:UDF.DEFAULTSTRINGNULL(T_NAME_in)
Enter fullscreen mode Exit fullscreen mode

can become:

COALESCE(NULLIF(TRIM(src.T_NAME), ''), 'XNA')
Enter fullscreen mode Exit fullscreen mode

A constant value such as:

'VLE'
Enter fullscreen mode Exit fullscreen mode

can become:

'VLE' AS CD_SOURCE_SYSTEM
Enter fullscreen mode Exit fullscreen mode

A numeric default such as:

-1
Enter fullscreen mode Exit fullscreen mode

can become:

-1 AS ID_NACE_PARENT
Enter fullscreen mode Exit fullscreen mode

The platform keeps these as explicit derived values in the canonical model rather than pretending they came from a physical source column.

Source filters and runtime parameters

A Source Qualifier may contain a filter similar to:

edw_business_date = to_date('$$BUSINESS_DATE','YYYYMMDDHH24MISS')
Enter fullscreen mode Exit fullscreen mode

The target Snowflake pattern can preserve that intent using a runtime parameter or session-variable approach:

WHERE src.EDW_BUSINESS_DATE =
      TO_TIMESTAMP_NTZ(:BUSINESS_DATE, 'YYYYMMDDHH24MISS')
Enter fullscreen mode Exit fullscreen mode

The exact runtime parameter implementation still needs to be confirmed for the target deployment framework. That is a deployment decision, not something a metadata generator should silently invent.

Lookup conversion is not always automatic

Lookups are a good example of why governed delivery matters.

An Informatica Lookup Procedure may include:

  • a lookup table
  • a lookup condition
  • a source filter
  • cache behavior
  • multiple-match behavior
  • dynamic or static lookup semantics

A basic Snowflake translation may propose a LEFT JOIN.

But that does not prove the join is semantically equivalent.

The migration still needs review for questions such as:

  • Is the lookup table current, historical, or slowly changing?
  • What happens when multiple matches exist?
  • Does the lookup require effective-date logic?
  • Is the lookup output a surrogate key?
  • Was cache behavior masking duplicate or late-arriving records?
  • Should the target use a join, a MERGE, or a separate key-resolution process?

The prototype therefore generates a reviewable join candidate but creates a migration finding:

Status: Needs Review
Reason: Lookup conversion requires confirmation of join semantics,
duplicate-match behavior, and reference-table ownership.
Enter fullscreen mode Exit fullscreen mode

The governed Release Gate

This is the part that matters most to me.

The platform does not stop at generated SQL.

It creates a validation and review workflow with statuses such as:

Draft
Under Review
Approved with Conditions
Approved
Rejected
Blocked
Enter fullscreen mode Exit fullscreen mode

The release gate can identify findings such as:

Finding Example action
Unmapped target field Confirm source, approved default, or explicit exclusion
Missing target datatype Confirm datatype before DDL release
Lookup conversion Validate join semantics and test results
Unsupported transformation Record manual migration decision
Missing date population rule Select source field, runtime parameter, timestamp, or nullable target decision
Complex expression Add unit test and business approval

For unresolved fields, the SQL intentionally remains visible:

NULL /* REVIEW REQUIRED: target field has no approved source/default */
Enter fullscreen mode Exit fullscreen mode

That is not a failure of the product.

It is the product preventing a false sense of automation.

Why human review remains necessary

AI and rule-based conversion can accelerate the mechanical parts of migration:

  • metadata extraction
  • connector tracing
  • expression inventory
  • type translation
  • SQL drafting
  • DQ rule suggestions
  • lineage documentation
  • risk classification

But a migration still requires decisions that depend on business meaning and target-state architecture.

For example, an unmapped effective-date field could mean very different things:

Use source business date
Use current timestamp
Use target load timestamp
Populate from a configuration parameter
Allow nulls and revise DDL
Exclude the column after SME approval
Enter fullscreen mode Exit fullscreen mode

A tool can surface the decision, propose options, and preserve the evidence.

A human should approve the final choice.

The generated delivery packet

Once review is complete, the prototype generates a delivery package containing:

  • canonical metadata model
  • source-to-target lineage
  • Snowflake DDL
  • Snowflake transformation SQL
  • data dictionary
  • technical specification
  • data quality rules
  • migration risk assessment
  • review decision history
  • deployment manifest
  • audit trail

The package should only be marked deployment-ready when high-risk findings have documented resolutions.

That is the next improvement I am working on: making approval decisions directly update release readiness and the exported findings package.

What this changes

The goal is not to claim that Informatica can be replaced by a single AI prompt.

The goal is to make migration delivery more reliable.

Instead of this:

Legacy Mapping
      ↓
Manual interpretation
      ↓
Spreadsheet updates
      ↓
SQL generation
      ↓
Late discovery of missing logic
Enter fullscreen mode Exit fullscreen mode

the target workflow becomes:

Legacy Mapping
      ↓
Structured metadata extraction
      ↓
Canonical representation
      ↓
Generated artifacts
      ↓
Visible assumptions and risks
      ↓
Human approval
      ↓
Traceable release package
Enter fullscreen mode Exit fullscreen mode

That is the difference between generating code and governing a migration.

Closing thought

Data migration programs rarely fail because a team cannot write SQL.

They fail because business logic, defaults, lookup behavior, data quality expectations, and ownership decisions are hidden across mappings, emails, spreadsheets, and tribal knowledge.

A governed metadata model gives those decisions a place to live.

That is the direction I am building toward with Data Engineering Copilot: start from business intent or legacy implementation metadata, generate delivery artifacts, and make every important assumption reviewable before release.

DataEngineering #Informatica #Snowflake #ETLModernization #DataMigration #MetadataDrivenDevelopment #DataGovernance #DataArchitecture #AIEngineering

Top comments (1)

Collapse
 
alexshev profile image
Alex Shev

This is the right migration framing. Translating ETL syntax is the easy-looking part; preserving intent is the real work. I would want every generated mapping to carry assumptions, validation cases, missing semantics, and an owner decision before it reaches Snowflake. Otherwise automation just moves ambiguity faster.