Amit Kumar Singh

Posted on Jun 27

From Informatica XML to Snowflake: Why ETL Migration Needs a Governed Delivery Workflow

#ai #devops #opensource #python

Legacy ETL modernization is often described as a conversion exercise:

Informatica mapping in. Snowflake SQL out.

That framing is incomplete.

A real migration is not only about translating expressions. It is about preserving transformation intent, identifying what is missing, documenting assumptions, validating target behavior, and ensuring that someone is accountable for decisions before generated artifacts are released.

I have been building a prototype called Data Engineering Copilot around that idea.

The latest capability starts from an Informatica PowerCenter XML export and produces a governed Snowflake migration delivery packet.

The workflow is:

Informatica PowerCenter XML
        ↓
Metadata and Lineage Extraction
        ↓
Canonical Metadata Model
        ↓
Snowflake Artifact Generation
        ↓
Validation and Migration Risk Assessment
        ↓
Human Review and Approval
        ↓
Governed Release Package

The problem with simple code conversion

An Informatica mapping can contain far more than a direct field-to-field relationship.

A typical mapping may include:

source definitions and target definitions
source qualifiers and filters
expression transformations
reusable transformations
lookups
constants and default values
mapping parameters
target load order
connector-level lineage
update strategy or sequence-generation behavior
target fields with no visible incoming connector

A generator that only reads source and target columns may produce SQL that looks valid but does not preserve the original delivery intent.

That is risky.

For example, imagine a target field that has no visible source column. It may still be populated through:

a constant such as 'SOURCE_A'
a default such as 'XNA'
a surrogate-key lookup
a runtime parameter
a load timestamp
a sequence generator
a business decision that was never documented in the mapping

If the tool silently inserts NULL, the SQL may compile while the migration is functionally wrong.

The prototype approach

The Data Engineering Copilot prototype accepts two starting points:

Business Requirement / Source-to-Target Mapping
Legacy ETL Mapping

For the legacy path, the first supported adapter is Informatica PowerCenter XML.

The important design principle is that both paths converge into the same canonical metadata model.

Business Requirement / STTM ─┐
                             ├─ Canonical Metadata Model
Informatica XML ─────────────┘
                                      ↓
                             Artifact Factory
                                      ↓
                       Validation and Review Gate
                                      ↓
                          Human Approval and Export

This means the product is not just an Informatica parser.

It is a governed metadata-to-delivery platform that can accept multiple sources of truth.

What the Informatica adapter extracts

For the initial version, the adapter reads metadata from PowerCenter XML such as:

SOURCE and SOURCEFIELD
TARGET and TARGETFIELD
TRANSFORMATION and TRANSFORMFIELD
INSTANCE
CONNECTOR
TABLEATTRIBUTE
source filters
lookup table names and conditions
transformation expressions
explicit default values
mapping parameters

From this, the platform builds a field-level canonical model with information such as:

Canonical field	Example value
Source table	`L0_VLE_NACE`
Source column	`CD_NACE`
Target table	`L1_D_NACE`
Target column	`CD_NACE`
Transformation type	Expression
Transformation logic	`TRIM(src.CD_NACE)`
Filter condition	business date predicate
Lookup table	reference/surrogate-key table
Lineage path	source → qualifier → expression → target expression → target
Migration status	Supported with Review / Manual Decision Required

Translating common legacy patterns

The first version supports a transparent subset of common Informatica patterns.

Expression transformations

An Informatica expression such as:

ltrim(rtrim(CD_NACE_in))

can become:

TRIM(src.CD_NACE)

A custom defaulting rule such as:

:UDF.DEFAULTSTRINGNULL(T_NAME_in)

can become:

COALESCE(NULLIF(TRIM(src.T_NAME), ''), 'XNA')

A constant value such as:

'VLE'

can become:

'VLE' AS CD_SOURCE_SYSTEM

A numeric default such as:

-1

can become:

-1 AS ID_NACE_PARENT

The platform keeps these as explicit derived values in the canonical model rather than pretending they came from a physical source column.

Source filters and runtime parameters

A Source Qualifier may contain a filter similar to:

edw_business_date = to_date('$$BUSINESS_DATE','YYYYMMDDHH24MISS')

The target Snowflake pattern can preserve that intent using a runtime parameter or session-variable approach:

WHERE src.EDW_BUSINESS_DATE =
      TO_TIMESTAMP_NTZ(:BUSINESS_DATE, 'YYYYMMDDHH24MISS')

The exact runtime parameter implementation still needs to be confirmed for the target deployment framework. That is a deployment decision, not something a metadata generator should silently invent.

Lookup conversion is not always automatic

Lookups are a good example of why governed delivery matters.

An Informatica Lookup Procedure may include:

a lookup table
a lookup condition
a source filter
cache behavior
multiple-match behavior
dynamic or static lookup semantics

A basic Snowflake translation may propose a LEFT JOIN.

But that does not prove the join is semantically equivalent.

The migration still needs review for questions such as:

Is the lookup table current, historical, or slowly changing?
What happens when multiple matches exist?
Does the lookup require effective-date logic?
Is the lookup output a surrogate key?
Was cache behavior masking duplicate or late-arriving records?
Should the target use a join, a MERGE, or a separate key-resolution process?

The prototype therefore generates a reviewable join candidate but creates a migration finding:

Status: Needs Review
Reason: Lookup conversion requires confirmation of join semantics,
duplicate-match behavior, and reference-table ownership.

The governed Release Gate

This is the part that matters most to me.

The platform does not stop at generated SQL.

It creates a validation and review workflow with statuses such as:

Draft
Under Review
Approved with Conditions
Approved
Rejected
Blocked

The release gate can identify findings such as:

Finding	Example action
Unmapped target field	Confirm source, approved default, or explicit exclusion
Missing target datatype	Confirm datatype before DDL release
Lookup conversion	Validate join semantics and test results
Unsupported transformation	Record manual migration decision
Missing date population rule	Select source field, runtime parameter, timestamp, or nullable target decision
Complex expression	Add unit test and business approval

For unresolved fields, the SQL intentionally remains visible:

NULL /* REVIEW REQUIRED: target field has no approved source/default */

That is not a failure of the product.

It is the product preventing a false sense of automation.

Why human review remains necessary

AI and rule-based conversion can accelerate the mechanical parts of migration:

metadata extraction
connector tracing
expression inventory
type translation
SQL drafting
DQ rule suggestions
lineage documentation
risk classification

But a migration still requires decisions that depend on business meaning and target-state architecture.

For example, an unmapped effective-date field could mean very different things:

Use source business date
Use current timestamp
Use target load timestamp
Populate from a configuration parameter
Allow nulls and revise DDL
Exclude the column after SME approval

A tool can surface the decision, propose options, and preserve the evidence.

A human should approve the final choice.

The generated delivery packet

Once review is complete, the prototype generates a delivery package containing:

canonical metadata model
source-to-target lineage
Snowflake DDL
Snowflake transformation SQL
data dictionary
technical specification
data quality rules
migration risk assessment
review decision history
deployment manifest
audit trail

The package should only be marked deployment-ready when high-risk findings have documented resolutions.

That is the next improvement I am working on: making approval decisions directly update release readiness and the exported findings package.

What this changes

The goal is not to claim that Informatica can be replaced by a single AI prompt.

The goal is to make migration delivery more reliable.

Instead of this:

Legacy Mapping
      ↓
Manual interpretation
      ↓
Spreadsheet updates
      ↓
SQL generation
      ↓
Late discovery of missing logic

the target workflow becomes:

Legacy Mapping
      ↓
Structured metadata extraction
      ↓
Canonical representation
      ↓
Generated artifacts
      ↓
Visible assumptions and risks
      ↓
Human approval
      ↓
Traceable release package

That is the difference between generating code and governing a migration.

Closing thought

Data migration programs rarely fail because a team cannot write SQL.

They fail because business logic, defaults, lookup behavior, data quality expectations, and ownership decisions are hidden across mappings, emails, spreadsheets, and tribal knowledge.

A governed metadata model gives those decisions a place to live.

That is the direction I am building toward with Data Engineering Copilot: start from business intent or legacy implementation metadata, generate delivery artifacts, and make every important assumption reviewable before release.

DataEngineering #Informatica #Snowflake #ETLModernization #DataMigration #MetadataDrivenDevelopment #DataGovernance #DataArchitecture #AIEngineering

Top comments (1)

Alex Shev • Jun 27

This is the right migration framing. Translating ETL syntax is the easy-looking part; preserving intent is the real work. I would want every generated mapping to carry assumptions, validation cases, missing semantics, and an owner decision before it reaches Snowflake. Otherwise automation just moves ambiguity faster.