<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Amit Kumar Singh</title>
    <description>The latest articles on DEV Community by Amit Kumar Singh (@amising6).</description>
    <link>https://dev.to/amising6</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3983416%2F9c88a36c-9ccd-4dc8-94dc-c427c5252ff4.png</url>
      <title>DEV Community: Amit Kumar Singh</title>
      <link>https://dev.to/amising6</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/amising6"/>
    <language>en</language>
    <item>
      <title>From Informatica XML to Snowflake: Why ETL Migration Needs a Governed Delivery Workflow</title>
      <dc:creator>Amit Kumar Singh</dc:creator>
      <pubDate>Sat, 27 Jun 2026 12:38:13 +0000</pubDate>
      <link>https://dev.to/amising6/from-informatica-xml-to-snowflake-why-etl-migration-needs-a-governed-delivery-workflow-6kn</link>
      <guid>https://dev.to/amising6/from-informatica-xml-to-snowflake-why-etl-migration-needs-a-governed-delivery-workflow-6kn</guid>
      <description>&lt;p&gt;Legacy ETL modernization is often described as a conversion exercise:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Informatica mapping in. Snowflake SQL out.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That framing is incomplete.&lt;/p&gt;

&lt;p&gt;A real migration is not only about translating expressions. It is about preserving transformation intent, identifying what is missing, documenting assumptions, validating target behavior, and ensuring that someone is accountable for decisions before generated artifacts are released.&lt;/p&gt;

&lt;p&gt;I have been building a prototype called &lt;strong&gt;Data Engineering Copilot&lt;/strong&gt; around that idea.&lt;/p&gt;

&lt;p&gt;The latest capability starts from an Informatica PowerCenter XML export and produces a governed Snowflake migration delivery packet.&lt;/p&gt;

&lt;p&gt;The workflow is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Informatica PowerCenter XML
        ↓
Metadata and Lineage Extraction
        ↓
Canonical Metadata Model
        ↓
Snowflake Artifact Generation
        ↓
Validation and Migration Risk Assessment
        ↓
Human Review and Approval
        ↓
Governed Release Package
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The problem with simple code conversion
&lt;/h2&gt;

&lt;p&gt;An Informatica mapping can contain far more than a direct field-to-field relationship.&lt;/p&gt;

&lt;p&gt;A typical mapping may include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;source definitions and target definitions&lt;/li&gt;
&lt;li&gt;source qualifiers and filters&lt;/li&gt;
&lt;li&gt;expression transformations&lt;/li&gt;
&lt;li&gt;reusable transformations&lt;/li&gt;
&lt;li&gt;lookups&lt;/li&gt;
&lt;li&gt;constants and default values&lt;/li&gt;
&lt;li&gt;mapping parameters&lt;/li&gt;
&lt;li&gt;target load order&lt;/li&gt;
&lt;li&gt;connector-level lineage&lt;/li&gt;
&lt;li&gt;update strategy or sequence-generation behavior&lt;/li&gt;
&lt;li&gt;target fields with no visible incoming connector&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A generator that only reads source and target columns may produce SQL that looks valid but does not preserve the original delivery intent.&lt;/p&gt;

&lt;p&gt;That is risky.&lt;/p&gt;

&lt;p&gt;For example, imagine a target field that has no visible source column. It may still be populated through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a constant such as &lt;code&gt;'SOURCE_A'&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;a default such as &lt;code&gt;'XNA'&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;a surrogate-key lookup&lt;/li&gt;
&lt;li&gt;a runtime parameter&lt;/li&gt;
&lt;li&gt;a load timestamp&lt;/li&gt;
&lt;li&gt;a sequence generator&lt;/li&gt;
&lt;li&gt;a business decision that was never documented in the mapping&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the tool silently inserts &lt;code&gt;NULL&lt;/code&gt;, the SQL may compile while the migration is functionally wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The prototype approach
&lt;/h2&gt;

&lt;p&gt;The Data Engineering Copilot prototype accepts two starting points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Business Requirement / Source-to-Target Mapping&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Legacy ETL Mapping&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For the legacy path, the first supported adapter is Informatica PowerCenter XML.&lt;/p&gt;

&lt;p&gt;The important design principle is that both paths converge into the same canonical metadata model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Business Requirement / STTM ─┐
                             ├─ Canonical Metadata Model
Informatica XML ─────────────┘
                                      ↓
                             Artifact Factory
                                      ↓
                       Validation and Review Gate
                                      ↓
                          Human Approval and Export
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means the product is not just an Informatica parser.&lt;/p&gt;

&lt;p&gt;It is a governed metadata-to-delivery platform that can accept multiple sources of truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Informatica adapter extracts
&lt;/h2&gt;

&lt;p&gt;For the initial version, the adapter reads metadata from PowerCenter XML such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SOURCE&lt;/code&gt; and &lt;code&gt;SOURCEFIELD&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;TARGET&lt;/code&gt; and &lt;code&gt;TARGETFIELD&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;TRANSFORMATION&lt;/code&gt; and &lt;code&gt;TRANSFORMFIELD&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;INSTANCE&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CONNECTOR&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;TABLEATTRIBUTE&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;source filters&lt;/li&gt;
&lt;li&gt;lookup table names and conditions&lt;/li&gt;
&lt;li&gt;transformation expressions&lt;/li&gt;
&lt;li&gt;explicit default values&lt;/li&gt;
&lt;li&gt;mapping parameters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From this, the platform builds a field-level canonical model with information such as:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Canonical field&lt;/th&gt;
&lt;th&gt;Example value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Source table&lt;/td&gt;
&lt;td&gt;&lt;code&gt;L0_VLE_NACE&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Source column&lt;/td&gt;
&lt;td&gt;&lt;code&gt;CD_NACE&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Target table&lt;/td&gt;
&lt;td&gt;&lt;code&gt;L1_D_NACE&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Target column&lt;/td&gt;
&lt;td&gt;&lt;code&gt;CD_NACE&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transformation type&lt;/td&gt;
&lt;td&gt;Expression&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transformation logic&lt;/td&gt;
&lt;td&gt;&lt;code&gt;TRIM(src.CD_NACE)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Filter condition&lt;/td&gt;
&lt;td&gt;business date predicate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lookup table&lt;/td&gt;
&lt;td&gt;reference/surrogate-key table&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lineage path&lt;/td&gt;
&lt;td&gt;source → qualifier → expression → target expression → target&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Migration status&lt;/td&gt;
&lt;td&gt;Supported with Review / Manual Decision Required&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Translating common legacy patterns
&lt;/h2&gt;

&lt;p&gt;The first version supports a transparent subset of common Informatica patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Expression transformations
&lt;/h3&gt;

&lt;p&gt;An Informatica expression such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ltrim(rtrim(CD_NACE_in))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;can become:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;TRIM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CD_NACE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A custom defaulting rule such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;:UDF.DEFAULTSTRINGNULL(T_NAME_in)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;can become:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;TRIM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T_NAME&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'XNA'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A constant value such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;'VLE'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;can become:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="s1"&gt;'VLE'&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;CD_SOURCE_SYSTEM&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A numeric default such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;can become:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;ID_NACE_PARENT&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The platform keeps these as explicit derived values in the canonical model rather than pretending they came from a physical source column.&lt;/p&gt;

&lt;h2&gt;
  
  
  Source filters and runtime parameters
&lt;/h2&gt;

&lt;p&gt;A Source Qualifier may contain a filter similar to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;edw_business_date = to_date('$$BUSINESS_DATE','YYYYMMDDHH24MISS')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The target Snowflake pattern can preserve that intent using a runtime parameter or session-variable approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EDW_BUSINESS_DATE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
      &lt;span class="n"&gt;TO_TIMESTAMP_NTZ&lt;/span&gt;&lt;span class="p"&gt;(:&lt;/span&gt;&lt;span class="n"&gt;BUSINESS_DATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'YYYYMMDDHH24MISS'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The exact runtime parameter implementation still needs to be confirmed for the target deployment framework. That is a deployment decision, not something a metadata generator should silently invent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lookup conversion is not always automatic
&lt;/h2&gt;

&lt;p&gt;Lookups are a good example of why governed delivery matters.&lt;/p&gt;

&lt;p&gt;An Informatica Lookup Procedure may include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a lookup table&lt;/li&gt;
&lt;li&gt;a lookup condition&lt;/li&gt;
&lt;li&gt;a source filter&lt;/li&gt;
&lt;li&gt;cache behavior&lt;/li&gt;
&lt;li&gt;multiple-match behavior&lt;/li&gt;
&lt;li&gt;dynamic or static lookup semantics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A basic Snowflake translation may propose a &lt;code&gt;LEFT JOIN&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;But that does not prove the join is semantically equivalent.&lt;/p&gt;

&lt;p&gt;The migration still needs review for questions such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the lookup table current, historical, or slowly changing?&lt;/li&gt;
&lt;li&gt;What happens when multiple matches exist?&lt;/li&gt;
&lt;li&gt;Does the lookup require effective-date logic?&lt;/li&gt;
&lt;li&gt;Is the lookup output a surrogate key?&lt;/li&gt;
&lt;li&gt;Was cache behavior masking duplicate or late-arriving records?&lt;/li&gt;
&lt;li&gt;Should the target use a join, a &lt;code&gt;MERGE&lt;/code&gt;, or a separate key-resolution process?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The prototype therefore generates a reviewable join candidate but creates a migration finding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Status: Needs Review
Reason: Lookup conversion requires confirmation of join semantics,
duplicate-match behavior, and reference-table ownership.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The governed Release Gate
&lt;/h2&gt;

&lt;p&gt;This is the part that matters most to me.&lt;/p&gt;

&lt;p&gt;The platform does not stop at generated SQL.&lt;/p&gt;

&lt;p&gt;It creates a validation and review workflow with statuses such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Draft
Under Review
Approved with Conditions
Approved
Rejected
Blocked
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The release gate can identify findings such as:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Finding&lt;/th&gt;
&lt;th&gt;Example action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Unmapped target field&lt;/td&gt;
&lt;td&gt;Confirm source, approved default, or explicit exclusion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Missing target datatype&lt;/td&gt;
&lt;td&gt;Confirm datatype before DDL release&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lookup conversion&lt;/td&gt;
&lt;td&gt;Validate join semantics and test results&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unsupported transformation&lt;/td&gt;
&lt;td&gt;Record manual migration decision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Missing date population rule&lt;/td&gt;
&lt;td&gt;Select source field, runtime parameter, timestamp, or nullable target decision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex expression&lt;/td&gt;
&lt;td&gt;Add unit test and business approval&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For unresolved fields, the SQL intentionally remains visible:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="cm"&gt;/* REVIEW REQUIRED: target field has no approved source/default */&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is not a failure of the product.&lt;/p&gt;

&lt;p&gt;It is the product preventing a false sense of automation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why human review remains necessary
&lt;/h2&gt;

&lt;p&gt;AI and rule-based conversion can accelerate the mechanical parts of migration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;metadata extraction&lt;/li&gt;
&lt;li&gt;connector tracing&lt;/li&gt;
&lt;li&gt;expression inventory&lt;/li&gt;
&lt;li&gt;type translation&lt;/li&gt;
&lt;li&gt;SQL drafting&lt;/li&gt;
&lt;li&gt;DQ rule suggestions&lt;/li&gt;
&lt;li&gt;lineage documentation&lt;/li&gt;
&lt;li&gt;risk classification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But a migration still requires decisions that depend on business meaning and target-state architecture.&lt;/p&gt;

&lt;p&gt;For example, an unmapped effective-date field could mean very different things:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Use source business date
Use current timestamp
Use target load timestamp
Populate from a configuration parameter
Allow nulls and revise DDL
Exclude the column after SME approval
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A tool can surface the decision, propose options, and preserve the evidence.&lt;/p&gt;

&lt;p&gt;A human should approve the final choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  The generated delivery packet
&lt;/h2&gt;

&lt;p&gt;Once review is complete, the prototype generates a delivery package containing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;canonical metadata model&lt;/li&gt;
&lt;li&gt;source-to-target lineage&lt;/li&gt;
&lt;li&gt;Snowflake DDL&lt;/li&gt;
&lt;li&gt;Snowflake transformation SQL&lt;/li&gt;
&lt;li&gt;data dictionary&lt;/li&gt;
&lt;li&gt;technical specification&lt;/li&gt;
&lt;li&gt;data quality rules&lt;/li&gt;
&lt;li&gt;migration risk assessment&lt;/li&gt;
&lt;li&gt;review decision history&lt;/li&gt;
&lt;li&gt;deployment manifest&lt;/li&gt;
&lt;li&gt;audit trail&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The package should only be marked deployment-ready when high-risk findings have documented resolutions.&lt;/p&gt;

&lt;p&gt;That is the next improvement I am working on: making approval decisions directly update release readiness and the exported findings package.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this changes
&lt;/h2&gt;

&lt;p&gt;The goal is not to claim that Informatica can be replaced by a single AI prompt.&lt;/p&gt;

&lt;p&gt;The goal is to make migration delivery more reliable.&lt;/p&gt;

&lt;p&gt;Instead of this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Legacy Mapping
      ↓
Manual interpretation
      ↓
Spreadsheet updates
      ↓
SQL generation
      ↓
Late discovery of missing logic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;the target workflow becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Legacy Mapping
      ↓
Structured metadata extraction
      ↓
Canonical representation
      ↓
Generated artifacts
      ↓
Visible assumptions and risks
      ↓
Human approval
      ↓
Traceable release package
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the difference between generating code and governing a migration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing thought
&lt;/h2&gt;

&lt;p&gt;Data migration programs rarely fail because a team cannot write SQL.&lt;/p&gt;

&lt;p&gt;They fail because business logic, defaults, lookup behavior, data quality expectations, and ownership decisions are hidden across mappings, emails, spreadsheets, and tribal knowledge.&lt;/p&gt;

&lt;p&gt;A governed metadata model gives those decisions a place to live.&lt;/p&gt;

&lt;p&gt;That is the direction I am building toward with Data Engineering Copilot: start from business intent or legacy implementation metadata, generate delivery artifacts, and make every important assumption reviewable before release.&lt;/p&gt;

&lt;h1&gt;
  
  
  DataEngineering #Informatica #Snowflake #ETLModernization #DataMigration #MetadataDrivenDevelopment #DataGovernance #DataArchitecture #AIEngineering
&lt;/h1&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>python</category>
      <category>devops</category>
    </item>
    <item>
      <title>Why Enterprise AI Needs Structured Dissent, Not Just More Agents</title>
      <dc:creator>Amit Kumar Singh</dc:creator>
      <pubDate>Sat, 27 Jun 2026 00:53:39 +0000</pubDate>
      <link>https://dev.to/amising6/why-enterprise-ai-needs-structured-dissent-not-just-more-agents-5cn</link>
      <guid>https://dev.to/amising6/why-enterprise-ai-needs-structured-dissent-not-just-more-agents-5cn</guid>
      <description>&lt;p&gt;Many AI projects today are presented as multi-agent systems.&lt;/p&gt;

&lt;p&gt;One agent investigates. Another agent analyzes risk. A third agent checks compliance. A fourth agent gives a recommendation.&lt;/p&gt;

&lt;p&gt;It sounds advanced.&lt;/p&gt;

&lt;p&gt;But in a bank, adding more agents does not automatically make a workflow safe.&lt;/p&gt;

&lt;p&gt;A bank cannot freeze a customer account, block a payment, file a regulatory report, or label a transaction as fraud simply because an AI system produced a confident answer.&lt;/p&gt;

&lt;p&gt;The real question is not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How many AI agents are involved?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The real question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can the system show evidence, challenge its own conclusion, apply deterministic rules, and stop for human approval when the decision is high impact?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the difference between an interesting multi-agent demo and an enterprise-ready AI workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  A banking example: suspicious wire transfer
&lt;/h2&gt;

&lt;p&gt;Imagine a bank detects a wire transfer for $250,000.&lt;/p&gt;

&lt;p&gt;The payment is unusual because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The customer has never sent a transfer of this size.&lt;/li&gt;
&lt;li&gt;The destination account is in a new country.&lt;/li&gt;
&lt;li&gt;The transaction happens outside the customer’s normal business hours.&lt;/li&gt;
&lt;li&gt;The beneficiary was added only a few minutes before the transfer.&lt;/li&gt;
&lt;li&gt;The customer recently changed their phone number and email address.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A simple AI chatbot might say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“This transaction looks suspicious. Consider blocking it.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is not enough.&lt;/p&gt;

&lt;p&gt;A bank needs to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which transaction patterns triggered the concern?&lt;/li&gt;
&lt;li&gt;Is the customer actually violating a known risk threshold?&lt;/li&gt;
&lt;li&gt;Is there a sanctions or AML issue?&lt;/li&gt;
&lt;li&gt;Could this be a legitimate business payment?&lt;/li&gt;
&lt;li&gt;What policy applies?&lt;/li&gt;
&lt;li&gt;Should the payment be blocked, held, or released?&lt;/li&gt;
&lt;li&gt;Who is allowed to make that decision?&lt;/li&gt;
&lt;li&gt;Can the bank explain the decision later to auditors, compliance teams, and the customer?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where structured multi-agent design matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  A better design: a banking fraud decision room
&lt;/h2&gt;

&lt;p&gt;Instead of letting one model make a decision, the bank can create a controlled workflow with specialized agents.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Transaction Alert
      ↓
Fraud Detection Agent
      ↓
Customer Behavior Agent
      ↓
AML / Sanctions Agent
      ↓
Policy and Risk Agent
      ↓
Decision Reviewer
      ↓
Human Compliance Officer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each agent has a limited responsibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Fraud Detection Agent
&lt;/h3&gt;

&lt;p&gt;This agent analyzes transaction behavior.&lt;/p&gt;

&lt;p&gt;It may identify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unusual payment amount&lt;/li&gt;
&lt;li&gt;New beneficiary&lt;/li&gt;
&lt;li&gt;New country&lt;/li&gt;
&lt;li&gt;Unusual transaction time&lt;/li&gt;
&lt;li&gt;Sudden profile changes&lt;/li&gt;
&lt;li&gt;Prior fraud indicators&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Its job is not to freeze the transaction.&lt;/p&gt;

&lt;p&gt;Its job is to create a structured fraud signal.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"event_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"FRAUD_SIGNAL"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"transaction_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"TXN-784921"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"customer_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CUST-10048"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"risk_indicators"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"new_beneficiary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"amount_12x_customer_average"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"unusual_country"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"recent_contact_change"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"risk_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;82&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.88&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives the next stage a reviewable artifact instead of a paragraph generated by an LLM.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Customer Behavior Agent
&lt;/h3&gt;

&lt;p&gt;A transaction may look suspicious but still be legitimate.&lt;/p&gt;

&lt;p&gt;For example, a corporate customer may be making a valid acquisition payment or paying a new overseas vendor.&lt;/p&gt;

&lt;p&gt;The Customer Behavior Agent looks at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Historical payment behavior&lt;/li&gt;
&lt;li&gt;Customer segment&lt;/li&gt;
&lt;li&gt;Typical payment ranges&lt;/li&gt;
&lt;li&gt;Known business relationships&lt;/li&gt;
&lt;li&gt;Recent support interactions&lt;/li&gt;
&lt;li&gt;Whether the customer informed the bank about a major payment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This agent can produce a counterpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"event_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CUSTOMER_CONTEXT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"transaction_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"TXN-784921"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"historical_pattern"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Outside normal range"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"known_business_event"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"No supporting event found"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"customer_contacted_bank"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"assessment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Transaction behavior remains inconsistent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.76&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is important because the system should not treat every unusual payment as fraud.&lt;/p&gt;

&lt;h2&gt;
  
  
  Structured dissent is necessary
&lt;/h2&gt;

&lt;p&gt;Now imagine the fraud agent recommends blocking the payment.&lt;/p&gt;

&lt;p&gt;A good enterprise workflow should not simply accept that recommendation.&lt;/p&gt;

&lt;p&gt;It should require another role to challenge it.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Fraud Agent says: “High fraud risk.”&lt;/li&gt;
&lt;li&gt;The Customer Context Agent says: “No evidence of a legitimate business event.”&lt;/li&gt;
&lt;li&gt;The AML Agent says: “Beneficiary has elevated geographic risk.”&lt;/li&gt;
&lt;li&gt;The Policy Agent says: “The bank’s hold threshold is met.”&lt;/li&gt;
&lt;li&gt;The Decision Reviewer says: “Human approval required before blocking.”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is structured dissent.&lt;/p&gt;

&lt;p&gt;It is not about making agents argue for entertainment.&lt;/p&gt;

&lt;p&gt;It is about making assumptions visible before the bank takes action.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In high-stakes workflows, disagreement is not a weakness. Hidden disagreement is the real risk.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The LLM should not make the final decision alone
&lt;/h2&gt;

&lt;p&gt;LLMs are useful for many parts of the workflow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Summarizing transaction history&lt;/li&gt;
&lt;li&gt;Explaining why a transaction appears unusual&lt;/li&gt;
&lt;li&gt;Reading customer notes&lt;/li&gt;
&lt;li&gt;Interpreting investigation findings&lt;/li&gt;
&lt;li&gt;Drafting a case narrative&lt;/li&gt;
&lt;li&gt;Generating a compliance-review summary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But an LLM should not control deterministic rules.&lt;/p&gt;

&lt;p&gt;For example, these should come from governed systems and rules engines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Daily transaction thresholds&lt;/li&gt;
&lt;li&gt;Sanctions screening results&lt;/li&gt;
&lt;li&gt;AML policy conditions&lt;/li&gt;
&lt;li&gt;Regulatory filing timelines&lt;/li&gt;
&lt;li&gt;Customer account restrictions&lt;/li&gt;
&lt;li&gt;Approval authority limits&lt;/li&gt;
&lt;li&gt;Payment-hold policies&lt;/li&gt;
&lt;li&gt;Risk score calculations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A safe architecture looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI Layer
- Investigates
- Summarizes
- Explains
- Recommends

Rules Layer
- Calculates thresholds
- Applies risk policies
- Checks sanctions lists
- Enforces approval limits
- Determines required escalation

Human Layer
- Approves
- Rejects
- Overrides
- Requests further investigation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This distinction matters.&lt;/p&gt;

&lt;p&gt;The AI can explain why a payment looks suspicious.&lt;/p&gt;

&lt;p&gt;The rules engine can determine whether the bank’s fraud-hold threshold has been crossed.&lt;/p&gt;

&lt;p&gt;The compliance officer can decide whether the payment should actually be blocked.&lt;/p&gt;

&lt;h2&gt;
  
  
  An evidence panel is more important than a chatbot answer
&lt;/h2&gt;

&lt;p&gt;The final decision should not be a black-box score.&lt;/p&gt;

&lt;p&gt;A compliance officer should see an evidence panel like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Transaction:
TXN-784921

Customer:
Corporate customer — existing account for 4 years

Amount:
$250,000

Risk indicators:
- New beneficiary
- New destination country
- Payment amount is 12x normal average
- Contact information changed within past 24 hours
- No matching historical vendor relationship

Policy checks:
- Enhanced review threshold: Triggered
- Manual compliance approval: Required
- Sanctions screening: Clear
- AML monitoring alert: Triggered

AI assessment:
High-risk transaction requiring manual review

Human decision:
Payment placed on temporary hold

Approved by:
Compliance Officer

Decision timestamp:
2026-06-26 14:22 UTC
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is what enterprise AI should produce.&lt;/p&gt;

&lt;p&gt;Not just an answer.&lt;/p&gt;

&lt;p&gt;A decision record.&lt;/p&gt;

&lt;h2&gt;
  
  
  Human approval is part of the architecture
&lt;/h2&gt;

&lt;p&gt;Human approval should not be added as an afterthought.&lt;/p&gt;

&lt;p&gt;In banking, some actions should be automated.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;th&gt;AI / system role&lt;/th&gt;
&lt;th&gt;Human role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Summarize alert&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;td&gt;Review if needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Identify unusual transaction patterns&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;td&gt;Review exceptions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Create investigation case&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;td&gt;Monitor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Place temporary low-risk review hold&lt;/td&gt;
&lt;td&gt;Rule-based&lt;/td&gt;
&lt;td&gt;Review later&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Freeze account&lt;/td&gt;
&lt;td&gt;Recommend only&lt;/td&gt;
&lt;td&gt;Explicit approval required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File SAR or regulatory report&lt;/td&gt;
&lt;td&gt;Draft supporting evidence&lt;/td&gt;
&lt;td&gt;Compliance approval required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Close customer account&lt;/td&gt;
&lt;td&gt;Never autonomous&lt;/td&gt;
&lt;td&gt;Senior human decision&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The system should know when to proceed, when to pause, and when to escalate.&lt;/p&gt;

&lt;p&gt;That is not a limitation.&lt;/p&gt;

&lt;p&gt;That is good enterprise design.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for data engineering teams
&lt;/h2&gt;

&lt;p&gt;This same pattern applies directly to data engineering.&lt;/p&gt;

&lt;p&gt;A data-engineering copilot should not only generate SQL or YAML from a source-to-target mapping document.&lt;/p&gt;

&lt;p&gt;It should operate as a governed workflow.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;STTM / DDL / Source Metadata
          ↓
Metadata Extraction Agent
          ↓
Mapping Validation Agent
          ↓
Transformation Logic Agent
          ↓
SQL / YAML Generator
          ↓
Reviewer Agent
          ↓
Data Engineer Approval
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reviewer should validate things such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does the source column exist?&lt;/li&gt;
&lt;li&gt;Is the target data type compatible?&lt;/li&gt;
&lt;li&gt;Is the join supported by the mapping?&lt;/li&gt;
&lt;li&gt;Is the transformation rule documented?&lt;/li&gt;
&lt;li&gt;Is a sign rule missing?&lt;/li&gt;
&lt;li&gt;Is a derived metric using an unapproved assumption?&lt;/li&gt;
&lt;li&gt;Are there duplicate or unused YAML objects?&lt;/li&gt;
&lt;li&gt;Has an engineer approved the generated output?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then every generated artifact should include traceability.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Target Column:
PROFIT_AMT

Source:
sales.PROFIT_AMT

Transformation:
CASE WHEN SALES_TYPE = 'CANCEL'
THEN PROFIT_AMT* -1
ELSE PROFIT_AMT
END

Business Rule:
Cancellation transactions must store Profit as negative.

Source Reference:
STTM row 42

Validation:
- Source column exists
- Transformation approved
- Target data type compatible
- Human review status: Approved
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is how generated code becomes a governed engineering artifact.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical checklist for enterprise AI
&lt;/h2&gt;

&lt;p&gt;Before calling a multi-agent system enterprise-ready, ask:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Does each agent have a clear responsibility?&lt;/li&gt;
&lt;li&gt;Are handoffs structured instead of free-text only?&lt;/li&gt;
&lt;li&gt;Can one agent challenge another agent’s conclusion?&lt;/li&gt;
&lt;li&gt;Are critical calculations and policy checks deterministic?&lt;/li&gt;
&lt;li&gt;Can every recommendation be traced to source evidence?&lt;/li&gt;
&lt;li&gt;Does the system show assumptions and confidence levels?&lt;/li&gt;
&lt;li&gt;Is there a clear escalation path for uncertainty?&lt;/li&gt;
&lt;li&gt;Can a human approve, reject, or override the decision?&lt;/li&gt;
&lt;li&gt;Can the organization reconstruct the full decision later?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the answer is no, the solution may still be a useful prototype.&lt;/p&gt;

&lt;p&gt;But it is not ready for high-stakes enterprise use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;The future of enterprise AI is not one intelligent assistant making every decision.&lt;/p&gt;

&lt;p&gt;It is also not a collection of agents talking continuously.&lt;/p&gt;

&lt;p&gt;The future is a governed decision system where AI helps teams investigate faster, compare perspectives, identify risk, and prepare recommendations.&lt;/p&gt;

&lt;p&gt;But evidence remains visible.&lt;/p&gt;

&lt;p&gt;Rules remain enforceable.&lt;/p&gt;

&lt;p&gt;Disagreement remains allowed.&lt;/p&gt;

&lt;p&gt;And people remain accountable.&lt;/p&gt;

&lt;p&gt;That is how AI becomes useful in banking, finance, data engineering, and other enterprise workflows where trust matters as much as speed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dataengineeringcopilot.com" rel="noopener noreferrer"&gt;https://dataengineeringcopilot.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/amising6/data-engineering-copilot" rel="noopener noreferrer"&gt;https://github.com/amising6/data-engineering-copilot&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/amit-singh-57980030" rel="noopener noreferrer"&gt;https://www.linkedin.com/in/amit-singh-57980030&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>devops</category>
    </item>
    <item>
      <title>From DataStage and Informatica to Databricks Medallion Architecture: Why Migration Is More Than Code Conversion</title>
      <dc:creator>Amit Kumar Singh</dc:creator>
      <pubDate>Sun, 21 Jun 2026 13:43:00 +0000</pubDate>
      <link>https://dev.to/amising6/from-datastage-and-informatica-to-databricks-medallion-architecture-why-migration-is-more-than-2cnd</link>
      <guid>https://dev.to/amising6/from-datastage-and-informatica-to-databricks-medallion-architecture-why-migration-is-more-than-2cnd</guid>
      <description>&lt;p&gt;Legacy ETL modernization is often described as a technology migration.&lt;/p&gt;

&lt;p&gt;Move DataStage jobs to Databricks.&lt;br&gt;
Convert Informatica mappings into PySpark.&lt;br&gt;
Replace legacy workflows with notebooks and Delta tables.&lt;/p&gt;

&lt;p&gt;But that description misses the hardest part.&lt;/p&gt;

&lt;p&gt;The real challenge is not converting syntax.&lt;/p&gt;

&lt;p&gt;The challenge is understanding years of hidden transformation logic, reconstructing data lineage, separating technical processing from business logic, and deciding where each responsibility belongs in a modern architecture.&lt;/p&gt;

&lt;p&gt;A DataStage job or Informatica mapping may contain raw ingestion, data cleansing, lookups, joins, business rules, aggregations, error handling, and reporting logic in one workflow.&lt;/p&gt;

&lt;p&gt;A Databricks Medallion architecture expects something different.&lt;/p&gt;

&lt;p&gt;It separates data processing into clearer layers:&lt;/p&gt;

&lt;p&gt;Bronze&lt;br&gt;
Raw ingestion and source preservation&lt;br&gt;
Silver&lt;br&gt;
Cleansing, standardization, enrichment, conformance, and quality controls&lt;br&gt;
Gold&lt;br&gt;
Business-ready models, aggregates, KPIs, reporting datasets, and semantic outputs&lt;/p&gt;

&lt;p&gt;That means a successful migration cannot be a blind one-to-one conversion.&lt;/p&gt;

&lt;p&gt;It needs to become a metadata and architecture exercise.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Why One-to-One Conversion Fails&lt;/p&gt;

&lt;p&gt;A traditional legacy ETL job often looks like this:&lt;/p&gt;

&lt;p&gt;Read source data&lt;br&gt;
→ Filter records&lt;br&gt;
→ Lookup reference data&lt;br&gt;
→ Cleanse values&lt;br&gt;
→ Deduplicate&lt;br&gt;
→ Apply business calculations&lt;br&gt;
→ Aggregate&lt;br&gt;
→ Write reporting output&lt;/p&gt;

&lt;p&gt;The problem is that all these responsibilities may exist inside one job, mapping, sequence, or workflow.&lt;/p&gt;

&lt;p&gt;For example, a single DataStage job might:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ingest from Oracle&lt;/li&gt;
&lt;li&gt;remove cancelled records&lt;/li&gt;
&lt;li&gt;trim whitespace&lt;/li&gt;
&lt;li&gt;standardize status values&lt;/li&gt;
&lt;li&gt;join customer master data&lt;/li&gt;
&lt;li&gt;calculate net order amount&lt;/li&gt;
&lt;li&gt;aggregate sales by month&lt;/li&gt;
&lt;li&gt;write a reporting table&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If that entire job is converted directly into one Databricks notebook, the organization may simply recreate the old architecture in a new platform.&lt;/p&gt;

&lt;p&gt;The code may run in Databricks, but the design remains difficult to maintain, test, govern, and scale.&lt;/p&gt;

&lt;p&gt;The goal should not be:&lt;/p&gt;

&lt;p&gt;Convert one legacy job into one notebook.&lt;/p&gt;

&lt;p&gt;The goal should be:&lt;/p&gt;

&lt;p&gt;Understand what each transformation is doing and place it in the right modern data layer.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The First Step: Extract Metadata, Not Just Code&lt;/p&gt;

&lt;p&gt;A legacy ETL migration should begin by extracting structured metadata from the existing platform.&lt;/p&gt;

&lt;p&gt;For DataStage, Informatica, SSIS, Talend, stored procedures, or other ETL tools, useful metadata may include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;job or mapping name&lt;/li&gt;
&lt;li&gt;workflow dependencies&lt;/li&gt;
&lt;li&gt;source tables, files, and APIs&lt;/li&gt;
&lt;li&gt;target tables and files&lt;/li&gt;
&lt;li&gt;source-to-target field mappings&lt;/li&gt;
&lt;li&gt;joins and lookup logic&lt;/li&gt;
&lt;li&gt;filters and conditions&lt;/li&gt;
&lt;li&gt;transformation expressions&lt;/li&gt;
&lt;li&gt;aggregations&lt;/li&gt;
&lt;li&gt;surrogate key generation&lt;/li&gt;
&lt;li&gt;reject handling&lt;/li&gt;
&lt;li&gt;parameter values&lt;/li&gt;
&lt;li&gt;schedules and sequencing&lt;/li&gt;
&lt;li&gt;pre-SQL and post-SQL&lt;/li&gt;
&lt;li&gt;restart or recovery logic&lt;/li&gt;
&lt;li&gt;error-handling behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The purpose is to create a structured representation of the legacy job.&lt;/p&gt;

&lt;p&gt;Legacy ETL Export&lt;br&gt;
→ Metadata Parser&lt;br&gt;
→ Canonical Metadata Model&lt;br&gt;
→ Transformation Graph&lt;br&gt;
→ Migration Blueprint&lt;/p&gt;

&lt;p&gt;This is much more valuable than simply reading transformation code line by line.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Reconstructing the Transformation Graph&lt;/p&gt;

&lt;p&gt;Once the metadata is extracted, the next step is to reconstruct the data lineage and transformation graph.&lt;/p&gt;

&lt;p&gt;Consider this fictional example:&lt;/p&gt;

&lt;p&gt;orders.csv&lt;br&gt;
     ↓&lt;br&gt;
filter cancelled orders&lt;br&gt;
     ↓&lt;br&gt;
lookup customer master&lt;br&gt;
     ↓&lt;br&gt;
standardize customer status&lt;br&gt;
     ↓&lt;br&gt;
deduplicate by order_id&lt;br&gt;
     ↓&lt;br&gt;
calculate order_amount&lt;br&gt;
     ↓&lt;br&gt;
aggregate monthly sales&lt;br&gt;
     ↓&lt;br&gt;
monthly_sales_summary&lt;/p&gt;

&lt;p&gt;This graph reveals several different kinds of work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;raw ingestion&lt;/li&gt;
&lt;li&gt;filtering&lt;/li&gt;
&lt;li&gt;enrichment&lt;/li&gt;
&lt;li&gt;standardization&lt;/li&gt;
&lt;li&gt;deduplication&lt;/li&gt;
&lt;li&gt;business calculation&lt;/li&gt;
&lt;li&gt;reporting aggregation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These should not all be treated as one technical unit.&lt;/p&gt;

&lt;p&gt;The transformation graph helps identify where the data changes, why it changes, and which downstream outputs depend on those changes.&lt;/p&gt;

&lt;p&gt;It also makes hidden business logic visible.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Mapping Legacy ETL Logic to Bronze, Silver, and Gold&lt;/p&gt;

&lt;p&gt;The Medallion architecture is useful because it separates responsibilities.&lt;/p&gt;

&lt;p&gt;Here is a practical way to classify legacy ETL logic.&lt;/p&gt;

&lt;p&gt;Legacy ETL Pattern  Meaning Likely Medallion Layer&lt;br&gt;
File, API, database, or CDC extraction  Raw source ingestion    Bronze&lt;br&gt;
Source preservation and ingestion metadata  Capture original source state   Bronze&lt;br&gt;
Basic schema enforcement    Standardized ingestion  Bronze or Silver&lt;br&gt;
Trim, cast, rename, null cleanup    Cleansing and standardization   Silver&lt;br&gt;
Deduplication   Record normalization    Silver&lt;br&gt;
Lookup and reference joins  Enrichment and conformance  Silver&lt;br&gt;
SCD handling    Historical dimensional processing   Silver&lt;br&gt;
Business calculations   Curated business logic  Gold&lt;br&gt;
Aggregation and KPI creation    Reporting-ready metrics Gold&lt;br&gt;
Dashboard/report output Consumption-ready dataset   Gold&lt;/p&gt;

&lt;p&gt;The important point is that a legacy component type does not automatically determine the Medallion layer.&lt;/p&gt;

&lt;p&gt;For example, a DataStage Transformer stage might perform:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;string trimming&lt;/li&gt;
&lt;li&gt;null handling&lt;/li&gt;
&lt;li&gt;a business calculation&lt;/li&gt;
&lt;li&gt;a customer lookup&lt;/li&gt;
&lt;li&gt;a reporting aggregation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are not all Silver transformations.&lt;/p&gt;

&lt;p&gt;The migration process needs to inspect the intent of the logic.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Example: One Legacy Job Becomes Multiple Databricks Layers&lt;/p&gt;

&lt;p&gt;Imagine this fictional legacy ETL workflow:&lt;/p&gt;

&lt;p&gt;Oracle Orders&lt;br&gt;
→ Transformer: trim strings and standardize status&lt;br&gt;
→ Lookup: customer master&lt;br&gt;
→ Transformer: calculate net_amount&lt;br&gt;
→ Aggregator: monthly sales by customer&lt;br&gt;
→ Reporting table&lt;/p&gt;

&lt;p&gt;A modern Databricks Medallion proposal could look like this:&lt;/p&gt;

&lt;p&gt;Bronze Layer&lt;br&gt;
bronze_orders_raw&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ingest raw Oracle orders&lt;/li&gt;
&lt;li&gt;Preserve source fields&lt;/li&gt;
&lt;li&gt;Add ingestion timestamp&lt;/li&gt;
&lt;li&gt;Add source identifier&lt;/li&gt;
&lt;li&gt;Add load date&lt;/li&gt;
&lt;li&gt;Retain raw records for traceability
Silver Layer
silver_orders&lt;/li&gt;
&lt;li&gt;Trim and standardize string fields&lt;/li&gt;
&lt;li&gt;Standardize status values&lt;/li&gt;
&lt;li&gt;Validate schema&lt;/li&gt;
&lt;li&gt;Apply null-handling rules&lt;/li&gt;
&lt;li&gt;Deduplicate order records
silver_orders_enriched&lt;/li&gt;
&lt;li&gt;Join customer master data&lt;/li&gt;
&lt;li&gt;Resolve customer keys&lt;/li&gt;
&lt;li&gt;Apply standardized enrichment logic&lt;/li&gt;
&lt;li&gt;Calculate normalized net_amount
Gold Layer
gold_customer_monthly_sales&lt;/li&gt;
&lt;li&gt;Aggregate net sales by customer and month&lt;/li&gt;
&lt;li&gt;Apply approved reporting definitions&lt;/li&gt;
&lt;li&gt;Produce a curated business-ready output&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This creates clearer ownership.&lt;/p&gt;

&lt;p&gt;Bronze preserves the source.&lt;/p&gt;

&lt;p&gt;Silver prepares trusted, reusable data.&lt;/p&gt;

&lt;p&gt;Gold provides business-facing outputs.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;What AI Can Assist With&lt;/p&gt;

&lt;p&gt;AI can make this migration process faster and more structured.&lt;/p&gt;

&lt;p&gt;For example, an AI-assisted migration workflow can help:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;summarize legacy job purpose&lt;/li&gt;
&lt;li&gt;parse transformation expressions&lt;/li&gt;
&lt;li&gt;identify source and target dependencies&lt;/li&gt;
&lt;li&gt;reconstruct lineage&lt;/li&gt;
&lt;li&gt;classify transformations by intent&lt;/li&gt;
&lt;li&gt;detect embedded business logic&lt;/li&gt;
&lt;li&gt;suggest Bronze, Silver, and Gold placement&lt;/li&gt;
&lt;li&gt;draft PySpark or Spark SQL&lt;/li&gt;
&lt;li&gt;generate Delta table DDL&lt;/li&gt;
&lt;li&gt;propose data-quality checks&lt;/li&gt;
&lt;li&gt;generate reconciliation logic&lt;/li&gt;
&lt;li&gt;create migration documentation&lt;/li&gt;
&lt;li&gt;identify unclear or risky rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Suppose a legacy rule says:&lt;/p&gt;

&lt;p&gt;IF status_code = 'C' THEN 'Closed' ELSE 'Open'&lt;/p&gt;

&lt;p&gt;An AI system can suggest:&lt;/p&gt;

&lt;p&gt;Likely classification:&lt;br&gt;
Silver-layer standardization rule&lt;br&gt;
Potential concern:&lt;br&gt;
Confirm whether status_code = 'C' means Closed across all source systems.&lt;br&gt;
Recommended action:&lt;br&gt;
Human review required before finalizing the standardization rule.&lt;/p&gt;

&lt;p&gt;That is useful because the system is not pretending to know the business definition.&lt;/p&gt;

&lt;p&gt;It is surfacing the decision that must be made.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;What Still Requires Human Review&lt;/p&gt;

&lt;p&gt;AI can accelerate analysis and drafting, but human accountability remains essential.&lt;/p&gt;

&lt;p&gt;Humans should continue to make final decisions about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;business definitions&lt;/li&gt;
&lt;li&gt;source-of-truth selection&lt;/li&gt;
&lt;li&gt;financial logic&lt;/li&gt;
&lt;li&gt;regulatory calculations&lt;/li&gt;
&lt;li&gt;data-retention policies&lt;/li&gt;
&lt;li&gt;exception handling&lt;/li&gt;
&lt;li&gt;data-quality thresholds&lt;/li&gt;
&lt;li&gt;reporting metrics&lt;/li&gt;
&lt;li&gt;production deployment approval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, a legacy aggregation may calculate:&lt;/p&gt;

&lt;p&gt;SUM(revenue) BY region, month&lt;/p&gt;

&lt;p&gt;The technical migration system may recommend Gold.&lt;/p&gt;

&lt;p&gt;But a human must still answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is revenue gross or net?&lt;/li&gt;
&lt;li&gt;Are refunds included?&lt;/li&gt;
&lt;li&gt;Does month use calendar month or fiscal month?&lt;/li&gt;
&lt;li&gt;Is region derived from customer, store, or sales territory?&lt;/li&gt;
&lt;li&gt;Is this metric reusable across reports?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are business and governance questions, not merely coding questions.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The Role of a Canonical Metadata Model&lt;/p&gt;

&lt;p&gt;A Canonical Metadata Model can become the bridge between legacy ETL and modern data architecture.&lt;/p&gt;

&lt;p&gt;It can represent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sources&lt;/li&gt;
&lt;li&gt;targets&lt;/li&gt;
&lt;li&gt;columns&lt;/li&gt;
&lt;li&gt;transformations&lt;/li&gt;
&lt;li&gt;joins&lt;/li&gt;
&lt;li&gt;keys&lt;/li&gt;
&lt;li&gt;data types&lt;/li&gt;
&lt;li&gt;quality expectations&lt;/li&gt;
&lt;li&gt;lineage&lt;/li&gt;
&lt;li&gt;business definitions&lt;/li&gt;
&lt;li&gt;approval status&lt;/li&gt;
&lt;li&gt;assumptions&lt;/li&gt;
&lt;li&gt;migration decisions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once metadata is normalized, multiple outputs can be generated from the same source of truth.&lt;/p&gt;

&lt;p&gt;Canonical Metadata Model&lt;br&gt;
→ Databricks Medallion Architecture Proposal&lt;br&gt;
→ PySpark / Spark SQL&lt;br&gt;
→ Delta Table DDL&lt;br&gt;
→ Data Quality Rules&lt;br&gt;
→ Reconciliation Checks&lt;br&gt;
→ Lineage Documentation&lt;br&gt;
→ Migration Specification&lt;br&gt;
→ Human Review Queue&lt;/p&gt;

&lt;p&gt;This is more powerful than isolated code conversion because it creates reusable engineering intelligence.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;How Data Engineering Copilot Could Support Legacy ETL Migration&lt;/p&gt;

&lt;p&gt;A future Data Engineering Copilot capability could act as a Legacy ETL Migration Copilot.&lt;/p&gt;

&lt;p&gt;Inputs could include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DataStage export files&lt;/li&gt;
&lt;li&gt;Informatica mapping exports&lt;/li&gt;
&lt;li&gt;workflow metadata&lt;/li&gt;
&lt;li&gt;SQL procedures&lt;/li&gt;
&lt;li&gt;ETL job documentation&lt;/li&gt;
&lt;li&gt;source-to-target mappings&lt;/li&gt;
&lt;li&gt;data model documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The workflow could be:&lt;/p&gt;

&lt;p&gt;Legacy ETL Export&lt;br&gt;
→ Parse Job Metadata&lt;br&gt;
→ Build Transformation Graph&lt;br&gt;
→ Identify Dependencies&lt;br&gt;
→ Classify Transformation Intent&lt;br&gt;
→ Propose Bronze / Silver / Gold Layers&lt;br&gt;
→ Generate Migration Artifacts&lt;br&gt;
→ Flag Ambiguity&lt;br&gt;
→ Route for Human Review&lt;/p&gt;

&lt;p&gt;Potential outputs could include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Medallion architecture recommendation&lt;/li&gt;
&lt;li&gt;Bronze, Silver, and Gold pipeline design&lt;/li&gt;
&lt;li&gt;Databricks notebook structure&lt;/li&gt;
&lt;li&gt;PySpark code drafts&lt;/li&gt;
&lt;li&gt;Spark SQL transformations&lt;/li&gt;
&lt;li&gt;Delta table definitions&lt;/li&gt;
&lt;li&gt;data-quality rules&lt;/li&gt;
&lt;li&gt;reconciliation checks&lt;/li&gt;
&lt;li&gt;migration documentation&lt;/li&gt;
&lt;li&gt;dependency analysis&lt;/li&gt;
&lt;li&gt;lineage diagrams&lt;/li&gt;
&lt;li&gt;review questions for unresolved logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key is not automatic migration without oversight.&lt;/p&gt;

&lt;p&gt;The key is to turn hidden legacy ETL logic into a reviewable modernization blueprint.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Migration Is a Metadata and Architecture Problem&lt;/p&gt;

&lt;p&gt;Many legacy ETL modernization efforts fail because they focus only on tool replacement.&lt;/p&gt;

&lt;p&gt;But old ETL jobs often contain years of accumulated business knowledge.&lt;/p&gt;

&lt;p&gt;That knowledge may be undocumented.&lt;/p&gt;

&lt;p&gt;It may be hidden inside transformations, lookups, stored procedures, filters, sequencing rules, and exception logic.&lt;/p&gt;

&lt;p&gt;A successful migration must preserve that knowledge while improving the architecture.&lt;/p&gt;

&lt;p&gt;That means the migration process should:&lt;/p&gt;

&lt;p&gt;Extract metadata&lt;br&gt;
→ Reconstruct lineage&lt;br&gt;
→ Identify transformation intent&lt;br&gt;
→ Separate technical and business responsibilities&lt;br&gt;
→ Propose Medallion layers&lt;br&gt;
→ Generate reviewable artifacts&lt;br&gt;
→ Capture assumptions&lt;br&gt;
→ Require human approval&lt;/p&gt;

&lt;p&gt;The future of ETL modernization is not simply translating one tool into another.&lt;/p&gt;

&lt;p&gt;It is making legacy data logic visible, structured, governed, and reusable.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Closing Thought&lt;/p&gt;

&lt;p&gt;DataStage and Informatica jobs were often built in an era when ingestion, cleansing, business logic, and reporting were tightly combined.&lt;/p&gt;

&lt;p&gt;Databricks Medallion architecture gives teams an opportunity to separate those responsibilities and create cleaner, more maintainable data products.&lt;/p&gt;

&lt;p&gt;But that opportunity is lost when organizations perform blind one-to-one conversion.&lt;/p&gt;

&lt;p&gt;The better approach is to treat legacy ETL modernization as a metadata-driven architecture exercise.&lt;/p&gt;

&lt;p&gt;Do not just convert legacy jobs into new code.&lt;br&gt;
Convert hidden transformation logic into a reviewable modernization blueprint.&lt;/p&gt;

&lt;p&gt;That is where AI-assisted metadata platforms can create real value for enterprise data engineering teams.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Data Engineering Copilot is a personal product initiative focused on metadata-driven engineering and governed delivery workflows.&lt;/p&gt;

&lt;p&gt;Illustrative examples in this article use fictional metadata only. No client, employer, production, or proprietary information is included.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>automation</category>
    </item>
    <item>
      <title>From Legacy Data Platforms to Modern Data Stacks: Why Metadata Matters More Than Technology</title>
      <dc:creator>Amit Kumar Singh</dc:creator>
      <pubDate>Sun, 21 Jun 2026 08:16:02 +0000</pubDate>
      <link>https://dev.to/amising6/from-legacy-data-platforms-to-modern-data-stacks-why-metadata-matters-more-than-technology-33oj</link>
      <guid>https://dev.to/amising6/from-legacy-data-platforms-to-modern-data-stacks-why-metadata-matters-more-than-technology-33oj</guid>
      <description>&lt;p&gt;Introduction&lt;/p&gt;

&lt;p&gt;Organizations spend millions of dollars modernizing data platforms.&lt;/p&gt;

&lt;p&gt;They migrate from on-premise databases to cloud warehouses. They replace legacy ETL tools with Spark and cloud-native orchestration. They introduce modern observability platforms, data catalogs, semantic layers, and AI-powered analytics.&lt;/p&gt;

&lt;p&gt;Yet many modernization programs struggle despite adopting the latest technology.&lt;/p&gt;

&lt;p&gt;The reason is surprisingly simple:&lt;/p&gt;

&lt;p&gt;Technology changes.&lt;/p&gt;

&lt;p&gt;Metadata remains.&lt;/p&gt;

&lt;p&gt;Most modernization projects focus on moving code. Few focus on understanding and preserving the metadata that defines the business.&lt;/p&gt;

&lt;p&gt;This is where metadata-driven engineering changes the conversation.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The Traditional Modernization Approach&lt;/p&gt;

&lt;p&gt;A typical legacy modernization initiative looks something like this:&lt;/p&gt;

&lt;p&gt;Legacy Environment&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Oracle&lt;/li&gt;
&lt;li&gt;Teradata&lt;/li&gt;
&lt;li&gt;Netezza&lt;/li&gt;
&lt;li&gt;Informatica&lt;/li&gt;
&lt;li&gt;DataStage&lt;/li&gt;
&lt;li&gt;SSIS&lt;/li&gt;
&lt;li&gt;Stored Procedures&lt;/li&gt;
&lt;li&gt;Excel-Based Documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Target Environment&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snowflake&lt;/li&gt;
&lt;li&gt;Databricks&lt;/li&gt;
&lt;li&gt;dbt&lt;/li&gt;
&lt;li&gt;Airflow&lt;/li&gt;
&lt;li&gt;Monte Carlo&lt;/li&gt;
&lt;li&gt;Power BI&lt;/li&gt;
&lt;li&gt;Sigma&lt;/li&gt;
&lt;li&gt;Cloud Storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The migration process usually involves:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reverse engineering legacy pipelines&lt;/li&gt;
&lt;li&gt;Understanding business logic&lt;/li&gt;
&lt;li&gt;Rewriting transformations&lt;/li&gt;
&lt;li&gt;Rebuilding data models&lt;/li&gt;
&lt;li&gt;Recreating documentation&lt;/li&gt;
&lt;li&gt;Reimplementing data quality checks&lt;/li&gt;
&lt;li&gt;Validating outputs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The challenge is that every artifact is treated as a separate deliverable.&lt;/p&gt;

&lt;p&gt;Engineers repeatedly translate the same business requirements into different technical formats.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The Real Asset Is Not The Code&lt;/p&gt;

&lt;p&gt;Most organizations assume the code is the asset.&lt;/p&gt;

&lt;p&gt;In reality, the most valuable asset is the metadata that describes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Source systems&lt;/li&gt;
&lt;li&gt;Business entities&lt;/li&gt;
&lt;li&gt;Data definitions&lt;/li&gt;
&lt;li&gt;Transformation logic&lt;/li&gt;
&lt;li&gt;Relationships&lt;/li&gt;
&lt;li&gt;Data quality rules&lt;/li&gt;
&lt;li&gt;Ownership&lt;/li&gt;
&lt;li&gt;Governance policies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Technology platforms evolve every few years.&lt;/p&gt;

&lt;p&gt;Business definitions often survive for decades.&lt;/p&gt;

&lt;p&gt;A customer is still a customer.&lt;/p&gt;

&lt;p&gt;A policy is still a policy.&lt;/p&gt;

&lt;p&gt;A claim is still a claim.&lt;/p&gt;

&lt;p&gt;What changes is how those concepts are implemented.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The Metadata Problem&lt;/p&gt;

&lt;p&gt;Consider a simple customer field.&lt;/p&gt;

&lt;p&gt;In a legacy platform it might appear as:&lt;/p&gt;

&lt;p&gt;CUSTOMER_ID&lt;/p&gt;

&lt;p&gt;In Snowflake it becomes:&lt;/p&gt;

&lt;p&gt;CUSTOMER_KEY&lt;/p&gt;

&lt;p&gt;In Power BI it appears as:&lt;/p&gt;

&lt;p&gt;Customer Identifier&lt;/p&gt;

&lt;p&gt;In a data catalog it appears as:&lt;/p&gt;

&lt;p&gt;Business Customer Reference&lt;/p&gt;

&lt;p&gt;The technology changes.&lt;/p&gt;

&lt;p&gt;The meaning remains the same.&lt;/p&gt;

&lt;p&gt;Modernization projects spend enormous effort rediscovering and translating metadata that already exists somewhere in the organization.&lt;/p&gt;

&lt;p&gt;This creates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Delivery delays&lt;/li&gt;
&lt;li&gt;Documentation drift&lt;/li&gt;
&lt;li&gt;Inconsistent implementations&lt;/li&gt;
&lt;li&gt;Increased testing effort&lt;/li&gt;
&lt;li&gt;Knowledge dependency on SMEs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;A Metadata-Driven Modernization Strategy&lt;/p&gt;

&lt;p&gt;Instead of migrating code directly, organizations can first create a standardized metadata representation.&lt;/p&gt;

&lt;p&gt;This becomes a Canonical Metadata Model.&lt;/p&gt;

&lt;p&gt;The Canonical Metadata Model acts as an abstraction layer between business metadata and technology platforms.&lt;/p&gt;

&lt;p&gt;Legacy Sources&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;STTM Documents&lt;/li&gt;
&lt;li&gt;Data Dictionaries&lt;/li&gt;
&lt;li&gt;Data Models&lt;/li&gt;
&lt;li&gt;Legacy ETL Jobs&lt;/li&gt;
&lt;li&gt;Database Schemas&lt;/li&gt;
&lt;li&gt;Business Rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;↓&lt;/p&gt;

&lt;p&gt;Canonical Metadata Model&lt;/p&gt;

&lt;p&gt;Standardized representation of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Entities&lt;/li&gt;
&lt;li&gt;Attributes&lt;/li&gt;
&lt;li&gt;Relationships&lt;/li&gt;
&lt;li&gt;Transformations&lt;/li&gt;
&lt;li&gt;Data Quality Rules&lt;/li&gt;
&lt;li&gt;Lineage&lt;/li&gt;
&lt;li&gt;Governance&lt;/li&gt;
&lt;li&gt;Business Definitions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;↓&lt;/p&gt;

&lt;p&gt;Modern Outputs&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snowflake DDL&lt;/li&gt;
&lt;li&gt;Databricks Notebooks&lt;/li&gt;
&lt;li&gt;dbt Models&lt;/li&gt;
&lt;li&gt;Airflow DAGs&lt;/li&gt;
&lt;li&gt;Monte Carlo Configurations&lt;/li&gt;
&lt;li&gt;ER Diagrams&lt;/li&gt;
&lt;li&gt;Data Dictionaries&lt;/li&gt;
&lt;li&gt;Technical Specifications&lt;/li&gt;
&lt;li&gt;Power BI Semantic Models&lt;/li&gt;
&lt;li&gt;Sigma Semantic Models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Build Once. Generate Everywhere.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;How DE Copilot Approaches Modernization&lt;/p&gt;

&lt;p&gt;DE Copilot is built around this concept.&lt;/p&gt;

&lt;p&gt;Instead of generating individual artifacts independently, the platform converts enterprise metadata into a Canonical Metadata Model.&lt;/p&gt;

&lt;p&gt;The Canonical Metadata Model becomes the single source of truth.&lt;/p&gt;

&lt;p&gt;Once standardized, generators can produce multiple technology-specific outputs.&lt;/p&gt;

&lt;p&gt;Current Capabilities&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snowflake DDL Generation&lt;/li&gt;
&lt;li&gt;Snowflake SQL Generation&lt;/li&gt;
&lt;li&gt;Data Dictionary Generation&lt;/li&gt;
&lt;li&gt;Technical Specification Generation&lt;/li&gt;
&lt;li&gt;Data Quality Rule Generation&lt;/li&gt;
&lt;li&gt;AI Metadata Analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Future Roadmap&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ER Diagram Generation&lt;/li&gt;
&lt;li&gt;dbt Model Generation&lt;/li&gt;
&lt;li&gt;Databricks Notebook Generation&lt;/li&gt;
&lt;li&gt;Airflow DAG Generation&lt;/li&gt;
&lt;li&gt;Monte Carlo Configuration Generation&lt;/li&gt;
&lt;li&gt;Power BI Semantic Model Generation&lt;/li&gt;
&lt;li&gt;Sigma Semantic Model Generation&lt;/li&gt;
&lt;li&gt;Knowledge Discovery Copilot&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;Why This Matters&lt;/p&gt;

&lt;p&gt;Modernization projects often fail because organizations rebuild the same knowledge repeatedly.&lt;/p&gt;

&lt;p&gt;Every new platform requires another translation exercise.&lt;/p&gt;

&lt;p&gt;A metadata-driven approach changes that.&lt;/p&gt;

&lt;p&gt;Instead of rewriting business logic for every technology, organizations standardize metadata once and generate multiple implementations.&lt;/p&gt;

&lt;p&gt;The focus shifts from technology migration to metadata preservation.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;The Future of Data Engineering&lt;/p&gt;

&lt;p&gt;For decades, data engineering has been centered around code.&lt;/p&gt;

&lt;p&gt;The next generation of platforms will be centered around metadata.&lt;/p&gt;

&lt;p&gt;Engineers will spend less time translating spreadsheets into code and more time solving business problems.&lt;/p&gt;

&lt;p&gt;The winning organizations will not be the ones with the newest technology stack.&lt;/p&gt;

&lt;p&gt;They will be the ones that understand their metadata best.&lt;/p&gt;

&lt;p&gt;Because technology changes.&lt;/p&gt;

&lt;p&gt;Metadata endures.&lt;/p&gt;

&lt;p&gt;And when metadata becomes the product, modernization becomes dramatically simpler.&lt;/p&gt;

&lt;p&gt;⸻&lt;/p&gt;

&lt;p&gt;About DE Copilot&lt;/p&gt;

&lt;p&gt;DE Copilot is a metadata-driven engineering platform that transforms enterprise Source-to-Target Mapping (STTM) documents into production-ready engineering artifacts through a Canonical Metadata Model.&lt;/p&gt;

&lt;p&gt;Learn more:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dataengineeringcopilot.com" rel="noopener noreferrer"&gt;https://dataengineeringcopilot.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Read:&lt;/p&gt;

&lt;p&gt;The Canonical Metadata Model: The Engine Behind DE Copilot&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>python</category>
      <category>discuss</category>
    </item>
    <item>
      <title>What I Learned After Reviewing Many AI and Developer Projects as a Hackathon Judge</title>
      <dc:creator>Amit Kumar Singh</dc:creator>
      <pubDate>Thu, 18 Jun 2026 11:12:59 +0000</pubDate>
      <link>https://dev.to/amising6/what-i-learned-after-reviewing-many-ai-and-developer-projects-as-a-hackathon-judge-2g06</link>
      <guid>https://dev.to/amising6/what-i-learned-after-reviewing-many-ai-and-developer-projects-as-a-hackathon-judge-2g06</guid>
      <description>&lt;p&gt;Over the last few days, I had the opportunity to review a large number of submissions across developer and AI-focused hackathon challenges.&lt;/p&gt;

&lt;p&gt;It was a very different experience from building a project myself.&lt;/p&gt;

&lt;p&gt;When you are building, you mostly think about your own idea, your own code, and your own constraints.&lt;/p&gt;

&lt;p&gt;When you are judging, you start seeing patterns across many builders.&lt;/p&gt;

&lt;p&gt;Some projects had beautiful interfaces but limited technical depth.&lt;/p&gt;

&lt;p&gt;Some had very strong engineering but needed better documentation.&lt;/p&gt;

&lt;p&gt;Some were simple ideas, but solved a real problem clearly.&lt;/p&gt;

&lt;p&gt;Some were ambitious platforms, but still needed stronger proof of usability, reliability, or completion.&lt;/p&gt;

&lt;p&gt;A few lessons stood out to me.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. A good project is not only about the idea
&lt;/h2&gt;

&lt;p&gt;Many submissions had interesting ideas.&lt;/p&gt;

&lt;p&gt;But the stronger ones clearly showed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what problem they were solving&lt;/li&gt;
&lt;li&gt;what existed before&lt;/li&gt;
&lt;li&gt;what was improved&lt;/li&gt;
&lt;li&gt;what technical choices were made&lt;/li&gt;
&lt;li&gt;what the user can actually do now&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The difference between “interesting” and “strong” was usually execution clarity.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Completion matters
&lt;/h2&gt;

&lt;p&gt;In a finish-up style challenge, the best projects were not always the flashiest.&lt;/p&gt;

&lt;p&gt;The best ones showed a real before-and-after story.&lt;/p&gt;

&lt;p&gt;Examples of strong completion signals included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;broken workflows fixed&lt;/li&gt;
&lt;li&gt;apps deployed publicly&lt;/li&gt;
&lt;li&gt;documentation improved&lt;/li&gt;
&lt;li&gt;tests added&lt;/li&gt;
&lt;li&gt;security gaps reduced&lt;/li&gt;
&lt;li&gt;onboarding improved&lt;/li&gt;
&lt;li&gt;production-readiness increased&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Shipping matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Documentation is part of engineering
&lt;/h2&gt;

&lt;p&gt;Some technically strong projects were harder to evaluate because the documentation was thin.&lt;/p&gt;

&lt;p&gt;A clear README, architecture diagram, demo video, screenshots, setup steps, and known limitations can significantly improve how a project is understood.&lt;/p&gt;

&lt;p&gt;Good documentation does not replace good engineering.&lt;/p&gt;

&lt;p&gt;But it helps people trust the engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. AI-assisted development still needs human judgment
&lt;/h2&gt;

&lt;p&gt;Many projects used AI tools like GitHub Copilot.&lt;/p&gt;

&lt;p&gt;The stronger submissions were honest about how AI helped.&lt;/p&gt;

&lt;p&gt;They did not claim that AI magically built the entire project.&lt;/p&gt;

&lt;p&gt;Instead, they explained how AI helped with boilerplate, debugging, refactoring, documentation, test cases, UI polish, or repetitive implementation work.&lt;/p&gt;

&lt;p&gt;That is a realistic and mature use of AI-assisted development.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Real-world thinking stands out
&lt;/h2&gt;

&lt;p&gt;The projects that stood out most often had practical engineering judgment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;security considerations&lt;/li&gt;
&lt;li&gt;user onboarding&lt;/li&gt;
&lt;li&gt;error handling&lt;/li&gt;
&lt;li&gt;observability&lt;/li&gt;
&lt;li&gt;privacy&lt;/li&gt;
&lt;li&gt;reliability&lt;/li&gt;
&lt;li&gt;deployment readiness&lt;/li&gt;
&lt;li&gt;maintainability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are the things that turn a demo into a product.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Simple but complete can beat ambitious but unclear
&lt;/h2&gt;

&lt;p&gt;A focused project with a working demo, clear use case, and thoughtful finishing work can be stronger than a large idea with missing proof.&lt;/p&gt;

&lt;p&gt;Clarity matters.&lt;/p&gt;

&lt;p&gt;Completeness matters.&lt;/p&gt;

&lt;p&gt;Evidence matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;Judging these projects reminded me how much energy and creativity exists in the developer community.&lt;/p&gt;

&lt;p&gt;It also reinforced something I strongly believe:&lt;/p&gt;

&lt;p&gt;Building software is not only about writing code.&lt;/p&gt;

&lt;p&gt;It is about solving a problem, explaining the solution, making it usable, and finishing the work well enough that someone else can understand it, trust it, and use it.&lt;/p&gt;

&lt;p&gt;That is where real engineering maturity starts.&lt;/p&gt;

</description>
      <category>hackathon</category>
      <category>devchallenge</category>
      <category>ai</category>
      <category>githubchallenge</category>
    </item>
    <item>
      <title># From Metadata to Knowledge Discovery: Why I Am Not Starting With a Chatbot</title>
      <dc:creator>Amit Kumar Singh</dc:creator>
      <pubDate>Tue, 16 Jun 2026 03:44:37 +0000</pubDate>
      <link>https://dev.to/amising6/-from-metadata-to-knowledge-discovery-why-i-am-not-starting-with-a-chatbot-5282</link>
      <guid>https://dev.to/amising6/-from-metadata-to-knowledge-discovery-why-i-am-not-starting-with-a-chatbot-5282</guid>
      <description>&lt;p&gt;A lot of AI products today start with the same idea:&lt;/p&gt;

&lt;p&gt;Upload documents.&lt;br&gt;
Ask questions.&lt;br&gt;
Get answers.&lt;/p&gt;

&lt;p&gt;In other words:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chat with your documents.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That is a powerful pattern.&lt;/p&gt;

&lt;p&gt;But for enterprise data engineering, I do not think every AI product needs to start as a chatbot.&lt;/p&gt;

&lt;p&gt;In fact, starting with a chatbot can make the first version unnecessarily complex.&lt;/p&gt;

&lt;p&gt;The moment we create an open-ended chatbot, we also need to think about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RAG&lt;/li&gt;
&lt;li&gt;permissions&lt;/li&gt;
&lt;li&gt;citations&lt;/li&gt;
&lt;li&gt;hallucinations&lt;/li&gt;
&lt;li&gt;evaluation&lt;/li&gt;
&lt;li&gt;guardrails&lt;/li&gt;
&lt;li&gt;scope control&lt;/li&gt;
&lt;li&gt;user intent&lt;/li&gt;
&lt;li&gt;knowledge freshness&lt;/li&gt;
&lt;li&gt;answer traceability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of these are important.&lt;/p&gt;

&lt;p&gt;But they may not be the first problems to solve.&lt;/p&gt;

&lt;p&gt;For the first version of &lt;strong&gt;Data Engineering Copilot&lt;/strong&gt;, I am thinking differently.&lt;/p&gt;

&lt;p&gt;The current MVP is not a chatbot.&lt;/p&gt;

&lt;p&gt;It is a workflow application.&lt;/p&gt;

&lt;p&gt;The flow is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Upload STTM
    ↓
Generate SQL
Generate DQ Rules
Generate Data Dictionary
    ↓
Download Artifacts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That may look simple.&lt;/p&gt;

&lt;p&gt;But I think that simplicity is the strength.&lt;/p&gt;

&lt;p&gt;The application is not trying to answer every possible question.&lt;/p&gt;

&lt;p&gt;It is focused on one clear data engineering workflow:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Take structured metadata as input and generate useful engineering artifacts as output.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this model, the UI itself becomes a form of scope control.&lt;/p&gt;

&lt;p&gt;The user cannot ask the system to write a Python game.&lt;/p&gt;

&lt;p&gt;The user cannot ask random questions outside the product boundary.&lt;/p&gt;

&lt;p&gt;The user cannot force the system into unrelated tasks.&lt;/p&gt;

&lt;p&gt;The user can only do what the workflow allows:&lt;/p&gt;

&lt;p&gt;Upload metadata.&lt;br&gt;
Validate it.&lt;br&gt;
Generate artifacts.&lt;br&gt;
Download output.&lt;/p&gt;

&lt;p&gt;For an early AI product, that is a powerful design choice.&lt;/p&gt;

&lt;p&gt;It reduces risk.&lt;/p&gt;

&lt;p&gt;It reduces ambiguity.&lt;/p&gt;

&lt;p&gt;It makes evaluation easier.&lt;/p&gt;

&lt;p&gt;It also makes the product easier to explain.&lt;/p&gt;

&lt;p&gt;Instead of saying:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;“This is a chatbot for data engineering.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The product can say:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;“This is a metadata-driven artifact generation engine for data engineering teams.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That distinction matters.&lt;/p&gt;

&lt;p&gt;Because in data engineering, many tasks are not open-ended conversations.&lt;/p&gt;

&lt;p&gt;They are repeatable workflows.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generate Snowflake SQL from STTM&lt;/li&gt;
&lt;li&gt;Generate PySpark transformation logic&lt;/li&gt;
&lt;li&gt;Generate DQ rules&lt;/li&gt;
&lt;li&gt;Generate reconciliation checks&lt;/li&gt;
&lt;li&gt;Generate data dictionaries&lt;/li&gt;
&lt;li&gt;Generate technical specifications&lt;/li&gt;
&lt;li&gt;Validate mappings&lt;/li&gt;
&lt;li&gt;Identify missing metadata&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These tasks do not always require a chatbot.&lt;/p&gt;

&lt;p&gt;They require structured input, business rules, validation, and controlled generation.&lt;/p&gt;

&lt;p&gt;That is why I believe the first version of an enterprise AI copilot does not need to be overly complicated.&lt;/p&gt;

&lt;p&gt;It can start with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Metadata In
    ↓
Artifacts Out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once that foundation is working, the product can evolve.&lt;/p&gt;

&lt;p&gt;Later versions can add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ask questions about STTM&lt;/li&gt;
&lt;li&gt;Ask questions about data lineage&lt;/li&gt;
&lt;li&gt;Ask questions about DQ rules&lt;/li&gt;
&lt;li&gt;Ask questions about business definitions&lt;/li&gt;
&lt;li&gt;Ask questions about downstream impact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, RAG, citations, permissions, and knowledge discovery become more important.&lt;/p&gt;

&lt;p&gt;But starting with a controlled workflow allows the product to build trust first.&lt;/p&gt;

&lt;p&gt;This is also where guardrails become practical.&lt;/p&gt;

&lt;p&gt;In this MVP, guardrails are not abstract AI safety concepts.&lt;/p&gt;

&lt;p&gt;They are simple engineering checks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does the STTM file have required columns?&lt;/li&gt;
&lt;li&gt;Are source and target columns populated?&lt;/li&gt;
&lt;li&gt;Are transformation rules present?&lt;/li&gt;
&lt;li&gt;Are data types valid?&lt;/li&gt;
&lt;li&gt;Are target tables defined?&lt;/li&gt;
&lt;li&gt;Can the generated SQL compile?&lt;/li&gt;
&lt;li&gt;Are DQ rules generated for mapped fields?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A simple validation rule may look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;required_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Source_Table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Source_Column&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Target_Table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Target_Column&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Transformation_Rule&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;required_columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Missing required column: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not glamorous.&lt;/p&gt;

&lt;p&gt;But it is real.&lt;/p&gt;

&lt;p&gt;And in enterprise systems, real usually wins.&lt;/p&gt;

&lt;p&gt;Many AI demos look impressive because they allow open-ended conversation.&lt;/p&gt;

&lt;p&gt;But enterprise products survive when they are controlled, testable, traceable, and useful.&lt;/p&gt;

&lt;p&gt;That is why I believe the first step for Data Engineering Copilot should not be:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chat with everything.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It should be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Understand metadata
Generate trusted artifacts
Create repeatable value
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The chatbot can come later.&lt;/p&gt;

&lt;p&gt;The knowledge discovery layer can come later.&lt;/p&gt;

&lt;p&gt;The agentic workflow can come later.&lt;/p&gt;

&lt;p&gt;The foundation should be simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;STTM
    ↓
Canonical Metadata
    ↓
SQL / DQ / Data Dictionary / Specs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the direction I am exploring.&lt;/p&gt;

&lt;p&gt;Not because chatbots are bad.&lt;/p&gt;

&lt;p&gt;But because data engineering teams often need something more specific.&lt;/p&gt;

&lt;p&gt;They need tools that reduce repetitive work.&lt;/p&gt;

&lt;p&gt;They need systems that understand metadata.&lt;/p&gt;

&lt;p&gt;They need outputs that can be reviewed, validated, and improved.&lt;/p&gt;

&lt;p&gt;And eventually, they need AI that can move beyond document retrieval toward evidence-based knowledge discovery.&lt;/p&gt;

&lt;p&gt;That journey starts with a small workflow.&lt;/p&gt;

&lt;p&gt;Upload metadata.&lt;/p&gt;

&lt;p&gt;Generate artifacts.&lt;/p&gt;

&lt;p&gt;Validate output.&lt;/p&gt;

&lt;p&gt;Build trust.&lt;/p&gt;

&lt;p&gt;Then expand.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>dataengineering</category>
      <category>ai</category>
      <category>metadata</category>
    </item>
    <item>
      <title>From RAG to Knowledge Discovery: What Comes Next for Enterprise AI</title>
      <dc:creator>Amit Kumar Singh</dc:creator>
      <pubDate>Mon, 15 Jun 2026 02:34:55 +0000</pubDate>
      <link>https://dev.to/amising6/from-rag-to-knowledge-discovery-what-comes-next-for-enterprise-ai-49i0</link>
      <guid>https://dev.to/amising6/from-rag-to-knowledge-discovery-what-comes-next-for-enterprise-ai-49i0</guid>
      <description>&lt;p&gt;From RAG to Knowledge Discovery: What Comes Next for Enterprise AI?&lt;/p&gt;

&lt;p&gt;Over the past two years, Retrieval-Augmented Generation (RAG) has become one of the most widely adopted patterns in enterprise AI.&lt;/p&gt;

&lt;p&gt;The reason is simple.&lt;/p&gt;

&lt;p&gt;Large Language Models are powerful, but they don’t know your company’s internal knowledge.&lt;/p&gt;

&lt;p&gt;RAG solved that problem.&lt;/p&gt;

&lt;p&gt;Instead of relying solely on what a model learned during training, organizations could connect enterprise documents, retrieve relevant information, and provide additional context at runtime.&lt;/p&gt;

&lt;p&gt;The architecture looked something like this:&lt;/p&gt;

&lt;p&gt;Enterprise Documents&lt;br&gt;
        ↓&lt;br&gt;
Chunking&lt;br&gt;
        ↓&lt;br&gt;
Embeddings&lt;br&gt;
        ↓&lt;br&gt;
Vector Database&lt;br&gt;
        ↓&lt;br&gt;
Retrieval&lt;br&gt;
        ↓&lt;br&gt;
LLM&lt;br&gt;
        ↓&lt;br&gt;
Answer&lt;/p&gt;

&lt;p&gt;For many use cases, this works extremely well.&lt;/p&gt;

&lt;p&gt;Employee assistants, HR chatbots, IT support copilots, policy search, document Q&amp;amp;A, and internal knowledge assistants are all examples of successful RAG applications.&lt;/p&gt;

&lt;p&gt;But as organizations scale their AI initiatives, a new challenge begins to emerge.&lt;/p&gt;

&lt;p&gt;The Problem with Enterprise Knowledge&lt;/p&gt;

&lt;p&gt;The issue is not that information is missing.&lt;/p&gt;

&lt;p&gt;The issue is that information is fragmented.&lt;/p&gt;

&lt;p&gt;Consider a simple retail question:&lt;/p&gt;

&lt;p&gt;How is Daily Sales calculated?&lt;/p&gt;

&lt;p&gt;The answer may exist across multiple artifacts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data Dictionary&lt;/li&gt;
&lt;li&gt;Source-to-Target Mapping (STTM)&lt;/li&gt;
&lt;li&gt;Business Rules&lt;/li&gt;
&lt;li&gt;Architecture Diagram&lt;/li&gt;
&lt;li&gt;Data Quality Specifications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A traditional RAG system may retrieve some of these documents.&lt;/p&gt;

&lt;p&gt;However, no single document contains the complete answer.&lt;/p&gt;

&lt;p&gt;The knowledge itself is distributed.&lt;/p&gt;

&lt;p&gt;This creates a fundamental challenge.&lt;/p&gt;

&lt;p&gt;RAG retrieves documents.&lt;/p&gt;

&lt;p&gt;Enterprise users need knowledge.&lt;/p&gt;

&lt;p&gt;Why Better Retrieval Isn’t Always Enough&lt;/p&gt;

&lt;p&gt;The industry has already introduced several improvements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hybrid Search&lt;/li&gt;
&lt;li&gt;Reranking&lt;/li&gt;
&lt;li&gt;Citations&lt;/li&gt;
&lt;li&gt;Confidence Scoring&lt;/li&gt;
&lt;li&gt;Agentic RAG&lt;/li&gt;
&lt;li&gt;Multi-Step Retrieval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These innovations significantly improve retrieval quality.&lt;/p&gt;

&lt;p&gt;However, they still operate primarily at the document level.&lt;/p&gt;

&lt;p&gt;The underlying assumption remains:&lt;/p&gt;

&lt;p&gt;Find the right documents and the answer will emerge.&lt;/p&gt;

&lt;p&gt;In practice, enterprise knowledge is often spread across multiple systems, documents, and teams.&lt;/p&gt;

&lt;p&gt;The challenge becomes connecting the pieces.&lt;/p&gt;

&lt;p&gt;Enter Knowledge Discovery&lt;/p&gt;

&lt;p&gt;What if we stopped thinking about documents as the primary source of truth?&lt;/p&gt;

&lt;p&gt;Instead of retrieving documents, what if we extracted knowledge from documents and connected it together?&lt;/p&gt;

&lt;p&gt;Imagine converting enterprise artifacts into a Canonical Knowledge Model.&lt;/p&gt;

&lt;p&gt;For the Daily Sales example:&lt;/p&gt;

&lt;p&gt;Business Term:&lt;br&gt;
Daily Sales&lt;br&gt;
Source System:&lt;br&gt;
POS&lt;br&gt;
Source Table:&lt;br&gt;
POS_TRANSACTIONS&lt;br&gt;
Attribute:&lt;br&gt;
SALE_AMOUNT&lt;br&gt;
Business Rule:&lt;br&gt;
Exclude Cancelled Transactions&lt;br&gt;
DQ Rule:&lt;br&gt;
Value &amp;gt;= 0&lt;br&gt;
Target:&lt;br&gt;
Sales Mart&lt;/p&gt;

&lt;p&gt;Now we are no longer working with isolated files.&lt;/p&gt;

&lt;p&gt;We are working with connected knowledge.&lt;/p&gt;

&lt;p&gt;The Shift from Retrieval to Discovery&lt;/p&gt;

&lt;p&gt;Traditional RAG:&lt;/p&gt;

&lt;p&gt;Question&lt;br&gt;
    ↓&lt;br&gt;
Retrieve Documents&lt;br&gt;
    ↓&lt;br&gt;
LLM&lt;br&gt;
    ↓&lt;br&gt;
Answer&lt;/p&gt;

&lt;p&gt;Knowledge Discovery:&lt;/p&gt;

&lt;p&gt;Question&lt;br&gt;
    ↓&lt;br&gt;
Identify Business Concept&lt;br&gt;
    ↓&lt;br&gt;
Discover Relationships&lt;br&gt;
    ↓&lt;br&gt;
Assemble Evidence&lt;br&gt;
    ↓&lt;br&gt;
LLM&lt;br&gt;
    ↓&lt;br&gt;
Trusted Answer&lt;/p&gt;

&lt;p&gt;The focus shifts from:&lt;/p&gt;

&lt;p&gt;Which document should I retrieve?&lt;/p&gt;

&lt;p&gt;to:&lt;/p&gt;

&lt;p&gt;What knowledge do I need to assemble?&lt;/p&gt;

&lt;p&gt;Why This Matters&lt;/p&gt;

&lt;p&gt;Enterprise users rarely ask document-centric questions.&lt;/p&gt;

&lt;p&gt;They ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where does this metric originate?&lt;/li&gt;
&lt;li&gt;Which systems contribute to this KPI?&lt;/li&gt;
&lt;li&gt;What business rules are applied?&lt;/li&gt;
&lt;li&gt;What data quality validations exist?&lt;/li&gt;
&lt;li&gt;What transformations occur before loading?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Answering these questions requires understanding relationships.&lt;/p&gt;

&lt;p&gt;Not just retrieving text.&lt;/p&gt;

&lt;p&gt;RAG Isn’t Going Away&lt;/p&gt;

&lt;p&gt;I don’t view Knowledge Discovery as a replacement for RAG.&lt;/p&gt;

&lt;p&gt;RAG remains a foundational capability.&lt;/p&gt;

&lt;p&gt;In fact, RAG will likely continue to play an important role in retrieval.&lt;/p&gt;

&lt;p&gt;The difference is that retrieval becomes one component within a larger knowledge architecture.&lt;/p&gt;

&lt;p&gt;A future enterprise AI stack may look like:&lt;/p&gt;

&lt;p&gt;Documents&lt;br&gt;
    ↓&lt;br&gt;
Metadata Extraction&lt;br&gt;
    ↓&lt;br&gt;
Canonical Knowledge Model&lt;br&gt;
    ↓&lt;br&gt;
Knowledge Graph&lt;br&gt;
    ↓&lt;br&gt;
RAG Retrieval&lt;br&gt;
    ↓&lt;br&gt;
Evidence Assembly&lt;br&gt;
    ↓&lt;br&gt;
Trusted Answers&lt;/p&gt;

&lt;p&gt;Final Thoughts&lt;/p&gt;

&lt;p&gt;The evolution of enterprise AI can be viewed as a progression:&lt;/p&gt;

&lt;p&gt;Era 1&lt;br&gt;
LLM&lt;br&gt;
Era 2&lt;br&gt;
RAG&lt;br&gt;
Era 3&lt;br&gt;
Advanced RAG&lt;br&gt;
(Hybrid Search, Reranking, Citations)&lt;br&gt;
Era 4&lt;br&gt;
Knowledge Discovery&lt;br&gt;
(Metadata, Relationships, Evidence)&lt;/p&gt;

&lt;p&gt;The goal is no longer simply retrieving documents.&lt;/p&gt;

&lt;p&gt;The goal is connecting fragmented enterprise knowledge and surfacing trusted evidence when it’s needed.&lt;/p&gt;

&lt;p&gt;Perhaps the next generation of enterprise copilots won’t be document assistants.&lt;/p&gt;

&lt;p&gt;They’ll be knowledge discovery systems.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>dataengineeringcopilot</category>
      <category>python</category>
    </item>
    <item>
      <title>From STTM to Snowflake SQL: Building a Metadata-Driven Data Engineering Copilot</title>
      <dc:creator>Amit Kumar Singh</dc:creator>
      <pubDate>Sun, 14 Jun 2026 05:33:55 +0000</pubDate>
      <link>https://dev.to/amising6/from-sttm-to-snowflake-sql-building-a-metadata-driven-data-engineering-copilot-n4</link>
      <guid>https://dev.to/amising6/from-sttm-to-snowflake-sql-building-a-metadata-driven-data-engineering-copilot-n4</guid>
      <description>&lt;p&gt;Most data engineering teams do not struggle because they lack smart people.&lt;/p&gt;

&lt;p&gt;They struggle because too much of the delivery process is still repetitive.&lt;/p&gt;

&lt;p&gt;A source-to-target mapping document comes in.&lt;/p&gt;

&lt;p&gt;Then someone has to manually create:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;target table DDL&lt;/li&gt;
&lt;li&gt;transformation SQL&lt;/li&gt;
&lt;li&gt;data dictionary&lt;/li&gt;
&lt;li&gt;technical specification&lt;/li&gt;
&lt;li&gt;data quality rules&lt;/li&gt;
&lt;li&gt;reconciliation checks&lt;/li&gt;
&lt;li&gt;test cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For one or two tables, this is manageable.&lt;/p&gt;

&lt;p&gt;For a real enterprise program with many tables, changing requirements, multiple source systems, and repeated delivery cycles, this becomes a major productivity problem.&lt;/p&gt;

&lt;p&gt;That is the problem I am exploring with &lt;strong&gt;Data Engineering Copilot&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Website: &lt;a href="https://dataengineeringcopilot.com" rel="noopener noreferrer"&gt;https://dataengineeringcopilot.com&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The idea
&lt;/h2&gt;

&lt;p&gt;The idea is simple:&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
text
Upload STTM
   ↓
Parse metadata
   ↓
Normalize into a canonical metadata model
   ↓
Generate engineering artifacts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>dataengineering</category>
      <category>ai</category>
      <category>snowflake</category>
      <category>etl</category>
    </item>
  </channel>
</rss>
