DEV Community: Amit Kumar Singh

AI Technical Debt in Data Engineering: Why Generated Code Still Needs Metadata, Review, and Governance

Amit Kumar Singh — Wed, 08 Jul 2026 01:33:48 +0000

AI-assisted coding is changing how data engineering teams work.

A developer can now generate SQL, PySpark, dbt models, data-quality checks, documentation, and test cases much faster than before. What previously took hours can sometimes be drafted in minutes.

That is a real productivity gain.

But there is also a risk.

In data engineering, code that looks correct is not always correct. A pipeline can compile, run, and even pass a basic test while still violating business rules, missing edge cases, using the wrong source attribute, applying an incomplete transformation, or creating downstream reporting issues.

This is where AI technical debt begins.

AI technical debt does not always look like messy code. Sometimes it looks like clean code that nobody fully understands, nobody validated against the business requirement, and nobody can confidently support in production.

Data Engineering Is Not Just Code Generation

A data pipeline is not only a technical artifact. It represents business meaning.

When a data engineer writes transformation logic, they are not simply moving data from one table to another. They are interpreting business requirements, source behavior, target expectations, data-quality rules, reconciliation logic, and downstream consumption needs.

For example, a source-to-target mapping may define how a customer status, policy state, transaction type, or claim indicator should be derived. A simple generated CASE statement may look correct, but the real question is:

Did it use the right source column?
Did it handle nulls correctly?
Did it account for late-arriving records?
Did it preserve the agreed business definition?
Did it match the data contract?
Did it align with downstream reporting expectations?
Did it include the required data-quality checks?
Did it create traceable lineage?

AI can generate code quickly, but it does not automatically understand the full enterprise context unless that context is structured, governed, and reviewed.

That is why generated code still needs metadata, review, and governance.

How AI Technical Debt Enters Data Engineering

AI technical debt often enters quietly.

It may start with a developer asking an AI tool to generate a SQL transformation from a mapping document. The generated code looks clean. The syntax is valid. The joins look reasonable. The test case passes for a sample dataset.

But later, the team discovers issues:

a source-column name was interpreted incorrectly
a business rule was partially implemented
a join condition caused duplicate records
an incremental load missed late-arriving updates
a null-handling rule was missing
a data-quality check was too generic
a downstream report started showing mismatched numbers
documentation no longer matched the implementation

None of these problems are unusual in data engineering. The difference is that AI can accelerate the creation of artifacts before the underlying engineering intent is fully validated.

That creates a new kind of debt: not just technical debt in the code, but intent debt.

Intent debt happens when the code exists, but the reasoning behind it is incomplete, undocumented, or disconnected from the approved requirement.

Metadata Is the Missing Control Layer

The solution is not to stop using AI.

The solution is to stop treating prompts as the source of truth.

For enterprise data engineering, the source of truth should be structured engineering metadata.

Metadata should define the intent behind the implementation. It should capture:

source systems
source tables and columns
target tables and columns
business definitions
transformation rules
join logic
data-quality expectations
incremental processing rules
reconciliation rules
ownership
approval status
lineage relationships
effective dates and version history

When this metadata is structured, AI-generated artifacts can be reviewed against something concrete.

Instead of asking, “Does this code look good?” the team can ask:

Does this code match the approved transformation rule?
Does it use the correct source and target attributes?
Does it follow the approved data-quality expectation?
Does it preserve lineage?
Does it support rerun and recovery?
Does it match the incremental-load design?
Does it align with business approval?

That is a much stronger review process.

The Wrong Way to Use AI in Data Engineering

A risky AI-assisted workflow looks like this:

Requirement document → AI prompt → generated code → developer copy/paste → deployment

This creates speed, but it also creates risk.

The AI may fill in missing assumptions. The developer may trust the output because it looks polished. Reviewers may focus on syntax instead of business intent. Documentation may be generated after the fact. Data-quality checks may be generic rather than tied to actual business rules.

This is how teams end up with pipelines that work technically but fail operationally.

In data engineering, that is dangerous because downstream consumers often rely on the data for reporting, analytics, regulatory processes, financial decisions, customer insights, or operational workflows.

Fast code is useful only when it is also correct, traceable, and supportable.

The Better Way: Metadata-Driven AI Assistance

A better workflow looks like this:

Business Requirement → Source-to-Target Mapping → Canonical Metadata Model → Generated Artifacts → Human Review → Approval → CI/CD → Monitoring

In this model, AI is not removed. It is placed inside a governed process.

AI can help generate:

SQL transformation templates
PySpark or Spark SQL logic
dbt model drafts
DDL scripts
data-quality checks
test scenarios
data dictionary entries
technical specifications
lineage summaries
reconciliation rules
runbook starter content

But those outputs should be treated as drafts.

The metadata model remains the control layer. Engineers review the generated artifacts against approved metadata and business expectations before anything moves toward production.

This changes the role of the engineer.

The engineer is not just writing repetitive code. The engineer is validating engineering intent, edge cases, performance, quality, maintainability, and production readiness.

That is where experienced engineering judgment matters most.

Human Review Is Not Optional

AI-generated code should go through a review process that checks more than syntax.

A strong review should include:

Business-rule validation
Does the implementation reflect the actual approved rule?

Source-to-target validation
Are the correct source and target columns used?

Data-quality validation
Are completeness, validity, uniqueness, referential integrity, timeliness, and reconciliation expectations covered?

Incremental-load validation
Does the code handle inserts, updates, deletes, late-arriving records, and reruns?

Performance validation
Will the logic scale for expected data volume?

Lineage validation
Can the team explain where the data came from and how it changed?

Operational validation
Can this be monitored, restarted, supported, and explained during production issues?

Security and access validation
Does the implementation respect access controls and data-handling requirements?

This is the difference between AI-assisted coding and AI-assisted engineering.

AI Can Generate Tests, But It Cannot Define Trust Alone

Many AI tools can generate test cases. That is useful, but test generation alone does not solve the trust problem.

Tests are only as good as the assumptions behind them.

If the business rule is incomplete, the generated test may also be incomplete. If the AI misunderstood the transformation, the test may validate the wrong behavior. If edge cases are missing from the metadata, the test may not cover them.

That is why data-quality rules and test scenarios should be tied back to approved metadata.

For example, if a target field is required to be non-null, that expectation should exist in metadata. If a status field allows only specific values, that should exist in metadata. If a target table must reconcile to a source count within a threshold, that rule should exist in metadata.

AI can help draft the checks, but the expectations must come from governed engineering intent.

Practical Example

Consider a simple customer status mapping.

A source system has a field called status_cd with values such as:

A
I
P
blank
unexpected codes

The target platform needs a standardized field called customer_status.

A basic AI-generated transformation might create:

A = Active
I = Inactive
P = Pending

That looks fine at first.

But a production-ready data engineering process needs more questions answered:

What happens when status_cd is null?
What happens when a new unexpected code appears?
Should unknown values fail the pipeline or be flagged?
Is the mapping approved by the business owner?
Does the target field allow nulls?
Is there a DQ rule for accepted values?
Should rejected records go to an exception table?
Does this field drive downstream reporting?
Does the data dictionary reflect the same definition?
Is lineage captured from source to target?

A metadata-driven approach defines those expectations once. Then the platform can generate a transformation draft, a DQ rule, a test case, documentation, and lineage from the same approved definition.

That reduces inconsistency.

Governance Does Not Mean Slowing Everything Down

Some engineers hear the word governance and think it means more process, more approvals, and slower delivery.

That does not have to be true.

Good governance reduces confusion. It makes the delivery path clearer. It prevents teams from debating the same rules repeatedly. It gives reviewers something concrete to validate. It helps new engineers understand the system faster. It reduces production support issues because the intent is documented before.

From Informatica XML to Snowflake: Why ETL Migration Needs a Governed Delivery Workflow

Amit Kumar Singh — Sat, 27 Jun 2026 12:38:13 +0000

Legacy ETL modernization is often described as a conversion exercise:

Informatica mapping in. Snowflake SQL out.

That framing is incomplete.

A real migration is not only about translating expressions. It is about preserving transformation intent, identifying what is missing, documenting assumptions, validating target behavior, and ensuring that someone is accountable for decisions before generated artifacts are released.

I have been building a prototype called Data Engineering Copilot around that idea.

The latest capability starts from an Informatica PowerCenter XML export and produces a governed Snowflake migration delivery packet.

The workflow is:

Informatica PowerCenter XML
        ↓
Metadata and Lineage Extraction
        ↓
Canonical Metadata Model
        ↓
Snowflake Artifact Generation
        ↓
Validation and Migration Risk Assessment
        ↓
Human Review and Approval
        ↓
Governed Release Package

The problem with simple code conversion

An Informatica mapping can contain far more than a direct field-to-field relationship.

A typical mapping may include:

source definitions and target definitions
source qualifiers and filters
expression transformations
reusable transformations
lookups
constants and default values
mapping parameters
target load order
connector-level lineage
update strategy or sequence-generation behavior
target fields with no visible incoming connector

A generator that only reads source and target columns may produce SQL that looks valid but does not preserve the original delivery intent.

That is risky.

For example, imagine a target field that has no visible source column. It may still be populated through:

a constant such as 'SOURCE_A'
a default such as 'XNA'
a surrogate-key lookup
a runtime parameter
a load timestamp
a sequence generator
a business decision that was never documented in the mapping

If the tool silently inserts NULL, the SQL may compile while the migration is functionally wrong.

The prototype approach

The Data Engineering Copilot prototype accepts two starting points:

Business Requirement / Source-to-Target Mapping
Legacy ETL Mapping

For the legacy path, the first supported adapter is Informatica PowerCenter XML.

The important design principle is that both paths converge into the same canonical metadata model.

Business Requirement / STTM ─┐
                             ├─ Canonical Metadata Model
Informatica XML ─────────────┘
                                      ↓
                             Artifact Factory
                                      ↓
                       Validation and Review Gate
                                      ↓
                          Human Approval and Export

This means the product is not just an Informatica parser.

It is a governed metadata-to-delivery platform that can accept multiple sources of truth.

What the Informatica adapter extracts

For the initial version, the adapter reads metadata from PowerCenter XML such as:

SOURCE and SOURCEFIELD
TARGET and TARGETFIELD
TRANSFORMATION and TRANSFORMFIELD
INSTANCE
CONNECTOR
TABLEATTRIBUTE
source filters
lookup table names and conditions
transformation expressions
explicit default values
mapping parameters

From this, the platform builds a field-level canonical model with information such as:

Canonical field	Example value
Source table	`L0_VLE_NACE`
Source column	`CD_NACE`
Target table	`L1_D_NACE`
Target column	`CD_NACE`
Transformation type	Expression
Transformation logic	`TRIM(src.CD_NACE)`
Filter condition	business date predicate
Lookup table	reference/surrogate-key table
Lineage path	source → qualifier → expression → target expression → target
Migration status	Supported with Review / Manual Decision Required

Translating common legacy patterns

The first version supports a transparent subset of common Informatica patterns.

Expression transformations

An Informatica expression such as:

ltrim(rtrim(CD_NACE_in))

can become:

TRIM(src.CD_NACE)

A custom defaulting rule such as:

:UDF.DEFAULTSTRINGNULL(T_NAME_in)

can become:

COALESCE(NULLIF(TRIM(src.T_NAME), ''), 'XNA')

A constant value such as:

'VLE'

can become:

'VLE' AS CD_SOURCE_SYSTEM

A numeric default such as:

-1

can become:

-1 AS ID_NACE_PARENT

The platform keeps these as explicit derived values in the canonical model rather than pretending they came from a physical source column.

Source filters and runtime parameters

A Source Qualifier may contain a filter similar to:

edw_business_date = to_date('$$BUSINESS_DATE','YYYYMMDDHH24MISS')

The target Snowflake pattern can preserve that intent using a runtime parameter or session-variable approach:

WHERE src.EDW_BUSINESS_DATE =
      TO_TIMESTAMP_NTZ(:BUSINESS_DATE, 'YYYYMMDDHH24MISS')

The exact runtime parameter implementation still needs to be confirmed for the target deployment framework. That is a deployment decision, not something a metadata generator should silently invent.

Lookup conversion is not always automatic

Lookups are a good example of why governed delivery matters.

An Informatica Lookup Procedure may include:

a lookup table
a lookup condition
a source filter
cache behavior
multiple-match behavior
dynamic or static lookup semantics

A basic Snowflake translation may propose a LEFT JOIN.

But that does not prove the join is semantically equivalent.

The migration still needs review for questions such as:

Is the lookup table current, historical, or slowly changing?
What happens when multiple matches exist?
Does the lookup require effective-date logic?
Is the lookup output a surrogate key?
Was cache behavior masking duplicate or late-arriving records?
Should the target use a join, a MERGE, or a separate key-resolution process?

The prototype therefore generates a reviewable join candidate but creates a migration finding:

Status: Needs Review
Reason: Lookup conversion requires confirmation of join semantics,
duplicate-match behavior, and reference-table ownership.

The governed Release Gate

This is the part that matters most to me.

The platform does not stop at generated SQL.

It creates a validation and review workflow with statuses such as:

Draft
Under Review
Approved with Conditions
Approved
Rejected
Blocked

The release gate can identify findings such as:

Finding	Example action
Unmapped target field	Confirm source, approved default, or explicit exclusion
Missing target datatype	Confirm datatype before DDL release
Lookup conversion	Validate join semantics and test results
Unsupported transformation	Record manual migration decision
Missing date population rule	Select source field, runtime parameter, timestamp, or nullable target decision
Complex expression	Add unit test and business approval

For unresolved fields, the SQL intentionally remains visible:

NULL /* REVIEW REQUIRED: target field has no approved source/default */

That is not a failure of the product.

It is the product preventing a false sense of automation.

Why human review remains necessary

AI and rule-based conversion can accelerate the mechanical parts of migration:

metadata extraction
connector tracing
expression inventory
type translation
SQL drafting
DQ rule suggestions
lineage documentation
risk classification

But a migration still requires decisions that depend on business meaning and target-state architecture.

For example, an unmapped effective-date field could mean very different things:

Use source business date
Use current timestamp
Use target load timestamp
Populate from a configuration parameter
Allow nulls and revise DDL
Exclude the column after SME approval

A tool can surface the decision, propose options, and preserve the evidence.

A human should approve the final choice.

The generated delivery packet

Once review is complete, the prototype generates a delivery package containing:

canonical metadata model
source-to-target lineage
Snowflake DDL
Snowflake transformation SQL
data dictionary
technical specification
data quality rules
migration risk assessment
review decision history
deployment manifest
audit trail

The package should only be marked deployment-ready when high-risk findings have documented resolutions.

That is the next improvement I am working on: making approval decisions directly update release readiness and the exported findings package.

What this changes

The goal is not to claim that Informatica can be replaced by a single AI prompt.

The goal is to make migration delivery more reliable.

Instead of this:

Legacy Mapping
      ↓
Manual interpretation
      ↓
Spreadsheet updates
      ↓
SQL generation
      ↓
Late discovery of missing logic

the target workflow becomes:

Legacy Mapping
      ↓
Structured metadata extraction
      ↓
Canonical representation
      ↓
Generated artifacts
      ↓
Visible assumptions and risks
      ↓
Human approval
      ↓
Traceable release package

That is the difference between generating code and governing a migration.

Closing thought

Data migration programs rarely fail because a team cannot write SQL.

They fail because business logic, defaults, lookup behavior, data quality expectations, and ownership decisions are hidden across mappings, emails, spreadsheets, and tribal knowledge.

A governed metadata model gives those decisions a place to live.

That is the direction I am building toward with Data Engineering Copilot: start from business intent or legacy implementation metadata, generate delivery artifacts, and make every important assumption reviewable before release.

DataEngineering #Informatica #Snowflake #ETLModernization #DataMigration #MetadataDrivenDevelopment #DataGovernance #DataArchitecture #AIEngineering

Why Enterprise AI Needs Structured Dissent, Not Just More Agents

Amit Kumar Singh — Sat, 27 Jun 2026 00:53:39 +0000

Many AI projects today are presented as multi-agent systems.

One agent investigates. Another agent analyzes risk. A third agent checks compliance. A fourth agent gives a recommendation.

It sounds advanced.

But in a bank, adding more agents does not automatically make a workflow safe.

A bank cannot freeze a customer account, block a payment, file a regulatory report, or label a transaction as fraud simply because an AI system produced a confident answer.

The real question is not:

How many AI agents are involved?

The real question is:

Can the system show evidence, challenge its own conclusion, apply deterministic rules, and stop for human approval when the decision is high impact?

That is the difference between an interesting multi-agent demo and an enterprise-ready AI workflow.

A banking example: suspicious wire transfer

Imagine a bank detects a wire transfer for $250,000.

The payment is unusual because:

The customer has never sent a transfer of this size.
The destination account is in a new country.
The transaction happens outside the customer’s normal business hours.
The beneficiary was added only a few minutes before the transfer.
The customer recently changed their phone number and email address.

A simple AI chatbot might say:

“This transaction looks suspicious. Consider blocking it.”

That is not enough.

A bank needs to know:

Which transaction patterns triggered the concern?
Is the customer actually violating a known risk threshold?
Is there a sanctions or AML issue?
Could this be a legitimate business payment?
What policy applies?
Should the payment be blocked, held, or released?
Who is allowed to make that decision?
Can the bank explain the decision later to auditors, compliance teams, and the customer?

This is where structured multi-agent design matters.

A better design: a banking fraud decision room

Instead of letting one model make a decision, the bank can create a controlled workflow with specialized agents.

Transaction Alert
      ↓
Fraud Detection Agent
      ↓
Customer Behavior Agent
      ↓
AML / Sanctions Agent
      ↓
Policy and Risk Agent
      ↓
Decision Reviewer
      ↓
Human Compliance Officer

Each agent has a limited responsibility.

1. Fraud Detection Agent

This agent analyzes transaction behavior.

It may identify:

Unusual payment amount
New beneficiary
New country
Unusual transaction time
Sudden profile changes
Prior fraud indicators

Its job is not to freeze the transaction.

Its job is to create a structured fraud signal.

{
  "event_type": "FRAUD_SIGNAL",
  "transaction_id": "TXN-784921",
  "customer_id": "CUST-10048",
  "risk_indicators": [
    "new_beneficiary",
    "amount_12x_customer_average",
    "unusual_country",
    "recent_contact_change"
  ],
  "risk_score": 82,
  "confidence": 0.88
}

This gives the next stage a reviewable artifact instead of a paragraph generated by an LLM.

2. Customer Behavior Agent

A transaction may look suspicious but still be legitimate.

For example, a corporate customer may be making a valid acquisition payment or paying a new overseas vendor.

The Customer Behavior Agent looks at:

Historical payment behavior
Customer segment
Typical payment ranges
Known business relationships
Recent support interactions
Whether the customer informed the bank about a major payment

This agent can produce a counterpoint:

{
  "event_type": "CUSTOMER_CONTEXT",
  "transaction_id": "TXN-784921",
  "historical_pattern": "Outside normal range",
  "known_business_event": "No supporting event found",
  "customer_contacted_bank": false,
  "assessment": "Transaction behavior remains inconsistent",
  "confidence": 0.76
}

This is important because the system should not treat every unusual payment as fraud.

Structured dissent is necessary

Now imagine the fraud agent recommends blocking the payment.

A good enterprise workflow should not simply accept that recommendation.

It should require another role to challenge it.

For example:

The Fraud Agent says: “High fraud risk.”
The Customer Context Agent says: “No evidence of a legitimate business event.”
The AML Agent says: “Beneficiary has elevated geographic risk.”
The Policy Agent says: “The bank’s hold threshold is met.”
The Decision Reviewer says: “Human approval required before blocking.”

That is structured dissent.

It is not about making agents argue for entertainment.

It is about making assumptions visible before the bank takes action.

In high-stakes workflows, disagreement is not a weakness. Hidden disagreement is the real risk.

The LLM should not make the final decision alone

LLMs are useful for many parts of the workflow:

Summarizing transaction history
Explaining why a transaction appears unusual
Reading customer notes
Interpreting investigation findings
Drafting a case narrative
Generating a compliance-review summary

But an LLM should not control deterministic rules.

For example, these should come from governed systems and rules engines:

Daily transaction thresholds
Sanctions screening results
AML policy conditions
Regulatory filing timelines
Customer account restrictions
Approval authority limits
Payment-hold policies
Risk score calculations

A safe architecture looks like this:

AI Layer
- Investigates
- Summarizes
- Explains
- Recommends

Rules Layer
- Calculates thresholds
- Applies risk policies
- Checks sanctions lists
- Enforces approval limits
- Determines required escalation

Human Layer
- Approves
- Rejects
- Overrides
- Requests further investigation

This distinction matters.

The AI can explain why a payment looks suspicious.

The rules engine can determine whether the bank’s fraud-hold threshold has been crossed.

The compliance officer can decide whether the payment should actually be blocked.

An evidence panel is more important than a chatbot answer

The final decision should not be a black-box score.

A compliance officer should see an evidence panel like this:

Transaction:
TXN-784921

Customer:
Corporate customer — existing account for 4 years

Amount:
$250,000

Risk indicators:
- New beneficiary
- New destination country
- Payment amount is 12x normal average
- Contact information changed within past 24 hours
- No matching historical vendor relationship

Policy checks:
- Enhanced review threshold: Triggered
- Manual compliance approval: Required
- Sanctions screening: Clear
- AML monitoring alert: Triggered

AI assessment:
High-risk transaction requiring manual review

Human decision:
Payment placed on temporary hold

Approved by:
Compliance Officer

Decision timestamp:
2026-06-26 14:22 UTC

This is what enterprise AI should produce.

Not just an answer.

A decision record.

Human approval is part of the architecture

Human approval should not be added as an afterthought.

In banking, some actions should be automated.

For example:

Action	AI / system role	Human role
Summarize alert	Automatic	Review if needed
Identify unusual transaction patterns	Automatic	Review exceptions
Create investigation case	Automatic	Monitor
Place temporary low-risk review hold	Rule-based	Review later
Freeze account	Recommend only	Explicit approval required
File SAR or regulatory report	Draft supporting evidence	Compliance approval required
Close customer account	Never autonomous	Senior human decision

The system should know when to proceed, when to pause, and when to escalate.

That is not a limitation.

That is good enterprise design.

What this means for data engineering teams

This same pattern applies directly to data engineering.

A data-engineering copilot should not only generate SQL or YAML from a source-to-target mapping document.

It should operate as a governed workflow.

For example:

STTM / DDL / Source Metadata
          ↓
Metadata Extraction Agent
          ↓
Mapping Validation Agent
          ↓
Transformation Logic Agent
          ↓
SQL / YAML Generator
          ↓
Reviewer Agent
          ↓
Data Engineer Approval

The reviewer should validate things such as:

Does the source column exist?
Is the target data type compatible?
Is the join supported by the mapping?
Is the transformation rule documented?
Is a sign rule missing?
Is a derived metric using an unapproved assumption?
Are there duplicate or unused YAML objects?
Has an engineer approved the generated output?

Then every generated artifact should include traceability.

Target Column:
PROFIT_AMT

Source:
sales.PROFIT_AMT

Transformation:
CASE WHEN SALES_TYPE = 'CANCEL'
THEN PROFIT_AMT* -1
ELSE PROFIT_AMT
END

Business Rule:
Cancellation transactions must store Profit as negative.

Source Reference:
STTM row 42

Validation:
- Source column exists
- Transformation approved
- Target data type compatible
- Human review status: Approved

This is how generated code becomes a governed engineering artifact.

A practical checklist for enterprise AI

Before calling a multi-agent system enterprise-ready, ask:

Does each agent have a clear responsibility?
Are handoffs structured instead of free-text only?
Can one agent challenge another agent’s conclusion?
Are critical calculations and policy checks deterministic?
Can every recommendation be traced to source evidence?
Does the system show assumptions and confidence levels?
Is there a clear escalation path for uncertainty?
Can a human approve, reject, or override the decision?
Can the organization reconstruct the full decision later?

If the answer is no, the solution may still be a useful prototype.

But it is not ready for high-stakes enterprise use.

Final thought

The future of enterprise AI is not one intelligent assistant making every decision.

It is also not a collection of agents talking continuously.

The future is a governed decision system where AI helps teams investigate faster, compare perspectives, identify risk, and prepare recommendations.

But evidence remains visible.

Rules remain enforceable.

Disagreement remains allowed.

And people remain accountable.

That is how AI becomes useful in banking, finance, data engineering, and other enterprise workflows where trust matters as much as speed.

https://dataengineeringcopilot.com

https://github.com/amising6/data-engineering-copilot

https://www.linkedin.com/in/amit-singh-57980030

From DataStage and Informatica to Databricks Medallion Architecture: Why Migration Is More Than Code Conversion

Amit Kumar Singh — Sun, 21 Jun 2026 13:43:00 +0000

Legacy ETL modernization is often described as a technology migration.

Move DataStage jobs to Databricks.
Convert Informatica mappings into PySpark.
Replace legacy workflows with notebooks and Delta tables.

But that description misses the hardest part.

The real challenge is not converting syntax.

The challenge is understanding years of hidden transformation logic, reconstructing data lineage, separating technical processing from business logic, and deciding where each responsibility belongs in a modern architecture.

A DataStage job or Informatica mapping may contain raw ingestion, data cleansing, lookups, joins, business rules, aggregations, error handling, and reporting logic in one workflow.

A Databricks Medallion architecture expects something different.

It separates data processing into clearer layers:

Bronze
Raw ingestion and source preservation
Silver
Cleansing, standardization, enrichment, conformance, and quality controls
Gold
Business-ready models, aggregates, KPIs, reporting datasets, and semantic outputs

That means a successful migration cannot be a blind one-to-one conversion.

It needs to become a metadata and architecture exercise.

⸻

Why One-to-One Conversion Fails

A traditional legacy ETL job often looks like this:

Read source data
→ Filter records
→ Lookup reference data
→ Cleanse values
→ Deduplicate
→ Apply business calculations
→ Aggregate
→ Write reporting output

The problem is that all these responsibilities may exist inside one job, mapping, sequence, or workflow.

For example, a single DataStage job might:

ingest from Oracle
remove cancelled records
trim whitespace
standardize status values
join customer master data
calculate net order amount
aggregate sales by month
write a reporting table

If that entire job is converted directly into one Databricks notebook, the organization may simply recreate the old architecture in a new platform.

The code may run in Databricks, but the design remains difficult to maintain, test, govern, and scale.

The goal should not be:

Convert one legacy job into one notebook.

The goal should be:

Understand what each transformation is doing and place it in the right modern data layer.

⸻

The First Step: Extract Metadata, Not Just Code

A legacy ETL migration should begin by extracting structured metadata from the existing platform.

For DataStage, Informatica, SSIS, Talend, stored procedures, or other ETL tools, useful metadata may include:

job or mapping name
workflow dependencies
source tables, files, and APIs
target tables and files
source-to-target field mappings
joins and lookup logic
filters and conditions
transformation expressions
aggregations
surrogate key generation
reject handling
parameter values
schedules and sequencing
pre-SQL and post-SQL
restart or recovery logic
error-handling behavior

The purpose is to create a structured representation of the legacy job.

Legacy ETL Export
→ Metadata Parser
→ Canonical Metadata Model
→ Transformation Graph
→ Migration Blueprint

This is much more valuable than simply reading transformation code line by line.

⸻

Reconstructing the Transformation Graph

Once the metadata is extracted, the next step is to reconstruct the data lineage and transformation graph.

Consider this fictional example:

orders.csv
↓
filter cancelled orders
↓
lookup customer master
↓
standardize customer status
↓
deduplicate by order_id
↓
calculate order_amount
↓
aggregate monthly sales
↓
monthly_sales_summary

This graph reveals several different kinds of work:

raw ingestion
filtering
enrichment
standardization
deduplication
business calculation
reporting aggregation

These should not all be treated as one technical unit.

The transformation graph helps identify where the data changes, why it changes, and which downstream outputs depend on those changes.

It also makes hidden business logic visible.

⸻

Mapping Legacy ETL Logic to Bronze, Silver, and Gold

The Medallion architecture is useful because it separates responsibilities.

Here is a practical way to classify legacy ETL logic.

Legacy ETL Pattern Meaning Likely Medallion Layer
File, API, database, or CDC extraction Raw source ingestion Bronze
Source preservation and ingestion metadata Capture original source state Bronze
Basic schema enforcement Standardized ingestion Bronze or Silver
Trim, cast, rename, null cleanup Cleansing and standardization Silver
Deduplication Record normalization Silver
Lookup and reference joins Enrichment and conformance Silver
SCD handling Historical dimensional processing Silver
Business calculations Curated business logic Gold
Aggregation and KPI creation Reporting-ready metrics Gold
Dashboard/report output Consumption-ready dataset Gold

The important point is that a legacy component type does not automatically determine the Medallion layer.

For example, a DataStage Transformer stage might perform:

string trimming
null handling
a business calculation
a customer lookup
a reporting aggregation

Those are not all Silver transformations.

The migration process needs to inspect the intent of the logic.

⸻

Example: One Legacy Job Becomes Multiple Databricks Layers

Imagine this fictional legacy ETL workflow:

Oracle Orders
→ Transformer: trim strings and standardize status
→ Lookup: customer master
→ Transformer: calculate net_amount
→ Aggregator: monthly sales by customer
→ Reporting table

A modern Databricks Medallion proposal could look like this:

Bronze Layer
bronze_orders_raw

Ingest raw Oracle orders
Preserve source fields
Add ingestion timestamp
Add source identifier
Add load date
Retain raw records for traceability Silver Layer silver_orders
Trim and standardize string fields
Standardize status values
Validate schema
Apply null-handling rules
Deduplicate order records silver_orders_enriched
Join customer master data
Resolve customer keys
Apply standardized enrichment logic
Calculate normalized net_amount Gold Layer gold_customer_monthly_sales
Aggregate net sales by customer and month
Apply approved reporting definitions
Produce a curated business-ready output

This creates clearer ownership.

Bronze preserves the source.

Silver prepares trusted, reusable data.

Gold provides business-facing outputs.

⸻

What AI Can Assist With

AI can make this migration process faster and more structured.

For example, an AI-assisted migration workflow can help:

summarize legacy job purpose
parse transformation expressions
identify source and target dependencies
reconstruct lineage
classify transformations by intent
detect embedded business logic
suggest Bronze, Silver, and Gold placement
draft PySpark or Spark SQL
generate Delta table DDL
propose data-quality checks
generate reconciliation logic
create migration documentation
identify unclear or risky rules

Suppose a legacy rule says:

IF status_code = 'C' THEN 'Closed' ELSE 'Open'

An AI system can suggest:

Likely classification:
Silver-layer standardization rule
Potential concern:
Confirm whether status_code = 'C' means Closed across all source systems.
Recommended action:
Human review required before finalizing the standardization rule.

That is useful because the system is not pretending to know the business definition.

It is surfacing the decision that must be made.

⸻

What Still Requires Human Review

AI can accelerate analysis and drafting, but human accountability remains essential.

Humans should continue to make final decisions about:

business definitions
source-of-truth selection
financial logic
regulatory calculations
data-retention policies
exception handling
data-quality thresholds
reporting metrics
production deployment approval

For example, a legacy aggregation may calculate:

SUM(revenue) BY region, month

The technical migration system may recommend Gold.

But a human must still answer:

Is revenue gross or net?
Are refunds included?
Does month use calendar month or fiscal month?
Is region derived from customer, store, or sales territory?
Is this metric reusable across reports?

Those are business and governance questions, not merely coding questions.

⸻

The Role of a Canonical Metadata Model

A Canonical Metadata Model can become the bridge between legacy ETL and modern data architecture.

It can represent:

sources
targets
columns
transformations
joins
keys
data types
quality expectations
lineage
business definitions
approval status
assumptions
migration decisions

Once metadata is normalized, multiple outputs can be generated from the same source of truth.

Canonical Metadata Model
→ Databricks Medallion Architecture Proposal
→ PySpark / Spark SQL
→ Delta Table DDL
→ Data Quality Rules
→ Reconciliation Checks
→ Lineage Documentation
→ Migration Specification
→ Human Review Queue

This is more powerful than isolated code conversion because it creates reusable engineering intelligence.

⸻

How Data Engineering Copilot Could Support Legacy ETL Migration

A future Data Engineering Copilot capability could act as a Legacy ETL Migration Copilot.

Inputs could include:

DataStage export files
Informatica mapping exports
workflow metadata
SQL procedures
ETL job documentation
source-to-target mappings
data model documentation

The workflow could be:

Legacy ETL Export
→ Parse Job Metadata
→ Build Transformation Graph
→ Identify Dependencies
→ Classify Transformation Intent
→ Propose Bronze / Silver / Gold Layers
→ Generate Migration Artifacts
→ Flag Ambiguity
→ Route for Human Review

Potential outputs could include:

Medallion architecture recommendation
Bronze, Silver, and Gold pipeline design
Databricks notebook structure
PySpark code drafts
Spark SQL transformations
Delta table definitions
data-quality rules
reconciliation checks
migration documentation
dependency analysis
lineage diagrams
review questions for unresolved logic

The key is not automatic migration without oversight.

The key is to turn hidden legacy ETL logic into a reviewable modernization blueprint.

⸻

Migration Is a Metadata and Architecture Problem

Many legacy ETL modernization efforts fail because they focus only on tool replacement.

But old ETL jobs often contain years of accumulated business knowledge.

That knowledge may be undocumented.

It may be hidden inside transformations, lookups, stored procedures, filters, sequencing rules, and exception logic.

A successful migration must preserve that knowledge while improving the architecture.

That means the migration process should:

Extract metadata
→ Reconstruct lineage
→ Identify transformation intent
→ Separate technical and business responsibilities
→ Propose Medallion layers
→ Generate reviewable artifacts
→ Capture assumptions
→ Require human approval

The future of ETL modernization is not simply translating one tool into another.

It is making legacy data logic visible, structured, governed, and reusable.

⸻

Closing Thought

DataStage and Informatica jobs were often built in an era when ingestion, cleansing, business logic, and reporting were tightly combined.

Databricks Medallion architecture gives teams an opportunity to separate those responsibilities and create cleaner, more maintainable data products.

But that opportunity is lost when organizations perform blind one-to-one conversion.

The better approach is to treat legacy ETL modernization as a metadata-driven architecture exercise.

Do not just convert legacy jobs into new code.
Convert hidden transformation logic into a reviewable modernization blueprint.

That is where AI-assisted metadata platforms can create real value for enterprise data engineering teams.

⸻

Data Engineering Copilot is a personal product initiative focused on metadata-driven engineering and governed delivery workflows.

Illustrative examples in this article use fictional metadata only. No client, employer, production, or proprietary information is included.

From Legacy Data Platforms to Modern Data Stacks: Why Metadata Matters More Than Technology

Amit Kumar Singh — Sun, 21 Jun 2026 08:16:02 +0000

Introduction

Organizations spend millions of dollars modernizing data platforms.

They migrate from on-premise databases to cloud warehouses. They replace legacy ETL tools with Spark and cloud-native orchestration. They introduce modern observability platforms, data catalogs, semantic layers, and AI-powered analytics.

Yet many modernization programs struggle despite adopting the latest technology.

The reason is surprisingly simple:

Technology changes.

Metadata remains.

Most modernization projects focus on moving code. Few focus on understanding and preserving the metadata that defines the business.

This is where metadata-driven engineering changes the conversation.

⸻

The Traditional Modernization Approach

A typical legacy modernization initiative looks something like this:

Legacy Environment

Oracle
Teradata
Netezza
Informatica
DataStage
SSIS
Stored Procedures
Excel-Based Documentation

Target Environment

Snowflake
Databricks
dbt
Airflow
Monte Carlo
Power BI
Sigma
Cloud Storage

The migration process usually involves:

Reverse engineering legacy pipelines
Understanding business logic
Rewriting transformations
Rebuilding data models
Recreating documentation
Reimplementing data quality checks
Validating outputs

The challenge is that every artifact is treated as a separate deliverable.

Engineers repeatedly translate the same business requirements into different technical formats.

⸻

The Real Asset Is Not The Code

Most organizations assume the code is the asset.

In reality, the most valuable asset is the metadata that describes:

Source systems
Business entities
Data definitions
Transformation logic
Relationships
Data quality rules
Ownership
Governance policies

Technology platforms evolve every few years.

Business definitions often survive for decades.

A customer is still a customer.

A policy is still a policy.

A claim is still a claim.

What changes is how those concepts are implemented.

⸻

The Metadata Problem

Consider a simple customer field.

In a legacy platform it might appear as:

CUSTOMER_ID

In Snowflake it becomes:

CUSTOMER_KEY

In Power BI it appears as:

Customer Identifier

In a data catalog it appears as:

Business Customer Reference

The technology changes.

The meaning remains the same.

Modernization projects spend enormous effort rediscovering and translating metadata that already exists somewhere in the organization.

This creates:

Delivery delays
Documentation drift
Inconsistent implementations
Increased testing effort
Knowledge dependency on SMEs

⸻

A Metadata-Driven Modernization Strategy

Instead of migrating code directly, organizations can first create a standardized metadata representation.

This becomes a Canonical Metadata Model.

The Canonical Metadata Model acts as an abstraction layer between business metadata and technology platforms.

Legacy Sources

STTM Documents
Data Dictionaries
Data Models
Legacy ETL Jobs
Database Schemas
Business Rules

↓

Canonical Metadata Model

Standardized representation of:

Entities
Attributes
Relationships
Transformations
Data Quality Rules
Lineage
Governance
Business Definitions

↓

Modern Outputs

Snowflake DDL
Databricks Notebooks
dbt Models
Airflow DAGs
Monte Carlo Configurations
ER Diagrams
Data Dictionaries
Technical Specifications
Power BI Semantic Models
Sigma Semantic Models

Build Once. Generate Everywhere.

⸻

How DE Copilot Approaches Modernization

DE Copilot is built around this concept.

Instead of generating individual artifacts independently, the platform converts enterprise metadata into a Canonical Metadata Model.

The Canonical Metadata Model becomes the single source of truth.

Once standardized, generators can produce multiple technology-specific outputs.

Current Capabilities

Snowflake DDL Generation
Snowflake SQL Generation
Data Dictionary Generation
Technical Specification Generation
Data Quality Rule Generation
AI Metadata Analysis

Future Roadmap

ER Diagram Generation
dbt Model Generation
Databricks Notebook Generation
Airflow DAG Generation
Monte Carlo Configuration Generation
Power BI Semantic Model Generation
Sigma Semantic Model Generation
Knowledge Discovery Copilot

⸻

Why This Matters

Modernization projects often fail because organizations rebuild the same knowledge repeatedly.

Every new platform requires another translation exercise.

A metadata-driven approach changes that.

Instead of rewriting business logic for every technology, organizations standardize metadata once and generate multiple implementations.

The focus shifts from technology migration to metadata preservation.

⸻

The Future of Data Engineering

For decades, data engineering has been centered around code.

The next generation of platforms will be centered around metadata.

Engineers will spend less time translating spreadsheets into code and more time solving business problems.

The winning organizations will not be the ones with the newest technology stack.

They will be the ones that understand their metadata best.

Because technology changes.

Metadata endures.

And when metadata becomes the product, modernization becomes dramatically simpler.

⸻

About DE Copilot

DE Copilot is a metadata-driven engineering platform that transforms enterprise Source-to-Target Mapping (STTM) documents into production-ready engineering artifacts through a Canonical Metadata Model.

Learn more:

https://dataengineeringcopilot.com

Read:

The Canonical Metadata Model: The Engine Behind DE Copilot

What I Learned After Reviewing Many AI and Developer Projects as a Hackathon Judge

Amit Kumar Singh — Thu, 18 Jun 2026 11:12:59 +0000

Over the last few days, I had the opportunity to review a large number of submissions across developer and AI-focused hackathon challenges.

It was a very different experience from building a project myself.

When you are building, you mostly think about your own idea, your own code, and your own constraints.

When you are judging, you start seeing patterns across many builders.

Some projects had beautiful interfaces but limited technical depth.

Some had very strong engineering but needed better documentation.

Some were simple ideas, but solved a real problem clearly.

Some were ambitious platforms, but still needed stronger proof of usability, reliability, or completion.

A few lessons stood out to me.

1. A good project is not only about the idea

Many submissions had interesting ideas.

But the stronger ones clearly showed:

what problem they were solving
what existed before
what was improved
what technical choices were made
what the user can actually do now

The difference between “interesting” and “strong” was usually execution clarity.

2. Completion matters

In a finish-up style challenge, the best projects were not always the flashiest.

The best ones showed a real before-and-after story.

Examples of strong completion signals included:

broken workflows fixed
apps deployed publicly
documentation improved
tests added
security gaps reduced
onboarding improved
production-readiness increased

Shipping matters.

3. Documentation is part of engineering

Some technically strong projects were harder to evaluate because the documentation was thin.

A clear README, architecture diagram, demo video, screenshots, setup steps, and known limitations can significantly improve how a project is understood.

Good documentation does not replace good engineering.

But it helps people trust the engineering.

4. AI-assisted development still needs human judgment

Many projects used AI tools like GitHub Copilot.

The stronger submissions were honest about how AI helped.

They did not claim that AI magically built the entire project.

Instead, they explained how AI helped with boilerplate, debugging, refactoring, documentation, test cases, UI polish, or repetitive implementation work.

That is a realistic and mature use of AI-assisted development.

5. Real-world thinking stands out

The projects that stood out most often had practical engineering judgment:

security considerations
user onboarding
error handling
observability
privacy
reliability
deployment readiness
maintainability

These are the things that turn a demo into a product.

6. Simple but complete can beat ambitious but unclear

A focused project with a working demo, clear use case, and thoughtful finishing work can be stronger than a large idea with missing proof.

Clarity matters.

Completeness matters.

Evidence matters.

Final Thought

Judging these projects reminded me how much energy and creativity exists in the developer community.

It also reinforced something I strongly believe:

Building software is not only about writing code.

It is about solving a problem, explaining the solution, making it usable, and finishing the work well enough that someone else can understand it, trust it, and use it.

That is where real engineering maturity starts.

# From Metadata to Knowledge Discovery: Why I Am Not Starting With a Chatbot

Amit Kumar Singh — Tue, 16 Jun 2026 03:44:37 +0000

A lot of AI products today start with the same idea:

Upload documents.
Ask questions.
Get answers.

In other words:

Chat with your documents.

That is a powerful pattern.

But for enterprise data engineering, I do not think every AI product needs to start as a chatbot.

In fact, starting with a chatbot can make the first version unnecessarily complex.

The moment we create an open-ended chatbot, we also need to think about:

RAG
permissions
citations
hallucinations
evaluation
guardrails
scope control
user intent
knowledge freshness
answer traceability

All of these are important.

But they may not be the first problems to solve.

For the first version of Data Engineering Copilot, I am thinking differently.

The current MVP is not a chatbot.

It is a workflow application.

The flow is simple:

Upload STTM
    ↓
Generate SQL
Generate DQ Rules
Generate Data Dictionary
    ↓
Download Artifacts

That may look simple.

But I think that simplicity is the strength.

The application is not trying to answer every possible question.

It is focused on one clear data engineering workflow:

Take structured metadata as input and generate useful engineering artifacts as output.

In this model, the UI itself becomes a form of scope control.

The user cannot ask the system to write a Python game.

The user cannot ask random questions outside the product boundary.

The user cannot force the system into unrelated tasks.

The user can only do what the workflow allows:

Upload metadata.
Validate it.
Generate artifacts.
Download output.

For an early AI product, that is a powerful design choice.

It reduces risk.

It reduces ambiguity.

It makes evaluation easier.

It also makes the product easier to explain.

Instead of saying:

“This is a chatbot for data engineering.”

The product can say:

“This is a metadata-driven artifact generation engine for data engineering teams.”

That distinction matters.

Because in data engineering, many tasks are not open-ended conversations.

They are repeatable workflows.

For example:

Generate Snowflake SQL from STTM
Generate PySpark transformation logic
Generate DQ rules
Generate reconciliation checks
Generate data dictionaries
Generate technical specifications
Validate mappings
Identify missing metadata

These tasks do not always require a chatbot.

They require structured input, business rules, validation, and controlled generation.

That is why I believe the first version of an enterprise AI copilot does not need to be overly complicated.

It can start with:

Metadata In
    ↓
Artifacts Out

Once that foundation is working, the product can evolve.

Later versions can add:

Ask questions about STTM
Ask questions about data lineage
Ask questions about DQ rules
Ask questions about business definitions
Ask questions about downstream impact

At that point, RAG, citations, permissions, and knowledge discovery become more important.

But starting with a controlled workflow allows the product to build trust first.

This is also where guardrails become practical.

In this MVP, guardrails are not abstract AI safety concepts.

They are simple engineering checks:

Does the STTM file have required columns?
Are source and target columns populated?
Are transformation rules present?
Are data types valid?
Are target tables defined?
Can the generated SQL compile?
Are DQ rules generated for mapped fields?

A simple validation rule may look like this:

required_columns = [
    "Source_Table",
    "Source_Column",
    "Target_Table",
    "Target_Column",
    "Transformation_Rule"
]

for col in required_columns:
    if col not in df.columns:
        st.error(f"Missing required column: {col}")
        st.stop()

This is not glamorous.

But it is real.

And in enterprise systems, real usually wins.

Many AI demos look impressive because they allow open-ended conversation.

But enterprise products survive when they are controlled, testable, traceable, and useful.

That is why I believe the first step for Data Engineering Copilot should not be:

Chat with everything.

It should be:

Understand metadata
Generate trusted artifacts
Create repeatable value

The chatbot can come later.

The knowledge discovery layer can come later.

The agentic workflow can come later.

The foundation should be simple:

STTM
    ↓
Canonical Metadata
    ↓
SQL / DQ / Data Dictionary / Specs

This is the direction I am exploring.

Not because chatbots are bad.

But because data engineering teams often need something more specific.

They need tools that reduce repetitive work.

They need systems that understand metadata.

They need outputs that can be reviewed, validated, and improved.

And eventually, they need AI that can move beyond document retrieval toward evidence-based knowledge discovery.

That journey starts with a small workflow.

Upload metadata.

Generate artifacts.

Validate output.

Build trust.

Then expand.

From RAG to Knowledge Discovery: What Comes Next for Enterprise AI

Amit Kumar Singh — Mon, 15 Jun 2026 02:34:55 +0000

From RAG to Knowledge Discovery: What Comes Next for Enterprise AI?

Over the past two years, Retrieval-Augmented Generation (RAG) has become one of the most widely adopted patterns in enterprise AI.

The reason is simple.

Large Language Models are powerful, but they don’t know your company’s internal knowledge.

RAG solved that problem.

Instead of relying solely on what a model learned during training, organizations could connect enterprise documents, retrieve relevant information, and provide additional context at runtime.

The architecture looked something like this:

Enterprise Documents
↓
Chunking
↓
Embeddings
↓
Vector Database
↓
Retrieval
↓
LLM
↓
Answer

For many use cases, this works extremely well.

Employee assistants, HR chatbots, IT support copilots, policy search, document Q&A, and internal knowledge assistants are all examples of successful RAG applications.

But as organizations scale their AI initiatives, a new challenge begins to emerge.

The Problem with Enterprise Knowledge

The issue is not that information is missing.

The issue is that information is fragmented.

Consider a simple retail question:

How is Daily Sales calculated?

The answer may exist across multiple artifacts:

Data Dictionary
Source-to-Target Mapping (STTM)
Business Rules
Architecture Diagram
Data Quality Specifications

A traditional RAG system may retrieve some of these documents.

However, no single document contains the complete answer.

The knowledge itself is distributed.

This creates a fundamental challenge.

RAG retrieves documents.

Enterprise users need knowledge.

Why Better Retrieval Isn’t Always Enough

The industry has already introduced several improvements:

Hybrid Search
Reranking
Citations
Confidence Scoring
Agentic RAG
Multi-Step Retrieval

These innovations significantly improve retrieval quality.

However, they still operate primarily at the document level.

The underlying assumption remains:

Find the right documents and the answer will emerge.

In practice, enterprise knowledge is often spread across multiple systems, documents, and teams.

The challenge becomes connecting the pieces.

Enter Knowledge Discovery

What if we stopped thinking about documents as the primary source of truth?

Instead of retrieving documents, what if we extracted knowledge from documents and connected it together?

Imagine converting enterprise artifacts into a Canonical Knowledge Model.

For the Daily Sales example:

Business Term:
Daily Sales
Source System:
POS
Source Table:
POS_TRANSACTIONS
Attribute:
SALE_AMOUNT
Business Rule:
Exclude Cancelled Transactions
DQ Rule:
Value >= 0
Target:
Sales Mart

Now we are no longer working with isolated files.

We are working with connected knowledge.

The Shift from Retrieval to Discovery

Traditional RAG:

Question
↓
Retrieve Documents
↓
LLM
↓
Answer

Knowledge Discovery:

Question
↓
Identify Business Concept
↓
Discover Relationships
↓
Assemble Evidence
↓
LLM
↓
Trusted Answer

The focus shifts from:

Which document should I retrieve?

to:

What knowledge do I need to assemble?

Why This Matters

Enterprise users rarely ask document-centric questions.

They ask:

Where does this metric originate?
Which systems contribute to this KPI?
What business rules are applied?
What data quality validations exist?
What transformations occur before loading?

Answering these questions requires understanding relationships.

Not just retrieving text.

RAG Isn’t Going Away

I don’t view Knowledge Discovery as a replacement for RAG.

RAG remains a foundational capability.

In fact, RAG will likely continue to play an important role in retrieval.

The difference is that retrieval becomes one component within a larger knowledge architecture.

A future enterprise AI stack may look like:

Documents
↓
Metadata Extraction
↓
Canonical Knowledge Model
↓
Knowledge Graph
↓
RAG Retrieval
↓
Evidence Assembly
↓
Trusted Answers

Final Thoughts

The evolution of enterprise AI can be viewed as a progression:

Era 1
LLM
Era 2
RAG
Era 3
Advanced RAG
(Hybrid Search, Reranking, Citations)
Era 4
Knowledge Discovery
(Metadata, Relationships, Evidence)

The goal is no longer simply retrieving documents.

The goal is connecting fragmented enterprise knowledge and surfacing trusted evidence when it’s needed.

Perhaps the next generation of enterprise copilots won’t be document assistants.

They’ll be knowledge discovery systems.

From STTM to Snowflake SQL: Building a Metadata-Driven Data Engineering Copilot

Amit Kumar Singh — Sun, 14 Jun 2026 05:33:55 +0000

Most data engineering teams do not struggle because they lack smart people.

They struggle because too much of the delivery process is still repetitive.

A source-to-target mapping document comes in.

Then someone has to manually create:

target table DDL
transformation SQL
data dictionary
technical specification
data quality rules
reconciliation checks
test cases

For one or two tables, this is manageable.

For a real enterprise program with many tables, changing requirements, multiple source systems, and repeated delivery cycles, this becomes a major productivity problem.

That is the problem I am exploring with Data Engineering Copilot.

Website: https://dataengineeringcopilot.com

The idea

The idea is simple:


text
Upload STTM
   ↓
Parse metadata
   ↓
Normalize into a canonical metadata model
   ↓
Generate engineering artifacts