We had a slightly reckless idea: what if we let AI do most of our data engineering work?
Not "help with a query here and there," but actually build real pipelines.
Azure, Databricks, Delta Lake, the whole thing.
Real enterprise data, messy schemas, and stakeholders who will definitely shout if numbers look wrong.
I'm a Senior Data Engineer, I work on this stack every day, and I still wanted to see how far we could push AI into my own job.
This is what happened when we tried.
The Experiment: Letting AI Touch Real Pipelines
The setup will look familiar to a lot of people:
- Azure as the platform
- Databricks as the main compute environment
- Delta Lake as the storage layer, with a Bronze, Silver, Gold medallion layout
- Unity Catalog for governance and access control
Most of the transformation work lives in PySpark, with SQL on top for reporting and BI layers.
The experiment was simple to describe and painful to watch:
- Give an LLM and Copilot style tools the job of:
- Writing PySpark transformations for a new Silver layer
- Generating SQL for aggregation and reporting tables
- Suggesting schemas and data models for a new feature set
- Proposing fixes for slow or failing jobs
We fed it:
- Plain language descriptions of the business logic
- Table schemas copied from
DESCRIBEandSHOW COLUMNS - A few existing notebooks as "examples of our style"
All of this happened in a safe Databricks workspace with test data and separate storage. No chance of breaking production, but our question was serious. Could we realistically replace most of the day to day data engineering work on a new pipeline?
Where AI Actually Helped
To be fair, AI did a few things well enough that I now use it on purpose.
Boilerplate PySpark
Whenever I needed yet another "read, filter, transform, write" notebook, the model saved a bit of time:
- Reading from Delta tables
- Simple filters and column selections
- Casting and basic feature engineering
- Writing back to Delta with a reasonable partition strategy
For example, I asked it something close to:
"Read from
bronze.orders, filter cancelled orders, castorder_tsto timestamp, addorder_date, then write tosilver.orders_cleanas Delta, partitioned byorder_date."
The generated PySpark looked like this:
from pyspark.sql import functions as F
df = (
spark.read.table("bronze.orders")
.filter(F.col("status") != "CANCELLED")
.withColumn("order_ts", F.to_timestamp("order_ts"))
.withColumn("order_date", F.to_date("order_ts"))
)
(df.write
.format("delta")
.mode("overwrite")
.partitionBy("order_date")
.saveAsTable("silver.orders_clean"))
Could I have written this faster by hand? On a good day, yes.
But over dozens of similar notebooks, the time saved adds up.
Quick SQL starting points
For straightforward reporting queries, Copilot in a SQL editor was handy.
- It auto completed
SELECTlists once it saw the schema - It filled in
GROUP BYandORDER BYclauses correctly most of the time - It often proposed reasonable aggregates to start from
I still had to adjust conditions and add proper filters, but I was no longer staring at a blank editor. That alone reduces friction.
Documentation and "glue text"
The part that surprised me most was how useful AI was for the boring bits:
- Drafting docstrings and short comments
- Converting bullet points into a light design doc
- Writing a high level description of a pipeline for internal docs
None of this replaces real architectural decisions, but it lets me stay in "technical thinking" mode while an assistant fills in the prose.
Where AI Failed, In Ways That Matter
Now for the part that actually matters, especially if you work in a production environment.
Joins that compile but lie
We tried a customer360 style pipeline that combined:
bronze.customersbronze.orders-
bronze.eventsfor clickstream data bronze.subscriptions
We told the model something like:
"Join these to build a customer centric table with basic attributes, last activity date, and current subscription status."
It produced:
- A join from
customerstoordersoncustomer_idwhich was fine - A join from
orderstosubscriptionsoncustomer_idwhich was wrong in our world, the real join key for subscriptions isaccount_id - An aggregation of events using
MAX(event_ts)percustomer_id, ignoring the fact that some event types should not count as "activity"
The result:
- Subscription states merged incorrectly
- Trial vs paid blurred together
- "Last seen" dates inflated by internal or noise events
All of this ran without any schema error. Nothing crashed. The table "looked" fine at a glance.
But the logic was off in exactly the way that breaks trust with downstream users.
Columns and tables that never existed
When our prompt was slightly vague, the model started inventing things:
- New boolean columns like
is_activethat were not in any table - Table names that looked plausible, for example
orders_clean, which only existed in someone's head, not in our catalog - Column names that were close to reality, but not exact, such as
customer_emailinstead ofemail_address
This is a known issue with LLM generated code in general, often called hallucination.
In a chat window it looks clever. In a notebook connected to real data, it is just a bug factory.
You either fix the code and bend it back toward reality, or you start renaming actual tables and columns to match the hallucination. I saw both instincts on the team.
Performance "help" that makes things worse
We pointed AI at a slow query on a large Delta table and asked for tuning suggestions.
It happily suggested changes that:
- Removed filters that were actually highly selective
- Rewrote predicates in ways that broke partition pruning
- Introduced joins that would obviously cause huge shuffles
In practice, I spent more time validating each suggestion with EXPLAIN and the Databricks query profile than it would have taken to reason through the original plan myself.
Edge cases and dirty data
We set up some non ideal data on purpose:
- Null keys in join columns
- Late arriving events
- Dirty reference data with conflicting keys
- Out of order timestamps
The model did not:
- Add defensive joins or explicit null handling
- Propose data quality checks
- Think about slowly changing dimensions or history
- Distinguish between "missing because it never existed" and "missing because it will arrive later"
The code it generated assumed clean, static, relational textbook data.
That world does not exist in any real enterprise I have worked in.
Why It Fails: Context, Lineage and Ownership
This is not only about wrong code. There is a deeper gap that explains most of the failures we saw.
No real sense of lineage
The model only sees what we paste into the prompt or make available through a narrow integration. It does not naturally see:
- The end to end flow from Bronze through to Gold
- Which downstream reports and ML models depend on a field
- Where a column definition came from originally
Lineage is technical and social. It lives partly in tools, partly in tribal knowledge, and partly in old Slack threads. AI only sees a slice of that picture unless you build a very deliberate context layer around it.
Business rules live outside the schema
Here is a real kind of rule you will not get from a table definition:
"A customer is active if they had a paid transaction in the last 90 days, except in region X and contract type Y, where the window is 180 days."
Pieces of that rule live in:
- Product requirements
- Email threads
- A teammate's memory
- Old reconciliation docs in some forgotten folder
If you ask AI to "mark active customers," it will give you a clean definition that fits a generic pattern. That pattern is almost guaranteed to differ from your actual rulebook.
No accountability
When I ship a pipeline:
- My name is on the PR
- I get pinged if a CFO dashboard looks wrong
- I have to answer questions from auditors or risk teams
AI has no skin in the game. It can be wrong with confidence and nothing bad happens to it. That changes how much you trust it, and it should change how you design your review and test processes.
What Actually Worked: AI As a Copilot
After a few weeks, we stopped trying to make AI "do" data engineering and started treating it like an extra pair of very fast hands.
Acceleration for well defined tasks
We now use AI to:
- Turn a clear description into a first draft of PySpark or SQL
- Suggest alternative ways to express the same logic
- Refactor slightly messy code into something cleaner
We still own the logic. We still write tests. The model just gets us to the first draft faster.
Debugging assistant, not performance engineer
AI is useful when:
- You paste in a stack trace and ask "what is this error really telling me"
- You show an
EXPLAINplan and want a plain language description - You forget the exact syntax for some obscure Spark function
It is not our primary performance tuner, but it makes the feedback loop a bit shorter.
Glue work and writing
We let AI start the boring writing:
- Basic README files
- First drafts of design docs
- Short explanations for internal wikis
Engineers review and correct the details, but they are not starting from an empty page.
Real Enterprise Scenarios
Two concrete scenarios from the kind of environment many of us work in:
-
Marketing attribution pipeline
- Good: generate window functions, sketch event aggregations, build starter queries
- Bad: handle the messy attribution rules that change per region and per campaign type
-
Finance reconciliation layer
- Good: boilerplate mapping logic, standard transformations, basic QC checks
- Bad: interpret accounting rules, satisfy audit requirements, reason about exceptions
In both cases, AI makes small pieces of the job faster. The responsibility and the judgement stay with the human engineers.
Final Verdict
After trying quite hard to let AI handle a big chunk of my data engineering work on Azure Databricks, my view is pretty clear.
- AI will not replace solid data engineers any time soon
- AI will change how those engineers work and what they spend time on
The engineers who learn to:
- Use AI for drafts and helpers, not as an oracle
- Wrap AI generated code in tests, monitoring, and review
- Keep ownership of business rules, data quality and performance
will move faster than those who either ignore these tools or trust them blindly.
AI will not replace data engineers.
Data engineers who use AI well will replace the ones who do not.

Top comments (0)