AI on Legacy Systems - What the Integration Layer Actually Looks Like

#machinelearning #career #ai #softwareengineering

Practical post for engineers who've hit the wall where an AI proof-of-concept works on clean data but can't connect to the legacy systems that hold actual production data.

Disclosure: I work at Ailoitte, which builds AI integration layers connecting legacy infrastructure to production AI. Sharing what the engineering actually looks like.

Why does AI work in the demo but break on production data?

AI models expect structured, consistently formatted data. Legacy systems — ERPs, mainframes, proprietary CRMs, on-premise databases — store data in formats built for the system's internal logic, not for external consumption by modern APIs.

The demo works because test data is clean and pre-formatted. Production data is messy, inconsistently structured, and often accessible only through interfaces that predate REST.

This is an integration problem, not an AI problem.

The three layers of a legacy AI integration

Layer 1: Data extraction

Getting data out of the legacy system. The options depend entirely on what the system exposes:

Direct database connections (SQL over JDBC/ODBC)
Flat file exports on a schedule
Existing system APIs — even legacy SOAP ones
CDC (change data capture) for real-time needs
Screen scraping as a genuine last resort

The method is dictated by the legacy system, not by preference.

Layer 2: Transformation

Converting legacy data formats into something the AI model can consume. This is where most of the engineering effort goes.

Legacy schemas were designed for the system's business logic — not for LLM context windows or vector embeddings. Transformation handles:

Denormalisation
Field mapping (often manual — legacy field names like CUST_ID_03 need interpretation before they're useful to anything)
Type conversion
Chunking for retrieval

This layer is almost always harder and slower than the actual AI work. Most projects underestimate it significantly.

Layer 3: Middleware

The ongoing bridge between the legacy system and the AI. Handles:

Update propagation — when legacy data changes, the AI's knowledge base needs to stay current
Latency management — legacy systems are often slow; AI responses need to feel fast
Error handling — the impedance mismatch between a 1990s database and a modern LLM API creates failure modes that need graceful recovery

When is rip-and-replace actually necessary?

For the AI integration use case specifically: rarely.

Replace if:

The legacy system is a genuine security liability
It's causing operational problems beyond the integration challenge
It's undocumented to the point of being unworkable
The organisation is already mid-modernisation for other reasons

Don't replace just to enable AI integration.

An API wrapper and transformation layer is almost always faster, cheaper, and lower risk. It also preserves institutional knowledge baked into the legacy system's data model — knowledge that gets lost in a full replacement and has to be re-learned the hard way.

What's the realistic timeline and cost?

A scoped AI integration layer connecting a legacy system to production AI:

Component	Detail
Timeline	6–10 weeks
Cost	$40K–$80K depending on legacy system complexity and number of data sources

The timeline driver is almost always the transformation layer — mapping the legacy schema to a format the AI can use takes longer than the actual AI work.

The bit nobody budgets for

Extraction gets scoped. The AI model gets scoped. Middleware gets scoped.

Then the team discovers that mapping a legacy schema designed for a 15-year-old ERP's internal business logic into something an LLM can reason over is 40% of total project effort.

That surprise is the most common reason legacy AI integration projects run over time and budget.

What legacy integration challenges are you running into? Specifically interested in what extraction approach teams are using when there's no modern API — and whether anyone has found a good pattern for keeping embeddings current when the underlying legacy data updates.