Mark Thorn

Posted on May 20

Integrating LLMs with Legacy Enterprise Systems: What Actually Works

#llm #legacy #distributedsystems #backenddevelopment

Most LLM integration articles assume you are starting from scratch. Clean microservices. Modern APIs. A greenfield codebase your team controls end to end.

That is not where most enterprises live.

The real world is SAP instances from 2009, Oracle ERP deployments that cost more to migrate than to maintain, COBOL batch jobs that run payroll for Fortune 500 companies, and ODBC connections that nobody wants to touch because the one engineer who understood them retired in 2021.

If you are trying to bring LLM capabilities into that environment, the playbook looks completely different from what most tutorials cover. This post is about what actually works.

Why the standard advice breaks here

Every LLM integration guide will tell you to expose your data through clean REST endpoints, chunk your documents, stuff them into a vector database, and wire up a RAG pipeline. That advice is correct. It is also written for teams that have clean data to begin with.

Legacy enterprise systems have four properties that make standard LLM integration genuinely hard.

The model has never seen your data format. SAP stores business data in tables with field names like VBELN, MATNR, and WERKS. Oracle EBS schemas span thousands of tables with naming conventions that only make sense to people who were in the room when those conventions were chosen. The models you are working with were trained on web text, GitHub repositories, and public documentation. Research from SAP published in late 2025 found that LLMs performing well on public benchmarks collapsed to near-zero accuracy when applied to real SAP customer column data, especially once customer-defined table extensions entered the picture. The gap is not a quirk. It is structural. Your enterprise data looks nothing like training data.

Your documentation is a liability, not an asset. The institutional knowledge about why a particular table is structured a certain way often lives entirely in the heads of people who left the company years ago. When you build a RAG pipeline and your source documents are 2014 spec sheets with broken links, handwritten margin notes scanned into PDFs, and six slightly different versions of the same schema sitting in different SharePoint folders, retrieval quality degrades in ways that are nearly impossible to debug from the model side. According to a February 2025 Gartner survey of 1,203 data management leaders, 63% of organizations either do not have or are unsure whether they have the right data management practices for AI. That same research projects that through 2026, organizations will abandon 60% of AI projects due to lack of AI-ready data. The bottleneck is not model capability. It is source data readiness.

You cannot give the model a database connection. Giving an LLM direct access to a production ERP is not a conversation any enterprise security team will have. Access controls, audit requirements, and compliance mandates require a controlled layer between the model and underlying systems. The EU AI Act, enforced from 2025 onwards, mandates that high-risk AI systems maintain detailed logs of what actions were taken, when, why, and by whose authority. You need that architecture before deployment, not retrofitted after.

Nothing in legacy environments exists in isolation. This is where teams run into the class of failure described well in The Code Didn't Break, The Assumptions Did: the system behaves exactly as designed, but the design was built on assumptions that no longer hold. A label print triggers an inventory write. An invoice update touches six downstream processes. A status field change propagates across reporting. When an LLM starts interacting with these systems, even read-only queries can surface data that crosses compliance boundaries you did not anticipate. The assumptions baked into the original integration are the landmines you inherit.

The middleware layer is not optional

The pattern that consistently reaches production is a proper middleware layer between your LLM and everything behind it. Not a thin shim. A genuine service with its own API, its own access controls, and its own observability stack.

This layer does several distinct jobs:

Translates natural language intent into the specific query patterns your legacy systems understand
Enforces the data the model is authorized to see, down to row-level access where required
Normalizes field names, date formats, and data types before context reaches the model
Logs every interaction for audit purposes with decision lineage, not just request and response pairs
Returns structured, sanitized responses rather than raw database outputs The LLM gateway pattern has emerged as the production standard for this architecture. Your application sends a request to the gateway. The gateway handles routing, authentication, rate limiting, and prompt assembly. It calls downstream systems through controlled interfaces. The model sees clean, contextualized input and never touches raw infrastructure directly.

IBM's documentation on AI gateways describes this pattern clearly: with RAG enabled at the gateway layer, the system automatically retrieves relevant context from enterprise knowledge bases and injects it into the prompt before generation, bridging the gap between static training data and your live internal data. The gateway becomes the translation layer between two worlds that were never designed to communicate.

MuleSoft articulates the same principle from the integration side in their piece on connecting enterprise APIs to LLMs: enterprises have already invested years building APIs to expose data from ERP, CRM, and legacy systems, and those existing APIs form the foundation for real-time AI, not something that needs to be rebuilt. The future of AI is not about starting over. It is about building on integration work that already exists.

This adds latency and engineering overhead. Both are worth it. The teams that skip this step and build direct integrations spend months debugging failures that are actually access control edge cases or field mapping inconsistencies they did not anticipate.

RAG over legacy documents: where it actually fails

Most enterprise environments have enormous volumes of documents. Technical manuals, compliance specifications, customer guides, support ticket histories, training materials. The instinct is to index everything and let retrieval handle it.

The problem is that retrieval quality is a direct function of index quality, and legacy enterprise documents degrade retrieval in several specific ways.

Dense acronym usage that differs by department, region, and decade. The same three-letter code can mean different things in European logistics documentation versus North American manufacturing specs. An embedding model produces similar vectors for both because the strings are identical. The retrieved context is wrong in a way that is very difficult to detect.

Scanned document noise. When your primary knowledge base is scanned PDFs of printed documents, optical character recognition introduces errors that survive into the vector index. Retrieval can pull in chunks with OCR artifacts that look plausible but contain corrupted field names or numbers.

Version fragmentation. Five slightly different versions of the same spec exist in different folders, SharePoint sites, or legacy file servers. Without explicit version management and deduplication before indexing, all five versions compete for retrieval. The model may synthesize across them and produce something that never existed in any single version.

Cross-references that break. Document references to part numbers, table IDs, or internal codes become broken when those identifiers change across system migrations. The retrieved context refers to a thing that no longer exists under that name.

Research on enterprise RAG accepted to the 2026 IEEE Conference on Artificial Intelligence found that metadata-enriched indexing approaches consistently outperform content-only baselines, with recursive chunking paired with TF-IDF weighted embeddings yielding 82.5% precision on enterprise document sets. More directly: a Pryon medical RAG study found that when the system was restricted to curated, high-quality content, hallucinations dropped to near zero. With unvetted baseline data, the same retrieval architecture fabricated responses for 52% of questions.

The practical implication is that document pre-processing discipline is not optional infrastructure. It is load-bearing architecture. Canonical naming conventions, deduplication, metadata tagging by system of record, explicit version management, and quality triage before anything enters the index. This work takes longer than building the RAG pipeline itself. Teams that skip it spend months debugging what look like model failures but are actually retrieval failures.

A concrete example: the supply chain labeling world

Consider how this plays out in enterprise label and barcode management. It is an instructive case precisely because it is unglamorous, deeply embedded in legacy ERP environments, and has been dealing with the ERP integration problem for thirty years.

A manufacturer running SAP or Oracle holds product data, lot numbers, shipping addresses, compliance specifications, and regulatory identifiers scattered across dozens of tables. Their label printing system needs to pull the exact right fields for each label type, for each regulatory environment, across multiple facilities and jurisdictions. The ERP and labeling integration pattern that works in this industry relies on universal, low-code connectors that watch for specific database records or file outputs, trigger print jobs, and write status back to the ERP without requiring custom development every time the underlying system gets upgraded. The reason this matters is the upgrade problem: custom integrations break on every SAP version bump. Universal integration survives them.

Now layer an LLM on top of that environment. The useful tasks are not replacing label printing. They are adjacent: parsing new regulatory requirements to identify which label fields are affected, generating audit-ready summaries of label change history, answering operator questions about why a specific label variant was approved. Every one of those tasks requires the model to reason over data living in systems it has no native understanding of.

The middleware layer earns its cost here. A well-designed integration surface translates MATNR into "material number," normalizes date formats from SAP's internal representation, resolves organizational unit codes into human-readable names, and presents the model with context it can reason over. Without that layer, you are asking the model to work with raw ERP output that looks like noise to anything trained on public data.

The function calling trap

Agentic LLM patterns are appealing. The model decides what to query, calls the right tool, processes the result, and takes the next step. In greenfield environments with well-designed APIs this pattern works reliably. In legacy enterprise environments it creates problems that are difficult to anticipate and expensive to fix.

Legacy systems were not designed for the interaction patterns LLMs produce. A model exploring an ERP schema through function calls can generate an enormous number of queries in a short time. If those queries touch tables that generate audit log entries, you now have compliance events from AI activity mixed with human activity, which creates regulatory problems in industries where those logs are reviewed for human-initiated actions. If the model attempts a query that crosses a data boundary it was not supposed to reach, the access control failure surfaces as a model error rather than a security event, which is harder to catch.

According to a 2025 ISACA industry report on agentic AI auditing, agentic AI systems create a growing audit challenge because their decision-making processes often lack traceability, weakening accountability and complicating regulatory compliance. The report notes that logs must capture not just what action was taken, but why, and by whose authority. When an agent autonomously chains function calls through a legacy system, reconstructing that decision lineage after the fact is rarely possible.

The safer pattern is constrained function calling: a small, explicit set of tools the model can use, each with a defined schema specifying exactly what it does and what data it returns. No open database cursors. No free-form query interfaces. The reduction in flexibility is real. The reduction in unexpected blast radius is worth it. PwC's 2024 AI Governance Survey found that 78% of enterprise leaders cite auditability as the most important technical governance feature for building regulatory confidence in AI deployments. Constrained function surfaces make auditability possible. Open-ended ones make it aspirational.

Context engineering is the new prompt engineering

The terminology has shifted. Gartner flagged in July 2025 that "context engineering" is displacing "prompt engineering" as the discipline that actually determines production LLM quality. The distinction matters in legacy integration contexts.

Prompt engineering is ad-hoc. Someone figures out a wording that works, pastes it into the system, and moves on. That works during prototyping. It does not work when you are maintaining a production system where every model update is a potential regression, every wording change by a junior engineer is a potential incident, and every cost-driven model swap is a multi-week migration.

A H1 2026 retrospective from Digital Applied found that the shift from craft prompting to what practitioners now call "prompt operations" was the defining change of that period: treating prompts as production artifacts with versioning, ownership, and eval suites from day one. Teams that wrote prompts in 2024 and added evals later spent 2026 inverting the order at significant cost. New prompts now ship with an eval suite from day one. Prompt-library discipline, with catalog, versioning, and owner-per-prompt, crossed from over-engineering to table stakes by April 2026.

In legacy system integrations specifically, the context assembled for the model must carry domain knowledge it does not have from pretraining:

Field name mappings. The model needs to know that VBELN is a sales order number, that WERKS is a plant code, that your organization's plant codes map to specific geographic locations.
Abbreviation glossaries. Your company has thirty years of internal shorthand. None of it is in the model's training data.
Business rules. Which data relationships are semantically meaningful. Which fields are populated only under certain conditions. Which codes are deprecated and what replaced them.
Regulatory terminology. GHS, UDI, FDA 21 CFR Part 11, GS1-128. If your domain has compliance vocabulary, the model needs it explicitly, not assumed. Atlan's analysis of enterprise prompt engineering describes this as "domain knowledge embedding": providing AI systems with specialized context that cannot be inferred from general training. Structured prompt processes have been shown to reduce AI errors by up to 76% compared to ad-hoc approaches. The mechanism is not magic. It is the systematic elimination of ambiguity about what the model is supposed to do with your specific data.

On-premises versus cloud deployment

Many legacy enterprise environments have constraints that make cloud-hosted LLM APIs non-viable. Regulated industries, government contractors, and organizations with data residency requirements cannot route internal enterprise data to external model endpoints. This is not a hypothetical concern. It is a hard architectural constraint that eliminates an entire class of solutions before you start.

On-premises LLM deployment has become substantially more viable since 2024. Quantized versions of models like Llama and Mistral variants can run on enterprise hardware with acceptable performance for many production use cases. Smaller fine-tuned models handling specific, well-scoped tasks can outperform much larger general models on those tasks while running entirely within your infrastructure perimeter.

The operational tradeoff is real. You become responsible for model versioning, hardware provisioning, scaling, inference optimization, and monitoring. For teams already running on-premises infrastructure this is incremental overhead. For teams that have moved entirely to SaaS, it represents a meaningful shift back toward infrastructure ownership.

The hybrid approach that most enterprises settle on is on-premises deployment for workflows that touch sensitive internal data, with cloud APIs for tasks that can operate on sanitized or anonymized information. This requires a routing layer that makes the right call consistently, which is another reason the gateway pattern earns its architectural complexity. The routing decision is a security boundary, not a performance optimization.

Phasing the integration: what actually ships

Legacy system LLM integration does not ship as a single project. The teams doing it well treat it as a phased program with explicit exit criteria between phases.

Phase one is read-only access. The model answers questions, summarizes documents, flags anomalies, and generates draft content for human review. It writes to nothing. The purpose of this phase is not just to deliver value, which it does, but to learn how the model actually behaves against your specific data. Every enterprise has edge cases in their data model that no amount of upfront analysis will surface. Phase one exposes them in a controlled environment where the blast radius of unexpected behavior is bounded.

Phase two is constrained write access. Specific, explicit actions with defined schemas. Update a status field. Generate a document draft. Trigger a workflow that a human approves before it executes. Human-in-the-loop is not a workaround in this phase. It is load-bearing architecture. According to Gartner, by 2029, 70% of enterprises will deploy agentic AI as part of IT infrastructure operations, up from less than 5% in 2025. The governance gap between autonomous agent actions and human-approved ones grows with scale. Phase two builds the governance infrastructure while the scale is still manageable.

Phase three is selective automation. Applied only to workflows where phase two has demonstrated reliability and where the cost of errors is manageable. This is the phase most early-stage demos are built toward. It is also the phase where teams that skipped phases one and two discover they cannot answer the audit question "why did the system do that" in a way that satisfies a compliance team.

The mistake is trying to build phase three first. It is the most impressive to demonstrate, which creates organizational pressure to reach it before the governance infrastructure exists to support it. The teams that resist that pressure and build the foundation first are the ones whose deployments are still running eighteen months later.

Legacy system integration has been a hard problem for thirty years. LLMs make parts of it more tractable. They do not eliminate the fundamentals. The data quality problem is still a data quality problem. The access control problem is still an access control problem. The audit trail requirement is still a compliance requirement. What changes is what you can build on top of solved infrastructure, and how fast you can build it once that infrastructure exists.

If you are working through this in a specific ERP environment, drop a comment. Particularly interested in what teams are finding in SAP S/4HANA contexts where data model complexity and compliance requirements tend to collide hardest.