Zeeshan Ghazanfar

Posted on May 5

RAG patterns that work for structured data vs ones that fail

#ai #architecture #llm #rag

Most RAG failures in enterprise systems do not come from the embedding model.

They come from using the same retrieval pattern for every kind of data.

A policy document, a support article, and a 12-year-old ERP schema are not the same problem. Treating them the same is how teams end up with demos that look useful and production systems that quietly return wrong answers.

At BrainPack, we see this most often when teams try to use naive RAG over structured enterprise data.

The pattern usually looks like this:

Export tables or reports into text chunks
Embed those chunks
Retrieve the top 5 matches
Ask the model to answer from them

It works in a demo because the question is usually close to the text that was embedded.

It fails in production because enterprise questions are rarely simple lookup questions.

Where naive RAG breaks

One deployment involved a legacy ERP with roughly 180 tables, inconsistent column names, and several business concepts spread across multiple modules.

A simple semantic RAG baseline gave acceptable answers for definition-style questions, but it broke badly on operational questions.

Examples:

“Which customers had delayed payments last quarter?”
“Show vendors with duplicate tax IDs.”
“Which orders were approved but not dispatched?”
“What changed in receivables compared to the previous period?”

The issue was not that the model could not reason.

The issue was that the retrieved context was incomplete.

For one receivables query, the retriever pulled chunks describing invoice status, but missed payment allocation rules stored in a separate table. The answer looked confident, but the number was wrong by about 18%.

That is the dangerous failure mode with structured data RAG.

The model does not say, “I am missing the join path.”

It answers with the partial context it has.

Pattern that failed: embedding raw table exports

This is the most common failed pattern.

Teams dump rows or reports into text, embed them, and expect semantic search to behave like a database engine.

It does not.

Embedding raw rows loses too much structure:

Joins are not explicit
Column meanings are ambiguous
Filters are applied after retrieval, not before
Aggregations are guessed instead of computed
Similar-looking records compete with the correct records

In one benchmark, raw row-level RAG answered simple entity lookup questions reasonably well, but failed most aggregation questions.

The failure rate was highest when the answer required joins across three or more tables.

That is expected. Vector search is not designed to discover relational execution plans.

Pattern that failed: embedding schema docs only

Another pattern is embedding schema documentation and asking the model to infer the query.

This helps with table discovery, but it is not enough.

The model may identify the right table family, but still miss business rules like:

Which status values count as active
Which date field represents the business event
Whether cancelled records should be excluded
Whether amounts are gross, net, posted, pending, or reversed

We tested this on ERP-style reporting questions.

Schema-doc RAG improved table selection, but still produced unstable answers because it did not enforce execution rules.

The system knew where to look, but not how to calculate.

That difference matters.

Pattern that works: schema-aware retrieval plus SQL execution

For structured data, the strongest pattern is usually not pure RAG.

It is schema-aware retrieval feeding a constrained query-generation layer.

The retrieval layer should find:

Relevant tables
Column definitions
Join paths
Business rules
Known query examples
Validation checks

Then the model generates SQL or an intermediate query plan.

The database computes the result.

The model explains it.

That separation is important.

The model should not calculate totals from retrieved text when the database can calculate them exactly.

In one internal evaluation, moving from raw text RAG to schema-aware retrieval plus SQL execution improved accuracy on structured reporting questions from the low 40% range to the high 70% range.

The remaining failures were not random. Most came from ambiguous business definitions, especially when the same metric had different meanings across departments.

That is fixable with evaluation sets and explicit metric definitions.

Pattern that works: business concept maps

Enterprise users do not ask questions in table names.

They ask in business language.

“Delayed payment” might map to:

Invoice due date
Payment posting date
Allocation status
Partial payment rules
Customer credit terms
Reversal handling

If the system only retrieves by schema similarity, it misses this.

We maintain business concept maps for production systems.

A concept map links business terms to tables, fields, filters, joins, and known edge cases.

This reduces one of the most common failure modes: the model choosing the right-looking field instead of the correct business field.

For example, “order date” may exist in multiple places:

Created date
Approved date
Confirmed date
Dispatch date
Posted date

A generic model will often choose the most obvious one.

A managed AI layer should choose the one that matches the business definition.

Pattern that works: retrieval by task type

Not every enterprise question should trigger the same pipeline.

We usually separate structured-data questions into task types:

Lookup
Aggregation
Comparison
Exception detection
Explanation
Policy or SOP reference

A lookup question can use a lighter path.

An aggregation question should usually go through SQL execution.

An explanation question may need both data and policy context.

This routing matters because the wrong pipeline can produce an answer that sounds correct but is operationally useless.

A common example is exception detection.

If a user asks, “Which invoices look abnormal this month?” naive RAG retrieves invoices containing similar language.

That is not anomaly detection.

The system needs computed baselines, thresholds, historical comparison, and sometimes human-defined rules.

Pattern that works: answer validation before response

For production systems, we do not treat the first model answer as final.

We validate it.

Useful checks include:

Did the generated query use approved tables?
Did it include required filters?
Did it avoid deprecated fields?
Did row counts look plausible?
Did totals reconcile with known reports?
Did the answer cite the executed query or source record?

One failure we saw repeatedly was date leakage.

The model would use invoice creation date when the metric required posting date.

The answer was syntactically valid and semantically plausible, but financially wrong.

A validation rule caught it because that metric was only allowed to use posting date.

That is the difference between a chatbot and a managed AI operation.

What we learned

For structured enterprise data, RAG is useful, but only when it is not asked to do the database’s job.

Naive RAG works for:

Definitions
SOP references
Field descriptions
Simple record lookup
Explaining existing reports

Naive RAG fails for:

Multi-table joins
Aggregations
Period comparisons
Financial calculations
Exception detection
Metrics with department-specific definitions

The better architecture is usually hybrid:

RAG for context
SQL or APIs for computation
Business concept maps for meaning
Evaluations for regression detection
Validation rules for production safety

This is why we describe BrainPack as Enterprise AI Operating Infrastructure, not a one-time chatbot deployment.

The work does not end when the agent answers the first question.

The real work starts when users ask the 500th question, the schema changes, a department redefines a metric, and the model still has to answer safely.

That only happens when the AI layer is monitored, evaluated, re-prompted, and maintained continuously.

DEV Community