Why PDF-Style RAG Fails on Structured Enterprise Data

#ai #database #dataengineering #rag

Most teams try to use document RAG patterns on structured enterprise data.

That usually breaks.

PDF RAG and structured-data RAG are not the same problem.

With PDF RAG, the system usually retrieves text chunks and asks the model to answer from them.

With ERP or CRM data, the problem is different:

Which table contains the answer?
Which fields are reliable?
Which joins are allowed?
Which filters map to the user’s business language?
Which rows are stale, duplicated, or operationally invalid?

We tested a basic vector-only RAG setup over structured records.

It looked fine in demos.

In production-style evals, it failed on multi-step questions because the retriever found semantically similar records, but missed the required relational constraints.

The fix was not “better embeddings”.

The fix was schema grounding.

We moved to a hybrid pattern:

classify the user intent
map terms to business entities
retrieve schema and field definitions
generate constrained SQL or API calls
validate outputs against business rules
only then pass the final result to the model for explanation

Accuracy improved because the model stopped guessing from loose chunks and started operating against the real data model.

One failure mode we still monitor closely:

The model can produce a correct-looking answer from incomplete data.

That is worse than an obvious error.

For structured enterprise systems, the hard part is not retrieval.

The hard part is knowing when the retrieved data is not enough.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.