Retrieval-Augmented Generation (RAG) is usually introduced as a clever AI pattern: take an LLM, bolt on a vector database, retrieve relevant documents, and voilà—your model is now “grounded” in private data. This framing is seductive because it makes RAG feel like an inference-time concern. Pick a good embedding model, tune top_k, write a better prompt, and the system improves.
In production, this mental model collapses almost immediately.
What actually determines whether a RAG system works over time has very little to do with prompt engineering or model choice. The dominant failure modes are mundane, unglamorous, and painfully familiar to anyone who has built large-scale data systems: stale data, broken pipelines, schema drift, inconsistent backfills, and the absence of contracts between producers and consumers.
RAG does not fail because LLMs hallucinate.
RAG fails because data systems drift.
Once you accept this, the architecture of a “good” RAG system changes completely.
From Toy RAG to Production Reality
Let’s start with a simplified RAG pipeline that appears in most tutorials:
- Load documents
- Split them into chunks
- Generate embeddings
- Store them in a vector database
- Retrieve top-k chunks at query time
- Send them to an LLM
This pipeline assumes something critical but rarely stated: that documents are static.
In real systems, documents change. Policies are updated. Knowledge bases are corrected retroactively. Records are deleted for compliance reasons. Meanings shift even when text does not. If your embedding store does not reflect these changes, retrieval quality degrades silently. Worse, it degrades confidently.
The LLM is not aware that its context is stale. It will happily synthesize an authoritative answer from outdated information.
This is the first sign that RAG is not an inference problem. It is a derived data problem.
Embeddings Are a Materialized View
A useful reframing is to think of embeddings as a materialized view over raw data.
They are:
- Derived from source data
- Expensive to compute
- Immutable once written
- Queried at high frequency
- Assumed to be correct by downstream consumers
This should immediately trigger familiar data-engineering questions:
- What is the source of truth?
- How do changes propagate?
- How do we handle deletes?
- How do we backfill safely?
- How do we know the data is fresh?
Most RAG systems answer none of these explicitly.
Data Freshness and Embedding Invalidation
Consider a simple example: a policy document stored in S3 that is updated weekly. A naïve RAG pipeline embeds the document once and stores the vectors in OpenSearch. A week later, the policy changes, but the embeddings remain untouched.
Your system is now guaranteed to return incorrect answers.
The dangerous part is that nothing breaks. Queries still work. Latency looks fine. Retrieval scores look reasonable. There is no exception to catch.
To prevent this, embedding invalidation must be explicit.
At minimum, each embedding must be associated with:
- A stable source identifier
- A source version or checksum
- A timestamp For example, a simple metadata schema might look like this:
{
"document_id": "policy_123",
"document_version": "2024-11-18",
"chunk_id": 7,
"embedding_model": "text-embedding-3-large",
"created_at": "2024-11-18T10:42:00Z"
}
At query time, retrieval should filter embeddings based on freshness constraints, not blindly trust the vector store.
This already moves RAG closer to a data system: freshness is now a first-class concept.
Change Data Capture → Incremental Re-Embedding
The next failure point appears at scale. Once you have thousands or millions of documents, re-embedding everything on every change becomes infeasible. Cost explodes, pipelines miss SLAs, and backfills become terrifying.
This is where Change Data Capture (CDC) becomes essential.
Instead of treating embeddings as batch artifacts, treat them as incrementally updated derived data.
A Practical AWS Pattern
Assume your source data lives in Aurora PostgreSQL and is periodically updated.
- Enable CDC using AWS DMS or logical replication.
- Stream changes into an S3 landing zone.
- Trigger re-embedding only for changed records.
A simplified Lambda-based embedding consumer might look like this:
import json
import boto3
from openai import OpenAI
from psycopg2 import connect
client = OpenAI()
opensearch = boto3.client("opensearch")
def handler(event, context):
for record in event["Records"]:
change = json.loads(record["body"])
if change["op"] == "DELETE":
delete_embeddings(change["document_id"])
continue
text_chunks = chunk_document(change["content"])
embeddings = client.embeddings.create(
model="text-embedding-3-large",
input=text_chunks
)
for i, vector in enumerate(embeddings.data):
index_vector(
document_id=change["document_id"],
version=change["version"],
chunk_id=i,
vector=vector.embedding
)
This code is not interesting from an ML perspective. It is interesting from a data perspective because it makes embeddings reactive to change.
Now embeddings behave like any other downstream table in a CDC-driven architecture.
Schema Evolution in “Unstructured” Data
The phrase “unstructured data” is one of the most damaging ideas in modern data systems. PDFs, tickets, chats, and documents are not unstructured—they have implicit schemas.
A policy document might look like prose, but it encodes structure:
- Definitions
- Scope
- Exceptions
- Effective dates
When these structures change, retrieval quality changes too. Chunking strategies that worked before may now split semantically related sections. Old embeddings may no longer align with new meanings.
This is why schema evolution must be modeled explicitly, even for text.
A practical approach is to version:
- Chunking logic
- Section detection
- Metadata extraction
For example:
def chunk_document_v2(document):
sections = extract_sections(document)
for section in sections:
yield {
"text": section.text,
"section_type": section.type,
"schema_version": "v2"
}
By tagging embeddings with a schema_version, you gain the ability to:
- Compare retrieval quality across versions
- Backfill selectively
- Roll back safely This is standard practice in feature stores. RAG systems should be no different.
Data Contracts for LLM Inputs
In mature data platforms, producers and consumers agree on contracts. LLMs are consumers too, even if they speak natural language.
Without contracts, retrieval layers return “whatever is close enough,” and prompts are expected to fix the rest. This is backwards.
A data contract for RAG might specify:
- Required metadata fields
- Maximum document age
- Allowed document types
- Minimum chunk completeness
Enforcement belongs in the retrieval layer, not the prompt.
def retrieve_context(query_embedding):
results = vector_search(
embedding=query_embedding,
filters={
"document_type": "policy",
"document_version": ">=2024-10-01"
}
)
return results
The LLM should never see context that violates these guarantees. If no context satisfies the contract, the system should abstain or escalate.
This is how you prevent hallucinations systemically, not cosmetically.
Backfills: The Moment of Truth
Eventually, you will need to:
- Change embedding models
- Fix broken chunking
- Correct historical data
This requires backfills, and backfills expose architectural weaknesses brutally.
A robust backfill strategy on AWS typically involves:
- Writing new embeddings to a versioned index
- Validating retrieval quality offline
- Atomically switching traffic
Step Functions are ideal for this:
{
"StartAt": "BatchDocuments",
"States": {
"BatchDocuments": {
"Type": "Map",
"ItemsPath": "$.documents",
"Iterator": {
"StartAt": "EmbedBatch",
"States": {
"EmbedBatch": {
"Type": "Task",
"Resource": "arn:aws:lambda:embed",
"End": true
}
}
},
"End": true
}
}
}
If backfills are terrifying, your system is not production-ready.
The LLM Is the Least Interesting Part
Once you view RAG through a data-engineering lens, something surprising happens: the LLM becomes interchangeable.
You can swap models. You can change prompts. You can even replace RAG with fine-tuning in some cases.
What you cannot replace easily is:
- Data lineage
- Freshness guarantees
- Versioned embeddings
- Deterministic retrieval
These are the real assets of a production RAG system.
Conclusion: Build RAG Like a Data Platform
RAG systems do not fail because LLMs are probabilistic.
They fail because data systems are treated casually.
If you build RAG like:
- a batch job,
- a demo pipeline,
- or a prompt experiment,
it will collapse under real-world change.
If you build it like:
- a CDC-driven system,
- with contracts, versioning, and backfills,
- using boring, well-understood data engineering principles,
it will scale—and more importantly, it will stay correct.
RAG is a data engineering problem disguised as AI.
Treat it that way, and the AI part becomes easy.
Top comments (0)