Drishti Jain for AWS Community Builders

Posted on Jan 27

RAG Is a Data Engineering Problem Disguised as AI

#rag #ai #aws #dataengineering

Retrieval-Augmented Generation (RAG) is usually introduced as a clever AI pattern: take an LLM, bolt on a vector database, retrieve relevant documents, and voilà—your model is now “grounded” in private data. This framing is seductive because it makes RAG feel like an inference-time concern. Pick a good embedding model, tune top_k, write a better prompt, and the system improves.

In production, this mental model collapses almost immediately.

What actually determines whether a RAG system works over time has very little to do with prompt engineering or model choice. The dominant failure modes are mundane, unglamorous, and painfully familiar to anyone who has built large-scale data systems: stale data, broken pipelines, schema drift, inconsistent backfills, and the absence of contracts between producers and consumers.

RAG does not fail because LLMs hallucinate.

RAG fails because data systems drift.

Once you accept this, the architecture of a “good” RAG system changes completely.

From Toy RAG to Production Reality

Let’s start with a simplified RAG pipeline that appears in most tutorials:

Load documents
Split them into chunks
Generate embeddings
Store them in a vector database
Retrieve top-k chunks at query time
Send them to an LLM

This pipeline assumes something critical but rarely stated: that documents are static.

In real systems, documents change. Policies are updated. Knowledge bases are corrected retroactively. Records are deleted for compliance reasons. Meanings shift even when text does not. If your embedding store does not reflect these changes, retrieval quality degrades silently. Worse, it degrades confidently.

The LLM is not aware that its context is stale. It will happily synthesize an authoritative answer from outdated information.
This is the first sign that RAG is not an inference problem. It is a derived data problem.

Embeddings Are a Materialized View

A useful reframing is to think of embeddings as a materialized view over raw data.

They are:

Derived from source data
Expensive to compute
Immutable once written
Queried at high frequency
Assumed to be correct by downstream consumers

This should immediately trigger familiar data-engineering questions:

What is the source of truth?
How do changes propagate?
How do we handle deletes?
How do we backfill safely?
How do we know the data is fresh?

Most RAG systems answer none of these explicitly.

Data Freshness and Embedding Invalidation

Consider a simple example: a policy document stored in S3 that is updated weekly. A naïve RAG pipeline embeds the document once and stores the vectors in OpenSearch. A week later, the policy changes, but the embeddings remain untouched.

Your system is now guaranteed to return incorrect answers.

The dangerous part is that nothing breaks. Queries still work. Latency looks fine. Retrieval scores look reasonable. There is no exception to catch.

To prevent this, embedding invalidation must be explicit.

At minimum, each embedding must be associated with:

A stable source identifier
A source version or checksum
A timestamp For example, a simple metadata schema might look like this:

{
  "document_id": "policy_123",
  "document_version": "2024-11-18",
  "chunk_id": 7,
  "embedding_model": "text-embedding-3-large",
  "created_at": "2024-11-18T10:42:00Z"
}

At query time, retrieval should filter embeddings based on freshness constraints, not blindly trust the vector store.
This already moves RAG closer to a data system: freshness is now a first-class concept.

Change Data Capture → Incremental Re-Embedding

The next failure point appears at scale. Once you have thousands or millions of documents, re-embedding everything on every change becomes infeasible. Cost explodes, pipelines miss SLAs, and backfills become terrifying.

This is where Change Data Capture (CDC) becomes essential.
Instead of treating embeddings as batch artifacts, treat them as incrementally updated derived data.

A Practical AWS Pattern

Assume your source data lives in Aurora PostgreSQL and is periodically updated.

Enable CDC using AWS DMS or logical replication.
Stream changes into an S3 landing zone.
Trigger re-embedding only for changed records.

A simplified Lambda-based embedding consumer might look like this:

import json
import boto3
from openai import OpenAI
from psycopg2 import connect

client = OpenAI()
opensearch = boto3.client("opensearch")

def handler(event, context):
    for record in event["Records"]:
        change = json.loads(record["body"])

        if change["op"] == "DELETE":
            delete_embeddings(change["document_id"])
            continue

        text_chunks = chunk_document(change["content"])

        embeddings = client.embeddings.create(
            model="text-embedding-3-large",
            input=text_chunks
        )

        for i, vector in enumerate(embeddings.data):
            index_vector(
                document_id=change["document_id"],
                version=change["version"],
                chunk_id=i,
                vector=vector.embedding
            )

This code is not interesting from an ML perspective. It is interesting from a data perspective because it makes embeddings reactive to change.
Now embeddings behave like any other downstream table in a CDC-driven architecture.

Schema Evolution in “Unstructured” Data

The phrase “unstructured data” is one of the most damaging ideas in modern data systems. PDFs, tickets, chats, and documents are not unstructured—they have implicit schemas.

A policy document might look like prose, but it encodes structure:

Definitions
Scope
Exceptions
Effective dates

When these structures change, retrieval quality changes too. Chunking strategies that worked before may now split semantically related sections. Old embeddings may no longer align with new meanings.
This is why schema evolution must be modeled explicitly, even for text.

A practical approach is to version:

Chunking logic
Section detection
Metadata extraction

For example:

def chunk_document_v2(document):
    sections = extract_sections(document)
    for section in sections:
        yield {
            "text": section.text,
            "section_type": section.type,
            "schema_version": "v2"
        }

By tagging embeddings with a schema_version, you gain the ability to:

Compare retrieval quality across versions
Backfill selectively
Roll back safely This is standard practice in feature stores. RAG systems should be no different.

Data Contracts for LLM Inputs

In mature data platforms, producers and consumers agree on contracts. LLMs are consumers too, even if they speak natural language.
Without contracts, retrieval layers return “whatever is close enough,” and prompts are expected to fix the rest. This is backwards.
A data contract for RAG might specify:

Required metadata fields
Maximum document age
Allowed document types
Minimum chunk completeness

Enforcement belongs in the retrieval layer, not the prompt.

def retrieve_context(query_embedding):
    results = vector_search(
        embedding=query_embedding,
        filters={
            "document_type": "policy",
            "document_version": ">=2024-10-01"
        }
    )
    return results

The LLM should never see context that violates these guarantees. If no context satisfies the contract, the system should abstain or escalate.
This is how you prevent hallucinations systemically, not cosmetically.

Backfills: The Moment of Truth

Eventually, you will need to:

Change embedding models
Fix broken chunking
Correct historical data

This requires backfills, and backfills expose architectural weaknesses brutally.

A robust backfill strategy on AWS typically involves:

Writing new embeddings to a versioned index
Validating retrieval quality offline
Atomically switching traffic

Step Functions are ideal for this:

{
  "StartAt": "BatchDocuments",
  "States": {
    "BatchDocuments": {
      "Type": "Map",
      "ItemsPath": "$.documents",
      "Iterator": {
        "StartAt": "EmbedBatch",
        "States": {
          "EmbedBatch": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:embed",
            "End": true
          }
        }
      },
      "End": true
    }
  }
}

If backfills are terrifying, your system is not production-ready.

The LLM Is the Least Interesting Part

Once you view RAG through a data-engineering lens, something surprising happens: the LLM becomes interchangeable.
You can swap models. You can change prompts. You can even replace RAG with fine-tuning in some cases.
What you cannot replace easily is: