DEV Community

Cover image for What It Actually Takes to Run a RAG System in Production
Parbhat Kapila
Parbhat Kapila

Posted on

What It Actually Takes to Run a RAG System in Production

RAG systems are easy to demo.

They’re difficult to operate.

A typical prototype retrieves a few documents, sends them to an LLM, and returns a reasonable answer. It works with a small dataset and no real traffic. The problems start when the system is exposed to live usage, larger document sets, and strict latency expectations.

When I moved a retrieval system into production with 10,000+ documents and real users, the first constraint was latency. Users don’t tolerate slow responses. Retrieval needed to stay under 200ms consistently, even under concurrent load. That ruled out naive chunking and default indexing strategies.

Instead of splitting documents purely by token length, chunking was structured around semantic boundaries. This reduced irrelevant matches and improved retrieval precision without increasing context size. On the database side, approximate nearest-neighbor indexing was tuned carefully to balance recall and query time. Search parameters were not left at defaults: they were adjusted until latency became predictable rather than variable.

Caching was introduced not as an optimization, but as a requirement. Repeated queries occur frequently in real systems. Storing embeddings and high-frequency retrieval results in Redis eliminated redundant computation and stabilized response times.

Cost became the second constraint.

Embedding large documents repeatedly is expensive. Early versions of the pipeline reprocessed identical content and passed excessive context to the model. That works at low volume. It breaks at scale. Deduplication via hashing, batched embedding requests, and dynamic context sizing significantly reduced token consumption. The goal was not theoretical efficiency; it was preventing costs from compounding with growth.

Reliability exposed another layer of complexity. A production system cannot depend on a single model provider. Rate limits and outages are not hypothetical; they happen. Introducing a provider abstraction layer with automatic fallback and retry logic made failures manageable. Instead of hard downtime, the system degraded gracefully.

Accuracy required tradeoffs. Increasing top-k retrieval improves recall but increases latency and token usage. Reducing it improves speed but risks missing context. A static configuration wasn’t sufficient. Retrieval depth became dynamic, adjusted based on similarity spread and query characteristics. That preserved response quality without sacrificing performance targets.

Some issues don’t show up in tutorials. Updating embedding models invalidates stored vectors. Schema migrations affect vector dimensions. Prompt changes quietly increase token usage. These are operational problems, not academic ones. They required versioned embeddings, strict schema control, and continuous token auditing.

A prototype proves that something is possible.

Production proves that it’s sustainable.

The real engineering work in AI systems isn’t generating correct answers. It’s managing latency ceilings, cost growth, provider instability, and system evolution without breaking live workflows.

That’s the difference between experimenting with AI and running it as infrastructure.

Top comments (0)