DEV Community: Parbhat Kapila

What It Actually Takes to Run a RAG System in Production

Parbhat Kapila — Thu, 12 Feb 2026 15:24:30 +0000

RAG systems are easy to demo.

They’re difficult to operate.

A typical prototype retrieves a few documents, sends them to an LLM, and returns a reasonable answer. It works with a small dataset and no real traffic. The problems start when the system is exposed to live usage, larger document sets, and strict latency expectations.

When I moved a retrieval system into production with 10,000+ documents and real users, the first constraint was latency. Users don’t tolerate slow responses. Retrieval needed to stay under 200ms consistently, even under concurrent load. That ruled out naive chunking and default indexing strategies.

Instead of splitting documents purely by token length, chunking was structured around semantic boundaries. This reduced irrelevant matches and improved retrieval precision without increasing context size. On the database side, approximate nearest-neighbor indexing was tuned carefully to balance recall and query time. Search parameters were not left at defaults: they were adjusted until latency became predictable rather than variable.

Caching was introduced not as an optimization, but as a requirement. Repeated queries occur frequently in real systems. Storing embeddings and high-frequency retrieval results in Redis eliminated redundant computation and stabilized response times.

Cost became the second constraint.

Embedding large documents repeatedly is expensive. Early versions of the pipeline reprocessed identical content and passed excessive context to the model. That works at low volume. It breaks at scale. Deduplication via hashing, batched embedding requests, and dynamic context sizing significantly reduced token consumption. The goal was not theoretical efficiency; it was preventing costs from compounding with growth.

Reliability exposed another layer of complexity. A production system cannot depend on a single model provider. Rate limits and outages are not hypothetical; they happen. Introducing a provider abstraction layer with automatic fallback and retry logic made failures manageable. Instead of hard downtime, the system degraded gracefully.

Accuracy required tradeoffs. Increasing top-k retrieval improves recall but increases latency and token usage. Reducing it improves speed but risks missing context. A static configuration wasn’t sufficient. Retrieval depth became dynamic, adjusted based on similarity spread and query characteristics. That preserved response quality without sacrificing performance targets.

Some issues don’t show up in tutorials. Updating embedding models invalidates stored vectors. Schema migrations affect vector dimensions. Prompt changes quietly increase token usage. These are operational problems, not academic ones. They required versioned embeddings, strict schema control, and continuous token auditing.

A prototype proves that something is possible.

Production proves that it’s sustainable.

The real engineering work in AI systems isn’t generating correct answers. It’s managing latency ceilings, cost growth, provider instability, and system evolution without breaking live workflows.

That’s the difference between experimenting with AI and running it as infrastructure.

I’m building an AI video system that resolves the entire editing pipeline from one sentence

Parbhat Kapila — Wed, 04 Feb 2026 19:34:00 +0000

Most AI video tools still treat editing as a user responsibility.

You give them a prompt, and then you’re dropped into timelines, templates, and a pile of small decisions. At that point, the “AI” is just helping you operate the UI.

I’m building CUTLINE with a different goal: resolve the entire video pipeline from a single sentence.

One line goes in. The system decides the script, visual pacing, voice, captions, music, and exports a finished 1080p video. No timelines. No templates. No watermark.

What I care about isn’t how clever the output looks on a demo. It’s whether the system stays predictable when used repeatedly, under real constraints, and with messy inputs.

Most of the hard work hasn’t been prompts. It’s been orchestration, defaults, and deciding where the system should not ask the user anything.

Still early, but I’m curious how others here think about building AI systems versus AI tools, especially in creative workflows.

If you want to see what I’m working on, it’s here:
www.parbhat.dev

Building the video tool I wish existed while shipping SaaS products

Parbhat Kapila — Tue, 03 Feb 2026 09:02:21 +0000

I’ve been building something I wish existed when I was shipping and scaling SaaS products.

CUTLINE is an AI-directed video editing system that turns a single sentence into a finished short video — handling visuals, pacing, and voice automatically.

It’s built as a real system: predictable output, repeatable workflows, and behavior that holds up as usage scales.

Not templates. Not generators. Real edited videos.

Going live soon. Happy to share details or walk through how it works under the hood.

www.parbhat.dev

What building real repositories taught me

Parbhat Kapila — Mon, 26 Jan 2026 18:21:03 +0000

Building RepoDoc has shifted my focus from features to behavior over time.

Large repositories introduce history, ambiguity, and edge cases that don’t show up in small examples. The challenge isn’t producing output, but keeping the system stable and understandable as complexity grows.

That constraint has ended up shaping the product more than any individual feature.

Good tools don’t just work once. They keep working as things get messy.

Check It

Why RepoDoc treats documents as immutable

Parbhat Kapila — Sat, 24 Jan 2026 08:26:53 +0000

In RepoDoc, documents are facts.

Everything else, summaries, diffs, embeddings, and confidence scores are interpretations.

That separation gives us:

safe re-processing

explainable AI outputs

long-term trust in the system

If your product explains change, your system can’t change the past.
Try Yourself