DEV Community

Weilu Wang
Weilu Wang

Posted on

RAG in production: three levels from search to conversation

DEVELOPER. So, you want to build a website that thinks?

USER. I want it to answer questions. Not just search.

DEVELOPER. Then let’s walk the path. First stop: the quiet room.

(He gestures slowly, almost lazily.)

DEVELOPER. Level One. RAG as a librarian. You chop up your help pages – articles, product specs, manuals – into chunks. Run each chunk through an embedding model. Vectors. No AI generation. No talking back. Just a silent index.

USER. Like a smarter search bar?

DEVELOPER. Exactly. You search for “lightweight running shoes.” It finds “super-light marathon trainers.” Same meaning, different words. But the user still gets a list of snippets. Scroll. Click. Read three paragraphs. Patch the answer together. It’s… polite. But dull.

USER. So where’s that useful?

DEVELOPER. “You might also like” recommendations. Back-office content scanning. Batch processing where no human waits for an answer. Not for real conversations.

(His tone quickens.)

DEVELOPER. Level Two. Now we add the voice. The AI.

(He snaps his fingers.)

DEVELOPER. User asks: “How do I reset my password?” The vector library hunts for relevant chunks. Then we stuff the question plus those chunks into a large language model – DeepSeek V4-Flash, GPT-4o-mini, whatever. And the model writes an answer. Complete. Natural. No fluff.

USER. So they don’t have to read ten pages.

DEVELOPER. They get one answer. And they can follow up: “What about the admin reset?” The model remembers the conversation. That’s the jump. From search to answer.

USER. How much does that cost?

DEVELOPER. Pennies. DeepSeek V4-Flash costs ~0.87 per million output tokens. A typical Q&A uses maybe 2000 tokens. That’s 0.0017 per query. For a thousand queries a day?
1.70.Runamediume‑commercesiteforamonth–maybe10.

USER. That’s nothing.

DEVELOPER. Exactly. That’s why Level Two is the sweet spot today. Perfect for internal knowledge bases, product help centers, e‑commerce Q&A, legal summaries, news article deep dives. Anywhere you want a guide, not a search box.

(He raises a finger.)

DEVELOPER. Level Three. CaaS. Conversation as a Service.

(His pace accelerates, words running together.)

DEVELOPER. Now you don’t just index documents. You index APIs. Function specs, endpoint descriptions, parameter lists, call examples. The user says: “Pause my subscription for next week.”

The vector store retrieves the relevant API: PUT /subscriptions/{id}/pause with start_date and duration. The model understands intent and outputs a structured call – JSON, a curl command, whatever your system expects. Then your backend either suggests the action (“Click to confirm”) or executes it directly, depending on safety.

USER. That’s dangerous.

DEVELOPER. It’s powerful. You design a confirmation step when needed. But the leap is clear: the user talks, the website acts. No menus. No forms. For SaaS, CRM, IoT dashboards – any place where talking is faster than clicking.

USER. Where do I start?

(He slows down, leans back.)

DEVELOPER. Start simple. Take your existing help docs. Your FAQ. Your user manual. Throw them into a vector database (Milvus Lite, Qdrant, Chroma). Write a tiny endpoint that: (1) embeds the user’s question, (2) retrieves top‑k chunks, (3) calls an LLM, (4) returns the answer. Add a chat bubble on every page. Two weeks. That’s your first step.

USER. And then?

DEVELOPER. Then you evolve. Add more documents. Tune chunk size. Switch from “retrieve then generate” to a proper RAG pipeline with reranking. And only after that – when you have real APIs – move to Level Three. From “tell me how” to “do it for me.”

(He smiles, pauses.)

DEVELOPER. One sentence: Don’t automate everything at once. Just put your existing help into RAG, add a layer of AI, and you get an assistant that answers “how do I use this feature?” instantly. Happier users. Fewer support tickets. Later, plug in the real APIs. That’s the path.

(Exit.)

Top comments (0)