VedX Group

Posted on May 24

How We Built a Production RAG Chatbot for a Client in 72 Hours (Full Stack Breakdown)

#showdev #ai #rag #architecture

A client messaged us on a Tuesday night. By Friday afternoon, their customer support chatbot was live — handling real queries, citing their actual documentation, and escalating edge cases to humans. No hallucinations. No generic GPT responses. No guesswork.

This is the breakdown of how we did it: the architecture, the stack, the mistakes, and the two decisions that saved the timeline.

We're a small two-person AI development team at VedX — we specialize in building RAG systems, voicebots, and full-stack AI products for businesses. This particular project is one we're allowed to share the technical details of, so here's the full walkthrough.

What the Client Actually Needed

The client ran a SaaS product in the legal compliance space. They had 200+ pages of documentation, an FAQ library, and a Notion workspace full of internal SOPs. Their support team was spending 60% of time answering questions that already existed in their docs — just not findably.

They did not want a generic ChatGPT wrapper. They wanted a bot that:

Only answered from their documents (no hallucinations about things not in scope)
Cited the exact source document when answering
Escalated to a human when confidence was below a threshold
Integrated with their existing React frontend via a simple chat widget

Classic RAG problem. Solvable in 72 hours with the right stack.

The Architecture (The Part That Matters)

Before writing a single line of code, we spent 3 hours on architecture. This is the part most freelancers skip — and it's exactly why most RAG prototypes fail in production.

User query
    ↓
Query preprocessing (intent classification + query expansion)
    ↓
Embedding model (text-embedding-3-small)
    ↓
Vector search → ChromaDB (cosine similarity, top-k=5)
    ↓
Reranker (cross-encoder) → filters to top-k=2
    ↓
LLM (GPT-4o-mini) with retrieved context + source metadata
    ↓
Response + citation + confidence score
    ↓
If confidence < 0.7 → human escalation flag

Two decisions in this architecture made all the difference:

1. Query expansion before embedding

Raw user queries are often short, ambiguous, and poorly formed. "What's the penalty?" gives a terrible embedding. We added an LLM step that rewrites the query into 3 variations before embedding — dramatically improving retrieval recall.

2. Cross-encoder reranking after vector search

Vector similarity is fast but imprecise. A cross-encoder (we used cross-encoder/ms-marco-MiniLM-L-6-v2 from Hugging Face) reads each retrieved chunk alongside the original query and scores actual relevance. This step alone reduced hallucinations by ~40% in our testing.

The Stack

Component	Tool	Why
Backend API	FastAPI (Python)	Async, fast, easy to deploy
Vector DB	ChromaDB (local)	Zero infra for MVP, portable to cloud
Embeddings	OpenAI text-embedding-3-small	Best price/performance ratio
Reranker	HuggingFace cross-encoder	Free, runs locally, < 200ms
LLM	GPT-4o-mini	4x cheaper than GPT-4, sufficient for Q&A
Document parsing	LangChain loaders + custom chunker	Handles PDF, Notion export, DOCX
Frontend widget	React + Tailwind	Embedded via iframe in client's app
Hosting	Railway (API) + Vercel (widget)	Ships in 10 minutes

Total inference cost: ~$0.003 per query at production load.

Day 1: Document Ingestion Pipeline (0–24 hours)

The first bottleneck was always going to be document quality. 200 pages of legal documentation is not a clean dataset — it had tables, footnotes, embedded PDFs, inconsistent heading structures, and duplicate content across Notion pages.

We built a custom chunking strategy rather than using LangChain's default RecursiveCharacterTextSplitter:

def smart_chunk(document: str, max_tokens: int = 400) -> list[str]:
    """
    Split by semantic boundaries: headings > paragraphs > sentences.
    Preserve heading context in each chunk via a sliding header stack.
    """
    chunks = []
    header_stack = []

    for block in parse_blocks(document):
        if block.type == "heading":
            # Update header context for subsequent chunks
            level = block.heading_level
            header_stack = header_stack[:level-1] + [block.text]
        elif block.type == "paragraph":
            context_prefix = " > ".join(header_stack) + "\n\n"
            chunk_text = context_prefix + block.text

            if token_count(chunk_text) <= max_tokens:
                chunks.append(chunk_text)
            else:
                # Sentence-level split with context preserved
                for sentence_chunk in split_by_sentences(block.text, max_tokens, context_prefix):
                    chunks.append(sentence_chunk)

    return chunks

The key insight: prepend the heading path to every chunk. When the user asks "what's the penalty for late filing?" and the answer is buried under Section 4 > Subsection 2 > Penalties, the chunk now contains "Section 4 > Subsection 2 > Penalties" in its text — making vector similarity much more accurate.

This took most of Day 1. We ingested ~4,200 chunks total.

Day 2: Retrieval + Reranking + LLM Prompt (24–48 hours)

Retrieval was straightforward once the chunks were clean. The reranking step is where we spent most of Day 2 tuning.

The prompt we landed on for the final LLM call:

SYSTEM_PROMPT = """
You are a precise assistant for {company_name}. 
Answer ONLY using the provided context documents.
If the answer is not clearly in the context, say: 
"I don't have enough information to answer that — I'll connect you with the team."

Rules:
- Cite your source document in every response: [Source: {doc_name}]
- Never speculate or add information not in the context
- Keep answers under 150 words unless the question requires more detail
- If multiple documents disagree, surface both answers and note the discrepancy
"""

The confidence scoring was the trickiest part. We built a heuristic:

def confidence_score(query: str, retrieved_chunks: list, response: str) -> float:
    # 1. Did the reranker score the top chunk above threshold?
    reranker_score = retrieved_chunks[0].rerank_score  # 0–1

    # 2. Did the LLM response include a source citation?
    has_citation = "[Source:" in response

    # 3. Is the top chunk semantically close to the response?
    response_embedding = embed(response)
    top_chunk_embedding = retrieved_chunks[0].embedding
    semantic_overlap = cosine_similarity(response_embedding, top_chunk_embedding)

    raw_score = (reranker_score * 0.5) + (semantic_overlap * 0.4) + (0.1 if has_citation else 0)
    return raw_score

Anything below 0.70 triggered the "I'll connect you with the team" path. In production testing, this caught every hallucination we induced deliberately.

Day 3: Frontend, Integration, Deploy (48–72 hours)

The React chat widget was 3 hours of work. The actual time sink was the client's CSP (Content Security Policy) blocking our iframe — a problem we've hit before and now check for in every project kickoff call.

We shipped to Railway in 22 minutes using a single Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

ChromaDB data was persisted to a Railway volume. Total infrastructure cost: $5/month.

What We Got Wrong (and Fixed Fast)

Mistake 1: Chunk size too large initially.

We started with 800-token chunks because we wanted more context per chunk. Retrieval precision dropped significantly — the cross-encoder had too much noise to rank. Dropping to 400 tokens and increasing top-k from 3 to 5 fixed it.

Mistake 2: No query preprocessing on Day 1.

The first version of the pipeline embedded the raw user query directly. Short queries like "late filing?" returned garbage. We added the query expansion step on Day 2 and recall improved dramatically.

Mistake 3: Overlooking document deduplication.

The Notion export had significant duplicate content across pages. Without deduplication, the same chunk appeared multiple times in top-k results — wasting context window and making citations redundant. A simple hash-based dedup before ingestion fixed it.

Results After 2 Weeks in Production

68% of queries resolved by the bot without human escalation
Average response time: 1.2 seconds end-to-end
0 hallucinations detected in human review of 500 sampled responses
Support ticket volume: down 41% in the first two weeks
Inference cost: $23 for 7,600 queries (first two weeks)

What We'd Change for a Larger Scale Version

Switch ChromaDB → Qdrant or Pinecone for multi-tenant isolation and horizontal scaling
Add a caching layer (Redis) for frequently asked questions — same query, instant response
Fine-tune the embedding model on domain-specific vocabulary for legal/compliance terminology
Streaming responses via SSE — the 1.2s wait feels long when you're used to ChatGPT's streaming UX

The Actual Takeaway

RAG is not a new idea. The reason most RAG implementations fail in production is not the LLM — it's the retrieval pipeline. Clean chunking, query expansion, and cross-encoder reranking are the unglamorous work that determines whether your chatbot answers correctly or confidently makes things up.

If you're building something similar, the 72-hour timeline is realistic for an MVP with clean documents. For enterprise-grade accuracy on messy document sets, budget 2–3 weeks for the ingestion pipeline alone.

We build these systems at VedX — if you're working on a RAG project and want to talk through the architecture, feel free to reach out.

Tags: #RAG #LLM #AI #Python #FastAPI #freelancing #webdev #chatbot

Author Bio (paste this into dev.to/Hashnode profile):

Divyanshu Purohit — Co-founder at VedX, a freelance AI and full-stack development agency. We build RAG systems, voicebots, and animated web products for startups and SMEs. Based in Jaipur, India. Building things that actually work in production.

Top comments (1)

VedX Group • May 24

good