A client messaged us on a Tuesday night. By Friday afternoon, their customer support chatbot was live — handling real queries, citing their actual documentation, and escalating edge cases to humans. No hallucinations. No generic GPT responses. No guesswork.
This is the breakdown of how we did it: the architecture, the stack, the mistakes, and the two decisions that saved the timeline.
We're a small two-person AI development team at VedX — we specialize in building RAG systems, voicebots, and full-stack AI products for businesses. This particular project is one we're allowed to share the technical details of, so here's the full walkthrough.
What the Client Actually Needed
The client ran a SaaS product in the legal compliance space. They had 200+ pages of documentation, an FAQ library, and a Notion workspace full of internal SOPs. Their support team was spending 60% of time answering questions that already existed in their docs — just not findably.
They did not want a generic ChatGPT wrapper. They wanted a bot that:
- Only answered from their documents (no hallucinations about things not in scope)
- Cited the exact source document when answering
- Escalated to a human when confidence was below a threshold
- Integrated with their existing React frontend via a simple chat widget
Classic RAG problem. Solvable in 72 hours with the right stack.
The Architecture (The Part That Matters)
Before writing a single line of code, we spent 3 hours on architecture. This is the part most freelancers skip — and it's exactly why most RAG prototypes fail in production.
User query
↓
Query preprocessing (intent classification + query expansion)
↓
Embedding model (text-embedding-3-small)
↓
Vector search → ChromaDB (cosine similarity, top-k=5)
↓
Reranker (cross-encoder) → filters to top-k=2
↓
LLM (GPT-4o-mini) with retrieved context + source metadata
↓
Response + citation + confidence score
↓
If confidence < 0.7 → human escalation flag
Two decisions in this architecture made all the difference:
1. Query expansion before embedding
Raw user queries are often short, ambiguous, and poorly formed. "What's the penalty?" gives a terrible embedding. We added an LLM step that rewrites the query into 3 variations before embedding — dramatically improving retrieval recall.
2. Cross-encoder reranking after vector search
Vector similarity is fast but imprecise. A cross-encoder (we used cross-encoder/ms-marco-MiniLM-L-6-v2 from Hugging Face) reads each retrieved chunk alongside the original query and scores actual relevance. This step alone reduced hallucinations by ~40% in our testing.
The Stack
| Component | Tool | Why |
|---|---|---|
| Backend API | FastAPI (Python) | Async, fast, easy to deploy |
| Vector DB | ChromaDB (local) | Zero infra for MVP, portable to cloud |
| Embeddings | OpenAI text-embedding-3-small | Best price/performance ratio |
| Reranker | HuggingFace cross-encoder | Free, runs locally, < 200ms |
| LLM | GPT-4o-mini | 4x cheaper than GPT-4, sufficient for Q&A |
| Document parsing | LangChain loaders + custom chunker | Handles PDF, Notion export, DOCX |
| Frontend widget | React + Tailwind | Embedded via iframe in client's app |
| Hosting | Railway (API) + Vercel (widget) | Ships in 10 minutes |
Total inference cost: ~$0.003 per query at production load.
Day 1: Document Ingestion Pipeline (0–24 hours)
The first bottleneck was always going to be document quality. 200 pages of legal documentation is not a clean dataset — it had tables, footnotes, embedded PDFs, inconsistent heading structures, and duplicate content across Notion pages.
We built a custom chunking strategy rather than using LangChain's default RecursiveCharacterTextSplitter:
def smart_chunk(document: str, max_tokens: int = 400) -> list[str]:
"""
Split by semantic boundaries: headings > paragraphs > sentences.
Preserve heading context in each chunk via a sliding header stack.
"""
chunks = []
header_stack = []
for block in parse_blocks(document):
if block.type == "heading":
# Update header context for subsequent chunks
level = block.heading_level
header_stack = header_stack[:level-1] + [block.text]
elif block.type == "paragraph":
context_prefix = " > ".join(header_stack) + "\n\n"
chunk_text = context_prefix + block.text
if token_count(chunk_text) <= max_tokens:
chunks.append(chunk_text)
else:
# Sentence-level split with context preserved
for sentence_chunk in split_by_sentences(block.text, max_tokens, context_prefix):
chunks.append(sentence_chunk)
return chunks
The key insight: prepend the heading path to every chunk. When the user asks "what's the penalty for late filing?" and the answer is buried under Section 4 > Subsection 2 > Penalties, the chunk now contains "Section 4 > Subsection 2 > Penalties" in its text — making vector similarity much more accurate.
This took most of Day 1. We ingested ~4,200 chunks total.
Day 2: Retrieval + Reranking + LLM Prompt (24–48 hours)
Retrieval was straightforward once the chunks were clean. The reranking step is where we spent most of Day 2 tuning.
The prompt we landed on for the final LLM call:
SYSTEM_PROMPT = """
You are a precise assistant for {company_name}.
Answer ONLY using the provided context documents.
If the answer is not clearly in the context, say:
"I don't have enough information to answer that — I'll connect you with the team."
Rules:
- Cite your source document in every response: [Source: {doc_name}]
- Never speculate or add information not in the context
- Keep answers under 150 words unless the question requires more detail
- If multiple documents disagree, surface both answers and note the discrepancy
"""
The confidence scoring was the trickiest part. We built a heuristic:
def confidence_score(query: str, retrieved_chunks: list, response: str) -> float:
# 1. Did the reranker score the top chunk above threshold?
reranker_score = retrieved_chunks[0].rerank_score # 0–1
# 2. Did the LLM response include a source citation?
has_citation = "[Source:" in response
# 3. Is the top chunk semantically close to the response?
response_embedding = embed(response)
top_chunk_embedding = retrieved_chunks[0].embedding
semantic_overlap = cosine_similarity(response_embedding, top_chunk_embedding)
raw_score = (reranker_score * 0.5) + (semantic_overlap * 0.4) + (0.1 if has_citation else 0)
return raw_score
Anything below 0.70 triggered the "I'll connect you with the team" path. In production testing, this caught every hallucination we induced deliberately.
Day 3: Frontend, Integration, Deploy (48–72 hours)
The React chat widget was 3 hours of work. The actual time sink was the client's CSP (Content Security Policy) blocking our iframe — a problem we've hit before and now check for in every project kickoff call.
We shipped to Railway in 22 minutes using a single Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
ChromaDB data was persisted to a Railway volume. Total infrastructure cost: $5/month.
What We Got Wrong (and Fixed Fast)
Mistake 1: Chunk size too large initially.
We started with 800-token chunks because we wanted more context per chunk. Retrieval precision dropped significantly — the cross-encoder had too much noise to rank. Dropping to 400 tokens and increasing top-k from 3 to 5 fixed it.
Mistake 2: No query preprocessing on Day 1.
The first version of the pipeline embedded the raw user query directly. Short queries like "late filing?" returned garbage. We added the query expansion step on Day 2 and recall improved dramatically.
Mistake 3: Overlooking document deduplication.
The Notion export had significant duplicate content across pages. Without deduplication, the same chunk appeared multiple times in top-k results — wasting context window and making citations redundant. A simple hash-based dedup before ingestion fixed it.
Results After 2 Weeks in Production
- 68% of queries resolved by the bot without human escalation
- Average response time: 1.2 seconds end-to-end
- 0 hallucinations detected in human review of 500 sampled responses
- Support ticket volume: down 41% in the first two weeks
- Inference cost: $23 for 7,600 queries (first two weeks)
What We'd Change for a Larger Scale Version
- Switch ChromaDB → Qdrant or Pinecone for multi-tenant isolation and horizontal scaling
- Add a caching layer (Redis) for frequently asked questions — same query, instant response
- Fine-tune the embedding model on domain-specific vocabulary for legal/compliance terminology
- Streaming responses via SSE — the 1.2s wait feels long when you're used to ChatGPT's streaming UX
The Actual Takeaway
RAG is not a new idea. The reason most RAG implementations fail in production is not the LLM — it's the retrieval pipeline. Clean chunking, query expansion, and cross-encoder reranking are the unglamorous work that determines whether your chatbot answers correctly or confidently makes things up.
If you're building something similar, the 72-hour timeline is realistic for an MVP with clean documents. For enterprise-grade accuracy on messy document sets, budget 2–3 weeks for the ingestion pipeline alone.
We build these systems at VedX — if you're working on a RAG project and want to talk through the architecture, feel free to reach out.
Tags: #RAG #LLM #AI #Python #FastAPI #freelancing #webdev #chatbot
Author Bio (paste this into dev.to/Hashnode profile):
Divyanshu Purohit — Co-founder at VedX, a freelance AI and full-stack development agency. We build RAG systems, voicebots, and animated web products for startups and SMEs. Based in Jaipur, India. Building things that actually work in production.
Top comments (1)
good