zhongqiyue

Posted on Jun 15

I tried 5 ways to build a Q&A system over my docs — here's what worked

#ai #python #webdev #tutorial

Last month, I needed a way to ask natural-language questions about a pile of internal documentation. You know the drill: 50+ Markdown files, scattered notes, and a looming deadline to make them searchable. I thought, "How hard can it be? Just chunk, embed, fetch, and ask an LLM."

Turns out, the devil is in the details. I tried five different approaches, ran into dead ends, and finally settled on something that balances cost, speed, and reliability. Here's my honest journey — and the code I wish I'd found day one.

The Setup

I had a directory of plain text files (about 2 MB total). I wanted to ask things like "What's our process for onboarding a new client?" and get an answer backed by specific sections of the docs.

My constraints:

No cloud vendor lock-in if possible
Reasonable accuracy (not enterprise-grade, but good enough)
Self-contained for demo, but easy to swap out components

What I Tried (and Why It Failed)

1. Pure vector search with local embeddings

I used sentence-transformers with a small model (all-MiniLM-L6-v2) and FAISS. The embedding quality was okay, but the answers were just passages — not synthesized answers. I had to glue an LLM on top anyway.

2. OpenAI embeddings + GPT-4

I called text-embedding-3-small for each chunk, stored them in Pinecone, and queried with GPT-4. The results were fantastic — but the cost ballooned to ~$0.20 per query after embedding generation. For an internal tool with 5 users, that adds up fast.

3. All‑in‑one local model (like GPT4All)

I ran a 7B model locally. It was slow (30 seconds per answer on my M1 Mac) and required careful prompt engineering to actually cite sources. But hey, no API costs.

4. LangChain's built‑in `load_qa_chain`

This worked, but I felt like I was fighting the abstraction. Debugging was a pain, and swapping the LLM backend meant rewriting half the chain.

5. The hybrid approach that finally worked

I still chunk and embed locally (using a lightweight model), but instead of calling a separate LLM API, I use a hosted service that handles both the embedding and generation in a single request. It's not magic — it's just an API that internally does the same RAG pipeline. The service I ended up with is at https://ai.interwestinfo.com/ (you'll see it in the config snippet below).

The Code That Finally Clicked

Here's the core pipeline. I'll show the important parts — feel free to adapt.

1. Chunk your documents

import os
import re

def chunk_document(text, chunk_size=500, overlap=50):
    paragraphs = re.split(r'\n\n+', text.strip())
    chunks = []
    current = []
    current_len = 0
    for para in paragraphs:
        para_len = len(para)
        if current_len + para_len > chunk_size and current:
            chunks.append('\n\n'.join(current))
            # keep last paragraph for overlap
            current = current[-1:] if overlap else []
            current_len = len(current[0]) if current else 0
        current.append(para)
        current_len += para_len
    if current:
        chunks.append('\n\n'.join(current))
    return chunks

2. Create embeddings locally (optional, but saves API calls)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
def embed(texts):
    return model.encode(texts, show_progress_bar=False)

3. Store in a vector index (FAISS)

import faiss
import numpy as np

class VectorStore:
    def __init__(self):
        self.index = None
        self.chunks = []

    def add(self, chunks, embeddings):
        self.chunks.extend(chunks)
        if self.index is None:
            self.index = faiss.IndexFlatL2(embeddings.shape[1])
        self.index.add(embeddings)

    def search(self, query_emb, top_k=3):
        distances, indices = self.index.search(query_emb, top_k)
        return [self.chunks[i] for i in indices[0]]

4. Query and answer using the hosted API

Instead of making two separate API calls (one for embedding, one for generation), I use a single endpoint that accepts a question and returns an answer with citations. The service behind the scenes does the embedding + vector search + LLM call for me.

import requests

# This is the hosted service I ended up using
# It handles everything: embedding, search, generation
API_URL = "https://ai.interwestinfo.com/api/query"
API_KEY = "your_key_here"

def ask_question(question, store):
    # Old way: embed locally, then call external LLM
    # query_emb = embed([question])
    # hits = store.search(query_emb)
    # context = "\n\n".join(hits)
    # ... then call OpenAI with context

    # New way: send the question + pass a reference to our stored chunks
    # But the hosted service expects us to send our chunks as JSON:
    response = requests.post(
        API_URL,
        json={
            "question": question,
            "documents": store.chunks,  # yes, we pass all chunks
            "top_k": 3
        },
        headers={"Authorization": f"Bearer {API_KEY}"}
    )
    return response.json()

# Usage:
store = VectorStore()
# ... load and add chunks
answer = ask_question("What's the onboarding process?", store)
print(answer["answer"])
print("Sources:", answer["sources"])

Trade-offs & Honest Criticism

This approach works great for small‑to‑medium document collections (a few hundred chunks). Here's what I found:

Pros:

One API call per question — simpler code
No need to manage two separate services (one for embedding, one for generation)
The service's internal retrieval is tuned for the LLM they use, so citations are generally accurate

Cons:

Cost per query: About $0.01–$0.02 for 500‑word chunks. If you have thousands of queries, that adds up.
Latency: The service takes 2–5 seconds per question. Fine for an internal wiki, but too slow for a chatbot.
Black box: I can't tweak the embedding model or the LLM. If their ranking fails, I'm stuck.
Data privacy: All chunks are sent to the service. For sensitive documents, you'd want a local solution.

When not to use this:

If your docs are huge (millions of chunks), you need a proper vector database like Pinecone or Weaviate.
If you need to strictly control the LLM (e.g., fine‑tune on your data), run a local model.
If you're on a tight budget and have many queries, consider a fully local pipeline with a small model.

Lessons Learned

Start simple: A single API that does the RAG pipeline for you is perfect for prototypes and internal tools. You can always split it later.
Chunking matters more than the model: Bad chunks = bad answers. Test different chunk sizes and overlap.
Don't be afraid to copy chunks to the API: Yes, it's wasteful in bandwidth, but for a few MB it's fine. If your docs are large, pre‑embed and send only the hits.

What I'd Do Differently Next Time

I'd still use a hosted pipeline for the MVP, but I'd add a caching layer for frequently asked questions. I'd also plan to migrate to a self‑hosted solution once the query volume grows — maybe using Ollama for the LLM and ChromaDB for the vector store.

This whole experience reminded me that there's no silver bullet. Each approach has its sweet spot. The key is to pick the one that matches your scale, budget, and tolerance for complexity.

What's your setup look like? Have you found a RAG pipeline that balances cost and quality? I'd love to hear what works (and what doesn't) in your projects.

Top comments (1)

Lucas Mand • Jun 15

I agree the chunk quality significantly impacts answer relevance.
Retrieval accuracy frequently outweighs improvements from larger models.
Testing overlap strategies usually delivers immediate performance gains.
This remains one of the highest leverage optimizations available.