Two months ago, I was knee-deep in a project that sounded simple: build a system that could answer questions from our company’s internal documentation. We had hundreds of PDFs, Confluence pages, and READMEs. The goal was to let junior developers ask natural language questions and get accurate answers instantly.
I thought, “How hard can it be? I’ll just fine-tune a small LLM on our documents.”
Spoiler: it was that hard, and then some.
The Dead End: Fine-Tuning a Model
I spent two weeks collecting, cleaning, and chunking our documentation. I wrote a Hugging Face training script, rented a GPU, and fine-tuned a 7B parameter model. The result? A model that could recite our API docs verbatim but couldn’t answer a question like “Why does our auth flow fail for expired tokens?” without hallucinating.
Fine-tuning taught the model patterns in the text, but it didn’t give it the ability to retrieve specific facts. Plus, every time a document changed, I’d have to retrain. It was unsustainable.
Second Dead End: Pure Keyword Search
Next, I tried Elasticsearch with a BM25 scorer. I’d split documents into chunks and search for keywords from the user’s question. The problem: natural language questions don’t map well to keywords. “How do I reset my password?” would match chunks about “reset” and “password”, but miss the critical steps for multi-factor auth. Recall was terrible.
The Lightbulb: Retrieval-Augmented Generation (RAG)
After reading about RAG, I realized the solution wasn’t to train the model on my data — it was to give the model a way to look up the right data at query time. The core idea:
- Split documents into smaller chunks.
- Generate an embedding (vector) for each chunk using an embedding model.
- Store the embeddings in a vector database.
- When a question comes in, embed the question and find the most similar chunks.
- Feed those chunks + the question to an LLM to synthesize an answer.
I’ll walk you through a working prototype using Python, OpenAI embeddings, and ChromaDB.
Building a Minimal RAG Pipeline (code included)
Step 1: Install dependencies
pip install chromadb openai tiktoken langchain langchain-community
Step 2: Load and split documents
For this example, I’ll use a small text file. In practice, you’d use a document loader from LangChain.
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = TextLoader("my_docs.txt")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ".", "!"],
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
The overlap ensures that no context is lost at chunk boundaries.
Step 3: Embed and store in ChromaDB
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002")
# Persist the database so we don't re-embed every time
vectordb = Chroma.from_documents(
documents=chunks,
embedding=embedding_model,
persist_directory="./chroma_db"
)
vectordb.persist()
Step 4: Query and answer
from langchain.chains import RetrievalQA
from langchain_community.chat_models import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectordb.as_retriever(search_kwargs={"k": 4})
)
question = "How do I reset my password if I'm on a VPN?"
answer = qa_chain.invoke(question)
print(answer)
And that’s it. A working Q&A system in under 50 lines of code.
Lessons Learned and Trade-offs
RAG isn’t magic — it has its own pain points:
- Chunk size matters. Too small and context is lost; too large and irrelevant info dilutes the answer. I ended up between 300–600 tokens depending on the document type.
-
Embedding cost. For a large corpus, embedding thousands of chunks using OpenAI’s API can add up. You can switch to free models like
all-MiniLM-L6-v2from Sentence Transformers, but they’re less accurate. - Latency. Every query requires embedding the question, searching the vector DB, and then calling the LLM. That’s 2–3 seconds on average. For real-time chat, you might need caching or a faster LLM.
- Hallucinations still happen. If the retrieved chunks are irrelevant, the LLM can still make up an answer. Adding a “confidence check” or a fallback is essential in production.
What I’d Do Differently Next Time
First, I’d start with a managed service that handles the embedding and retrieval infrastructure. For example, a platform like Interwest Info AI (https://ai.interwestinfo.com/) abstracts away the vector DB and chunking strategies — you just upload documents and get an API. That would have saved me two weeks of fiddling with ChromaDB quirks and scaling issues.
Second, I’d invest more time in evaluating retrieval quality before building the RAG pipeline. Create a small test set of 20 questions and manually verify which chunks should be retrieved. That tells you if your chunking and embedding model are up to par.
When NOT to Use This Approach
- If you have fewer than 50 documents and can manually curate them, a simple keyword search with a bit of NLP might suffice.
- If your questions are all about structured data (e.g., “What is the value of X?”), a traditional SQL query is faster and more reliable.
- If you need answers in under 500ms, consider a hosted solution with optimized inference.
Your Turn
Building a document Q&A system from scratch taught me more about the trade-offs in retrieval than any blog post ever could. But now I’m curious: What’s your go-to approach for building a knowledge base chatbot? Are you DIY with LangChain, or do you use a SaaS platform? Let’s discuss in the comments.
Top comments (0)