How I stopped dumping PDFs and started chatting with my documentation

#python #webdev #ai #tutorial

A few months ago I was drowning in documentation. My team had written hundreds of pages about our internal microservices, configuration guides, and deployment procedures. Great, right? Except that nobody read them. The same questions popped up in Slack every week. "How do I reset the staging DB?" "What's the syntax for that webhook?"

I tried throwing a basic search index on top of the wiki. It was terrible. People would type "reset staging database" and get back a page about resetting production credentials. Context? Gone. Synonyms? Useless.

So I did what any developer would do: I spent two weekends building a RAG (Retrieval-Augmented Generation) system from scratch. Here’s what I learned, including the dead ends that wasted my time.

The naïve approach: dump PDFs into a vector database

I started with the classic recipe: PDFs → text splitter → OpenAI embeddings → Pinecone. Simple. It worked... for one question. For everything else it returned irrelevant junk.

The problem was chunking. I used a fixed 512-token chunk size with no overlap. Sentences got chopped in half. Code blocks were ripped apart. The retrieval step found pieces of text that looked vector-similar but made no sense to the LLM.

What didn't work: semantic search alone

I tried switching to a more advanced embedding model (text-embedding-3-large) and adding metadata filters. Still not great. The issue is that questions like "How do I reset staging DB?" require matching a verb (reset) and a noun (staging DB) with the relevant procedure. A single chunk rarely contained both the action and the target.

I also experimented with sliding window overlap and larger chunk sizes (1024 tokens). That helped a bit, but then the LLM would get distracted by too much context.

What finally worked: hierarchical chunking + hybrid search

After reading a dozen blog posts and papers, I settled on a two-layer approach:

Document-level summary: Each document gets a short summary (via LLM) stored as its own chunk.
Fine-grained chunks: Inside each document I split by logical sections (headings) and keep chunks small (256 tokens) with 50-token overlap.
Hybrid retrieval: First search over summaries using dense embeddings, then within the top documents do a combined dense + sparse (BM25) search.

Here's the core retrieval function I ended up with:

import chromadb
from sentence_transformers import CrossEncoder

class HybridRetriever:
    def __init__(self, collection, bm25_index):
        self.collection = collection  # Chroma collection
        self.bm25 = bm25_index
        self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

    def retrieve(self, query, top_k=5):
        # dense retrieval (embedding distance)
        dense_results = self.collection.query(
            query_texts=[query],
            n_results=top_k * 2
        )
        # sparse retrieval (BM25)
        sparse_results = self.bm25.search(query, top_k * 2)

        # combine and deduplicate
        combined = {}
        for doc_id, score in dense_results:
            combined[doc_id] = combined.get(doc_id, 0) + (0.7 * score)
        for doc_id, score in sparse_results:
            combined[doc_id] = combined.get(doc_id, 0) + (0.3 * score)

        # rerank with cross-encoder
        candidates = sorted(combined, key=lambda x: combined[x], reverse=True)[:top_k]
        texts = [self.collection.get(doc_id)['document'] for doc_id in candidates]
        cross_scores = self.reranker.predict([(query, t) for t in texts])

        final = sorted(zip(candidates, cross_scores), key=lambda x: x[1], reverse=True)
        return [doc_id for doc_id, _ in final]

This hybrid approach finally gave me consistently relevant chunks. The cross-encoder reranker is slow but I only run it on the top 10 candidates, so it's tolerable.

Lessons learned (and trade-offs)

Chunking is the hardest part. Don't underestimate it. Logical splits (by markdown headings, by function definitions) beat any fixed token window.
Metadata is your friend. Store document title, section, and URL so you can cite sources.
Embedding models matter less than retrieval strategy. I tried OpenAI, Cohere, and local models. The difference was tiny compared to the chunking + reranking pipeline.
Never use exact vector search alone. BM25 catches keyword matches that embeddings miss.
Hosted solutions exist – services like Interwest Info (https://ai.interwestinfo.com/) wrap all of this into an API, but if you want to understand the internals, build your own first.

What I'd do differently next time

I'd start with LangChain or LlamaIndex instead of rolling my own pipeline. They handle lots of edge cases (like splitting code blocks, handling tables) that I spent days debugging. Also, I'd invest earlier in a good evaluation set – without a dozen test queries you'll never know if your changes are actually improving things.

The system is now running in production for our team of 20. We get about 50 questions per day, and I'm still tweaking the reranker threshold. It's not perfect – it fails on really vague questions – but it cut our Slack repetitions by 70%.

What chunking strategies have you found effective for technical documentation? I'm still learning.