Kristoffer Nordström

Posted on Mar 13 • Originally published at blog.northerntest.se

Teaching a Knowledge Base to Search

#ai #productivity #programming #tooling

I searched for "Fredrik Haard" and got back a conversation about exploratory testing philosophy.

Not wrong, exactly. Fredrik and I talk about testing a lot. But I wasn't looking for a conversation about testing. I was looking for Fredrik. His contact card, his Slack ID, the email thread where he suggested we move the weekly sync to Mondays. Specific things about a specific person.

The knowledge base had over a hundred thousand chunks of my digital life indexed. And it couldn't find someone by name.

That's when I realised that building a search system is its own kind of engineering problem. And it took five distinct phases of "this doesn't work, let me try something else" to get it right. Each phase was driven by a concrete failure. Each fix revealed the next problem.

This is the story of how I taught my knowledge base to search.

Phase 1: Pure vectors, pure confidence

When I first built the knowledge base (the full story is in I Gave My AI a Memory), search was simple. Every document gets turned into a vector embedding, a numerical representation of its meaning. When you search, your query gets embedded the same way, and the system finds documents whose vectors are closest in meaning.

Think of it like a librarian who organises books by vibes. You walk in and say "I need something about conference planning" and she knows exactly which shelf to point you to. The topics, the feel of the content, the general subject matter. For broad queries, this worked great. "What do I know about conference planning?" pulled up the right notes. "Testing philosophy discussions" surfaced the right conversations.

But names aren't semantic. "Fredrik Haard" doesn't mean anything in vector space. It's just two tokens that could land anywhere relative to other tokens. Same problem with project codes, invoice numbers, technical terms. Anything where you need an exact match, not a vibe match. You wouldn't walk up to a librarian and ask for "the feeling of being named Fredrik." You'd want her to check the author index.

I had multiple Fredriks in the system. Fredrik Haard and Fredrik Thuresson. A pure vector search for "Fredrik" would rank results by how much the surrounding context resembled... what, exactly? The concept of being named Fredrik? It returned whatever document happened to be semantically closest to the query embedding. Sometimes that was the right Fredrik. Sometimes it wasn't.

I didn't discover this by accident. I'd been building an evaluation framework (a test harness for the knowledge base itself) and one of the tasks was: "What is Fredrik's Slack ID?" Simple question. The system couldn't reliably answer it.

That was the first lesson. Semantic search is great until you need something specific.

Phase 2: Adding keywords to the mix

The fix was obvious once I saw the problem. Don't replace vector search, add keyword search alongside it. Give the librarian a second tool. She already knows which shelf to point you to by topic. Now she also gets an alphabetical index, so she can look up names, codes, and specific terms directly. BM25 (the algorithm behind most traditional search engines) is that index. It finds "Fredrik Haard" by matching those exact words, weighted by how rare they are in the corpus.

So now the system runs both searches in parallel. Vector search finds documents that are semantically similar to the query. BM25 finds documents that contain the actual words. The results get combined using Reciprocal Rank Fusion, a merging algorithm that combines ranked lists by position rather than raw score. If a document ranks highly in both lists, it surfaces near the top. If it only matches one signal, it still appears but lower.

You search for "Fredrik Haard," the alphabetical index puts the right documents at the top. You search for "that conversation about exploratory testing approaches," the topic knowledge handles it even if those exact words never appear in the document. And when both signals agree, you get the most confident results.

The Fredrik problem was solved. But a new one appeared immediately.

The system was now returning better results. A lot more of them. And every result was a full document. When I was using Claude Code or a local LLM, the search would return fifty relevant chunks and try to stuff them all into the context window. With local models running 4K or 8K token windows, that blew things up completely. Even with Claude's larger context, loading fifty full documents just to answer a question about one of them was wasteful.

Better recall, but overwhelming the consumer of the results.

Phase 3: The librarian trick

I kept hitting context limits, so I started thinking about how a librarian actually works. They don't read every book in the library to answer your question. They find twenty books by topic (fast, rough), then read each one carefully to decide which three actually answer what you asked (slow, precise). I know in real life, your local librarian won't read all the books they pulled up from your ask about "quantum physics in the 80's", but its good analogy I happened to stick with.

So I built doc cards. A compressed summary of each document containing its title, key facts, main topics, and mentioned entities. About a tenth the size of the full document. Think of it as small generated index cards with good details about the books (full documents) that the librarian can now use when looking for the twenty books. The search now works in two stages. First, scan the doc cards to find what's relevant. Small enough to fit lots of them in a context window. Second, load the full document only for the ones that actually matter.

This was born directly from the eval framework. I had a test task where the model needed to find and synthesize information about a person from scattered sources (emails, meeting notes, Slack conversations). With full documents, the local models choked on context. With doc cards as the first stage, they could scan ten summaries, pick the three most relevant, and then load just those.

The eval scores jumped. Not because the search was returning different results, but because the consumer (the LLM) could actually use what it got.

Two-stage retrieval. It sounds like an optimisation trick. It's actually a design philosophy. Give the system just enough context to make a good decision about what to look at more closely.

Photo by Thomas Bormans on Unsplash

Phase 4: The expensive second opinion

At this point I had hybrid search producing candidates and doc cards keeping context manageable. But the ranking still felt rough sometimes. The BM25 and vector signals disagreed often enough that the merged results weren't always in the best order.

So I added a cross-encoder reranker. Back to the librarian. So far she's been finding books two ways (by topic and by index lookup) and scanning their index cards. But she hasn't actually sat down with your question and a book at the same time to judge how well it answers what you asked. That's what a cross-encoder does. It takes the query and a document together as a single input and produces a relevance score. Not "is this document about the right topic?" but "does this specific document answer this specific question?"

Much more accurate than the earlier stages. But expensive, because it has to process each query-document pair individually. You can't ask the librarian to sit down and carefully read all 68,000 books against your question, everytime you ask a new question. That doesn't scale.

The economics work because of the previous phases. By the time the cross-encoder sees the results, hybrid search and doc cards have already narrowed it down to maybe twenty candidates. Asking the librarian to carefully evaluate twenty books against your question? That's a Tuesday afternoon. Twenty instead of sixty-eight thousand is a completely different cost equation.

But raw fusion scores aren't the whole picture. Not all books in the library are equally trustworthy. A curated project note I wrote deliberately is more likely to be what I'm looking for than one of 38,000 Gmail threads. So before the cross-encoder even sees the results, I adjust scores by source type. Project docs and area notes get a boost (I wrote those on purpose, they're the reference shelf). Slack messages and emails get a slight penalty (there are tens of thousands of them, and most are noise, they're the periodicals bin). And everything decays over time, because a meeting note from last week is usually more relevant than one from two years ago. Different source types decay at different rates (a project README stays relevant longer than a Slack thread).

The final score blends 60% cross-encoder with 40% of that domain-adjusted score. The cross-encoder gets the biggest vote because it's the most accurate signal. But the domain adjustments keep it honest about what kind of document you're probably looking for.

The eval framework confirmed it. The task that asked about Fredrik? Cross-encoder reranking moved the right answer from position four to position one. On broader queries, the quality improvement was consistent enough that I could see it across multiple eval runs.

That's measurable progress. Not vibes. Scores.

Phase 5: The trust problem

And then ChromaDB corrupted my index.

I'd been using ChromaDB as the vector database from the start. It worked fine for months. Then one day, search results for a topic I knew I'd written about came back empty. Not wrong results. No results. The librarian showed up for work and books were missing from the shelves. The catalog said they existed. They didn't. Chunks had silently disappeared from the index after what looked like a process crash. The Python-FFI bridge between ChromaDB's Python wrapper and its underlying storage had a stability problem that only showed up under certain conditions.

Memory corruption in a memory system. The irony wasn't lost on me.

I migrated to LanceDB over that same week. One migration script, 117,000 chunks preserved with all their embeddings intact, zero data loss. LanceDB gave me a Rust core (no more segfaults from a Python bridge), native BM25 support (hybrid search built into the database instead of bolted on), and about a third of the storage footprint (2.2 GB down from the bloated ChromaDB files).

That migration wasn't about search quality. It was about trust. If you can't trust the storage layer, nothing you build on top matters. The fanciest reranking in the world is useless if chunks silently vanish.

That's a foundation decision. Not a feature decision.

What I actually learned

Five phases, and each one was driven by a failure I couldn't have predicted in advance. Vector search failed on names. Hybrid search overwhelmed context windows. Full documents were too expensive to process. Initial ranking wasn't precise enough. And the database itself couldn't be trusted.

I couldn't have designed this system on a whiteboard. I had to build it, test it, watch it fail, and fix the failure. The eval framework was the thing that made each phase's improvement visible and gave me confidence to ship each change. Without those test tasks (find Fredrik's Slack ID, aggregate person context, extract specific invoice numbers), I'd have been guessing about whether the search was actually getting better.

The system today handles 163,000 chunks from ten data sources. It runs three scoring stages (fusion, domain adjustment, cross-encoder reranking) and supports two-stage retrieval for context-constrained consumers. And it sits on a storage layer I trust.

But the architecture isn't the point. The point is that every piece exists because something broke. Not because I planned it, not because a best practices guide said so. Because I was using the system daily, hitting real problems, and fixing them one at a time.

Photo by Tim van Cleef on Unsplash