Felipe Araújo

Posted on Jun 4

Building a production RAG across a Book series: Retrieval, Reranking, and Hard Lessons

#ai #architecture #llm #rag

I built a search and Q&A system over the entire A Song of Ice and Fire series, all 10 books, ~66,000 paragraphs. The project is called Uma Busca de Gelo e Fogo, and it's live at buscadegeloefogo.vercel.app.

The system has two modes: a classic full-text search engine and a RAG-powered chat that lets you ask questions in natural language and get answers grounded in the actual text. This article is about the second part, the retrieval pipeline, the decisions behind it, and the embarrassing amount of time I spent fixing things that I thought were obviously correct from the start.

The System at a Glance

Three independent microservices:

Component	Role	Stack	Deploy
Backend	Full-text search engine + RAG proxy	Fastify + SQLite FTS5 + TypeScript	Render (Docker)
RAG	Retrieval + generation	FastAPI + ChromaDB + Groq	Hugging Face Spaces (Docker)
Frontend	Search and chat UI	Next.js + Tailwind	Vercel

The backend handles lexical search and also acts as a proxy between the frontend and the RAG microservice. The RAG service lives separately, it's compute-heavy and needs to fail independently from the rest. If the RAG is down, the search engine still works. That isolation saved me more than once during development.

This article focuses entirely on the RAG service.

Why Not Just FTS5?

I have a strong opinion here: people massively underestimate lexical retrieval. For a corpus this size, SQLite FTS5 with a unicode61 tokenizer is absurdly good, it handles diacritics, multi-term proximity queries via NEAR, and snippet() highlighting, all inside a ~50MB file with zero infrastructure overhead. I think too many RAG projects reach for vector databases before seriously asking whether a well-configured full-text search engine would already solve their problem.

For this project, it solves most of the problem. If you search for "Dracarys", FTS5 finds every relevant paragraph instantly. Filter by book, by POV character, expand context, done.

But there's a hard ceiling. If you ask "Why did Jon Snow's brothers betray him?", there's no query term that maps cleanly to the relevant passages. The answer is distributed across chapters, framed in different ways, never stated explicitly in a single paragraph. FTS5 has nothing to offer there.

That's the problem RAG solves. Not as a replacement, as a complementary layer for a different class of questions.

The Retrieval Pipeline

My first version was embarrassingly naive: embed all chunks, store in ChromaDB, cosine similarity lookup, done. It looked fine in early testing because I was asking simple questions. The moment I tried anything with indirect phrasing, questions where the answer wasn't literally stated in a single chunk, the quality collapsed. I was getting chunks that were topically adjacent but factually irrelevant, and the model was confidently synthesizing wrong answers from them.

I spent longer than I'd like to admit staring at retrieval outputs before accepting that cosine similarity alone wasn't going to cut it. The pipeline I ended up with:

User question
  │
  ├─ 1. Dense retrieval    → bge-m3 embedding → ChromaDB (cosine, top 60)
  ├─ 2. Sparse retrieval   → BM25Okapi → top 60
  ├─ 3. Fusion             → Reciprocal Rank Fusion (K=60) → top 40
  ├─ 4. Reranking          → bge-reranker-v2-m3 (cross-encoder) → top 20
  └─ 5. Generation         → Llama 3.3 70B via Groq

Dense Retrieval: bge-m3

The embedding model is BAAI/bge-m3. Multilingual support was non-negotiable — the corpus is in Portuguese, but users ask questions in English, Portuguese, and sometimes both in the same sentence. bge-m3 handles that well.

One thing I only discovered after reading the BGE documentation carefully: these models support instruction-tuned embeddings. For retrieval, the query should use the prefix:

"Represent this sentence for searching relevant passages: {question}"

This isn't cosmetic. It tells the model the embedding should be optimized for document retrieval specifically, not generic semantic similarity. I originally skipped this because it looked like boilerplate. It isn't, dropping the prefix measurably degrades retrieval alignment.

Sparse Retrieval: BM25

Dense retrieval is good at paraphrase and semantic similarity. It's bad at exact matching for rare or proper nouns. In a fantasy series, this is a serious problem. "Casterly Rock", "Daenerys Stormborn", "R'hllor" — these are not concepts a bi-encoder generalizes to gracefully. BM25 handles them exactly, and at essentially zero cost.

Running both in parallel is covering for the obvious weaknesses of each method.

Fusion: Reciprocal Rank Fusion

RRF merges two ranked lists without requiring score normalization. The formula:

score(doc) = Σ 1 / (K + rank(doc))

With K=60, documents ranked highly by either method get a strong boost. Documents ranked poorly by both get filtered out. The reason to use rank rather than raw score is that BM25 scores and cosine similarities live on completely different scales — you can't just add them. RRF sidesteps that entirely.

I initially tried a weighted linear combination of normalized scores. It was worse and much harder to tune. RRF is simpler and more robust.

Reranking: Cross-Encoder

The bi-encoder computes embeddings for query and document independently and compares them via cosine similarity. It's fast because you compute document embeddings once and index them. It's also a lossy approximation, there's no direct interaction between query and document tokens during scoring.

A cross-encoder is different. It takes the concatenated query and document as input and scores them with full attention between both. It's meaningfully more accurate. It's also orders of magnitude slower, you can't run it over 66,000 documents.

The solution is to run it only over the top 40 candidates from RRF. At that scale it's fast enough; at corpus scale it would be unusable. The model is BAAI/bge-reranker-v2-m3, the multilingual cross-encoder from the same family as bge-m3.

After reranking, the top 20 chunks go into the generation prompt.

Chunking: Where I Lost the Most Time

The embedding pipeline runs over ~66,000 paragraphs using a sliding window: 5 sentences per chunk, stride of 3. Adjacent chunks share 2 sentences of overlap.

I did not start here. I started with fixed character splits because that's what most tutorials show, and tutorials are written to be simple, not correct. Fixed character splits routinely cut sentences in half. When your chunk ends mid-sentence, the embedding captures the beginning of a thought with no resolution, and the retrieval degrades in ways that are genuinely hard to diagnose because the chunks look fine when you print them.

Switching to sentence-based splitting with NLTK's sent_tokenize fixed a class of retrieval failures I had been blaming on the embedding model. That was a humbling moment.

The overlapping window is there because a single sentence that answers the user's question might land exactly at the boundary of a non-overlapping chunk. Overlap reduces that risk by ensuring each sentence appears in multiple chunks with different surrounding context. The tradeoff is redundancy, the same content appears more than once in ChromaDB. For this corpus size, that's fine.

Prompt Engineering: The Mistake I Was Confident About

My original system prompt:

"Answer based solely on the provided context. If you don't know, say you don't know."

This is standard advice, repeated everywhere. The reasoning is sound: strict grounding prevents hallucination. In practice, it made the system look dumber than it actually was.

The problem is that "answer only from context" is a retrieval quality guarantee disguised as a generation quality guarantee. If the retrieval pipeline surfaces the right chunks, it works great. If retrieval fails, wrong chunk boundaries, embedding misalignment, a question phrased in a way the model didn't handle well, the LLM sees a context that doesn't contain the answer and dutifully says "I don't know."

I was so confident this was correct that I spent time looking for bugs in the retrieval pipeline when the real issue was that I had made the model incapable of compensating for retrieval failures. The model had relevant knowledge. I had told it to pretend otherwise.

The corrected prompt:

"Use the context as your primary source. You may supplement with your own knowledge if necessary. If you use your own knowledge, say so explicitly."

The model stays grounded in retrieved text, falls back gracefully when retrieval misses, and is transparent about when it does so. The contract is more honest about what the system actually guarantees.

Evaluation

The system has an evaluation script that measures four metrics using LLM-as-Judge:

Metric	What it measures
Context Precision	What fraction of retrieved chunks are actually relevant?
Context Recall	Does the retrieved context contain enough to answer the question?
Faithfulness	Is the generated answer consistent with the retrieved context?
Answer Relevancy	Does the answer actually address what was asked?

LLM-as-Judge is the right choice here because there's no ground truth corpus. These are open-ended questions about a book series, there's no single correct answer to compute BLEU against. N-gram overlap metrics would be meaningless for this task.

I'll be honest: I don't have polished benchmark numbers to share. The evaluation script exists and runs, but I've been using it more as a diagnostic tool than as a rigorous benchmark. That's on the list of things to make more systematic.

Fallback: When ChromaDB Is Down

Hugging Face Spaces has cold starts. If ChromaDB is unavailable when a request comes in, the system automatically falls back to direct FTS5 queries on the SQLite database. The answer won't be LLM-generated, but the user gets relevant text instead of a 500 error.

Designing this fallback in from the beginning, rather than adding it after the first production incident, is one of the few things I did in the right order.

What I'd Do Differently

Adaptive chunking. Sliding window is a reasonable default but it ignores narrative structure entirely. A paragraph break in a fantasy novel often marks a meaningful boundary. Chunking by scene or narrative unit would likely improve context coherence more than any retrieval tweak.

Query expansion. Some questions come in English, some in Portuguese. A translation or synonym expansion step before retrieval would help recall for cross-language queries without requiring a multilingual retrieval overhaul.

HyDE. Instead of embedding the raw question, ask the LLM to generate a hypothetical passage that would answer it, then embed that. The resulting embedding is often much better aligned with the document space than the question embedding directly. I haven't implemented this yet, but I expect it would meaningfully improve retrieval for indirect or abstract questions.

BM25 persistence. The BM25 index is rebuilt from the full corpus on every service startup. For 66,000 paragraphs it's fast, but it's unnecessary work. Persisting it would shave startup time for no real cost.

Streaming. The full response is returned at once. SSE streaming would make the perceived latency dramatically better for longer answers.

Closing

The system is live at buscadegeloefogo.vercel.app. Ask it something that requires actual reasoning across the books, not just keyword lookup, and see how the retrieval holds up.

The main thing I learned building this is that RAG quality is determined by the weakest link in the pipeline, and the weakest link is usually not the LLM. It's the chunk boundaries. It's the retrieval strategy. It's the prompt contract. None of those are obvious until they're broken in production.

Happy to discuss any of it in the comments.

DEV Community