DEV Community: Felipe Araújo

Replacing Cross-Encoder Reranking with a Weighted Hybrid Score

Felipe Araújo — Fri, 19 Jun 2026 23:44:41 +0000

My RAG pipeline had a bottleneck, and the fix turned out to be just simple math.

The problem

My retrieval pipeline for Uma Busca de Gelo e Fogo, a RAG system over the full A Song of Ice and Fire corpus (~66k paragraphs), follows a fairly standard hybrid retrieval setup:

Dense retrieval (ChromaDB, bge-m3)
    → BM25 sparse (BM25Okapi)
        → RRF fusion
            → Cross-encoder rerank (bge-reranker-v2-m3)
                → Top chunks → LLM

The cross-encoder (bge-reranker-v2-m3) was doing its job, reordering the fused candidates by genuine semantic relevance. The problem was the cost: on CPU, reranking just 10 chunks took 8.6 seconds. The full search pipeline averaged 12.57 seconds per query. That's before the LLM even starts generating a response.

In a chat interface, 12 seconds feels painfully slow. Nobody wants to wait 12 seconds for a search step.

The insight: I already had the signals, I just wasn't using them

Before the cross-encoder ever runs, the pipeline already computes three independent relevance signals for every candidate chunk:

bm25_score — lexical relevance from BM25Okapi
dense_cosine — semantic similarity from the dense retrieval step (1 − cosine distance from ChromaDB)
rrf_score — the fused rank score from Reciprocal Rank Fusion

All three were being computed, used internally for fusion, and then discarded before reranking. The cross-encoder was reading the raw chunk text and recomputing relevance from scratch — expensive and, in part, redundant with signals already sitting in memory.

So the question became: what if, instead of running a second neural model, I just combined the signals I already had?

The solution: reranking as a weighted sum

The replacement is genuinely this simple. For each chunk, after min-max normalizing each signal (since BM25 and cosine live on very different scales):

def lightweight_rerank(chunks, weights):
    norm_bm25 = normalize([c["bm25_score"] for c in chunks])
    norm_dense = normalize([c["dense_cosine"] for c in chunks])
    norm_rrf = normalize([c["rrf_score"] for c in chunks])

    for chunk, b, d, r in zip(chunks, norm_bm25, norm_dense, norm_rrf):
        chunk["final_score"] = (
            weights["bm25"] * b +
            weights["dense"] * d +
            weights["rrf"] * r
        )

    return sorted(chunks, key=lambda c: c["final_score"], reverse=True)

Default weights: bm25=0.3, dense=0.5, rrf=0.2. No transformer forward pass. No GPU. No 1.5GB model loaded into memory. Just a weighted sum of numbers that were already sitting in the pipeline.

I kept the original cross-encoder code completely intact, gated behind an environment variable (RERANKER_MODE=cross_encoder|lightweight), so I could A/B test instead of betting the whole pipeline on a hunch.

The result: ~13x faster

Metric	Cross-Encoder	Lightweight	Improvement
Reranking step (10 chunks)	8.6s	~0.01s	~860x faster
Full search pipeline	12.57s	0.96s	13.1x faster
Model RAM footprint	~1.5GB	0	—

The reranking step itself went from being the dominant cost in the pipeline to being essentially free, a handful of arithmetic operations on already-computed numbers. The full pipeline now responds in under a second instead of nearly 13.

But does the ranking still make sense?

Speed alone doesn't matter if the lightweight reranker is just putting irrelevant chunks first. So I compared the actual ordering it produces against the cross-encoder's ordering, across 18 test queries:

Metric	Value
Overlap@10	1.000
NDCG@10	0.889
MRR	0.458

Overlap@10 = 1.000 means both methods return the exact same 10 candidate chunks, the only difference is the order they're placed in. NDCG@10 = 0.889 confirms that order is, overall, quite close to what the cross-encoder would produce. MRR = 0.458 is the more honest number: the cross-encoder's top pick isn't always the lightweight reranker's top pick, though it usually lands in the top 2-3.

That gap matters, and I'm not going to pretend it doesn't, which is part of why this isn't the end of the story (more on that below).

Why this works at all

The honest technical reason this isn't crazy: BM25, dense cosine similarity, and RRF score are already decent relevance signals on their own, that's the whole premise of hybrid retrieval. The cross-encoder's job was to refine an already-reasonable ordering, not to find relevance from nothing. When Overlap@10 is 1.000, the heavy lifting (deciding which 10 chunks matter) was already done upstream by retrieval and fusion. The cross-encoder, in this setup, was mostly fine-tuning an ordering that was largely correct already, and a weighted sum of existing signals can approximate that fine-tuning at a fraction of the cost.

I also tried something more rigorous than guessing the weights: I used the cross-encoder's own scores as a training signal and fit a linear regression (numpy.linalg.lstsq) over the three normalized signals. The result: R² = 0.128. In plain terms, the three signals combined linearly explain only about 13% of the variance in what the cross-encoder actually scores. The cross-encoder is doing something genuinely non-linear, picking up on relationships between query and text that a weighted sum of BM25/cosine/RRF can't represent, no matter how the weights are tuned.

That's a useful negative result. It tells me the manual default weights are about as good as this approach is going to get, there's no hidden linear combination waiting to be discovered. If I want to close the remaining gap with the cross-encoder, linear combination of these three signals isn't the path.

What this means in practice

For my use case, a chat interface where someone asks questions about ASOIAF lore and expects a fast, conversational response, this trade was worth it. Going from 12.57s to 0.96s per search step is the difference between "usable in a live chat" and "noticeably broken." And the lightweight reranker isn't reordering chunks randomly; it's a reasonable approximation that gets the same candidate set, mostly in the same order.

What I'm not claiming: that this is a drop-in replacement for a cross-encoder in every RAG system. If your bottleneck isn't latency, or if you need maximum precision regardless of cost, the cross-encoder is still doing real work that a weighted sum can't fully replicate (that R²=0.128 result makes that explicit).

What's next

This change was isolated to the reranking step specifically so I could measure it cleanly. But the lightweight reranker isn't the end of the optimization work, it's a starting point for a few things I'm actively testing now:

Switching the generation model. I'm currently experimenting with Qwen3.6 instead of Llama 3.3 70B for the generation step, to see if it handles the retrieved context more reliably.
Re-embedding with optimized chunking. I'm rebuilding the corpus embeddings with a revised chunking strategy, which should change the quality of what dense retrieval surfaces in the first place, upstream of anything the reranker does.
Prompt adjustments for the generation step, to make sure the LLM anchors its answers more tightly to the retrieved context, independent of which reranker is feeding it chunks.

Each of those is a separate variable, and I'm keeping them isolated rather than changing everything at once — which is exactly how I caught the reranker's actual impact in the first place. The next article will cover what happens when those land.

Following this rebuild in public. The project is live at buscadegeloefogo.vercel.app, and the source is on GitHub.

Gaussian Elimination: the algorithm hiding inside NumPy that I was doing by hand

Felipe Araújo — Thu, 18 Jun 2026 13:06:42 +0000

There's a specific moment in studying math that hits different as an engineer: when you realize the "academic exercise" you're grinding through is literally running inside production software you've used for years.

That moment happened to me recently. I've been pivoting from backend engineering (TypeScript, NestJS, distributed systems) into AI Engineering, and I decided I wasn't going to fake my way through the math. No skipping the foundations. So I went back to Gilbert Strang's MIT 18.06 and started solving linear systems by hand. And then it clicked.

The Setup

I was working through a 3×3 system:

x  + 2y - z  = 3
2x +  y + z  = 7
3x -  y + 2z = 8

Which becomes an augmented matrix:

[ 1  2 -1 | 3 ]
[ 2  1  1 | 7 ]
[ 3 -1  2 | 8 ]

The goal: zero out everything below the diagonal. Pivot by pivot.

First pivot (column 1):

L2 ← L2 - 2·L1  →  [ 0  -3   3 |  1 ]
L3 ← L3 - 3·L1  →  [ 0  -7   5 | -1 ]

Second pivot (column 2):

L3 ← 3·L3 - 7·L2  →  [ 0  0  -6 | -10 ]

Upper triangular form:

[ 1  2  -1 |   3 ]
[ 0 -3   3 |   1 ]
[ 0  0  -6 | -10 ]

Back-substitution from bottom to top gives:

z = 5/3,  y = 4/3,  x = 2

Standard stuff. Nothing fancy. Or so I thought.

The multiplier `m`

Every elimination step computes a multiplier:

m = element_to_zero / pivot

So when zeroing out L2[0] using L1 as pivot row:

m = 2/1 = 2  →  L2 ← L2 - 2·L1

For L3[0]:

m = 3/1 = 3  →  L3 ← L3 - 3·L1

I was doing this mechanically, column by column, treating each operation as:

A[i][j] = A[i][j] - m * A[pivot][j]

That's it. That's the whole thing. And that's when I looked at an algorithm and went quiet for a second.

This is literally the code

for pivot in range(n):
    for row in range(pivot + 1, n):
        m = A[row][pivot] / A[pivot][pivot]
        A[row] = A[row] - m * A[pivot]

The exact sequence I was doing by hand, pivot selection, multiplier computation, row update, is the algorithm. Not a simplification of it. Not a conceptual analogy. The actual algorithm.

And when you call np.linalg.solve(A, b), you're running a production-grade and optimized version of this. The math is the same. The performance engineering around it is what makes it fast.

Where it goes from here

NumPy doesn't literally run Gaussian Elimination in the naive textbook form. What it actually computes under the hood is LU decomposition, a factorization of the matrix into two triangular pieces, where U is essentially what we produced with elimination, and L stores the multipliers m along the way.

I haven't gone deep into LU yet. But understanding that the elimination I was doing by hand is the entry point to that decomposition changed how I see the abstraction. It's not magic. It's the same loop, formalized.

What this study session actually changed

I came in thinking I was filling a gap in my math background. I came out understanding something structural: the linear algebra I'm studying isn't background knowledge for ML, it is the substrate of ML.

Backprop is the chain rule applied to matrix operations. Attention in transformers is matrix multiplication with a softmax. Embeddings live in vector spaces where distance and similarity are defined by inner products. The gradient descent step is a vector subtraction.

When Gilbert Strang says "the key ideas of linear algebra" he's not being poetic. Those ideas are load-bearing walls in almost every ML system.

I'm still early in this path, backend engineer moving into AI Engineering, currently building and studying simultaneously. But I'm increasingly convinced that the engineers who understand what's happening inside np.linalg.solve will make better decisions than the ones who only know how to call it.

I'm documenting this pivot publicly. My RAG project is live at buscadegeloefogo.vercel.app, the Linear Algebra visualizer I built as a study tool is at github.com/FelipeAraujoBS/LA-Canva-Playground. More posts incoming as I go deeper.

Building a production RAG across a Book series: Retrieval, Reranking, and Hard Lessons

Felipe Araújo — Thu, 04 Jun 2026 06:01:43 +0000

I built a search and Q&A system over the entire A Song of Ice and Fire series, all 10 books, ~66,000 paragraphs. The project is called Uma Busca de Gelo e Fogo, and it's live at buscadegeloefogo.vercel.app.

The system has two modes: a classic full-text search engine and a RAG-powered chat that lets you ask questions in natural language and get answers grounded in the actual text. This article is about the second part, the retrieval pipeline, the decisions behind it, and the embarrassing amount of time I spent fixing things that I thought were obviously correct from the start.

The System at a Glance

Three independent microservices:

Component	Role	Stack	Deploy
Backend	Full-text search engine + RAG proxy	Fastify + SQLite FTS5 + TypeScript	Render (Docker)
RAG	Retrieval + generation	FastAPI + ChromaDB + Groq	Hugging Face Spaces (Docker)
Frontend	Search and chat UI	Next.js + Tailwind	Vercel

The backend handles lexical search and also acts as a proxy between the frontend and the RAG microservice. The RAG service lives separately, it's compute-heavy and needs to fail independently from the rest. If the RAG is down, the search engine still works. That isolation saved me more than once during development.

This article focuses entirely on the RAG service.

Why Not Just FTS5?

I have a strong opinion here: people massively underestimate lexical retrieval. For a corpus this size, SQLite FTS5 with a unicode61 tokenizer is absurdly good, it handles diacritics, multi-term proximity queries via NEAR, and snippet() highlighting, all inside a ~50MB file with zero infrastructure overhead. I think too many RAG projects reach for vector databases before seriously asking whether a well-configured full-text search engine would already solve their problem.

For this project, it solves most of the problem. If you search for "Dracarys", FTS5 finds every relevant paragraph instantly. Filter by book, by POV character, expand context, done.

But there's a hard ceiling. If you ask "Why did Jon Snow's brothers betray him?", there's no query term that maps cleanly to the relevant passages. The answer is distributed across chapters, framed in different ways, never stated explicitly in a single paragraph. FTS5 has nothing to offer there.

That's the problem RAG solves. Not as a replacement, as a complementary layer for a different class of questions.

The Retrieval Pipeline

My first version was embarrassingly naive: embed all chunks, store in ChromaDB, cosine similarity lookup, done. It looked fine in early testing because I was asking simple questions. The moment I tried anything with indirect phrasing, questions where the answer wasn't literally stated in a single chunk, the quality collapsed. I was getting chunks that were topically adjacent but factually irrelevant, and the model was confidently synthesizing wrong answers from them.

I spent longer than I'd like to admit staring at retrieval outputs before accepting that cosine similarity alone wasn't going to cut it. The pipeline I ended up with:

User question
  │
  ├─ 1. Dense retrieval    → bge-m3 embedding → ChromaDB (cosine, top 60)
  ├─ 2. Sparse retrieval   → BM25Okapi → top 60
  ├─ 3. Fusion             → Reciprocal Rank Fusion (K=60) → top 40
  ├─ 4. Reranking          → bge-reranker-v2-m3 (cross-encoder) → top 20
  └─ 5. Generation         → Llama 3.3 70B via Groq

Dense Retrieval: bge-m3

The embedding model is BAAI/bge-m3. Multilingual support was non-negotiable — the corpus is in Portuguese, but users ask questions in English, Portuguese, and sometimes both in the same sentence. bge-m3 handles that well.

One thing I only discovered after reading the BGE documentation carefully: these models support instruction-tuned embeddings. For retrieval, the query should use the prefix:

"Represent this sentence for searching relevant passages: {question}"

This isn't cosmetic. It tells the model the embedding should be optimized for document retrieval specifically, not generic semantic similarity. I originally skipped this because it looked like boilerplate. It isn't, dropping the prefix measurably degrades retrieval alignment.

Sparse Retrieval: BM25

Dense retrieval is good at paraphrase and semantic similarity. It's bad at exact matching for rare or proper nouns. In a fantasy series, this is a serious problem. "Casterly Rock", "Daenerys Stormborn", "R'hllor" — these are not concepts a bi-encoder generalizes to gracefully. BM25 handles them exactly, and at essentially zero cost.

Running both in parallel is covering for the obvious weaknesses of each method.

Fusion: Reciprocal Rank Fusion

RRF merges two ranked lists without requiring score normalization. The formula:

score(doc) = Σ 1 / (K + rank(doc))

With K=60, documents ranked highly by either method get a strong boost. Documents ranked poorly by both get filtered out. The reason to use rank rather than raw score is that BM25 scores and cosine similarities live on completely different scales — you can't just add them. RRF sidesteps that entirely.

I initially tried a weighted linear combination of normalized scores. It was worse and much harder to tune. RRF is simpler and more robust.

Reranking: Cross-Encoder

The bi-encoder computes embeddings for query and document independently and compares them via cosine similarity. It's fast because you compute document embeddings once and index them. It's also a lossy approximation, there's no direct interaction between query and document tokens during scoring.

A cross-encoder is different. It takes the concatenated query and document as input and scores them with full attention between both. It's meaningfully more accurate. It's also orders of magnitude slower, you can't run it over 66,000 documents.

The solution is to run it only over the top 40 candidates from RRF. At that scale it's fast enough; at corpus scale it would be unusable. The model is BAAI/bge-reranker-v2-m3, the multilingual cross-encoder from the same family as bge-m3.

After reranking, the top 20 chunks go into the generation prompt.

Chunking: Where I Lost the Most Time

The embedding pipeline runs over ~66,000 paragraphs using a sliding window: 5 sentences per chunk, stride of 3. Adjacent chunks share 2 sentences of overlap.

I did not start here. I started with fixed character splits because that's what most tutorials show, and tutorials are written to be simple, not correct. Fixed character splits routinely cut sentences in half. When your chunk ends mid-sentence, the embedding captures the beginning of a thought with no resolution, and the retrieval degrades in ways that are genuinely hard to diagnose because the chunks look fine when you print them.

Switching to sentence-based splitting with NLTK's sent_tokenize fixed a class of retrieval failures I had been blaming on the embedding model. That was a humbling moment.

The overlapping window is there because a single sentence that answers the user's question might land exactly at the boundary of a non-overlapping chunk. Overlap reduces that risk by ensuring each sentence appears in multiple chunks with different surrounding context. The tradeoff is redundancy, the same content appears more than once in ChromaDB. For this corpus size, that's fine.

Prompt Engineering: The Mistake I Was Confident About

My original system prompt:

"Answer based solely on the provided context. If you don't know, say you don't know."

This is standard advice, repeated everywhere. The reasoning is sound: strict grounding prevents hallucination. In practice, it made the system look dumber than it actually was.

The problem is that "answer only from context" is a retrieval quality guarantee disguised as a generation quality guarantee. If the retrieval pipeline surfaces the right chunks, it works great. If retrieval fails, wrong chunk boundaries, embedding misalignment, a question phrased in a way the model didn't handle well, the LLM sees a context that doesn't contain the answer and dutifully says "I don't know."

I was so confident this was correct that I spent time looking for bugs in the retrieval pipeline when the real issue was that I had made the model incapable of compensating for retrieval failures. The model had relevant knowledge. I had told it to pretend otherwise.

The corrected prompt:

"Use the context as your primary source. You may supplement with your own knowledge if necessary. If you use your own knowledge, say so explicitly."

The model stays grounded in retrieved text, falls back gracefully when retrieval misses, and is transparent about when it does so. The contract is more honest about what the system actually guarantees.

Evaluation

The system has an evaluation script that measures four metrics using LLM-as-Judge:

Metric	What it measures
Context Precision	What fraction of retrieved chunks are actually relevant?
Context Recall	Does the retrieved context contain enough to answer the question?
Faithfulness	Is the generated answer consistent with the retrieved context?
Answer Relevancy	Does the answer actually address what was asked?

LLM-as-Judge is the right choice here because there's no ground truth corpus. These are open-ended questions about a book series, there's no single correct answer to compute BLEU against. N-gram overlap metrics would be meaningless for this task.

I'll be honest: I don't have polished benchmark numbers to share. The evaluation script exists and runs, but I've been using it more as a diagnostic tool than as a rigorous benchmark. That's on the list of things to make more systematic.

Fallback: When ChromaDB Is Down

Hugging Face Spaces has cold starts. If ChromaDB is unavailable when a request comes in, the system automatically falls back to direct FTS5 queries on the SQLite database. The answer won't be LLM-generated, but the user gets relevant text instead of a 500 error.

Designing this fallback in from the beginning, rather than adding it after the first production incident, is one of the few things I did in the right order.

What I'd Do Differently

Adaptive chunking. Sliding window is a reasonable default but it ignores narrative structure entirely. A paragraph break in a fantasy novel often marks a meaningful boundary. Chunking by scene or narrative unit would likely improve context coherence more than any retrieval tweak.

Query expansion. Some questions come in English, some in Portuguese. A translation or synonym expansion step before retrieval would help recall for cross-language queries without requiring a multilingual retrieval overhaul.

HyDE. Instead of embedding the raw question, ask the LLM to generate a hypothetical passage that would answer it, then embed that. The resulting embedding is often much better aligned with the document space than the question embedding directly. I haven't implemented this yet, but I expect it would meaningfully improve retrieval for indirect or abstract questions.

BM25 persistence. The BM25 index is rebuilt from the full corpus on every service startup. For 66,000 paragraphs it's fast, but it's unnecessary work. Persisting it would shave startup time for no real cost.

Streaming. The full response is returned at once. SSE streaming would make the perceived latency dramatically better for longer answers.

Closing

The system is live at buscadegeloefogo.vercel.app. Ask it something that requires actual reasoning across the books, not just keyword lookup, and see how the retrieval holds up.

The main thing I learned building this is that RAG quality is determined by the weakest link in the pipeline, and the weakest link is usually not the LLM. It's the chunk boundaries. It's the retrieval strategy. It's the prompt contract. None of those are obvious until they're broken in production.

Happy to discuss any of it in the comments.

I built a RAG pipeline from scratch, and one wrong answer made me dive even deeper into AI Engineering

Felipe Araújo — Sat, 30 May 2026 02:53:17 +0000

A backend engineer's first step into AI Engineering: embeddings, vector search, and the chunking bug that made everything click.

Why I decided to pivot toward AI Engineering

I have been a backend engineer for a while now: TypeScript, NestJS, distributed systems, APIs in production. I like that work. But at some point I started paying attention to a specific career trajectory I came across: someone with a background almost identical to mine who had moved into AI Engineering. Not abandoned backend, extended it.

That reframed everything for me. This wasn't a pivot away from what I knew. It was a direction to grow into. And I decided to start from the fundamentals, not from the tooling.

So instead of installing LangChain and following a tutorial, I built a RAG pipeline from scratch, no abstractions, no magic. Just Python, the Gemini API, and ChromaDB. Here is what I learned.

What RAG actually is

Before writing a line of code, I needed a mental model that made sense to me as an engineer.

RAG stands for Retrieval-Augmented Generation. The idea is simple: LLMs have frozen knowledge (their training cutoff) and a limited context window. You cannot feed an entire codebase or document library into a single prompt. RAG solves this by fetching only the relevant fragments at query time and injecting them into the context before the LLM responds.

Think of it as hiring a brilliant consultant who knows nothing about your company. Instead of retraining them from scratch, you hand them the relevant documents before each meeting. That is RAG.

The pipeline has two phases:

INDEXING (runs once):
Document → chunking → embeddings → vector database

QUERYING (runs on every question):
Question → embedding → similarity search → top K chunks → LLM → answer

Embeddings: meaning as coordinates

The concept that unlocked everything for me was embeddings. An embedding is a vector, nothing more than a list of numbers, that represents the semantic meaning of a piece of text. Similar meanings produce similar vectors. Dissimilar meanings produce distant vectors.

This is not keyword matching. It is geometry. When you search a vector database, you are finding the nearest neighbors in a high-dimensional space. A question about "payment processing failures" can match a chunk that talks about "error handling in transactions", even if they share no words.

The model learned these relationships from co-occurrence patterns across billions of sentences. It never "saw" what a dog looks like, but it learned that "dog" and "cat" appear in similar contexts, pet care articles, veterinary advice, adoption stories, while "car" appears in entirely different ones. That contrast is encoded into their vector coordinates: dog and cat end up geometrically close, car ends up far away.

In my project, each chunk produced a vector with 3072 dimensions using gemini-embedding-001.

The architecture

rag-project/
├── src/
│   ├── chunking.py      # text splitting logic
│   ├── embeddings.py    # embedding generation via Gemini API
│   ├── vector_store.py  # ChromaDB setup
│   └── llm.py           # prompt construction and response generation
├── main.py              # orchestrates the full pipeline
└── .env                 # API keys

Each module exports only functions. No logic runs on import. main.py is the only place that decides what executes and in what order.

Chunking: the step most tutorials skip

Chunking is dividing your document into fragments before generating embeddings. The size matters more than I expected.

def chunk_text(text, chunk_size=400, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap
    return chunks

The bug that taught me the most

I asked the system (in Portuguese): "O que são controllers no NestJS?" — "What are controllers in NestJS?"

The response (in Portuguese): "Não sabe." — "Does not know".

The LLM was Gemini. Gemini absolutely knows what NestJS controllers are. I had explicitly instructed it to answer only from the provided context — so when the context was wrong, it answered honestly that it did not know.

I inspected the context being sent to the model:

Controllers no NestJS são responsáveis  os controllers via injeção de dependência. ("Controllers in NestJS are responsible the controllers via dependency injection.)

The chunk had been cut in the middle of a sentence. The fix was increasing the chunk size from 200 to 400 characters. The system then answered correctly.

This is the failure mode that matters in production RAG. The pipeline does not crash. It runs perfectly and produces a wrong answer. The actual problem was upstream; in the chunking strategy.

Chunk size directly affects answer quality. Too small: the embedding captures a fragment without enough semantic content. Too large: the embedding averages over too much content and loses specificity.

What I understand now that I did not before

RAG is simpler to implement than I expected. The hard part is not the code, it is the judgment. Knowing when a chunk is too small. Knowing when retrieved context is semantically close but factually irrelevant. Knowing when to restrict the LLM to context and when to let it reason freely.

The libraries abstract the mechanics. The engineering is in the decisions around them.

Retrieval quality determines answer quality. The LLM is the last step. If the chunks going in are wrong, no model in the world will produce a correct answer.

What comes next

This was a minimal implementation on purpose. The next version will index a real corpus, the parsed books of A Song of Ice and Fire, with structure-aware chunking by chapter, metadata filters by POV character and book, and conversation history for a proper chatbot experience.

After that: evals. Measuring whether the system actually answers correctly at scale is what separates a working demo from a production system.

If you are a backend engineer considering a move toward AI Engineering: start here. Build it without the frameworks first. The abstractions make much more sense once you know what they are hiding.