<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Felipe Araújo</title>
    <description>The latest articles on DEV Community by Felipe Araújo (@felipearaujobs).</description>
    <link>https://dev.to/felipearaujobs</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3959297%2F78d4f4e7-29ce-43f0-800a-cdd33d6b502e.jpeg</url>
      <title>DEV Community: Felipe Araújo</title>
      <link>https://dev.to/felipearaujobs</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/felipearaujobs"/>
    <language>en</language>
    <item>
      <title>Gaussian Elimination: the algorithm hiding inside NumPy that I was doing by hand</title>
      <dc:creator>Felipe Araújo</dc:creator>
      <pubDate>Thu, 18 Jun 2026 13:06:42 +0000</pubDate>
      <link>https://dev.to/felipearaujobs/gaussian-elimination-the-algorithm-hiding-inside-numpy-that-i-was-doing-by-hand-1ahn</link>
      <guid>https://dev.to/felipearaujobs/gaussian-elimination-the-algorithm-hiding-inside-numpy-that-i-was-doing-by-hand-1ahn</guid>
      <description>&lt;p&gt;There's a specific moment in studying math that hits different as an engineer: when you realize the "academic exercise" you're grinding through is literally running inside production software you've used for years.&lt;/p&gt;

&lt;p&gt;That moment happened to me recently. I've been pivoting from backend engineering (TypeScript, NestJS, distributed systems) into AI Engineering, and I decided I wasn't going to fake my way through the math. No skipping the foundations. So I went back to Gilbert Strang's MIT 18.06 and started solving linear systems by hand. And then it clicked.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I was working through a 3×3 system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x  + 2y - z  = 3
2x +  y + z  = 7
3x -  y + 2z = 8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Which becomes an augmented matrix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ 1  2 -1 | 3 ]
[ 2  1  1 | 7 ]
[ 3 -1  2 | 8 ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The goal: zero out everything below the diagonal. Pivot by pivot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First pivot (column 1):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L2 ← L2 - 2·L1  →  [ 0  -3   3 |  1 ]
L3 ← L3 - 3·L1  →  [ 0  -7   5 | -1 ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Second pivot (column 2):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;L3 ← 3·L3 - 7·L2  →  [ 0  0  -6 | -10 ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Upper triangular form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ 1  2  -1 |   3 ]
[ 0 -3   3 |   1 ]
[ 0  0  -6 | -10 ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Back-substitution from bottom to top gives:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;z = 5/3,  y = 4/3,  x = 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Standard stuff. Nothing fancy. Or so I thought.&lt;/p&gt;




&lt;h2&gt;
  
  
  The multiplier &lt;code&gt;m&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Every elimination step computes a multiplier:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;element_to_zero&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;pivot&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So when zeroing out &lt;code&gt;L2[0]&lt;/code&gt; using &lt;code&gt;L1&lt;/code&gt; as pivot row:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="err"&gt;→&lt;/span&gt;  &lt;span class="n"&gt;L2&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="n"&gt;L2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="err"&gt;·&lt;/span&gt;&lt;span class="n"&gt;L1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For &lt;code&gt;L3[0]&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;  &lt;span class="err"&gt;→&lt;/span&gt;  &lt;span class="n"&gt;L3&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="n"&gt;L3&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="err"&gt;·&lt;/span&gt;&lt;span class="n"&gt;L1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I was doing this mechanically, column by column, treating each operation as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pivot&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. That's the whole thing. And that's when I looked at an algorithm and went quiet for a second.&lt;/p&gt;




&lt;h2&gt;
  
  
  This is literally the code
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pivot&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pivot&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;pivot&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pivot&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;pivot&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pivot&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The exact sequence I was doing by hand, pivot selection, multiplier computation, row update, is the algorithm. Not a simplification of it. Not a conceptual analogy. The actual algorithm.&lt;/p&gt;

&lt;p&gt;And when you call &lt;code&gt;np.linalg.solve(A, b)&lt;/code&gt;, you're running a production-grade and optimized version of this. The math is the same. The performance engineering around it is what makes it fast.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where it goes from here
&lt;/h2&gt;

&lt;p&gt;NumPy doesn't literally run Gaussian Elimination in the naive textbook form. What it actually computes under the hood is LU decomposition, a factorization of the matrix into two triangular pieces, where U is essentially what we produced with elimination, and L stores the multipliers m along the way.&lt;/p&gt;

&lt;p&gt;I haven't gone deep into LU yet. But understanding that the elimination I was doing by hand is the entry point to that decomposition changed how I see the abstraction. It's not magic. It's the same loop, formalized.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this study session actually changed
&lt;/h2&gt;

&lt;p&gt;I came in thinking I was filling a gap in my math background. I came out understanding something structural: the linear algebra I'm studying isn't background knowledge for ML, it &lt;em&gt;is&lt;/em&gt; the substrate of ML.&lt;/p&gt;

&lt;p&gt;Backprop is the chain rule applied to matrix operations. Attention in transformers is matrix multiplication with a softmax. Embeddings live in vector spaces where distance and similarity are defined by inner products. The gradient descent step is a vector subtraction.&lt;/p&gt;

&lt;p&gt;When Gilbert Strang says "the key ideas of linear algebra" he's not being poetic. Those ideas are load-bearing walls in almost every ML system.&lt;/p&gt;

&lt;p&gt;I'm still early in this path, backend engineer moving into AI Engineering, currently building and studying simultaneously. But I'm increasingly convinced that the engineers who understand what's happening inside &lt;code&gt;np.linalg.solve&lt;/code&gt; will make better decisions than the ones who only know how to call it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm documenting this pivot publicly. My RAG project is live at &lt;a href="https://buscadegeloefogo.vercel.app" rel="noopener noreferrer"&gt;buscadegeloefogo.vercel.app&lt;/a&gt;, the Linear Algebra visualizer I built as a study tool is at &lt;a href="https://github.com/FelipeAraujoBS/LA-Canva-Playground" rel="noopener noreferrer"&gt;github.com/FelipeAraujoBS/LA-Canva-Playground&lt;/a&gt;. More posts incoming as I go deeper.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>algorithms</category>
      <category>computerscience</category>
      <category>learning</category>
      <category>python</category>
    </item>
    <item>
      <title>Building a production RAG across a Book series: Retrieval, Reranking, and Hard Lessons</title>
      <dc:creator>Felipe Araújo</dc:creator>
      <pubDate>Thu, 04 Jun 2026 06:01:43 +0000</pubDate>
      <link>https://dev.to/felipearaujobs/building-a-production-rag-across-a-book-series-retrieval-reranking-and-hard-lessons-4jfa</link>
      <guid>https://dev.to/felipearaujobs/building-a-production-rag-across-a-book-series-retrieval-reranking-and-hard-lessons-4jfa</guid>
      <description>&lt;p&gt;I built a search and Q&amp;amp;A system over the entire &lt;em&gt;A Song of Ice and Fire&lt;/em&gt; series, all 10 books, ~66,000 paragraphs. The project is called &lt;strong&gt;Uma Busca de Gelo e Fogo&lt;/strong&gt;, and it's live at &lt;a href="https://buscadegeloefogo.vercel.app" rel="noopener noreferrer"&gt;buscadegeloefogo.vercel.app&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The system has two modes: a classic full-text search engine and a RAG-powered chat that lets you ask questions in natural language and get answers grounded in the actual text. This article is about the second part, the retrieval pipeline, the decisions behind it, and the embarrassing amount of time I spent fixing things that I thought were obviously correct from the start.&lt;/p&gt;




&lt;h2&gt;
  
  
  The System at a Glance
&lt;/h2&gt;

&lt;p&gt;Three independent microservices:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Stack&lt;/th&gt;
&lt;th&gt;Deploy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Backend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full-text search engine + RAG proxy&lt;/td&gt;
&lt;td&gt;Fastify + SQLite FTS5 + TypeScript&lt;/td&gt;
&lt;td&gt;Render (Docker)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieval + generation&lt;/td&gt;
&lt;td&gt;FastAPI + ChromaDB + Groq&lt;/td&gt;
&lt;td&gt;Hugging Face Spaces (Docker)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Frontend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Search and chat UI&lt;/td&gt;
&lt;td&gt;Next.js + Tailwind&lt;/td&gt;
&lt;td&gt;Vercel&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The backend handles lexical search and also acts as a proxy between the frontend and the RAG microservice. The RAG service lives separately, it's compute-heavy and needs to fail independently from the rest. If the RAG is down, the search engine still works. That isolation saved me more than once during development.&lt;/p&gt;

&lt;p&gt;This article focuses entirely on the RAG service.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Not Just FTS5?
&lt;/h2&gt;

&lt;p&gt;I have a strong opinion here: people massively underestimate lexical retrieval. For a corpus this size, SQLite FTS5 with a unicode61 tokenizer is absurdly good, it handles diacritics, multi-term proximity queries via &lt;code&gt;NEAR&lt;/code&gt;, and &lt;code&gt;snippet()&lt;/code&gt; highlighting, all inside a ~50MB file with zero infrastructure overhead. I think too many RAG projects reach for vector databases before seriously asking whether a well-configured full-text search engine would already solve their problem.&lt;/p&gt;

&lt;p&gt;For this project, it solves most of the problem. If you search for &lt;em&gt;"Dracarys"&lt;/em&gt;, FTS5 finds every relevant paragraph instantly. Filter by book, by POV character, expand context, done.&lt;/p&gt;

&lt;p&gt;But there's a hard ceiling. If you ask &lt;em&gt;"Why did Jon Snow's brothers betray him?"&lt;/em&gt;, there's no query term that maps cleanly to the relevant passages. The answer is distributed across chapters, framed in different ways, never stated explicitly in a single paragraph. FTS5 has nothing to offer there.&lt;/p&gt;

&lt;p&gt;That's the problem RAG solves. Not as a replacement, as a complementary layer for a different class of questions.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Retrieval Pipeline
&lt;/h2&gt;

&lt;p&gt;My first version was embarrassingly naive: embed all chunks, store in ChromaDB, cosine similarity lookup, done. It looked fine in early testing because I was asking simple questions. The moment I tried anything with indirect phrasing, questions where the answer wasn't literally stated in a single chunk, the quality collapsed. I was getting chunks that were topically adjacent but factually irrelevant, and the model was confidently synthesizing wrong answers from them.&lt;/p&gt;

&lt;p&gt;I spent longer than I'd like to admit staring at retrieval outputs before accepting that cosine similarity alone wasn't going to cut it. The pipeline I ended up with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User question
  │
  ├─ 1. Dense retrieval    → bge-m3 embedding → ChromaDB (cosine, top 60)
  ├─ 2. Sparse retrieval   → BM25Okapi → top 60
  ├─ 3. Fusion             → Reciprocal Rank Fusion (K=60) → top 40
  ├─ 4. Reranking          → bge-reranker-v2-m3 (cross-encoder) → top 20
  └─ 5. Generation         → Llama 3.3 70B via Groq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Dense Retrieval: bge-m3
&lt;/h3&gt;

&lt;p&gt;The embedding model is &lt;code&gt;BAAI/bge-m3&lt;/code&gt;. Multilingual support was non-negotiable — the corpus is in Portuguese, but users ask questions in English, Portuguese, and sometimes both in the same sentence. bge-m3 handles that well.&lt;/p&gt;

&lt;p&gt;One thing I only discovered after reading the BGE documentation carefully: these models support &lt;em&gt;instruction-tuned&lt;/em&gt; embeddings. For retrieval, the query should use the prefix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Represent this sentence for searching relevant passages: {question}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't cosmetic. It tells the model the embedding should be optimized for document retrieval specifically, not generic semantic similarity. I originally skipped this because it looked like boilerplate. It isn't, dropping the prefix measurably degrades retrieval alignment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sparse Retrieval: BM25
&lt;/h3&gt;

&lt;p&gt;Dense retrieval is good at paraphrase and semantic similarity. It's bad at exact matching for rare or proper nouns. In a fantasy series, this is a serious problem. &lt;em&gt;"Casterly Rock"&lt;/em&gt;, &lt;em&gt;"Daenerys Stormborn"&lt;/em&gt;, &lt;em&gt;"R'hllor"&lt;/em&gt; — these are not concepts a bi-encoder generalizes to gracefully. BM25 handles them exactly, and at essentially zero cost.&lt;/p&gt;

&lt;p&gt;Running both in parallel is covering for the obvious weaknesses of each method.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fusion: Reciprocal Rank Fusion
&lt;/h3&gt;

&lt;p&gt;RRF merges two ranked lists without requiring score normalization. The formula:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;score(doc) = Σ 1 / (K + rank(doc))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With K=60, documents ranked highly by either method get a strong boost. Documents ranked poorly by both get filtered out. The reason to use rank rather than raw score is that BM25 scores and cosine similarities live on completely different scales — you can't just add them. RRF sidesteps that entirely.&lt;/p&gt;

&lt;p&gt;I initially tried a weighted linear combination of normalized scores. It was worse and much harder to tune. RRF is simpler and more robust.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reranking: Cross-Encoder
&lt;/h3&gt;

&lt;p&gt;The bi-encoder computes embeddings for query and document independently and compares them via cosine similarity. It's fast because you compute document embeddings once and index them. It's also a lossy approximation, there's no direct interaction between query and document tokens during scoring.&lt;/p&gt;

&lt;p&gt;A cross-encoder is different. It takes the concatenated query and document as input and scores them with full attention between both. It's meaningfully more accurate. It's also orders of magnitude slower, you can't run it over 66,000 documents.&lt;/p&gt;

&lt;p&gt;The solution is to run it only over the top 40 candidates from RRF. At that scale it's fast enough; at corpus scale it would be unusable. The model is &lt;code&gt;BAAI/bge-reranker-v2-m3&lt;/code&gt;, the multilingual cross-encoder from the same family as bge-m3.&lt;/p&gt;

&lt;p&gt;After reranking, the top 20 chunks go into the generation prompt.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chunking: Where I Lost the Most Time
&lt;/h2&gt;

&lt;p&gt;The embedding pipeline runs over ~66,000 paragraphs using a sliding window: 5 sentences per chunk, stride of 3. Adjacent chunks share 2 sentences of overlap.&lt;/p&gt;

&lt;p&gt;I did not start here. I started with fixed character splits because that's what most tutorials show, and tutorials are written to be simple, not correct. Fixed character splits routinely cut sentences in half. When your chunk ends mid-sentence, the embedding captures the beginning of a thought with no resolution, and the retrieval degrades in ways that are genuinely hard to diagnose because the chunks look fine when you print them.&lt;/p&gt;

&lt;p&gt;Switching to sentence-based splitting with NLTK's &lt;code&gt;sent_tokenize&lt;/code&gt; fixed a class of retrieval failures I had been blaming on the embedding model. That was a humbling moment.&lt;/p&gt;

&lt;p&gt;The overlapping window is there because a single sentence that answers the user's question might land exactly at the boundary of a non-overlapping chunk. Overlap reduces that risk by ensuring each sentence appears in multiple chunks with different surrounding context. The tradeoff is redundancy, the same content appears more than once in ChromaDB. For this corpus size, that's fine.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prompt Engineering: The Mistake I Was Confident About
&lt;/h2&gt;

&lt;p&gt;My original system prompt:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Answer based solely on the provided context. If you don't know, say you don't know."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is standard advice, repeated everywhere. The reasoning is sound: strict grounding prevents hallucination. In practice, it made the system look dumber than it actually was.&lt;/p&gt;

&lt;p&gt;The problem is that "answer only from context" is a retrieval quality guarantee disguised as a generation quality guarantee. If the retrieval pipeline surfaces the right chunks, it works great. If retrieval fails, wrong chunk boundaries, embedding misalignment, a question phrased in a way the model didn't handle well, the LLM sees a context that doesn't contain the answer and dutifully says &lt;em&gt;"I don't know."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I was so confident this was correct that I spent time looking for bugs in the retrieval pipeline when the real issue was that I had made the model incapable of compensating for retrieval failures. The model had relevant knowledge. I had told it to pretend otherwise.&lt;/p&gt;

&lt;p&gt;The corrected prompt:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Use the context as your primary source. You may supplement with your own knowledge if necessary. If you use your own knowledge, say so explicitly."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model stays grounded in retrieved text, falls back gracefully when retrieval misses, and is transparent about when it does so. The contract is more honest about what the system actually guarantees.&lt;/p&gt;




&lt;h2&gt;
  
  
  Evaluation
&lt;/h2&gt;

&lt;p&gt;The system has an evaluation script that measures four metrics using LLM-as-Judge:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context Precision&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;What fraction of retrieved chunks are actually relevant?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context Recall&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Does the retrieved context contain enough to answer the question?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Faithfulness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Is the generated answer consistent with the retrieved context?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Answer Relevancy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Does the answer actually address what was asked?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LLM-as-Judge is the right choice here because there's no ground truth corpus. These are open-ended questions about a book series, there's no single correct answer to compute BLEU against. N-gram overlap metrics would be meaningless for this task.&lt;/p&gt;

&lt;p&gt;I'll be honest: I don't have polished benchmark numbers to share. The evaluation script exists and runs, but I've been using it more as a diagnostic tool than as a rigorous benchmark. That's on the list of things to make more systematic.&lt;/p&gt;




&lt;h2&gt;
  
  
  Fallback: When ChromaDB Is Down
&lt;/h2&gt;

&lt;p&gt;Hugging Face Spaces has cold starts. If ChromaDB is unavailable when a request comes in, the system automatically falls back to direct FTS5 queries on the SQLite database. The answer won't be LLM-generated, but the user gets relevant text instead of a 500 error.&lt;/p&gt;

&lt;p&gt;Designing this fallback in from the beginning, rather than adding it after the first production incident, is one of the few things I did in the right order.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Adaptive chunking.&lt;/strong&gt; Sliding window is a reasonable default but it ignores narrative structure entirely. A paragraph break in a fantasy novel often marks a meaningful boundary. Chunking by scene or narrative unit would likely improve context coherence more than any retrieval tweak.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query expansion.&lt;/strong&gt; Some questions come in English, some in Portuguese. A translation or synonym expansion step before retrieval would help recall for cross-language queries without requiring a multilingual retrieval overhaul.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HyDE.&lt;/strong&gt; Instead of embedding the raw question, ask the LLM to generate a hypothetical passage that would answer it, then embed that. The resulting embedding is often much better aligned with the document space than the question embedding directly. I haven't implemented this yet, but I expect it would meaningfully improve retrieval for indirect or abstract questions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BM25 persistence.&lt;/strong&gt; The BM25 index is rebuilt from the full corpus on every service startup. For 66,000 paragraphs it's fast, but it's unnecessary work. Persisting it would shave startup time for no real cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streaming.&lt;/strong&gt; The full response is returned at once. SSE streaming would make the perceived latency dramatically better for longer answers.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The system is live at &lt;a href="https://buscadegeloefogo.vercel.app" rel="noopener noreferrer"&gt;buscadegeloefogo.vercel.app&lt;/a&gt;. Ask it something that requires actual reasoning across the books, not just keyword lookup, and see how the retrieval holds up.&lt;/p&gt;

&lt;p&gt;The main thing I learned building this is that RAG quality is determined by the weakest link in the pipeline, and the weakest link is usually not the LLM. It's the chunk boundaries. It's the retrieval strategy. It's the prompt contract. None of those are obvious until they're broken in production.&lt;/p&gt;

&lt;p&gt;Happy to discuss any of it in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>I built a RAG pipeline from scratch, and one wrong answer made me dive even deeper into AI Engineering</title>
      <dc:creator>Felipe Araújo</dc:creator>
      <pubDate>Sat, 30 May 2026 02:53:17 +0000</pubDate>
      <link>https://dev.to/felipearaujobs/i-built-a-rag-pipeline-from-scratch-and-one-wrong-answer-made-me-dive-even-deeper-into-ai-4npg</link>
      <guid>https://dev.to/felipearaujobs/i-built-a-rag-pipeline-from-scratch-and-one-wrong-answer-made-me-dive-even-deeper-into-ai-4npg</guid>
      <description>&lt;p&gt;A backend engineer's first step into AI Engineering: embeddings, vector search, and the chunking bug that made everything click.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I decided to pivot toward AI Engineering
&lt;/h2&gt;

&lt;p&gt;I have been a backend engineer for a while now: TypeScript, NestJS, distributed systems, APIs in production. I like that work. But at some point I started paying attention to a specific career trajectory I came across: someone with a background almost identical to mine who had moved into AI Engineering. Not abandoned backend, extended it.&lt;/p&gt;

&lt;p&gt;That reframed everything for me. This wasn't a pivot away from what I knew. It was a direction to grow into. And I decided to start from the fundamentals, not from the tooling.&lt;/p&gt;

&lt;p&gt;So instead of installing LangChain and following a tutorial, I built a RAG pipeline from scratch, no abstractions, no magic. Just Python, the Gemini API, and ChromaDB. Here is what I learned.&lt;/p&gt;




&lt;h2&gt;
  
  
  What RAG actually is
&lt;/h2&gt;

&lt;p&gt;Before writing a line of code, I needed a mental model that made sense to me as an engineer.&lt;/p&gt;

&lt;p&gt;RAG stands for Retrieval-Augmented Generation. The idea is simple: LLMs have frozen knowledge (their training cutoff) and a limited context window. You cannot feed an entire codebase or document library into a single prompt. RAG solves this by fetching only the relevant fragments at query time and injecting them into the context before the LLM responds.&lt;/p&gt;

&lt;p&gt;Think of it as hiring a brilliant consultant who knows nothing about your company. Instead of retraining them from scratch, you hand them the relevant documents before each meeting. That is RAG.&lt;/p&gt;

&lt;p&gt;The pipeline has two phases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INDEXING (runs once):
Document → chunking → embeddings → vector database

QUERYING (runs on every question):
Question → embedding → similarity search → top K chunks → LLM → answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Embeddings: meaning as coordinates
&lt;/h2&gt;

&lt;p&gt;The concept that unlocked everything for me was embeddings. An embedding is a vector, nothing more than a list of numbers, that represents the semantic meaning of a piece of text. Similar meanings produce similar vectors. Dissimilar meanings produce distant vectors.&lt;/p&gt;

&lt;p&gt;This is not keyword matching. It is geometry. When you search a vector database, you are finding the nearest neighbors in a high-dimensional space. A question about "payment processing failures" can match a chunk that talks about "error handling in transactions", even if they share no words.&lt;/p&gt;

&lt;p&gt;The model learned these relationships from co-occurrence patterns across billions of sentences. It never "saw" what a dog looks like, but it learned that "dog" and "cat" appear in similar contexts, pet care articles, veterinary advice, adoption stories, while "car" appears in entirely different ones. That contrast is encoded into their vector coordinates: dog and cat end up geometrically close, car ends up far away.&lt;/p&gt;

&lt;p&gt;In my project, each chunk produced a vector with 3072 dimensions using gemini-embedding-001.&lt;/p&gt;




&lt;h2&gt;
  
  
  The architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;rag-project/
├── src/
│   ├── chunking.py      &lt;span class="c"&gt;# text splitting logic&lt;/span&gt;
│   ├── embeddings.py    &lt;span class="c"&gt;# embedding generation via Gemini API&lt;/span&gt;
│   ├── vector_store.py  &lt;span class="c"&gt;# ChromaDB setup&lt;/span&gt;
│   └── llm.py           &lt;span class="c"&gt;# prompt construction and response generation&lt;/span&gt;
├── main.py              &lt;span class="c"&gt;# orchestrates the full pipeline&lt;/span&gt;
└── .env                 &lt;span class="c"&gt;# API keys&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each module exports only functions. No logic runs on import. main.py is the only place that decides what executes and in what order.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chunking: the step most tutorials skip
&lt;/h2&gt;

&lt;p&gt;Chunking is dividing your document into fragments before generating embeddings. The size matters more than I expected.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;
        &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The bug that taught me the most
&lt;/h2&gt;

&lt;p&gt;I asked the system (in Portuguese): "O que são controllers no NestJS?" — "What are controllers in NestJS?"&lt;/p&gt;

&lt;p&gt;The response (in Portuguese): "Não sabe." — "Does not know".&lt;/p&gt;

&lt;p&gt;The LLM was Gemini. Gemini absolutely knows what NestJS controllers are. I had explicitly instructed it to answer only from the provided context — so when the context was wrong, it answered honestly that it did not know.&lt;/p&gt;

&lt;p&gt;I inspected the context being sent to the model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Controllers no NestJS são responsáveis  os controllers via injeção de dependência. ("Controllers in NestJS are responsible the controllers via dependency injection.)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The chunk had been cut in the middle of a sentence. The fix was increasing the chunk size from 200 to 400 characters. The system then answered correctly.&lt;/p&gt;

&lt;p&gt;This is the failure mode that matters in production RAG. The pipeline does not crash. It runs perfectly and produces a wrong answer. The actual problem was upstream; in the chunking strategy.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Chunk size directly affects answer quality. Too small: the embedding captures a fragment without enough semantic content. Too large: the embedding averages over too much content and loses specificity.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What I understand now that I did not before
&lt;/h2&gt;

&lt;p&gt;RAG is simpler to implement than I expected. The hard part is not the code, it is the judgment. Knowing when a chunk is too small. Knowing when retrieved context is semantically close but factually irrelevant. Knowing when to restrict the LLM to context and when to let it reason freely.&lt;/p&gt;

&lt;p&gt;The libraries abstract the mechanics. The engineering is in the decisions around them.&lt;/p&gt;

&lt;p&gt;Retrieval quality determines answer quality. The LLM is the last step. If the chunks going in are wrong, no model in the world will produce a correct answer.&lt;/p&gt;




&lt;h2&gt;
  
  
  What comes next
&lt;/h2&gt;

&lt;p&gt;This was a minimal implementation on purpose. The next version will index a real corpus, the parsed books of A Song of Ice and Fire, with structure-aware chunking by chapter, metadata filters by POV character and book, and conversation history for a proper chatbot experience.&lt;/p&gt;

&lt;p&gt;After that: evals. Measuring whether the system actually answers correctly at scale is what separates a working demo from a production system.&lt;/p&gt;

&lt;p&gt;If you are a backend engineer considering a move toward AI Engineering: start here. Build it without the frameworks first. The abstractions make much more sense once you know what they are hiding.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>softwareengineering</category>
      <category>python</category>
    </item>
  </channel>
</rss>
