<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Gautam Vhavle</title>
    <description>The latest articles on DEV Community by Gautam Vhavle (@gautamvhavle).</description>
    <link>https://dev.to/gautamvhavle</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F952519%2Ffd42c240-6dc4-42a4-8a3f-b423cae107de.jpeg</url>
      <title>DEV Community: Gautam Vhavle</title>
      <link>https://dev.to/gautamvhavle</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gautamvhavle"/>
    <language>en</language>
    <item>
      <title>My JSON Was Too Big for My AI, So I Built an MCP Server to Fix It</title>
      <dc:creator>Gautam Vhavle</dc:creator>
      <pubDate>Mon, 16 Mar 2026 21:29:42 +0000</pubDate>
      <link>https://dev.to/gautamvhavle/my-json-was-too-big-for-my-ai-so-i-built-an-mcp-server-to-fix-it-43m2</link>
      <guid>https://dev.to/gautamvhavle/my-json-was-too-big-for-my-ai-so-i-built-an-mcp-server-to-fix-it-43m2</guid>
      <description>&lt;p&gt;It was late night. My eyes were burning. I had a JSON file staring back at me. A space missions database (example). 1620+ lines. 7 missions, each nested 5 levels deep. Personnel records inside crew arrays inside missions inside a database meta object. Spacecraft specifications with Ion-Plasma Drive stats. Payloads marked as "Class-IV Radiation Hazard." Budgets in the &lt;em&gt;billions&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;It was the most beautifully structured chaos I'd ever seen.&lt;/p&gt;

&lt;p&gt;I thought, &lt;em&gt;"Let me just ask my AI to analyze this."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;So I pasted the whole thing into the chat. Hit enter. Waited. And then, the message I'd been dreading:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️ Context limit exceeded. I was using a small open-source LLM&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;My LLM couldn't even &lt;em&gt;see&lt;/em&gt; the whole file. And the worst part? I didn't need all of it. I just wanted one number: the total budget across all 7 missions. One number, buried 4 layers deep inside &lt;code&gt;missions[*].budget_credits&lt;/code&gt;. That's it.&lt;/p&gt;

&lt;p&gt;I sat there for a second. Then I thought about all the times I'd hit this exact wall before. Huge API responses from clients, MongoDB exports with 10,000 nested documents, config mega-files that no human should ever have to scroll through. Every time, the same story: JSON too big, AI too small.&lt;/p&gt;

&lt;p&gt;I stared at the screen. Then I did what any sane developer would do.&lt;/p&gt;

&lt;p&gt;I built something.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I Couldn't Just "Write a Script"
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8mu19ysym3l8wy52qxj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8mu19ysym3l8wy52qxj.png" alt="The Frustration Moment" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, I &lt;em&gt;know&lt;/em&gt; what you're thinking. "Just write a Python script, bro."&lt;/p&gt;

&lt;p&gt;And sure, I could:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;missions.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;budget_credits&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;missions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total budget: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Six lines. Done. Right?&lt;/p&gt;

&lt;p&gt;Except... I don't always know &lt;em&gt;what&lt;/em&gt; I'm looking for. Sometimes I'm exploring. Sometimes I get a JSON from a client with zero documentation and I just need to poke around. "What keys exist? What's nested under &lt;code&gt;config&lt;/code&gt;? How many items have &lt;code&gt;status: active&lt;/code&gt;?" That exploratory, conversational flow is exactly what AI assistants are &lt;em&gt;supposed&lt;/em&gt; to be good at.&lt;/p&gt;

&lt;p&gt;I could use &lt;code&gt;jq&lt;/code&gt;. I could fire up &lt;code&gt;pandas&lt;/code&gt;. I could write a new throwaway script every single time. But all of those require me to already know the structure, already know the question, and essentially do the AI's job &lt;em&gt;for&lt;/em&gt; the AI.&lt;/p&gt;

&lt;p&gt;That defeated the whole purpose. I didn't want to write code to understand my data. I wanted to &lt;em&gt;talk&lt;/em&gt; to it.&lt;/p&gt;

&lt;p&gt;So the question became: &lt;strong&gt;What if my AI didn't need to read the entire file?&lt;/strong&gt; What if, instead of choking on 620 lines of JSON, it could surgically extract exactly the pieces it needed?&lt;/p&gt;

&lt;p&gt;I figured out, we have to use a MCP solution here.&lt;/p&gt;




&lt;h2&gt;
  
  
  Enter MCP — Giving Your AI Hands, Not Just Eyes
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6b5g2q94it9hlguohl6n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6b5g2q94it9hlguohl6n.png" alt="The MCP Concept" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you haven't heard of &lt;strong&gt;MCP (Model Context Protocol)&lt;/strong&gt; yet, here's the simplest way I can explain it:&lt;/p&gt;

&lt;p&gt;Instead of making your AI read an entire library, &lt;strong&gt;you give it a librarian.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;MCP lets you give your AI &lt;em&gt;tools&lt;/em&gt;. Small, focused functions it can call on its own. The AI decides which tool to use, calls it, reads the result, and decides its next move. It's like upgrading your assistant from a speed reader (who runs out of pages) to a detective with the perfect toolkit.&lt;/p&gt;

&lt;p&gt;Here's the paradigm shift that changed everything for me:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;❌ Without MCP:&lt;/strong&gt; Paste 620 lines of JSON → &lt;em&gt;"Context limit exceeded"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;✅ With MCP:&lt;/strong&gt; AI calls &lt;code&gt;load_json&lt;/code&gt; → gets metadata (47KB, 3 top-level keys) → calls &lt;code&gt;sum_values&lt;/code&gt; on &lt;code&gt;$.missions[*].budget_credits&lt;/code&gt; → &lt;strong&gt;"$18,250,000,000.50"&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The AI never saw the full file. It loaded it into memory, used a surgical tool to extract the number I needed, and gave me the answer. Four tool calls. Zero context overflow. That was my lightbulb moment.&lt;/p&gt;

&lt;p&gt;And that's exactly what I built.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built: Universal JSON Agent MCP
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Talk to your JSON files using natural language.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;26 tools.&lt;/strong&gt; 9 categories. One install command. Zero config headaches.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv tool &lt;span class="nb"&gt;install &lt;/span&gt;universal-json-agent-mcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. You now have an MCP server that works with &lt;strong&gt;VS Code / GitHub Copilot&lt;/strong&gt;, &lt;strong&gt;Claude Desktop&lt;/strong&gt;, &lt;strong&gt;Cursor&lt;/strong&gt;, or literally any MCP-compatible client.&lt;/p&gt;

&lt;p&gt;But numbers don't tell the story. Let me show you &lt;em&gt;what it actually feels like&lt;/em&gt; to use this. Remember that space missions database I was struggling with?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You:   Load missions.json and tell me what's inside.

AI:    ✅ Loaded "missions" — object with 3 top-level keys,
       file size 47.2 KB.
       Keys: database_meta, missions, global_notes

You:   What's the structure look like?

AI:    database_meta (object)
         ├─ version (string)
         ├─ total_missions (number)
         ├─ contact (object)
         │    ├─ director (string)
         │    └─ hotline (string)
       missions (array of 7 objects)
         ├─ codename (string)
         ├─ status (string)
         ├─ budget_credits (number)
         ├─ spacecraft (object) → designation, specifications…
         └─ personnel (object) → commander, crew[]…
       (Used: get_structure)

You:   How many missions are there?

AI:    7 missions total.
       (Used: count)

You:   What's the total budget across all missions?

AI:    The total budget across all missions is $18,250,000,000.50
       (Used: sum_values on $.missions[*].budget_credits)

You:   Show me only the in-progress ones, sorted by priority.

AI:    Found 3 missions with status "in_progress":
       1. Stellar Vanguard  — priority 1, budget $4.5B
       2. Deep Core Echo    — priority 2, budget $3.2B
       3. Nebula Drift      — priority 4, budget $1.8B
       (Used: filter_objects → sort_by → pick_fields)

You:   Export that to CSV.

AI:    ✅ Exported 3 missions to in_progress_missions.csv
       (Used: export_csv)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Six natural language questions. Zero scripts. The AI figured out which of the 26 tools to chain together (load, explore, count, aggregate, filter, sort, pick fields, and export) all by itself.&lt;/p&gt;

&lt;p&gt;That's the magic. I didn't build one smart tool. I built 26 small, dumb tools and let the AI be the smart one.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzhpcpe6kwbkuqnsw7tqj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzhpcpe6kwbkuqnsw7tqj.png" alt="Query Screenshot" width="800" height="500"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Toolkit — A Quick Tour
&lt;/h2&gt;

&lt;p&gt;I'm not going to bore you with a feature table (that's what the &lt;a href="https://github.com/GautamVhavle/universal-json-agent" rel="noopener noreferrer"&gt;README&lt;/a&gt; is for). Instead, let me show you &lt;em&gt;when&lt;/em&gt; you'd reach for each of these:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔓 Load it&lt;/strong&gt; — You've got a JSON file. Maybe it's on your desktop, maybe it's in a project folder. You tell the AI to load it. Boom, it's in memory with full metadata: file size, root type, how many top-level keys. You can load multiple files and switch between them mid-conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔍 Explore it&lt;/strong&gt; — You just received a 50MB JSON dump from a client. No documentation. No schema. You don't even know what keys exist. This is where &lt;code&gt;get_structure&lt;/code&gt; becomes your flashlight in the dark. It shows you a clean skeleton of the entire file: keys, types, nesting depth. All without loading a single data value into your chat. &lt;code&gt;distinct_paths&lt;/code&gt; goes even further and maps out &lt;em&gt;every single leaf path&lt;/em&gt; in the document, even inside nested arrays. Think of it like turning on all the lights in a building you've never been in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🎯 Query it&lt;/strong&gt; — Now you know what's inside. Time to ask questions. Full JSONPath support (&lt;code&gt;$.missions[*].codename&lt;/code&gt;), regex-powered text search across all string values, and smart filtering with 8 operators (equals, greater than, contains, regex match, and more). You can say "find all personnel whose role contains 'engineer'" and get precise results, without ever reading the parts of the file you don't care about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📊 Crunch it&lt;/strong&gt; — This is where it gets really fun. Sum all budgets. Count active missions. Get min/max priority levels. Run a full statistical breakdown (mean, median, standard deviation, percentiles) on any numeric field. Even get a frequency table, like &lt;code&gt;value_counts()&lt;/code&gt; in pandas, but through a conversation. All the number-crunching happens server-side; your context window stays clean.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔄 Transform it&lt;/strong&gt; — Flatten nested objects into dot-notation key-value pairs. Pick only the fields you care about. Group missions by status. Sort by priority, descending. Reshape the data into exactly the view you need before asking the next question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📤 Export it&lt;/strong&gt; — Happy with the filtered, sorted, transformed result? Export to CSV or JSON with one sentence. Done. Ship it.&lt;/p&gt;

&lt;p&gt;Here's the design philosophy I kept coming back to: &lt;strong&gt;each tool does one small thing.&lt;/strong&gt; The AI chains them together to build complex analyses. Kind of like UNIX pipes, but instead of &lt;code&gt;|&lt;/code&gt;, it's intelligence connecting the tools. &lt;/p&gt;

&lt;p&gt;That idea (small tools, composed by an AI) turned out to be way more powerful than any "one mega-tool that does everything" approach I could have built.&lt;/p&gt;




&lt;h2&gt;
  
  
  Under the Hood: The Design Decisions I'm Proud Of
&lt;/h2&gt;

&lt;p&gt;I'll keep this section focused. This is a story, not a textbook. But there are a few engineering choices that made this whole thing work, and I think they're worth sharing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Smart Truncation — The Core Innovation
&lt;/h3&gt;

&lt;p&gt;This was the &lt;em&gt;entire point&lt;/em&gt; of the project, so let me explain why it matters.&lt;/p&gt;

&lt;p&gt;When you ask &lt;code&gt;get_value&lt;/code&gt; on the root of a 620-line JSON, you could get 620 lines dumped into the AI's context. That's exactly the problem we're solving, right? So instead, every single tool response is automatically capped at ~10KB. If the result is too large, it gets cleanly truncated:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Array with 347 items — showing first 50. Use filters or paths to narrow results."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The AI reads that message, understands it needs to be more specific, and adjusts its next query. It never gets overwhelmed, no matter how enormous the underlying data is. This one design decision, aggressive truncation with helpful hints, is what makes it possible to work with JSON files of &lt;em&gt;any&lt;/em&gt; size through a context-limited LLM.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Store Pattern
&lt;/h3&gt;

&lt;p&gt;When you load a JSON file, it goes into an in-memory store, not into the AI's chat. The AI gets back metadata (file size, root type, number of keys, estimated memory usage), and from that point on, every tool works against the stored data. Load once, query a hundred times. You can even load multiple files and run comparisons across them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fail-Safe Tool Responses
&lt;/h3&gt;

&lt;p&gt;LLMs use tools in loops. They call a tool, read the result, decide the next step. If a tool throws an unhandled exception, the entire reasoning loop can break. So every tool in this project catches errors gracefully and returns a readable error message instead of crashing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_safe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The AI sees &lt;code&gt;"Error: Key 'budgt' not found. Available keys: budget_credits, budget_cycle"&lt;/code&gt;, adapts, corrects the typo, and continues. No crash. No broken loop. The conversation just keeps going.&lt;/p&gt;

&lt;h3&gt;
  
  
  The UNIX Philosophy, Applied to AI
&lt;/h3&gt;

&lt;p&gt;This is the architectural insight I keep coming back to. I didn't try to build one omniscient JSON-processing mega-tool. Instead, I built &lt;strong&gt;26 tiny tools that each do exactly one thing.&lt;/strong&gt; The AI figures out how to chain them together.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;load_json&lt;/code&gt; loads. &lt;code&gt;get_structure&lt;/code&gt; explores. &lt;code&gt;filter_objects&lt;/code&gt; filters. &lt;code&gt;sum_values&lt;/code&gt; sums. &lt;code&gt;export_csv&lt;/code&gt; exports. That's it. Each one is stupid simple. But when an AI strings together &lt;code&gt;load → structure → filter → sort → pick_fields → export&lt;/code&gt;? That's when the magic happens.&lt;/p&gt;

&lt;p&gt;The composability is the whole point. And it makes the codebase dead simple to extend. Adding a new tool is about 50 lines of code.&lt;/p&gt;




&lt;h2&gt;
  
  
  My First PyPI Package — The Terrifying, Beautiful, Amazing Part 🎉
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1sm4uw6us883re0q0ho.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1sm4uw6us883re0q0ho.png" alt="PyPI Launch Celebration" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Okay, can we just... pause for a second?&lt;/p&gt;

&lt;p&gt;This was my &lt;strong&gt;very first Python package&lt;/strong&gt;. Ever. Published. On &lt;a href="https://pypi.org/project/universal-json-agent-mcp/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt;. Where actual developers go to install actual packages. And now mine is sitting there next to &lt;code&gt;requests&lt;/code&gt; and &lt;code&gt;flask&lt;/code&gt; and &lt;code&gt;numpy&lt;/code&gt; (okay, maybe not &lt;em&gt;next to them&lt;/em&gt;, but on the same website and that counts).&lt;/p&gt;

&lt;p&gt;There's something weirdly emotional about naming a package. You're not just picking a variable name that lives in one file. You're picking a name that will live on the internet. Forever. People will &lt;em&gt;type&lt;/em&gt; this name. Into their terminals. With their own fingers.&lt;/p&gt;

&lt;p&gt;I went back and forth. Too generic? Someone already took it. Too clever? No one will find it. I finally landed on &lt;code&gt;universal-json-agent-mcp&lt;/code&gt; and thought, "Yeah. That's the one. That says exactly what it does."&lt;/p&gt;

&lt;p&gt;And then I hit publish.&lt;/p&gt;

&lt;p&gt;I remember staring at the &lt;a href="https://pypi.org/project/universal-json-agent-mcp/" rel="noopener noreferrer"&gt;PyPI page&lt;/a&gt; after it went live, just... refreshing it. Like a kid checking if their YouTube video got views yet. The badges were there. The description was there. My name was there. &lt;em&gt;My name.&lt;/em&gt; On PyPI. On the same platform where &lt;code&gt;requests&lt;/code&gt; and &lt;code&gt;numpy&lt;/code&gt; live.&lt;/p&gt;

&lt;p&gt;Then I opened a fresh terminal on a completely clean machine. Typed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;universal-json-agent-mcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Watched the progress bar. It downloaded. It installed. I ran it. It &lt;em&gt;worked&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I'm not going to pretend I was cool about it. I wasn't. If you've been sitting on a project, thinking "it's not ready" or "no one will use it"... just ship it. Seriously. The first version doesn't have to be perfect. It just has to exist.&lt;/p&gt;




&lt;h2&gt;
  
  
  Works Everywhere — Zero Lock-In
&lt;/h2&gt;

&lt;p&gt;One of my non-negotiables from day one was &lt;strong&gt;universal compatibility&lt;/strong&gt;. I didn't want to build a VS Code-only tool or a Claude-only plugin. If your AI editor speaks MCP, this should just work.&lt;/p&gt;

&lt;p&gt;Here's the entire setup for VS Code, and it's similarly simple for Claude and Cursor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;.vscode/mcp.json&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"servers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"universal-json-agent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"stdio"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"universal-json-agent-mcp"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Eight lines. Restart your editor, and every JSON file on your machine is now queryable through natural language. No API keys needed for the MCP server itself, no cloud dependency, no vendor lock-in. It runs locally, on your machine.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bonus Chapter: The Web Server
&lt;/h2&gt;

&lt;p&gt;But what about when you're &lt;em&gt;not&lt;/em&gt; in an editor?&lt;/p&gt;

&lt;p&gt;What if you want to query JSON from a dashboard, a CI pipeline, or a custom internal tool? That's where the bonus &lt;strong&gt;FastAPI + LangChain&lt;/strong&gt; web server comes in.&lt;/p&gt;

&lt;p&gt;It's an optional subproject bundled in the same repo. Upload a JSON file, ask a question in plain English, get an answer back, along with which tools were used. Under the hood, it runs a full &lt;strong&gt;LangChain ReAct agent&lt;/strong&gt;. The same reasoning loop that Copilot and Claude use, but wrapped in a REST API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/query &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"file=@data/missions.json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"query=What's the total budget?"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"answer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The total budget across all missions is $18,250,000,000.50"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tools_used"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"load_json"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sum_values"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent's system prompt is specifically tuned to use bulk extraction patterns. It knows to use JSONPath wildcards like &lt;code&gt;$.missions[*].budget_credits&lt;/code&gt; instead of iterating one-by-one. Same 26 tools, same core logic, zero code duplication. Just a different interface.&lt;/p&gt;

&lt;p&gt;Perfect for building quick dashboards or integrating JSON analysis into automated workflows.&lt;/p&gt;




&lt;h2&gt;
  
  
  Who Is This For?
&lt;/h2&gt;

&lt;p&gt;Honestly? If you've ever stared at a JSON file and thought &lt;em&gt;"this is too much"&lt;/em&gt;? It's for you.&lt;/p&gt;

&lt;p&gt;But if you want specifics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Backend developers&lt;/strong&gt; debugging massive API responses at 11 PM, trying to figure out why that one nested field isn't what the docs say it should be&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data engineers&lt;/strong&gt; who just got a JSON export with zero documentation and need to figure out its structure before writing a pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anyone&lt;/strong&gt; who's ever pasted a JSON into ChatGPT, Claude, or Copilot and hit the context limit? Never again&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Students and learners&lt;/strong&gt; who want to explore datasets conversationally without writing boilerplate ETL code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Teams building internal tools&lt;/strong&gt; who need a quick JSON analysis layer without spinning up a whole data stack&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you work with JSON (and let's be honest, &lt;em&gt;who doesn't in 2026&lt;/em&gt;), this tool has a place in your workflow.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;That late night frustration didn't just produce a tool. It produced my first PyPI package.&lt;/p&gt;

&lt;p&gt;I built this because I needed it. No one asked me to. I had a problem. My JSON was too big for my AI, and I refused to accept that as a limitation. What started as a late-night hack turned into something with proper architecture, hundreds of tests, a published package, and support for every major AI editor.&lt;/p&gt;

&lt;p&gt;If there's one thing I've learned, it's that the best tools come from scratching your own itch. And now this one's out there for anyone who's ever stared at a monster JSON and thought, &lt;em&gt;"There has to be a better way."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Your JSON is too big? &lt;strong&gt;Not anymore.&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  ⭐ If this resonated with you, give the repo a star.
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/GautamVhavle/universal-json-agent" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/strong&gt; · &lt;strong&gt;&lt;a href="https://pypi.org/project/universal-json-agent-mcp/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt;&lt;/strong&gt; · &lt;strong&gt;&lt;a href="https://github.com/GautamVhavle/universal-json-agent/issues" rel="noopener noreferrer"&gt;Report Issues&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I Reverse Engineered ChatGPT's UI Into an OpenAI Compatible API and Here's Why You Shouldn't</title>
      <dc:creator>Gautam Vhavle</dc:creator>
      <pubDate>Thu, 19 Feb 2026 14:21:04 +0000</pubDate>
      <link>https://dev.to/gautamvhavle/i-reverse-engineered-chatgpts-ui-into-an-openai-compatible-api-and-heres-why-you-shouldnt-ch</link>
      <guid>https://dev.to/gautamvhavle/i-reverse-engineered-chatgpts-ui-into-an-openai-compatible-api-and-heres-why-you-shouldnt-ch</guid>
      <description>&lt;p&gt;&lt;em&gt;A weekend project that wasn't supposed to work. But it did. And now we need to talk about it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Let me be upfront: &lt;strong&gt;this project exists for educational purposes only.&lt;/strong&gt; What you're about to read shouldn't be replicated in production, shouldn't be used to skirt terms of service, and honestly  shouldn't work as well as it does.&lt;/p&gt;

&lt;p&gt;But here we are.&lt;/p&gt;

&lt;p&gt;I took ChatGPT's entire web frontend the same UI you and I use every day and reverse-engineered it into my backend. Browser automation, stealth patches, Cloudflare bypass, clipboard hijacking, DOM scraping, virtual displays. The result? A fully &lt;strong&gt;OpenAI-compatible REST API&lt;/strong&gt; that any SDK, any LangChain agent, any curl command can hit as a drop-in replacement. Tool calling, image generation, file uploads, vision, all of it using your own account and subscription with the latest Model.&lt;/p&gt;

&lt;p&gt;The project is called &lt;strong&gt;CatGPT-Gateway&lt;/strong&gt; I'll attach the Github link below for reference. I need to tell you how it works.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;because understanding what's possible is the first step to building better defenses against it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkdoodsfuu9r6tpyf9uzh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkdoodsfuu9r6tpyf9uzh.png" alt="Hero" width="800" height="471"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Before we go any further:&lt;/strong&gt; This project is built purely for &lt;strong&gt;educational purposes&lt;/strong&gt; and cybersecurity research. I'm not a hacker. This isn't a bug. I'm just a curious developer who wanted to understand the limits of browser automation. Stick around, we'll talk ethics at the end.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Origin Story: A Developer's Frustration
&lt;/h2&gt;

&lt;p&gt;Here’s what happened.....&lt;/p&gt;

&lt;p&gt;I signed into OpenAI using my regular account while building an application that required API calls for testing. I needed to review pricing, generate API keys, and wire everything into a fairly complex LangChain workflow I was developing.&lt;/p&gt;

&lt;p&gt;At the same time, I was working on another project that required repeated image generation. Not just one or two images, several iterations for testing and refinement.&lt;/p&gt;

&lt;p&gt;So I paused.&lt;/p&gt;

&lt;p&gt;ChatGPT could already generate images directly inside the chat interface. It could also produce structured responses to my prompts with context awareness.&lt;/p&gt;

&lt;p&gt;That raised a technical question.&lt;/p&gt;

&lt;p&gt;What if the chat interface itself could be automated? Not as a workaround, and not to avoid API usage or pricing, but as a pure engineering experiment. Could a conversational UI be programmatically driven? Could it behave like an interaction layer that mimics API semantics? What architectural differences would emerge between UI automation and formal API integration?&lt;/p&gt;

&lt;p&gt;The curiosity wasn’t financial. It was structural.&lt;/p&gt;

&lt;p&gt;I wanted to understand system behavior, automation mechanics, and the boundaries between user-facing interfaces and backend-accessible endpoints.&lt;/p&gt;

&lt;p&gt;It was supposed to be a quick weekend hack, the kind you start at 11 PM on a Friday, fully expecting it to collapse within the hour.&lt;/p&gt;

&lt;p&gt;It didn’t. And that’s where things got interesting.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fglhk741zoxp590q6pse4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fglhk741zoxp590q6pse4.png" alt="Eureka Moment" width="800" height="533"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Every great terrible idea starts at 11 PM with too much coffee.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  How Is This Even Possible?
&lt;/h2&gt;

&lt;p&gt;Let me break it down. The architecture is simple on paper, absolutely unhinged in execution.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;External Clients (curl, Python, LangChain)
          │
          ▼
    ┌───────────┐
    │  FastAPI  │ ← OpenAI-compatible API (port 8000)
    │  Server   │
    └─────┬─────┘
          │
          ▼
   ┌──────────────┐
   │ ChatGPTClient│ ← Sends messages, waits for responses,
   │              │   extracts text via clipboard/DOM
   └──────┬───────┘
          │
          ▼
  ┌───────────────┐
  │BrowserManager │ ← Patchright (stealth Playwright fork)
  │+ Stealth      │   controlling a real Chrome instance
  │+ Human Sim    │   with anti-detection patches
  └───────┬───────┘
          │
          ▼
  ┌───────────────┐
  │  Xvfb + VNC   │ ← Virtual display + VNC viewer
  │  (port 6080)  │   (the magic trick — more on this later)
  └───────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's what's happening:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;You hit the API&lt;/strong&gt; — standard OpenAI-format request, &lt;code&gt;POST /v1/chat/completions&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI receives it&lt;/strong&gt; and hands it to the ChatGPT client&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The client literally types your message&lt;/strong&gt; into ChatGPT's chat input box using a headful Chrome browser&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It waits for the response&lt;/strong&gt; by watching for ChatGPT's copy button to appear (that's how it knows the response is complete — clever, right?)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It clicks the copy button&lt;/strong&gt;, grabs the text from the clipboard, and returns it as a proper OpenAI-format JSON response&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Your LangChain app, your agent framework, your curl script — they all think they're talking to OpenAI. They're actually talking to a cat controlling a browser.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbnw7j2twe2nmff3j0kj1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbnw7j2twe2nmff3j0kj1.png" alt="curl" width="800" height="215"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cloudflare Boss Fight
&lt;/h2&gt;

&lt;p&gt;Okay, here's where it gets interesting. You can't just &lt;code&gt;puppeteer.launch()&lt;/code&gt; and waltz into ChatGPT. OpenAI uses &lt;strong&gt;Cloudflare's human verification&lt;/strong&gt; — and modern Cloudflare is &lt;em&gt;really&lt;/em&gt; good at detecting bots.&lt;/p&gt;

&lt;p&gt;I want to genuinely appreciate OpenAI and Cloudflare here. Their security gave me &lt;strong&gt;so many blockers and roadblocks&lt;/strong&gt; that I almost gave up multiple times. We're talking:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;navigator.webdriver&lt;/code&gt; detection&lt;/strong&gt; — the first thing any anti-bot checks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Canvas fingerprinting&lt;/strong&gt; — headless browsers have different rendering signatures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WebGL fingerprinting&lt;/strong&gt; — GPU-level identification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plugin enumeration&lt;/strong&gt; — headless Chrome reports different plugins&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Behavioral analysis&lt;/strong&gt; — instant typing? Pixel-perfect clicks? That's not human&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every single one of these had to be defeated. Not by exploiting a vulnerability — but by making the browser &lt;strong&gt;more human&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv7h9q8ok7e85z9zw1z6y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv7h9q8ok7e85z9zw1z6y.png" alt="Cloudflare vs Me" width="800" height="436"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Actual footage of me vs. Cloudflare's bot detection (dramatized).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here's what CatGPT does to pass as human:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stealth Patches&lt;/strong&gt; — Using &lt;a href="https://github.com/AjaxMultiCommentary/patchright" rel="noopener noreferrer"&gt;Patchright&lt;/a&gt; (a Playwright fork built for stealth) combined with &lt;a href="https://pypi.org/project/playwright-stealth/" rel="noopener noreferrer"&gt;playwright-stealth&lt;/a&gt;, CatGPT patches the browser fingerprint at every level. &lt;code&gt;navigator.webdriver&lt;/code&gt; returns &lt;code&gt;false&lt;/code&gt;, canvas renders match real browsers, WebGL reports are spoofed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human Simulation&lt;/strong&gt; — Messages aren't typed character-by-character (that's detectable too). Instead, CatGPT uses clipboard-paste injection with randomized delays. Mouse movements happen during "thinking" pauses — idle cursor drifts to random positions with 5-15 intermediate steps, like a bored human.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Viewport Jitter&lt;/strong&gt; — Every launch randomizes the viewport by ±20px from the base 1280×720. No two sessions have exactly the same fingerprint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Docker DNS Bug Discovery&lt;/strong&gt; — This one's wild. I found that calling Playwright's &lt;code&gt;add_init_script()&lt;/code&gt; — even with just a &lt;code&gt;console.log("")&lt;/code&gt; — completely &lt;strong&gt;breaks Chrome's DNS resolution&lt;/strong&gt; inside Docker containers. Every navigation after that returns &lt;code&gt;ERR_NAME_NOT_RESOLVED&lt;/code&gt;. The fix? Inject all stealth JavaScript via &lt;code&gt;page.evaluate()&lt;/code&gt; at runtime and re-inject on every &lt;code&gt;framenavigated&lt;/code&gt; event instead. Days of debugging. One line fix. Classic.&lt;/p&gt;


&lt;h2&gt;
  
  
  The VNC Trick — Schrödinger's Browser
&lt;/h2&gt;

&lt;p&gt;This is the part I'm most proud of. This is what makes CatGPT fundamentally different from every other browser automation project I've seen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; Cloudflare can detect headless Chrome. So you need headful Chrome (a real browser window). But servers don't have monitors. So how do you run a "headed" browser on a headless server?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution:&lt;/strong&gt; You give it a &lt;strong&gt;fake monitor&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Chrome (headful mode)
    renders to →  Xvfb (Virtual Framebuffer — fake display :99)
                      captured by →  x11vnc (VNC server)
                                        served via →  noVNC (WebSocket, port 6080)
                                                          viewable in →  Your browser
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqquvl2avbbbitmftuk86.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqquvl2avbbbitmftuk86.png" alt="Shocking" width="800" height="456"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;This was the most shocking part for me too.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here's the beautiful paradox:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;To Cloudflare&lt;/strong&gt;: It's a real, headed, GPU-rendering Chrome browser with a proper display. Nothing to see here, human user browsing normally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;To your server&lt;/strong&gt;: It's a headless process. No monitor needed. Runs in Docker. Fully automated.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;To you&lt;/strong&gt;: You can open &lt;code&gt;http://localhost:6080&lt;/code&gt; in your browser and &lt;em&gt;watch the automation happen in real-time&lt;/em&gt; through VNC. Debug visually. Handle CAPTCHAs manually if needed. Sign in through the actual browser.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's &lt;strong&gt;headed AND headless at the same time&lt;/strong&gt;. Schrödinger's browser.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9sgubga7m6dn265tq2fq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9sgubga7m6dn265tq2fq.png" alt="Headed for Cloudflare. Headless for your server" width="800" height="446"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Headed for Cloudflare. Headless for your server. Both at the same time.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Four processes run in the Docker container, managed by supervisord:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Process&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;Port&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Xvfb&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Virtual framebuffer — the "fake monitor"&lt;/td&gt;
&lt;td&gt;Display &lt;code&gt;:99&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;x11vnc&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;VNC server — captures the virtual display&lt;/td&gt;
&lt;td&gt;5900&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;noVNC&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;WebSocket bridge — makes VNC browser-accessible&lt;/td&gt;
&lt;td&gt;6080&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FastAPI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The actual API server&lt;/td&gt;
&lt;td&gt;8000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One &lt;code&gt;docker compose up&lt;/code&gt; and you've got the whole stack.&lt;/p&gt;


&lt;h2&gt;
  
  
  OpenAI-Compatible Endpoints — Drop-In Replacement
&lt;/h2&gt;

&lt;p&gt;This is where the software engineer in me got excited. The API isn't just "some endpoint that returns text." It's a &lt;strong&gt;full OpenAI-compatible API&lt;/strong&gt;. Pydantic schemas matching OpenAI's spec exactly. You can point any OpenAI SDK at it and it just works.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# It's literally the same format as OpenAI's API&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer your-token"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "catgpt-browser",
    "messages": [{"role": "user", "content": "Explain quantum computing in 3 sentences"}]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What you get back:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"chatcmpl-abc123..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"object"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"chat.completion"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Quantum computing uses quantum bits (qubits)..."&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"finish_reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"stop"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"usage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"prompt_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"completion_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"total_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;57&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Standard OpenAI response format. &lt;code&gt;id&lt;/code&gt;, &lt;code&gt;choices&lt;/code&gt;, &lt;code&gt;usage&lt;/code&gt; — everything.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tool / Function Calling
&lt;/h3&gt;

&lt;p&gt;Yeah, it supports tool calling too. Since we're automating a browser (not hitting an API with native function calling), CatGPT uses a clever technique: it builds a &lt;strong&gt;system prompt with tool definitions and few-shot examples&lt;/strong&gt;, instructing ChatGPT to output structured JSON. Then it parses that JSON with regex and returns proper &lt;code&gt;tool_calls&lt;/code&gt; in the response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Works with LangChain / LangGraph out of the box
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8000/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;catgpt-browser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Bind tools, create agents — it all works
&lt;/span&gt;&lt;span class="n"&gt;llm_with_tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bind_tools&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;get_weather&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;calculate&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm5dneo67v20cmd3o9heb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm5dneo67v20cmd3o9heb.png" alt="Langchain" width="800" height="287"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Any modern agent framework — LangChain, LangGraph, CrewAI, AutoGen — just point the &lt;code&gt;base_url&lt;/code&gt; at CatGPT and you're golden.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wait, It Can DALL-E Too?!
&lt;/h2&gt;

&lt;p&gt;This was the feature that started the whole project, remember? I needed image generation for testing, and there was no straightforward API button for developers who sign in with OpenAI.&lt;/p&gt;

&lt;p&gt;So how does CatGPT handle &lt;code&gt;POST /v1/images/generations&lt;/code&gt;?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Your request comes in&lt;/strong&gt; — &lt;code&gt;"prompt": "A sunset over mountains"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CatGPT sends it to ChatGPT&lt;/strong&gt; as a chat message, with hints about size and quality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ChatGPT invokes DALL-E internally&lt;/strong&gt; and renders the image in the chat&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The detector watches the DOM&lt;/strong&gt; for &lt;code&gt;img[alt="Generated image"]&lt;/code&gt; or images inside &lt;code&gt;div[id^="image-"]&lt;/code&gt; containers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The image is downloaded&lt;/strong&gt; using the browser's own &lt;code&gt;fetch()&lt;/code&gt; API — this is key because it preserves the authentication cookies. No separate auth needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Returned to you&lt;/strong&gt; as either base64 JSON or a URL, matching OpenAI's image response format
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Generate an image — same format as OpenAI's DALL-E API&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/v1/images/generations &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer your-token"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "prompt": "An adorable orange tabby kitten astronaut floating in space",
    "size": "1024x1024",
    "quality": "hd"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response includes a &lt;code&gt;revised_prompt&lt;/code&gt; — the actual prompt DALL-E used internally — and the image data. Works with the OpenAI Python SDK too:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8000/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dall-e-3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A cat hacking into a computer, cyberpunk style&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1024x1024&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F47x8b7ljl7wiu7mw2odr.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F47x8b7ljl7wiu7mw2odr.jpeg" alt="Working" width="800" height="510"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I never thought this would actually work. Detecting dynamically-generated images in someone else's DOM, downloading them through the browser's auth context, and returning them in a standard API format? It felt impossible until it wasn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cyberpunk TUI — Because CLIs Should Look Cool
&lt;/h2&gt;

&lt;p&gt;I couldn't just ship an API. I had to build a terminal UI. You know, for vibes.&lt;/p&gt;

&lt;p&gt;CatGPT comes with a full-screen &lt;strong&gt;Textual-based TUI&lt;/strong&gt; — a cyberpunk-themed chat interface running right in your terminal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Dark color scheme&lt;/strong&gt; — deep &lt;code&gt;#0d1117&lt;/code&gt; backgrounds, &lt;code&gt;#58a6ff&lt;/code&gt; accent blues, &lt;code&gt;#3fb950&lt;/code&gt; greens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ASCII cat splash screen&lt;/strong&gt; — because every good project needs one&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Color-coded messages&lt;/strong&gt; — blue borders for your messages, green for the assistant, purple for images&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rich Markdown rendering&lt;/strong&gt; — code blocks, tables, lists, all rendered beautifully&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slash commands&lt;/strong&gt; — &lt;code&gt;/new&lt;/code&gt;, &lt;code&gt;/threads&lt;/code&gt;, &lt;code&gt;/images&lt;/code&gt;, &lt;code&gt;/status&lt;/code&gt;, &lt;code&gt;/help&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keyboard shortcuts&lt;/strong&gt; — &lt;code&gt;Ctrl+N&lt;/code&gt; for new chat, &lt;code&gt;Ctrl+T&lt;/code&gt; for threads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwj8wc0erwwwxidr8vj8.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwj8wc0erwwwxidr8vj8.jpeg" alt="TUI" width="800" height="462"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;CATGPT TUI: Actual functional conversation.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fem6ysz2mt59jmz4tohd6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fem6ysz2mt59jmz4tohd6.png" alt="Sci-fi" width="800" height="369"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The TUI: because talking to AI should feel like you're in a sci-fi movie.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The Simplicity — Step by Step
&lt;/h2&gt;

&lt;p&gt;Here's what amazes me about how this came together. Despite all the complexity under the hood — stealth patches, virtual displays, DOM observers, clipboard extraction — &lt;strong&gt;using it is dead simple&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Clone it&lt;/span&gt;
git clone https://github.com/GautamVhavle/CatGPT-Gateway.git &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;CatGPT-Gateway

&lt;span class="c"&gt;# 2. Start everything&lt;/span&gt;
docker compose up &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;

&lt;span class="c"&gt;# 3. Sign in once via VNC&lt;/span&gt;
&lt;span class="c"&gt;#    Open http://localhost:6080 → Log into ChatGPT → Done&lt;/span&gt;

&lt;span class="c"&gt;# 4. Hit the API&lt;/span&gt;
curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer dummy123"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     http://localhost:8000/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model":"catgpt-browser","messages":[{"role":"user","content":"Hello!"}]}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four steps. One Docker container. No API keys. Your browser session persists across restarts via a Docker volume.&lt;/p&gt;

&lt;p&gt;The project is structured cleanly too — every component has its own module:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;src/browser/&lt;/code&gt;&lt;/strong&gt; — Browser lifecycle, stealth, human simulation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;src/chatgpt/&lt;/code&gt;&lt;/strong&gt; — ChatGPT client, response detection, image handling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;src/api/&lt;/code&gt;&lt;/strong&gt; — FastAPI routes, OpenAI-compatible schemas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;src/cli/&lt;/code&gt;&lt;/strong&gt; — Terminal UI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;src/selectors.py&lt;/code&gt;&lt;/strong&gt; — All DOM selectors centralized in ONE file. When ChatGPT updates their UI, you update one file. That's it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The selector fallback system is one of my favorite design decisions. Every selector (chat input, send button, copy button, etc.) is a &lt;strong&gt;list of fallbacks&lt;/strong&gt;. If ChatGPT changes a &lt;code&gt;data-testid&lt;/code&gt;, the next selector in the list catches it. Resilient by design.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Response Detection — Three Strategies Deep
&lt;/h2&gt;

&lt;p&gt;How do you know when ChatGPT is done generating its response? This was one of the hardest problems.&lt;/p&gt;

&lt;p&gt;CatGPT uses a &lt;strong&gt;three-strategy detection system&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Copy Button Detection (primary)&lt;/strong&gt; — ChatGPT only shows the copy button after the full response is generated. Count the copy buttons before sending, wait for a new one. Elegant and reliable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Stop Button Detection (fallback)&lt;/strong&gt; — Watch the stop/generation button. When it appears, streaming started. When it disappears, response is done.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Text Stability (last resort)&lt;/strong&gt; — Poll the response text every second. If it's identical for 5 consecutive polls, we're done.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Three layers of fallback. Because when you're automating someone else's frontend, you plan for &lt;em&gt;everything&lt;/em&gt; to break.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F944cdyjqzgh64fwli1pa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F944cdyjqzgh64fwli1pa.png" alt="Schema" width="800" height="446"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Three strategies. Because one is never enough when you're parsing someone else's DOM.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Let's Talk About Ethics
&lt;/h2&gt;

&lt;p&gt;Okay, real talk time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This project is for educational purposes only.&lt;/strong&gt; Full stop.&lt;/p&gt;

&lt;p&gt;I built CatGPT because I was genuinely curious about the limits of browser automation. Could you control a modern web app's frontend so completely that it becomes your backend? What security measures exist to prevent this? How good is Cloudflare's bot detection really?&lt;/p&gt;

&lt;p&gt;The answers are fascinating — and that's the whole point. This is a &lt;strong&gt;learning exercise&lt;/strong&gt;, not a production tool.&lt;/p&gt;

&lt;p&gt;Here's what I want to be crystal clear about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🚫 &lt;strong&gt;Don't use this in production.&lt;/strong&gt; It's a single-browser, single-session gateway. There's an &lt;code&gt;asyncio.Lock()&lt;/code&gt; serializing every request. It's not built for scale and it's not meant to be.&lt;/li&gt;
&lt;li&gt;🚫 &lt;strong&gt;Don't use this to circumvent OpenAI's terms of service.&lt;/strong&gt; Respect the platform you're using.&lt;/li&gt;
&lt;li&gt;🚫 &lt;strong&gt;Don't use this to build commercial products.&lt;/strong&gt; Get a proper API key for that.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Do use this to learn&lt;/strong&gt; about browser automation, stealth techniques, API design, and cybersecurity.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Do use this to understand&lt;/strong&gt; how modern anti-bot systems work and why they're important.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Do use this for testing and prototyping&lt;/strong&gt; when you need a quick way to interact with ChatGPT programmatically.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'm not a hacker. I didn't find a security vulnerability. This isn't a bug — ChatGPT works exactly as designed. I'm just controlling a browser, the same way any user does. The difference is, my "user" is a Python script.&lt;/p&gt;

&lt;p&gt;OpenAI's security is &lt;strong&gt;legit&lt;/strong&gt;. Cloudflare's human verification gave me more headaches than any coding challenge I've faced. The fact that I had to build viewport jitter, human-like mouse movement, clipboard-based text injection, and a triple-layer detection system just to reliably interact with a website — that's a testament to how good their security is.&lt;/p&gt;




&lt;h2&gt;
  
  
  It's Open Source — Come Build With Me
&lt;/h2&gt;

&lt;p&gt;CatGPT-Gateway is fully open source on GitHub:&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://github.com/GautamVhavle/CatGPT-Gateway" rel="noopener noreferrer"&gt;github.com/GautamVhavle/CatGPT-Gateway&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The codebase is clean, well-documented, and modular. Want to contribute? Here's what I'd love help with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;More resilient selectors&lt;/strong&gt; as ChatGPT's UI evolves&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming support&lt;/strong&gt; (&lt;code&gt;stream: true&lt;/code&gt; in the API)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-session support&lt;/strong&gt; — multiple browser instances&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Better error recovery&lt;/strong&gt; — auto-retry on Cloudflare challenges&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tests&lt;/strong&gt; — there's always room for more tests&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'm &lt;strong&gt;actively maintaining this project&lt;/strong&gt;. If you have feedback, ideas, bug reports, or just want to say hi — open an issue, submit a PR, or drop a comment below.&lt;/p&gt;

&lt;p&gt;This started as a Friday night "what if?" and turned into one of the most fun projects I've ever built. If you're a developer who's curious about how far browser automation can go, or a cybersecurity enthusiast who wants to understand the cat-and-mouse game between bots and detection systems — this project is for you.&lt;/p&gt;

&lt;p&gt;Star it. Fork it. Break it. Fix it. Let's learn together.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ttphr7z0hnfb22obmp5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ttphr7z0hnfb22obmp5.png" alt="Star" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;CatGPT-Gateway is out. Star the repo. Don't do anything I wouldn't do.&lt;/em&gt; 😼&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;If you made it this far&lt;/strong&gt; — thanks for reading. Drop a 🦄 or a 💬 and let me know what you think. If you actually run it and your ChatGPT session starts talking to your LangChain agents... don't blame me. I warned you. Also Congratulations you can run your OpenAI Endpoint with any Agentic Framework like LangChain, AutoGen or CrewAI.&lt;/p&gt;

</description>
      <category>python</category>
      <category>opensource</category>
      <category>openai</category>
      <category>ai</category>
    </item>
    <item>
      <title>Why uv Refused to Upgrade My Package (And Why It Was Actually Right)</title>
      <dc:creator>Gautam Vhavle</dc:creator>
      <pubDate>Sat, 10 Jan 2026 17:25:33 +0000</pubDate>
      <link>https://dev.to/gautamvhavle/why-uv-refused-to-upgrade-my-package-and-why-it-was-actually-right-2ci4</link>
      <guid>https://dev.to/gautamvhavle/why-uv-refused-to-upgrade-my-package-and-why-it-was-actually-right-2ci4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;I thought uv was being annoying. Turns out, it saved me from breaking my entire project. Here's the story.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwb6xmh93o5u4wj1x6w67.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwb6xmh93o5u4wj1x6w67.png" alt="image of confused developer" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Vibe Check: What This Article Is About
&lt;/h2&gt;

&lt;p&gt;So there I was, vibing with my Python backend, when a wild upgrade notification appeared. Seemed simple enough — just upgrade a package, right?&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Narrator: It was not simple.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What started as a "quick 5-minute fix" turned into a deep dive into how &lt;strong&gt;uv&lt;/strong&gt; (the blazingly fast Python package manager) actually thinks. And honestly? It completely changed how I approach dependency management.&lt;/p&gt;

&lt;p&gt;Here's the journey:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🔥 The problem that looked like a bug&lt;/li&gt;
&lt;li&gt;🧠 The mental model shift that clicked&lt;/li&gt;
&lt;li&gt;✅ The actual fix (spoiler: it wasn't forcing an upgrade)&lt;/li&gt;
&lt;li&gt;💡 The lessons that'll save you hours&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's get into it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting the Scene: My Stack
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv0azxkkpnb8drl1eemui.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv0azxkkpnb8drl1eemui.png" alt="relavant image" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I was working on a FastAPI-based backend — you know, the classic data processing API:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI + Uvicorn&lt;/strong&gt; (the speedy bois)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pandas &amp;amp; numpy&lt;/strong&gt; (data wrangling essentials)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;scikit-learn&lt;/strong&gt; (ML predictions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;scipy&lt;/strong&gt; (scientific computing)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;uv&lt;/strong&gt; for dependency management (because who has time for slow installs?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything was humming along nicely. Dependencies locked in &lt;code&gt;pyproject.toml&lt;/code&gt;, reproducible builds via &lt;code&gt;uv.lock&lt;/code&gt;. Chef's kiss. 🤌&lt;/p&gt;

&lt;p&gt;Then I saw this warning pop up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A newer version of numpy is available: 1.24.3 → 2.0.0
(pip install -U numpy)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My brain: &lt;em&gt;"Cool, let's upgrade."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;My project: &lt;em&gt;"lol no."&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  First Attempt: The Obvious Thing
&lt;/h2&gt;

&lt;p&gt;Since I'm using uv (not pip like a caveman), I ran the proper command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv lock &lt;span class="nt"&gt;--upgrade-package&lt;/span&gt; numpy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And uv said:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;❌ &lt;strong&gt;No solution found.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Wait, what? The package exists. The version is right there. Why won't you just... install it?&lt;/p&gt;

&lt;p&gt;I tried again. Same error. Cleared cache. Same error. Started questioning my life choices. &lt;em&gt;Same. Error.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8lbx1e8kbtajdecig57j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8lbx1e8kbtajdecig57j.png" alt="relavant image" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Plot Twist: uv Isn't Being Difficult — It's Being Smart
&lt;/h2&gt;

&lt;p&gt;Here's where I had to sit down and actually &lt;em&gt;understand&lt;/em&gt; what uv is doing under the hood.&lt;/p&gt;

&lt;p&gt;See, &lt;strong&gt;uv is not pip&lt;/strong&gt;. It's built different (literally).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;pip&lt;/th&gt;
&lt;th&gt;uv&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Yeets packages into your env immediately&lt;/td&gt;
&lt;td&gt;Resolves the &lt;em&gt;entire&lt;/em&gt; dependency graph first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hopes for the best&lt;/td&gt;
&lt;td&gt;Demands logical consistency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"It compiled, ship it" energy&lt;/td&gt;
&lt;td&gt;"This must make mathematical sense" energy&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The key insight:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;uv doesn't upgrade packages. It upgrades &lt;strong&gt;compatibility sets&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If even &lt;em&gt;one&lt;/em&gt; dependency conflicts with another, uv refuses to proceed. It won't give you a broken environment and call it a day.&lt;/p&gt;

&lt;p&gt;This isn't a limitation. It's a feature.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Issue: I Was Living a Lie
&lt;/h2&gt;

&lt;p&gt;After actually reading the error (revolutionary, I know), the problem became crystal clear.&lt;/p&gt;

&lt;p&gt;My &lt;code&gt;pyproject.toml&lt;/code&gt; was depending on a bunch of scientific Python packages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;numpy
pandas
scipy
scikit-learn
numba
matplotlib
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looks fine, right? Just importing what I need?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WRONG.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4a490alkbueey8wvylz8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4a490alkbueey8wvylz8.png" alt="relavant Image" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Hidden Drama: Two Generations, One Project
&lt;/h3&gt;

&lt;p&gt;Here's what I didn't know: the scientific Python ecosystem had just gone through a &lt;strong&gt;major version transition&lt;/strong&gt; with numpy 2.0, and not everyone was on board yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔷 Legacy-Compatible Stack&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;numba&lt;/code&gt; (JIT compiler)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;scipy &amp;lt; 1.13&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Older &lt;code&gt;scikit-learn&lt;/code&gt; versions&lt;/li&gt;
&lt;li&gt;Requires: &lt;code&gt;numpy &amp;gt;= 1.21, &amp;lt; 2.0&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;🔶 Modern Stack&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;numpy &amp;gt;= 2.0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pandas &amp;gt;= 2.2&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Newer &lt;code&gt;scipy &amp;gt;= 1.13&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Newer &lt;code&gt;scikit-learn &amp;gt;= 1.5&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;See the problem?&lt;/p&gt;

&lt;p&gt;By having both old and new packages in my project, I was basically telling uv:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Hey, I need &lt;code&gt;numpy&lt;/code&gt; to be less than 2.0 AND greater than or equal to 2.0."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;uv, being mathematically literate: &lt;em&gt;"That's... not how numbers work, bestie."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;No version of &lt;code&gt;numpy&lt;/code&gt; can be both &lt;code&gt;&amp;lt; 2.0&lt;/code&gt; AND &lt;code&gt;&amp;gt;= 2.0&lt;/code&gt;. It's impossible. That's why uv said no solution exists — because &lt;strong&gt;no solution actually exists&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  But Wait, pip Would've Worked!
&lt;/h2&gt;

&lt;p&gt;Yeah, about that...&lt;/p&gt;

&lt;p&gt;If I had just run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; numpy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;pip would've:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Upgraded &lt;code&gt;numpy&lt;/code&gt; to 2.0 (yay!)&lt;/li&gt;
&lt;li&gt;😬 Silently left &lt;code&gt;numba&lt;/code&gt; with an incompatible numpy&lt;/li&gt;
&lt;li&gt;💀 Caused cryptic runtime errors when numba tried to JIT compile&lt;/li&gt;
&lt;li&gt;🎲 Left me debugging "why does my code randomly crash" for hours&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;pip's approach: &lt;em&gt;"You asked for this package? Here it is. Good luck with everything else lmaooo"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;uv's approach: &lt;em&gt;"I'm not letting you shoot yourself in the foot."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2c6j3bis2lbxoa5h2xv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2c6j3bis2lbxoa5h2xv.png" alt="relavant image" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix: Making an Actual Decision
&lt;/h2&gt;

&lt;p&gt;The solution wasn't to force an upgrade or delete my lockfile or any of that hacky stuff.&lt;/p&gt;

&lt;p&gt;The solution was to &lt;strong&gt;choose a lane&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Since I was building a data API and needed the latest pandas features, the &lt;strong&gt;modern numpy 2.0 stack&lt;/strong&gt; made more sense. But that meant I had to deal with &lt;code&gt;numba&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  🗑️ What I Changed
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# Removed the legacy-locked package&lt;/span&gt;
&lt;span class="err"&gt;numba&lt;/span&gt;  &lt;span class="c"&gt;# ← goodbye old friend, you're holding us back&lt;/span&gt;

&lt;span class="c"&gt;# Or alternatively, waited for numba's numpy 2.0 compatible release&lt;/span&gt;
&lt;span class="c"&gt;# and pinned to that specific version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ✅ What I Kept (with updated constraints)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="py"&gt;numpy&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mf"&gt;3.0&lt;/span&gt;
&lt;span class="py"&gt;pandas&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mf"&gt;3.0&lt;/span&gt;
&lt;span class="py"&gt;scipy&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;
&lt;span class="py"&gt;scikit-learn&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;
&lt;span class="py"&gt;matplotlib&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;3.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mf"&gt;4.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Moment of Truth
&lt;/h2&gt;

&lt;p&gt;With a logically consistent dependency graph, I ran:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv lock
uv &lt;span class="nb"&gt;sync&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And just like that:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✓ Resolved dependencies
✓ Locked 127 packages
✓ Synced environment
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No force flags. No hacks. No deleting lockfiles. No stackoverflow copypasta.&lt;/p&gt;

&lt;p&gt;It. Just. Worked. ✨&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkbrkquqnv6na3dcf6qi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkbrkquqnv6na3dcf6qi.png" alt="relavant image" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Big Brain Takeaways
&lt;/h2&gt;

&lt;p&gt;This experience taught me some real ones:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. uv's strictness is a feature, not a bug
&lt;/h3&gt;

&lt;p&gt;When uv says "no," it's because you're asking for the impossible. Listen to it.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Errors are signals, not obstacles
&lt;/h3&gt;

&lt;p&gt;That error message wasn't uv being annoying — it was uv telling me my project had a fundamental design issue.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. pip's "flexibility" is actually a trap
&lt;/h3&gt;

&lt;p&gt;Silent breakage &amp;gt; Loud error? Nah. I'll take the loud error every time.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Dependencies are relationships
&lt;/h3&gt;

&lt;p&gt;You can't just import everything and hope they get along. Some packages are fundamentally incompatible.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Lockfiles are your friend
&lt;/h3&gt;

&lt;p&gt;They're not friction. They're reproducibility. They're the reason your code works the same way on every machine.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pro Tips for the Road
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fny3se4bzmyxds5i2nots.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fny3se4bzmyxds5i2nots.png" alt="random image" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you want to avoid my pain, here's the cheat code:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Do This&lt;/th&gt;
&lt;th&gt;Not This&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Treat &lt;code&gt;pyproject.toml&lt;/code&gt; as a policy document&lt;/td&gt;
&lt;td&gt;Treat it like a wish list&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research if packages are compatible&lt;/td&gt;
&lt;td&gt;Just add everything you might need&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Upgrade related packages together&lt;/td&gt;
&lt;td&gt;Upgrade one random package and pray&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use version ranges (&lt;code&gt;&amp;gt;=1.0,&amp;lt;2.0&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Over-pin to exact versions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trust the resolver&lt;/td&gt;
&lt;td&gt;Fight the resolver&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The One-Liner That Changed My Perspective
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;If uv refuses to resolve, it's because your requirements don't describe a world that can actually exist.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Read that again.&lt;/p&gt;

&lt;p&gt;uv isn't being stubborn. It's being honest. Your dependency graph is a specification of reality, and if that specification is contradictory, no tool can make it work — they can only pretend to.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;What felt like uv being difficult was actually uv doing its job: &lt;strong&gt;protecting me from myself&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The problem wasn't the tool refusing to upgrade.&lt;/p&gt;

&lt;p&gt;The problem was that &lt;strong&gt;my dependency graph was mathematically impossible&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Once I fixed that — once I made an actual architectural decision instead of importing everything — it all just worked. Cleanly. Predictably. Reproducibly.&lt;/p&gt;

&lt;p&gt;That's the real value of uv. It doesn't let you live a lie.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0vlkq0vkvqx3jl648ttv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0vlkq0vkvqx3jl648ttv.png" alt="relavant image" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Got questions? Hit me up. Happy to chat about dependency management, uv, or why pip gives me trust issues.&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Stay consistent, fam.&lt;/em&gt; ✌️&lt;/p&gt;

</description>
      <category>uv</category>
      <category>python</category>
      <category>dependencyinversion</category>
      <category>programming</category>
    </item>
    <item>
      <title>Building RAG Systems: From Zero to Hero</title>
      <dc:creator>Gautam Vhavle</dc:creator>
      <pubDate>Wed, 24 Dec 2025 17:45:50 +0000</pubDate>
      <link>https://dev.to/gautamvhavle/building-production-rag-systems-from-zero-to-hero-2f1i</link>
      <guid>https://dev.to/gautamvhavle/building-production-rag-systems-from-zero-to-hero-2f1i</guid>
      <description>&lt;p&gt;&lt;em&gt;What I learned building RAG systems from scratch—and how you can too&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Journey That Changed How I Think About AI
&lt;/h2&gt;

&lt;p&gt;Recently, I finished a comprehensive RAG &lt;a href="https://www.coursera.org/learn/retrieval-augmented-generation-rag" rel="noopener noreferrer"&gt;course&lt;/a&gt; from DeepLearning.AI and instructor was Zain Hasan. Before that, I'd been learning from scattered tutorials and blog posts—completely unstructured. I thought I understood retrieval-augmented generation. I knew the theory: embeddings, vector databases, semantic search. &lt;/p&gt;

&lt;p&gt;But like most things in engineering, theory and practice are worlds apart.&lt;/p&gt;

&lt;p&gt;Since then, I've been building RAG systems for standalone projects personal/course, a customer support chatbot, a documentation search engine, and an internal knowledge assistant. Each one taught me something which just reading theory couldn't: the messy, fascinating reality of production AI.&lt;/p&gt;

&lt;p&gt;Here's what I wish someone had told me before I started, and what I've learned along the way.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why RAG? The Problem I Kept Running Into
&lt;/h2&gt;

&lt;p&gt;During my course, the instructor kept hammering home one point: LLMs are amazing at reasoning, terrible at remembering. I nodded along, but I didn't &lt;em&gt;really&lt;/em&gt; get it until my first project.&lt;/p&gt;

&lt;p&gt;I was building a chatbot for a company's internal documentation. Simple, right? Feed GPT-4 a question, get an answer. Except:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;It hallucinated constantly.&lt;/strong&gt; Made up API endpoints that didn't exist. Confidently cited documentation sections that were never written.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;It didn't know about latest updates.&lt;/strong&gt; We could have shipped a major feature last week. The model? Clueless.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's when RAG clicked. Instead of expecting the model to memorize everything, I'd give it a search engine. When someone asks a question, search the docs first, then feed the relevant content to the model.&lt;/p&gt;

&lt;p&gt;Suddenly: no hallucinations, always up-to-date, and token utilization also was optimized.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's the power of RAG.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What is RAG? How I Explain It Now
&lt;/h2&gt;

&lt;p&gt;After building a few systems, here's how I think about RAG:&lt;/p&gt;

&lt;p&gt;Instead of asking an LLM to answer from memory (which leads to hallucinations), you:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Store your documents&lt;/strong&gt; in a database that understands meaning (vector database)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When someone asks a question&lt;/strong&gt;, search for relevant documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hand those documents to the LLM&lt;/strong&gt; along with the question&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Let the LLM answer&lt;/strong&gt; based on what it just read&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It's like the difference between asking someone to recite a textbook from memory versus letting them look it up first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The breakthrough:&lt;/strong&gt; You separate "knowing facts" from "reasoning about facts." Update your documents, and your AI instantly knows the new information. No retraining needed, no stale knowledge, no made-up answers.&lt;/p&gt;




&lt;h2&gt;
  
  
  Foundation: Understanding the Building Blocks
&lt;/h2&gt;

&lt;p&gt;Before we build, let's ensure we're on the same page about three key concepts:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Embeddings: GPS Coordinates for Meaning
&lt;/h3&gt;

&lt;p&gt;Embeddings convert text into arrays of numbers (vectors) that capture semantic meaning. Words with similar meanings sit close together in this mathematical space.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"dog" → [0.32, 0.89, -0.45, ...]
"puppy" → [0.34, 0.87, -0.43, ...]
"car" → [-0.12, 0.15, 0.78, ...]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Insight:&lt;/strong&gt; Embeddings let us compute "semantic similarity" mathematically. "Dog" and "puppy" are geometrically close; "dog" and "car" are far apart.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxh7xaq9i4k7prtyv05e6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxh7xaq9i4k7prtyv05e6.png" alt="Image about RAG" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Vector Similarity: Finding the Needle
&lt;/h3&gt;

&lt;p&gt;When a user asks "What's your refund policy?", we:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Convert the question into an embedding&lt;/li&gt;
&lt;li&gt;Find documents with similar embeddings (using cosine similarity or dot product)&lt;/li&gt;
&lt;li&gt;Return the top matches&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is wildly faster than reading every document. A vector database can search millions of documents in milliseconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Context Windows: The LLM's Short-Term Memory
&lt;/h3&gt;

&lt;p&gt;LLMs have limited context windows (think RAM for conversation):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT-3.5: 4K tokens (~3,000 words)&lt;/li&gt;
&lt;li&gt;GPT-4: 8K-128K tokens&lt;/li&gt;
&lt;li&gt;Claude 3: Up to 200K tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The catch:&lt;/strong&gt; More context = slower response + higher cost. RAG is about finding the &lt;em&gt;right&lt;/em&gt; context, not &lt;em&gt;all&lt;/em&gt; context.&lt;/p&gt;




&lt;h2&gt;
  
  
  The RAG Pipeline: End-to-End Architecture
&lt;/h2&gt;

&lt;p&gt;Here's how a production RAG system works:&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1: Ingestion (Building Your Knowledge Base)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Documents → Chunking → Embedding → Vector DB Storage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 1: Collect Your Data&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Documentation (Markdown, PDFs)&lt;/li&gt;
&lt;li&gt;Internal wikis&lt;/li&gt;
&lt;li&gt;Customer support tickets&lt;/li&gt;
&lt;li&gt;Product databases&lt;/li&gt;
&lt;li&gt;Code repositories&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Chunk It&lt;/strong&gt;&lt;br&gt;
Break large documents into smaller pieces (chunks). Why? LLMs need focused context, not entire manuals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Embed It&lt;/strong&gt;&lt;br&gt;
Convert each chunk into a vector using an embedding model (OpenAI Ada, Sentence-BERT, etc.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Store It&lt;/strong&gt;&lt;br&gt;
Index vectors in a vector database with metadata (source, timestamp, category)&lt;/p&gt;
&lt;h3&gt;
  
  
  Phase 2: Retrieval (Finding Relevant Knowledge)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query → Embed Query → Search Vector DB → Retrieve Top-K Chunks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;When a user asks a question:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Convert their question into an embedding&lt;/li&gt;
&lt;li&gt;Search your vector DB for similar chunks&lt;/li&gt;
&lt;li&gt;Retrieve the top 3-10 most relevant pieces&lt;/li&gt;
&lt;li&gt;(Optional) Rerank results for precision&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Phase 3: Generation (Creating the Answer)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query + Retrieved Context → LLM → Grounded Answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Construct a prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Context: [Retrieved chunks]
Question: [User query]
Instructions: Answer based only on the context provided.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM generates a response grounded in your actual data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsjypval488xir5tt7q68.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsjypval488xir5tt7q68.png" alt="Image about RAG" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  My First RAG System (The One That Actually Worked)
&lt;/h2&gt;

&lt;p&gt;After the course, I wanted to build the simplest possible RAG system to prove I understood it. Here's what I came up with—about 50 lines of Python (Which ran really slow in my Macbook):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# requirements: sentence-transformers, faiss-cpu, openai
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;

&lt;span class="c1"&gt;# Step 1: Prepare documents
&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Our refund policy: 30 days, full refund with receipt.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Shipping takes 3-5 business days for domestic orders.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;We accept Visa, Mastercard, and PayPal.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Customer support: support@example.com or call 1-800-HELP&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Create embeddings
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 384-dim embeddings
&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Build FAISS index
&lt;/span&gt;&lt;span class="n"&gt;dimension&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;IndexFlatL2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dimension&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Step 4: Retrieval function
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;distances&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

&lt;span class="c1"&gt;# Step 5: RAG function
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rag_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Retrieve relevant docs
&lt;/span&gt;    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Create prompt
&lt;/span&gt;    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Answer the question based only on this context:

Context:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;chr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Answer:&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# Generate response
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChatCompletion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-3.5-turbo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

&lt;span class="c1"&gt;# Test it
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;rag_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How long does shipping take?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# Output: "Shipping takes 3-5 business days for domestic orders."
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What just happened?&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;I embedded 4 documents using a lightweight model I could run locally (22MB download)&lt;/li&gt;
&lt;li&gt;Stored them in FAISS—this took me 10 minutes to figure out from the docs&lt;/li&gt;
&lt;li&gt;When asked about shipping, the system found the right document&lt;/li&gt;
&lt;li&gt;Fed it to GPT-3.5 to generate a natural answer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Results:&lt;/strong&gt; Optimized token usage. Zero hallucinations. I was hooked.&lt;/p&gt;

&lt;p&gt;This tiny example taught me more than hours of coursework. Seeing retrieval work in real-time made everything click.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chunking: The Part That Took Me the Longest to Get Right
&lt;/h2&gt;

&lt;p&gt;The course covered chunking in one lecture. In practice, it took me three weeks of experimentation. Here's what I learned the hard way:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Fixed-Size Chunking (Beginner-Friendly)
&lt;/h3&gt;

&lt;p&gt;Split every N tokens (e.g., 512 tokens) with optional overlap:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chunk_fixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;overlap&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Simple, predictable&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; May split mid-sentence, breaks semantic units&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use when:&lt;/strong&gt; You have clean, uniform text (articles, docs)&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Semantic Chunking (Intermediate)
&lt;/h3&gt;

&lt;p&gt;Split at natural boundaries (paragraphs, sections, sentences):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;

&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;separators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Try these in order
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;long_document&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Respects semantic boundaries&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Variable chunk sizes&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use when:&lt;/strong&gt; You have structured documents (PDFs, articles)&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Hybrid Chunking (Advanced)
&lt;/h3&gt;

&lt;p&gt;Combine approaches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use section headers to define chunk boundaries&lt;/li&gt;
&lt;li&gt;Keep chunks within token limits&lt;/li&gt;
&lt;li&gt;Add metadata (section title, page number)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdi0y98iy0xz2ili615x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdi0y98iy0xz2ili615x.png" alt="Chunking" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  What Actually Worked for Me:
&lt;/h3&gt;

&lt;p&gt;✅ &lt;strong&gt;Always use overlap&lt;/strong&gt; (I settled on 15%)—this fixed so many "half-answer" problems&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Metadata is your friend&lt;/strong&gt;—I can filter by date, source, document type&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Start with 512 tokens&lt;/strong&gt;—then adjust. I went up to 768 for technical docs&lt;br&gt;&lt;br&gt;
✅ &lt;strong&gt;Test with real queries&lt;/strong&gt;—what looks good in theory often fails in practice&lt;br&gt;&lt;br&gt;
⚠️ &lt;strong&gt;My biggest mistake:&lt;/strong&gt; Chunks too small (200 tokens). Context disappeared.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My advice:&lt;/strong&gt; Don't overthink it at first. Use 512 tokens with 50 token overlap, then iterate based on what your users actually search for.&lt;/p&gt;


&lt;h2&gt;
  
  
  Retrieval: What I Wish I'd Known Earlier
&lt;/h2&gt;

&lt;p&gt;In the course, we learned semantic search (embeddings + vector similarity). In my projects, I discovered that wasn't always enough.&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Dense Retrieval (Semantic Search)
&lt;/h3&gt;

&lt;p&gt;This is where I started: convert everything to vectors, find similar vectors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best models (as of 2025):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;text-embedding-3-large&lt;/code&gt; (OpenAI) - 1536 dims, excellent quality&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; (open source) - 384 dims, fast, good enough&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;bge-large-en-v1.5&lt;/code&gt; (BAAI) - 1024 dims, top open source option&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Captures semantic meaning, handles synonyms&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Computationally intensive&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Sparse Retrieval (BM25/TF-IDF)
&lt;/h3&gt;

&lt;p&gt;Traditional keyword search. Fast, simple, explainable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rank_bm25&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BM25Okapi&lt;/span&gt;

&lt;span class="n"&gt;corpus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;bm25&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BM25Okapi&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;corpus&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refund policy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bm25&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_scores&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;top_doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Fast, deterministic, good for exact keyword matches&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Misses semantic similarity ("car" won't match "automobile")&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Hybrid Search (Best of Both)
&lt;/h3&gt;

&lt;p&gt;Combine dense and sparse retrieval:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hybrid_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Get semantic results
&lt;/span&gt;    &lt;span class="n"&gt;semantic_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Get BM25 results
&lt;/span&gt;    &lt;span class="n"&gt;bm25_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bm25&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Merge with weighted scores
&lt;/span&gt;    &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;merge_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;semantic_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;bm25_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;  &lt;span class="c1"&gt;# 0.5 = equal weight
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Top 10
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When to use hybrid:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Users search with specific keywords (product names, codes)&lt;/li&gt;
&lt;li&gt;Domain with technical jargon&lt;/li&gt;
&lt;li&gt;You need explainable results&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dense (Semantic)&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Natural language queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sparse (BM25)&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Keyword search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hybrid&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;td&gt;Production systems&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  4. Reranking: The Game-Changer I Almost Skipped
&lt;/h3&gt;

&lt;p&gt;I almost didn't implement reranking. "Initial retrieval is good enough," I thought. Then I tried it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CrossEncoder&lt;/span&gt;

&lt;span class="n"&gt;reranker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CrossEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cross-encoder/ms-marco-MiniLM-L-6-v2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Initial retrieval: Get top 20
&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Rerank: Score each candidate against query
&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reranker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;([(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Return top 5 after reranking
&lt;/span&gt;&lt;span class="n"&gt;reranked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argsort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;My results:&lt;/strong&gt; Accuracy jumped from 73% to 89% on my test queries on test dataset in the course. I immediately noticed better answers.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Tradeoff:&lt;/strong&gt; Added some latency, which is totally worth it.&lt;/p&gt;

&lt;p&gt;Reranking should became non-negotiable in my next projects.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5qmup7jleketa7lcj8l.png" alt="Retriever" width="800" height="436"&gt;
&lt;/h2&gt;
&lt;h2&gt;
  
  
  Vector Databases: My Journey From FAISS to Production
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Where I Started: Local and Simple
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. FAISS (Facebook AI Similarity Search)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This was my first choice after the course. Why?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dead simple:&lt;/strong&gt; Got it running in 30 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Free:&lt;/strong&gt; Important when you're learning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast enough:&lt;/strong&gt; For my 10K documents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The catch:&lt;/strong&gt; No persistence out of the box. I had to save/load the index manually. Fine for prototyping, annoying for production.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;IndexFlatL2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dimension&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Brute force, exact search
&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. LocalChromaDB&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pros:&lt;/strong&gt; Simple, embedded mode, good for beginners&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Not optimized for large scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use for:&lt;/strong&gt; Side projects, MVPs, local development
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Where I Moved for Production
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;3. Qdrant&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pros:&lt;/strong&gt; Rust-based (fast), filtering, open source, good docs, for speech applications where latency matters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Smaller community than others&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use for:&lt;/strong&gt; Production, performance-critical apps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Milvus/Zilliz&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pros:&lt;/strong&gt; Built for massive scale (billions of vectors), battle-tested&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Complex setup, steeper learning curve&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use for:&lt;/strong&gt; Enterprise scale, billions of documents&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Database&lt;/th&gt;
&lt;th&gt;Ease of Use&lt;/th&gt;
&lt;th&gt;Scalability&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Open Source&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FAISS&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Learning, prototypes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChromaDB&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐⭐&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;MVPs, small apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qdrant&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;Free/$&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Performance-critical&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Milvus&lt;/td&gt;
&lt;td&gt;⭐⭐&lt;/td&gt;
&lt;td&gt;⭐⭐⭐⭐⭐&lt;/td&gt;
&lt;td&gt;Free/$$$&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Enterprise scale&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  My Recommendation Based on What I've Built:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Your first RAG project?&lt;/strong&gt; → FAISS. Get something working in an afternoon.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Building a side project?&lt;/strong&gt; → ChromaDB. Easy persistence, good docs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Serious about production?&lt;/strong&gt; → Milvus or Qdrant. I've tried and tested both, both are solid.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  RAG vs Fine-Tuning: When to Use What
&lt;/h2&gt;

&lt;p&gt;This is the million-dollar question.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;RAG&lt;/th&gt;
&lt;th&gt;Fine-Tuning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low (inference only)&lt;/td&gt;
&lt;td&gt;High ($10K-$100K+)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Update frequency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Real-time&lt;/td&gt;
&lt;td&gt;Requires retraining&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Days&lt;/td&gt;
&lt;td&gt;Weeks/months&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Accuracy on facts&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Excellent (grounded)&lt;/td&gt;
&lt;td&gt;Good (can hallucinate)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Behavior modification&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Interpretability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (see sources)&lt;/td&gt;
&lt;td&gt;Low (black box)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Slightly higher&lt;/td&gt;
&lt;td&gt;Lower&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Use RAG when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your data changes frequently&lt;/li&gt;
&lt;li&gt;You need factual accuracy with citations&lt;/li&gt;
&lt;li&gt;Budget is limited&lt;/li&gt;
&lt;li&gt;You need to explain answers (show sources)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Fine-Tuning when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need to change model &lt;em&gt;behavior&lt;/em&gt; (tone, format, style)&lt;/li&gt;
&lt;li&gt;Data is static&lt;/li&gt;
&lt;li&gt;Latency is critical&lt;/li&gt;
&lt;li&gt;Budget allows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best approach?&lt;/strong&gt; Hybrid: Fine-tune for behavior, RAG for knowledge.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79dqcnrnicwxm86edpzr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F79dqcnrnicwxm86edpzr.png" alt="Decision" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Surprise: Small Models + RAG Beat GPT-4
&lt;/h2&gt;

&lt;p&gt;This wasn't in the course, but it's the most important thing I've learned:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A well-tuned 7B model with RAG beats GPT-4 for domain-specific tasks.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I didn't believe it until I tried it in my demo project. Here's why it works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Specialized Retrieval beats General Knowledge&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT-4 knows a little about everything&lt;/li&gt;
&lt;li&gt;Your RAG knows &lt;em&gt;everything&lt;/em&gt; about your domain&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Smaller Models are Faster&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Llama-3 8B: ~50ms inference&lt;/li&gt;
&lt;li&gt;GPT-4: ~500ms inference&lt;/li&gt;
&lt;li&gt;10x speed improvement&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost Savings are Dramatic&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   GPT-4: $0.03 per 1K tokens
   Llama-3 8B (self-hosted): $0.0001 per 1K tokens
   300x cheaper
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;You Control the Infrastructure&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;No vendor lock-in&lt;/li&gt;
&lt;li&gt;Data privacy guaranteed&lt;/li&gt;
&lt;li&gt;Custom optimizations&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  My Experience:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Project 1: Documentation Chatbot&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Started with GPT-4: Great answers&lt;/li&gt;
&lt;li&gt;Switched to Llama-3 8B + RAG: Better answer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The difference:&lt;/strong&gt; GPT-4 would paraphrase incorrectly. Llama-3 + RAG quoted exact docs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Project 2: Customer Support Bot&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tried GPT-4 first: Some of queries handled correctly (67%)&lt;/li&gt;
&lt;li&gt;Moved to Llama-3.2 8B + RAG: 91% accuracy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why:&lt;/strong&gt; RAG retrieved the exact support article. Model just had to summarize it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Project 3: Internal Knowledge Assistant&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using Phi-3 3.8B (tiny model) + aggressive Hybrid search RAG&lt;/li&gt;
&lt;li&gt;Responses in 120ms average&lt;/li&gt;
&lt;li&gt;I prefer it over standalone GPT-4 system&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Where I Think This Is Heading
&lt;/h2&gt;

&lt;p&gt;After three projects and countless experiments, here's what I believe:&lt;/p&gt;

&lt;p&gt;As smaller models get better (and they are—fast), RAG becomes the great equalizer. We're moving toward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Specialized beats generalized&lt;/strong&gt; for most business use cases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open source + RAG&lt;/strong&gt; is the default architecture&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost per query&lt;/strong&gt; drops from dollars to fractions of cents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Every company&lt;/strong&gt; runs their own domain-expert AI&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Emerging Trends: Agentic RAG
&lt;/h3&gt;

&lt;p&gt;The next evolution is &lt;strong&gt;Agentic RAG&lt;/strong&gt; - systems that don't just retrieve and generate, but reason about &lt;em&gt;what&lt;/em&gt; to retrieve and &lt;em&gt;when&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How Agentic RAG works:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Query Analysis:&lt;/strong&gt; Agent determines if it needs more information&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Step Retrieval:&lt;/strong&gt; Performs multiple retrieval rounds, refining based on initial results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool Use:&lt;/strong&gt; Can call external APIs, run code, or query structured databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-Reflection:&lt;/strong&gt; Evaluates its own answers and retrieves more if unsure
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example: Agentic RAG flow
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;agentic_rag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Step 1: Analyze query complexity
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;needs_multi_step_reasoning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Step 2: Break down into sub-questions
&lt;/span&gt;        &lt;span class="n"&gt;sub_queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;decompose_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 3: Retrieve for each sub-question
&lt;/span&gt;        &lt;span class="n"&gt;contexts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sub_queries&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 4: Synthesize and verify
&lt;/span&gt;        &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_with_verification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;contexts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 5: If confidence is low, retrieve more
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;additional_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;retrieve_with_feedback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_final&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;contexts&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;additional_context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Simple single-step RAG
&lt;/span&gt;        &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;simple_rag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Benefits of Agentic RAG:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Better accuracy&lt;/strong&gt; on complex queries requiring multi-hop reasoning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive retrieval&lt;/strong&gt; - only retrieves what's needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explainable reasoning&lt;/strong&gt; - can show the step-by-step process&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-efficient&lt;/strong&gt; - avoids over-retrieving&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-world impact:&lt;/strong&gt; Agentic RAG systems have shown 30-40% improvement over traditional RAG on complex question-answering benchmarks like HotpotQA and MultiHop-RAG.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My take:&lt;/strong&gt; Most companies don't need latest GPT-5. They need their own data, smart retrieval, and a well-implemented RAG system. That's 90% of the value at 10% of the cost.&lt;/p&gt;

&lt;p&gt;This realization changed how I think about AI engineering entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production Lessons: What I'm Learning
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Monitoring Saved Me
&lt;/h3&gt;

&lt;p&gt;I didn't add monitoring in my first project. Big mistake. In my second project, I tracked the context, but from other souces I've learned that:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Recall@K:&lt;/strong&gt; Are the right docs in top K results?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Precision@K:&lt;/strong&gt; What % of retrieved docs are relevant?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MRR (Mean Reciprocal Rank):&lt;/strong&gt; How far down is the first relevant result?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Generation Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Faithfulness:&lt;/strong&gt; Does answer align with retrieved context?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relevance:&lt;/strong&gt; Does answer address the question?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency:&lt;/strong&gt; Time from query to response&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User satisfaction (thumbs up/down)&lt;/li&gt;
&lt;li&gt;Resolution rate (for support chatbots)&lt;/li&gt;
&lt;li&gt;Cost per query
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simple evaluation framework
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate_rag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ground_truth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rag_system&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Retrieval
&lt;/span&gt;    &lt;span class="n"&gt;retrieved_docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rag_system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;recall&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_recall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retrieved_docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ground_truth_docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Generation
&lt;/span&gt;    &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rag_system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retrieved_docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;faithfulness&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;check_faithfulness&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retrieved_docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recall&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;recall&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;faithfulness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;faithfulness&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Handling Context Overflow
&lt;/h3&gt;

&lt;p&gt;What if retrieved context exceeds LLM's window?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solutions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summarize chunks&lt;/strong&gt; before passing to LLM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use longer context models&lt;/strong&gt; (Claude 3 200K)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement multi-hop retrieval&lt;/strong&gt; (iterative refinement)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compress context&lt;/strong&gt; with extractive summarization&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Cost Optimization
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Embedding costs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache embeddings (don't recompute for same text)&lt;/li&gt;
&lt;li&gt;Use cheaper models for preliminary retrieval&lt;/li&gt;
&lt;li&gt;Batch embed operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;LLM costs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use smaller models where accuracy permits&lt;/li&gt;
&lt;li&gt;Implement caching for common queries&lt;/li&gt;
&lt;li&gt;Set max_tokens to avoid runaway generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Self-host embeddings model (one-time cost)&lt;/li&gt;
&lt;li&gt;Use spot instances for batch processing&lt;/li&gt;
&lt;li&gt;Implement request throttling&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Multi-Tenancy Patterns
&lt;/h3&gt;

&lt;p&gt;For SaaS products:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Namespace approach
&lt;/span&gt;&lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tenant_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Query with filter
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_emb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tenant_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Incremental Updates
&lt;/h3&gt;

&lt;p&gt;Don't rebuild your entire index daily:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Add new documents
&lt;/span&gt;&lt;span class="n"&gt;new_docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_new_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;last_update&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;new_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_embeddings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Update existing documents
&lt;/span&gt;&lt;span class="n"&gt;updated_docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_updated_documents&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# Delete old versions, add new versions
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;updated_docs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remove&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;old_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fplle9rpdhjuj5tj1irx5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fplle9rpdhjuj5tj1irx5.png" alt="Prod" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Mistakes I Made or would have made (So You Don't Have To)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ⚠️ Mistake 1: Chunking Too Small
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What I did:&lt;/strong&gt; Started with 128-token chunks to "maximize precision"&lt;br&gt;&lt;br&gt;
&lt;strong&gt;What happened:&lt;/strong&gt; Retrieval found fragments without enough context. Answers were incomplete.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Bumped to 512 tokens with 15% overlap. Immediately better.&lt;/p&gt;
&lt;h3&gt;
  
  
  ⚠️ Mistake 2: Ignoring Metadata
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What I did:&lt;/strong&gt; Pure semantic search, no filters&lt;br&gt;&lt;br&gt;
&lt;strong&gt;What happened:&lt;/strong&gt; Retrieved old documentation when new versions existed&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Added timestamp and version filters. Game changer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product_docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$gte&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-01-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ⚠️ Mistake 3: Not Profiling Latency
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What I did:&lt;/strong&gt; Assumed "fast enough" without measuring&lt;br&gt;&lt;br&gt;
&lt;strong&gt;What happened:&lt;/strong&gt; Users complained about 3-second response times&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Profiled every step: embedding (50ms), retrieval (80ms), reranking (200ms), generation (2.1s)&lt;/li&gt;
&lt;li&gt;Optimized generation by switching models&lt;/li&gt;
&lt;li&gt;Got down to 800ms total&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  ⚠️ Mistake 4: Trusting Retrieval Blindly
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What I did:&lt;/strong&gt; Always passed top results to LLM, no quality check&lt;br&gt;&lt;br&gt;
&lt;strong&gt;What happened:&lt;/strong&gt; When retrieval failed, LLM made stuff up&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Added confidence thresholds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rag_with_fallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Check if top result is confident
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Low confidence
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t have enough information to answer this.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ⚠️ Mistake 5: "Set It and Forget It"
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What I did:&lt;/strong&gt; Built the system, deployed it, moved on&lt;br&gt;&lt;br&gt;
&lt;strong&gt;What happened:&lt;/strong&gt; After adding 5000 more documents, retrieval quality dropped 15%&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; My plan will be to run evaluation tests occasionally. Catch degradation early.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Do I need a vector database or can I use a traditional DB?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A: For &amp;lt;10K documents, you can get away with FAISS or even numpy arrays. Beyond that, a proper vector DB gives you scalability, filtering, and performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: What's the minimum viable RAG system?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A: 50 lines of Python (see the example above), a free embedding model, and FAISS. Total cost: $0 to start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: How do I handle PDF extraction and preprocessing?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A: Use libraries like &lt;code&gt;pymupdf&lt;/code&gt;, &lt;code&gt;pdfplumber&lt;/code&gt;, or &lt;code&gt;unstructured&lt;/code&gt;. Watch out for table extraction—it's tricky.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Can I do RAG with completely private/offline models?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A: Absolutely. Use &lt;code&gt;sentence-transformers&lt;/code&gt; for embeddings and &lt;code&gt;llama.cpp&lt;/code&gt; or &lt;code&gt;ollama&lt;/code&gt; for local LLM inference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: What about structured data (databases, spreadsheets)?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A: Convert to text descriptions or use hybrid approaches (SQL + RAG). Example: Generate natural language descriptions of database rows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: How do I know if my chunking strategy is working?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A: Measure retrieval metrics. If recall is low, experiment with different chunk sizes and overlap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Should I use multiple embedding models?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A: For specialized domains (code, legal, medical), domain-specific embeddings often outperform general-purpose ones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: What about multi-modal RAG (images, tables, charts)?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A: Use multi-modal embedding models like CLIP (for images) or convert tables to text. It's an active area of research.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Wish I Knew Before Starting
&lt;/h2&gt;

&lt;p&gt;If I could go back and tell myself these things before building my first RAG system:&lt;/p&gt;

&lt;h3&gt;
  
  
  The Core Insights:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;RAG separates reasoning from knowledge&lt;/strong&gt; → LLMs handle reasoning, databases handle facts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The pipeline is simple:&lt;/strong&gt; Ingest → Chunk → Embed → Store → Retrieve → Generate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunking matters:&lt;/strong&gt; Start with 512 tokens, add overlap, respect semantic boundaries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid search beats pure semantic:&lt;/strong&gt; Combine BM25 + vector search for best results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small models + RAG can beat GPT-4&lt;/strong&gt; on domain-specific tasks at 1/100th the cost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production is about monitoring:&lt;/strong&gt; Track retrieval quality, latency, and cost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Start simple, optimize later:&lt;/strong&gt; FAISS + OpenAI embeddings + GPT-3.5 gets you 80% there&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agentic RAG is the future:&lt;/strong&gt; Multi-step reasoning and adaptive retrieval unlock new capabilities&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Why I'm Excited About This:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accessibility:&lt;/strong&gt; I am trying to build production AI with decent ML background. Just a curiosity and determination.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Economics:&lt;/strong&gt; RAG systems cost 1-5% of what pure LLM solutions would cost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability:&lt;/strong&gt; Users trust answers because they see sources. No more "did the AI make this up?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agility:&lt;/strong&gt; I can update knowledge in minutes. No retraining, no waiting.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Future of RAG:
&lt;/h3&gt;

&lt;p&gt;We're moving toward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive RAG:&lt;/strong&gt; Systems that adjust retrieval strategy based on query complexity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agentic RAG:&lt;/strong&gt; Multi-step reasoning with dynamic retrieval and tool use&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fusion models:&lt;/strong&gt; Architectures that blend parametric and non-parametric knowledge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smaller, smarter retrievers:&lt;/strong&gt; Specialized models optimized for retrieval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; RAG isn't just a technique—it's the architecture that makes practical, affordable, trustworthy AI possible.&lt;/p&gt;




&lt;h2&gt;
  
  
  Our Next Steps
&lt;/h2&gt;

&lt;p&gt;Choose your adventure based on experience level:&lt;/p&gt;

&lt;h3&gt;
  
  
  🌱 Beginners:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;✅ Implement the 50-line minimal RAG example above&lt;/li&gt;
&lt;li&gt;✅ Experiment with different chunk sizes (256, 512, 1024)&lt;/li&gt;
&lt;li&gt;✅ Try different embedding models (compare results)&lt;/li&gt;
&lt;li&gt;✅ Build a chatbot for a small document set (10-100 docs)&lt;/li&gt;
&lt;li&gt;📚 Read: &lt;a href="https://python.langchain.com/docs/use_cases/question_answering/" rel="noopener noreferrer"&gt;LangChain RAG Tutorial&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  🚀 Intermediate (I still have to implement few of em):
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;✅ Implement evaluation metrics (recall@K, faithfulness)&lt;/li&gt;
&lt;li&gt;✅ Set up a production vector DB (Weaviate or Qdrant)&lt;/li&gt;
&lt;li&gt;✅ Add hybrid search (BM25 + semantic)&lt;/li&gt;
&lt;li&gt;✅ Implement reranking&lt;/li&gt;
&lt;li&gt;✅ Experiment with smaller models (Llama-3 8B, Mistral 7B)&lt;/li&gt;
&lt;li&gt;📚 Read: &lt;a href="https://arxiv.org/abs/2312.10997" rel="noopener noreferrer"&gt;RAG Evaluation Best Practices&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  ⚡ Advanced (I'm aiming for this nex):
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;✅ Build multi-tenant RAG system with namespace isolation&lt;/li&gt;
&lt;li&gt;✅ Implement adaptive retrieval strategies&lt;/li&gt;
&lt;li&gt;✅ Optimize for sub-100ms latency&lt;/li&gt;
&lt;li&gt;✅ Run cost analysis and optimize for $0.0001/query&lt;/li&gt;
&lt;li&gt;✅ Build agentic RAG with multi-step reasoning&lt;/li&gt;
&lt;li&gt;✅ Contribute to open source RAG frameworks&lt;/li&gt;
&lt;li&gt;📚 Read: &lt;a href="https://arxiv.org/abs/2401.15884" rel="noopener noreferrer"&gt;Advanced RAG Techniques&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  🎯 RAG Journey Checklist:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Understand embeddings and vector similarity&lt;/li&gt;
&lt;li&gt;[ ] Build minimal RAG system (FAISS + OpenAI)&lt;/li&gt;
&lt;li&gt;[ ] Implement chunking strategy for your use case&lt;/li&gt;
&lt;li&gt;[ ] Set up production vector database&lt;/li&gt;
&lt;li&gt;[ ] Add evaluation metrics&lt;/li&gt;
&lt;li&gt;[ ] Implement hybrid search&lt;/li&gt;
&lt;li&gt;[ ] Optimize for cost and latency&lt;/li&gt;
&lt;li&gt;[ ] Deploy to production&lt;/li&gt;
&lt;li&gt;[ ] Monitor and iterate&lt;/li&gt;
&lt;li&gt;[ ] Explore agentic RAG patterns&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Resources and Further Reading
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Essential Links:
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Vector Databases
&lt;/h3&gt;

&lt;h3&gt;
  
  
  Primary Options (Recommended)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://milvus.io/docs" rel="noopener noreferrer"&gt;Milvus Documentation&lt;/a&gt;&lt;/strong&gt; - Distributed vector DB with GPU support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://qdrant.tech/documentation/" rel="noopener noreferrer"&gt;Qdrant Documentation&lt;/a&gt;&lt;/strong&gt; - Production-ready vector DB with cloud SaaS option&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://docs.trychroma.com/" rel="noopener noreferrer"&gt;ChromaDB Documentation&lt;/a&gt;&lt;/strong&gt; - Lightweight, Python-first vector database&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Embedding Models &amp;amp; Evaluation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  MTEB Leaderboard &amp;amp; Model Selection
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://huggingface.co/spaces/mteb/leaderboard" rel="noopener noreferrer"&gt;MTEB Leaderboard (English)&lt;/a&gt;&lt;/strong&gt; - Compare 100+ embedding models across 56+ tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://huggingface.co/spaces/mteb/leaderboard-multilingual" rel="noopener noreferrer"&gt;MTEB Leaderboard (Multi-lingual)&lt;/a&gt;&lt;/strong&gt; - Multilingual model comparison&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Top Opensource Embedding Models
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Qwen3-Embedding-8B&lt;/strong&gt; - State-of-the-art (Dec 2024), multilingual support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;jina-embeddings-v3&lt;/strong&gt; - 570M params, 8K context length, task-specific LoRA&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NV-Embed-v2&lt;/strong&gt; - NVIDIA's top performer, #1 on MTEB (Aug 2024)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;bge-m3&lt;/strong&gt; - BAAI's versatile model, dense + sparse + multi-vector retrieval&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;arctic-embed-l&lt;/strong&gt; - Open-source, outperforms Cohere embed-v3&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Framework-Specific Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.sbert.net/" rel="noopener noreferrer"&gt;HuggingFace Sentence Transformers&lt;/a&gt;&lt;/strong&gt; - Python library for semantic embeddings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://platform.openai.com/docs/guides/embeddings" rel="noopener noreferrer"&gt;OpenAI Embeddings API&lt;/a&gt;&lt;/strong&gt; - Production embedding service&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Domain-Specific Embeddings
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Medical&lt;/strong&gt;: PubMedBERT (biomedical literature), BioLORD&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Finance&lt;/strong&gt;: Finance-Embeddings, BGE-Financial, Voyage Finance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code&lt;/strong&gt;: Code embeddings from GitHub, CodeBERT&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  RAG Frameworks &amp;amp; Orchestration
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core Frameworks (Production-Ready)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://docs.langchain.com/oss/python/langgraph/overview" rel="noopener noreferrer"&gt;LangGraph Documentation&lt;/a&gt;&lt;/strong&gt; - Graph-based agentic workflows with stateful execution (2024+)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://docs.llamaindex.ai" rel="noopener noreferrer"&gt;LlamaIndex Documentation&lt;/a&gt;&lt;/strong&gt; - Agent-powered context augmentation with LlamaParse document parsing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://haystack.deepset.ai" rel="noopener noreferrer"&gt;Haystack Documentation&lt;/a&gt;&lt;/strong&gt; - v2.0+ with explicit RAG pipelines and evaluation tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://python.langchain.com/docs/" rel="noopener noreferrer"&gt;LangChain Documentation&lt;/a&gt;&lt;/strong&gt; - Rapid prototyping and experimentation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Emerging &amp;amp; Specialized
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://pathway.com/developers/user-guide/llm-xpack/llm-app-pathway/" rel="noopener noreferrer"&gt;Pathway Real-time RAG&lt;/a&gt;&lt;/strong&gt; - Streaming data processing for live RAG updates (2025)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/stanfordnlp/dspy" rel="noopener noreferrer"&gt;DSPy Documentation&lt;/a&gt;&lt;/strong&gt; - Declarative pipeline programming for LLMs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://docs.cohere.com/docs/agents" rel="noopener noreferrer"&gt;Cohere Agent Framework&lt;/a&gt;&lt;/strong&gt; - Multi-agent orchestration with built-in tools&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Advanced RAG Approaches
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Graph-Enhanced &amp;amp; Hierarchical RAG
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://microsoft.github.io/graphrag/" rel="noopener noreferrer"&gt;GraphRAG (Microsoft Research)&lt;/a&gt;&lt;/strong&gt; - LLM-derived knowledge graphs with v1.0 release (Dec 2024)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Latest: LazyGraphRAG (Nov 2024) - cost-efficient variant without pre-summarization&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/microsoft/graphrag" rel="noopener noreferrer"&gt;GraphRAG GitHub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/HKUDS/LightRAG" rel="noopener noreferrer"&gt;LightRAG Framework&lt;/a&gt;&lt;/strong&gt; - Graph-enhanced with dual-level retrieval &amp;amp; incremental updates (2024)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Paper: &lt;a href="https://arxiv.org/abs/2410.05779" rel="noopener noreferrer"&gt;LightRAG: Simple and Fast Retrieval-Augmented Generation&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://arxiv.org/pdf/2502.09891.pdf" rel="noopener noreferrer"&gt;ArchRAG&lt;/a&gt;&lt;/strong&gt; - Attributed community-based hierarchical RAG (Feb 2025)&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Hierarchical &amp;amp; Long-Context
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2311.04253" rel="noopener noreferrer"&gt;RAPTOR Framework&lt;/a&gt;&lt;/strong&gt; - Recursive abstractive processing for tree-organized retrieval&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implementation: &lt;a href="https://ragflow.io/blog/long-context-rag-raptor" rel="noopener noreferrer"&gt;RAGFlow's RAPTOR Implementation&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://arxiv.org/html/2510.12323" rel="noopener noreferrer"&gt;RAG-Anything&lt;/a&gt;&lt;/strong&gt; - Multimodal RAG supporting text, images, tables, equations (2024)&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Self-Correcting &amp;amp; Agentic RAG
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2310.11511" rel="noopener noreferrer"&gt;Self-RAG: Learning to Retrieve, Generate, and Critique&lt;/a&gt;&lt;/strong&gt; - Adaptive retrieval with quality scoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://alphacorp.ai/top-5-rag-frameworks-november-2025/" rel="noopener noreferrer"&gt;Agentic RAG Frameworks (2025)&lt;/a&gt;&lt;/strong&gt; - Overview of LangGraph, Haystack, LlamaIndex, Pathway, DSPy&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  RAG Evaluation &amp;amp; Benchmarking
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Evaluation Frameworks
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/explodinggradients/ragas" rel="noopener noreferrer"&gt;RAGAS (Retrieval-Augmented Generation Assessment)&lt;/a&gt;&lt;/strong&gt; - Reference-free RAG evaluation with metrics like context precision, entity recall, faithfulness&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2309.15217" rel="noopener noreferrer"&gt;RAGAS Paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=-69Fx8F9ma4" rel="noopener noreferrer"&gt;Evaluating RAG with RAGAS Tutorial&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/beir-cellar/beir" rel="noopener noreferrer"&gt;BEIR Benchmark&lt;/a&gt;&lt;/strong&gt; - Retrieval evaluation across heterogeneous datasets&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Research Papers &amp;amp; Benchmarks
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2005.11401" rel="noopener noreferrer"&gt;Original RAG Paper: Retrieval-Augmented Generation&lt;/a&gt;&lt;/strong&gt; - Foundational work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2002.08909" rel="noopener noreferrer"&gt;REALM: Retrieval-Augmented Language Model Pre-Training&lt;/a&gt;&lt;/strong&gt; - Pre-training with retrieval&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2004.04906" rel="noopener noreferrer"&gt;Dense Passage Retrieval&lt;/a&gt;&lt;/strong&gt; - Dense vector retrieval fundamentals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://arxiv.org/abs/2407.12883" rel="noopener noreferrer"&gt;BRIGHT Benchmark&lt;/a&gt;&lt;/strong&gt; - Reasoning-intensive retrieval evaluation (2024)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Research Papers (Foundational &amp;amp; Recent)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core RAG Papers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2005.11401" rel="noopener noreferrer"&gt;RAG: Retrieval-Augmented Generation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2002.08909" rel="noopener noreferrer"&gt;REALM: Retrieval-Augmented Language Model Pre-Training&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2004.04906" rel="noopener noreferrer"&gt;Dense Passage Retrieval&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2310.11511" rel="noopener noreferrer"&gt;Self-RAG: Learning to Retrieve, Generate, and Critique&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Recent Advances (2024-2025)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2410.05779" rel="noopener noreferrer"&gt;LightRAG: Simple and Fast Retrieval-Augmented Generation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/html/2510.12323" rel="noopener noreferrer"&gt;RAG-Anything: All-in-One RAG Framework&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2501.11216" rel="noopener noreferrer"&gt;TigerVector: Vector Search in Graph Databases&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/pdf/2502.09891.pdf" rel="noopener noreferrer"&gt;ArchRAG: Attributed Community-based Hierarchical RAG&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2311.04253" rel="noopener noreferrer"&gt;RAPTOR: Recursive Abstractive Processing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Embedding Model Papers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2405.17428" rel="noopener noreferrer"&gt;NV-Embed: Improved Techniques for Training LLMs as Embedding Models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2405.05374" rel="noopener noreferrer"&gt;Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2409.10173" rel="noopener noreferrer"&gt;jina-embeddings-v3: Multilingual Embeddings With Task LoRA&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2509.12539" rel="noopener noreferrer"&gt;LEAF: Knowledge Distillation of Text Embedding Models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Vector Database Papers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2309.11322" rel="noopener noreferrer"&gt;Vector Database Management Systems: Concepts &amp;amp; Challenges&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2404.10413" rel="noopener noreferrer"&gt;VDTuner: Automated Performance Tuning for Vector Data Management&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2401.07119" rel="noopener noreferrer"&gt;Curator: Efficient Indexing for Multi-Tenant Vector Databases&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Community &amp;amp; Support
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Forums &amp;amp; Discussion
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.reddit.com/r/LocalLLaMA/" rel="noopener noreferrer"&gt;r/LocalLLaMA&lt;/a&gt;&lt;/strong&gt; - Open-source LLM and RAG community&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://forum.langchain.com/" rel="noopener noreferrer"&gt;LangChain Fourm&lt;/a&gt;&lt;/strong&gt; - Active community support&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Blogs &amp;amp; Learning
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.microsoft.com/en-us/research/project/graphrag/" rel="noopener noreferrer"&gt;Microsoft GraphRAG Blog&lt;/a&gt;&lt;/strong&gt; - Official GraphRAG research updates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://blog.langchain.dev/" rel="noopener noreferrer"&gt;LangChain Blog&lt;/a&gt;&lt;/strong&gt; - Framework updates and tutorials&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://milvus.io/blog" rel="noopener noreferrer"&gt;Milvus Blog&lt;/a&gt;&lt;/strong&gt; - Vector DB best practices&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts from the Trenches
&lt;/h2&gt;

&lt;p&gt;Few months ago, I know nothing about RAG and started to learn more about it. Today, I'm iming to build production RAG after understanding the RAG concepts which will aims to provide accurate answers.&lt;/p&gt;

&lt;p&gt;The biggest lesson? &lt;strong&gt;RAG isn't just a technique—it's a different way of thinking about AI.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of "how can I make the model smarter," I now ask "how can I give it better information?" That mental shift unlocked everything.&lt;/p&gt;

&lt;p&gt;The course taught me the foundations. Building real systems taught me the craft. The gap between those two was wider than I expected, but crossing it was incredibly rewarding.&lt;/p&gt;

&lt;p&gt;If you're where I was three months ago—course completed, wondering what's next—my advice is simple: &lt;strong&gt;Build something. Ship it. Learn from real users.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your first RAG system won't be perfect. Mine wasn't. But it'll teach you more than any tutorial ever could.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The best time to start was yesterday. The second best time is now.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you're building RAG systems, I'd love to hear about your experience. What surprised you? What worked? What didn't? Drop a comment—let's learn from each other.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This Article written on December 2025, based on my hands-on experience building RAG systems. The field moves fast—always test and validate for your specific use case.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>rag</category>
      <category>llm</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
