DEV Community: Gautam Vhavle

I Ran a 2-Billion Parameter AI Model in a Browser Tab. No Server.

Gautam Vhavle — Mon, 25 May 2026 00:57:29 +0000

I ran a 2-billion-parameter language model entirely inside a browser tab.

No server. No API key. No cloud. Completely offline, Just Chrome, WebGPU, and my laptop's GPU generating tokens locally.

This was not a frontend talking to a hidden backend. The model loaded on my machine and replied at around 20+ tokens/sec on an M1 MacBook Pro.

That means private AI, no inference bill, and no waiting on someone else's backend.

The first time it worked, it felt slightly illegal.

I had spent enough time with local LLM setups to know the usual shape of the problem. Download a model. Load it through a native runtime. Watch your fans spin up. Accept that the browser is just the nice front end sitting on top of the real machine learning system.

Then I hit a question I could not let go of: if my laptop can already run the model locally, why can't the browser talk to the same GPU directly?

That question became BrowserAI. It started as a local experiment and ended up as a public app that runs 100+ models from 13+ families, entirely client-side, with no backend inference at all.

It also runs on modern phones. The UI is responsive, the app is installable as a PWA, and smaller models are surprisingly usable on mobile hardware.

If you want to try the product before reading the internals:

Live demo: BrowserAI
Code: GitHub

My machine snapshot
Device: MacBook Pro (Apple M1, 8 GB RAM)
Browser: Chrome/Brave
Model: Qwen 3.5 9B q4f16_1 (4-bit quantized)
Observed speed: ~20+ tokens/sec
Caveat: this is one personal result, not a universal benchmark

This article is the real technical story behind it: what made it possible, what was harder than expected, and how the whole stack works from first page load to streamed token.

It started with a very specific kind of curiosity

I was already deep in the usual local AI rabbit hole: llama.cpp, Ollama, quantized model files, token speed comparisons, context window tuning, all of it. Once you start running models on your own machine, your brain immediately begins asking annoying questions.

Mine was simple:

If the GPU is already here, why am I still treating the browser like a spectator?

At first I assumed the answer was some boring platform limitation. Browsers do UI. Native apps do heavy compute. End of story.

But that stopped being true the moment I started reading about WebGPU.

Then I found WebLLM from MLC-AI. That was the real turning point. WebLLM takes open models, compiles them through Apache TVM, and runs them on top of WebGPU with an API that feels suspiciously familiar if you've used OpenAI's SDK before.

I pointed it at Qwen 3.5 2B, opened Chrome, typed a message, and watched the reply stream back locally. No hidden server call. No magic API endpoint. Just my browser pushing work to my own GPU.

That was the moment the project stopped being an experiment and started becoming a product.

The whole project is really about solving three problems

Once I had the first prototype working, the architecture became much clearer. Running LLMs in the browser is not one problem. It is three separate problems stacked on top of each other:

Compute: How do you get browser code close enough to the GPU to make transformer inference practical?
Memory: How do you fit billion-parameter models into the memory a browser can realistically use?
Responsiveness: How do you do all that without freezing the UI and making the app feel broken?

BrowserAI exists because the modern browser finally has answers to all three.

Problem 1: getting real GPU compute in the browser

The foundation is WebGPU.

If WebGL was about teaching browsers how to draw, WebGPU is about teaching browsers how to compute. That difference matters a lot for AI. Large language models are mostly giant piles of matrix multiplications, and matrix multiplications are exactly the kind of work GPUs are absurdly good at.

What made this click for me was realizing that I did not need the browser to become PyTorch. I only needed it to get close enough to the GPU that a compiled runtime could use the hardware efficiently.

That is what WebGPU enables.

BrowserAI does not hand-write raw WebGPU kernels for inference. The heavy lifting is handled by @mlc-ai/web-llm, which in turn relies on Apache TVM to compile model execution for the browser. But BrowserAI does directly use the WebGPU capability layer to figure out what the device can handle before the user ever downloads a model.

Here is the key starting point:

if (!("gpu" in navigator)) {
  throw new Error("WebGPU is not available in this browser.");
}

const adapter = await navigator.gpu.requestAdapter();
const adapterInfo = adapter.info;
const features = adapter.features;
const limits = adapter.limits;

From that adapter, BrowserAI pulls the information that actually matters for model selection:

GPU vendor and architecture
Support for shader-f16
maxStorageBufferBindingSize
compute limits like workgroup size and bind group counts

The most interesting part is VRAM estimation.

WebGPU does not simply tell you, "here is your VRAM budget." So the app estimates it from maxStorageBufferBindingSize, then applies a correction for unified-memory systems like Apple Silicon. That correction matters because an M-series Mac shares RAM between CPU and GPU, so a naive estimate can be way too optimistic.

In practice, BrowserAI caps unified-memory machines to about 65% of reported system memory for GPU work, then keeps another chunk in reserve so the model has room for the KV cache during generation. That is the difference between a model that loads reliably and one that crashes after making the user wait through a 2 GB download.

This hardware-aware step is why BrowserAI can sort machines into useful tiers instead of just pretending every GPU is the same.

Tier	Approx VRAM	Realistic model range
Low	under 2.5 GB	0.5B to 1B
Medium	2.5 GB to 6 GB	1B to 4B
High	6 GB to 12 GB	4B to 9B
Ultra	12 GB+	9B and above

That is also why the default recommendation in BrowserAI is Qwen 3.5 2B q4f16_1. It is small enough to fit on a wide range of modern machines, but strong enough to feel useful immediately.

On my machine, that first setup streamed at around 20+ tokens per second. That is not a universal benchmark and I would never present it as one, but it was the moment I knew browser inference had crossed the line from novelty to product-worthy.

Problem 2: fitting a giant into browser memory

Even with GPU compute available, there is a very annoying reality check: model weights are huge.

A 2B model in full 32-bit precision is roughly 8 GB. That is already too big for a lot of browser contexts. A 9B model at fp32 is way beyond what most users can realistically load.

This is where quantization stops being an optimization and becomes the entire reason the app exists.

The core idea is simple. At inference time, you usually do not need the full precision the model had during training. So instead of storing every weight as a 32-bit float, you store compressed approximations and recover enough fidelity during computation to keep quality high.

In BrowserAI, the most important formats are:

q4f16_1: 4-bit weights with float16 accumulation
q4f32_1: 4-bit weights with float32 accumulation
q0f16: unquantized float16, used only for small models where the trade-off makes sense

The practical effect is massive.

Model	q4f16_1	q4f32_1	q0f16
Qwen 3.5 0.8B	1.6 GB	1.9 GB	2.7 GB
Qwen 3.5 2B	2.2 GB	2.6 GB	-
Qwen 3.5 4B	3.9 GB	4.7 GB	-
Qwen 3.5 9B	6.4 GB	-	-

That is the trick. Not a small trick. The trick.

Without quantization, browser inference is mostly an academic flex. With good 4-bit formats, billion-parameter models become realistic for normal hardware.

The easiest way to think about it is image compression. A raw image can be huge. A compressed image can be tiny while still looking almost identical to your eye. Quantization does something similar for model weights. It compresses the numbers hard enough to make deployment possible without wrecking the behavior that actually matters to the user.

There is still a trade-off, of course. Smaller is not free. But for a browser chat app, this is one of the best deals in modern ML engineering: dramatic memory savings, minor practical quality loss, and a product people can actually use.

One more subtle point matters here too: q4f16_1 is usually the best default, but it depends on shader-f16 support. If a machine cannot do that efficiently, BrowserAI can fall back to broader-compatibility options like q4f32_1. That is why hardware detection and quantization strategy are not separate topics. They are the same decision from two angles.

Problem 3: making it feel like an actual app instead of a frozen tab

Now comes the part most demos get wrong.

Even if the model can run locally, the experience still falls apart if the browser locks up every time the user sends a message. A responsive chat app cannot block the main thread while it downloads gigabytes of weights or generates tokens one by one.

This is where WebLLM, Apache TVM, and Web Workers all click together.

The runtime stack

At a high level, the stack looks like this:

Apache TVM compiles model execution for the browser
WebAssembly handles orchestration and control flow
WebGPU handles the heavy numerical work on the GPU
WebLLM wraps all of that in a developer-friendly runtime
Web Workers keep the inference work off the UI thread

This is what makes the project feel less like a hack and more like a real runtime.

WebLLM exposes an API that is intentionally familiar:

const chunks = await engine.chat.completions.create({
  messages: chatHistory,
  stream: true,
  stream_options: { include_usage: true },
  temperature: 0.7,
  max_tokens: 1024,
});

for await (const chunk of chunks) {
  const delta = chunk.choices[0]?.delta?.content || "";
  fullReply += delta;
  onToken(fullReply);
}

That surface is deceptively simple. Underneath it, the runtime is tokenizing input, dispatching compiled GPU work, sampling outputs, tracking usage, and streaming token deltas back to the UI.

Why the worker architecture matters

All active inference in BrowserAI runs inside a dedicated worker. The worker file is almost absurdly small:

import { WebWorkerMLCEngineHandler } from "@mlc-ai/web-llm";

const handler = new WebWorkerMLCEngineHandler();

self.onmessage = (msg: MessageEvent) => {
  handler.onmessage(msg);
};

That tiny file is enough because WebLLM does the hard orchestration. The important design choice is not the code size. It is the separation of concerns.

The main thread handles UI, input, rendering, chat state, and progress feedback.

The worker handles model loading and token generation.

That means the user can keep typing, scrolling, switching chats, or watching progress indicators while the browser is doing serious inference work in the background.

BrowserAI also uses the same pattern for background downloads. If the user wants to pre-cache another model while chatting with the current one, the app spins up a separate worker, downloads and initializes the model just long enough to get it into cache, then unloads it. That one detail makes the app feel much more intentional because users can prepare larger models without interrupting their current session.

There are even model-specific runtime quirks hidden in the plumbing. For example, Gemma 3 needs a special sliding-window attention configuration during initialization. That kind of detail does not show up in screenshots, but it is exactly the sort of thing you have to handle if you want a broad model catalog instead of a single hero demo.

The part that turns a demo into a product: caching and persistence

The first load is the tax you pay for local AI.

If a model is 2.2 GB, there is no clever copywriting trick that makes that small. The first visit can be slow. The download can be heavy. On a weak connection, it can be annoying.

The only acceptable answer is to make that pain happen once.

BrowserAI uses three storage layers for that:

Cache API for model files and weight shards
localStorage for chats, preferences, and active session metadata
sessionStorage for remembering the currently loaded model across refreshes

The model cache is the most interesting part. WebLLM stores the downloaded artifacts in browser caches, and BrowserAI inspects those caches to figure out which models are complete. It does not just look for a random file and assume success. It checks for model-specific cache entries, finds the manifest file such as ndarray-cache.json or tensor-cache.json, and verifies that the expected shards are present. If the manifest is missing, it falls back to a heuristic.

That sounds like a tiny detail, but it solves a real product problem: partial downloads. If a user closes the tab halfway through a large model, you do not want the app pretending that the model is fully ready the next time they visit.

Then there is the Service Worker.

The Service Worker in BrowserAI uses a simple, practical strategy:

navigation requests are network-first
static assets are cache-first

That means the app shell loads quickly on repeat visits, but the HTML can still stay fresh when the user is online. The result is that once the app and a model are cached, BrowserAI feels much closer to a local desktop app than a website.

There is also a nice quality-of-life piece here: when a model is successfully loaded, the app saves that model ID in session storage. Refresh the tab, and BrowserAI can auto-load the same model from cache instead of making the user start over.

This is the part many technical writeups skip, but it matters enormously. Users do not experience your stack as WebGPU plus TVM plus quantization. They experience it as: "I opened the page, downloaded a model once, and next time it was just there."

This is edge AI, just in a browser tab

At this point it is worth naming what BrowserAI actually is.

This is edge AI.

The model runs where the user already is, on the local device, close to the data and close to the interface. There is no round trip to a remote inference server for every prompt. The browser is acting as the deployment layer, but the execution model is the same core idea driving edge AI everywhere else: move intelligence to the edge instead of forcing every interaction back through the cloud.

What makes the browser version interesting is that it combines two things that usually do not coexist.

Edge execution: inference happens on the user's own hardware
Web distribution: the app is still one URL away, with no install flow, no native packaging, and no platform-specific client build

That combination changes the trade-offs in ways that are both technical and product-facing.

From a technical perspective, edge AI improves four things immediately:

Privacy and data locality. Prompts, drafts, notes, and conversations can stay on-device instead of being shipped to a vendor API by default.
Latency. You remove network round trips, server queueing, and cold-start behavior. The remaining bottleneck is local compute, which is often the better bottleneck to have for interactive work.
Offline resilience. Once the model and app shell are cached, the system can keep working on a plane, in a tunnel, or behind a flaky enterprise network.
Cost structure. There is no per-token bill and no inference fleet to operate for each user session. You pay in client hardware constraints instead of server spend.

That does not mean edge AI is automatically better. It means the bottlenecks move.

On the edge, you stop worrying as much about API throughput and start worrying about VRAM limits, thermal throttling, battery drain, storage quotas, browser support, and model footprint. Debugging also gets more complicated because the hardware matrix is messy. A fast Apple Silicon laptop, an older Windows machine, and a modern phone can all support the same product in principle while delivering very different ceilings in practice.

That is why BrowserAI spends so much effort on detection, model filtering, quantization, caching, and worker isolation. Those are not just implementation details. They are the product surface area of edge AI. If you are going to run models on user devices, you need to adapt to the device you actually got, not the one you wish you had.

I also think edge AI changes what kinds of products become worth building.

If inference can happen locally, you can justify AI features in places where cloud inference feels awkward or too expensive: private note tools, internal enterprise assistants, offline field software, education apps, developer tools, on-device copilots, kiosks, and mobile experiences that should not fall apart when connectivity does. Even when a product eventually uses a hybrid architecture, pushing some tasks to the edge can reduce cloud cost, improve responsiveness, and narrow the privacy boundary.

The bigger shift is philosophical. For the last few years, most AI products trained users to think of intelligence as something that lives in somebody else's data center. Edge AI pushes in the other direction. It says a surprising amount of useful intelligence can live on your laptop, your phone, your workstation, or your browser session.

I do not think the future is "edge everywhere, servers nowhere." Large models, shared memory, enterprise coordination, and heavy retrieval workflows will keep plenty of AI in the cloud. But I do think a lot of current cloud usage is there because the tooling was easier, not because the architecture was inherently right.

That is why projects like this matter to me. They make edge AI feel less like a research demo and more like a normal software design option.

What actually happens when you send a message

The full flow is simpler than it sounds once you flatten it:

You type a message in the chat UI.
BrowserAI appends it to local chat state and sends the formatted history to the WebLLM engine.
The engine call is proxied into a worker.
WebLLM tokenizes the prompt, runs the compiled model through WASM plus WebGPU, and streams token deltas back.
The UI updates the assistant bubble in real time and stores stats like tokens per second, total tokens, and generation time.
The finished response and its metadata get persisted to localStorage.

That last bit is one of my favorite parts of the product. BrowserAI does not just stream text. It surfaces runtime information too. After a generation finishes, the app can show tokens generated, prompt size, generation speed, and context usage. That makes the whole system feel transparent instead of mystical.

And that transparency matters, because browser AI still feels unbelievable to a lot of people the first time they see it.

The constraints are real, and saying that makes the project stronger

One thing I wanted this project to avoid was fake magic.

Browser AI is real. It is also constrained.

Here are the honest limits:

The first download is heavy. A local model that feels instant on later visits still has to arrive somehow.
WebGPU support is not universal. Modern Chrome, Edge, and recent Safari builds are the target. Older browsers are a hard stop.
It runs on mobile too, but mobile still has tighter limits. Modern phones can run the app, especially with smaller models, but thermals, memory ceilings, and browser support still make laptops and desktops the better experience for heavier use.
Bigger models still need real VRAM. Quantization helps a lot, but physics still wins.
Performance is hardware-dependent. Tokens per second varies wildly across machines.

In my experience, being explicit about these limits does not make the project less impressive. It makes it more credible.

The exciting part is not that local browser AI is magically free of constraints. It is that those constraints are now reasonable enough to build a product around.

Why I think this matters

What started as a curiosity project became a much bigger idea for me.

If the browser can do real GPU compute, if open models can be quantized aggressively enough to fit consumer hardware, and if runtimes like WebLLM can package all of that into something usable, then the browser is no longer just a thin client for AI.

It becomes a real runtime.

That changes the product equation in a few important ways:

private inference becomes easier
API costs disappear for many use cases
offline AI starts feeling practical
distribution gets much simpler because the browser is already the platform

That does not mean servers go away. It means we finally get to choose which workloads belong on the server and which ones are perfectly happy living on the device.

BrowserAI is my attempt to push that idea into something people can actually touch.

If you try it, start with Qwen 3.5 2B. It is the best first impression of the stack: small enough to load on ordinary hardware, fast enough to feel alive, and strong enough to make the whole approach click.

Open Chrome. Visit BrowserAI. Pick a model. Let it download once. Then send a prompt and watch your own machine do the work.

That feeling is the whole point.

If you try it, tell me your setup: chip or GPU, browser, model, and tokens/sec. I'm especially curious which low-end machines surprise me. The code is open source if you want to inspect the internals or build on top of it.

⭐ GitHub Repo

My JSON Was Too Big for My AI, So I Built an MCP Server to Fix It

Gautam Vhavle — Mon, 16 Mar 2026 21:29:42 +0000

It was late night. My eyes were burning. I had a JSON file staring back at me. A space missions database (example). 1620+ lines. 7 missions, each nested 5 levels deep. Personnel records inside crew arrays inside missions inside a database meta object. Spacecraft specifications with Ion-Plasma Drive stats. Payloads marked as "Class-IV Radiation Hazard." Budgets in the billions.

It was the most beautifully structured chaos I'd ever seen.

I thought, "Let me just ask my AI to analyze this."

So I pasted the whole thing into the chat. Hit enter. Waited. And then, the message I'd been dreading:

⚠️ Context limit exceeded. I was using a small open-source LLM

My LLM couldn't even see the whole file. And the worst part? I didn't need all of it. I just wanted one number: the total budget across all 7 missions. One number, buried 4 layers deep inside missions[*].budget_credits. That's it.

I sat there for a second. Then I thought about all the times I'd hit this exact wall before. Huge API responses from clients, MongoDB exports with 10,000 nested documents, config mega-files that no human should ever have to scroll through. Every time, the same story: JSON too big, AI too small.

I stared at the screen. Then I did what any sane developer would do.

I built something.

Why I Couldn't Just "Write a Script"

Now, I know what you're thinking. "Just write a Python script, bro."

And sure, I could:

import json

with open("missions.json") as f:
    data = json.load(f)

total = sum(m["budget_credits"] for m in data["missions"])
print(f"Total budget: ${total:,.2f}")

Six lines. Done. Right?

Except... I don't always know what I'm looking for. Sometimes I'm exploring. Sometimes I get a JSON from a client with zero documentation and I just need to poke around. "What keys exist? What's nested under config? How many items have status: active?" That exploratory, conversational flow is exactly what AI assistants are supposed to be good at.

I could use jq. I could fire up pandas. I could write a new throwaway script every single time. But all of those require me to already know the structure, already know the question, and essentially do the AI's job for the AI.

That defeated the whole purpose. I didn't want to write code to understand my data. I wanted to talk to it.

So the question became: What if my AI didn't need to read the entire file? What if, instead of choking on 620 lines of JSON, it could surgically extract exactly the pieces it needed?

I figured out, we have to use a MCP solution here.

Enter MCP — Giving Your AI Hands, Not Just Eyes

If you haven't heard of MCP (Model Context Protocol) yet, here's the simplest way I can explain it:

Instead of making your AI read an entire library, you give it a librarian.

MCP lets you give your AI tools. Small, focused functions it can call on its own. The AI decides which tool to use, calls it, reads the result, and decides its next move. It's like upgrading your assistant from a speed reader (who runs out of pages) to a detective with the perfect toolkit.

Here's the paradigm shift that changed everything for me:

❌ Without MCP: Paste 620 lines of JSON → "Context limit exceeded"

✅ With MCP: AI calls load_json → gets metadata (47KB, 3 top-level keys) → calls sum_values on $.missions[*].budget_credits → "$18,250,000,000.50"

The AI never saw the full file. It loaded it into memory, used a surgical tool to extract the number I needed, and gave me the answer. Four tool calls. Zero context overflow. That was my lightbulb moment.

And that's exactly what I built.

What I Built: Universal JSON Agent MCP

Talk to your JSON files using natural language.

26 tools. 9 categories. One install command. Zero config headaches.

uv tool install universal-json-agent-mcp

That's it. You now have an MCP server that works with VS Code / GitHub Copilot, Claude Desktop, Cursor, or literally any MCP-compatible client.

But numbers don't tell the story. Let me show you what it actually feels like to use this. Remember that space missions database I was struggling with?

You:   Load missions.json and tell me what's inside.

AI:    ✅ Loaded "missions" — object with 3 top-level keys,
       file size 47.2 KB.
       Keys: database_meta, missions, global_notes

You:   What's the structure look like?

AI:    database_meta (object)
         ├─ version (string)
         ├─ total_missions (number)
         ├─ contact (object)
         │    ├─ director (string)
         │    └─ hotline (string)
       missions (array of 7 objects)
         ├─ codename (string)
         ├─ status (string)
         ├─ budget_credits (number)
         ├─ spacecraft (object) → designation, specifications…
         └─ personnel (object) → commander, crew[]…
       (Used: get_structure)

You:   How many missions are there?

AI:    7 missions total.
       (Used: count)

You:   What's the total budget across all missions?

AI:    The total budget across all missions is $18,250,000,000.50
       (Used: sum_values on $.missions[*].budget_credits)

You:   Show me only the in-progress ones, sorted by priority.

AI:    Found 3 missions with status "in_progress":
       1. Stellar Vanguard  — priority 1, budget $4.5B
       2. Deep Core Echo    — priority 2, budget $3.2B
       3. Nebula Drift      — priority 4, budget $1.8B
       (Used: filter_objects → sort_by → pick_fields)

You:   Export that to CSV.

AI:    ✅ Exported 3 missions to in_progress_missions.csv
       (Used: export_csv)

Six natural language questions. Zero scripts. The AI figured out which of the 26 tools to chain together (load, explore, count, aggregate, filter, sort, pick fields, and export) all by itself.

That's the magic. I didn't build one smart tool. I built 26 small, dumb tools and let the AI be the smart one.

The Toolkit — A Quick Tour

I'm not going to bore you with a feature table (that's what the README is for). Instead, let me show you when you'd reach for each of these:

🔓 Load it — You've got a JSON file. Maybe it's on your desktop, maybe it's in a project folder. You tell the AI to load it. Boom, it's in memory with full metadata: file size, root type, how many top-level keys. You can load multiple files and switch between them mid-conversation.

🔍 Explore it — You just received a 50MB JSON dump from a client. No documentation. No schema. You don't even know what keys exist. This is where get_structure becomes your flashlight in the dark. It shows you a clean skeleton of the entire file: keys, types, nesting depth. All without loading a single data value into your chat. distinct_paths goes even further and maps out every single leaf path in the document, even inside nested arrays. Think of it like turning on all the lights in a building you've never been in.

🎯 Query it — Now you know what's inside. Time to ask questions. Full JSONPath support ($.missions[*].codename), regex-powered text search across all string values, and smart filtering with 8 operators (equals, greater than, contains, regex match, and more). You can say "find all personnel whose role contains 'engineer'" and get precise results, without ever reading the parts of the file you don't care about.

📊 Crunch it — This is where it gets really fun. Sum all budgets. Count active missions. Get min/max priority levels. Run a full statistical breakdown (mean, median, standard deviation, percentiles) on any numeric field. Even get a frequency table, like value_counts() in pandas, but through a conversation. All the number-crunching happens server-side; your context window stays clean.

🔄 Transform it — Flatten nested objects into dot-notation key-value pairs. Pick only the fields you care about. Group missions by status. Sort by priority, descending. Reshape the data into exactly the view you need before asking the next question.

📤 Export it — Happy with the filtered, sorted, transformed result? Export to CSV or JSON with one sentence. Done. Ship it.

Here's the design philosophy I kept coming back to: each tool does one small thing. The AI chains them together to build complex analyses. Kind of like UNIX pipes, but instead of |, it's intelligence connecting the tools.

That idea (small tools, composed by an AI) turned out to be way more powerful than any "one mega-tool that does everything" approach I could have built.

Under the Hood: The Design Decisions I'm Proud Of

I'll keep this section focused. This is a story, not a textbook. But there are a few engineering choices that made this whole thing work, and I think they're worth sharing.

Smart Truncation — The Core Innovation

This was the entire point of the project, so let me explain why it matters.

When you ask get_value on the root of a 620-line JSON, you could get 620 lines dumped into the AI's context. That's exactly the problem we're solving, right? So instead, every single tool response is automatically capped at ~10KB. If the result is too large, it gets cleanly truncated:

"Array with 347 items — showing first 50. Use filters or paths to narrow results."

The AI reads that message, understands it needs to be more specific, and adjusts its next query. It never gets overwhelmed, no matter how enormous the underlying data is. This one design decision, aggressive truncation with helpful hints, is what makes it possible to work with JSON files of any size through a context-limited LLM.

The Store Pattern

When you load a JSON file, it goes into an in-memory store, not into the AI's chat. The AI gets back metadata (file size, root type, number of keys, estimated memory usage), and from that point on, every tool works against the stored data. Load once, query a hundred times. You can even load multiple files and run comparisons across them.

Fail-Safe Tool Responses

LLMs use tools in loops. They call a tool, read the result, decide the next step. If a tool throws an unhandled exception, the entire reasoning loop can break. So every tool in this project catches errors gracefully and returns a readable error message instead of crashing:

def _safe(fn, *args, **kwargs) -> str:
    try:
        return fn(store, *args, **kwargs)
    except Exception as exc:
        return f"Error: {exc}"

The AI sees "Error: Key 'budgt' not found. Available keys: budget_credits, budget_cycle", adapts, corrects the typo, and continues. No crash. No broken loop. The conversation just keeps going.

The UNIX Philosophy, Applied to AI

This is the architectural insight I keep coming back to. I didn't try to build one omniscient JSON-processing mega-tool. Instead, I built 26 tiny tools that each do exactly one thing. The AI figures out how to chain them together.

load_json loads. get_structure explores. filter_objects filters. sum_values sums. export_csv exports. That's it. Each one is stupid simple. But when an AI strings together load → structure → filter → sort → pick_fields → export? That's when the magic happens.

The composability is the whole point. And it makes the codebase dead simple to extend. Adding a new tool is about 50 lines of code.

My First PyPI Package — The Terrifying, Beautiful, Amazing Part 🎉

Okay, can we just... pause for a second?

This was my very first Python package. Ever. Published. On PyPI. Where actual developers go to install actual packages. And now mine is sitting there next to requests and flask and numpy (okay, maybe not next to them, but on the same website and that counts).

There's something weirdly emotional about naming a package. You're not just picking a variable name that lives in one file. You're picking a name that will live on the internet. Forever. People will type this name. Into their terminals. With their own fingers.

I went back and forth. Too generic? Someone already took it. Too clever? No one will find it. I finally landed on universal-json-agent-mcp and thought, "Yeah. That's the one. That says exactly what it does."

And then I hit publish.

I remember staring at the PyPI page after it went live, just... refreshing it. Like a kid checking if their YouTube video got views yet. The badges were there. The description was there. My name was there. My name. On PyPI. On the same platform where requests and numpy live.

Then I opened a fresh terminal on a completely clean machine. Typed:

pip install universal-json-agent-mcp

Watched the progress bar. It downloaded. It installed. I ran it. It worked.

I'm not going to pretend I was cool about it. I wasn't. If you've been sitting on a project, thinking "it's not ready" or "no one will use it"... just ship it. Seriously. The first version doesn't have to be perfect. It just has to exist.

Works Everywhere — Zero Lock-In

One of my non-negotiables from day one was universal compatibility. I didn't want to build a VS Code-only tool or a Claude-only plugin. If your AI editor speaks MCP, this should just work.

Here's the entire setup for VS Code, and it's similarly simple for Claude and Cursor:

// .vscode/mcp.json
{
  "servers": {
    "universal-json-agent": {
      "type": "stdio",
      "command": "universal-json-agent-mcp"
    }
  }
}

That's it. Eight lines. Restart your editor, and every JSON file on your machine is now queryable through natural language. No API keys needed for the MCP server itself, no cloud dependency, no vendor lock-in. It runs locally, on your machine.

Bonus Chapter: The Web Server

But what about when you're not in an editor?

What if you want to query JSON from a dashboard, a CI pipeline, or a custom internal tool? That's where the bonus FastAPI + LangChain web server comes in.

It's an optional subproject bundled in the same repo. Upload a JSON file, ask a question in plain English, get an answer back, along with which tools were used. Under the hood, it runs a full LangChain ReAct agent. The same reasoning loop that Copilot and Claude use, but wrapped in a REST API:

curl -X POST http://localhost:8000/query \
  -F "file=@data/missions.json" \
  -F "query=What's the total budget?"

{
  "answer": "The total budget across all missions is $18,250,000,000.50",
  "tools_used": ["load_json", "sum_values"]
}

The agent's system prompt is specifically tuned to use bulk extraction patterns. It knows to use JSONPath wildcards like $.missions[*].budget_credits instead of iterating one-by-one. Same 26 tools, same core logic, zero code duplication. Just a different interface.

Perfect for building quick dashboards or integrating JSON analysis into automated workflows.

Who Is This For?

Honestly? If you've ever stared at a JSON file and thought "this is too much"? It's for you.

But if you want specifics:

Backend developers debugging massive API responses at 11 PM, trying to figure out why that one nested field isn't what the docs say it should be
Data engineers who just got a JSON export with zero documentation and need to figure out its structure before writing a pipeline
Anyone who's ever pasted a JSON into ChatGPT, Claude, or Copilot and hit the context limit? Never again
Students and learners who want to explore datasets conversationally without writing boilerplate ETL code
Teams building internal tools who need a quick JSON analysis layer without spinning up a whole data stack

If you work with JSON (and let's be honest, who doesn't in 2026), this tool has a place in your workflow.

Final Thoughts

That late night frustration didn't just produce a tool. It produced my first PyPI package.

I built this because I needed it. No one asked me to. I had a problem. My JSON was too big for my AI, and I refused to accept that as a limitation. What started as a late-night hack turned into something with proper architecture, hundreds of tests, a published package, and support for every major AI editor.

If there's one thing I've learned, it's that the best tools come from scratching your own itch. And now this one's out there for anyone who's ever stared at a monster JSON and thought, "There has to be a better way."

Your JSON is too big? Not anymore.

⭐ If this resonated with you, give the repo a star.

GitHub · PyPI · Report Issues

I Reverse Engineered ChatGPT's UI Into an OpenAI Compatible API and Here's Why You Shouldn't

Gautam Vhavle — Thu, 19 Feb 2026 14:21:04 +0000

A weekend project that wasn't supposed to work. But it did. And now we need to talk about it.

Let me be upfront: this project exists for educational purposes only. What you're about to read shouldn't be replicated in production, shouldn't be used to skirt terms of service, and honestly shouldn't work as well as it does.

But here we are.

I took ChatGPT's entire web frontend the same UI you and I use every day and reverse-engineered it into my backend. Browser automation, stealth patches, Cloudflare bypass, clipboard hijacking, DOM scraping, virtual displays. The result? A fully OpenAI-compatible REST API that any SDK, any LangChain agent, any curl command can hit as a drop-in replacement. Tool calling, image generation, file uploads, vision, all of it using your own account and subscription with the latest Model.

The project is called CatGPT-Gateway I'll attach the Github link below for reference. I need to tell you how it works.

because understanding what's possible is the first step to building better defenses against it.

⚠️ Before we go any further: This project is built purely for educational purposes and cybersecurity research. I'm not a hacker. This isn't a bug. I'm just a curious developer who wanted to understand the limits of browser automation. Stick around, we'll talk ethics at the end.

The Origin Story: A Developer's Frustration

Here’s what happened.....

I signed into OpenAI using my regular account while building an application that required API calls for testing. I needed to review pricing, generate API keys, and wire everything into a fairly complex LangChain workflow I was developing.

At the same time, I was working on another project that required repeated image generation. Not just one or two images, several iterations for testing and refinement.

So I paused.

ChatGPT could already generate images directly inside the chat interface. It could also produce structured responses to my prompts with context awareness.

That raised a technical question.

What if the chat interface itself could be automated? Not as a workaround, and not to avoid API usage or pricing, but as a pure engineering experiment. Could a conversational UI be programmatically driven? Could it behave like an interaction layer that mimics API semantics? What architectural differences would emerge between UI automation and formal API integration?

The curiosity wasn’t financial. It was structural.

I wanted to understand system behavior, automation mechanics, and the boundaries between user-facing interfaces and backend-accessible endpoints.

It was supposed to be a quick weekend hack, the kind you start at 11 PM on a Friday, fully expecting it to collapse within the hour.

It didn’t. And that’s where things got interesting.

Every great terrible idea starts at 11 PM with too much coffee.

How Is This Even Possible?

Let me break it down. The architecture is simple on paper, absolutely unhinged in execution.

External Clients (curl, Python, LangChain)
          │
          ▼
    ┌───────────┐
    │  FastAPI  │ ← OpenAI-compatible API (port 8000)
    │  Server   │
    └─────┬─────┘
          │
          ▼
   ┌──────────────┐
   │ ChatGPTClient│ ← Sends messages, waits for responses,
   │              │   extracts text via clipboard/DOM
   └──────┬───────┘
          │
          ▼
  ┌───────────────┐
  │BrowserManager │ ← Patchright (stealth Playwright fork)
  │+ Stealth      │   controlling a real Chrome instance
  │+ Human Sim    │   with anti-detection patches
  └───────┬───────┘
          │
          ▼
  ┌───────────────┐
  │  Xvfb + VNC   │ ← Virtual display + VNC viewer
  │  (port 6080)  │   (the magic trick — more on this later)
  └───────────────┘

Here's what's happening:

You hit the API — standard OpenAI-format request, POST /v1/chat/completions
FastAPI receives it and hands it to the ChatGPT client
The client literally types your message into ChatGPT's chat input box using a headful Chrome browser
It waits for the response by watching for ChatGPT's copy button to appear (that's how it knows the response is complete — clever, right?)
It clicks the copy button, grabs the text from the clipboard, and returns it as a proper OpenAI-format JSON response

Your LangChain app, your agent framework, your curl script — they all think they're talking to OpenAI. They're actually talking to a cat controlling a browser.

The Cloudflare Boss Fight

Okay, here's where it gets interesting. You can't just puppeteer.launch() and waltz into ChatGPT. OpenAI uses Cloudflare's human verification — and modern Cloudflare is really good at detecting bots.

I want to genuinely appreciate OpenAI and Cloudflare here. Their security gave me so many blockers and roadblocks that I almost gave up multiple times. We're talking:

navigator.webdriver detection — the first thing any anti-bot checks
Canvas fingerprinting — headless browsers have different rendering signatures
WebGL fingerprinting — GPU-level identification
Plugin enumeration — headless Chrome reports different plugins
Behavioral analysis — instant typing? Pixel-perfect clicks? That's not human

Every single one of these had to be defeated. Not by exploiting a vulnerability — but by making the browser more human.

Actual footage of me vs. Cloudflare's bot detection (dramatized).

Here's what CatGPT does to pass as human:

Stealth Patches — Using Patchright (a Playwright fork built for stealth) combined with playwright-stealth, CatGPT patches the browser fingerprint at every level. navigator.webdriver returns false, canvas renders match real browsers, WebGL reports are spoofed.

Human Simulation — Messages aren't typed character-by-character (that's detectable too). Instead, CatGPT uses clipboard-paste injection with randomized delays. Mouse movements happen during "thinking" pauses — idle cursor drifts to random positions with 5-15 intermediate steps, like a bored human.

Viewport Jitter — Every launch randomizes the viewport by ±20px from the base 1280×720. No two sessions have exactly the same fingerprint.

The Docker DNS Bug Discovery — This one's wild. I found that calling Playwright's add_init_script() — even with just a console.log("") — completely breaks Chrome's DNS resolution inside Docker containers. Every navigation after that returns ERR_NAME_NOT_RESOLVED. The fix? Inject all stealth JavaScript via page.evaluate() at runtime and re-inject on every framenavigated event instead. Days of debugging. One line fix. Classic.

The VNC Trick — Schrödinger's Browser

This is the part I'm most proud of. This is what makes CatGPT fundamentally different from every other browser automation project I've seen.

The problem: Cloudflare can detect headless Chrome. So you need headful Chrome (a real browser window). But servers don't have monitors. So how do you run a "headed" browser on a headless server?

The solution: You give it a fake monitor.

Chrome (headful mode)
    renders to →  Xvfb (Virtual Framebuffer — fake display :99)
                      captured by →  x11vnc (VNC server)
                                        served via →  noVNC (WebSocket, port 6080)
                                                          viewable in →  Your browser

This was the most shocking part for me too.

Here's the beautiful paradox:

To Cloudflare: It's a real, headed, GPU-rendering Chrome browser with a proper display. Nothing to see here, human user browsing normally.
To your server: It's a headless process. No monitor needed. Runs in Docker. Fully automated.
To you: You can open http://localhost:6080 in your browser and watch the automation happen in real-time through VNC. Debug visually. Handle CAPTCHAs manually if needed. Sign in through the actual browser.

It's headed AND headless at the same time. Schrödinger's browser.

Headed for Cloudflare. Headless for your server. Both at the same time.

Four processes run in the Docker container, managed by supervisord:

Process	What It Does	Port
Xvfb	Virtual framebuffer — the "fake monitor"	Display `:99`
x11vnc	VNC server — captures the virtual display	5900
noVNC	WebSocket bridge — makes VNC browser-accessible	6080
FastAPI	The actual API server	8000

One docker compose up and you've got the whole stack.

OpenAI-Compatible Endpoints — Drop-In Replacement

This is where the software engineer in me got excited. The API isn't just "some endpoint that returns text." It's a full OpenAI-compatible API. Pydantic schemas matching OpenAI's spec exactly. You can point any OpenAI SDK at it and it just works.

# It's literally the same format as OpenAI's API
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-token" \
  -d '{
    "model": "catgpt-browser",
    "messages": [{"role": "user", "content": "Explain quantum computing in 3 sentences"}]
  }'

What you get back:

{
  "id": "chatcmpl-abc123...",
  "object": "chat.completion",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Quantum computing uses quantum bits (qubits)..."
    },
    "finish_reason": "stop"
  }],
  "usage": { "prompt_tokens": 12, "completion_tokens": 45, "total_tokens": 57 }
}

Standard OpenAI response format. id, choices, usage — everything.

Tool / Function Calling

Yeah, it supports tool calling too. Since we're automating a browser (not hitting an API with native function calling), CatGPT uses a clever technique: it builds a system prompt with tool definitions and few-shot examples, instructing ChatGPT to output structured JSON. Then it parses that JSON with regex and returns proper tool_calls in the response.

# Works with LangChain / LangGraph out of the box
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-token",
    model="catgpt-browser"
)

# Bind tools, create agents — it all works
llm_with_tools = llm.bind_tools([get_weather, calculate])

Any modern agent framework — LangChain, LangGraph, CrewAI, AutoGen — just point the base_url at CatGPT and you're golden.

Wait, It Can DALL-E Too?!

This was the feature that started the whole project, remember? I needed image generation for testing, and there was no straightforward API button for developers who sign in with OpenAI.

So how does CatGPT handle POST /v1/images/generations?

Your request comes in — "prompt": "A sunset over mountains"
CatGPT sends it to ChatGPT as a chat message, with hints about size and quality
ChatGPT invokes DALL-E internally and renders the image in the chat
The detector watches the DOM for img[alt="Generated image"] or images inside div[id^="image-"] containers
The image is downloaded using the browser's own fetch() API — this is key because it preserves the authentication cookies. No separate auth needed.
Returned to you as either base64 JSON or a URL, matching OpenAI's image response format

# Generate an image — same format as OpenAI's DALL-E API
curl -X POST http://localhost:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-token" \
  -d '{
    "prompt": "An adorable orange tabby kitten astronaut floating in space",
    "size": "1024x1024",
    "quality": "hd"
  }'

The response includes a revised_prompt — the actual prompt DALL-E used internally — and the image data. Works with the OpenAI Python SDK too:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-token")
response = client.images.generate(
    model="dall-e-3",
    prompt="A cat hacking into a computer, cyberpunk style",
    size="1024x1024"
)

I never thought this would actually work. Detecting dynamically-generated images in someone else's DOM, downloading them through the browser's auth context, and returning them in a standard API format? It felt impossible until it wasn't.

The Cyberpunk TUI — Because CLIs Should Look Cool

I couldn't just ship an API. I had to build a terminal UI. You know, for vibes.

CatGPT comes with a full-screen Textual-based TUI — a cyberpunk-themed chat interface running right in your terminal:

GitHub Dark color scheme — deep #0d1117 backgrounds, #58a6ff accent blues, #3fb950 greens
ASCII cat splash screen — because every good project needs one
Color-coded messages — blue borders for your messages, green for the assistant, purple for images
Rich Markdown rendering — code blocks, tables, lists, all rendered beautifully
Slash commands — /new, /threads, /images, /status, /help
Keyboard shortcuts — Ctrl+N for new chat, Ctrl+T for threads

CATGPT TUI: Actual functional conversation.

The TUI: because talking to AI should feel like you're in a sci-fi movie.

The Simplicity — Step by Step

Here's what amazes me about how this came together. Despite all the complexity under the hood — stealth patches, virtual displays, DOM observers, clipboard extraction — using it is dead simple:

# 1. Clone it
git clone https://github.com/GautamVhavle/CatGPT-Gateway.git && cd CatGPT-Gateway

# 2. Start everything
docker compose up --build -d

# 3. Sign in once via VNC
#    Open http://localhost:6080 → Log into ChatGPT → Done

# 4. Hit the API
curl -H "Authorization: Bearer dummy123" \
     http://localhost:8000/v1/chat/completions \
     -d '{"model":"catgpt-browser","messages":[{"role":"user","content":"Hello!"}]}'

Four steps. One Docker container. No API keys. Your browser session persists across restarts via a Docker volume.

The project is structured cleanly too — every component has its own module:

src/browser/ — Browser lifecycle, stealth, human simulation
src/chatgpt/ — ChatGPT client, response detection, image handling
src/api/ — FastAPI routes, OpenAI-compatible schemas
src/cli/ — Terminal UI
src/selectors.py — All DOM selectors centralized in ONE file. When ChatGPT updates their UI, you update one file. That's it.

The selector fallback system is one of my favorite design decisions. Every selector (chat input, send button, copy button, etc.) is a list of fallbacks. If ChatGPT changes a data-testid, the next selector in the list catches it. Resilient by design.

The Response Detection — Three Strategies Deep

How do you know when ChatGPT is done generating its response? This was one of the hardest problems.

CatGPT uses a three-strategy detection system:

Copy Button Detection (primary) — ChatGPT only shows the copy button after the full response is generated. Count the copy buttons before sending, wait for a new one. Elegant and reliable.
Stop Button Detection (fallback) — Watch the stop/generation button. When it appears, streaming started. When it disappears, response is done.
Text Stability (last resort) — Poll the response text every second. If it's identical for 5 consecutive polls, we're done.

Three layers of fallback. Because when you're automating someone else's frontend, you plan for everything to break.

Three strategies. Because one is never enough when you're parsing someone else's DOM.

Let's Talk About Ethics

Okay, real talk time.

This project is for educational purposes only. Full stop.

I built CatGPT because I was genuinely curious about the limits of browser automation. Could you control a modern web app's frontend so completely that it becomes your backend? What security measures exist to prevent this? How good is Cloudflare's bot detection really?

The answers are fascinating — and that's the whole point. This is a learning exercise, not a production tool.

Here's what I want to be crystal clear about:

🚫 Don't use this in production. It's a single-browser, single-session gateway. There's an asyncio.Lock() serializing every request. It's not built for scale and it's not meant to be.
🚫 Don't use this to circumvent OpenAI's terms of service. Respect the platform you're using.
🚫 Don't use this to build commercial products. Get a proper API key for that.
✅ Do use this to learn about browser automation, stealth techniques, API design, and cybersecurity.
✅ Do use this to understand how modern anti-bot systems work and why they're important.
✅ Do use this for testing and prototyping when you need a quick way to interact with ChatGPT programmatically.

I'm not a hacker. I didn't find a security vulnerability. This isn't a bug — ChatGPT works exactly as designed. I'm just controlling a browser, the same way any user does. The difference is, my "user" is a Python script.

OpenAI's security is legit. Cloudflare's human verification gave me more headaches than any coding challenge I've faced. The fact that I had to build viewport jitter, human-like mouse movement, clipboard-based text injection, and a triple-layer detection system just to reliably interact with a website — that's a testament to how good their security is.

It's Open Source — Come Build With Me

CatGPT-Gateway is fully open source on GitHub:

👉 github.com/GautamVhavle/CatGPT-Gateway

The codebase is clean, well-documented, and modular. Want to contribute? Here's what I'd love help with:

More resilient selectors as ChatGPT's UI evolves
Streaming support (stream: true in the API)
Multi-session support — multiple browser instances
Better error recovery — auto-retry on Cloudflare challenges
Tests — there's always room for more tests

I'm actively maintaining this project. If you have feedback, ideas, bug reports, or just want to say hi — open an issue, submit a PR, or drop a comment below.

This started as a Friday night "what if?" and turned into one of the most fun projects I've ever built. If you're a developer who's curious about how far browser automation can go, or a cybersecurity enthusiast who wants to understand the cat-and-mouse game between bots and detection systems — this project is for you.

Star it. Fork it. Break it. Fix it. Let's learn together.

CatGPT-Gateway is out. Star the repo. Don't do anything I wouldn't do. 😼

If you made it this far — thanks for reading. Drop a 🦄 or a 💬 and let me know what you think. If you actually run it and your ChatGPT session starts talking to your LangChain agents... don't blame me. I warned you. Also Congratulations you can run your OpenAI Endpoint with any Agentic Framework like LangChain, AutoGen or CrewAI.

Why uv Refused to Upgrade My Package (And Why It Was Actually Right)

Gautam Vhavle — Sat, 10 Jan 2026 17:25:33 +0000

I thought uv was being annoying. Turns out, it saved me from breaking my entire project. Here's the story.

The Vibe Check: What This Article Is About

So there I was, vibing with my Python backend, when a wild upgrade notification appeared. Seemed simple enough — just upgrade a package, right?

Narrator: It was not simple.

What started as a "quick 5-minute fix" turned into a deep dive into how uv (the blazingly fast Python package manager) actually thinks. And honestly? It completely changed how I approach dependency management.

Here's the journey:

🔥 The problem that looked like a bug
🧠 The mental model shift that clicked
✅ The actual fix (spoiler: it wasn't forcing an upgrade)
💡 The lessons that'll save you hours

Let's get into it.

Setting the Scene: My Stack

I was working on a FastAPI-based backend — you know, the classic data processing API:

FastAPI + Uvicorn (the speedy bois)
pandas & numpy (data wrangling essentials)
scikit-learn (ML predictions)
scipy (scientific computing)
uv for dependency management (because who has time for slow installs?)

Everything was humming along nicely. Dependencies locked in pyproject.toml, reproducible builds via uv.lock. Chef's kiss. 🤌

Then I saw this warning pop up:

A newer version of numpy is available: 1.24.3 → 2.0.0
(pip install -U numpy)

My brain: "Cool, let's upgrade."

My project: "lol no."

First Attempt: The Obvious Thing

Since I'm using uv (not pip like a caveman), I ran the proper command:

uv lock --upgrade-package numpy

And uv said:

❌ No solution found.

Wait, what? The package exists. The version is right there. Why won't you just... install it?

I tried again. Same error. Cleared cache. Same error. Started questioning my life choices. Same. Error.

The Plot Twist: uv Isn't Being Difficult — It's Being Smart

Here's where I had to sit down and actually understand what uv is doing under the hood.

See, uv is not pip. It's built different (literally).

pip	uv
Yeets packages into your env immediately	Resolves the entire dependency graph first
Hopes for the best	Demands logical consistency
"It compiled, ship it" energy	"This must make mathematical sense" energy

The key insight:

uv doesn't upgrade packages. It upgrades compatibility sets.

If even one dependency conflicts with another, uv refuses to proceed. It won't give you a broken environment and call it a day.

This isn't a limitation. It's a feature.

The Real Issue: I Was Living a Lie

After actually reading the error (revolutionary, I know), the problem became crystal clear.

My pyproject.toml was depending on a bunch of scientific Python packages:

numpy
pandas
scipy
scikit-learn
numba
matplotlib

Looks fine, right? Just importing what I need?

WRONG.

The Hidden Drama: Two Generations, One Project

Here's what I didn't know: the scientific Python ecosystem had just gone through a major version transition with numpy 2.0, and not everyone was on board yet.

🔷 Legacy-Compatible Stack

numba (JIT compiler)
scipy < 1.13
Older scikit-learn versions
Requires: numpy >= 1.21, < 2.0

🔶 Modern Stack

numpy >= 2.0
pandas >= 2.2
Newer scipy >= 1.13
Newer scikit-learn >= 1.5

See the problem?

By having both old and new packages in my project, I was basically telling uv:

"Hey, I need numpy to be less than 2.0 AND greater than or equal to 2.0."

uv, being mathematically literate: "That's... not how numbers work, bestie."

No version of numpy can be both < 2.0 AND >= 2.0. It's impossible. That's why uv said no solution exists — because no solution actually exists.

But Wait, pip Would've Worked!

Yeah, about that...

If I had just run:

pip install -U numpy

pip would've:

✅ Upgraded numpy to 2.0 (yay!)
😬 Silently left numba with an incompatible numpy
💀 Caused cryptic runtime errors when numba tried to JIT compile
🎲 Left me debugging "why does my code randomly crash" for hours

pip's approach: "You asked for this package? Here it is. Good luck with everything else lmaooo"

uv's approach: "I'm not letting you shoot yourself in the foot."

The Fix: Making an Actual Decision

The solution wasn't to force an upgrade or delete my lockfile or any of that hacky stuff.

The solution was to choose a lane.

Since I was building a data API and needed the latest pandas features, the modern numpy 2.0 stack made more sense. But that meant I had to deal with numba.

🗑️ What I Changed

# Removed the legacy-locked package
numba  # ← goodbye old friend, you're holding us back

# Or alternatively, waited for numba's numpy 2.0 compatible release
# and pinned to that specific version

✅ What I Kept (with updated constraints)

numpy>=2.0,<3.0
pandas>=2.2,<3.0
scipy>=1.13,<2.0
scikit-learn>=1.5,<2.0
matplotlib>=3.8,<4.0

The Moment of Truth

With a logically consistent dependency graph, I ran:

uv lock
uv sync

And just like that:

✓ Resolved dependencies
✓ Locked 127 packages
✓ Synced environment

No force flags. No hacks. No deleting lockfiles. No stackoverflow copypasta.

It. Just. Worked. ✨

The Big Brain Takeaways

This experience taught me some real ones:

1. uv's strictness is a feature, not a bug

When uv says "no," it's because you're asking for the impossible. Listen to it.

2. Errors are signals, not obstacles

That error message wasn't uv being annoying — it was uv telling me my project had a fundamental design issue.

3. pip's "flexibility" is actually a trap

Silent breakage > Loud error? Nah. I'll take the loud error every time.

4. Dependencies are relationships

You can't just import everything and hope they get along. Some packages are fundamentally incompatible.

5. Lockfiles are your friend

They're not friction. They're reproducibility. They're the reason your code works the same way on every machine.

Pro Tips for the Road

If you want to avoid my pain, here's the cheat code:

Do This	Not This
Treat `pyproject.toml` as a policy document	Treat it like a wish list
Research if packages are compatible	Just add everything you might need
Upgrade related packages together	Upgrade one random package and pray
Use version ranges (`>=1.0,<2.0`)	Over-pin to exact versions
Trust the resolver	Fight the resolver

The One-Liner That Changed My Perspective

If uv refuses to resolve, it's because your requirements don't describe a world that can actually exist.

Read that again.

uv isn't being stubborn. It's being honest. Your dependency graph is a specification of reality, and if that specification is contradictory, no tool can make it work — they can only pretend to.

Wrapping Up

What felt like uv being difficult was actually uv doing its job: protecting me from myself.

The problem wasn't the tool refusing to upgrade.

The problem was that my dependency graph was mathematically impossible.

Once I fixed that — once I made an actual architectural decision instead of importing everything — it all just worked. Cleanly. Predictably. Reproducibly.

That's the real value of uv. It doesn't let you live a lie.

Got questions? Hit me up. Happy to chat about dependency management, uv, or why pip gives me trust issues.

Stay consistent, fam. ✌️

Building RAG Systems: From Zero to Hero

Gautam Vhavle — Wed, 24 Dec 2025 17:45:50 +0000

What I learned building RAG systems from scratch—and how you can too

The Journey That Changed How I Think About AI

Recently, I finished a comprehensive RAG course from DeepLearning.AI and instructor was Zain Hasan. Before that, I'd been learning from scattered tutorials and blog posts—completely unstructured. I thought I understood retrieval-augmented generation. I knew the theory: embeddings, vector databases, semantic search.

But like most things in engineering, theory and practice are worlds apart.

Since then, I've been building RAG systems for standalone projects personal/course, a customer support chatbot, a documentation search engine, and an internal knowledge assistant. Each one taught me something which just reading theory couldn't: the messy, fascinating reality of production AI.

Here's what I wish someone had told me before I started, and what I've learned along the way.

Why RAG? The Problem I Kept Running Into

During my course, the instructor kept hammering home one point: LLMs are amazing at reasoning, terrible at remembering. I nodded along, but I didn't really get it until my first project.

I was building a chatbot for a company's internal documentation. Simple, right? Feed GPT-4 a question, get an answer. Except:

It hallucinated constantly. Made up API endpoints that didn't exist. Confidently cited documentation sections that were never written.
It didn't know about latest updates. We could have shipped a major feature last week. The model? Clueless.

That's when RAG clicked. Instead of expecting the model to memorize everything, I'd give it a search engine. When someone asks a question, search the docs first, then feed the relevant content to the model.

Suddenly: no hallucinations, always up-to-date, and token utilization also was optimized.

That's the power of RAG.

What is RAG? How I Explain It Now

After building a few systems, here's how I think about RAG:

Instead of asking an LLM to answer from memory (which leads to hallucinations), you:

Store your documents in a database that understands meaning (vector database)
When someone asks a question, search for relevant documents
Hand those documents to the LLM along with the question
Let the LLM answer based on what it just read

It's like the difference between asking someone to recite a textbook from memory versus letting them look it up first.

The breakthrough: You separate "knowing facts" from "reasoning about facts." Update your documents, and your AI instantly knows the new information. No retraining needed, no stale knowledge, no made-up answers.

Foundation: Understanding the Building Blocks

Before we build, let's ensure we're on the same page about three key concepts:

1. Embeddings: GPS Coordinates for Meaning

Embeddings convert text into arrays of numbers (vectors) that capture semantic meaning. Words with similar meanings sit close together in this mathematical space.

"dog" → [0.32, 0.89, -0.45, ...]
"puppy" → [0.34, 0.87, -0.43, ...]
"car" → [-0.12, 0.15, 0.78, ...]

Key Insight: Embeddings let us compute "semantic similarity" mathematically. "Dog" and "puppy" are geometrically close; "dog" and "car" are far apart.

2. Vector Similarity: Finding the Needle

When a user asks "What's your refund policy?", we:

Convert the question into an embedding
Find documents with similar embeddings (using cosine similarity or dot product)
Return the top matches

This is wildly faster than reading every document. A vector database can search millions of documents in milliseconds.

3. Context Windows: The LLM's Short-Term Memory

LLMs have limited context windows (think RAM for conversation):

GPT-3.5: 4K tokens (~3,000 words)
GPT-4: 8K-128K tokens
Claude 3: Up to 200K tokens

The catch: More context = slower response + higher cost. RAG is about finding the right context, not all context.

The RAG Pipeline: End-to-End Architecture

Here's how a production RAG system works:

Phase 1: Ingestion (Building Your Knowledge Base)

Documents → Chunking → Embedding → Vector DB Storage

Step 1: Collect Your Data

Documentation (Markdown, PDFs)
Internal wikis
Customer support tickets
Product databases
Code repositories

Step 2: Chunk It
Break large documents into smaller pieces (chunks). Why? LLMs need focused context, not entire manuals.

Step 3: Embed It
Convert each chunk into a vector using an embedding model (OpenAI Ada, Sentence-BERT, etc.)

Step 4: Store It
Index vectors in a vector database with metadata (source, timestamp, category)

Phase 2: Retrieval (Finding Relevant Knowledge)

User Query → Embed Query → Search Vector DB → Retrieve Top-K Chunks

When a user asks a question:

Convert their question into an embedding
Search your vector DB for similar chunks
Retrieve the top 3-10 most relevant pieces
(Optional) Rerank results for precision

Phase 3: Generation (Creating the Answer)

Query + Retrieved Context → LLM → Grounded Answer

Construct a prompt:

Context: [Retrieved chunks]
Question: [User query]
Instructions: Answer based only on the context provided.

The LLM generates a response grounded in your actual data.

My First RAG System (The One That Actually Worked)

After the course, I wanted to build the simplest possible RAG system to prove I understood it. Here's what I came up with—about 50 lines of Python (Which ran really slow in my Macbook):

# requirements: sentence-transformers, faiss-cpu, openai

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import openai

# Step 1: Prepare documents
documents = [
    "Our refund policy: 30 days, full refund with receipt.",
    "Shipping takes 3-5 business days for domestic orders.",
    "We accept Visa, Mastercard, and PayPal.",
    "Customer support: support@example.com or call 1-800-HELP"
]

# Step 2: Create embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')  # 384-dim embeddings
embeddings = model.encode(documents)

# Step 3: Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

# Step 4: Retrieval function
def retrieve(query, k=2):
    query_embedding = model.encode([query])
    distances, indices = index.search(query_embedding, k)
    return [documents[i] for i in indices[0]]

# Step 5: RAG function
def rag_query(question):
    # Retrieve relevant docs
    context = retrieve(question)

    # Create prompt
    prompt = f"""Answer the question based only on this context:

Context:
{chr(10).join(context)}

Question: {question}

Answer:"""

    # Generate response
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )

    return response.choices[0].message.content

# Test it
print(rag_query("How long does shipping take?"))
# Output: "Shipping takes 3-5 business days for domestic orders."

What just happened?

I embedded 4 documents using a lightweight model I could run locally (22MB download)
Stored them in FAISS—this took me 10 minutes to figure out from the docs
When asked about shipping, the system found the right document
Fed it to GPT-3.5 to generate a natural answer

Results: Optimized token usage. Zero hallucinations. I was hooked.

This tiny example taught me more than hours of coursework. Seeing retrieval work in real-time made everything click.

Chunking: The Part That Took Me the Longest to Get Right

The course covered chunking in one lecture. In practice, it took me three weeks of experimentation. Here's what I learned the hard way:

1. Fixed-Size Chunking (Beginner-Friendly)

Split every N tokens (e.g., 512 tokens) with optional overlap:

def chunk_fixed(text, chunk_size=512, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

Pros: Simple, predictable

Cons: May split mid-sentence, breaks semantic units

Use when: You have clean, uniform text (articles, docs)

2. Semantic Chunking (Intermediate)

Split at natural boundaries (paragraphs, sections, sentences):

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", " "]  # Try these in order
)

chunks = splitter.split_text(long_document)

Pros: Respects semantic boundaries

Cons: Variable chunk sizes

Use when: You have structured documents (PDFs, articles)

3. Hybrid Chunking (Advanced)

Combine approaches:

Use section headers to define chunk boundaries
Keep chunks within token limits
Add metadata (section title, page number)

What Actually Worked for Me:

✅ Always use overlap (I settled on 15%)—this fixed so many "half-answer" problems

✅ Metadata is your friend—I can filter by date, source, document type

✅ Start with 512 tokens—then adjust. I went up to 768 for technical docs

✅ Test with real queries—what looks good in theory often fails in practice

⚠️ My biggest mistake: Chunks too small (200 tokens). Context disappeared.

My advice: Don't overthink it at first. Use 512 tokens with 50 token overlap, then iterate based on what your users actually search for.

Retrieval: What I Wish I'd Known Earlier

In the course, we learned semantic search (embeddings + vector similarity). In my projects, I discovered that wasn't always enough.

1. Dense Retrieval (Semantic Search)

This is where I started: convert everything to vectors, find similar vectors.

Best models (as of 2025):

text-embedding-3-large (OpenAI) - 1536 dims, excellent quality
all-MiniLM-L6-v2 (open source) - 384 dims, fast, good enough
bge-large-en-v1.5 (BAAI) - 1024 dims, top open source option

Pros: Captures semantic meaning, handles synonyms

Cons: Computationally intensive

2. Sparse Retrieval (BM25/TF-IDF)

Traditional keyword search. Fast, simple, explainable.

from rank_bm25 import BM25Okapi

corpus = [doc.split() for doc in documents]
bm25 = BM25Okapi(corpus)

query = "refund policy".split()
scores = bm25.get_scores(query)
top_doc = documents[np.argmax(scores)]

Pros: Fast, deterministic, good for exact keyword matches

Cons: Misses semantic similarity ("car" won't match "automobile")

3. Hybrid Search (Best of Both)

Combine dense and sparse retrieval:

def hybrid_search(query, alpha=0.5):
    # Get semantic results
    semantic_results = vector_db.search(query, k=20)

    # Get BM25 results
    bm25_results = bm25.search(query, k=20)

    # Merge with weighted scores
    combined = merge_results(
        semantic_results, 
        bm25_results, 
        alpha=alpha  # 0.5 = equal weight
    )

    return combined[:10]  # Top 10

When to use hybrid:

Users search with specific keywords (product names, codes)
Domain with technical jargon
You need explainable results

Method	Speed	Accuracy	Best For
Dense (Semantic)	Medium	High	Natural language queries
Sparse (BM25)	Fast	Medium	Keyword search
Hybrid	Medium	Highest	Production systems

4. Reranking: The Game-Changer I Almost Skipped

I almost didn't implement reranking. "Initial retrieval is good enough," I thought. Then I tried it:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# Initial retrieval: Get top 20
candidates = vector_db.search(query, k=20)

# Rerank: Score each candidate against query
scores = reranker.predict([(query, doc) for doc in candidates])

# Return top 5 after reranking
reranked = [candidates[i] for i in np.argsort(scores)[-5:]]

My results: Accuracy jumped from 73% to 89% on my test queries on test dataset in the course. I immediately noticed better answers.

Tradeoff: Added some latency, which is totally worth it.

Reranking should became non-negotiable in my next projects.

Vector Databases: My Journey From FAISS to Production

Where I Started: Local and Simple

1. FAISS (Facebook AI Similarity Search)

This was my first choice after the course. Why?

Dead simple: Got it running in 30 minutes
Free: Important when you're learning
Fast enough: For my 10K documents

The catch: No persistence out of the box. I had to save/load the index manually. Fine for prototyping, annoying for production.

import faiss
index = faiss.IndexFlatL2(dimension)  # Brute force, exact search
index.add(embeddings)

2. LocalChromaDB

Pros: Simple, embedded mode, good for beginners
Cons: Not optimized for large scale
Use for: Side projects, MVPs, local development

import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")
collection.add(documents=texts, embeddings=embeddings, ids=ids)

Where I Moved for Production

3. Qdrant

Pros: Rust-based (fast), filtering, open source, good docs, for speech applications where latency matters
Cons: Smaller community than others
Use for: Production, performance-critical apps

4. Milvus/Zilliz

Pros: Built for massive scale (billions of vectors), battle-tested
Cons: Complex setup, steeper learning curve
Use for: Enterprise scale, billions of documents

Database	Ease of Use	Scalability	Cost	Open Source	Best For
FAISS	⭐⭐⭐⭐⭐	⭐⭐	Free	✅	Learning, prototypes
ChromaDB	⭐⭐⭐⭐⭐	⭐⭐⭐	Free	✅	MVPs, small apps
Qdrant	⭐⭐⭐⭐	⭐⭐⭐⭐	Free/$	✅	Performance-critical
Milvus	⭐⭐	⭐⭐⭐⭐⭐	Free/$$$	✅	Enterprise scale

My Recommendation Based on What I've Built:

Your first RAG project? → FAISS. Get something working in an afternoon.
Building a side project? → ChromaDB. Easy persistence, good docs.
Serious about production? → Milvus or Qdrant. I've tried and tested both, both are solid.

RAG vs Fine-Tuning: When to Use What

This is the million-dollar question.

Criterion	RAG	Fine-Tuning
Cost	Low (inference only)	High ($10K-$100K+)
Update frequency	Real-time	Requires retraining
Setup time	Days	Weeks/months
Accuracy on facts	Excellent (grounded)	Good (can hallucinate)
Behavior modification	Limited	Excellent
Interpretability	High (see sources)	Low (black box)
Latency	Slightly higher	Lower

Use RAG when:

Your data changes frequently
You need factual accuracy with citations
Budget is limited
You need to explain answers (show sources)

Use Fine-Tuning when:

You need to change model behavior (tone, format, style)
Data is static
Latency is critical
Budget allows

Best approach? Hybrid: Fine-tune for behavior, RAG for knowledge.

The Surprise: Small Models + RAG Beat GPT-4

This wasn't in the course, but it's the most important thing I've learned:

A well-tuned 7B model with RAG beats GPT-4 for domain-specific tasks.

I didn't believe it until I tried it in my demo project. Here's why it works:

Specialized Retrieval beats General Knowledge
- GPT-4 knows a little about everything
- Your RAG knows everything about your domain
Smaller Models are Faster
- Llama-3 8B: ~50ms inference
- GPT-4: ~500ms inference
- 10x speed improvement
Cost Savings are Dramatic

   GPT-4: $0.03 per 1K tokens
   Llama-3 8B (self-hosted): $0.0001 per 1K tokens
   300x cheaper

You Control the Infrastructure
- No vendor lock-in
- Data privacy guaranteed
- Custom optimizations

My Experience:

Project 1: Documentation Chatbot

Started with GPT-4: Great answers
Switched to Llama-3 8B + RAG: Better answer
The difference: GPT-4 would paraphrase incorrectly. Llama-3 + RAG quoted exact docs.

Project 2: Customer Support Bot

Tried GPT-4 first: Some of queries handled correctly (67%)
Moved to Llama-3.2 8B + RAG: 91% accuracy
Why: RAG retrieved the exact support article. Model just had to summarize it.

Project 3: Internal Knowledge Assistant

Using Phi-3 3.8B (tiny model) + aggressive Hybrid search RAG
Responses in 120ms average
I prefer it over standalone GPT-4 system

Where I Think This Is Heading

After three projects and countless experiments, here's what I believe:

As smaller models get better (and they are—fast), RAG becomes the great equalizer. We're moving toward:

Specialized beats generalized for most business use cases
Open source + RAG is the default architecture
Cost per query drops from dollars to fractions of cents
Every company runs their own domain-expert AI

Emerging Trends: Agentic RAG

The next evolution is Agentic RAG - systems that don't just retrieve and generate, but reason about what to retrieve and when:

How Agentic RAG works:

Query Analysis: Agent determines if it needs more information
Multi-Step Retrieval: Performs multiple retrieval rounds, refining based on initial results
Tool Use: Can call external APIs, run code, or query structured databases
Self-Reflection: Evaluates its own answers and retrieves more if unsure

# Example: Agentic RAG flow
def agentic_rag(query):
    # Step 1: Analyze query complexity
    if needs_multi_step_reasoning(query):
        # Step 2: Break down into sub-questions
        sub_queries = decompose_query(query)

        # Step 3: Retrieve for each sub-question
        contexts = [retrieve(q) for q in sub_queries]

        # Step 4: Synthesize and verify
        answer = generate_with_verification(query, contexts)

        # Step 5: If confidence is low, retrieve more
        if answer.confidence < 0.8:
            additional_context = retrieve_with_feedback(query, answer)
            answer = generate_final(query, contexts + additional_context)
    else:
        # Simple single-step RAG
        answer = simple_rag(query)

    return answer

Benefits of Agentic RAG:

Better accuracy on complex queries requiring multi-hop reasoning
Adaptive retrieval - only retrieves what's needed
Explainable reasoning - can show the step-by-step process
Cost-efficient - avoids over-retrieving

Real-world impact: Agentic RAG systems have shown 30-40% improvement over traditional RAG on complex question-answering benchmarks like HotpotQA and MultiHop-RAG.

My take: Most companies don't need latest GPT-5. They need their own data, smart retrieval, and a well-implemented RAG system. That's 90% of the value at 10% of the cost.

This realization changed how I think about AI engineering entirely.

Production Lessons: What I'm Learning

1. Monitoring Saved Me

I didn't add monitoring in my first project. Big mistake. In my second project, I tracked the context, but from other souces I've learned that:

Retrieval Metrics:

Recall@K: Are the right docs in top K results?
Precision@K: What % of retrieved docs are relevant?
MRR (Mean Reciprocal Rank): How far down is the first relevant result?

Generation Metrics:

Faithfulness: Does answer align with retrieved context?
Relevance: Does answer address the question?
Latency: Time from query to response

Business Metrics:

User satisfaction (thumbs up/down)
Resolution rate (for support chatbots)
Cost per query

# Simple evaluation framework
def evaluate_rag(query, ground_truth, rag_system):
    # Retrieval
    retrieved_docs = rag_system.retrieve(query)
    recall = calculate_recall(retrieved_docs, ground_truth_docs)

    # Generation
    answer = rag_system.generate(query, retrieved_docs)
    faithfulness = check_faithfulness(answer, retrieved_docs)

    return {"recall": recall, "faithfulness": faithfulness}

2. Handling Context Overflow

What if retrieved context exceeds LLM's window?

Solutions:

Summarize chunks before passing to LLM
Use longer context models (Claude 3 200K)
Implement multi-hop retrieval (iterative refinement)
Compress context with extractive summarization

3. Cost Optimization

Embedding costs:

Cache embeddings (don't recompute for same text)
Use cheaper models for preliminary retrieval
Batch embed operations

LLM costs:

Use smaller models where accuracy permits
Implement caching for common queries
Set max_tokens to avoid runaway generation

Infrastructure:

Self-host embeddings model (one-time cost)
Use spot instances for batch processing
Implement request throttling

4. Multi-Tenancy Patterns

For SaaS products:

# Namespace approach
collection.add(
    documents=docs,
    embeddings=embeddings,
    metadata=[{"tenant_id": "customer_123"}]
)

# Query with filter
results = collection.query(
    query_embedding=query_emb,
    filter={"tenant_id": "customer_123"}
)

5. Incremental Updates

Don't rebuild your entire index daily:

# Add new documents
new_docs = fetch_new_documents(since=last_update)
new_embeddings = model.encode(new_docs)
index.add(new_embeddings)

# Update existing documents
updated_docs = fetch_updated_documents()
# Delete old versions, add new versions
for doc in updated_docs:
    index.remove(doc.old_id)
    index.add(doc.new_embedding, doc.new_id)

Mistakes I Made or would have made (So You Don't Have To)

⚠️ Mistake 1: Chunking Too Small

What I did: Started with 128-token chunks to "maximize precision"

What happened: Retrieval found fragments without enough context. Answers were incomplete.

Fix: Bumped to 512 tokens with 15% overlap. Immediately better.

⚠️ Mistake 2: Ignoring Metadata

What I did: Pure semantic search, no filters

What happened: Retrieved old documentation when new versions existed

Fix: Added timestamp and version filters. Game changer.

results = vector_db.search(
    query=query,
    filter={"category": "product_docs", "date": {"$gte": "2024-01-01"}}
)

⚠️ Mistake 3: Not Profiling Latency

What I did: Assumed "fast enough" without measuring

What happened: Users complained about 3-second response times

Fix:

Profiled every step: embedding (50ms), retrieval (80ms), reranking (200ms), generation (2.1s)
Optimized generation by switching models
Got down to 800ms total

⚠️ Mistake 4: Trusting Retrieval Blindly

What I did: Always passed top results to LLM, no quality check

What happened: When retrieval failed, LLM made stuff up

Fix: Added confidence thresholds:

def rag_with_fallback(query):
    results = retrieve(query, k=3)

    # Check if top result is confident
    if results[0].score < 0.7:  # Low confidence
        return "I don't have enough information to answer this."

    return generate(query, results)

⚠️ Mistake 5: "Set It and Forget It"

What I did: Built the system, deployed it, moved on

What happened: After adding 5000 more documents, retrieval quality dropped 15%

Fix: My plan will be to run evaluation tests occasionally. Catch degradation early.

Frequently Asked Questions

Q: Do I need a vector database or can I use a traditional DB?

A: For <10K documents, you can get away with FAISS or even numpy arrays. Beyond that, a proper vector DB gives you scalability, filtering, and performance.

Q: What's the minimum viable RAG system?

A: 50 lines of Python (see the example above), a free embedding model, and FAISS. Total cost: $0 to start.

Q: How do I handle PDF extraction and preprocessing?

A: Use libraries like pymupdf, pdfplumber, or unstructured. Watch out for table extraction—it's tricky.

Q: Can I do RAG with completely private/offline models?

A: Absolutely. Use sentence-transformers for embeddings and llama.cpp or ollama for local LLM inference.

Q: What about structured data (databases, spreadsheets)?

A: Convert to text descriptions or use hybrid approaches (SQL + RAG). Example: Generate natural language descriptions of database rows.

Q: How do I know if my chunking strategy is working?

A: Measure retrieval metrics. If recall is low, experiment with different chunk sizes and overlap.

Q: Should I use multiple embedding models?

A: For specialized domains (code, legal, medical), domain-specific embeddings often outperform general-purpose ones.

Q: What about multi-modal RAG (images, tables, charts)?

A: Use multi-modal embedding models like CLIP (for images) or convert tables to text. It's an active area of research.

What I Wish I Knew Before Starting

If I could go back and tell myself these things before building my first RAG system:

The Core Insights:

RAG separates reasoning from knowledge → LLMs handle reasoning, databases handle facts
The pipeline is simple: Ingest → Chunk → Embed → Store → Retrieve → Generate
Chunking matters: Start with 512 tokens, add overlap, respect semantic boundaries
Hybrid search beats pure semantic: Combine BM25 + vector search for best results
Small models + RAG can beat GPT-4 on domain-specific tasks at 1/100th the cost
Production is about monitoring: Track retrieval quality, latency, and cost
Start simple, optimize later: FAISS + OpenAI embeddings + GPT-3.5 gets you 80% there
Agentic RAG is the future: Multi-step reasoning and adaptive retrieval unlock new capabilities

Why I'm Excited About This:

Accessibility: I am trying to build production AI with decent ML background. Just a curiosity and determination.
Economics: RAG systems cost 1-5% of what pure LLM solutions would cost
Reliability: Users trust answers because they see sources. No more "did the AI make this up?"
Agility: I can update knowledge in minutes. No retraining, no waiting.

The Future of RAG:

We're moving toward:

Adaptive RAG: Systems that adjust retrieval strategy based on query complexity
Agentic RAG: Multi-step reasoning with dynamic retrieval and tool use
Fusion models: Architectures that blend parametric and non-parametric knowledge
Smaller, smarter retrievers: Specialized models optimized for retrieval

Bottom line: RAG isn't just a technique—it's the architecture that makes practical, affordable, trustworthy AI possible.

Our Next Steps

Choose your adventure based on experience level:

🌱 Beginners:

✅ Implement the 50-line minimal RAG example above
✅ Experiment with different chunk sizes (256, 512, 1024)
✅ Try different embedding models (compare results)
✅ Build a chatbot for a small document set (10-100 docs)
📚 Read: LangChain RAG Tutorial

🚀 Intermediate (I still have to implement few of em):

✅ Implement evaluation metrics (recall@K, faithfulness)
✅ Set up a production vector DB (Weaviate or Qdrant)
✅ Add hybrid search (BM25 + semantic)
✅ Implement reranking
✅ Experiment with smaller models (Llama-3 8B, Mistral 7B)
📚 Read: RAG Evaluation Best Practices

⚡ Advanced (I'm aiming for this nex):

✅ Build multi-tenant RAG system with namespace isolation
✅ Implement adaptive retrieval strategies
✅ Optimize for sub-100ms latency
✅ Run cost analysis and optimize for $0.0001/query
✅ Build agentic RAG with multi-step reasoning
✅ Contribute to open source RAG frameworks
📚 Read: Advanced RAG Techniques

🎯 RAG Journey Checklist:

[ ] Understand embeddings and vector similarity
[ ] Build minimal RAG system (FAISS + OpenAI)
[ ] Implement chunking strategy for your use case
[ ] Set up production vector database
[ ] Add evaluation metrics
[ ] Implement hybrid search
[ ] Optimize for cost and latency
[ ] Deploy to production
[ ] Monitor and iterate
[ ] Explore agentic RAG patterns

Resources and Further Reading

Essential Links:

Vector Databases

Primary Options (Recommended)

Milvus Documentation - Distributed vector DB with GPU support
Qdrant Documentation - Production-ready vector DB with cloud SaaS option
ChromaDB Documentation - Lightweight, Python-first vector database

Embedding Models & Evaluation

MTEB Leaderboard & Model Selection

MTEB Leaderboard (English) - Compare 100+ embedding models across 56+ tasks
MTEB Leaderboard (Multi-lingual) - Multilingual model comparison

Top Opensource Embedding Models

Qwen3-Embedding-8B - State-of-the-art (Dec 2024), multilingual support
jina-embeddings-v3 - 570M params, 8K context length, task-specific LoRA
NV-Embed-v2 - NVIDIA's top performer, #1 on MTEB (Aug 2024)
bge-m3 - BAAI's versatile model, dense + sparse + multi-vector retrieval
arctic-embed-l - Open-source, outperforms Cohere embed-v3

Framework-Specific Resources

HuggingFace Sentence Transformers - Python library for semantic embeddings
OpenAI Embeddings API - Production embedding service

Domain-Specific Embeddings

Medical: PubMedBERT (biomedical literature), BioLORD
Finance: Finance-Embeddings, BGE-Financial, Voyage Finance
Code: Code embeddings from GitHub, CodeBERT

RAG Frameworks & Orchestration

Core Frameworks (Production-Ready)

LangGraph Documentation - Graph-based agentic workflows with stateful execution (2024+)
LlamaIndex Documentation - Agent-powered context augmentation with LlamaParse document parsing
Haystack Documentation - v2.0+ with explicit RAG pipelines and evaluation tools
LangChain Documentation - Rapid prototyping and experimentation

Emerging & Specialized

Pathway Real-time RAG - Streaming data processing for live RAG updates (2025)
DSPy Documentation - Declarative pipeline programming for LLMs
Cohere Agent Framework - Multi-agent orchestration with built-in tools

Advanced RAG Approaches

Graph-Enhanced & Hierarchical RAG

GraphRAG (Microsoft Research) - LLM-derived knowledge graphs with v1.0 release (Dec 2024)
- Latest: LazyGraphRAG (Nov 2024) - cost-efficient variant without pre-summarization
- GraphRAG GitHub
LightRAG Framework - Graph-enhanced with dual-level retrieval & incremental updates (2024)
- Paper: LightRAG: Simple and Fast Retrieval-Augmented Generation
ArchRAG - Attributed community-based hierarchical RAG (Feb 2025)

Hierarchical & Long-Context

RAPTOR Framework - Recursive abstractive processing for tree-organized retrieval
- Implementation: RAGFlow's RAPTOR Implementation
RAG-Anything - Multimodal RAG supporting text, images, tables, equations (2024)

Self-Correcting & Agentic RAG

Self-RAG: Learning to Retrieve, Generate, and Critique - Adaptive retrieval with quality scoring
Agentic RAG Frameworks (2025) - Overview of LangGraph, Haystack, LlamaIndex, Pathway, DSPy

RAG Evaluation & Benchmarking

Evaluation Frameworks

RAGAS (Retrieval-Augmented Generation Assessment) - Reference-free RAG evaluation with metrics like context precision, entity recall, faithfulness
- RAGAS Paper
- Evaluating RAG with RAGAS Tutorial
BEIR Benchmark - Retrieval evaluation across heterogeneous datasets

Research Papers & Benchmarks

Original RAG Paper: Retrieval-Augmented Generation - Foundational work
REALM: Retrieval-Augmented Language Model Pre-Training - Pre-training with retrieval
Dense Passage Retrieval - Dense vector retrieval fundamentals
BRIGHT Benchmark - Reasoning-intensive retrieval evaluation (2024)

Research Papers (Foundational & Recent)

Core RAG Papers

Recent Advances (2024-2025)

Embedding Model Papers

Vector Database Papers

Community & Support

Forums & Discussion

r/LocalLLaMA - Open-source LLM and RAG community
LangChain Fourm - Active community support

Blogs & Learning

Microsoft GraphRAG Blog - Official GraphRAG research updates
LangChain Blog - Framework updates and tutorials
Milvus Blog - Vector DB best practices

Final Thoughts from the Trenches

Few months ago, I know nothing about RAG and started to learn more about it. Today, I'm iming to build production RAG after understanding the RAG concepts which will aims to provide accurate answers.

The biggest lesson? RAG isn't just a technique—it's a different way of thinking about AI.

Instead of "how can I make the model smarter," I now ask "how can I give it better information?" That mental shift unlocked everything.

The course taught me the foundations. Building real systems taught me the craft. The gap between those two was wider than I expected, but crossing it was incredibly rewarding.

If you're where I was three months ago—course completed, wondering what's next—my advice is simple: Build something. Ship it. Learn from real users.

Your first RAG system won't be perfect. Mine wasn't. But it'll teach you more than any tutorial ever could.

The best time to start was yesterday. The second best time is now.

If you're building RAG systems, I'd love to hear about your experience. What surprised you? What worked? What didn't? Drop a comment—let's learn from each other.

This Article written on December 2025, based on my hands-on experience building RAG systems. The field moves fast—always test and validate for your specific use case.