DEV Community

martin brice
martin brice

Posted on

How a Non-Developer Finally Understood RAG (And You Can Too)

Tags: RAG, AI, LLM, beginners, machinelearning


About the author: Hi, I’m Martin Brice. Not a developer. Just someone who got way too deep into local LLMs and somehow ended up here. ㅋㅋ


Okay so. I’m not a developer.

But I’ve been obsessing over local LLMs — VS Code, Continue extension, Ollama, the whole thing. And everyone kept throwing around this word: RAG.

Like it was obvious. Like I should just know.

I didn’t.

So I started asking questions. Really basic ones. And somewhere between “wait, embedding is just… turning words into numbers?” and “hold on, this is literally just Homebrew” — it all made sense.

Here’s how I got there. No CS degree required.


First — What Problem Does RAG Even Solve?

Your local LLM is smart. But it doesn’t know your stuff.

Not your codebase. Not your internal docs. Not anything you made.

RAG = giving your LLM the right context before it answers.

That’s literally it. Retrieval Augmented Generation sounds fancy. It’s not. Find relevant stuff → hand it to the LLM → get a way better answer.


The 3 Parts. In Human.

🔢 Embedding — I Call It “Numberization”

Before anything can be searched, your text needs to be converted into numbers. Why? Because comparing numbers is way faster than comparing words.

I kept calling this numberization in my head and honestly? More accurate than “embedding.”

"memory leak fix"     [0.2, 0.8, 0.1, 0.5 ...]
"gc.collect usage"    [0.21, 0.79, 0.11, 0.48 ...]
"today's weather"     [0.9, 0.1, 0.7, 0.2 ...]
Enter fullscreen mode Exit fullscreen mode

Similar meanings → similar numbers. So when you search, you’re not matching words. You’re matching meaning. Wild, right?

And with Ollama, this runs fully local. Your code never leaves your machine.

ollama pull nomic-embed-text  # that's it
Enter fullscreen mode Exit fullscreen mode

🗄️ ChromaDB — The Warehouse

So now you’ve got all these numbers. You need somewhere to put them.

ChromaDB stores them. But here’s the key part — it doesn’t sort them when they go in. It just dumps everything in the warehouse.

The smart part happens at retrieval:

Question comes in
       ↓
Convert question to numbers (same embedding process)
       ↓
Compare against everything in the warehouse
       ↓
Pull out the closest matches
Enter fullscreen mode Exit fullscreen mode

Think of it as a warehouse where nothing is organized — but the librarian can instantly find anything similar to what you’re looking for. That’s the vibe.


⚖️ Reranker — The Curator

Vector search is great but it casts a wide net. You might get 20 results. Maybe 3 are actually useful.

That’s what the Reranker is for. It reads each result carefully and re-orders by actual relevance.

Before Reranker:
1st → "memory concepts overview"   ← sounds related, not helpful
2nd → "memory leak debug code"     ← THIS is what you need
3rd → "memory optimization tips"

After Reranker:
1st → "memory leak debug code"     ✅
2nd → "memory optimization tips"
3rd → "memory concepts overview"
Enter fullscreen mode Exit fullscreen mode

My favorite analogy:

Vector search = casting a fishing net (catch a lot)
Reranker = chef picking only the best catch (keep what matters)


The “Homebrew Moment” — LangChain

This is when it clicked for me.

LangChain is a Python library. And just like Homebrew on Mac — it’s not doing the hard work itself. It’s just coordinating everything else.

LangChain (the general contractor) 🏢
  ├── Chunking    → does this itself
  ├── Embedding   → calls Ollama
  ├── Storage     → calls ChromaDB
  └── Answer      → calls local LLM
Enter fullscreen mode Exit fullscreen mode

Install the parts separately. Use them all through one interface.

pip install langchain chromadb   # like brew install
ollama pull nomic-embed-text
Enter fullscreen mode Exit fullscreen mode

Before LangChain → wire everything manually, 100 lines of code
After LangChain → connect the pieces, done

It’s a PM that outsources everything and just manages the pipeline. Lightweight. Smart. Kinda lazy in the best way. ㅋㅋ


The Thing Nobody Explains Well: Chunking

Before any of the above happens, your documents get cut into pieces. This is chunking. And the cutting strategy matters a lot.

For code, the rule is: cut by function or class, not by character count.

# Good chunk ✅ — complete function
def calculate_memory(data):
    result = data * 2
    return result

# Bad chunk ❌ — cut mid-function
def calculate_memory(da
Enter fullscreen mode Exit fullscreen mode

You also want metadata on every chunk:

{
  "file": "engine_core.py",
  "type": "function",
  "name": "calculate_memory",
  "line": 42
}
Enter fullscreen mode Exit fullscreen mode

With this, your LLM can say:

“The issue is in engine_core.py, line 42, inside calculate_memory”

Instead of:

“Somewhere in your code maybe?”

Big difference. ㅋㅋ


What Gets Chunked vs What Doesn’t

This confused me early on:

Source Chunk it?
Your codebase ✅ Yes — by function
Long docs ✅ Yes — by section
Q&A pairs ❌ No — already the right size

Q&A pairs go straight to embedding. Splitting them would destroy their meaning.


The Full Flow (My Actual Setup)

VS Code + Continue
        ↓
Proxy Server intercepts the prompt
        ↓
RAG kicks in:
  → ChromaDB searched
  → Reranker filters best results
  → Context injected into prompt
        ↓
Local LLM (Ollama) answers
        ↓
Q&A pair saved back to ChromaDB
Enter fullscreen mode Exit fullscreen mode

That last step is the part I love most. Every question and answer gets stored. The system gets smarter the more you use it — no retraining, no API costs, no data leaving your machine.


Plain English Summary

Fancy term What it actually means
Embedding Numberization — text → numbers
Vector DB Warehouse that retrieves by number similarity
Chunking Cutting docs into meaningful pieces
Reranker Curator that picks best results
LangChain PM/contractor coordinating everything
RAG Giving LLM the right context before answering

The jargon made this feel impossible. Once I stopped trying to understand “vector embeddings” and started thinking about numberization, warehouses, and general contractors — it took about an hour.

If you’re a non-developer trying to make sense of this stuff, I hope this saves you that hour. You don’t need to write the code to understand the architecture. And understanding the architecture helps you ask better questions when you do start building.

Which is kind of the whole point of RAG, isn’t it? ㅋㅋ


Stack: VS Code + Continue + Ollama + ChromaDB + LangChain
All local. No API keys. No data leaving the machine.


— Martin Brice, a non-developer who got too curious

Top comments (0)