Sabi Mantock

Posted on Jun 16

From "I Understood Nothing" to Building a RAG App

#learning #nlp #python #ai

Yesterday, I took the time to go through 31 pages of my own notes on NLP, notes I had carefully created during an AI engineering course. However, I found that I understood very little of it.

It felt odd and disheartening to have written something down yet not grasp it. I initially thought the issue was with me. But that wasn't the case. The real issue was that simply reading notes does not equate to true learning. My notes were packed with information intended for a version of me that already had some understanding, and you can't develop comprehension just by rereading a summary. So, I decided to change my approach. Instead of reading, I had the concepts explained to me one question at a time, starting from the basics, using only simple examples. No technical terms were permitted until I had proven I could understand them.

By the end of the day, I had reconstructed the four main concepts of practical NLP from the ground up, and I used them to create a functioning app that answers questions based on my own notes. Here's the complete journey, with the same examples that finally helped me understand, along with the code for each step so you can follow along.

Start with the dumbest possible question

Imagine you want a computer to sort emails into "spam" and "not spam." Here's the catch: the computer has never read a word in its life. The only thing it can do, the only thing it has ever been able to do, is math on numbers.

So before it can sort a single email, one problem comes first: the words have to become numbers. That sentence turned out to be the foundation of everything. All of this branch of NLP is just one job: turn text into numbers without losing the meaning that matters.

My first instinct was to turn each letter into a number. It's a reasonable guess, and it's actually how computers store text underneath. But it captures spelling, not meaning. "win" and "bin" look almost identical as numbers, yet mean totally different things. The meaning of an email lives in its words, not its letters.

Bag of Words: count, don't spell

So we work with whole words. Take three emails:

"win money now"
"money money money"
"see you tomorrow"

List every distinct word (that's the vocabulary), give each a column, and for each email count how many times each word appears. "win money now" becomes 1, 1, 1, 0, 0, 0. "money money money" becomes 0, 3, 0, 0, 0, 0. Suddenly every email is a row of pure numbers, and a computer can compare them.

scikit-learn does exactly this in three lines:

from sklearn.feature_extraction.text import CountVectorizer

emails = ["win money now", "money money money", "see you tomorrow"]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

print(vectorizer.get_feature_names_out())  # the vocabulary (the columns)
print(X.toarray())                         # each row = one email as word-counts

That's it. That's the intimidating-sounding "Bag of Words." I had read about it and bounced off it; here I did it by hand in two minutes, then in three lines of code.

It's called a bag for a reason, and the reason is also its flaw. Imagine tossing a sentence's words into a paper bag and shaking it. You still know which words are inside and how many, but the order is gone. "win money now" and "now win money" produce the identical row. Usually harmless. But consider "dog bites man" versus "man bites dog": same three words, one is a boring Tuesday and the other is front-page news. To Bag of Words they're identical. That weakness is exactly why fancier methods had to be invented later.

TF-IDF: not all words deserve equal weight

Bag of Words has a second problem: it treats every word as equally important. The word "the" is in nearly every email, so knowing an email contains "the" tells you nothing. Meanwhile a rare word like "viagra" is a screaming signal. Yet Bag of Words counts them the same.

TF-IDF fixes this with two dials multiplied together:

TF (term frequency): how often a word appears in this one email. If a message says "money money money money," it's clearly about money. Repetition means importance.
IDF (inverse document frequency): how rare the word is across the whole pile, measured by how many separate emails contain it. Common word → tiny score. Rare word → big score. ("Inverse" just means flipped: more documents, smaller score.)

Multiply them, and something beautiful happens. Take "the": it appears a lot (high TF) but it's everywhere (near-zero IDF), so high × almost-zero ≈ zero. The word silences itself. Nobody handed the computer a list of words to ignore. The math removed the filler on its own. Meanwhile "viagra" in a spam email is repeated (high TF) and rare (high IDF), so it lights up bright.

Same API, one word swapped:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(emails)

print(X.toarray().round(2))  # same shape, but now weighted by rarity

I tested the idea on "win win win money the the" and correctly predicted "win" would score highest, repeated and rare, while "the" collapsed to nothing. The method had taught me to think the way it thinks.

Embeddings: giving words a place on a map

Both methods above share a blindness: to them, "money" and "cash" are as unrelated as "money" and "banana." Each word is just its own separate column, with nothing connecting them. They have no concept that two different words can mean nearly the same thing.

Embeddings fix this with one gorgeous idea: put every word on a map of meaning. Picture a giant map where each word is a dot. We place words that mean similar things near each other, and unrelated words far apart. "cash" sits right next to "money"; "banana" is way over by "apple." A word's coordinates on that map, written as numbers, are its embedding. "king" lands closer to "queen" than to "bicycle," because the map is organized by meaning.

But how does a computer that can't read decide where to put each word? Through the company a word keeps. It reads millions of sentences and notices that "money" and "cash" are both surrounded by the same neighbors: borrow, pay, bank, withdraw. Same company, same neighborhood. "banana" keeps different company (eat, peel, ripe, fruit), so it lands far away. The computer never knows what any word means; it just matches patterns of context. (The real map isn't flat with two directions; it has hundreds, which is why an embedding is a long list of numbers. The idea is identical: close = similar, far = different.)

A pre-trained model hands you those coordinates directly:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
vecs = model.encode(["money", "cash", "banana"])

print(vecs.shape)  # (3, 384) → three words, each a dot with 384 coordinates

The kicker: it's not only single words. A whole sentence, or a whole note, can become a dot too, and that's the hinge the entire next step swings on.

RAG: I accidentally invented it

Here's where it all paid off. Imagine every note in my wiki is a dot on the meaning map. Then I type a question, and that question becomes a dot on the same map. To find the note that answers it, I just grab the nearest dots.

That's the entire idea behind RAG (Retrieval-Augmented Generation):

Turn the question into a dot (embed it).
Find the nearest note-dots on the map (retrieval).
Hand those notes to a language model and say "answer using only these."
The model writes an answer grounded in my actual notes instead of guessing.

Most people learn RAG by copying a tutorial and have no idea what's happening underneath. I arrived at it by reasoning, which meant I understood every layer before I wrote a line of code.

Synapse: making it real

So I built it, in a Jupyter notebook, and named it Synapse, because a synapse is the connection between two neurons, which is exactly what embeddings do: connect related notes by meaning.

The pipeline is just the four ideas above, in order.

1. Load every note:

from pathlib import Path

WIKI = Path("path/to/my/wiki")
notes = [{"path": str(f), "text": f.read_text(encoding="utf-8")}
         for f in WIKI.rglob("*.md")]

print(f"Loaded {len(notes)} notes")

2. Embed each note into a dot on the map:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode([n["text"] for n in notes])  # shape (N, 384)

3. Retrieve the nearest notes to a question:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

question = "what is the difference between stemming and lemmatization?"
q_emb = model.encode([question])

scores = cosine_similarity(q_emb, embeddings)[0]  # "near = similar", as a number
top = np.argsort(scores)[::-1][:3]                # the 3 closest notes

4. Generate a grounded answer from those notes:

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from your environment

context = "\n\n---\n\n".join(notes[i]["text"] for i in top)
prompt = f"""Answer the question using ONLY the notes below.
If the answer isn't in the notes, say you don't know.

NOTES:
{context}

QUESTION: {question}"""

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=500,
    messages=[{"role": "user", "content": prompt}],
)
print(resp.content[0].text)

That last instruction ("if the answer isn't in the notes, say you don't know") is the anti-hallucination guard. It's the difference between a toy and something you can trust.

I asked it: "What's the difference between stemming and lemmatization?" It pulled up exactly the right notes and wrote an answer that reproduced my own comparison table: the tools, the "studies → studi" example, all of it. Not from the model's training. From my brain, retrieved and read back to me.

What actually made it click

An hour into all this, I'd gone from "I understand nothing" to deriving RAG on my own. The lesson wasn't really about NLP. It was about how to learn:

Reading is not understanding. I'd mistaken having notes for having knowledge. The fix was active discovery: being asked a question and having to reach for the answer, one small step at a time, every abstract idea anchored to something concrete (a paper bag, a map, a spam email). And then building, because nothing reveals what you don't understand faster than making it run.

If you're staring at material that won't stick, try this: stop reading it. Have it asked of you instead. Start from the dumbest possible question and refuse to move on until each piece genuinely makes sense. You can get further in a day than you'd believe.

Next up for Synapse: chunking notes for sharper retrieval, citing its sources, a real frontend, and a local-model version. The journey continues. I'll write that one up too.