Navas Herbert

Posted on Jun 30

Day 1 of 3: We Built the Retrieval Half of RAG - Before Touching a Single AI Model

#python #ai #fastapi #beginners

I opened today's session with a sentence I wanted everyone to hold onto for the next three days:

"By the end of today, you will be able to upload a real document through an API endpoint, and have it automatically split into chunks, turned into vectors, and stored so it can be searched by meaning, not just by keyword. We will not touch any AI model today - that is on purpose. We want you to see retrieval working on its own first."

This is Day 1 of a 3-day Document Intelligence API build with the cohort 6 interns. Same FastAPI skills they used last week building the Grade Tracker- Depends, response models, endpoints - but a new kind of database underneath. Today was retrieval only. No LLM, no generation, no API key required. Just proving that the search half of the system actually works.

Here's how it went.

Three Ideas, Whiteboard Only, Ten Minutes

Before any code, I drew three ideas on the board and told the room: "I want you to be able to redraw this from memory by the end of today."

Embedding: a piece of text turned into a list of numbers, a vector, such that text with similar meaning ends up close together in that number space. "Cat" and "dog" land near each other. "Cat" and "spreadsheet" land far apart. The model was never told what a cat is - it learned this closeness from patterns in huge amounts of text.

Vector search : just nearest-neighbour search in that number space. Take a question, embed it the same way, find which stored vectors sit closest. That closeness is what "relevant" means here.

RAG (Retrieval-Augmented Generation) - retrieve the relevant chunks first, then paste them into the LLM's prompt, so it answers using your data instead of guessing from what it memorised during training.

Question --> [embed] --> [search vectors] --> top-k chunks --> [stuff into prompt] --> LLM --> Answer + sources

I said the one sentence I repeat every time I teach this, slowly, twice: "The LLM never sees your documents unless retrieval puts the right chunk into the prompt." That single idea, if it clicks on Day 1, makes the rest of the week make sense. Everything else is implementation detail around that one fact.

Setting Up: Real Folders, Built Live in the Terminal

No starter repo. No zip file. We built the project structure from nothing, together, in the terminal.

mkdir interdoc && cd interdoc
mkdir app && cd app
touch ingestion.py vectorstore.py main.py

Three empty files. I wrote their names on the board before we touched any of them - ingestion.py, vectorstore.py, main.py - and pointed at each one as we built it. This cohort already knows this rhythm from last week's database.py, models.py, schemas.py, crud.py, main.py build. One file, one job. Files talk to each other through clean function calls, not by reaching into each other's internals.

I also had the Day 1 goals written out as a checklist before we started - explain what an embedding is and why similar meanings end up as nearby vectors, explain the difference between keyword and vector search, upload a real file through your own endpoint and have it chunked and stored, explain why chunk overlap matters, run a search query with zero AI involved, and know exactly what's still missing to call this a real product.

That last line on the checklist : "hio ni kesho" (that's tomorrow) - was a deliberate boundary. Today is retrieval. Generation is Day 2's problem, not today's.

File 1: ingestion.py - Extract Text, Then Chunk It

"This file has exactly one job: take raw bytes from an uploaded file and turn them into a list of clean text chunks, ready to be embedded. It never talks to ChromaDB and it never talks to FastAPI."

import io
from pypdf import PdfReader

def extract_text(filename: str, file_bytes: bytes) -> str:
    if filename.lower().endswith(".pdf"):
        reader = PdfReader(io.BytesIO(file_bytes))
        return "\n".join(page.extract_text() or "" for page in reader.pages)
    return file_bytes.decode("utf-8", errors="ignore")

io.BytesIO lets us treat raw bytes sitting in memory as if they were a file, without ever saving anything to disk. The uploaded file arrives as bytes - PdfReader expects something that behaves like a file - so this bridges the two. I made a point of telling the class: "Notice we are not saving the uploaded file anywhere. We pull the bytes straight out of memory, extract the text, and the original file is gone. For this project that's fine - we only care about the text inside, not the file itself."

Someone asked: "What about Word documents? Scanned PDFs?" Honest answer: scanned PDFs need OCR - a separate tool that reads text out of images - and Word docs need a different library entirely. Today we keep it to PDF and plain text on purpose, so the lesson stays about retrieval, not file-format archaeology.

Then the function the whole session would orbit around:

def chunk_text(text: str, chunk_size: int = 300, overlap: int = 50) -> list[str]:
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        if chunk.strip():
            chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

"This is the function I want you to be able to explain to a friend tonight. We cannot embed an entire document as one giant vector - it would be too vague, mixing twenty different ideas into one point in space. So we split the text into smaller pieces and embed each chunk separately."

We did it by hand on the board with tiny numbers - chunk_size=5, overlap=2, on a 12-word sentence. First chunk: words 0 to 5. Next start: 0 + 5 - 2 = 3. Second chunk: words 3 to 8. Words 3 and 4 appear in both chunks - that's the overlap. Why does that matter? If the answer to someone's question sits right at the boundary between two chunks, without overlap you'd cut that sentence in half and lose it in both pieces.

The REPL Moment - And a Real Typo

We tested chunk_text() live, in the interactive Python shell, before wiring up anything else.

from app.ingestion import chunk_text

text = "An embedding is a piece of text turned into a list of numbers (vector) - such that text with similar MEANING ends up close together in that number space"
chunks = chunk_text(text, chunk_size=15, overlap=5)
len(chunks)

I typed chunk[0] to look at the first piece - forgetting the s. Python answered immediately with NameError: name 'chunk' is not defined. Did you mean: 'chunks'?

I left it on screen and let the class read it out loud. Then corrected it: chunks[0].

'An embedding is a piece of text turned into a list of numbers (vector) -'

chunks[1]

'list of numbers (vector) - such that text with similar MEANING ends up close together'

I pointed directly at the repeated words - "vector" and "-" sitting at the start of chunks[1], the same words that closed out chunks[0]. That's the overlap, working exactly as designed, visible in real output rather than just described in theory.

I told the room: "That NameError is going to happen to every single one of you this week, probably more than once. Python tells you exactly what it thinks you meant. Read the error before you panic." A typo in a live REPL is one of the most useful teaching moments available - it shows that errors are information, not failure.

We also walked through what happens if overlap >= chunk_size deliberately - start never moves forward, or moves backward, and the while loop never terminates. I had a volunteer try it on a short string and watch it hang, then Ctrl+C out. A great bug to see once, on purpose, in a safe sandbox, rather than discover by accident at midnight.

File 2: vectorstore.py : Where Chunks Become Vectors

"File two is where chunks actually become vectors and get saved somewhere we can search later. Two new ideas live here: sentence-transformers, which turns text into numbers, and ChromaDB, which stores those numbers and knows how to search them."

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.PersistentClient(path="./data/chroma_db")

embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

collection = client.get_or_create_collection(
    name="documents",
    embedding_function=embedding_fn,
)

def add_chunks(doc_id: str, chunks: list[str]) -> int:
    ids = [f"{doc_id}_{i}" for i in range(len(chunks))]
    metadatas = [{"source": doc_id, "chunk_index": i} for i in range(len(chunks))]
    collection.add(documents=chunks, metadatas=metadatas, ids=ids)
    return len(chunks)

def search_chunks(query: str, top_k: int = 4):
    return collection.query(query_texts=[query], n_results=top_k)

PersistentClient(path="./data/chroma_db") means Chroma writes everything to disk, in that folder. Stop the server, restart it tomorrow, your uploaded documents are still there. I made a point of this line specifically: "This single line is the entire 'database connection' for our vector store. There's no separate server running in the background like there would be with PostgreSQL. Chroma is a library that reads and writes a folder on disk. That's the whole infrastructure."

The embedding function deserved its own callout: "This model runs locally. It downloads once, the first time you use it, then every embedding after that happens on your own laptop, for free, forever. Compare that to the LLM we plug in tomorrow, which calls out to a paid API. Embeddings are cheap and local. Generation is the part that costs money."

I pointed directly at metadatas in add_chunks() - "this is the detail that turns a toy demo into a real product. Without it, we'd have vectors with no idea which document or which part of a document they came from, and no way to show a citation when we answer a question later."

And search_chunks() got the moment of the file: "Sit with this for a second. One line of code is doing embed-the-question, then nearest-neighbour search across however many chunks we have stored, and returning the best matches with their distances. Everything before this was preparation. This line is the actual retrieval in Retrieval-Augmented Generation."

When someone asked why ChromaDB and not Pinecone, Weaviate, or pgvector - honest answer: those are excellent production choices, but every one of them needs an account, an API key, or a running server process. Chroma needs none of that, which is exactly why it's the right teaching tool this week. The concepts transfer directly the day you need a production-grade option.

File 3: main.py - Small on Purpose

"Notice how small this file is. ingestion.py and vectorstore.py do all the real work; main.py just connects an HTTP request to the right function calls."

from fastapi import FastAPI, UploadFile, File
from app.ingestion import extract_text, chunk_text
from app.vectorstore import add_chunks

app = FastAPI(title="Document Intelligence API")

@app.get("/")
def root():
    return {"status": "alive"}

@app.post("/upload")
async def upload_document(file: UploadFile = File(...)):
    contents = await file.read()
    text = extract_text(file.filename, contents)
    chunks = chunk_text(text)
    n = add_chunks(doc_id=file.filename, chunks=chunks)
    return {"filename": file.filename, "chunks_created": n}

Same separation-of-concerns rule as crud.py last week - main.py doesn't extract text, doesn't embed anything, doesn't touch the database directly. It receives the HTTP request and hands the work to the right specialist file.

"Look at how this endpoint reads almost like a sentence: read the file, extract the text, chunk it, add it to the store, confirm it worked. Five lines, five clear steps. Every endpoint we write this week follows this same 'receive, delegate, confirm' shape."

Starting the Server - Another Honest Typo

uvicorn app.main:app --reload

First attempt, I fat-fingered it: --relaod. Uvicorn didn't recognise the flag and complained. I caught it, fixed the spelling, and reran it.

I didn't edit that moment out of how I told the story to the class either. "This is what live coding actually looks like. I've typed this command hundreds of times and I still mistype it occasionally. The terminal tells you exactly what went wrong. Read it, fix it, move on." Once corrected, the server started cleanly with auto-reload watching for file changes.

We opened http://127.0.0.1:8000/docs, expanded POST /upload, clicked Try it out, chose a short sample text file, and hit Execute.

{
  "filename": "sample_notes.txt",
  "chunks_created": 4
}

A real number greater than zero. The whole pipeline - upload, extract, chunk, embed, store - had just run, end to end, through an actual HTTP request.

I warned the room ahead of time: if chunks_created comes back as 0, the most common cause is a scanned PDF with no extractable text layer - extract_text() returns an empty string for every page in that case. Always keep a backup plain-text file ready for exactly this situation.

The Payoff: Proving Retrieval Works, Zero LLM Involved

"This next part is the payoff for today. We are going to prove that our system can already find the right information, before we ever connect an AI model."

A throwaway script, scratch.py, in the project root:

from app.vectorstore import search_chunks

results = search_chunks("what is this document about?", top_k=3)

for doc, meta, dist in zip(
    results["documents"][0],
    results["metadatas"][0],
    results["distances"][0],
):
    print(f"[{meta['source']} chunk {meta['chunk_index']}] (distance={dist:.3f})")
    print(doc[:150], "...")
    print()

python scratch.py

Real chunks came back, each with a distance score attached. I read the output out loud and pointed straight at the distance numbers: "Lower means more similar. That's it. That's the whole mystery of 'similarity search' : it's just a number, and we are sorting by it."

Then the checkpoint question that ties the whole day together: change the question in scratch.py to something completely unrelated to the uploaded document, rerun it, and watch what happens to the distance values. When the question has nothing to do with what's stored, the distances jump noticeably higher - visible, numeric proof that the system genuinely doesn't know things it was never given.

What Exists Right Now, With Zero AI Generation Involved

I said this out loud at the end of the session, slowly:

"A real API endpoint that accepts any PDF or text file, automatically splits it into overlapping chunks, turns those chunks into vectors using a local embedding model, and stores them so they can be found again by meaning, not by exact keyword match. That's the entire retrieval half of RAG - working, and testable, right now."

Tomorrow we build the other half: taking those retrieved chunks, handing them to an LLM, and getting back an answer with citations. That's when this stops being a search engine and becomes a product.

What I Noticed Teaching This Session

1. The "redraw it from memory" bar works. Telling students up front that they'll need to reproduce the embeddings diagram themselves changes how closely they listen to the first ten minutes. Theory with an explicit accountability check lands differently than theory alone.

2. Real typos teach better than clean demos. The chunk[0] NameError and the --relaod flag mistake weren't planned, but I didn't edit either one out of how I narrated the session. Seeing an instructor make and fix a real mistake, live, removes a huge amount of fear from a beginner's own inevitable errors.

3. Distance numbers demystify "AI search" instantly. The moment students see an actual float next to each retrieved chunk, and watch that number change when the question stops being relevant, "semantic search" stops being a buzzword and becomes something they can reason about directly.

4. Drawing the line at "no LLM today" was the right call. It would have been tempting to wire in generation on Day 1 for a flashier demo. Keeping retrieval isolated meant every student could see, concretely, exactly what part of the system was doing the finding - before any model got a chance to paper over a weak retrieval step with a confident-sounding answer.

Homework Before Day 2

Upload 2–3 real documents - your own notes, a resume, a public FAQ page saved as text, anything
Edit scratch.py and try at least 5 different questions against your own documents
Try a question that should not be answerable from your documents, and look closely at the distance values compared to one that should be

Next: generation, citations, and the moment this stops being a search engine.

*I'm a data trainer in Nairobi running a full data programme -
Python foundations → Data Science or Data Engineering specialisations.

Follow along or drop your questions in the comments.*

DEV Community