Marko Frei

Posted on Jun 12

Build a RAG Chatbot From Scratch in About 40 Lines of Python

#ai #machinelearning #python #tutorial

Large language models are confidently wrong about anything they were not trained on: your internal docs, last week's release notes, that niche product you built. RAG (Retrieval-Augmented Generation) is the fix. Instead of fine tuning, you fetch the relevant text at question time and hand it to the model as context.

In this tutorial we will build a small but real RAG chatbot that answers questions about a private knowledge base. No heavy frameworks, so you can see every moving part. By the end you will have roughly 40 lines of Python that you can point at your own data.

How RAG works

The whole pipeline is five steps:

your docs --> chunk --> embed --> store
                                    |
question --> embed --> search ------+--> top matches --> LLM --> answer

In plain words: you break your documents into chunks, turn each chunk into a vector (an embedding), and keep them. When a question comes in, you embed it too, find the chunks whose vectors are closest, and paste those chunks into the prompt so the model answers from real information instead of guessing.

Setup

You need Python 3.9 or newer and three packages:

pip install sentence-transformers numpy anthropic

Embeddings will run locally through sentence-transformers, so that part is free and needs no API key. The only API call is the final answer generation. I am using Claude here, so grab a key and set it:

export ANTHROPIC_API_KEY=your_key_here

If you would rather use a different model, you only have to change one function at the end, and I will point out exactly where.

Step 1: Your knowledge base

For the demo I am using facts about a made up product called Nimbus. The point is that no model was trained on this, so any correct answer has to come from retrieval.

documents = [
    "Nimbus is a cloud file storage service founded in 2021. The free plan includes 5 GB of storage and works on up to two devices.",
    "The Nimbus Pro plan costs $8 per month and includes 2 TB of storage, unlimited devices, and 90 days of version history.",
    "Nimbus supports automatic photo backup on iOS and Android. Backups run only on Wi-Fi by default, but you can turn on cellular backup in Settings.",
    "To share a file in Nimbus, right click it and choose Share, then set the link to view-only or edit. Shared links expire after 30 days unless you are on the Pro plan.",
]

Later you would swap this for your own files, a database dump, scraped pages, whatever.

Step 2: Chunk the text

Models retrieve better when text is in small, focused pieces rather than giant blobs. Here is a simple word based chunker with a little overlap so you do not cut a sentence in half and lose the meaning.

def chunk_text(text, chunk_size=100, overlap=20):
    words = text.split()
    chunks, start = [], 0
    while start < len(words):
        chunks.append(" ".join(words[start:start + chunk_size]))
        start += chunk_size - overlap
    return chunks

chunks = []
for doc in documents:
    chunks.extend(chunk_text(doc))

Our sample docs are short, so each becomes one chunk. With real documents this is where the splitting earns its keep.

Step 3: Embed the chunks

An embedding is a list of numbers that captures meaning, so that similar text ends up with similar vectors. We load a small open model and encode every chunk once, up front.

from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("all-MiniLM-L6-v2")
chunk_embeddings = embedder.encode(chunks)

all-MiniLM-L6-v2 is tiny, fast on a laptop, and produces 384 dimensional vectors. Good enough to learn with and surprisingly capable.

Step 4: Retrieve the closest chunks

To find relevant chunks we compare the question's vector to every chunk vector using cosine similarity, then keep the top matches.

import numpy as np

def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def retrieve(query, k=3):
    q = embedder.encode([query])[0]
    scores = [cosine(q, e) for e in chunk_embeddings]
    top = np.argsort(scores)[::-1][:k]
    return [chunks[i] for i in top]

This brute force loop is fine for a few thousand chunks. Past that you would reach for a real vector store, but the idea is identical: find the nearest vectors.

Step 5: Generate the answer

Now we stuff the retrieved chunks into the prompt and ask the model to answer from them only. That last instruction is what keeps it honest and cuts down on made up answers.

from anthropic import Anthropic

client = Anthropic()  # reads ANTHROPIC_API_KEY from your environment

def answer(query):
    context = "\n\n".join(retrieve(query))
    prompt = (
        "Answer the question using only the context below. "
        "If the answer is not in the context, say you do not know.\n\n"
        f"Context:\n{context}\n\nQuestion: {query}"
    )
    resp = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    )
    return resp.content[0].text

This is the one function to change if you want a different provider. Swap the client and the create call for OpenAI, a local model through Ollama, or anything else, and the rest of the pipeline stays the same.

Step 6: Talk to it

if __name__ == "__main__":
    while True:
        q = input("\nAsk about Nimbus (or 'quit'): ")
        if q.lower() == "quit":
            break
        print("\n" + answer(q))

Run it and try 'How much is the Pro plan?' or 'Do photo backups use cellular data?'. The bot pulls the right chunk and answers from it. Ask something not in the docs, like 'Does Nimbus have a desktop app?', and it should tell you it does not know, which is exactly what you want.

The whole thing

import numpy as np
from sentence_transformers import SentenceTransformer
from anthropic import Anthropic

documents = [
    "Nimbus is a cloud file storage service founded in 2021. The free plan includes 5 GB of storage and works on up to two devices.",
    "The Nimbus Pro plan costs $8 per month and includes 2 TB of storage, unlimited devices, and 90 days of version history.",
    "Nimbus supports automatic photo backup on iOS and Android. Backups run only on Wi-Fi by default, but you can turn on cellular backup in Settings.",
    "To share a file in Nimbus, right click it and choose Share, then set the link to view-only or edit. Shared links expire after 30 days unless you are on the Pro plan.",
]

def chunk_text(text, chunk_size=100, overlap=20):
    words = text.split()
    chunks, start = [], 0
    while start < len(words):
        chunks.append(" ".join(words[start:start + chunk_size]))
        start += chunk_size - overlap
    return chunks

chunks = []
for doc in documents:
    chunks.extend(chunk_text(doc))

embedder = SentenceTransformer("all-MiniLM-L6-v2")
chunk_embeddings = embedder.encode(chunks)

def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def retrieve(query, k=3):
    q = embedder.encode([query])[0]
    scores = [cosine(q, e) for e in chunk_embeddings]
    top = np.argsort(scores)[::-1][:k]
    return [chunks[i] for i in top]

client = Anthropic()

def answer(query):
    context = "\n\n".join(retrieve(query))
    prompt = (
        "Answer the question using only the context below. "
        "If the answer is not in the context, say you do not know.\n\n"
        f"Context:\n{context}\n\nQuestion: {query}"
    )
    resp = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    )
    return resp.content[0].text

if __name__ == "__main__":
    while True:
        q = input("\nAsk about Nimbus (or 'quit'): ")
        if q.lower() == "quit":
            break
        print("\n" + answer(q))

Where to go from here

This is the real shape of RAG, just minimal. To take it toward production:

Swap the numpy loop for a vector database like Chroma, FAISS, or pgvector once you have a lot of chunks.
Improve chunking. Splitting on sentences or headings usually beats a fixed word count.
Add citations by returning which chunk each answer came from, so users can verify.
Evaluate it. Write a handful of question and answer pairs and check retrieval is actually pulling the right chunks before you blame the model.

Your turn

That is a working RAG chatbot you can point at your own notes or docs today.

What would you feed it first? And if you have built RAG before, what tripped you up most, the chunking, the retrieval quality, or keeping the model from wandering off the context? Curious to hear in the comments.
Please feel free to join our discord server and discuss about AI...
https://discord.gg/nWctKNRM

Top comments (5)

Alex Shev • Jun 12

The 40-line version is a useful teaching tool because it exposes the actual moving parts: chunk, embed, retrieve, prompt, answer. That is the right mental model before adding frameworks.

The production jump is where the hard problems start: chunk quality, freshness, eval sets, citations, permissions, and knowing when retrieval returned weak evidence. Small demos are great as long as teams do not confuse the skeleton with the full reliability layer.

Marko Frei • Jun 12

Thanks, Alex. That's exactly the distinction I hoped readers would take away. The example focuses on exposing the core mechanics, while production systems require additional layers for retrieval quality, freshness, permissions, citations, evaluation, and monitoring. Appreciate the thoughtful addition.

Alex Shev • Jun 13

Exactly. That is why I liked the 40-line version: it makes the retrieval loop inspectable before the production concerns pile on. Once people can see the moving parts clearly, the next discussion about freshness, permissions, citations, and evals becomes much more concrete.

Gesner Deslandes • Jun 13

Great post, Marko – congratulations on such a clear and practical breakdown of RAG. You nailed the core insight: giving the model the right context at query time is the most efficient path to accurate answers.

I completely agree with your approach. These days, even if we can train a new AI from scratch, doing so often wastes a huge amount of time, compute, and data preparation. That’s why I prefer to use a Groq API key (or any fast inference endpoint) to make my websites and software interactively smart. When a code snippet or an attribution lacks something, I integrate the API directly – no heavy retraining, no endless fine‑tuning.

To me, it’s about calling a cat a cat, a dog a dog. When we deal with an eagle, we already know it’s a bird – it will fly. In other words, we should recognise the tool for what it is: a fast, reliable way to add intelligence without reinventing the wheel. The same RAG principle you explained so well is exactly how I keep my own projects lean, production‑ready, and easy to adapt.

Thanks again for sharing this. Looking forward to more of your tutorials.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.