DEV Community

Cover image for Build a RAG Chatbot From Scratch in About 40 Lines of Python
Marko Frei
Marko Frei

Posted on

Build a RAG Chatbot From Scratch in About 40 Lines of Python

Large language models are confidently wrong about anything they were not trained on: your internal docs, last week's release notes, that niche product you built. RAG (Retrieval-Augmented Generation) is the fix. Instead of fine tuning, you fetch the relevant text at question time and hand it to the model as context.

In this tutorial we will build a small but real RAG chatbot that answers questions about a private knowledge base. No heavy frameworks, so you can see every moving part. By the end you will have roughly 40 lines of Python that you can point at your own data.

How RAG works

The whole pipeline is five steps:

your docs --> chunk --> embed --> store
                                    |
question --> embed --> search ------+--> top matches --> LLM --> answer
Enter fullscreen mode Exit fullscreen mode

In plain words: you break your documents into chunks, turn each chunk into a vector (an embedding), and keep them. When a question comes in, you embed it too, find the chunks whose vectors are closest, and paste those chunks into the prompt so the model answers from real information instead of guessing.

Setup

You need Python 3.9 or newer and three packages:

pip install sentence-transformers numpy anthropic
Enter fullscreen mode Exit fullscreen mode

Embeddings will run locally through sentence-transformers, so that part is free and needs no API key. The only API call is the final answer generation. I am using Claude here, so grab a key and set it:

export ANTHROPIC_API_KEY=your_key_here
Enter fullscreen mode Exit fullscreen mode

If you would rather use a different model, you only have to change one function at the end, and I will point out exactly where.

Step 1: Your knowledge base

For the demo I am using facts about a made up product called Nimbus. The point is that no model was trained on this, so any correct answer has to come from retrieval.

documents = [
    "Nimbus is a cloud file storage service founded in 2021. The free plan includes 5 GB of storage and works on up to two devices.",
    "The Nimbus Pro plan costs $8 per month and includes 2 TB of storage, unlimited devices, and 90 days of version history.",
    "Nimbus supports automatic photo backup on iOS and Android. Backups run only on Wi-Fi by default, but you can turn on cellular backup in Settings.",
    "To share a file in Nimbus, right click it and choose Share, then set the link to view-only or edit. Shared links expire after 30 days unless you are on the Pro plan.",
]
Enter fullscreen mode Exit fullscreen mode

Later you would swap this for your own files, a database dump, scraped pages, whatever.

Step 2: Chunk the text

Models retrieve better when text is in small, focused pieces rather than giant blobs. Here is a simple word based chunker with a little overlap so you do not cut a sentence in half and lose the meaning.

def chunk_text(text, chunk_size=100, overlap=20):
    words = text.split()
    chunks, start = [], 0
    while start < len(words):
        chunks.append(" ".join(words[start:start + chunk_size]))
        start += chunk_size - overlap
    return chunks

chunks = []
for doc in documents:
    chunks.extend(chunk_text(doc))
Enter fullscreen mode Exit fullscreen mode

Our sample docs are short, so each becomes one chunk. With real documents this is where the splitting earns its keep.

Step 3: Embed the chunks

An embedding is a list of numbers that captures meaning, so that similar text ends up with similar vectors. We load a small open model and encode every chunk once, up front.

from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("all-MiniLM-L6-v2")
chunk_embeddings = embedder.encode(chunks)
Enter fullscreen mode Exit fullscreen mode

all-MiniLM-L6-v2 is tiny, fast on a laptop, and produces 384 dimensional vectors. Good enough to learn with and surprisingly capable.

Step 4: Retrieve the closest chunks

To find relevant chunks we compare the question's vector to every chunk vector using cosine similarity, then keep the top matches.

import numpy as np

def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def retrieve(query, k=3):
    q = embedder.encode([query])[0]
    scores = [cosine(q, e) for e in chunk_embeddings]
    top = np.argsort(scores)[::-1][:k]
    return [chunks[i] for i in top]
Enter fullscreen mode Exit fullscreen mode

This brute force loop is fine for a few thousand chunks. Past that you would reach for a real vector store, but the idea is identical: find the nearest vectors.

Step 5: Generate the answer

Now we stuff the retrieved chunks into the prompt and ask the model to answer from them only. That last instruction is what keeps it honest and cuts down on made up answers.

from anthropic import Anthropic

client = Anthropic()  # reads ANTHROPIC_API_KEY from your environment

def answer(query):
    context = "\n\n".join(retrieve(query))
    prompt = (
        "Answer the question using only the context below. "
        "If the answer is not in the context, say you do not know.\n\n"
        f"Context:\n{context}\n\nQuestion: {query}"
    )
    resp = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    )
    return resp.content[0].text
Enter fullscreen mode Exit fullscreen mode

This is the one function to change if you want a different provider. Swap the client and the create call for OpenAI, a local model through Ollama, or anything else, and the rest of the pipeline stays the same.

Step 6: Talk to it

if __name__ == "__main__":
    while True:
        q = input("\nAsk about Nimbus (or 'quit'): ")
        if q.lower() == "quit":
            break
        print("\n" + answer(q))
Enter fullscreen mode Exit fullscreen mode

Run it and try 'How much is the Pro plan?' or 'Do photo backups use cellular data?'. The bot pulls the right chunk and answers from it. Ask something not in the docs, like 'Does Nimbus have a desktop app?', and it should tell you it does not know, which is exactly what you want.

The whole thing

import numpy as np
from sentence_transformers import SentenceTransformer
from anthropic import Anthropic

documents = [
    "Nimbus is a cloud file storage service founded in 2021. The free plan includes 5 GB of storage and works on up to two devices.",
    "The Nimbus Pro plan costs $8 per month and includes 2 TB of storage, unlimited devices, and 90 days of version history.",
    "Nimbus supports automatic photo backup on iOS and Android. Backups run only on Wi-Fi by default, but you can turn on cellular backup in Settings.",
    "To share a file in Nimbus, right click it and choose Share, then set the link to view-only or edit. Shared links expire after 30 days unless you are on the Pro plan.",
]

def chunk_text(text, chunk_size=100, overlap=20):
    words = text.split()
    chunks, start = [], 0
    while start < len(words):
        chunks.append(" ".join(words[start:start + chunk_size]))
        start += chunk_size - overlap
    return chunks

chunks = []
for doc in documents:
    chunks.extend(chunk_text(doc))

embedder = SentenceTransformer("all-MiniLM-L6-v2")
chunk_embeddings = embedder.encode(chunks)

def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def retrieve(query, k=3):
    q = embedder.encode([query])[0]
    scores = [cosine(q, e) for e in chunk_embeddings]
    top = np.argsort(scores)[::-1][:k]
    return [chunks[i] for i in top]

client = Anthropic()

def answer(query):
    context = "\n\n".join(retrieve(query))
    prompt = (
        "Answer the question using only the context below. "
        "If the answer is not in the context, say you do not know.\n\n"
        f"Context:\n{context}\n\nQuestion: {query}"
    )
    resp = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    )
    return resp.content[0].text

if __name__ == "__main__":
    while True:
        q = input("\nAsk about Nimbus (or 'quit'): ")
        if q.lower() == "quit":
            break
        print("\n" + answer(q))
Enter fullscreen mode Exit fullscreen mode

Where to go from here

This is the real shape of RAG, just minimal. To take it toward production:

  • Swap the numpy loop for a vector database like Chroma, FAISS, or pgvector once you have a lot of chunks.
  • Improve chunking. Splitting on sentences or headings usually beats a fixed word count.
  • Add citations by returning which chunk each answer came from, so users can verify.
  • Evaluate it. Write a handful of question and answer pairs and check retrieval is actually pulling the right chunks before you blame the model.

Your turn

That is a working RAG chatbot you can point at your own notes or docs today.

What would you feed it first? And if you have built RAG before, what tripped you up most, the chunking, the retrieval quality, or keeping the model from wandering off the context? Curious to hear in the comments.
Please feel free to join our discord server and discuss about AI...
https://discord.gg/nWctKNRM

Top comments (0)