Practical RAG, Part 1: The Simplest RAG That Actually Works

#ai #llm #rag #python

By Suman — Part 1 of the **Practical RAG* series. All code is in a runnable notebook: https://www.kaggle.com/code/sumannath88/ep01-simple-rag

Everyone talks about RAG. Far fewer people have built the simplest version end to end and looked at exactly where it falls over.

That's what this series does. We start with the most naive RAG pipeline that actually works, understand it completely, and then — one concrete problem at a time — make it better. No frameworks hiding the moving parts. Just Python you can read.

By the end of this post you'll have a working pipeline in about 40 lines that answers questions correctly — and you'll understand exactly why that success is misleading. Those hidden weaknesses are the roadmap for the rest of the series.

What RAG actually is

RAG — Retrieval-Augmented Generation — is one idea: before you ask the model a question, go find relevant text and paste it into the prompt. That's it. The "retrieval" finds the text; the "generation" is the LLM answering with that text in front of it.

Why bother? Because it lets a model answer questions about your data — documents it was never trained on — without fine-tuning, and it grounds answers in real sources instead of the model's memory.

The naive pipeline has five steps:

Load your documents
Chunk them into pieces
Embed each chunk into a vector
Retrieve the chunks most similar to the question
Generate an answer with those chunks as context

Let's build each one.

Setup

We'll use local embeddings (via sentence-transformers) so retrieval is free and needs no API key, and OpenRouter for generation because it exposes an OpenAI-compatible API across many models.

pip install sentence-transformers openai numpy

import os
import numpy as np

# On Kaggle, store OPENROUTER_API_KEY as a notebook Secret; elsewhere use an
# env var or paste it inline.
try:
    from kaggle_secrets import UserSecretsClient
    os.environ.setdefault(
        "OPENROUTER_API_KEY",
        UserSecretsClient().get_secret("OPENROUTER_API_KEY"),
    )
except ModuleNotFoundError:
    os.environ.setdefault("OPENROUTER_API_KEY", "sk-or-...")  # your key

LLM_MODEL   = "deepseek/deepseek-v4-flash"
EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
TOP_K = 3

The notebook runs on Kaggle, Colab, or locally. Embeddings are computed locally, so only generation touches the network.

1 & 2. Load and chunk

To keep everything self-contained, our "corpus" is a handful of short passages about planets. And our chunking strategy is the simplest one imaginable: one chunk per document.

DOCUMENTS = [
    "Mercury is the smallest planet ... no moons ...",
    "Venus is the hottest planet ... 465 degrees Celsius.",
    "Earth ... the only known world with liquid water and life ...",
    "Mars ... two small moons, Phobos and Deimos.",
    "Jupiter is the largest planet ... at least 95 known moons.",
    "Saturn ... famous for its prominent ring system ...",
]
chunks = DOCUMENTS  # naive: each doc is one chunk

This is fine because the passages are already short. Hold onto that caveat — it's the first thing that breaks on real data.

3. Embed

An embedding turns text into a vector of numbers such that similar meanings land near each other in space. We compute one vector per chunk, once, up front.

from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer(EMBED_MODEL)
chunk_embeddings = embedder.encode(chunks, normalize_embeddings=True)

We normalize the vectors so that cosine similarity — the standard measure of "how close are these two meanings" — collapses to a plain dot product.

4. Retrieve

To answer a question, embed the question the same way, score it against every chunk, and keep the top k.

def retrieve(question, k=TOP_K):
    q_emb = embedder.encode([question], normalize_embeddings=True)[0]
    scores = chunk_embeddings @ q_emb        # cosine similarity
    top_idx = np.argsort(scores)[::-1][:k]
    return [(chunks[i], float(scores[i])) for i in top_idx]

Ask "Which planet has the most moons?" and the Jupiter chunk comes back on top. No LLM involved yet — this is pure vector search.

5. Generate

Now stitch the retrieved chunks into a prompt and ask the model — instructing it to answer only from the provided context. That instruction is the heart of RAG discipline: it's what keeps the model grounded instead of guessing.

from openai import OpenAI

client = OpenAI(base_url="https://openrouter.ai/api/v1",
                api_key=os.environ["OPENROUTER_API_KEY"])

def answer(question, k=TOP_K):
    retrieved = retrieve(question, k)
    context = "\n\n".join(f"[{i+1}] {c}" for i, (c, _) in enumerate(retrieved))
    prompt = (
        "Answer the question using ONLY the context below. "
        "If the answer is not in the context, say you don't know.\n\n"
        f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"
    )
    resp = client.chat.completions.create(
        model=LLM_MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return resp.choices[0].message.content, retrieved

answer("Which planet has the most moons?")[0]
# -> "Jupiter, with at least 95 known moons."

That's a complete RAG system. Load → chunk → embed → retrieve → generate.

It works — and that's the trap

Here's the twist: this pipeline handles the hard-looking questions just fine.

A question outside the corpus:

answer("How far is Pluto from the Sun?")[0]
# -> "I don't know."

Pluto isn't in our documents, and the model correctly refuses to invent an answer. Grounding is doing its job.

A comparison spanning two chunks:

answer("Which is hotter, Venus or Mercury, and why?")[0]
# -> "Venus is hotter (~465°C) because its thick CO2 atmosphere traps heat,
#     while Mercury has almost no atmosphere."

The answer lives across two chunks, and top-k retrieval pulls both. Correct, and even well-reasoned.

So naive RAG works. It works flawlessly. And that is exactly the problem — because it's working on six clean, short, hand-picked paragraphs. A small, tidy corpus hides every weakness the technique has.

The weaknesses hiding behind the demo — and the roadmap

Clean answers on toy data prove almost nothing. Each of these breaks the moment you point naive RAG at real documents, and each is exactly what a later part of the series fixes:

Chunking is naive. One-chunk-per-document collapses when documents are long — the right passage gets buried in noise or split apart.
Retrieval is purely semantic. Exact keywords — names, IDs, error codes — can slip past vector similarity. Hybrid (keyword + vector) search helps.
No reranking. With hundreds of chunks, the top k by cosine similarity aren't reliably the most useful k.
No evaluation. We're eyeballing two answers. Without numbers, we can't tell whether any "improvement" actually improved anything.

Part 2 takes on chunking and retrieval quality — and adds a small evaluation harness so every change from here on is measurable.

The full runnable notebook for this part is here: https://www.kaggle.com/code/sumannath88/ep01-simple-rag

If this was useful, follow along — the series gets more interesting as the naive version starts to hurt.

Next: Part 2 — Better chunks, hybrid retrieval, and how to actually measure RAG.