Building a RAG Application from Scratch: A Step-by-Step Guide

#ai #rag #programming #webdev

In this guide, we'll build a RAG pipeline from scratch in Python no LangChain, no LlamaIndex — so you actually understand every moving part before you reach for a framework. By the end you'll have a working system that can answer questions about your own documents.

What We're Building

A simple but complete pipeline:

Ingest documents and split them into chunks
Embed those chunks into vectors
Store the vectors in a searchable index
Retrieve the most relevant chunks for a given question
Generate an answer using an LLM, grounded in the retrieved context

[Documents] → [Chunking] → [Embeddings] → [Vector Store]
                                                  ↓
[User Question] → [Embed Query] → [Retrieve Top-K] → [LLM] → [Answer]

Prerequisites

pip install openai numpy tiktoken

You'll need an OpenAI API key (or swap in any embedding/chat model the logic is the same). Set it as an environment variable:

export OPENAI_API_KEY="your-key-here"

Step 1: Chunking Your Documents

LLMs and embedding models have context limits, and stuffing an entire document into one embedding loses precision you want chunks small enough to be specific but large enough to retain context.

import tiktoken

def chunk_text(text: str, chunk_size: int = 300, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks by token count."""
    encoding = tiktoken.get_encoding("cl100k_base")
    tokens = encoding.encode(text)

    chunks = []
    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunks.append(encoding.decode(chunk_tokens))
        start += chunk_size - overlap  # overlap keeps context across boundaries

    return chunks

The overlap matters more than it looks. Without it, a sentence that explains a key fact can get cut in half across two chunks, and neither half retrieves well on its own.

sample_doc = """
RAG stands for Retrieval-Augmented Generation. It combines a retrieval system
with a generative language model. Instead of relying solely on what the model
learned during training, RAG fetches relevant information from an external
knowledge source at query time and feeds it into the model's context window...
"""

chunks = chunk_text(sample_doc, chunk_size=50, overlap=10)
print(f"Created {len(chunks)} chunks")

Step 2: Generating Embeddings

Embeddings turn text into vectors of numbers that capture semantic meaning similar concepts end up close together in vector space, even if the wording differs.

from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    text = text.replace("\n", " ")
    response = client.embeddings.create(input=[text], model=model)
    return response.data[0].embedding

For production use, batch your embedding calls instead of looping one at a time it's significantly faster and cheaper:

def get_embeddings_batch(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
    texts = [t.replace("\n", " ") for t in texts]
    response = client.embeddings.create(input=texts, model=model)
    return [item.embedding for item in response.data]

Step 3: Building a Simple Vector Store

You don't need a full vector database to get started. For learning purposes (and even small production use cases), an in-memory store with cosine similarity works fine.

import numpy as np

class SimpleVectorStore:
    def __init__(self):
        self.chunks: list[str] = []
        self.embeddings: list[list[float]] = []

    def add(self, chunks: list[str], embeddings: list[list[float]]):
        self.chunks.extend(chunks)
        self.embeddings.extend(embeddings)

    def search(self, query_embedding: list[float], top_k: int = 3) -> list[tuple[str, float]]:
        if not self.embeddings:
            return []

        query_vec = np.array(query_embedding)
        doc_matrix = np.array(self.embeddings)

        # Cosine similarity between query and every stored chunk
        similarities = doc_matrix @ query_vec / (
            np.linalg.norm(doc_matrix, axis=1) * np.linalg.norm(query_vec)
        )

        top_indices = np.argsort(similarities)[::-1][:top_k]
        return [(self.chunks[i], float(similarities[i])) for i in top_indices]

This is the part frameworks abstract away, but seeing it written out matters: retrieval is just "find the vectors closest to my query vector." Everything else Pinecone, Weaviate, pgvector, FAISS is a more scalable, more optimized version of this same idea.

Step 4: Putting Ingestion Together

def ingest_document(text: str, store: SimpleVectorStore):
    chunks = chunk_text(text)
    embeddings = get_embeddings_batch(chunks)
    store.add(chunks, embeddings)
    print(f"Ingested {len(chunks)} chunks")

store = SimpleVectorStore()
ingest_document(sample_doc, store)

In a real application, this is where you'd loop over a folder of PDFs, Markdown files, or scraped pages, extracting raw text from each before chunking.

Step 5: Retrieval + Generation

This is the "RAG" part retrieve relevant chunks, then hand them to the LLM as context.

def answer_question(question: str, store: SimpleVectorStore, top_k: int = 3) -> str:
    query_embedding = get_embedding(question)
    results = store.search(query_embedding, top_k=top_k)

    context = "\n\n---\n\n".join([chunk for chunk, score in results])

    prompt = f"""Answer the question using only the context below.
If the context doesn't contain the answer, say "I don't have enough information to answer that."

Context:
{context}

Question: {question}

Answer:"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )

    return response.choices[0].message.content

The "answer using only the context" instruction is doing real work here it's what keeps the model grounded instead of falling back on its own training data when the retrieved chunks don't actually contain the answer.

answer = answer_question("What is RAG and how does it work?", store)
print(answer)

Putting It All Together

def build_rag_pipeline(documents: list[str]) -> SimpleVectorStore:
    store = SimpleVectorStore()
    for doc in documents:
        ingest_document(doc, store)
    return store

documents = [sample_doc, "Another document's text...", "..."]
store = build_rag_pipeline(documents)

while True:
    question = input("\nAsk a question (or 'quit'): ")
    if question.lower() == "quit":
        break
    print(answer_question(question, store))

Where This Breaks Down (and What to Do About It)

A pipeline this simple will work for a demo, but a few things will bite you at real scale:

Chunking is naive. Splitting purely by token count ignores document structure it'll happily cut a chunk in the middle of a table or a code block. Better approaches split on semantic boundaries (paragraphs, sections, headers) and use libraries like langchain.text_splitter or custom logic that respects Markdown/HTML structure.

Linear search doesn't scale. SimpleVectorStore.search compares your query against every stored vector fine for a few thousand chunks, painfully slow at millions. At that scale you want an approximate nearest neighbor index (HNSW, IVF) via something like FAISS, Pinecone, Qdrant, or pgvector.

Retrieval quality matters more than people expect. Pure vector similarity sometimes pulls back chunks that are topically close but not actually useful. Hybrid search (combining vector similarity with keyword/BM25 search) and reranking (passing retrieved chunks through a smaller model that re-scores relevance) both noticeably improve answer quality.

No metadata filtering. Real systems usually need to filter by source, date, user permissions, etc., before or alongside the similarity search not just "find the closest vectors" in a single unfiltered pool.

No evaluation loop. It's easy to ship a RAG system that feels like it's working and is quietly hallucinating or retrieving the wrong chunks. Track retrieval precision and answer faithfulness, even informally, before trusting it in production.

Wrapping Up

The core idea behind RAG is simpler than the ecosystem around it suggests: embed your content, store the vectors, find the closest ones to a query, and feed them to an LLM as context. Everything else vector databases, rerankers, hybrid search, agentic retrieval is refinement on top of that same loop.

Building it from scratch once, even a version this minimal, makes it much easier to reason about what a framework like LangChain or LlamaIndex is actually doing under the hood, and where to look first when retrieval quality isn't good enough.

If you want to take this further, good next steps are swapping the in-memory store for FAISS or pgvector, adding a reranking step, and experimenting with chunk size/overlap on your own documents to see how much it affects answer quality.

Read the Complete Guide

This article walks you through building a RAG application from scratch. If you're new to Retrieval-Augmented Generation and want to understand the fundamentals including what RAG is, how it works, its architecture, benefits, and real-world use cases check out our complete guide.

📖 What Is Retrieval-Augmented Generation (RAG) in AI and How Does It Work?

Questions or improvements? Drop them in the comments happy to dig into any part of this in more depth.