Abdullah Sheikh

Posted on Jun 7

How to Build a RAG System with Your Own Documents in 7 Simple Steps

#rag #aiassistants #nocode #knowledgemanagement

Create a Retrieval‑Augmented Generation pipeline that answers questions from your private data without writing code

Before We Start: What You'll Walk Away With

By the end of this tutorial you’ll be able to point a chatbot at any collection of PDFs, Word docs, or wiki pages and get accurate answers in seconds.

First, you’ll grasp the three moving parts of a RAG system—the retriever, the generator, and the glue that stitches them together—just like understanding the driver, the map, and the GPS signal when you order a ride.

Next, you’ll spin up a fully functional pipeline using only free or freemium services, so there’s no need to hire a data‑science team or rent pricey GPU servers.

Finally, you’ll have a ready‑to‑plug component you can drop into a Slack bot or a lightweight web UI, turning your knowledge base into an instant assistant.

Component view: Retriever (searches your docs), Generator (writes the answer), Orchestrator (passes the query).
Toolchain: Vector store (e.g., Pinecone free tier), LLM API (OpenAI’s chat model with free credits), No‑code connector (Zapier or Make).
Deployment: Wrap the endpoint with a simple Flask app or use a serverless function, then hook it to Slack or a static HTML page.

Think of it like packing a suitcase: you pick the right items (documents), arrange them efficiently (vector index), and add a travel guide (LLM) that tells you exactly where everything is when you need it.

When you finish, you’ll have a RAG system tutorial you can replicate for any new project, no code beyond a few configuration files.

Ready to start building?

What Retrieval‑Augmented Generation Actually Is (No Jargon)

Think of a RAG system as a two‑person team: one side is a chatty, well‑read friend (the generator) and the other side is a tidy filing cabinet (the retriever) that holds all the docs you care about.

The friend knows how to string words together fluently, but they don’t remember the latest policy updates or product specs. When you ask a question, the friend first asks the filing cabinet for the most relevant pages, then crafts an answer that blends that fresh info with their natural language skill.

In practice, the retriever runs a quick search across your PDFs, Word files, or wiki, returning a handful of snippets. Those snippets are fed to the generator, which writes a response that feels like a conversation yet stays grounded in your actual documents.

Imagine ordering a custom sandwich. The chef (generator) can make any combination, but they need to know which ingredients you have in the fridge (retriever). They glance at the inventory, pick the freshest lettuce and the exact cheese you stocked, then assemble a sandwich that tastes right and matches what you asked for.

The payoff is simple: the AI assistant answers with up‑to‑date, company‑specific facts while keeping the flow of a casual chat. No massive model training, no data‑science squad—just a searchable index and a language model working together.

The 3 Mistakes Everyone Makes With RAG

Most people hit a wall early because they treat a RAG system like a generic search engine.

Using a generic web‑search index instead of a vector store. Imagine ordering a pizza and getting a sushi menu – the results are technically “searchable” but completely useless. A vector store embeds your own PDFs and Word files, so similarity matches stay on topic.
Feeding the whole document into the LLM at once. This is like trying to stuff an entire suitcase into a carry‑on; you’ll exceed the size limit and end up with broken zippers. Large language models have token caps; when you overload them they truncate, guess, and hallucinate.
Ignoring chunk‑size and overlap. Picture Google Maps without street‑level detail: you see the city, but you can’t navigate the alleys. If you split text into chunks that are too big or have no overlap, the similarity search loses the context needed for accurate answers.
Fix #1: Set up faiss or pinecone as your vector store; index only the embeddings from your own docs.
Fix #2: Chunk documents into 300‑500 token pieces before embedding; keep each piece well under the model’s limit.
Fix #3: Add a 50‑token overlap between consecutive chunks so concepts that span sections stay linked.

Skip these pitfalls and your RAG system tutorial will actually deliver the answers you need.

How to Build a RAG System: Step‑by‑Step

Grab your docs, clean them up, and you’re ready to feed them to the AI.

Gather & clean – collect every PDF, DOCX, or Markdown file. If any PDFs are scanned, run an OCR pass so they become searchable plain text. Think of it like washing dishes before you start cooking; you don’t want crumbs in the sauce.
Chunk the text – split the cleaned text into overlapping pieces, about 500 tokens each with a 100‑token overlap. Overlap works like a puzzle border: it gives the model context from the neighboring piece.
Create embeddings – send each chunk to a lightweight model such as text-embedding-ada-002 or a HuggingFace sentence‑transformers model. The output is a dense vector that captures the meaning of the chunk.
Store vectors – push the embeddings into a vector database like Pinecone, Weaviate, or Qdrant. The DB acts as a fast “Google Maps” for similarity: it finds the nearest points (chunks) to a query.

Build the retrieval API – write a tiny endpoint that:

accepts a user query,
asks the vector store for the top‑k similar chunks,
concatenates the query with those chunks, and
calls an LLM to generate the answer.

from fastapi import FastAPI, Request
import pinecone, openai

app = FastAPI()
pinecone.init(api_key="YOUR_KEY")
index = pinecone.Index("rag-index")

@app.post("/ask")
async def ask(req: Request):
    data = await req.json()
    query = data["question"]
    q_emb = openai.Embedding.create(model="text-embedding-ada-002", input=query)["data"][0]["embedding"]
    results = index.query(vector=q_emb, top_k=5, include_metadata=True)
    context = "
".join([r["metadata"]["text"] for r in results.matches])
    answer = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role":"system","content":context},{"role":"user","content":query}]
    )
    return {"answer": answer.choices[0].message.content}

Wrap it in a UI – expose the API through a Slack bot, a Streamlit dashboard, or a Teams connector. This is the “serving plate” that lets users ask questions without seeing the code.
Test & monitor – run sample queries, measure latency, and add a relevance feedback button so users can flag wrong answers. Over time, you can fine‑tune chunk size or top‑k to improve results.

Follow these seven actions and you’ll have a functional RAG system tutorial ready for real users.

A Real Example: Turning a Marketing Playbook Into an AI Coach

Maya, a SaaS marketing manager, wants her team to pull answers from the 2023 Playbook as fast as ordering a coffee.

She drops the 42 PDFs into a shared Google Drive folder and runs the free pdf2txt CLI to pull raw text into a bucket called playbook_raw.
Using a tiny Python script she slices each document into 600‑token chunks with a 150‑token overlap—think packing a suitcase so nothing gets cramped.
Each chunk is sent to OpenAI’s ada‑002 embedding endpoint; the resulting vectors are saved locally.
She spins up a free Qdrant cloud instance, creates a collection, and pushes the embeddings along with their source IDs.
A lightweight FastAPI service reads a query, fetches the top 4 closest chunks from Qdrant, and returns them as JSON.
The API is wired to a Slack slash command /playbook. When a teammate types a question, Slack forwards it to the FastAPI endpoint, which then calls gpt‑4‑turbo with the retrieved snippets to generate the final answer.

The whole flow feels like using Google Maps: you input a destination (the question), the service finds the best roads (relevant chunks), and then gives you turn‑by‑turn directions (the AI‑crafted answer).

Tools: pdf2txt (CLI), Python 3.10+, FastAPI, Qdrant Cloud (free tier), Slack API.
Tips: Keep chunk size under 800 tokens to stay within model limits; use a 150‑token overlap to preserve context across boundaries.

import openai, qdrant_client, pathlib, json

def embed(text):
    return openai.Embedding.create(model="ada-002", input=text)["data"][0]["embedding"]

def chunkify(txt, size=600, overlap=150):
    tokens = txt.split()
    for i in range(0, len(tokens), size - overlap):
        yield " ".join(tokens[i:i+size])

for file in pathlib.Path("playbook_raw").glob("*.txt"):
    for chunk in chunkify(file.read_text()):
        vec = embed(chunk)
        qdrant_client.upsert(collection="playbook", points=[{"id":hash(chunk), "vector":vec, "payload":{"text":chunk}}])

Now any team member can ask, “What’s the recommended email cadence for enterprise leads?” and get a spot‑on answer in under 30 seconds.

The Tools That Make This Easier

Grab the right toolbox and the whole process feels like ordering a meal—you pick the ingredients, the kitchen does the work, and you get a finished dish.

pdf2txt – a free CLI that pulls plain text from PDFs and runs OCR when needed. Think of it as a scanner that not only copies but also translates the printed page into searchable words.

pdf2txt --input report.pdf --output report.txt --ocr

LangChain – open‑source library that chops documents into chunks, builds retrieval logic, and talks to the LLM. It’s like a travel planner that breaks a long trip into stops, then finds the best route between them.
Qdrant Cloud – vector database with a free tier up to 5 M vectors. It stores embeddings and returns the nearest matches in milliseconds, similar to how Google Maps instantly shows the closest coffee shop.
OpenAI API – use text-embedding-ada-002 for cheap, high‑quality embeddings and gpt-4-turbo for completions. It’s the reliable delivery service that brings your answers right to the table without breaking the budget.
Streamlit Community Cloud – spin up a web UI for your RAG endpoint in minutes, free of charge. Imagine packing a suitcase: Streamlit provides the bag, you just drop in the components.

These five tools cover extraction, chunking, storage, reasoning, and presentation, giving you a complete RAG system tutorial without writing a single line of infrastructure code.

With them in place, the next step is wiring everything together.

Quick Reference: RAG System Cheat Sheet

Here’s the whole process at a glance, ready to copy‑paste into your notes.

Gather → clean → plain‑text all docs. Treat it like gathering groceries: pick the items, throw away the wilted ones, then lay everything on the counter in a single list.
Chunk: 500‑600 tokens, 20‑30% overlap. Think of packing a suitcase; each box (chunk) holds a manageable amount, and a little overlap ensures nothing gets left behind.
Embed with Ada‑002 or sentence‑transformers. It’s like converting a photo into a fingerprint—choose a fast, reliable tool to turn text into vectors.
Store in Qdrant / Pinecone / Weaviate. Pick a pantry that keeps your fingerprints organized and searchable.
API flow: query → retrieve top‑k → LLM → answer. Imagine Sam, a product manager, typing a question in Slack. The system pulls the best matches, feeds them to the LLM, and returns a concise reply.
UI options: Slack slash command, Streamlit, Teams bot. Pick the front‑door that your team already uses, just like adding a new lane to an existing road.
Test with 5 real questions, monitor latency **Best‑practice tips:
Keep chunk size consistent; irregular pieces cause uneven search results.
Use a small top_k (e.g., 5) to stay under the latency budget.
Log each query’s response time; alerts on >2 s keep the experience smooth.

Keep this cheat sheet handy; it’s your RAG system tutorial shortcut.

What to Do Next

Kick the tires on your new RAG system with three bite‑size actions, then decide how far you want to push it.

Easy: Grab the run_one_click.py script, point it at a folder of a few PDFs, and hit python run_one_click.py. Within minutes you’ll see vectors land in Qdrant—think of it as watching ingredients pop onto a kitchen counter before you start cooking.
Medium: Hook the FastAPI endpoint to a Slack bot (slack-bot.py in the repo) and invite a teammate to ask five real questions. This is like setting up a walk‑up window: you get instant feedback on whether the assistant is serving the right answers.
Hard: Build a relevance‑feedback loop. Store thumbs‑up/down in a feedback table, re‑rank chunks with ranker = CrossEncoder(...), and periodically fine‑tune a lightweight embedding model (e.g., sentence‑transformers/all-MiniLM-L6‑v2). It’s the equivalent of training a personal barista who remembers your exact coffee preferences.
Tool tip: Use docker-compose up -d qdrant to keep the vector store running locally while you experiment.

Cheat sheet:

python run_one_click.py – ingest PDFs
uvicorn main:app --reload – launch API
python slack-bot.py – start Slack bridge

Got stuck or discovered a shortcut? Drop a comment below – I love hearing how you’ve customized the workflow!

About the Author

Abdullah Sheikh is the Founder & CEO at Exteed, where he leads a team of skilled developers specializing in Web2 and Web3 applications, Custom Smart Contracts, and Blockchain solutions.

With 6+ years of experience, Abdullah has built CRMs, Crypto Wallets, DeFi Exchanges, E-Commerce Stores, HIPAA Compliant EMR Systems, and AI-powered systems that drive business efficiency and innovation.

His expertise spans Blockchain, Crypto & Tokenomics, Artificial Intelligence, and Web Applications; building reliable and smooth web apps that fit the client’s goals and requirements.

📧 info@abdullah-sheikh.com · 🔗 LinkedIn · 🌐 abdullah-sheikh.com