We’ve all faced this situation: you’re buried in a massive wall of documentation, desperately trying to track down one tiny use case. It feels exactly like hunting for a needle in a haystack.
So, how do we make this easier?
That’s where Retrieval-Augmented Generation (RAG) powered by LLMs comes in. Instead of endlessly scrolling, you can query the docs in plain English and get precise, contextual answers back—just like chatting with a subject-matter expert.
And here’s the best part:
- You don’t have to stick to OpenAI’s SDK.
- You can even spin this up with an on-prem LLM for full control and data privacy.
If you’re curious about setting up an on-prem LLM with Llama, check out this detailed guide: Setting up RAG locally with Ollama – A Beginner-Friendly Guide.
In this post, I’ll walk you through how I built a local RAG-powered chatbot on top of the Python Command Line Documentation. By the end, you’ll see how documentation can transform from a static reference into an interactive, intelligent assistant.
High-Level Architecture
At a glance, here’s the end-to-end flow we’ll be building:
- Scrape Documentation — Extract raw text directly from the official Python docs.
- Chunk the Text — Split large documents into smaller, context-friendly segments.
- Generate Embeddings — Convert text chunks into vector representations using an embedding model (e.g., OpenAI or any on-prem alternative).
- Store in FAISS — Persist embeddings locally in a FAISS vector store for efficient semantic search.
- RAG Pipeline — Retrieve the most relevant chunks and feed them into the LLM to generate accurate, context-aware answers.
- Expose API with FastAPI — Wrap everything into a lightweight API so the system can be queried just like a chatbot.
Note
It’s always a good practice to pen down your idea and flow first before jumping into implementation.
A clear sketch or outline helps avoid confusion later and keeps your architecture consistent.
Let’s Implement the Process
Now that we understand the big picture, let’s roll up our sleeves and implement the chatbot step by step.
We’ll start with scraping the documentation content from the official Python website.
Prerequisites
Before you start, make sure you have Python 3.10+ installed and the following libraries available.
You can install them using pip
:
# Core libraries
pip install openai faiss-cpu numpy pickle5
# Web scraping (if using requests/BeautifulSoup)
pip install requests beautifulsoup4 lxml
# FastAPI for serving the chatbot API
pip install fastapi uvicorn
- faiss-cpu : can be interchangeably used with gpu supported library for more info read the FAISS documentation
1. Scraping the Content
The Python documentation is written in HTML and hosted online.
To use it in our chatbot, we need to extract the text from the web page so we can later process it into embeddings.
We’ll use two handy libraries:
requests : Fetch the webpage HTML.
BeautifulSoup : Parse the HTML and extract only the meaningful text.
Here’s the code:
import requests
from bs4 import BeautifulSoup
# URL of the Python command-line documentation page
URL = "https://docs.python.org/3/using/cmdline.html"
def scrape_cmdline():
"""
Scrape the Python command line documentation page.
Steps:
1. Make a GET request to the documentation URL.
2. Parse the HTML content using BeautifulSoup.
3. Extract only the main section of the page (role="main"),
which contains the actual documentation (excluding headers, nav, footers).
4. Return the extracted text as a clean string.
"""
# Fetch the HTML page
res = requests.get(URL)
# Parse HTML
soup = BeautifulSoup(res.text, "html.parser")
# Extract the main content div
content = soup.find("div", {"role": "main"}).get_text(separator="\n")
return content
if __name__ == "__main__":
# Preview the first 500 characters of the scraped content : Best practice
text = scrape_cmdline()
print(text[:500])
What’s happening here?
- We request the Python docs page using requests.get().
- BeautifulSoup parses the HTML so we can search inside it.
- We specifically grab the section, because that’s where the actual documentation content lives. This avoids scraping menus, sidebars, or footers.
- We return the cleaned text, and in the main block we print a preview of the first 500 characters to confirm it’s working (A good testing practice where modules are somewhat independent from each other. More on the Development Pattern Later)
1.5. Normalizing the Text
Before we move to chunking, it’s useful to normalize the content a bit.
Documentation often contains extra line breaks, inconsistent spacing, or formatting artifacts.
By collapsing whitespace and trimming redundant line breaks, we ensure cleaner input for chunking and embeddings.
def normalize_text(text: str) -> str:
"""
Normalize scrapped documentation text.
- Collapse multiple spaces/newlines
- Strip leading/trailing whitespace
"""
import re
text = re.sub(r"\s+", " ", text) # collapse multiple spaces/newlines
return text.strip()
2. Chunking the scrapped content
Large documents are too big for embeddings, We’ll split them into chunks of ~800 characters.
def chunk_text(text, chunk_size=800):
"""
Split text into smaller chunks of roughly `chunk_size`.
This helps embeddings capture semantic meaning without hitting token limits.
"""
lines = text.split("\n")
chunks, current = [], []
current_len = 0
for line in lines:
current.append(line)
current_len += len(line)
if current_len >= chunk_size:
chunks.append("\n".join(current))
current, current_len = [], 0
if current:
chunks.append("\n".join(current))
return chunks
Here we are
- Global chunk size set to 800 for balance between context and token efficiency
- Split by sentence/line first, then aggregate into chunks
- Stop at the chunk size limit, ensuring no oversized segments
- Preserve semantic continuity by joining lines back together
3. Generate Embeddings & Build FAISS Index
Now that we have clean, chunked text, the next step is to make it searchable by meaning rather than just raw keywords.
To achieve this, we’ll:
-
Generate vector embeddings — Each chunk of text is transformed into a high-dimensional vector using OpenAI’s
text-embedding-3-small
model. - Build a FAISS index — These vectors are stored in a FAISS index, enabling fast semantic similarity search.
- Persist artifacts — Both the document chunks and FAISS index are saved locally so we don’t need to recompute them every run.
import pickle
import numpy as np
import faiss
from openai import OpenAI
from <SCRAP_SCRIPT> import scrape_cmdline
from <CHUNK_SCRIPT> import chunk_text
# Initialize OpenAI client
client = OpenAI()
# Scrape documentation
text = scrape_cmdline()
# Chunk documentation
docs = chunk_text(text, chunk_size=800)
# Generate embeddings for each chunk
embeddings = [
client.embeddings.create(
model="text-embedding-3-small",
input=doc
).data[0].embedding
for doc in docs
]
# Convert embeddings into NumPy array (FAISS requires float32)
emb_np = np.array(embeddings, dtype="float32")
# Create and populate FAISS index
index = faiss.IndexFlatL2(emb_np.shape[1])
index.add(emb_np)
# Save both docs and index locally
with open("cmdline_docs.pkl", "wb") as f:
pickle.dump(docs, f)
faiss.write_index(index, "cmdline.index")
print("Embedding index built and stored locally!")
Why FAISS?
- Optimized for similarity search: Handles millions of embeddings efficiently.
- Local & lightweight: No external dependency once built.
Scalable: You can swap in approximate nearest-neighbor (ANN) indexes for larger datasets.
More on FAISS in some other blog.
4. Query with RAG Pipeline
Now let’s write a script that takes a user question, retrieves relevant chunks, and generates an answer using GPT.
import pickle
import numpy as np
import faiss
from openai import OpenAI
client = OpenAI()
# Load FAISS index + document chunks
index = faiss.read_index("cmdline.index")
docs = pickle.load(open("cmdline_docs.pkl", "rb"))
def ask_cmdline(question, k=3):
"""
Retrieve relevant doc chunks and answer the user's question
using Retrieval-Augmented Generation (RAG).
"""
# Embed the query
q_emb = client.embeddings.create(
model="text-embedding-3-small",
input=question
).data[0].embedding
# Search in FAISS for top-k similar chunks
D, I = index.search(np.array([q_emb], dtype="float32"), k=k)
context = "\n\n".join([docs[i] for i in I[0]])
# Construct prompt for GPT
prompt = f"""
You are a Python documentation assistant.
Use the following docs context to answer the question.
Cite specific flags or examples when possible.
Context:
{context}
Question: {question}
"""
# Generate response
resp = client.chat.completions.create(
model="gpt-4o-mini", # choose a model of your liking.
messages=[{"role": "user", "content": prompt}]
)
return resp.choices[0].message.content
# Adding the testing block to test RAG Process with a static prompt.
if __name__ == "__main__":
print(ask_cmdline("What does the -m option do in Python?"))
- We can run this code using
python <rag_file>.py
- By this we can test the output provided by LLM based on our faiss index.
Exposing API for ChatBot Simulation using FastApi
To make this interactive, we can expose it with FastAPI interface and can integrate with UI components if we want:
from fastapi import FastAPI, Query
from <RAG_FILE> import ask_cmdline
app = FastAPI()
@app.get("/ask")
def ask_docs(q: str = Query(..., description="Ask about Python cmdline")):
"""
API endpoint to query the chatbot.
"""
return {"answer": ask_cmdline(q)}
- run this command to make your API Live
uvicorn <rag_file>:app --reload
- Now you can use curl to pass any question related to the documentation and fetch the answer.
Ideas for a Production-Grade Chatbot
Once the basic pipeline is working, here are a few ways to enhance it for real-world usage:
-
Store Metadata with Chunks
- If your content spans multiple pages or external links, enhance your chunking strategy to include metadata such as page numbers, URLs, or section titles.
- When the chatbot retrieves a chunk, you can return this metadata alongside the answer, enabling users to jump directly to the source of information.
-
Frontend & Contextual Memory
- Integrate a frontend UI using an existing chatbot component to make interactions more user-friendly and visually appealing.
- For production-grade usage, maintain a memory of previous messages so the bot can provide context-aware responses and carry on multi-turn conversations naturally.
With these improvements, you now have a chatbot capable of answering “needle-in-a-haystack” queries from massive, often messy technical documentation.
Even when the content is dense or full of jargon, your RAG pipeline ensures that the chatbot can provide accurate, contextually relevant answers efficiently.
Top comments (0)