How I built a RAG agent to eliminate operational interruptions at work
Open source project using Python, LangChain, ChromaDB, FastAPI and Discord — from a real problem to production deployment.
Every company has a silent cycle that drains time without anyone noticing.
An employee has a question about a procedure. They can't find the answer in the documentation. They interrupt a more experienced colleague. That person stops what they're doing, answers, and goes back to work — focus already broken. Multiply that by 10, 20, 50 times a week.
Watching that pattern is what led me to build POPS AI: a RAG (Retrieval-Augmented Generation) agent capable of answering questions about a company's Standard Operating Procedures, directly through Discord or via a REST API.
The problem that motivated the project
The company had dozens of SOPs documented in PDF format. The problem wasn't a lack of documentation — it was the friction in accessing it. Nobody opens a network folder, hunts for the right file, and reads 15 pages just to answer a quick question.
The question I asked myself was simple: what if the documentation could answer questions on its own?
The architecture in three stages
The system works in three distinct phases, each with a clear responsibility.
1. Extraction
The extrair_texto.py script reads PDFs from the pops_originais/ folder, extracts the full text using PyMuPDF, and saves it as .txt. Page images are also extracted for future use.
import fitz # PyMuPDF
def extract_text_from_pdf(pdf_path):
doc = fitz.open(pdf_path)
full_text = ""
for page in doc:
full_text += page.get_text()
return full_text
Simple, but important: extraction quality determines response quality. Scanned PDFs without OCR are enemy number one here.
2. Embedding generation
With the extracted texts, gerar_embeddings.py splits the content into chunks using LangChain's RecursiveCharacterTextSplitter, generates the vectors, and persists them in ChromaDB.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split_text(text)
The chunk_overlap=200 was a deliberate choice: it ensures context isn't cut off abruptly between chunks, which visibly improved response coherence.
The project supports two embedding models via config.py:
-
Gemini
models/embedding-001— high quality, requires API key, cost scales with volume -
Local SBERT (
paraphrase-multilingual-mpnet-base-v2) — runs offline, great for avoiding costs or rate limits
This flexibility was one of the design decisions that added the most value, especially for anyone who wants to experiment with the project at zero cost.
3. Query (RAG)
When a user asks a question, the system:
- Converts the question into a vector using the same embedding model
- Searches for the most semantically similar chunks in ChromaDB
- Builds a prompt with the retrieved excerpts as context
- Sends it to Gemini 2.0 Flash to generate the final answer
results = collection.query(
query_embeddings=[question_embedding],
n_results=5
)
context = "\n\n".join(results['documents'][0])
prompt = f"""You are an assistant specialized in the company's SOPs.
Use only the information below to answer.
Context:
{context}
Question: {question}
"""
The interfaces: Discord and API
The project exposes the knowledge base in two ways.
Discord bot with slash commands:
-
/pop <question>— queries the vector database and returns the answer -
/addpop <file.txt>— lets admins add new SOPs in real time, without reprocessing the entire base
FastAPI REST API with a POST /ask endpoint, designed for integration with other internal systems:
// Request
{ "question": "How do I configure the scanner on the Samsung printer?" }
// Response
{
"answer": "To configure the scanner, follow these steps:\n1. Turn on the printer...\n[Source: SOP-ScannerSetup.txt]"
}
The challenge nobody talks about: token costs
Building the RAG was the fun part. The real challenge came after: how do you control costs in production?
A few decisions that made a real difference:
Using SBERT for embeddings instead of the Gemini API brings indexing cost down to zero — the model runs locally. Cost only occurs at response generation, which is where the actual value is.
Limiting n_results=5 in the vector search avoids passing unnecessary context to the model. More context = more tokens = more cost, without necessarily improving the answer.
Gemini 2.0 Flash was chosen intentionally over Pro: for objective questions about procedures, the quality difference is minimal while the cost difference is significant.
Deployment: one container, two processes
One decision that cost me a few hours was running the Discord bot and the FastAPI server in the same Docker container. The solution was Supervisor, which manages both processes in a lightweight, self-recovering way.
# supervisord.conf
[program:api]
command=uvicorn api_bot:app --host 0.0.0.0 --port 8000
[program:discord]
command=python bot_discord.py
autostart=true
autorestart=true
The result is a single, lightweight container that starts both services in parallel and automatically restarts either one if it fails. On an entry-level VPS, this matters a lot.
What I learned that wasn't in the plan
Chunking is an art. Chunk size and overlap affect response quality more than the model itself. I spent more time tuning this than anything else.
Security from day one. The .gitignore had to be configured before the first public commit to ensure no confidential company PDFs ended up in the repository. A mistake here is hard to undo.
The real problem wasn't technical. The most complex part was understanding what kind of questions users would actually ask and how to structure the SOPs so the model could retrieve the right information. Garbage in, garbage out applies twice as hard in RAG.
The project is open source
POPS AI is available on GitHub with a full README, .env.example, configured Docker Compose, and step-by-step setup instructions for both local and container-based deployment.
You can clone it, adapt it to your own knowledge base, and use it with your own documents — whether for SOPs, internal wikis, product manuals, or any PDF-based documentation.
Stack
Python 3.10 LangChain ChromaDB FastAPI Discord.py Google Gemini 2.0 Flash SBERT Docker Supervisor PyMuPDF
If you made it this far and are curious about any architectural decision, token cost management in production, or how to adapt this to a different use case — drop a comment. Happy to discuss.
Top comments (1)
RAG for SOPs has a specific failure mode worth designing against from day one: the retriever pulls the right chunk, the generator paraphrases it, and the paraphrase quietly changes a step in a procedure that's load-bearing for compliance. Embedding similarity gets the right neighborhood; faithfulness to the exact wording of the SOP is a different axis.
For SOP-style content the higher-confidence pattern is retrieval plus extractive answering — the agent quotes the procedure verbatim with a citation, then offers context around it, rather than synthesizing a new version. Generation-side flexibility is what people love about LLMs but it's exactly the wrong default for regulated procedural content. Worth being deliberate about which mode the system runs in per content type.