How I built a RAG agent to eliminate operational interruptions at work
Open source project using Python, LangChain, ChromaDB, FastAPI and Discord — from a real problem to production deployment.
Every company has a silent cycle that drains time without anyone noticing.
An employee has a question about a procedure. They can't find the answer in the documentation. They interrupt a more experienced colleague. That person stops what they're doing, answers, and goes back to work — focus already broken. Multiply that by 10, 20, 50 times a week.
Watching that pattern is what led me to build POPS AI: a RAG (Retrieval-Augmented Generation) agent capable of answering questions about a company's Standard Operating Procedures, directly through Discord or via a REST API.
The problem that motivated the project
The company had dozens of SOPs documented in PDF format. The problem wasn't a lack of documentation — it was the friction in accessing it. Nobody opens a network folder, hunts for the right file, and reads 15 pages just to answer a quick question.
The question I asked myself was simple: what if the documentation could answer questions on its own?
The architecture in three stages
The system works in three distinct phases, each with a clear responsibility.
1. Extraction
The extrair_texto.py script reads PDFs from the pops_originais/ folder, extracts the full text using PyMuPDF, and saves it as .txt. Page images are also extracted for future use.
import fitz # PyMuPDF
def extract_text_from_pdf(pdf_path):
doc = fitz.open(pdf_path)
full_text = ""
for page in doc:
full_text += page.get_text()
return full_text
Simple, but important: extraction quality determines response quality. Scanned PDFs without OCR are enemy number one here.
2. Embedding generation
With the extracted texts, gerar_embeddings.py splits the content into chunks using LangChain's RecursiveCharacterTextSplitter, generates the vectors, and persists them in ChromaDB.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split_text(text)
The chunk_overlap=200 was a deliberate choice: it ensures context isn't cut off abruptly between chunks, which visibly improved response coherence.
The project supports two embedding models via config.py:
-
Gemini
models/embedding-001— high quality, requires API key, cost scales with volume -
Local SBERT (
paraphrase-multilingual-mpnet-base-v2) — runs offline, great for avoiding costs or rate limits
This flexibility was one of the design decisions that added the most value, especially for anyone who wants to experiment with the project at zero cost.
3. Query (RAG)
When a user asks a question, the system:
- Converts the question into a vector using the same embedding model
- Searches for the most semantically similar chunks in ChromaDB
- Builds a prompt with the retrieved excerpts as context
- Sends it to Gemini 2.0 Flash to generate the final answer
results = collection.query(
query_embeddings=[question_embedding],
n_results=5
)
context = "\n\n".join(results['documents'][0])
prompt = f"""You are an assistant specialized in the company's SOPs.
Use only the information below to answer.
Context:
{context}
Question: {question}
"""
The interfaces: Discord and API
The project exposes the knowledge base in two ways.
Discord bot with slash commands:
-
/pop <question>— queries the vector database and returns the answer -
/addpop <file.txt>— lets admins add new SOPs in real time, without reprocessing the entire base
FastAPI REST API with a POST /ask endpoint, designed for integration with other internal systems:
// Request
{ "question": "How do I configure the scanner on the Samsung printer?" }
// Response
{
"answer": "To configure the scanner, follow these steps:\n1. Turn on the printer...\n[Source: SOP-ScannerSetup.txt]"
}
The challenge nobody talks about: token costs
Building the RAG was the fun part. The real challenge came after: how do you control costs in production?
A few decisions that made a real difference:
Using SBERT for embeddings instead of the Gemini API brings indexing cost down to zero — the model runs locally. Cost only occurs at response generation, which is where the actual value is.
Limiting n_results=5 in the vector search avoids passing unnecessary context to the model. More context = more tokens = more cost, without necessarily improving the answer.
Gemini 2.0 Flash was chosen intentionally over Pro: for objective questions about procedures, the quality difference is minimal while the cost difference is significant.
Deployment: one container, two processes
One decision that cost me a few hours was running the Discord bot and the FastAPI server in the same Docker container. The solution was Supervisor, which manages both processes in a lightweight, self-recovering way.
# supervisord.conf
[program:api]
command=uvicorn api_bot:app --host 0.0.0.0 --port 8000
[program:discord]
command=python bot_discord.py
autostart=true
autorestart=true
The result is a single, lightweight container that starts both services in parallel and automatically restarts either one if it fails. On an entry-level VPS, this matters a lot.
What I learned that wasn't in the plan
Chunking is an art. Chunk size and overlap affect response quality more than the model itself. I spent more time tuning this than anything else.
Security from day one. The .gitignore had to be configured before the first public commit to ensure no confidential company PDFs ended up in the repository. A mistake here is hard to undo.
The real problem wasn't technical. The most complex part was understanding what kind of questions users would actually ask and how to structure the SOPs so the model could retrieve the right information. Garbage in, garbage out applies twice as hard in RAG.
The project is open source
POPS AI is available on GitHub with a full README, .env.example, configured Docker Compose, and step-by-step setup instructions for both local and container-based deployment.
You can clone it, adapt it to your own knowledge base, and use it with your own documents — whether for SOPs, internal wikis, product manuals, or any PDF-based documentation.
Stack
Python 3.10 LangChain ChromaDB FastAPI Discord.py Google Gemini 2.0 Flash SBERT Docker Supervisor PyMuPDF
If you made it this far and are curious about any architectural decision, token cost management in production, or how to adapt this to a different use case — drop a comment. Happy to discuss.
Top comments (6)
Really enjoyed reading this. I like that the project comes from a real operational pain point instead of just being another “AI demo.” The architecture is also explained very clearly, especially the tradeoffs around chunking, embeddings, and token costs.
The point about chunking being more impactful than the model itself is something many people only realize after building their first RAG system. I’m also curious how you evaluate retrieval quality in production. Are you using any automated metrics or mostly relying on manual testing right now?
Thanks for the comment! Actually, we're performing the tests manually. I hadn't thought of that, but it's a very important point. I'd like to implement these tests in the POPS Agent. Do you have any tools you could recommend for these tests?
Unfortunately, not really yet. I'm currently trying to build something like that for myself, but a ready-made solution or a framework for it would be really helpful.
The repository is open, with the base ready for tool modifications, should that help. This was created to respond via Discord or API. But reading files, creating Chuckings and Embeddings works well. It just needs a few revisions. Keep contact, maybe I can help you with that.
Here is my LinkedIn: linkedin.com/in/devcleberlucas/
RAG for SOPs has a specific failure mode worth designing against from day one: the retriever pulls the right chunk, the generator paraphrases it, and the paraphrase quietly changes a step in a procedure that's load-bearing for compliance. Embedding similarity gets the right neighborhood; faithfulness to the exact wording of the SOP is a different axis.
For SOP-style content the higher-confidence pattern is retrieval plus extractive answering — the agent quotes the procedure verbatim with a citation, then offers context around it, rather than synthesizing a new version. Generation-side flexibility is what people love about LLMs but it's exactly the wrong default for regulated procedural content. Worth being deliberate about which mode the system runs in per content type.
That's a really good point. Thanks for bringing this up.
To be honest, my initial focus was much more on making sure the right information could be retrieved than on the difference between retrieving the correct chunk and staying faithful to the original wording. And you're absolutely right: in procedural content, even a small paraphrase can unintentionally change the meaning of a critical step.
I really like the idea of having a more extractive mode, quoting the procedure exactly as it appears in the source document and using the LLM only to provide additional context when appropriate. I think that's a very interesting direction for the next iterations of the project.
This is exactly why I enjoy sharing what I build publicly. Someone always brings a perspective I hadn't considered and helps turn a proof of concept into something more robust.
Thanks again for the thoughtful feedback. I'll definitely take it into account for future versions of POPS AI.
Thank you!