DEV Community

Cover image for En:Building a RAG Agent for SOPs
Cleber Lucas
Cleber Lucas

Posted on

En:Building a RAG Agent for SOPs

How I built a RAG agent to eliminate operational interruptions at work

Open source project using Python, LangChain, ChromaDB, FastAPI and Discord — from a real problem to production deployment.


Every company has a silent cycle that drains time without anyone noticing.

An employee has a question about a procedure. They can't find the answer in the documentation. They interrupt a more experienced colleague. That person stops what they're doing, answers, and goes back to work — focus already broken. Multiply that by 10, 20, 50 times a week.

Watching that pattern is what led me to build POPS AI: a RAG (Retrieval-Augmented Generation) agent capable of answering questions about a company's Standard Operating Procedures, directly through Discord or via a REST API.


The problem that motivated the project

The company had dozens of SOPs documented in PDF format. The problem wasn't a lack of documentation — it was the friction in accessing it. Nobody opens a network folder, hunts for the right file, and reads 15 pages just to answer a quick question.

The question I asked myself was simple: what if the documentation could answer questions on its own?


The architecture in three stages

The system works in three distinct phases, each with a clear responsibility.

1. Extraction

The extrair_texto.py script reads PDFs from the pops_originais/ folder, extracts the full text using PyMuPDF, and saves it as .txt. Page images are also extracted for future use.

import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    full_text = ""
    for page in doc:
        full_text += page.get_text()
    return full_text
Enter fullscreen mode Exit fullscreen mode

Simple, but important: extraction quality determines response quality. Scanned PDFs without OCR are enemy number one here.

2. Embedding generation

With the extracted texts, gerar_embeddings.py splits the content into chunks using LangChain's RecursiveCharacterTextSplitter, generates the vectors, and persists them in ChromaDB.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_text(text)
Enter fullscreen mode Exit fullscreen mode

The chunk_overlap=200 was a deliberate choice: it ensures context isn't cut off abruptly between chunks, which visibly improved response coherence.

The project supports two embedding models via config.py:

  • Gemini models/embedding-001 — high quality, requires API key, cost scales with volume
  • Local SBERT (paraphrase-multilingual-mpnet-base-v2) — runs offline, great for avoiding costs or rate limits

This flexibility was one of the design decisions that added the most value, especially for anyone who wants to experiment with the project at zero cost.

3. Query (RAG)

When a user asks a question, the system:

  1. Converts the question into a vector using the same embedding model
  2. Searches for the most semantically similar chunks in ChromaDB
  3. Builds a prompt with the retrieved excerpts as context
  4. Sends it to Gemini 2.0 Flash to generate the final answer
results = collection.query(
    query_embeddings=[question_embedding],
    n_results=5
)

context = "\n\n".join(results['documents'][0])

prompt = f"""You are an assistant specialized in the company's SOPs.
Use only the information below to answer.

Context:
{context}

Question: {question}
"""
Enter fullscreen mode Exit fullscreen mode

The interfaces: Discord and API

The project exposes the knowledge base in two ways.

Discord bot with slash commands:

  • /pop <question> — queries the vector database and returns the answer
  • /addpop <file.txt> — lets admins add new SOPs in real time, without reprocessing the entire base

FastAPI REST API with a POST /ask endpoint, designed for integration with other internal systems:

// Request
{ "question": "How do I configure the scanner on the Samsung printer?" }

// Response
{
  "answer": "To configure the scanner, follow these steps:\n1. Turn on the printer...\n[Source: SOP-ScannerSetup.txt]"
}
Enter fullscreen mode Exit fullscreen mode

The challenge nobody talks about: token costs

Building the RAG was the fun part. The real challenge came after: how do you control costs in production?

A few decisions that made a real difference:

Using SBERT for embeddings instead of the Gemini API brings indexing cost down to zero — the model runs locally. Cost only occurs at response generation, which is where the actual value is.

Limiting n_results=5 in the vector search avoids passing unnecessary context to the model. More context = more tokens = more cost, without necessarily improving the answer.

Gemini 2.0 Flash was chosen intentionally over Pro: for objective questions about procedures, the quality difference is minimal while the cost difference is significant.


Deployment: one container, two processes

One decision that cost me a few hours was running the Discord bot and the FastAPI server in the same Docker container. The solution was Supervisor, which manages both processes in a lightweight, self-recovering way.

# supervisord.conf
[program:api]
command=uvicorn api_bot:app --host 0.0.0.0 --port 8000

[program:discord]
command=python bot_discord.py

autostart=true
autorestart=true
Enter fullscreen mode Exit fullscreen mode

The result is a single, lightweight container that starts both services in parallel and automatically restarts either one if it fails. On an entry-level VPS, this matters a lot.


What I learned that wasn't in the plan

Chunking is an art. Chunk size and overlap affect response quality more than the model itself. I spent more time tuning this than anything else.

Security from day one. The .gitignore had to be configured before the first public commit to ensure no confidential company PDFs ended up in the repository. A mistake here is hard to undo.

The real problem wasn't technical. The most complex part was understanding what kind of questions users would actually ask and how to structure the SOPs so the model could retrieve the right information. Garbage in, garbage out applies twice as hard in RAG.


The project is open source

POPS AI is available on GitHub with a full README, .env.example, configured Docker Compose, and step-by-step setup instructions for both local and container-based deployment.

You can clone it, adapt it to your own knowledge base, and use it with your own documents — whether for SOPs, internal wikis, product manuals, or any PDF-based documentation.

🔗 github.com/obelucca/POPS_AI


Stack

Python 3.10 LangChain ChromaDB FastAPI Discord.py Google Gemini 2.0 Flash SBERT Docker Supervisor PyMuPDF


If you made it this far and are curious about any architectural decision, token cost management in production, or how to adapt this to a different use case — drop a comment. Happy to discuss.

Top comments (1)

Collapse
 
mnemehq profile image
Theo Valmis

RAG for SOPs has a specific failure mode worth designing against from day one: the retriever pulls the right chunk, the generator paraphrases it, and the paraphrase quietly changes a step in a procedure that's load-bearing for compliance. Embedding similarity gets the right neighborhood; faithfulness to the exact wording of the SOP is a different axis.

For SOP-style content the higher-confidence pattern is retrieval plus extractive answering — the agent quotes the procedure verbatim with a citation, then offers context around it, rather than synthesizing a new version. Generation-side flexibility is what people love about LLMs but it's exactly the wrong default for regulated procedural content. Worth being deliberate about which mode the system runs in per content type.