<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Cleber Lucas</title>
    <description>The latest articles on DEV Community by Cleber Lucas (@obelucca__).</description>
    <link>https://dev.to/obelucca__</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3394098%2F13817e67-756b-4669-9e4a-7ca813a0573b.jpg</url>
      <title>DEV Community: Cleber Lucas</title>
      <link>https://dev.to/obelucca__</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/obelucca__"/>
    <language>en</language>
    <item>
      <title>En:Building a RAG Agent for SOPs</title>
      <dc:creator>Cleber Lucas</dc:creator>
      <pubDate>Tue, 09 Jun 2026 16:53:55 +0000</pubDate>
      <link>https://dev.to/obelucca__/enbuilding-a-rag-agent-for-sops-5hj1</link>
      <guid>https://dev.to/obelucca__/enbuilding-a-rag-agent-for-sops-5hj1</guid>
      <description>&lt;h1&gt;
  
  
  How I built a RAG agent to eliminate operational interruptions at work
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;Open source project using Python, LangChain, ChromaDB, FastAPI and Discord — from a real problem to production deployment.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Every company has a silent cycle that drains time without anyone noticing.&lt;/p&gt;

&lt;p&gt;An employee has a question about a procedure. They can't find the answer in the documentation. They interrupt a more experienced colleague. That person stops what they're doing, answers, and goes back to work — focus already broken. Multiply that by 10, 20, 50 times a week.&lt;/p&gt;

&lt;p&gt;Watching that pattern is what led me to build &lt;strong&gt;POPS AI&lt;/strong&gt;: a RAG &lt;em&gt;(Retrieval-Augmented Generation)&lt;/em&gt; agent capable of answering questions about a company's Standard Operating Procedures, directly through Discord or via a REST API.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem that motivated the project
&lt;/h2&gt;

&lt;p&gt;The company had dozens of SOPs documented in PDF format. The problem wasn't a lack of documentation — it was the friction in accessing it. Nobody opens a network folder, hunts for the right file, and reads 15 pages just to answer a quick question.&lt;/p&gt;

&lt;p&gt;The question I asked myself was simple: &lt;strong&gt;what if the documentation could answer questions on its own?&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The architecture in three stages
&lt;/h2&gt;

&lt;p&gt;The system works in three distinct phases, each with a clear responsibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Extraction
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;extrair_texto.py&lt;/code&gt; script reads PDFs from the &lt;code&gt;pops_originais/&lt;/code&gt; folder, extracts the full text using PyMuPDF, and saves it as &lt;code&gt;.txt&lt;/code&gt;. Page images are also extracted for future use.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;fitz&lt;/span&gt;  &lt;span class="c1"&gt;# PyMuPDF
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_text_from_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fitz&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;full_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;full_text&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;full_text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple, but important: extraction quality determines response quality. Scanned PDFs without OCR are enemy number one here.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Embedding generation
&lt;/h3&gt;

&lt;p&gt;With the extracted texts, &lt;code&gt;gerar_embeddings.py&lt;/code&gt; splits the content into chunks using LangChain's &lt;code&gt;RecursiveCharacterTextSplitter&lt;/code&gt;, generates the vectors, and persists them in ChromaDB.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;

&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;chunk_overlap=200&lt;/code&gt; was a deliberate choice: it ensures context isn't cut off abruptly between chunks, which visibly improved response coherence.&lt;/p&gt;

&lt;p&gt;The project supports two embedding models via &lt;code&gt;config.py&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gemini &lt;code&gt;models/embedding-001&lt;/code&gt;&lt;/strong&gt; — high quality, requires API key, cost scales with volume&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local SBERT (&lt;code&gt;paraphrase-multilingual-mpnet-base-v2&lt;/code&gt;)&lt;/strong&gt; — runs offline, great for avoiding costs or rate limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This flexibility was one of the design decisions that added the most value, especially for anyone who wants to experiment with the project at zero cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Query (RAG)
&lt;/h3&gt;

&lt;p&gt;When a user asks a question, the system:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Converts the question into a vector using the same embedding model&lt;/li&gt;
&lt;li&gt;Searches for the most semantically similar chunks in ChromaDB&lt;/li&gt;
&lt;li&gt;Builds a prompt with the retrieved excerpts as context&lt;/li&gt;
&lt;li&gt;Sends it to Gemini 2.0 Flash to generate the final answer
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;query_embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;question_embedding&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;n_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;documents&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are an assistant specialized in the company&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s SOPs.
Use only the information below to answer.

Context:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The interfaces: Discord and API
&lt;/h2&gt;

&lt;p&gt;The project exposes the knowledge base in two ways.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Discord bot&lt;/strong&gt; with slash commands:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;/pop &amp;lt;question&amp;gt;&lt;/code&gt; — queries the vector database and returns the answer&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/addpop &amp;lt;file.txt&amp;gt;&lt;/code&gt; — lets admins add new SOPs in real time, without reprocessing the entire base&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;FastAPI REST API&lt;/strong&gt; with a &lt;code&gt;POST /ask&lt;/code&gt; endpoint, designed for integration with other internal systems:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Request&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"question"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"How do I configure the scanner on the Samsung printer?"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Response&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"answer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"To configure the scanner, follow these steps:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;1. Turn on the printer...&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;[Source: SOP-ScannerSetup.txt]"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The challenge nobody talks about: token costs
&lt;/h2&gt;

&lt;p&gt;Building the RAG was the fun part. The real challenge came after: how do you control costs in production?&lt;/p&gt;

&lt;p&gt;A few decisions that made a real difference:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Using SBERT for embeddings instead of the Gemini API&lt;/strong&gt; brings indexing cost down to zero — the model runs locally. Cost only occurs at response generation, which is where the actual value is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limiting &lt;code&gt;n_results=5&lt;/code&gt; in the vector search&lt;/strong&gt; avoids passing unnecessary context to the model. More context = more tokens = more cost, without necessarily improving the answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini 2.0 Flash&lt;/strong&gt; was chosen intentionally over Pro: for objective questions about procedures, the quality difference is minimal while the cost difference is significant.&lt;/p&gt;




&lt;h2&gt;
  
  
  Deployment: one container, two processes
&lt;/h2&gt;

&lt;p&gt;One decision that cost me a few hours was running the Discord bot and the FastAPI server in the same Docker container. The solution was &lt;strong&gt;Supervisor&lt;/strong&gt;, which manages both processes in a lightweight, self-recovering way.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# supervisord.conf
&lt;/span&gt;&lt;span class="nn"&gt;[program:api]&lt;/span&gt;
&lt;span class="py"&gt;command&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;uvicorn api_bot:app --host 0.0.0.0 --port 8000&lt;/span&gt;

&lt;span class="nn"&gt;[program:discord]&lt;/span&gt;
&lt;span class="py"&gt;command&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;python bot_discord.py&lt;/span&gt;

&lt;span class="py"&gt;autostart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;autorestart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result is a single, lightweight container that starts both services in parallel and automatically restarts either one if it fails. On an entry-level VPS, this matters a lot.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I learned that wasn't in the plan
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Chunking is an art.&lt;/strong&gt; Chunk size and overlap affect response quality more than the model itself. I spent more time tuning this than anything else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security from day one.&lt;/strong&gt; The &lt;code&gt;.gitignore&lt;/code&gt; had to be configured before the first public commit to ensure no confidential company PDFs ended up in the repository. A mistake here is hard to undo.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real problem wasn't technical.&lt;/strong&gt; The most complex part was understanding what kind of questions users would actually ask and how to structure the SOPs so the model could retrieve the right information. Garbage in, garbage out applies twice as hard in RAG.&lt;/p&gt;




&lt;h2&gt;
  
  
  The project is open source
&lt;/h2&gt;

&lt;p&gt;POPS AI is available on GitHub with a full README, &lt;code&gt;.env.example&lt;/code&gt;, configured Docker Compose, and step-by-step setup instructions for both local and container-based deployment.&lt;/p&gt;

&lt;p&gt;You can clone it, adapt it to your own knowledge base, and use it with your own documents — whether for SOPs, internal wikis, product manuals, or any PDF-based documentation.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://github.com/obelucca/POPS_AI" rel="noopener noreferrer"&gt;github.com/obelucca/POPS_AI&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Stack
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;Python 3.10&lt;/code&gt; &lt;code&gt;LangChain&lt;/code&gt; &lt;code&gt;ChromaDB&lt;/code&gt; &lt;code&gt;FastAPI&lt;/code&gt; &lt;code&gt;Discord.py&lt;/code&gt; &lt;code&gt;Google Gemini 2.0 Flash&lt;/code&gt; &lt;code&gt;SBERT&lt;/code&gt; &lt;code&gt;Docker&lt;/code&gt; &lt;code&gt;Supervisor&lt;/code&gt; &lt;code&gt;PyMuPDF&lt;/code&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you made it this far and are curious about any architectural decision, token cost management in production, or how to adapt this to a different use case — drop a comment. Happy to discuss.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>rag</category>
      <category>langchain</category>
    </item>
    <item>
      <title>Do PDF ao Discord com RAG: Como construí um agente RAG para eliminar interrupções operacionais na empresa</title>
      <dc:creator>Cleber Lucas</dc:creator>
      <pubDate>Tue, 09 Jun 2026 13:51:49 +0000</pubDate>
      <link>https://dev.to/obelucca__/do-pdf-ao-discord-com-rag-como-construi-um-agente-rag-para-eliminar-interrupcoes-operacionais-na-48n5</link>
      <guid>https://dev.to/obelucca__/do-pdf-ao-discord-com-rag-como-construi-um-agente-rag-para-eliminar-interrupcoes-operacionais-na-48n5</guid>
      <description>&lt;h1&gt;
  
  
  Como construí um agente RAG para eliminar interrupções operacionais na empresa
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;Projeto open source com Python, LangChain, ChromaDB, FastAPI e Discord — do problema real ao deploy em produção.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Toda empresa tem aquele ciclo silencioso que drena tempo sem que ninguém perceba.&lt;/p&gt;

&lt;p&gt;Um funcionário tem uma dúvida sobre um procedimento. Não encontra a resposta nos documentos. Interrompe alguém mais experiente. Essa pessoa para o que estava fazendo, responde, e volta ao trabalho — já com o raciocínio quebrado. Multiplique isso por 10, 20, 50 vezes por semana.&lt;/p&gt;

&lt;p&gt;Foi observando esse padrão que decidi construir o &lt;strong&gt;POPS AI&lt;/strong&gt;: um agente de RAG &lt;em&gt;(Retrieval-Augmented Generation)&lt;/em&gt; capaz de responder perguntas sobre os Procedimentos Operacionais Padrão de uma empresa, direto pelo Discord ou via API REST.&lt;/p&gt;




&lt;h2&gt;
  
  
  O problema que motivou o projeto
&lt;/h2&gt;

&lt;p&gt;A empresa tinha dezenas de POPs documentados em PDF. O problema não era a falta de documentação — era o atrito para acessá-la. Ninguém abre uma pasta de rede, procura o arquivo certo e lê 15 páginas para responder uma dúvida pontual.&lt;/p&gt;

&lt;p&gt;A pergunta que me fiz foi simples: &lt;strong&gt;e se a documentação pudesse responder sozinha?&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  A arquitetura em três etapas
&lt;/h2&gt;

&lt;p&gt;O sistema funciona em três fases distintas, cada uma com responsabilidade clara.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Extração
&lt;/h3&gt;

&lt;p&gt;O script &lt;code&gt;extrair_texto.py&lt;/code&gt; lê os PDFs da pasta &lt;code&gt;pops_originais/&lt;/code&gt;, extrai o texto completo com PyMuPDF e salva em &lt;code&gt;.txt&lt;/code&gt;. Imagens das páginas também são extraídas para uso futuro.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;fitz&lt;/span&gt;  &lt;span class="c1"&gt;# PyMuPDF
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extrair_texto_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;caminho_pdf&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fitz&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;caminho_pdf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;texto_completo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pagina&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;texto_completo&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;pagina&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;texto_completo&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simples, mas importante: a qualidade da extração determina a qualidade das respostas. PDFs escaneados sem OCR são o inimigo número um aqui.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Geração de embeddings
&lt;/h3&gt;

&lt;p&gt;Com os textos extraídos, o &lt;code&gt;gerar_embeddings.py&lt;/code&gt; divide o conteúdo em chunks usando o &lt;code&gt;RecursiveCharacterTextSplitter&lt;/code&gt; da LangChain, gera os vetores e persiste no ChromaDB.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;

&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texto&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;O &lt;code&gt;chunk_overlap=200&lt;/code&gt; foi uma decisão deliberada: ele garante que o contexto não seja cortado abruptamente entre um chunk e o próximo, o que melhorou visivelmente a coerência das respostas.&lt;/p&gt;

&lt;p&gt;O projeto suporta dois modelos de embedding via &lt;code&gt;config.py&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gemini &lt;code&gt;models/embedding-001&lt;/code&gt;&lt;/strong&gt; — qualidade alta, requer API key e gera custo por volume&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SBERT local (&lt;code&gt;paraphrase-multilingual-mpnet-base-v2&lt;/code&gt;)&lt;/strong&gt; — roda offline, ótimo para evitar custos ou limites de requisição&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Essa flexibilidade foi uma das decisões de design que mais agregou valor, especialmente para quem quer experimentar o projeto sem gastar nada.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Consulta (RAG)
&lt;/h3&gt;

&lt;p&gt;Quando o usuário faz uma pergunta, o sistema:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Converte a pergunta em vetor usando o mesmo modelo de embedding&lt;/li&gt;
&lt;li&gt;Busca os chunks mais semanticamente similares no ChromaDB&lt;/li&gt;
&lt;li&gt;Monta um prompt com os trechos recuperados como contexto&lt;/li&gt;
&lt;li&gt;Envia para o Gemini 2.0 Flash gerar a resposta final
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;resultados&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;query_embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;embedding_pergunta&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;n_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;contexto&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resultados&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;documents&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Você é um assistente especializado nos POPs da empresa.
Use apenas as informações abaixo para responder.

Contexto:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;contexto&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Pergunta: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pergunta&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  As interfaces: Discord e API
&lt;/h2&gt;

&lt;p&gt;O projeto expõe a base de conhecimento de duas formas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bot do Discord&lt;/strong&gt; com slash commands:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;/pop &amp;lt;pergunta&amp;gt;&lt;/code&gt; — consulta a base vetorial e retorna a resposta&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/addpop &amp;lt;arquivo.txt&amp;gt;&lt;/code&gt; — permite que administradores adicionem novos POPs em tempo real, sem precisar reprocessar toda a base&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;API FastAPI&lt;/strong&gt; com endpoint &lt;code&gt;POST /ask&lt;/code&gt;, pensada para integrar com outros sistemas internos:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Request&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"question"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Como configurar o scanner da impressora Samsung?"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Response&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"answer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Para configurar o scanner, siga os passos:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;1. Ligue a impressora...&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;[Fonte: POP-ConfiguraçãoScanner.txt]"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  O desafio que ninguém menciona: custo de tokens
&lt;/h2&gt;

&lt;p&gt;Construir o RAG foi a parte divertida. O desafio real veio depois: como controlar o custo em produção?&lt;/p&gt;

&lt;p&gt;Algumas decisões que fizeram diferença:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Usar SBERT para embeddings em vez da API do Gemini&lt;/strong&gt; reduz o custo de indexação para zero — o modelo roda localmente. O custo só existe na geração de resposta, que é onde o valor real está.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitar &lt;code&gt;n_results=5&lt;/code&gt; na busca vetorial&lt;/strong&gt; evita passar contexto desnecessário para o modelo. Mais contexto = mais tokens = mais custo, sem necessariamente melhorar a resposta.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini 2.0 Flash&lt;/strong&gt; foi escolhido intencionalmente sobre o Pro: para perguntas objetivas sobre procedimentos, a diferença de qualidade é mínima e a diferença de custo é expressiva.&lt;/p&gt;




&lt;h2&gt;
  
  
  Deploy: um container, dois processos
&lt;/h2&gt;

&lt;p&gt;Uma decisão que me custou algumas horas foi rodar o bot do Discord e a API FastAPI no mesmo container Docker. A solução foi o &lt;strong&gt;Supervisor&lt;/strong&gt;, que gerencia ambos os processos de forma leve e auto-recuperável.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# supervisord.conf
&lt;/span&gt;&lt;span class="nn"&gt;[program:api]&lt;/span&gt;
&lt;span class="py"&gt;command&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;uvicorn api_bot:app --host 0.0.0.0 --port 8000&lt;/span&gt;

&lt;span class="nn"&gt;[program:discord]&lt;/span&gt;
&lt;span class="py"&gt;command&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;python bot_discord.py&lt;/span&gt;

&lt;span class="py"&gt;autostart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;autorestart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;O resultado é um container único, leve, que sobe os dois serviços em paralelo e reinicia automaticamente qualquer um que falhe. Para uma VPS de entrada, isso faz toda a diferença.&lt;/p&gt;




&lt;h2&gt;
  
  
  O que aprendi que não estava no plano
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Chunking é uma arte.&lt;/strong&gt; O tamanho e o overlap dos chunks afetam mais a qualidade das respostas do que o modelo em si. Passei mais tempo ajustando isso do que qualquer outra coisa.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Segurança desde o início.&lt;/strong&gt; O &lt;code&gt;.gitignore&lt;/code&gt; precisou ser configurado antes do primeiro commit público para garantir que nenhum PDF com dados confidenciais da empresa fosse parar no repositório. Um erro aqui é difícil de reverter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;O problema real não era técnico.&lt;/strong&gt; A parte mais complexa foi entender que tipo de pergunta os usuários fariam e como estruturar os POPs para que o modelo conseguisse recuperar as informações certas. Garbage in, garbage out vale dobrado em RAG.&lt;/p&gt;




&lt;h2&gt;
  
  
  O projeto é open source
&lt;/h2&gt;

&lt;p&gt;O POPS AI está disponível no GitHub com README completo, &lt;code&gt;.env.example&lt;/code&gt;, Docker Compose configurado e passo a passo de instalação tanto local quanto via container.&lt;/p&gt;

&lt;p&gt;Você pode clonar, adaptar para sua própria base de conhecimento e usar com seus próprios documentos — seja para POPs, wikis internas, manuais de produto ou qualquer documentação em PDF.&lt;/p&gt;

&lt;p&gt;🔗 &lt;a href="https://github.com/obelucca/POPS_AI" rel="noopener noreferrer"&gt;github.com/obelucca/POPS_AI&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Stack utilizada
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;Python 3.10&lt;/code&gt; &lt;code&gt;LangChain&lt;/code&gt; &lt;code&gt;ChromaDB&lt;/code&gt; &lt;code&gt;FastAPI&lt;/code&gt; &lt;code&gt;Discord.py&lt;/code&gt; &lt;code&gt;Google Gemini 2.0 Flash&lt;/code&gt; &lt;code&gt;SBERT&lt;/code&gt; &lt;code&gt;Docker&lt;/code&gt; &lt;code&gt;Supervisor&lt;/code&gt; &lt;code&gt;PyMuPDF&lt;/code&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Se você chegou até aqui e tem curiosidade sobre alguma decisão de arquitetura, custo de tokens em produção ou como adaptar para um caso de uso diferente — deixa nos comentários. Bora trocar ideia.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>rag</category>
      <category>langchain</category>
    </item>
  </channel>
</rss>
