<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Muhammad Muzammil</title>
    <description>The latest articles on DEV Community by Muhammad Muzammil (@muzammil_endevsols).</description>
    <link>https://dev.to/muzammil_endevsols</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3898635%2Fc1665cd8-af8b-4b0e-8683-db2bb1cec273.png</url>
      <title>DEV Community: Muhammad Muzammil</title>
      <link>https://dev.to/muzammil_endevsols</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/muzammil_endevsols"/>
    <language>en</language>
    <item>
      <title>LongTrainer: The Production-Ready Python RAG Framework That Replaces 500 Lines of LangChain Boilerplate</title>
      <dc:creator>Muhammad Muzammil</dc:creator>
      <pubDate>Thu, 07 May 2026 04:21:56 +0000</pubDate>
      <link>https://dev.to/muzammil_endevsols/longtrainer-the-production-ready-python-rag-framework-that-replaces-500-lines-of-langchain-1ggp</link>
      <guid>https://dev.to/muzammil_endevsols/longtrainer-the-production-ready-python-rag-framework-that-replaces-500-lines-of-langchain-1ggp</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Build multi-tenant AI chatbots with persistent memory, streaming, tool calling, and 9 vector DB providers — in 10 lines of Python.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The RAG Boilerplate Problem Nobody Talks About&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Every developer building a production RAG chatbot eventually faces the same wall.&lt;/p&gt;

&lt;p&gt;You start with a LangChain tutorial. You connect an LLM. You load a PDF. You get a response. It works — and then reality hits.&lt;/p&gt;

&lt;p&gt;You need multiple bots for multiple customers. You need their conversation history to survive a server restart. You need real-time streaming responses. You need your bot to call external APIs when documents don’t have the answer. You need to store vectors somewhere other than RAM. You need encryption. You need a REST API so the frontend team can actually use this thing.&lt;/p&gt;

&lt;p&gt;What started as a weekend prototype turns into hundreds of lines of infrastructure glue — and none of it is the actual product you are building.&lt;/p&gt;

&lt;p&gt;This is the problem LongTrainer was designed to solve.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What Is LongTrainer?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;LongTrainer is a production-ready, open-source Python RAG (Retrieval-Augmented Generation) framework published under the MIT License. It is an opinionated, batteries-included abstraction layer on top of LangChain and LangGraph that handles the full production chatbot lifecycle:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Document ingestion from 15+ sources&lt;/li&gt;
&lt;li&gt;Vector embedding and retrieval across 9 vector database providers&lt;/li&gt;
&lt;li&gt;Multi-tenant bot isolation with per-bot LLM, embeddings, and config&lt;/li&gt;
&lt;li&gt;Persistent conversation memory backed by MongoDB&lt;/li&gt;
&lt;li&gt;Streaming responses — sync and async&lt;/li&gt;
&lt;li&gt;Tool calling and agent reasoning via LangGraph&lt;/li&gt;
&lt;li&gt;Vision and multimodal chat&lt;/li&gt;
&lt;li&gt;Chat encryption at rest&lt;/li&gt;
&lt;li&gt;A built-in FastAPI REST server with zero configuration
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;longtrainer
&lt;span class="c"&gt;# With optional agent/tool-calling support&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;longtrainer[agent]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full documentation is available at endevsols.github.io/Long-Trainer.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why LongTrainer Over Raw LangChain or LlamaIndex?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here is an honest comparison of what a production RAG system requires you to build yourself versus what LongTrainer provides:&lt;/p&gt;

&lt;p&gt;Concern Roll Your Own LongTrainer Multi-bot management Manage state dictionaries per tenant initialize_bot_id() → fully isolated bot Persistent memory Wire MongoDB or Redis manually Built-in MongoDB-backed history Document ingestion Assemble loaders + splitters add_document_from_path(path, bot_id) Streaming Implement astream callbacks get_response(stream=True) yields chunks Tool calling / Agent Build LangGraph graph from scratch add_tool(my_tool) + agent_mode=True Web search augmentation Find, integrate, and maintain web_search=True flag Vision/multimodal Complex multi-modal pipeline get_vision_response() Self-improvement Not a standard concept train_chats() feeds Q&amp;amp;A back into KB Encryption at rest Implement Fernet yourself encrypt_chats=True REST API Build FastAPI server yourself longtrainer serve&lt;/p&gt;

&lt;p&gt;The framework operates in two modes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG Mode (LCEL Chain):&lt;/strong&gt; Fast, deterministic document Q&amp;amp;A using LangChain Expression Language. Best for knowledge base chatbots and document assistants where the document is the authoritative source.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Mode (LangGraph):&lt;/strong&gt; A full agentic reasoning loop. The bot decides when to query documents, when to invoke tools, and how to chain multi-step reasoning. Best for workflows that require acting on external data.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Quickstart:&lt;/strong&gt; From Zero to a Working RAG Bot
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;System Dependencies&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Linux (Ubuntu/Debian)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;libmagic-dev poppler-utils tesseract-ocr qpdf libreoffice pandoc

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;macOS&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;libmagic poppler tesseract qpdf libreoffice pandoc

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Initialize&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtrainer.trainer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LongTrainer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LongTrainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;mongo_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mongodb://localhost:27017/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_token_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;encrypt_chats&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Load Documents&lt;/strong&gt;&lt;br&gt;
LongTrainer supports an extensive range of ingestion sources:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initialize_bot_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# Local files — PDF, DOCX, CSV, HTML, Markdown, TXT
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contracts/agreement.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Web URLs and YouTube transcripts
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_link&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://docs.yourapp.com/api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Amazon S3
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_aws_s3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-bucket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;folder/data.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Google Drive
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_google_drive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;folder_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1abc...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Confluence wiki
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_confluence&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://yourco.atlassian.net&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;you@yourco.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;space_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ENG&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# GitHub repository
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_github&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;repo&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://github.com/you/repo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Dynamic injection — any LangChain document loader
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_dynamic_loader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MyCustomLoader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;param&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Create the Bot and Chat&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OpenAIEmbeddings&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;num_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt_template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant. Answer only from the provided context. {context}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chat_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What are the termination clauses in section 4?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chat_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Streaming&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Synchronous streaming
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the key points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chat_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Async streaming — for FastAPI and other async frameworks
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;aget_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain section 7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chat_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Multi-Tenancy: Built for SaaS&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every bot created via initialize_bot_id() receives a unique identifier. All associated data — documents, vector embeddings, conversation history, tool registrations, and per-bot configuration — is fully isolated to that ID.&lt;/p&gt;

&lt;p&gt;You can run hundreds of bots on a single LongTrainer instance with no cross-contamination:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Customer A — Legal documents, GPT-4o-mini
&lt;/span&gt;&lt;span class="n"&gt;bot_a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initialize_bot_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_a_contracts.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# Customer B — Technical docs, Claude, custom embedding
&lt;/span&gt;&lt;span class="n"&gt;bot_b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initialize_bot_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_b_api_docs.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;bot_b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ChatAnthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-3-5-sonnet-20241022&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bots persist across server restarts. Restore any previous bot with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Agent Mode and Tool Calling&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When retrieval alone is not enough — when your bot needs to act, not just answer — agent mode enables a full LangGraph reasoning loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic Tool Loading (Zero Code)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;agent_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tavily_search_results_json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wikipedia&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arxiv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PythonREPLTool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yahoo_finance_news&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LongTrainer dynamically imports and initializes any string-based tool &lt;/p&gt;

&lt;p&gt;&lt;code&gt;from langchain.agents.load_tools —&lt;/code&gt; no manual wiring required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Custom Tool Registration&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtrainer.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;web_search&lt;/span&gt;
&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_exchange_rate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;currency_pair&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fetch the real-time exchange rate for a currency pair like USD/EUR.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;fetch_rate_from_api&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;currency_pair&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;web_search&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_exchange_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chat_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the current EUR/USD rate and what does the latest Fed statement say about it?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chat_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent autonomously decides when to query documents, when to call web search, and when to invoke your custom tool — all within a single turn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vector Database Support&lt;/strong&gt;&lt;br&gt;
LongTrainer treats vector store portability as a first-class concern, supporting nine providers out of the box:&lt;/p&gt;

&lt;p&gt;Provider Type Best For FAISS Local / In-memory Development, small scale Pinecone Cloud-native Serverless, large scale Chroma Open-source Self-hosted, fast prototyping Qdrant Open-source High-performance filtering PGVector PostgreSQL extension Existing Postgres infrastructure MongoDB Atlas Cloud Unified database + vector search Milvus Open-source Billion-vector scale Weaviate Open-source Multi-modal, GraphQL Elasticsearch Enterprise Existing ES infrastructure&lt;/p&gt;

&lt;p&gt;Each bot can use a different vector store — a meaningful advantage in multi-tenant architectures where different customers may have different infrastructure requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM Provider Support&lt;/strong&gt;&lt;br&gt;
LongTrainer’s Dynamic Model Factory accepts any BaseChatModel implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# OpenAI
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;
&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Anthropic Claude
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatAnthropic&lt;/span&gt;
&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatAnthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-3-5-sonnet-20241022&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Google Gemini
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_google_vertexai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatVertexAI&lt;/span&gt;
&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatVertexAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# AWS Bedrock
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_aws&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatBedrock&lt;/span&gt;
&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatBedrock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic.claude-3-5-sonnet-20241022-v2:0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Groq
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_groq&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatGroq&lt;/span&gt;
&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatGroq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama-3.1-70b-versatile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Ollama (local / air-gapped inference)
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_ollama&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOllama&lt;/span&gt;
&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOllama&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama3.2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Per-bot LLM configuration makes LongTrainer well-suited for architectures where difforent customers or use cases warrant different models — GPT-4o for enterprise users, Ollama for on-premise deployments with strict data residency requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vision and Multimodal Chat&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;vision_chat_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_vision_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_vision_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What defects are visible in this manufacturing photo?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;image_paths&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inspection_001.jpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inspection_002.jpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;vision_chat_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;vision_chat_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Self-Improving Memory:&lt;/strong&gt; &lt;code&gt;train_chats()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;After a bot accumulates conversation history, you can feed that history back into its knowledge base:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;train_chats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The framework extracts high-quality Q&amp;amp;A pairs from past sessions and re-ingests them as documents. Over time, the bot gets better at answering the specific questions your users are actually asking — a continuous improvement loop that raw LangChain pipelines do not provide out of the box.&lt;/p&gt;

&lt;p&gt;Zero-Code CLI and FastAPI Server&lt;br&gt;
LongTrainer 1.2.1 ships with a production-ready CLI and REST API server.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terminal Chat&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Initialize project&lt;/span&gt;
longtrainer init
&lt;span class="c"&gt;# Create a bot&lt;/span&gt;
longtrainer bot create &lt;span class="nt"&gt;--prompt&lt;/span&gt; &lt;span class="s2"&gt;"You are a helpful customer support agent."&lt;/span&gt;
&lt;span class="c"&gt;# Add a document&lt;/span&gt;
longtrainer add-doc &amp;lt;bot_id&amp;gt; /path/to/faq.pdf
&lt;span class="c"&gt;# Start an interactive chat session&lt;/span&gt;
longtrainer chat &amp;lt;bot_id&amp;gt;
REST API Server
longtrainer serve

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Starts a FastAPI server at &lt;a href="http://localhost:8000" rel="noopener noreferrer"&gt;http://localhost:8000&lt;/a&gt; with 18 REST endpoints covering full CRUD for bots, document ingestion, chat session management, and streaming. The Swagger UI is auto-generated at &lt;a href="http://localhost:8000/docs" rel="noopener noreferrer"&gt;http://localhost:8000/docs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Endpoint Method Description /health GET Health check /bots POST Create bot /bots/{id}/documents/path POST Ingest file /bots/{id}/chats POST Create chat session /bots/{id}/chats/{chat_id} POST Chat with streaming&lt;/p&gt;

&lt;p&gt;The server is Docker-ready and suitable for production deployment behind a reverse proxy.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Complete API Reference&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Constructor&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;trainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LongTrainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;mongo_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mongodb://localhost:27017/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="c1"&gt;# Default: ChatOpenAI(model="gpt-4o-2024-08-06")
&lt;/span&gt;    &lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# Default: OpenAIEmbeddings()
&lt;/span&gt;    &lt;span class="n"&gt;prompt_template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_token_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ensemble&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# Multi-query ensemble retrieval
&lt;/span&gt;    &lt;span class="n"&gt;encrypt_chats&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# Fernet encryption at rest
&lt;/span&gt;    &lt;span class="n"&gt;encryption_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# Auto-generated if None
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Production Tuning&lt;/strong&gt;&lt;br&gt;
Multi-Query Ensemble Retrieval Enable with ensemble=True. Generates multiple reformulations of each user query and merges the retrieval results — significantly improves recall for ambiguous or conversational queries at the cost of additional LLM calls per turn.&lt;/p&gt;

&lt;p&gt;Chunk Strategy The default chunk_size=2048 with chunk_overlap=200 works well for general prose documents. For structured content — tables, code, legal clauses — reduce chunk_size and increase chunk_overlap to avoid splitting semantic units across boundaries.&lt;/p&gt;

&lt;p&gt;num_k Tuning Start with num_k=3 for focused Q&amp;amp;A. Increase to num_k=7–10 for synthesis tasks where broader context improves answer quality.&lt;/p&gt;

&lt;p&gt;MongoDB Indexing For deployments with hundreds of bots and thousands of conversations, index your MongoDB collections on bot_id and chat_id fields to maintain consistent query performance at scale.&lt;/p&gt;

&lt;p&gt;Token Budget max_token_limit=32000 controls the conversation context window. For models with 128K+ context windows, this value can be increased substantially. Monitor document sizes in the memory collection as conversations grow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Use Cases
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;SaaS Multi-Tenant Document Assistant&lt;/strong&gt; Each customer gets an isolated bot seeded with their own uploaded documents. Conversation history persists across sessions. LongTrainer’s bot_id / chat_id isolation model makes this architecture a few lines of code rather than an engineering project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enterprise Internal Knowledge Base&lt;/strong&gt; Load Confluence wikis, GitHub repos, internal PDFs, and S3 buckets into a single bot. Enable ensemble=True for better recall on ambiguous queries. Enable encrypt_chats=True for compliance requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Customer Support Agent&lt;/strong&gt; Use agent mode with web search and a CRM lookup tool. The bot retrieves from product documentation, checks live ticket status via tool calls, and returns grounded answers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Research Assistant&lt;/strong&gt; with Continuous Improvement Feed academic PDFs into a bot. Run train_chats() periodically to re-ingest high-quality Q&amp;amp;A pairs from past sessions. The bot improves incrementally without retraining.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On-Premise Deployment for Data Residency&lt;/strong&gt; Use ChatOllama as the LLM with a local FAISS store. No data leaves the premises. longtrainer serve provides the REST interface for internal applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;PyPI: &lt;code&gt;pip install longtrainer&lt;/code&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/ENDEVSOLS/Long-Trainer" rel="noopener noreferrer"&gt;github.com/ENDEVSOLS/Long-Trainer&lt;/a&gt;&lt;br&gt;
Documentation: &lt;a href="https://endevsols.github.io/Long-Trainer" rel="noopener noreferrer"&gt;endevsols.github.io/Long-Trainer&lt;/a&gt;&lt;br&gt;
Open Source Tools: &lt;a href="https://endevsols.com/open-source/longtrainer" rel="noopener noreferrer"&gt;endevsols.com/open-source/longtrainer&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If LongTrainer saves you meaningful engineering time, consider starring the repository and sharing it with your team.&lt;/p&gt;

&lt;p&gt;Tags: #Python #MachineLearning #LangChain #RAG #AI #ChatBot #OpenSource #LLM #NLP #GenerativeAI #LangGraph #VectorDatabase #ArtificialIntelligence #MLOps&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>rag</category>
      <category>ai</category>
      <category>python</category>
    </item>
    <item>
      <title>Beyond Chatbot Wrappers: Designing ‘Velocity Architecture’ for Production Multi-Agent Systems</title>
      <dc:creator>Muhammad Muzammil</dc:creator>
      <pubDate>Wed, 06 May 2026 05:34:53 +0000</pubDate>
      <link>https://dev.to/muzammil_endevsols/beyond-chatbot-wrappers-designing-velocity-architecture-for-production-multi-agent-systems-22dp</link>
      <guid>https://dev.to/muzammil_endevsols/beyond-chatbot-wrappers-designing-velocity-architecture-for-production-multi-agent-systems-22dp</guid>
      <description>&lt;p&gt;The tech landscape is currently flooded with “AI fatigue.” Every day, another startup launches a thin wrapper around a foundational LLM API, calling it a revolutionary product. But as any backend engineer operating in the real world knows: stringing together a few prompts behind a UI doesn’t survive contact with enterprise production.&lt;/p&gt;

&lt;p&gt;Monolithic prompts are brittle. Context windows get polluted. And when the system hallucinates or fails, debugging an opaque API call is a nightmare.&lt;/p&gt;

&lt;p&gt;To build high-ROI applications that actually solve complex problems, we need to stop building wrappers and start designing Velocity Architecture infrastructure optimized for multi-agent orchestration, state persistence, and scalable execution.&lt;/p&gt;

&lt;p&gt;Here is a blueprint for designing backend systems where AI agents do actual work, not just chat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem with Monolithic Prompts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The typical v1 approach to an AI feature is a single, massive prompt containing instructions, user input, and retrieved context (RAG).&lt;/p&gt;

&lt;p&gt;This fails at scale for three reasons:&lt;/p&gt;

&lt;p&gt;Context Degradation: As you shove more retrieved data into the prompt, the LLM loses focus on the actual instructions (the “lost in the middle” phenomenon).&lt;br&gt;
Zero Fault Tolerance: If the model misunderstands one sub-task, the entire output fails.&lt;br&gt;
High Latency: Processing massive monolithic prompts takes time and burns tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: Multi-Agent Orchestration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of one monolithic LLM call doing everything, a multi-agent system breaks down complex workflows into discrete, specialized nodes. Think of it less like a brain, and more like a microservices architecture for AI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Supervisor Pattern&lt;/strong&gt;&lt;br&gt;
In a production environment, you need a deterministic routing mechanism. We typically implement a Supervisor Node.&lt;/p&gt;

&lt;p&gt;The Supervisor doesn’t generate the final answer; it evaluates the user’s intent and routes the payload to specialized worker agents (e.g., a “Code Review Agent,” a “Data Extraction Agent,” or a “SQL Generation Agent”).&lt;/p&gt;

&lt;p&gt;By constraining each worker agent to a single, narrow system prompt, accuracy skyrockets, and hallucinations drop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Core Infrastructure Stack&lt;/strong&gt;&lt;br&gt;
To build this orchestration layer effectively, your underlying stack matters. Here is a battle-tested architecture pattern for multi-agent MVPs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The Asynchronous Engine: FastAPI&lt;/strong&gt;&lt;br&gt;
Multi-agent workflows are inherently asynchronous. Agents need to pause execution to call external APIs, query databases, or wait for another agent’s output. Python’s FastAPI is the ideal orchestration layer here due to its native asyncio support and high throughput. It allows the system to manage multiple concurrent agent graphs without blocking the main event loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. State Management &amp;amp; Vector Storage: PostgreSQL + pgvector&lt;/strong&gt;&lt;br&gt;
When agents hand off tasks to one another, they need a shared “memory” or state. Relying entirely on the LLM’s context window for this state is expensive and unreliable.&lt;/p&gt;

&lt;p&gt;Instead of juggling a separate vector database and a relational database, consolidate. Using PostgreSQL with the pgvector extension allows you to store your agent state (JSONB), relational user data, and embedding vectors in a single, ACID-compliant environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The Orchestration Framework (e.g., LangGraph)&lt;/strong&gt;&lt;br&gt;
Rather than writing messy while loops to handle agent routing, use a graph-based state machine. Frameworks like LangGraph allow you to define agents as nodes and their interactions as edges. This makes the execution flow highly observable. If an agent loops infinitely, you can catch it at the graph level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Minimal Routing Example&lt;/strong&gt;&lt;br&gt;
Instead of giant code blocks, let’s look at the core routing logic. The secret to multi-agent stability is keeping the routing strict.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# A conceptual look at how a Supervisor routes state
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;supervisor_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;routing_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    You are a supervisor. Review the task and route to the correct worker.
    Available workers: [researcher, coder, reviewer]
    If the task is complete, route to &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;FINISH&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# The LLM outputs a structured JSON response dictating the next node
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ainvoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;routing_prompt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;current_task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;next_node&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;route_to&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By forcing the LLM to output a strict schema (using function calling or structured output), the graph framework knows exactly which Python function to trigger next. The LLM handles the logic, while standard Python code handles the execution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why This Matters for Production&lt;/strong&gt;&lt;br&gt;
Building “Velocity Architecture” means establishing a foundation where new capabilities can be added simply by wiring a new agent into the graph.&lt;/p&gt;

&lt;p&gt;If you want to add a web-scraping feature, you don’t rewrite your massive master prompt. You create a simple Web Scraper Agent, define its input/output schema, and tell the Supervisor it exists.&lt;/p&gt;

&lt;p&gt;This decoupling is what separates hobbyist AI projects from enterprise-grade infrastructure. It allows for modular testing, independent scaling, and most importantly, predictable system behavior.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>python</category>
      <category>fastapi</category>
    </item>
    <item>
      <title>Building Production-Ready RAG is Harder Than You Think (Here's How to Fix It)</title>
      <dc:creator>Muhammad Muzammil</dc:creator>
      <pubDate>Mon, 04 May 2026 11:56:05 +0000</pubDate>
      <link>https://dev.to/muzammil_endevsols/building-production-ready-rag-is-harder-than-you-think-heres-how-to-fix-it-47me</link>
      <guid>https://dev.to/muzammil_endevsols/building-production-ready-rag-is-harder-than-you-think-heres-how-to-fix-it-47me</guid>
      <description>&lt;p&gt;Building a RAG chatbot in a tutorial takes a weekend.&lt;br&gt;
Making it production-ready takes months, and most teams don't realize the complexity&lt;br&gt;
until they're already dealing with frustrated users and crashing servers.&lt;/p&gt;

&lt;p&gt;When building for enterprise, you have to optimize for iteration speed and&lt;br&gt;
rock-solid reliability. Here is what real-world production RAG actually requires&lt;br&gt;
that basic tutorials skip over:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tenant isolation:&lt;/strong&gt; Ensuring Client A can never access Client B's vector data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent memory:&lt;/strong&gt; Session histories that survive server restarts, backed by MongoDB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming responses:&lt;/strong&gt; Handling heavy LLM loads without timing out&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability:&lt;/strong&gt; Knowing exactly &lt;em&gt;why&lt;/em&gt; the AI retrieved a specific chunk or gave a wrong answer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination detection:&lt;/strong&gt; Catching fabrications before the end-user sees them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We built &lt;a href="https://github.com/ENDEVSOLS/Long-Trainer" rel="noopener noreferrer"&gt;LongTrainer&lt;/a&gt; to handle all of&lt;br&gt;
this out of the box. It sits on top of LangChain, so you don't have to wire the&lt;br&gt;
infrastructure together yourself.&lt;/p&gt;

&lt;p&gt;With over 39,000 downloads, it is actively powering deployments from FinTech to Healthcare.&lt;/p&gt;


&lt;h2&gt;
  
  
  Deploying a Multi-Tenant RAG Bot in 5 Lines
&lt;/h2&gt;

&lt;p&gt;Instead of writing custom session management, vector routing, and database wrappers,&lt;br&gt;
here is all you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtrainer.trainer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LongTrainer&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Initialize with persistent MongoDB memory
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LongTrainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mongo_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mongodb://localhost:27017/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Generate a fully isolated bot instance per client
&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initialize_bot_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Ingest documents into the bot's secure, isolated vector space
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path/to/your/data.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 4. Spin up the bot — embeddings and indexing handled automatically
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 5. Create a persistent chat session
&lt;/span&gt;&lt;span class="n"&gt;chat_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Route queries securely — bot_id and chat_id enforce strict isolation
&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the refund policy?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chat_id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Sources are returned alongside the answer for auditability
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every call is routed through &lt;code&gt;bot_id&lt;/code&gt; and &lt;code&gt;chat_id&lt;/code&gt;. There is no shared state between&lt;br&gt;
clients - the vector index, chat history, and document context are all strictly isolated&lt;br&gt;
per bot instance.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Black Box Problem
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgo1jt11gb6bwmcnly8w6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgo1jt11gb6bwmcnly8w6.png" alt="Bar chart showing LongTrainer v1.3.0 increasing RAG accuracy from roughly 70 percent to a 95 percent accuracy rate, alongside a metric showing improved document retrieval accuracy." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When an AI gives a wrong answer in production, you are usually debugging blind:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did the vector database retrieve the wrong document chunk?&lt;/li&gt;
&lt;li&gt;Did the LLM hallucinate beyond what the context supported?&lt;/li&gt;
&lt;li&gt;Was the prompt silently truncated due to token limits?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without observability, you cannot answer any of these questions. You are waiting for&lt;br&gt;
a user complaint instead of catching the failure yourself.&lt;/p&gt;

&lt;p&gt;This is the core problem v1.3.0 addresses.&lt;/p&gt;


&lt;h2&gt;
  
  
  What's New in v1.3.0: Native LongTracer Integration
&lt;/h2&gt;

&lt;p&gt;Install with the tracer extras:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;longtrainer[tracer]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Enable it with a single flag at initialization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtrainer.trainer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LongTrainer&lt;/span&gt;

&lt;span class="n"&gt;trainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LongTrainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;mongo_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mongodb://localhost:27017/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;enable_tracer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# Activate full observability
&lt;/span&gt;    &lt;span class="n"&gt;tracer_backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mongo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Store traces in MongoDB
&lt;/span&gt;    &lt;span class="n"&gt;tracer_verify&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# Enable NLI hallucination detection
&lt;/span&gt;    &lt;span class="n"&gt;tracer_verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# Print span logs to console
&lt;/span&gt;    &lt;span class="n"&gt;tracer_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;     &lt;span class="c1"&gt;# Strictness for hallucination flagging (0.0–1.0)
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once enabled, two things happen automatically on every query:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Granular Observability
&lt;/h3&gt;

&lt;p&gt;LongTracer captures a hierarchical trace for every interaction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Every call to get_response() automatically generates a trace:
&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the compliance section&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chat_id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# What gets captured behind the scenes:
# - Retrieval span: which documents were fetched, similarity scores, latency in ms
# - LLM span: exact prompt sent, token count (prompt + completion), generation latency
# - Agent spans (if agent_mode=True): every tool call, input, output, and execution time
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All traces are stored in MongoDB and queryable at any time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pymongo&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MongoClient&lt;/span&gt;

&lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MongoClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mongodb://localhost:27017/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;longtracer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Pull all traces for a specific bot, ordered by timestamp
&lt;/span&gt;&lt;span class="n"&gt;traces&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs.bot_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-bot-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;sort&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;traces&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Latency: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;outputs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;latency_ms&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tokens used: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;outputs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;token_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retrieved docs: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;outputs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retrieved_docs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Real-Time Hallucination Detection
&lt;/h3&gt;

&lt;p&gt;When &lt;code&gt;tracer_verify=True&lt;/code&gt; is set, every response goes through &lt;code&gt;CitationVerifier&lt;/code&gt;&lt;br&gt;
before being returned to the user.&lt;/p&gt;

&lt;p&gt;It works in two stages:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1 - Claim extraction:&lt;/strong&gt;&lt;br&gt;
The AI's response is split into atomic, independently verifiable claims.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2 — NLI cross-referencing:&lt;/strong&gt;&lt;br&gt;
Each claim is checked against the retrieved source documents using a Natural Language&lt;br&gt;
Inference model. A claim fails if the source documents do not logically entail it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Query hallucination records for a specific bot
&lt;/span&gt;&lt;span class="n"&gt;hallucinations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs.bot_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-bot-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outputs.is_hallucinated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;hallucinations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hallucinated response: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed claims: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;outputs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;failed_claims&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Source docs used: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;outputs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retrieved_docs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You are no longer waiting for a user to report an error. You have a systematic,&lt;br&gt;
queryable record of every point where the AI broke from its source material.&lt;/p&gt;
&lt;h3&gt;
  
  
  Graceful Degradation
&lt;/h3&gt;

&lt;p&gt;If you want span and latency logging without the overhead of NLI evaluation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;trainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LongTrainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;mongo_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mongodb://localhost:27017/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;enable_tracer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tracer_verify&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;  &lt;span class="c1"&gt;# Observability on, hallucination detection off
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;longtrainer[tracer]&lt;/code&gt; is not installed, LongTrainer bypasses the tracer&lt;br&gt;
entirely without raising an exception — no breaking changes to existing deployments.&lt;/p&gt;


&lt;h2&gt;
  
  
  Also in v1.3.0: Lazy Loading at Scale
&lt;/h2&gt;

&lt;p&gt;Previous versions eagerly loaded all chat histories into RAM on server startup.&lt;br&gt;
At 100,000+ sessions, this caused startup times measured in minutes and significant&lt;br&gt;
memory pressure.&lt;/p&gt;

&lt;p&gt;v1.3.0 flips this entirely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before v1.3.0: all sessions loaded at startup → memory spike
# After v1.3.0: zero sessions loaded at startup
&lt;/span&gt;
&lt;span class="c1"&gt;# When a user sends a message:
&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chat_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# LongTrainer fetches only *this* conversation thread from MongoDB on demand
# All other sessions remain unloaded until requested
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For production environments with large user bases, startup time drops from&lt;br&gt;
minutes to milliseconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Standard install&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;longtrainer

&lt;span class="c"&gt;# With observability and hallucination detection&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;longtrainer[tracer]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Supported LLM providers: OpenAI, Anthropic, Gemini, AWS Bedrock, HuggingFace,&lt;br&gt;
Groq, Ollama, and any LangChain-compatible LLM.&lt;/p&gt;

&lt;p&gt;Supported vector stores: FAISS, Pinecone, Qdrant, PGVector, Chroma.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ENDEVSOLS/Long-Trainer" rel="noopener noreferrer"&gt;github.com/ENDEVSOLS/Long-Trainer&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://endevsols.github.io/Long-Trainer" rel="noopener noreferrer"&gt;endevsols.github.io/Long-Trainer&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/longtrainer" rel="noopener noreferrer"&gt;pypi.org/project/longtrainer&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;For those of you already running RAG in production: what is the biggest&lt;br&gt;
infrastructure bottleneck you are currently hitting?&lt;/p&gt;

</description>
      <category>python</category>
      <category>langchain</category>
      <category>opensource</category>
      <category>rag</category>
    </item>
    <item>
      <title>How We Automated Hallucination Detection in Enterprise RAG Pipelines</title>
      <dc:creator>Muhammad Muzammil</dc:creator>
      <pubDate>Wed, 29 Apr 2026 05:34:16 +0000</pubDate>
      <link>https://dev.to/endevsols/how-we-automated-hallucination-detection-in-enterprise-rag-pipelines-42ca</link>
      <guid>https://dev.to/endevsols/how-we-automated-hallucination-detection-in-enterprise-rag-pipelines-42ca</guid>
      <description>&lt;p&gt;Your RAG isn't broken. It's just lying quietly.&lt;/p&gt;

&lt;p&gt;Retrieval works. The LLM sounds confident. Your users get an answer.&lt;/p&gt;

&lt;p&gt;But somewhere in that response, a claim contradicts the source document it was supposed to be grounded in. No error thrown. No flag raised. Just a confident, wrong answer, delivered at scale.&lt;/p&gt;

&lt;p&gt;This is the hallucination problem that doesn't get talked about enough. Not the obvious failures. The subtle ones.&lt;/p&gt;

&lt;p&gt;We've seen it across enterprise RAG deployments  in legal tools, internal knowledge bases, customer-facing assistants. The retrieval pipeline performs. The LLM performs. And still, trust erodes the moment a user catches one bad answer.&lt;/p&gt;

&lt;p&gt;We're open sourcing &lt;a href="https://endevsols.com/open-source/longtracer" rel="noopener noreferrer"&gt;LongTracer&lt;/a&gt;, our answer to this problem.&lt;/p&gt;

&lt;p&gt;LongTracer sits at the output layer of any RAG pipeline and verifies every claim in an LLM response against your source documents. It uses a hybrid STS + NLI approach: first finding the most semantically relevant source sentence per claim, then classifying whether that source actually supports, contradicts, or is neutral to what the LLM said.&lt;/p&gt;

&lt;p&gt;The result: a trust score, a verdict, and a clear list of exactly which claims hallucinated and why.&lt;/p&gt;

&lt;p&gt;No LLM calls. No vector store required. No new infrastructure. It works with LangChain, LlamaIndex, Haystack, LangGraph, or any pipeline that gives you a response and source chunks.&lt;/p&gt;

&lt;p&gt;MIT licensed. Built from real implementation experience.&lt;/p&gt;

&lt;p&gt;If you're running RAG in production, your users deserve answers you can actually stand behind.&lt;/p&gt;

&lt;p&gt;Try:&lt;br&gt;
&lt;code&gt;pip install longtracer&lt;/code&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>opensource</category>
      <category>python</category>
    </item>
    <item>
      <title>RAG vs. Fine-Tuning vs. Prompting: 2026 Strategic Guide</title>
      <dc:creator>Muhammad Muzammil</dc:creator>
      <pubDate>Sun, 26 Apr 2026 11:40:41 +0000</pubDate>
      <link>https://dev.to/muzammil_endevsols/rag-vs-fine-tuning-vs-prompting-2026-strategic-guide-169l</link>
      <guid>https://dev.to/muzammil_endevsols/rag-vs-fine-tuning-vs-prompting-2026-strategic-guide-169l</guid>
      <description>&lt;p&gt;As we navigate the landscape of 2026, the initial era of generative AI experimentation has yielded to a period of industrial-grade Enterprise LLM Implementation. For technical founders and CTOs, the fundamental challenge is no longer just selecting a foundational model, but architecting a system that safely bridges the 'Enterprise Data Gap' - the distance between a model's public training weights and your organization's proprietary intelligence.&lt;/p&gt;

&lt;p&gt;In our internal analysis of scaling enterprise AI systems, we found that optimizing data retrieval pipelines can reduce hallucination rates by up to 85% compared to baseline models. The decision between Retrieval-Augmented Generation (RAG), Fine-Tuning, and Prompt Engineering is no longer a theoretical debate; it is a critical infrastructure choice that dictates your compute costs, latency, and system scalability.&lt;br&gt;
This guide provides a practitioner's framework for architecting Large Language Models (LLMs) for maximum ROI, security, and production-grade accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Engineering Reality: Moving Beyond Base Models&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Base models are essentially 'polymaths with amnesia.' They possess vast general knowledge and reasoning capabilities but lack access to your internal databases, real-time analytics, and secure corporate data.&lt;br&gt;
To transform these models into production-ready assets, engineering teams must leverage one of three primary optimization levers. A common mistake is assuming that adjusting model weights (Fine-Tuning) is the default solution for poor performance. In reality, the most resilient architectures today are hybrid systems that utilize multi-agent workflows for routing, RAG for factual grounding, and fine-tuning exclusively for deep stylistic or logical specialization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option A: Advanced Prompting &amp;amp; Multi-Agent Routing (The Agility Play)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architectural Overview&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Prompt engineering has evolved far beyond basic text instructions. In 2026, it involves programmatic prompt construction and multi-agent orchestration frameworks like LangGraph. Instead of relying on a single zero-shot prompt, we design stateful, multi-actor systems where agents dynamically construct prompts based on the user's intent before routing the query to the appropriate LLM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Engineering Trade-offs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Near-zero infrastructure overhead; instantaneous iteration; highly effective when combined with stateful agentic workflows.&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Strictly bounded by the model's context window limits; highly susceptible to prompt injection attacks; prone to 'mode collapse' when instructions become too complex.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production Use Case&lt;/strong&gt;&lt;br&gt;
Best utilized as the routing layer of an AI application. For example, using a lightweight model to classify an incoming query and dynamically inject the correct system prompt before passing it to a heavier model for execution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option B: Retrieval-Augmented Generation (The Contextual Powerhouse)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architectural Overview&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;RAG is the industry standard for bridging LLMs with proprietary data. Instead of baking knowledge into the model's weights, RAG relies on a high-speed semantic search pipeline.&lt;br&gt;
When dealing with large-scale vectorization projects - often scaling up to 300-400GB of enterprise data, a naive RAG approach fails. Production RAG requires a robust pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ingestion &amp;amp; Chunking: Parsing raw data and applying semantic chunking strategies to preserve context.&lt;/li&gt;
&lt;li&gt;Embedding: Passing chunks through an embedding model to create dense vector representations.&lt;/li&gt;
&lt;li&gt;Vector Store: Storing these embeddings in a high-performance vector database.&lt;/li&gt;
&lt;li&gt;Retrieval &amp;amp; Generation: Intercepting a user query, converting it to a vector, retrieving the Top-K nearest neighbors, and injecting that context into the LLM's prompt via a scalable backend (typically built on FastAPI).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The Engineering Trade-offs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Absolute data freshness; highly auditable (you can trace exact source documents); inherently secure through document-level access controls.&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Introduces latency during the retrieval step; requires maintaining separate infrastructure (Vector DBs, embedding pipelines).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production Use Case&lt;/strong&gt;&lt;br&gt;
RAG is the definitive architecture for systems requiring factual accuracy and real-time updates, such as medical clinical assistants parsing dynamic guidelines or financial chatbots querying live internal knowledge bases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option C: Fine-Tuning (The Deep Expertise Specialization)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architectural Overview&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fine-tuning permanently alters the internal parameters (weights) of a pre-trained model. Rather than providing context at runtime, you are retraining the model on a highly curated, domain-specific dataset. Modern Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA and QLoRA, allow teams to freeze the base model and only update a small subset of weights, drastically reducing compute requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Engineering Trade-offs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Unmatched performance in niche logical tasks; highly effective at forcing models to output specific structural formats (like proprietary code or strict JSON); reduces runtime latency compared to heavy RAG prompts.&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; High risk of 'Knowledge Obsolescence' (data is frozen at training time); expensive data curation process; difficult to enforce user-level data security.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production Use Case&lt;/strong&gt;&lt;br&gt;
Reserved for tasks where reasoning style, format, and domain jargon outweigh the need for real-time data. Ideal for proprietary code generation, strict regulatory compliance parsing, or altering the inherent 'voice' of an open-source model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG vs Fine-Tuning vs Prompting: The Infrastructure Matrix&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When architecting a solution, evaluate these critical dimensions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Data Freshness: RAG provides real-time access. Fine-tuning is static.&lt;/li&gt;
&lt;li&gt;Hallucination Mitigation: RAG grounds outputs in provided facts. Fine-tuning can actually increase confident hallucinations if the training data is flawed.&lt;/li&gt;
&lt;li&gt;Security &amp;amp; Access Control: RAG allows for Role-Based Access Control (RBAC) at the database level. Fine-tuning bakes data into the weights, making it accessible to anyone who queries the model.&lt;/li&gt;
&lt;li&gt;Infrastructure Load: RAG shifts the load to memory and database I/O. Fine-tuning shifts the load to heavy GPU compute.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Strategic Recommendation for AI Architecture in 2026&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For engineering leaders, the optimal architecture is a RAG-First Strategy wrapped in Agentic Routing.&lt;br&gt;
By building a robust RAG architecture, you create a system that is grounded, auditable, and secure. Utilize frameworks like LangGraph to orchestrate prompt-based agents that handle logic and routing, and reserve fine-tuning strictly as a surgical tool for edge cases where the LLM struggles to grasp domain-specific formatting.&lt;/p&gt;

&lt;p&gt;Choosing the right path for LLM optimization is the difference between an AI product that scales efficiently and a fragile system that becomes a technical liability.&lt;/p&gt;

&lt;p&gt;At EnDevSols, we specialize in architecting production-grade multi-agent workflows and high-capacity RAG pipelines for enterprise clients. If you are a CTO or technical founder looking to transition from AI prototypes to scalable infrastructure, explore our &lt;a href="https://endevsols.com/services/generative-ai" rel="noopener noreferrer"&gt;Generative AI Development Services&lt;/a&gt; to see how we build resilient AI systems.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>agents</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
