<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Muhammad Muzammil</title>
    <description>The latest articles on DEV Community by Muhammad Muzammil (@muzammil_endevsols).</description>
    <link>https://dev.to/muzammil_endevsols</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3898635%2Fc1665cd8-af8b-4b0e-8683-db2bb1cec273.png</url>
      <title>DEV Community: Muhammad Muzammil</title>
      <link>https://dev.to/muzammil_endevsols</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/muzammil_endevsols"/>
    <language>en</language>
    <item>
      <title>LongTracer: Open-Source RAG Hallucination Detection Without LLM-as-a-Judge</title>
      <dc:creator>Muhammad Muzammil</dc:creator>
      <pubDate>Thu, 04 Jun 2026 11:08:12 +0000</pubDate>
      <link>https://dev.to/muzammil_endevsols/longtracer-open-source-rag-hallucination-detection-without-llm-as-a-judge-39eg</link>
      <guid>https://dev.to/muzammil_endevsols/longtracer-open-source-rag-hallucination-detection-without-llm-as-a-judge-39eg</guid>
      <description>&lt;p&gt;Stop paying to evaluate your LLM outputs. Stop tolerating non-deterministic quality gates. LongTracer is the MIT-licensed Python library that catches RAG hallucinations at inference time — no API calls, no cloud dependency, no per-verification cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Hallucination Problem Is Now a Production Engineering Problem&lt;/strong&gt;&lt;br&gt;
Retrieval-Augmented Generation (RAG) has become the dominant architecture for enterprise AI in 2025–2026. Legal research tools, medical Q&amp;amp;A systems, financial advisory bots, and customer-support agents all run the same core loop: retrieve context from a knowledge base, pass it to an LLM, return the response.&lt;/p&gt;

&lt;p&gt;The failure mode is well-documented: hallucination — the LLM generating confident, plausible-sounding output that directly contradicts the very source documents it was given.&lt;/p&gt;

&lt;p&gt;A legal assistant that cites a case that doesn’t exist.&lt;br&gt;
A medical chatbot that states the wrong drug dosage.&lt;br&gt;
A customer-support agent that invents a return policy.&lt;br&gt;
These are not edge cases. They are the daily operational reality for any team running RAG at scale.&lt;/p&gt;

&lt;p&gt;The engineering community has largely accepted the reframing: hallucination is not a model bug you patch once. It is a systems engineering discipline you manage continuously. That shift has spawned an entire category of LLM observability tooling — and the market is now crowded.&lt;/p&gt;

&lt;p&gt;This article does two things: gives you an honest map of the observability landscape as it stands today, and makes the technical case for LongTracer — a focused, open-source Python library built by EnDevSols that takes a fundamentally different approach to the problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 2025–2026 LLM Observability Landscape: An Honest Map&lt;/strong&gt;&lt;br&gt;
Before evaluating any specific tool, it helps to understand what the market actually offers. As of mid-2026, the major players fall into four distinct categories.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;General-Purpose Trace Platforms&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Langfuse&lt;/strong&gt; (MIT-licensed, self-hostable) has become the default open-source choice for teams that need prompt management, session tracing, and evaluation harnesses. Its breadth is its strength — it integrates with LangChain, LlamaIndex, and custom pipelines, supports prompt versioning, and has a human annotation queue. Its fundamental limitation in the RAG verification space: it is an observability tool. It tells you what happened. It does not automatically verify whether the response was grounded in the retrieved documents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Arize Phoenix&lt;/strong&gt; brings a mature MLOps heritage. Built natively on OpenTelemetry, it excels at embedding drift detection, retrieval quality metrics, and evaluation pipelines. Teams with a traditional ML background will find the paradigm familiar. Like Langfuse, it is primarily a tracing and post-hoc evaluation platform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangSmith&lt;/strong&gt; is the native observability layer for LangChain/LangGraph. Tightly integrated and excellent for graph visualization and annotation — but creates significant vendor lock-in and is less useful for teams using other frameworks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-Time Guardrail Platforms&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Galileo&lt;/strong&gt; differentiates through proprietary SLM models purpose-built for real-time evaluation. Its Luna-2 models are widely regarded as state-of-the-art for blocking harmful or hallucinated outputs before they reach users. The tradeoff: enterprise-only pricing, cloud-only deployment, and LLM-calls-to-evaluate-LLM-calls — compounding both cost and latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Helicone&lt;/strong&gt; takes an entirely different approach — acting as a transparent proxy between your application and LLM providers. The “one-line” setup is its headline feature. It excels at cost tracking and caching but is not a semantic verification system in any meaningful sense.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Gap These Tools Leave&lt;/strong&gt;&lt;br&gt;
Every solution above falls into one of two categories:&lt;/p&gt;

&lt;p&gt;Passive and post-hoc — observes and reports after the fact but does not verify claim-level grounding at inference time.&lt;br&gt;
Expensive and locked — real-time guardrails require enterprise contracts, cloud connectivity, and LLM calls to evaluate LLM calls.&lt;br&gt;
This is precisely the gap LongTracer is designed to fill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Is LongTracer?&lt;/strong&gt;&lt;br&gt;
LongTracer is an open-source Python SDK (MIT license, available on PyPI) built by EnDevSols for one specific job: verify that every claim in an LLM response is actually supported by the source documents used to generate it.&lt;/p&gt;

&lt;p&gt;It achieves this using a hybrid STS + NLI pipeline — two lightweight encoder models that run entirely locally, with no external API calls, no internet dependency, and no per-verification cost.&lt;/p&gt;

&lt;p&gt;As of v0.2.0 (released May 18, 2026), it ships with a complete observability suite: a built-in web dashboard, OpenTelemetry export for Grafana/Datadog/Jaeger, active alerting via Slack/Discord/webhooks, and a production-grade REST API server. But the core mission has never changed:&lt;/p&gt;

&lt;p&gt;“RAG hallucination detection, multi-project tracing, and pluggable backends — all batteries included.”&lt;/p&gt;

&lt;p&gt;Install it in one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;longtracer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Supports Python 3.10, 3.11, and 3.12.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How LongTracer Works: The STS + NLI Pipeline&lt;/strong&gt;&lt;br&gt;
This is LongTracer’s core technical differentiator. Understanding the architecture is essential to understanding why it solves problems that other tools don’t.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem with LLM-as-a-Judge&lt;/strong&gt;&lt;br&gt;
Most RAG evaluation approaches use an LLM-as-a-judge strategy: send the original response and the source context to a capable model (GPT-4o, Claude, Gemini) and ask it to score faithfulness. This approach is intuitive but introduces three serious production problems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency:&lt;/strong&gt; An additional LLM call adds 1–5 seconds per inference.&lt;br&gt;
Cost: At scale, paying for an evaluation call on every response becomes substantial.&lt;br&gt;
&lt;strong&gt;Non-determinism:&lt;/strong&gt; The same inputs can produce different scores on consecutive runs, making CI/CD integration unreliable. You cannot write a test that will not flake.&lt;br&gt;
LongTracer’s design decision is direct: replace the LLM judge with a deterministic two-stage encoder pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1 — Claim Splitting&lt;/strong&gt;&lt;br&gt;
The LLM response is broken into individual atomic claims using a regex-based sentence splitter tuned for LLM output patterns. Key behaviors:&lt;/p&gt;

&lt;p&gt;Decimal numbers (98.6°F) are not split at their period&lt;br&gt;
Standard abbreviations (Dr., Inc., e.g.) are handled correctly&lt;br&gt;
Meta-statements — honest uncertainty phrases like “the documents do not contain…” — are detected and never flagged as hallucinations, even if no source explicitly supports them&lt;br&gt;
Hallucination-signaling phrases — statements like “based on my general knowledge…” — are flagged regardless of downstream NLI score, because they explicitly indicate the model is drawing on training data rather than the retrieved context&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2A — STS Evidence Selection (&amp;lt; 10ms per claim)&lt;/strong&gt;&lt;br&gt;
For each atomic claim, the bi-encoder all-MiniLM-L6-v2 computes cosine similarity between the claim embedding and every sentence in the provided source documents. The highest-scoring sentence is selected as the candidate evidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gating logic:&lt;/strong&gt; If the best similarity score is below 0.25, the NLI stage is skipped entirely. There is no value in running a cross-encoder on a claim that has no plausible source match — this saves compute and avoids false positives on topics genuinely absent from the retrieved context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2B — NLI Verification (~150ms per claim)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The cross-encoder nli-deberta-v3-xsmall takes the (claim, best_source_sentence) pair and outputs three probabilities:&lt;/p&gt;

&lt;p&gt;LabelMeaningActionentailmentSource text supports the claim✅ Claim passesneutralSource neither confirms nor contradicts⚠️ Claim is unverifiedcontradictionSource directly contradicts the claim❌ Hallucination flagged&lt;/p&gt;

&lt;p&gt;A claim is flagged as a hallucination when contradiction_score &amp;gt; 0.5.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trust Score&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;trust_score = supported_claims / total_claims&lt;br&gt;
A score of 1.0 means every claim in the response is supported by retrieved documents. A score of 0.0 means none are.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The SLM Fallback for Numeric and Temporal Claims (v0.1.4+)&lt;/strong&gt;&lt;br&gt;
Standard NLI models are known to underperform on fine-grained numeric and date comparisons — distinguishing “330 meters” from “303 meters” is a semantic task NLI encoders were not optimized for. LongTracer v0.1.4 addressed this with an optional SLM fallback verifier using Qwen2.5-1.5B-Instruct-GGUF. This model is invoked automatically only when NLI confidence is low and the claim contains numeric or temporal content. The gating logic ensures the baseline verification path stays under 150ms for the vast majority of real-world claims.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The One-Liner API&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Zero configuration. No account. No API key.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtracer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;check&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The Eiffel Tower is 330 meters tall and located in Berlin.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The Eiffel Tower is a wrought-iron lattice tower in Paris, France. It is 330 metres tall.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;              &lt;span class="c1"&gt;# "FAIL"
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trust_score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="c1"&gt;# 0.5
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hallucination_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 1  ("Berlin" contradicts "Paris, France")
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or from the terminal, with no Python code at all:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;longtracer check &lt;span class="s2"&gt;"The Eiffel Tower is in Berlin."&lt;/span&gt; &lt;span class="s2"&gt;"The Eiffel Tower is in Paris."&lt;/span&gt;
&lt;span class="c"&gt;# ✗ FAIL  trust=0.50  hallucinations=1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Framework Integrations: Every Major RAG Stack&lt;/strong&gt;&lt;br&gt;
One of LongTracer’s most practical competitive advantages is the breadth of its native adapters. As of v0.2.0, it supports seven major frameworks with minimal integration code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangChain&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtracer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LongTracer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;instrument_langchain&lt;/span&gt;
&lt;span class="n"&gt;LongTracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;instrument_langchain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;your_chain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Every chain.invoke() now auto-verifies responses against retrieved context
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;LlamaIndex&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtracer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LongTracer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;instrument_llamaindex&lt;/span&gt;
&lt;span class="n"&gt;LongTracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;instrument_llamaindex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;your_query_engine&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;LangGraph&lt;/span&gt; &lt;span class="n"&gt;Agents&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtracer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;instrument_langgraph&lt;/span&gt;
&lt;span class="n"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;instrument_langgraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the refund policy?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]},&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;callbacks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LangGraph adapter accumulates sources across multi-step tool calls and runs verification once at agent completion, not after every intermediate step. This means the final answer — not intermediate reasoning — is what gets verified, avoiding noisy per-step false positives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Haystack v2&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtracer.adapters.haystack_handler&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LongTracerVerifier&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_component&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verifier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;LongTracerVerifier&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generator.replies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verifier.response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retriever.documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verifier.documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;OpenAI Assistants API&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtracer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;instrument_openai_assistant&lt;/span&gt;
&lt;span class="nf"&gt;instrument_openai_assistant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Automatically verifies assistant responses against file_search citations
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;CrewAI&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtracer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;instrument_crewai&lt;/span&gt;
&lt;span class="nf"&gt;instrument_crewai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;crew&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Wraps kickoff() to verify each task output against its context sources
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;AutoGen (≥ 0.4)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtracer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;instrument_autogen&lt;/span&gt;
&lt;span class="nf"&gt;instrument_autogen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Direct API — Any Framework&lt;/strong&gt;&lt;br&gt;
For custom pipelines or frameworks not yet listed, the CitationVerifier accepts plain strings with no dependencies on vector stores, LLMs, or external services:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtracer.guard.verifier&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CitationVerifier&lt;/span&gt;
&lt;span class="n"&gt;verifier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CitationVerifier&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;verifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;verify_parallel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LLM said this...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk 1 text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk 2 text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;source_metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Multi-Project Tracing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Production teams rarely run a single RAG application. LongTracer’s multi-project architecture allows you to trace multiple applications — a customer chatbot, an internal search API, a document Q&amp;amp;A service — under a single backend while keeping traces tagged and independently filterable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtracer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LongTracer&lt;/span&gt;
&lt;span class="n"&gt;LongTracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chatbot-prod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sqlite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chatbot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LongTracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chatbot-prod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;search&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LongTracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search-api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chatbot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_root&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is your cancellation policy?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each project’s traces are independently browsable via the CLI and the web dashboard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pluggable Storage Backends&lt;/strong&gt;&lt;br&gt;
LongTracer stores verification traces in configurable backends suited to every deployment scenario:&lt;/p&gt;

&lt;p&gt;BackendInstallBest ForSQLiteBuilt-in (default)Local development, single-serverMemoryBuilt-inTesting, ephemeral runsMongoDBpip install "longtracer[mongo]"Production, distributedPostgreSQLpip install "longtracer[postgres]"Production, relationalRedispip install "longtracer[redis]"High-throughput, ephemeral&lt;/p&gt;

&lt;p&gt;Configuration is a single block in pyproject.toml:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[tool.longtracer]&lt;/span&gt;
&lt;span class="py"&gt;project&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"my-rag-app"&lt;/span&gt;
&lt;span class="py"&gt;backend&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"sqlite"&lt;/span&gt;
&lt;span class="py"&gt;threshold&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;
&lt;span class="py"&gt;verbose&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or via environment variables, following the configuration priority chain:&lt;/p&gt;

&lt;p&gt;Learn about Medium’s values&lt;br&gt;
Code arguments → Environment variables → pyproject.toml → Built-in defaults&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v0.2.0: The Observability and Analytics Suite&lt;/strong&gt;&lt;br&gt;
The most significant release in LongTracer’s history shipped on May 18, 2026. Version 0.2.0 transforms LongTracer from a standalone guardrail library into a full observability platform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Built-In Web Dashboard&lt;/strong&gt;&lt;br&gt;
longtracer serve&lt;/p&gt;
&lt;h1&gt;
  
  
  Open &lt;a href="http://localhost:8000/dashboard" rel="noopener noreferrer"&gt;http://localhost:8000/dashboard&lt;/a&gt;
&lt;/h1&gt;

&lt;p&gt;Browse all verified traces across every project, view hallucination rates over time, and drill into individual trace spans. The dashboard is authenticated via HTTP-only cookies with timing-safe digest comparison — production-grade security out of the box, no configuration required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aggregated Metrics API&lt;/strong&gt;&lt;br&gt;
Two new endpoints provide programmatic access to verification metrics:&lt;/p&gt;

&lt;p&gt;GET /api/v1/metrics/summary — total traces, average trust score, total hallucinations across all projects&lt;br&gt;
GET /api/v1/metrics/timeseries — trend data for dashboarding or alerting integrations&lt;br&gt;
OpenTelemetry Export&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"longtracer[otel]"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LongTracer emits standard OTLP spans (longtracer.verify) with the following attributes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;longtracer.trust_score&lt;/li&gt;
&lt;li&gt;longtracer.hallucination_count&lt;/li&gt;
&lt;li&gt;longtracer.verdict&lt;/li&gt;
&lt;li&gt;longtracer.project&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are fully compatible with Jaeger, Grafana Tempo, Datadog, Honeycomb, and any OTLP-compliant backend. A pre-configured Grafana Dashboard Template (grafana/longtracer.json) is included in the repository for instant visualization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critically:&lt;/strong&gt; if OTel packages are not installed, the integration fails gracefully as a zero-overhead no-op. No crashes, no warnings, no behavior change in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Active Alerting System&lt;/strong&gt;&lt;br&gt;
LongTracer’s alerting runs in a background daemon thread — it never blocks the verification pipeline. When a trust score drops below a configured threshold, notifications are dispatched to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slack&lt;/li&gt;
&lt;li&gt;Discord&lt;/li&gt;
&lt;li&gt;Email&lt;/li&gt;
&lt;li&gt;Custom Webhooks — HMAC-SHA256 signed, Stripe-style, with 5 retries and exponential backoff&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Configuration is a single environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;LONGTRACER_ALERT_THRESHOLD&lt;/span&gt;=&lt;span class="m"&gt;0&lt;/span&gt;.&lt;span class="m"&gt;7&lt;/span&gt;
&lt;span class="n"&gt;LONGTRACER_SLACK_WEBHOOK_URL&lt;/span&gt;=&lt;span class="n"&gt;https&lt;/span&gt;://&lt;span class="n"&gt;hooks&lt;/span&gt;.&lt;span class="n"&gt;slack&lt;/span&gt;.&lt;span class="n"&gt;com&lt;/span&gt;/...

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The webhook implementation uses dead-letter logging after maximum retries, ensuring no silent alert failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The CLI: Full Observability Without Writing Code&lt;/strong&gt;&lt;br&gt;
The longtracer CLI provides complete trace access from the terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
longtracer view                        &lt;span class="c"&gt;# List recent traces&lt;/span&gt;
longtracer view &lt;span class="nt"&gt;--last&lt;/span&gt;                 &lt;span class="c"&gt;# View most recent trace&lt;/span&gt;
longtracer view &lt;span class="nt"&gt;--id&lt;/span&gt; &amp;lt;trace_id&amp;gt;        &lt;span class="c"&gt;# View specific trace&lt;/span&gt;
longtracer view &lt;span class="nt"&gt;--project&lt;/span&gt; chatbot-prod &lt;span class="c"&gt;# Filter by project&lt;/span&gt;
longtracer view &lt;span class="nt"&gt;--export&lt;/span&gt; &amp;lt;trace_id&amp;gt;    &lt;span class="c"&gt;# Export trace to JSON&lt;/span&gt;
longtracer view &lt;span class="nt"&gt;--html&lt;/span&gt; &amp;lt;trace_id&amp;gt;      &lt;span class="c"&gt;# Export to self-contained&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;HTML report&lt;br&gt;
The HTML export is particularly useful for cross-functional teams. It is a zero-dependency, self-contained single file with:&lt;/p&gt;

&lt;p&gt;Color-coded per-claim verdict rows&lt;br&gt;
Side-by-side diff of the LLM claim versus the best matching source evidence&lt;br&gt;
A summary stats bar showing pass/fail/hallucination breakdown&lt;br&gt;
Click-to-expand claim detail with STS score, entailment score, and contradiction score&lt;br&gt;
Send an HTML trace file to a product manager, QA engineer, or non-technical stakeholder and they can immediately see exactly which claims were hallucinated and which source sentence was evaluated against each one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The REST API Server Mode&lt;/strong&gt;&lt;br&gt;
For polyglot environments or microservice architectures, LongTracer can operate as a standalone HTTP verification service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;longtracer serve

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This starts a FastAPI-based server with:&lt;/p&gt;

&lt;p&gt;POST /api/v1/verify — verify a single response&lt;br&gt;
POST /api/v1/verify/batch — bulk verification in a single call&lt;br&gt;
GET /api/v1/health — health check (no authentication required)&lt;br&gt;
GET /api/v1/traces — list recent traces&lt;br&gt;
GET /api/v1/traces/{trace_id} — retrieve a specific trace&lt;br&gt;
Security features included by default:&lt;/p&gt;

&lt;p&gt;API key authentication via x-api-key header (LangSmith-standard) with Authorization: Bearer fallback&lt;br&gt;
Timing-safe key comparison via secrets.compare_digest&lt;br&gt;
CORS middleware with configurable origins&lt;br&gt;
Token bucket rate limiter (60 req/min per IP, configurable)&lt;br&gt;
Pydantic input validation with max-length and max-items constraints&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Determinism Matters for CI/CD Integration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the most practically important properties of LongTracer is determinism. Because verification uses fixed encoder weights rather than a generative LLM, the same inputs always produce the same output on the same hardware.&lt;/p&gt;

&lt;p&gt;This is a prerequisite for integrating hallucination detection into CI/CD pipelines. Teams can write regression tests that assert specific trust scores — and those tests will not flake due to model stochasticity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="c1"&gt;# In your test suite
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtracer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;check&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_rag_response_is_grounded&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;generate_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the refund policy?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;get_retrieved_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the refund policy?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trust_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RAG response grounding degraded: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trust_score&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hallucination_count&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This kind of deterministic quality gate is simply not possible with LLM-as-a-judge tools, where the same prompt can score 0.9 on one run and 0.7 on the next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Async Support and Batch Processing&lt;/strong&gt;&lt;br&gt;
Modern Python applications run on asyncio. LongTracer supports fully async verification:&lt;/p&gt;

&lt;p&gt;result = await verifier.verify_parallel_async(response, sources)&lt;br&gt;
For bulk evaluation workloads — running evaluations over a dataset of historical traces, or benchmarking a new retrieval configuration — the batch API parallelizes claim verification internally using ThreadPoolExecutor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtracer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;check_batch&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;check_batch&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;P is NP.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sources&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;It is not known if P equals NP.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Water boils at 100°C.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sources&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Water boils at 100°C at standard atmospheric pressure.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Who Should Use LongTracer?&lt;/strong&gt;&lt;br&gt;
LongTracer is the right choice if:&lt;/p&gt;

&lt;p&gt;You are building a RAG application and need to know, at inference time, whether the LLM’s response is grounded in the retrieved documents&lt;br&gt;
You want hallucination detection without paying for additional LLM API calls on every inference&lt;br&gt;
You need CI/CD-compatible, deterministic quality gates for your RAG pipeline&lt;br&gt;
You are using LangChain, LlamaIndex, LangGraph, Haystack, CrewAI, AutoGen, or the OpenAI Assistants API&lt;br&gt;
You want a fully self-hosted, data-private solution with no external dependencies&lt;br&gt;
You need to monitor multiple RAG projects under a single backend&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LongTracer is not the primary choice if:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your primary need is LLM cost tracking and response caching (Helicone is optimized for this)&lt;br&gt;
You need enterprise-grade real-time safety guardrails with SLA guarantees and dedicated support (Galileo is the leader here)&lt;br&gt;
You are deeply invested in the LangChain ecosystem and need native graph visualization and annotation queues (LangSmith serves this niche well)&lt;/p&gt;

&lt;p&gt;Getting Started in Under 5 Minutes&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
&lt;span class="c"&gt;# 1. Install&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;longtracer

&lt;span class="c"&gt;# 2. Run your first verification - no config required&lt;/span&gt;
python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"
from longtracer import check
result = check(
    'The Eiffel Tower is located in Berlin.',
    ['The Eiffel Tower is located in Paris, France.']
)
print(f'Verdict: {result.verdict}')
print(f'Trust Score: {result.trust_score}')
print(f'Hallucinations: {result.hallucination_count}')
"&lt;/span&gt;

&lt;span class="c"&gt;# 3. Or use the CLI&lt;/span&gt;
longtracer check &lt;span class="s2"&gt;"The Eiffel Tower is in Berlin."&lt;/span&gt; &lt;span class="s2"&gt;"The Eiffel Tower is in Paris."&lt;/span&gt;

&lt;span class="c"&gt;# 4. Start the dashboard&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"longtracer[server]"&lt;/span&gt;

longtracer serve

&lt;span class="c"&gt;# Visit http://localhost:8000/dashboard&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
The LLM observability market is mature and well-funded. Most tools in the space are still solving the wrong problem — they tell you that something went wrong after the fact, or they use an LLM to evaluate an LLM, adding cost and non-determinism to an already uncertain pipeline.&lt;/p&gt;

&lt;p&gt;LongTracer takes a fundamentally different bet: that a carefully engineered two-stage encoder pipeline — STS for evidence selection, NLI for semantic verification — can catch the majority of real-world RAG hallucinations with near-zero latency, zero external API cost, and complete determinism.&lt;/p&gt;

&lt;p&gt;That bet has held up in practice. Since its initial release in April 2025, LongTracer has shipped adapters for seven major frameworks, a production-grade REST API server, a complete observability suite with OTel integration, and a web dashboard — all while maintaining its core constraint: no vector store dependency, no LLM dependency, just strings in and verification out.&lt;/p&gt;

&lt;p&gt;For teams that have accepted hallucination as an inevitable tax on AI-powered applications, LongTracer offers a different path: treat every LLM response as innocent until proven grounded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/ENDEVSOLS/LongTracer" rel="noopener noreferrer"&gt;github.com/ENDEVSOLS/LongTracer&lt;/a&gt;&lt;br&gt;
Documentation: &lt;a href="https://endevsols.github.io/LongTracer" rel="noopener noreferrer"&gt;endevsols.github.io/LongTracer&lt;/a&gt;&lt;br&gt;
PyPI: &lt;a href="https://pypi.org/project/longtracer" rel="noopener noreferrer"&gt;pypi.org/project/longtracer&lt;/a&gt;&lt;br&gt;
Quick Start: &lt;a href="https://endevsols.github.io/LongTracer/getting-started/quickstart" rel="noopener noreferrer"&gt;endevsols.github.io/LongTracer/getting-started/quickstart&lt;/a&gt;&lt;br&gt;
EnDevSols Open-Source Projects: &lt;a href="https://endevsols.com/open-source/longtracer&amp;lt;br&amp;gt;%0AChangelog" rel="noopener noreferrer"&gt;CHANGELOG.md&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rag</category>
      <category>llm</category>
      <category>langchain</category>
      <category>ai</category>
    </item>
    <item>
      <title>LongTrainer: The Production-Ready Python RAG Framework That Replaces 500 Lines of LangChain Boilerplate</title>
      <dc:creator>Muhammad Muzammil</dc:creator>
      <pubDate>Thu, 07 May 2026 04:21:56 +0000</pubDate>
      <link>https://dev.to/muzammil_endevsols/longtrainer-the-production-ready-python-rag-framework-that-replaces-500-lines-of-langchain-1ggp</link>
      <guid>https://dev.to/muzammil_endevsols/longtrainer-the-production-ready-python-rag-framework-that-replaces-500-lines-of-langchain-1ggp</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Build multi-tenant AI chatbots with persistent memory, streaming, tool calling, and 9 vector DB providers — in 10 lines of Python.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The RAG Boilerplate Problem Nobody Talks About&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Every developer building a production RAG chatbot eventually faces the same wall.&lt;/p&gt;

&lt;p&gt;You start with a LangChain tutorial. You connect an LLM. You load a PDF. You get a response. It works — and then reality hits.&lt;/p&gt;

&lt;p&gt;You need multiple bots for multiple customers. You need their conversation history to survive a server restart. You need real-time streaming responses. You need your bot to call external APIs when documents don’t have the answer. You need to store vectors somewhere other than RAM. You need encryption. You need a REST API so the frontend team can actually use this thing.&lt;/p&gt;

&lt;p&gt;What started as a weekend prototype turns into hundreds of lines of infrastructure glue — and none of it is the actual product you are building.&lt;/p&gt;

&lt;p&gt;This is the problem LongTrainer was designed to solve.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What Is LongTrainer?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;LongTrainer is a production-ready, open-source Python RAG (Retrieval-Augmented Generation) framework published under the MIT License. It is an opinionated, batteries-included abstraction layer on top of LangChain and LangGraph that handles the full production chatbot lifecycle:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Document ingestion from 15+ sources&lt;/li&gt;
&lt;li&gt;Vector embedding and retrieval across 9 vector database providers&lt;/li&gt;
&lt;li&gt;Multi-tenant bot isolation with per-bot LLM, embeddings, and config&lt;/li&gt;
&lt;li&gt;Persistent conversation memory backed by MongoDB&lt;/li&gt;
&lt;li&gt;Streaming responses — sync and async&lt;/li&gt;
&lt;li&gt;Tool calling and agent reasoning via LangGraph&lt;/li&gt;
&lt;li&gt;Vision and multimodal chat&lt;/li&gt;
&lt;li&gt;Chat encryption at rest&lt;/li&gt;
&lt;li&gt;A built-in FastAPI REST server with zero configuration
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;longtrainer
&lt;span class="c"&gt;# With optional agent/tool-calling support&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;longtrainer[agent]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full documentation is available at endevsols.github.io/Long-Trainer.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why LongTrainer Over Raw LangChain or LlamaIndex?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here is an honest comparison of what a production RAG system requires you to build yourself versus what LongTrainer provides:&lt;/p&gt;

&lt;p&gt;Concern Roll Your Own LongTrainer Multi-bot management Manage state dictionaries per tenant initialize_bot_id() → fully isolated bot Persistent memory Wire MongoDB or Redis manually Built-in MongoDB-backed history Document ingestion Assemble loaders + splitters add_document_from_path(path, bot_id) Streaming Implement astream callbacks get_response(stream=True) yields chunks Tool calling / Agent Build LangGraph graph from scratch add_tool(my_tool) + agent_mode=True Web search augmentation Find, integrate, and maintain web_search=True flag Vision/multimodal Complex multi-modal pipeline get_vision_response() Self-improvement Not a standard concept train_chats() feeds Q&amp;amp;A back into KB Encryption at rest Implement Fernet yourself encrypt_chats=True REST API Build FastAPI server yourself longtrainer serve&lt;/p&gt;

&lt;p&gt;The framework operates in two modes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG Mode (LCEL Chain):&lt;/strong&gt; Fast, deterministic document Q&amp;amp;A using LangChain Expression Language. Best for knowledge base chatbots and document assistants where the document is the authoritative source.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Mode (LangGraph):&lt;/strong&gt; A full agentic reasoning loop. The bot decides when to query documents, when to invoke tools, and how to chain multi-step reasoning. Best for workflows that require acting on external data.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Quickstart:&lt;/strong&gt; From Zero to a Working RAG Bot
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;System Dependencies&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Linux (Ubuntu/Debian)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;libmagic-dev poppler-utils tesseract-ocr qpdf libreoffice pandoc

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;macOS&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;libmagic poppler tesseract qpdf libreoffice pandoc

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Initialize&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtrainer.trainer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LongTrainer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LongTrainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;mongo_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mongodb://localhost:27017/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_token_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;encrypt_chats&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Load Documents&lt;/strong&gt;&lt;br&gt;
LongTrainer supports an extensive range of ingestion sources:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initialize_bot_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# Local files — PDF, DOCX, CSV, HTML, Markdown, TXT
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contracts/agreement.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Web URLs and YouTube transcripts
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_link&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://docs.yourapp.com/api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Amazon S3
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_aws_s3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-bucket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;folder/data.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Google Drive
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_google_drive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;folder_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1abc...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Confluence wiki
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_confluence&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://yourco.atlassian.net&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;you@yourco.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;space_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ENG&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# GitHub repository
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_github&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;repo&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://github.com/you/repo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Dynamic injection — any LangChain document loader
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_dynamic_loader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MyCustomLoader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;param&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Create the Bot and Chat&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OpenAIEmbeddings&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;num_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt_template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant. Answer only from the provided context. {context}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chat_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What are the termination clauses in section 4?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chat_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Streaming&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Synchronous streaming
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the key points&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chat_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Async streaming — for FastAPI and other async frameworks
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;aget_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain section 7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chat_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Multi-Tenancy: Built for SaaS&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every bot created via initialize_bot_id() receives a unique identifier. All associated data — documents, vector embeddings, conversation history, tool registrations, and per-bot configuration — is fully isolated to that ID.&lt;/p&gt;

&lt;p&gt;You can run hundreds of bots on a single LongTrainer instance with no cross-contamination:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Customer A — Legal documents, GPT-4o-mini
&lt;/span&gt;&lt;span class="n"&gt;bot_a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initialize_bot_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_a_contracts.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# Customer B — Technical docs, Claude, custom embedding
&lt;/span&gt;&lt;span class="n"&gt;bot_b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initialize_bot_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_b_api_docs.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;bot_b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ChatAnthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-3-5-sonnet-20241022&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bots persist across server restarts. Restore any previous bot with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Agent Mode and Tool Calling&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When retrieval alone is not enough — when your bot needs to act, not just answer — agent mode enables a full LangGraph reasoning loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic Tool Loading (Zero Code)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;agent_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tavily_search_results_json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wikipedia&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arxiv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PythonREPLTool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yahoo_finance_news&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LongTrainer dynamically imports and initializes any string-based tool &lt;/p&gt;

&lt;p&gt;&lt;code&gt;from langchain.agents.load_tools —&lt;/code&gt; no manual wiring required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Custom Tool Registration&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtrainer.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;web_search&lt;/span&gt;
&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_exchange_rate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;currency_pair&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fetch the real-time exchange rate for a currency pair like USD/EUR.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;fetch_rate_from_api&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;currency_pair&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;web_search&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_exchange_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chat_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the current EUR/USD rate and what does the latest Fed statement say about it?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chat_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent autonomously decides when to query documents, when to call web search, and when to invoke your custom tool — all within a single turn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vector Database Support&lt;/strong&gt;&lt;br&gt;
LongTrainer treats vector store portability as a first-class concern, supporting nine providers out of the box:&lt;/p&gt;

&lt;p&gt;Provider Type Best For FAISS Local / In-memory Development, small scale Pinecone Cloud-native Serverless, large scale Chroma Open-source Self-hosted, fast prototyping Qdrant Open-source High-performance filtering PGVector PostgreSQL extension Existing Postgres infrastructure MongoDB Atlas Cloud Unified database + vector search Milvus Open-source Billion-vector scale Weaviate Open-source Multi-modal, GraphQL Elasticsearch Enterprise Existing ES infrastructure&lt;/p&gt;

&lt;p&gt;Each bot can use a different vector store — a meaningful advantage in multi-tenant architectures where different customers may have different infrastructure requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM Provider Support&lt;/strong&gt;&lt;br&gt;
LongTrainer’s Dynamic Model Factory accepts any BaseChatModel implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# OpenAI
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;
&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Anthropic Claude
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatAnthropic&lt;/span&gt;
&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatAnthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-3-5-sonnet-20241022&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Google Gemini
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_google_vertexai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatVertexAI&lt;/span&gt;
&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatVertexAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# AWS Bedrock
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_aws&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatBedrock&lt;/span&gt;
&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatBedrock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic.claude-3-5-sonnet-20241022-v2:0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Groq
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_groq&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatGroq&lt;/span&gt;
&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatGroq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama-3.1-70b-versatile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Ollama (local / air-gapped inference)
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_ollama&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOllama&lt;/span&gt;
&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOllama&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama3.2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Per-bot LLM configuration makes LongTrainer well-suited for architectures where difforent customers or use cases warrant different models — GPT-4o for enterprise users, Ollama for on-premise deployments with strict data residency requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vision and Multimodal Chat&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;vision_chat_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_vision_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_vision_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What defects are visible in this manufacturing photo?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;image_paths&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inspection_001.jpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inspection_002.jpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;vision_chat_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;vision_chat_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Self-Improving Memory:&lt;/strong&gt; &lt;code&gt;train_chats()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;After a bot accumulates conversation history, you can feed that history back into its knowledge base:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;train_chats&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The framework extracts high-quality Q&amp;amp;A pairs from past sessions and re-ingests them as documents. Over time, the bot gets better at answering the specific questions your users are actually asking — a continuous improvement loop that raw LangChain pipelines do not provide out of the box.&lt;/p&gt;

&lt;p&gt;Zero-Code CLI and FastAPI Server&lt;br&gt;
LongTrainer 1.2.1 ships with a production-ready CLI and REST API server.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terminal Chat&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Initialize project&lt;/span&gt;
longtrainer init
&lt;span class="c"&gt;# Create a bot&lt;/span&gt;
longtrainer bot create &lt;span class="nt"&gt;--prompt&lt;/span&gt; &lt;span class="s2"&gt;"You are a helpful customer support agent."&lt;/span&gt;
&lt;span class="c"&gt;# Add a document&lt;/span&gt;
longtrainer add-doc &amp;lt;bot_id&amp;gt; /path/to/faq.pdf
&lt;span class="c"&gt;# Start an interactive chat session&lt;/span&gt;
longtrainer chat &amp;lt;bot_id&amp;gt;
REST API Server
longtrainer serve

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Starts a FastAPI server at &lt;a href="http://localhost:8000" rel="noopener noreferrer"&gt;http://localhost:8000&lt;/a&gt; with 18 REST endpoints covering full CRUD for bots, document ingestion, chat session management, and streaming. The Swagger UI is auto-generated at &lt;a href="http://localhost:8000/docs" rel="noopener noreferrer"&gt;http://localhost:8000/docs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Endpoint Method Description /health GET Health check /bots POST Create bot /bots/{id}/documents/path POST Ingest file /bots/{id}/chats POST Create chat session /bots/{id}/chats/{chat_id} POST Chat with streaming&lt;/p&gt;

&lt;p&gt;The server is Docker-ready and suitable for production deployment behind a reverse proxy.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Complete API Reference&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Constructor&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;trainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LongTrainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;mongo_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mongodb://localhost:27017/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="c1"&gt;# Default: ChatOpenAI(model="gpt-4o-2024-08-06")
&lt;/span&gt;    &lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# Default: OpenAIEmbeddings()
&lt;/span&gt;    &lt;span class="n"&gt;prompt_template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_token_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ensemble&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# Multi-query ensemble retrieval
&lt;/span&gt;    &lt;span class="n"&gt;encrypt_chats&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# Fernet encryption at rest
&lt;/span&gt;    &lt;span class="n"&gt;encryption_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# Auto-generated if None
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Production Tuning&lt;/strong&gt;&lt;br&gt;
Multi-Query Ensemble Retrieval Enable with ensemble=True. Generates multiple reformulations of each user query and merges the retrieval results — significantly improves recall for ambiguous or conversational queries at the cost of additional LLM calls per turn.&lt;/p&gt;

&lt;p&gt;Chunk Strategy The default chunk_size=2048 with chunk_overlap=200 works well for general prose documents. For structured content — tables, code, legal clauses — reduce chunk_size and increase chunk_overlap to avoid splitting semantic units across boundaries.&lt;/p&gt;

&lt;p&gt;num_k Tuning Start with num_k=3 for focused Q&amp;amp;A. Increase to num_k=7–10 for synthesis tasks where broader context improves answer quality.&lt;/p&gt;

&lt;p&gt;MongoDB Indexing For deployments with hundreds of bots and thousands of conversations, index your MongoDB collections on bot_id and chat_id fields to maintain consistent query performance at scale.&lt;/p&gt;

&lt;p&gt;Token Budget max_token_limit=32000 controls the conversation context window. For models with 128K+ context windows, this value can be increased substantially. Monitor document sizes in the memory collection as conversations grow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Use Cases
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;SaaS Multi-Tenant Document Assistant&lt;/strong&gt; Each customer gets an isolated bot seeded with their own uploaded documents. Conversation history persists across sessions. LongTrainer’s bot_id / chat_id isolation model makes this architecture a few lines of code rather than an engineering project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enterprise Internal Knowledge Base&lt;/strong&gt; Load Confluence wikis, GitHub repos, internal PDFs, and S3 buckets into a single bot. Enable ensemble=True for better recall on ambiguous queries. Enable encrypt_chats=True for compliance requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Customer Support Agent&lt;/strong&gt; Use agent mode with web search and a CRM lookup tool. The bot retrieves from product documentation, checks live ticket status via tool calls, and returns grounded answers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Research Assistant&lt;/strong&gt; with Continuous Improvement Feed academic PDFs into a bot. Run train_chats() periodically to re-ingest high-quality Q&amp;amp;A pairs from past sessions. The bot improves incrementally without retraining.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On-Premise Deployment for Data Residency&lt;/strong&gt; Use ChatOllama as the LLM with a local FAISS store. No data leaves the premises. longtrainer serve provides the REST interface for internal applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;PyPI: &lt;code&gt;pip install longtrainer&lt;/code&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/ENDEVSOLS/Long-Trainer" rel="noopener noreferrer"&gt;github.com/ENDEVSOLS/Long-Trainer&lt;/a&gt;&lt;br&gt;
Documentation: &lt;a href="https://endevsols.github.io/Long-Trainer" rel="noopener noreferrer"&gt;endevsols.github.io/Long-Trainer&lt;/a&gt;&lt;br&gt;
Open Source Tools: &lt;a href="https://endevsols.com/open-source/longtrainer" rel="noopener noreferrer"&gt;endevsols.com/open-source/longtrainer&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If LongTrainer saves you meaningful engineering time, consider starring the repository and sharing it with your team.&lt;/p&gt;

&lt;p&gt;Tags: #Python #MachineLearning #LangChain #RAG #AI #ChatBot #OpenSource #LLM #NLP #GenerativeAI #LangGraph #VectorDatabase #ArtificialIntelligence #MLOps&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>rag</category>
      <category>ai</category>
      <category>python</category>
    </item>
    <item>
      <title>Beyond Chatbot Wrappers: Designing ‘Velocity Architecture’ for Production Multi-Agent Systems</title>
      <dc:creator>Muhammad Muzammil</dc:creator>
      <pubDate>Wed, 06 May 2026 05:34:53 +0000</pubDate>
      <link>https://dev.to/muzammil_endevsols/beyond-chatbot-wrappers-designing-velocity-architecture-for-production-multi-agent-systems-22dp</link>
      <guid>https://dev.to/muzammil_endevsols/beyond-chatbot-wrappers-designing-velocity-architecture-for-production-multi-agent-systems-22dp</guid>
      <description>&lt;p&gt;The tech landscape is currently flooded with “AI fatigue.” Every day, another startup launches a thin wrapper around a foundational LLM API, calling it a revolutionary product. But as any backend engineer operating in the real world knows: stringing together a few prompts behind a UI doesn’t survive contact with enterprise production.&lt;/p&gt;

&lt;p&gt;Monolithic prompts are brittle. Context windows get polluted. And when the system hallucinates or fails, debugging an opaque API call is a nightmare.&lt;/p&gt;

&lt;p&gt;To build high-ROI applications that actually solve complex problems, we need to stop building wrappers and start designing Velocity Architecture infrastructure optimized for multi-agent orchestration, state persistence, and scalable execution.&lt;/p&gt;

&lt;p&gt;Here is a blueprint for designing backend systems where AI agents do actual work, not just chat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem with Monolithic Prompts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The typical v1 approach to an AI feature is a single, massive prompt containing instructions, user input, and retrieved context (RAG).&lt;/p&gt;

&lt;p&gt;This fails at scale for three reasons:&lt;/p&gt;

&lt;p&gt;Context Degradation: As you shove more retrieved data into the prompt, the LLM loses focus on the actual instructions (the “lost in the middle” phenomenon).&lt;br&gt;
Zero Fault Tolerance: If the model misunderstands one sub-task, the entire output fails.&lt;br&gt;
High Latency: Processing massive monolithic prompts takes time and burns tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: Multi-Agent Orchestration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of one monolithic LLM call doing everything, a multi-agent system breaks down complex workflows into discrete, specialized nodes. Think of it less like a brain, and more like a microservices architecture for AI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Supervisor Pattern&lt;/strong&gt;&lt;br&gt;
In a production environment, you need a deterministic routing mechanism. We typically implement a Supervisor Node.&lt;/p&gt;

&lt;p&gt;The Supervisor doesn’t generate the final answer; it evaluates the user’s intent and routes the payload to specialized worker agents (e.g., a “Code Review Agent,” a “Data Extraction Agent,” or a “SQL Generation Agent”).&lt;/p&gt;

&lt;p&gt;By constraining each worker agent to a single, narrow system prompt, accuracy skyrockets, and hallucinations drop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Core Infrastructure Stack&lt;/strong&gt;&lt;br&gt;
To build this orchestration layer effectively, your underlying stack matters. Here is a battle-tested architecture pattern for multi-agent MVPs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The Asynchronous Engine: FastAPI&lt;/strong&gt;&lt;br&gt;
Multi-agent workflows are inherently asynchronous. Agents need to pause execution to call external APIs, query databases, or wait for another agent’s output. Python’s FastAPI is the ideal orchestration layer here due to its native asyncio support and high throughput. It allows the system to manage multiple concurrent agent graphs without blocking the main event loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. State Management &amp;amp; Vector Storage: PostgreSQL + pgvector&lt;/strong&gt;&lt;br&gt;
When agents hand off tasks to one another, they need a shared “memory” or state. Relying entirely on the LLM’s context window for this state is expensive and unreliable.&lt;/p&gt;

&lt;p&gt;Instead of juggling a separate vector database and a relational database, consolidate. Using PostgreSQL with the pgvector extension allows you to store your agent state (JSONB), relational user data, and embedding vectors in a single, ACID-compliant environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The Orchestration Framework (e.g., LangGraph)&lt;/strong&gt;&lt;br&gt;
Rather than writing messy while loops to handle agent routing, use a graph-based state machine. Frameworks like LangGraph allow you to define agents as nodes and their interactions as edges. This makes the execution flow highly observable. If an agent loops infinitely, you can catch it at the graph level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Minimal Routing Example&lt;/strong&gt;&lt;br&gt;
Instead of giant code blocks, let’s look at the core routing logic. The secret to multi-agent stability is keeping the routing strict.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# A conceptual look at how a Supervisor routes state
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;supervisor_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;routing_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    You are a supervisor. Review the task and route to the correct worker.
    Available workers: [researcher, coder, reviewer]
    If the task is complete, route to &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;FINISH&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# The LLM outputs a structured JSON response dictating the next node
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ainvoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;routing_prompt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;current_task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;next_node&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;route_to&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By forcing the LLM to output a strict schema (using function calling or structured output), the graph framework knows exactly which Python function to trigger next. The LLM handles the logic, while standard Python code handles the execution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why This Matters for Production&lt;/strong&gt;&lt;br&gt;
Building “Velocity Architecture” means establishing a foundation where new capabilities can be added simply by wiring a new agent into the graph.&lt;/p&gt;

&lt;p&gt;If you want to add a web-scraping feature, you don’t rewrite your massive master prompt. You create a simple Web Scraper Agent, define its input/output schema, and tell the Supervisor it exists.&lt;/p&gt;

&lt;p&gt;This decoupling is what separates hobbyist AI projects from enterprise-grade infrastructure. It allows for modular testing, independent scaling, and most importantly, predictable system behavior.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>python</category>
      <category>fastapi</category>
    </item>
    <item>
      <title>Building Production-Ready RAG is Harder Than You Think (Here's How to Fix It)</title>
      <dc:creator>Muhammad Muzammil</dc:creator>
      <pubDate>Mon, 04 May 2026 11:56:05 +0000</pubDate>
      <link>https://dev.to/muzammil_endevsols/building-production-ready-rag-is-harder-than-you-think-heres-how-to-fix-it-47me</link>
      <guid>https://dev.to/muzammil_endevsols/building-production-ready-rag-is-harder-than-you-think-heres-how-to-fix-it-47me</guid>
      <description>&lt;p&gt;Building a RAG chatbot in a tutorial takes a weekend.&lt;br&gt;
Making it production-ready takes months, and most teams don't realize the complexity&lt;br&gt;
until they're already dealing with frustrated users and crashing servers.&lt;/p&gt;

&lt;p&gt;When building for enterprise, you have to optimize for iteration speed and&lt;br&gt;
rock-solid reliability. Here is what real-world production RAG actually requires&lt;br&gt;
that basic tutorials skip over:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tenant isolation:&lt;/strong&gt; Ensuring Client A can never access Client B's vector data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent memory:&lt;/strong&gt; Session histories that survive server restarts, backed by MongoDB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming responses:&lt;/strong&gt; Handling heavy LLM loads without timing out&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability:&lt;/strong&gt; Knowing exactly &lt;em&gt;why&lt;/em&gt; the AI retrieved a specific chunk or gave a wrong answer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination detection:&lt;/strong&gt; Catching fabrications before the end-user sees them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We built &lt;a href="https://github.com/ENDEVSOLS/Long-Trainer" rel="noopener noreferrer"&gt;LongTrainer&lt;/a&gt; to handle all of&lt;br&gt;
this out of the box. It sits on top of LangChain, so you don't have to wire the&lt;br&gt;
infrastructure together yourself.&lt;/p&gt;

&lt;p&gt;With over 39,000 downloads, it is actively powering deployments from FinTech to Healthcare.&lt;/p&gt;


&lt;h2&gt;
  
  
  Deploying a Multi-Tenant RAG Bot in 5 Lines
&lt;/h2&gt;

&lt;p&gt;Instead of writing custom session management, vector routing, and database wrappers,&lt;br&gt;
here is all you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtrainer.trainer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LongTrainer&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Initialize with persistent MongoDB memory
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LongTrainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mongo_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mongodb://localhost:27017/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Generate a fully isolated bot instance per client
&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initialize_bot_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Ingest documents into the bot's secure, isolated vector space
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_document_from_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path/to/your/data.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 4. Spin up the bot — embeddings and indexing handled automatically
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_bot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 5. Create a persistent chat session
&lt;/span&gt;&lt;span class="n"&gt;chat_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Route queries securely — bot_id and chat_id enforce strict isolation
&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the refund policy?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chat_id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Sources are returned alongside the answer for auditability
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every call is routed through &lt;code&gt;bot_id&lt;/code&gt; and &lt;code&gt;chat_id&lt;/code&gt;. There is no shared state between&lt;br&gt;
clients - the vector index, chat history, and document context are all strictly isolated&lt;br&gt;
per bot instance.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Black Box Problem
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgo1jt11gb6bwmcnly8w6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgo1jt11gb6bwmcnly8w6.png" alt="Bar chart showing LongTrainer v1.3.0 increasing RAG accuracy from roughly 70 percent to a 95 percent accuracy rate, alongside a metric showing improved document retrieval accuracy." width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When an AI gives a wrong answer in production, you are usually debugging blind:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did the vector database retrieve the wrong document chunk?&lt;/li&gt;
&lt;li&gt;Did the LLM hallucinate beyond what the context supported?&lt;/li&gt;
&lt;li&gt;Was the prompt silently truncated due to token limits?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without observability, you cannot answer any of these questions. You are waiting for&lt;br&gt;
a user complaint instead of catching the failure yourself.&lt;/p&gt;

&lt;p&gt;This is the core problem v1.3.0 addresses.&lt;/p&gt;


&lt;h2&gt;
  
  
  What's New in v1.3.0: Native LongTracer Integration
&lt;/h2&gt;

&lt;p&gt;Install with the tracer extras:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;longtrainer[tracer]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Enable it with a single flag at initialization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;longtrainer.trainer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LongTrainer&lt;/span&gt;

&lt;span class="n"&gt;trainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LongTrainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;mongo_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mongodb://localhost:27017/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;enable_tracer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# Activate full observability
&lt;/span&gt;    &lt;span class="n"&gt;tracer_backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mongo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Store traces in MongoDB
&lt;/span&gt;    &lt;span class="n"&gt;tracer_verify&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# Enable NLI hallucination detection
&lt;/span&gt;    &lt;span class="n"&gt;tracer_verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# Print span logs to console
&lt;/span&gt;    &lt;span class="n"&gt;tracer_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;     &lt;span class="c1"&gt;# Strictness for hallucination flagging (0.0–1.0)
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once enabled, two things happen automatically on every query:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Granular Observability
&lt;/h3&gt;

&lt;p&gt;LongTracer captures a hierarchical trace for every interaction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Every call to get_response() automatically generates a trace:
&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the compliance section&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chat_id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# What gets captured behind the scenes:
# - Retrieval span: which documents were fetched, similarity scores, latency in ms
# - LLM span: exact prompt sent, token count (prompt + completion), generation latency
# - Agent spans (if agent_mode=True): every tool call, input, output, and execution time
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All traces are stored in MongoDB and queryable at any time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pymongo&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MongoClient&lt;/span&gt;

&lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MongoClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mongodb://localhost:27017/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;longtracer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Pull all traces for a specific bot, ordered by timestamp
&lt;/span&gt;&lt;span class="n"&gt;traces&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs.bot_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-bot-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;sort&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;traces&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Latency: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;outputs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;latency_ms&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tokens used: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;outputs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;token_count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retrieved docs: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;outputs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retrieved_docs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Real-Time Hallucination Detection
&lt;/h3&gt;

&lt;p&gt;When &lt;code&gt;tracer_verify=True&lt;/code&gt; is set, every response goes through &lt;code&gt;CitationVerifier&lt;/code&gt;&lt;br&gt;
before being returned to the user.&lt;/p&gt;

&lt;p&gt;It works in two stages:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1 - Claim extraction:&lt;/strong&gt;&lt;br&gt;
The AI's response is split into atomic, independently verifiable claims.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2 — NLI cross-referencing:&lt;/strong&gt;&lt;br&gt;
Each claim is checked against the retrieved source documents using a Natural Language&lt;br&gt;
Inference model. A claim fails if the source documents do not logically entail it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Query hallucination records for a specific bot
&lt;/span&gt;&lt;span class="n"&gt;hallucinations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs.bot_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-bot-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outputs.is_hallucinated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;hallucinations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hallucinated response: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed claims: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;outputs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;failed_claims&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Source docs used: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;outputs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retrieved_docs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You are no longer waiting for a user to report an error. You have a systematic,&lt;br&gt;
queryable record of every point where the AI broke from its source material.&lt;/p&gt;
&lt;h3&gt;
  
  
  Graceful Degradation
&lt;/h3&gt;

&lt;p&gt;If you want span and latency logging without the overhead of NLI evaluation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;trainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LongTrainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;mongo_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mongodb://localhost:27017/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;enable_tracer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tracer_verify&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;  &lt;span class="c1"&gt;# Observability on, hallucination detection off
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;longtrainer[tracer]&lt;/code&gt; is not installed, LongTrainer bypasses the tracer&lt;br&gt;
entirely without raising an exception — no breaking changes to existing deployments.&lt;/p&gt;


&lt;h2&gt;
  
  
  Also in v1.3.0: Lazy Loading at Scale
&lt;/h2&gt;

&lt;p&gt;Previous versions eagerly loaded all chat histories into RAM on server startup.&lt;br&gt;
At 100,000+ sessions, this caused startup times measured in minutes and significant&lt;br&gt;
memory pressure.&lt;/p&gt;

&lt;p&gt;v1.3.0 flips this entirely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before v1.3.0: all sessions loaded at startup → memory spike
# After v1.3.0: zero sessions loaded at startup
&lt;/span&gt;
&lt;span class="c1"&gt;# When a user sends a message:
&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chat_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# LongTrainer fetches only *this* conversation thread from MongoDB on demand
# All other sessions remain unloaded until requested
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For production environments with large user bases, startup time drops from&lt;br&gt;
minutes to milliseconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Standard install&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;longtrainer

&lt;span class="c"&gt;# With observability and hallucination detection&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;longtrainer[tracer]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Supported LLM providers: OpenAI, Anthropic, Gemini, AWS Bedrock, HuggingFace,&lt;br&gt;
Groq, Ollama, and any LangChain-compatible LLM.&lt;/p&gt;

&lt;p&gt;Supported vector stores: FAISS, Pinecone, Qdrant, PGVector, Chroma.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ENDEVSOLS/Long-Trainer" rel="noopener noreferrer"&gt;github.com/ENDEVSOLS/Long-Trainer&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://endevsols.github.io/Long-Trainer" rel="noopener noreferrer"&gt;endevsols.github.io/Long-Trainer&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/longtrainer" rel="noopener noreferrer"&gt;pypi.org/project/longtrainer&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;For those of you already running RAG in production: what is the biggest&lt;br&gt;
infrastructure bottleneck you are currently hitting?&lt;/p&gt;

</description>
      <category>python</category>
      <category>langchain</category>
      <category>opensource</category>
      <category>rag</category>
    </item>
    <item>
      <title>How We Automated Hallucination Detection in Enterprise RAG Pipelines</title>
      <dc:creator>Muhammad Muzammil</dc:creator>
      <pubDate>Wed, 29 Apr 2026 05:34:16 +0000</pubDate>
      <link>https://dev.to/endevsols/how-we-automated-hallucination-detection-in-enterprise-rag-pipelines-42ca</link>
      <guid>https://dev.to/endevsols/how-we-automated-hallucination-detection-in-enterprise-rag-pipelines-42ca</guid>
      <description>&lt;p&gt;Your RAG isn't broken. It's just lying quietly.&lt;/p&gt;

&lt;p&gt;Retrieval works. The LLM sounds confident. Your users get an answer.&lt;/p&gt;

&lt;p&gt;But somewhere in that response, a claim contradicts the source document it was supposed to be grounded in. No error thrown. No flag raised. Just a confident, wrong answer, delivered at scale.&lt;/p&gt;

&lt;p&gt;This is the hallucination problem that doesn't get talked about enough. Not the obvious failures. The subtle ones.&lt;/p&gt;

&lt;p&gt;We've seen it across enterprise RAG deployments  in legal tools, internal knowledge bases, customer-facing assistants. The retrieval pipeline performs. The LLM performs. And still, trust erodes the moment a user catches one bad answer.&lt;/p&gt;

&lt;p&gt;We're open sourcing &lt;a href="https://endevsols.com/open-source/longtracer" rel="noopener noreferrer"&gt;LongTracer&lt;/a&gt;, our answer to this problem.&lt;/p&gt;

&lt;p&gt;LongTracer sits at the output layer of any RAG pipeline and verifies every claim in an LLM response against your source documents. It uses a hybrid STS + NLI approach: first finding the most semantically relevant source sentence per claim, then classifying whether that source actually supports, contradicts, or is neutral to what the LLM said.&lt;/p&gt;

&lt;p&gt;The result: a trust score, a verdict, and a clear list of exactly which claims hallucinated and why.&lt;/p&gt;

&lt;p&gt;No LLM calls. No vector store required. No new infrastructure. It works with LangChain, LlamaIndex, Haystack, LangGraph, or any pipeline that gives you a response and source chunks.&lt;/p&gt;

&lt;p&gt;MIT licensed. Built from real implementation experience.&lt;/p&gt;

&lt;p&gt;If you're running RAG in production, your users deserve answers you can actually stand behind.&lt;/p&gt;

&lt;p&gt;Try:&lt;br&gt;
&lt;code&gt;pip install longtracer&lt;/code&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>opensource</category>
      <category>python</category>
    </item>
    <item>
      <title>RAG vs. Fine-Tuning vs. Prompting: 2026 Strategic Guide</title>
      <dc:creator>Muhammad Muzammil</dc:creator>
      <pubDate>Sun, 26 Apr 2026 11:40:41 +0000</pubDate>
      <link>https://dev.to/muzammil_endevsols/rag-vs-fine-tuning-vs-prompting-2026-strategic-guide-169l</link>
      <guid>https://dev.to/muzammil_endevsols/rag-vs-fine-tuning-vs-prompting-2026-strategic-guide-169l</guid>
      <description>&lt;p&gt;As we navigate the landscape of 2026, the initial era of generative AI experimentation has yielded to a period of industrial-grade Enterprise LLM Implementation. For technical founders and CTOs, the fundamental challenge is no longer just selecting a foundational model, but architecting a system that safely bridges the 'Enterprise Data Gap' - the distance between a model's public training weights and your organization's proprietary intelligence.&lt;/p&gt;

&lt;p&gt;In our internal analysis of scaling enterprise AI systems, we found that optimizing data retrieval pipelines can reduce hallucination rates by up to 85% compared to baseline models. The decision between Retrieval-Augmented Generation (RAG), Fine-Tuning, and Prompt Engineering is no longer a theoretical debate; it is a critical infrastructure choice that dictates your compute costs, latency, and system scalability.&lt;br&gt;
This guide provides a practitioner's framework for architecting Large Language Models (LLMs) for maximum ROI, security, and production-grade accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Engineering Reality: Moving Beyond Base Models&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Base models are essentially 'polymaths with amnesia.' They possess vast general knowledge and reasoning capabilities but lack access to your internal databases, real-time analytics, and secure corporate data.&lt;br&gt;
To transform these models into production-ready assets, engineering teams must leverage one of three primary optimization levers. A common mistake is assuming that adjusting model weights (Fine-Tuning) is the default solution for poor performance. In reality, the most resilient architectures today are hybrid systems that utilize multi-agent workflows for routing, RAG for factual grounding, and fine-tuning exclusively for deep stylistic or logical specialization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option A: Advanced Prompting &amp;amp; Multi-Agent Routing (The Agility Play)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architectural Overview&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Prompt engineering has evolved far beyond basic text instructions. In 2026, it involves programmatic prompt construction and multi-agent orchestration frameworks like LangGraph. Instead of relying on a single zero-shot prompt, we design stateful, multi-actor systems where agents dynamically construct prompts based on the user's intent before routing the query to the appropriate LLM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Engineering Trade-offs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Near-zero infrastructure overhead; instantaneous iteration; highly effective when combined with stateful agentic workflows.&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Strictly bounded by the model's context window limits; highly susceptible to prompt injection attacks; prone to 'mode collapse' when instructions become too complex.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production Use Case&lt;/strong&gt;&lt;br&gt;
Best utilized as the routing layer of an AI application. For example, using a lightweight model to classify an incoming query and dynamically inject the correct system prompt before passing it to a heavier model for execution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option B: Retrieval-Augmented Generation (The Contextual Powerhouse)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architectural Overview&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;RAG is the industry standard for bridging LLMs with proprietary data. Instead of baking knowledge into the model's weights, RAG relies on a high-speed semantic search pipeline.&lt;br&gt;
When dealing with large-scale vectorization projects - often scaling up to 300-400GB of enterprise data, a naive RAG approach fails. Production RAG requires a robust pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ingestion &amp;amp; Chunking: Parsing raw data and applying semantic chunking strategies to preserve context.&lt;/li&gt;
&lt;li&gt;Embedding: Passing chunks through an embedding model to create dense vector representations.&lt;/li&gt;
&lt;li&gt;Vector Store: Storing these embeddings in a high-performance vector database.&lt;/li&gt;
&lt;li&gt;Retrieval &amp;amp; Generation: Intercepting a user query, converting it to a vector, retrieving the Top-K nearest neighbors, and injecting that context into the LLM's prompt via a scalable backend (typically built on FastAPI).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The Engineering Trade-offs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Absolute data freshness; highly auditable (you can trace exact source documents); inherently secure through document-level access controls.&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Introduces latency during the retrieval step; requires maintaining separate infrastructure (Vector DBs, embedding pipelines).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production Use Case&lt;/strong&gt;&lt;br&gt;
RAG is the definitive architecture for systems requiring factual accuracy and real-time updates, such as medical clinical assistants parsing dynamic guidelines or financial chatbots querying live internal knowledge bases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option C: Fine-Tuning (The Deep Expertise Specialization)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architectural Overview&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fine-tuning permanently alters the internal parameters (weights) of a pre-trained model. Rather than providing context at runtime, you are retraining the model on a highly curated, domain-specific dataset. Modern Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA and QLoRA, allow teams to freeze the base model and only update a small subset of weights, drastically reducing compute requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Engineering Trade-offs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Unmatched performance in niche logical tasks; highly effective at forcing models to output specific structural formats (like proprietary code or strict JSON); reduces runtime latency compared to heavy RAG prompts.&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; High risk of 'Knowledge Obsolescence' (data is frozen at training time); expensive data curation process; difficult to enforce user-level data security.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production Use Case&lt;/strong&gt;&lt;br&gt;
Reserved for tasks where reasoning style, format, and domain jargon outweigh the need for real-time data. Ideal for proprietary code generation, strict regulatory compliance parsing, or altering the inherent 'voice' of an open-source model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG vs Fine-Tuning vs Prompting: The Infrastructure Matrix&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When architecting a solution, evaluate these critical dimensions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Data Freshness: RAG provides real-time access. Fine-tuning is static.&lt;/li&gt;
&lt;li&gt;Hallucination Mitigation: RAG grounds outputs in provided facts. Fine-tuning can actually increase confident hallucinations if the training data is flawed.&lt;/li&gt;
&lt;li&gt;Security &amp;amp; Access Control: RAG allows for Role-Based Access Control (RBAC) at the database level. Fine-tuning bakes data into the weights, making it accessible to anyone who queries the model.&lt;/li&gt;
&lt;li&gt;Infrastructure Load: RAG shifts the load to memory and database I/O. Fine-tuning shifts the load to heavy GPU compute.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Strategic Recommendation for AI Architecture in 2026&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For engineering leaders, the optimal architecture is a RAG-First Strategy wrapped in Agentic Routing.&lt;br&gt;
By building a robust RAG architecture, you create a system that is grounded, auditable, and secure. Utilize frameworks like LangGraph to orchestrate prompt-based agents that handle logic and routing, and reserve fine-tuning strictly as a surgical tool for edge cases where the LLM struggles to grasp domain-specific formatting.&lt;/p&gt;

&lt;p&gt;Choosing the right path for LLM optimization is the difference between an AI product that scales efficiently and a fragile system that becomes a technical liability.&lt;/p&gt;

&lt;p&gt;At EnDevSols, we specialize in architecting production-grade multi-agent workflows and high-capacity RAG pipelines for enterprise clients. If you are a CTO or technical founder looking to transition from AI prototypes to scalable infrastructure, explore our &lt;a href="https://endevsols.com/services/generative-ai" rel="noopener noreferrer"&gt;Generative AI Development Services&lt;/a&gt; to see how we build resilient AI systems.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>agents</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
