DEV Community

PAWAN YADAV  (AI Engineer)
PAWAN YADAV (AI Engineer)

Posted on

Building a Self-Hosted RAG Chatbot with a Dual-Agent LLM Pipeline (and Automatic LLM Failover)

Over the past few weeks I built a Retrieval-Augmented Generation (RAG) chatbot from the ground up — one that answers strictly from a knowledge base I control, supports role-based access for regular users vs. admins, and never makes things up. In this post I want to walk through the architecture, the design decisions, and one piece that took real engineering effort: automatically switching to a backup LLM when the primary one is busy or rate-limited, so a query never just fails.

This isn’t a toy demo — it’s a full pipeline with authentication, a vector database, a two-stage reasoning process, and an admin console for managing the knowledge base. Here’s how it all fits together.

The Problem I Was Solving
Most “just call an LLM API” chatbots have two issues:

They hallucinate. Ask a generic LLM a domain-specific question and it will confidently make something up if it doesn’t actually know the answer.
They have a single point of failure. If your one LLM provider is rate-limited, down, or just slow at that moment, your whole app stalls.
I wanted a system where:

Every answer is grounded in real, embedded documents — no hallucination.
Two different roles (regular users and admins) get different capabilities.
If the LLM serving a request is unavailable, the system transparently retries on a different model instead of erroring out.
High-Level Architecture
The system has three user-facing surfaces and one processing core:

Signup / Identification Page — the user enters an email; the backend checks it against a list of approved admin addresses and routes accordingly.
User View — query-only. Regular users can ask questions but cannot touch the knowledge base.
Admin View — everything a user can do, plus the ability to upload documents/URLs, view what’s been embedded, and trigger re-indexing.
Here’s the high-level flow:

flowchart TD
A[Signup Page - Enter Email] --> B{Role Check}
B -->|Regular Email| C[User View - Query Only]
B -->|Admin Email| D[Admin View - Manage + Query]
C --> E[Query Pipeline]
D --> E
E --> F[Agent 1: Tool & Argument Selection]
F --> G[Execute Retrieval Tool]
G --> H[Raw Retrieved Chunks]
H --> I[Agent 2: Refinement]
I --> J[Final Answer Returned]
If your blogging platform doesn’t render Mermaid diagrams (Medium, for instance, won’t render this natively), here’s the same flow as plain text:

Signup Page
|
v
Role Check (User or Admin?)
|---------------------|
v v
User View Admin View
(Query Only) (Manage + Query)
|---------------------|
v
Query Pipeline
v
Agent 1: Tool & Argument Selection
v
Execute Retrieval Tool
v
Raw Retrieved Chunks
v
Agent 2: Refinement
v
Final Answer Returned

Why Two Agents Instead of One?
A single LLM call trying to both decide what to retrieve and how to phrase the final answer tends to produce messier output. So I split the reasoning into two sequential agents, orchestrated as a small graph of steps rather than one big prompt:

Agent 1 — Tool Decision Takes the raw user query and decides which retrieval tool to call and with what arguments. It returns a small, strict structure like:

[{"name": "retrieval_tool", "arguments": {"query": "your parsed query here"}}]
A lightweight Python handler parses this output, extracts the tool name and arguments, and manually executes the retrieval step — no need to trust the LLM to “just run the function,” which keeps things deterministic and debuggable.

Agent 2 — Refinement Takes whatever raw chunks came back from retrieval and turns them into a clear, well-formatted, professional answer. This is also the layer that keeps responses grounded — it’s instructed to work only with what was retrieved, not to add outside knowledge.

sequence Diagram

participant U as User/Admin
participant A1 as Agent 1 (Tool Decision)
participant T as Retrieval Tool
participant V as Vector Store
participant A2 as Agent 2 (Refinement)

U->>A1: Submits query
A1->>A1: Decide tool + arguments
A1->>T: Call retrieval tool
T->>V: Similarity search
V-->>T: Top-K relevant chunks
T-->>A2: Raw chunks + sources
A2->>A2: Refine into polished answer
A2-->>U: Final answer
Enter fullscreen mode Exit fullscreen mode

The Retrieval Layer
Documents and URLs go through a standard but carefully tuned pipeline:

Loading — PDFs, Word docs, plain text, and Markdown files are parsed with format-specific loaders; web pages are scraped and cleaned.
Chunking — content is split into ~400-character chunks using a recursive character splitter, with each chunk tagged with its source (filename or URL) for traceability.
Embedding — each chunk is converted into a vector using an embedding model.
Storage — vectors go into a local vector index that supports fast similarity search and incremental updates (you don’t need to rebuild the whole index every time you add a new file).
Retrieval — when a query comes in, the index returns the top-K most similar chunks, filtered by a minimum similarity score threshold so irrelevant matches get dropped before they ever reach the LLM.

Two tunable parameters matter a lot here:

score_threshold — the minimum similarity score a chunk needs to be considered relevant. Raise it for stricter, more precise retrieval; lower it if the bot is being too conservative and saying "I don't know" too often.
k — how many top chunks get passed into the refinement agent. Too low and you risk missing context; too high and you risk diluting the answer with noise (and burning more tokens).
Admin Controls
Admins get a dedicated panel to manage the knowledge base directly:

Upload new files (PDF, Word, text, Markdown) for embedding.
Submit URLs to scrape and embed.
View every currently embedded source and a running count.
Run test queries against the live index.
Trigger a manual reload of the retriever so newly embedded content becomes searchable immediately, without restarting the whole service.
Regular users never see any of this — they only get a query box. Role detection is based on a simple, server-side check against a list of approved admin emails at signup time, and every admin-only route re-checks that role before doing anything destructive or data-modifying.

Now, the Interesting Part: Automatic LLM Failover
Here’s the piece I want to focus on, since it’s the part most tutorials skip. If you’ve ever called a single LLM provider in production, you’ve hit this: rate limits, timeouts, momentary outages, or a model that’s just slow to respond under load. If your whole app depends on one model, one bad moment takes everything down with it.

Become a Medium member
The fix is a failover-aware LLM router sitting in front of both agents. Instead of hardcoding “always call Model A,” every LLM call goes through a small dispatcher that:

Tries the primary model first.
If that call fails, times out, or comes back with a rate-limit / “server busy” type error, it automatically retries the same request on the next model in a prioritized list — no user-facing error, no manual restart.
Logs which model actually served the request (useful for debugging and for tracking cost/usage per provider).
Optionally applies a short backoff before retrying the original model on the next query, so it isn’t hammered immediately after a failure.
Press enter or click to view image in full size

How to actually manage this on the server
Practically, here’s the setup that works well:

  1. Define a prioritized list of models, not a single model.

LLM_PRIORITY = [
{"name": "primary_llm", "provider": "provider_a", "model": "model-a-large"},
{"name": "secondary_llm", "provider": "provider_b", "model": "model-b-large"},
{"name": "fallback_llm", "provider": "provider_c", "model": "model-c-small"},
]
Order them by quality/cost first, reliability second. The first entry is your “ideal” answer quality; the rest exist purely so a request never just dies.

  1. Wrap every call in a router function that catches specific failure types.

import time
def call_llm_with_failover(prompt, models=LLM_PRIORITY, max_retries_per_model=1):
last_error = None
for model_config in models:
for attempt in range(max_retries_per_model):
try:
response = call_provider(
provider=model_config["provider"],
model=model_config["model"],
prompt=prompt,
timeout=15 # seconds — don't let one model hang the whole pipeline
)
# Success — record which model actually answered
log_model_usage(model_config["name"])
return response
except RateLimitError:
last_error = "rate_limited"
break # don't retry the same rate-limited model, move to next one
except TimeoutError:
last_error = "timeout"
continue # maybe worth one retry on the same model
except ServerBusyError:
last_error = "busy"
break
raise AllModelsUnavailableError(f"All models exhausted. Last error: {last_error}")

  1. Use this wrapper for both agents, not just one. Agent 1 (tool decision) and Agent 2 (refinement) should each go through the same failover router independently — it’s entirely possible for the model serving Agent 1 to be busy while the model serving Agent 2 is fine, or vice versa.

  2. Add a circuit breaker so you don’t keep hammering a model that’s clearly down. A simple in-memory counter works for small-to-medium traffic:

from collections import defaultdict
from time import time
failure_counts = defaultdict(list)
COOLDOWN_SECONDS = 60
FAILURE_THRESHOLD = 3
def is_model_in_cooldown(model_name):
now = time()
recent_failures = [t for t in failure_counts[model_name] if now - t < COOLDOWN_SECONDS]
failure_counts[model_name] = recent_failures
return len(recent_failures) >= FAILURE_THRESHOLD
def record_failure(model_name):
failure_counts[model_name].append(time())
Check is_model_in_cooldown() before attempting a model in the priority list — if it's in cooldown, skip straight to the next one instead of wasting a request and a timeout window on a model you already know is struggling.

  1. Make the model list configurable, not hardcoded. Keep it in an environment variable or a small config file so you can reorder priority, swap providers, or add a new model without touching code:

.env

LLM_PRIORITY_ORDER=primary_llm,secondary_llm,fallback_llm
PRIMARY_LLM_PROVIDER=provider_a
SECONDARY_LLM_PROVIDER=provider_b
FALLBACK_LLM_PROVIDER=provider_c
REQUEST_TIMEOUT_SECONDS=15
COOLDOWN_SECONDS=60
This is the difference between “I have to redeploy to change providers” and “I edit one line in .env and restart the service."

Why this matters in production
With this in place, a query never just hangs or errors out because one provider happened to be under load at that exact second. The user experience stays consistent — they get an answer, possibly from a slightly different underlying model, but they’re never staring at a spinner that times out. And because every fallback event is logged, you get visibility into how often your primary model is actually struggling, which is useful data for deciding whether to renegotiate rate limits, add a fourth fallback, or just accept the current setup.

flowchart TD
Q[Incoming Query] --> P[Try Primary LLM]
P -->|Success| R[Return Answer]
P -->|Rate Limited / Busy / Timeout| S[Try Secondary LLM]
S -->|Success| R
S -->|Failure| T[Try Fallback LLM]
T -->|Success| R
T -->|Failure| U[Raise Controlled Error]
Technology Choices
For anyone curious about the stack, at a glance:

Backend: Python web framework with async support, SQL-backed database for user/role data, secure session handling, password hashing for any stored credentials.
Agent orchestration: a graph-based orchestration library to define the two-agent sequence cleanly, with a tool/document-processing framework underneath.
Embeddings & retrieval: open embedding models for vectorization, a local vector index for similarity search, recursive text chunking, and a set of format-specific document loaders (PDF, Word, text, Markdown, web pages).
Frontend: plain HTML/CSS/JS with a utility-first CSS framework, server-rendered templates.
Backups: periodic backup of the embedding store to cloud object storage.
Concurrency: thread pools and async I/O so embedding jobs and queries don’t block each other.
What’s Next
A few things on the roadmap:

Letting admins delete individual files or specific vectors from the index, instead of only adding to it.
Tightening retrieval accuracy by experimenting with chunk size and embedding model choice.
Possibly collapsing the two-agent pipeline into a single, faster agent once latency becomes a bigger priority than the clean separation of concerns.
Expanding the failover router to support weighted load balancing across models even when all of them are healthy, not just failover when one is down.
Closing Thoughts
The biggest lesson from this project: a RAG chatbot is “easy” to build to a demo-quality bar, and genuinely hard to build to a doesn’t-fall-over-in-production bar. The retrieval and refinement pipeline gets you accurate, grounded answers. The failover router is what keeps the lights on when your LLM provider has a bad five minutes. Both matter — and most tutorials only show you the first one.

If you’re building something similar, start with the dual-agent retrieval pipeline to get accuracy right, then treat your LLM calls as a resource pool rather than a single dependency from day one. Retrofitting failover later is a lot more painful than designing for it up front.

Top comments (0)