How I Built a Multi-LLM AI Agent System for Hospital Management
When your LLM provider goes down, the hospital can't wait. Here's how I designed a system that never stops.
Every AI demo you see on Twitter works with one LLM provider. But what happens when that provider hits rate limits at 8 AM — exactly when your hospital staff needs their morning revenue report?
I learned this the hard way. After my AI agent failed three mornings in a row because OpenRouter was rate-limited, I rebuilt the entire system from scratch. Today, HISDashboard runs 10 specialized AI agents across 4 LLM providers with automatic fallback — and it hasn't missed a morning report in months.
Here's the architecture that made it possible.
1. The Problem: One Agent Can't Do Everything
My first attempt was simple: one ReAct agent, one LLM, all tools loaded. It failed spectacularly:
- Context window overflow — 40+ tool descriptions consumed most of the tokens
- Wrong tool selection — asked about HR staffing, agent queried financial data
- Single point of failure — OpenRouter down = everything down
I needed a fundamentally different architecture.
2. The Solution: Router → Specialist → Reflection
2.1 Router Agent — Intent Classification
The Router doesn't use regex or keyword matching. It's a lightweight LLM call with structured output via Pydantic:
class IntentResult(BaseModel):
"""Structured intent classification result."""
intent: str = Field(
description="One of: clinical, booking, analysis, hr_dispatch..."
)
confidence: float = Field(
default=0.8,
description="Confidence score 0.0 to 1.0",
)
When confidence drops below 0.4, instead of guessing wrong, it asks the user to clarify:
if confidence < LOW_CONFIDENCE:
return "clarify", confidence # Better to ask than to guess wrong
And when the LLM itself fails? Three layers of fallback ensure the router never crashes:
Layer 1: LLM structured output (Pydantic schema)
↓ (parse error)
Layer 2: Text response → keyword extraction
↓ (LLM down)
Layer 3: Regex keyword matching (always works, no API needed)
Key insight: Your router is the single most critical component. If it fails, everything fails. Over-engineer it.
2.2 Specialist Agents — Right Tool, Right Model
Each agent gets only the tools it needs. No context window waste:
# Financial agent: complex model, 20+ tools, 10 reasoning iterations
"financial": {
"is_complex": True, # GPT-4 class model
"toolkit": financial_tools, # 20+ tools (revenue, insurance, forecast...)
"max_iters": 10, # More reasoning steps for complex analysis
"parallel_tools": True, # Call multiple APIs simultaneously
}
# Booking agent: simple model, 2 tools, 3 iterations
"booking": {
"is_complex": False, # GPT-3.5 class model (cheaper, faster)
"toolkit": booking_tools, # Just create_appointment + cancel_appointment
"max_iters": 3, # Simple task, fewer steps needed
}
The tradeoff: complex agents use expensive models with more iterations. Simple agents use cheap, fast models. This saves ~60% on API costs while maintaining quality where it matters.
2.3 Reflection Layer — Self-Correction Before Responding
Before any response reaches the user, the Reflection layer runs 5 quality checks:
def evaluate_response(query, response, intent, tool_results):
issues = []
issues.extend(_check_empty_or_short(response)) # Agent gave up?
issues.extend(_check_tool_failures(tool_results)) # All APIs failed?
issues.extend(_check_ungrounded_medical(response, intent, tool_results))
issues.extend(_check_repetition(response)) # LLM stuck in loop?
issues.extend(_check_generic_nonanswer(response, intent))
| Check | What it catches | Why it matters |
|---|---|---|
| Empty/Short | Agent returned minimal text | Users expect detailed analysis |
| Tool Failures | All API calls failed silently | Agent should acknowledge errors |
| Ungrounded Medical | Dosages without RAG source | Patient safety — can't hallucinate medicine |
| Repetition | Same sentence 3+ times | LLM generation loop detected |
| Generic Non-answer | "I can't help" when tools exist | Agent should try harder |
If quality score < 0.4 or a critical issue is found, the system automatically retries with improved instructions:
if "all_tools_failed" in issues:
retry_hint = "Tools failed. Try again, or answer from knowledge "
"and NOTE clearly this is not real-time data."
improved_prompt = f"{retry_hint}\n\nOriginal question: {query}"
# Agent gets a second chance with better guidance
3. Multi-LLM: The Insurance Policy
The core of reliability — a 4-provider fallback chain:
Primary: OpenRouter (DeepSeek-V4) — Best cost/performance ratio
↓ (429 rate limit or timeout)
Fallback 1: Google Gemini — Free tier, solid quality
↓ (quota exhausted)
Fallback 2: Groq — Fastest inference, limited context
↓ (all cloud providers down)
Fallback 3: Ollama (local) — Runs on our server, always available
The factory pattern makes switching invisible to agents:
# Agents don't know which provider they're using
model = get_complex_model() # Returns first available provider
agent = ReActAgent(model=model, toolkit=toolkit)
The secret weapon: force_new=True rotates API keys on 429 errors without restarting the system:
async def get_agent_async(agent_key, force_new=False):
if force_new:
_agents_async.pop(cache_key, None)
# Rebuilds with fresh model config → new API key
When one key hits rate limits, the system seamlessly switches to a backup key, then to the next provider. The user never notices.
4. MCP Protocol: Standardizing 40+ Tools
With 40+ tool functions spread across 12 files (finance, HR, diagnostics, RAG, booking...), maintaining consistent interfaces was a nightmare. MCP (Model Context Protocol) solved this.
Before MCP: Each tool had its own calling convention, error format, and response structure.
After MCP: One protocol, one schema, one way to discover and call tools.
# Build a hybrid toolkit: local tools + MCP-discovered tools
toolkit = await build_hybrid_toolkit(
local_functions=local_fns, # Our FastAPI wrappers (high priority)
mcp_category=mcp_category, # MCP tools discovered at runtime
mcp_max_tools=20, # Cap to prevent context overflow
)
The hybrid approach:
- Local tools (REST wrappers around our FastAPI backend) → fast, reliable, battle-tested
- MCP tools (auto-discovered from tool server) → extensible without code changes
- Priority system → local tools always take precedence; MCP supplements
This means I can add new capabilities by registering tools on the MCP server — no agent code changes needed.
5. Results & Lessons Learned
The numbers
| Metric | Before (Single Agent) | After (Multi-Agent) |
|---|---|---|
| Routing accuracy | ~60% | ~92% |
| Average response time | 8-15s | 3-6s |
| Failed morning reports | 3/week | 0/month |
| Monthly LLM cost | ~$80 | ~$35 |
| Code maintainability | 🔴 Monolith | 🟢 Modular |
What worked
- ✅ Zero morning report failures since implementing Multi-LLM fallback
- ✅ 60% cost reduction by routing simple queries to cheap models
- ✅ < 2 second failover between LLM providers
- ✅ Medical safety net — Reflection catches ungrounded claims before doctors see them
What I'd do differently
- Start with the Router from day 1 — Don't build "one agent to rule them all" first. You'll waste weeks.
- Log every routing decision — When something goes wrong, you need the trace. Every intent classification, tool call, and reflection check gets logged.
- Keyword fallback is not optional — When your LLM router fails, regex literally saves the day.
- Simple agents are underrated — Booking agent with 2 tools and 3 iterations handles 90% of requests perfectly. Not everything needs GPT-4.
Conclusion
Building AI agents for production is fundamentally different from building demos. The three patterns that made the biggest difference:
- Route, don't bloat — Specialized agents with focused toolkits beat one god-agent every time
- Fail gracefully — Multi-LLM fallback + keyword backup + reflection retries = zero downtime
- Self-correct — Never send an AI response to a user without checking it first
The full system powers a real hospital dashboard serving staff daily. Architecture & documentation: Ai-Healthcare-Dashboard
Source code is private (enterprise healthcare). The showcase repo contains full architecture docs, diagrams, and technical decisions.
About the author: I'm Duc Thanh, an AI Engineer specializing in Healthcare IT. I build production AI systems that hospitals actually use. Connect with me on LinkedIn or GitHub.

Top comments (0)