DEV Community

DucThanh
DucThanh

Posted on

How I Built a Multi-LLM AI Agent System for Hospital Management

How I Built a Multi-LLM AI Agent System for Hospital Management

When your LLM provider goes down, the hospital can't wait. Here's how I designed a system that never stops.


Every AI demo you see on Twitter works with one LLM provider. But what happens when that provider hits rate limits at 8 AM — exactly when your hospital staff needs their morning revenue report?

I learned this the hard way. After my AI agent failed three mornings in a row because OpenRouter was rate-limited, I rebuilt the entire system from scratch. Today, HISDashboard runs 10 specialized AI agents across 4 LLM providers with automatic fallback — and it hasn't missed a morning report in months.

Here's the architecture that made it possible.


1. The Problem: One Agent Can't Do Everything

My first attempt was simple: one ReAct agent, one LLM, all tools loaded. It failed spectacularly:

  • Context window overflow — 40+ tool descriptions consumed most of the tokens
  • Wrong tool selection — asked about HR staffing, agent queried financial data
  • Single point of failure — OpenRouter down = everything down

I needed a fundamentally different architecture.


2. The Solution: Router → Specialist → Reflection

2.1 Router Agent — Intent Classification

The Router doesn't use regex or keyword matching. It's a lightweight LLM call with structured output via Pydantic:

class IntentResult(BaseModel):
    """Structured intent classification result."""
    intent: str = Field(
        description="One of: clinical, booking, analysis, hr_dispatch..."
    )
    confidence: float = Field(
        default=0.8,
        description="Confidence score 0.0 to 1.0",
    )
Enter fullscreen mode Exit fullscreen mode

When confidence drops below 0.4, instead of guessing wrong, it asks the user to clarify:

if confidence < LOW_CONFIDENCE:
    return "clarify", confidence  # Better to ask than to guess wrong
Enter fullscreen mode Exit fullscreen mode

And when the LLM itself fails? Three layers of fallback ensure the router never crashes:

Layer 1: LLM structured output (Pydantic schema)
    ↓ (parse error)
Layer 2: Text response → keyword extraction
    ↓ (LLM down)
Layer 3: Regex keyword matching (always works, no API needed)
Enter fullscreen mode Exit fullscreen mode

Key insight: Your router is the single most critical component. If it fails, everything fails. Over-engineer it.

2.2 Specialist Agents — Right Tool, Right Model

Each agent gets only the tools it needs. No context window waste:

# Financial agent: complex model, 20+ tools, 10 reasoning iterations
"financial": {
    "is_complex": True,         # GPT-4 class model
    "toolkit": financial_tools,  # 20+ tools (revenue, insurance, forecast...)
    "max_iters": 10,            # More reasoning steps for complex analysis
    "parallel_tools": True,     # Call multiple APIs simultaneously
}

# Booking agent: simple model, 2 tools, 3 iterations
"booking": {
    "is_complex": False,        # GPT-3.5 class model (cheaper, faster)
    "toolkit": booking_tools,   # Just create_appointment + cancel_appointment
    "max_iters": 3,             # Simple task, fewer steps needed
}
Enter fullscreen mode Exit fullscreen mode

The tradeoff: complex agents use expensive models with more iterations. Simple agents use cheap, fast models. This saves ~60% on API costs while maintaining quality where it matters.

2.3 Reflection Layer — Self-Correction Before Responding

Before any response reaches the user, the Reflection layer runs 5 quality checks:

def evaluate_response(query, response, intent, tool_results):
    issues = []
    issues.extend(_check_empty_or_short(response))      # Agent gave up?
    issues.extend(_check_tool_failures(tool_results))    # All APIs failed?
    issues.extend(_check_ungrounded_medical(response, intent, tool_results))
    issues.extend(_check_repetition(response))           # LLM stuck in loop?
    issues.extend(_check_generic_nonanswer(response, intent))
Enter fullscreen mode Exit fullscreen mode
Check What it catches Why it matters
Empty/Short Agent returned minimal text Users expect detailed analysis
Tool Failures All API calls failed silently Agent should acknowledge errors
Ungrounded Medical Dosages without RAG source Patient safety — can't hallucinate medicine
Repetition Same sentence 3+ times LLM generation loop detected
Generic Non-answer "I can't help" when tools exist Agent should try harder

If quality score < 0.4 or a critical issue is found, the system automatically retries with improved instructions:

if "all_tools_failed" in issues:
    retry_hint = "Tools failed. Try again, or answer from knowledge "
                 "and NOTE clearly this is not real-time data."
improved_prompt = f"{retry_hint}\n\nOriginal question: {query}"
# Agent gets a second chance with better guidance
Enter fullscreen mode Exit fullscreen mode

3. Multi-LLM: The Insurance Policy

The core of reliability — a 4-provider fallback chain:

Primary:    OpenRouter (DeepSeek-V4) — Best cost/performance ratio
    ↓ (429 rate limit or timeout)
Fallback 1: Google Gemini — Free tier, solid quality
    ↓ (quota exhausted)
Fallback 2: Groq — Fastest inference, limited context
    ↓ (all cloud providers down)
Fallback 3: Ollama (local) — Runs on our server, always available
Enter fullscreen mode Exit fullscreen mode

The factory pattern makes switching invisible to agents:

# Agents don't know which provider they're using
model = get_complex_model()  # Returns first available provider
agent = ReActAgent(model=model, toolkit=toolkit)
Enter fullscreen mode Exit fullscreen mode

The secret weapon: force_new=True rotates API keys on 429 errors without restarting the system:

async def get_agent_async(agent_key, force_new=False):
    if force_new:
        _agents_async.pop(cache_key, None)
        # Rebuilds with fresh model config → new API key
Enter fullscreen mode Exit fullscreen mode

When one key hits rate limits, the system seamlessly switches to a backup key, then to the next provider. The user never notices.


4. MCP Protocol: Standardizing 40+ Tools

With 40+ tool functions spread across 12 files (finance, HR, diagnostics, RAG, booking...), maintaining consistent interfaces was a nightmare. MCP (Model Context Protocol) solved this.

Before MCP: Each tool had its own calling convention, error format, and response structure.

After MCP: One protocol, one schema, one way to discover and call tools.

# Build a hybrid toolkit: local tools + MCP-discovered tools
toolkit = await build_hybrid_toolkit(
    local_functions=local_fns,     # Our FastAPI wrappers (high priority)
    mcp_category=mcp_category,     # MCP tools discovered at runtime
    mcp_max_tools=20,              # Cap to prevent context overflow
)
Enter fullscreen mode Exit fullscreen mode

The hybrid approach:

  • Local tools (REST wrappers around our FastAPI backend) → fast, reliable, battle-tested
  • MCP tools (auto-discovered from tool server) → extensible without code changes
  • Priority system → local tools always take precedence; MCP supplements

This means I can add new capabilities by registering tools on the MCP server — no agent code changes needed.


5. Results & Lessons Learned

The numbers

Metric Before (Single Agent) After (Multi-Agent)
Routing accuracy ~60% ~92%
Average response time 8-15s 3-6s
Failed morning reports 3/week 0/month
Monthly LLM cost ~$80 ~$35
Code maintainability 🔴 Monolith 🟢 Modular

What worked

  • Zero morning report failures since implementing Multi-LLM fallback
  • 60% cost reduction by routing simple queries to cheap models
  • < 2 second failover between LLM providers
  • Medical safety net — Reflection catches ungrounded claims before doctors see them

What I'd do differently

  1. Start with the Router from day 1 — Don't build "one agent to rule them all" first. You'll waste weeks.
  2. Log every routing decision — When something goes wrong, you need the trace. Every intent classification, tool call, and reflection check gets logged.
  3. Keyword fallback is not optional — When your LLM router fails, regex literally saves the day.
  4. Simple agents are underrated — Booking agent with 2 tools and 3 iterations handles 90% of requests perfectly. Not everything needs GPT-4.

Conclusion

Building AI agents for production is fundamentally different from building demos. The three patterns that made the biggest difference:

  1. Route, don't bloat — Specialized agents with focused toolkits beat one god-agent every time
  2. Fail gracefully — Multi-LLM fallback + keyword backup + reflection retries = zero downtime
  3. Self-correct — Never send an AI response to a user without checking it first

The full system powers a real hospital dashboard serving staff daily. Architecture & documentation: Ai-Healthcare-Dashboard

Source code is private (enterprise healthcare). The showcase repo contains full architecture docs, diagrams, and technical decisions.


About the author: I'm Duc Thanh, an AI Engineer specializing in Healthcare IT. I build production AI systems that hospitals actually use. Connect with me on LinkedIn or GitHub.

Top comments (0)