<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: DucThanh</title>
    <description>The latest articles on DEV Community by DucThanh (@ducthanh1810).</description>
    <link>https://dev.to/ducthanh1810</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3904547%2F7c645f91-066a-43cd-8fac-c2ca7bef5615.jpg</url>
      <title>DEV Community: DucThanh</title>
      <link>https://dev.to/ducthanh1810</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ducthanh1810"/>
    <language>en</language>
    <item>
      <title>How I Built a Multi-LLM AI Agent System for Hospital Management</title>
      <dc:creator>DucThanh</dc:creator>
      <pubDate>Fri, 01 May 2026 09:47:33 +0000</pubDate>
      <link>https://dev.to/ducthanh1810/how-i-built-a-multi-llm-ai-agent-system-for-hospital-management-4ool</link>
      <guid>https://dev.to/ducthanh1810/how-i-built-a-multi-llm-ai-agent-system-for-hospital-management-4ool</guid>
      <description>&lt;h1&gt;
  
  
  How I Built a Multi-LLM AI Agent System for Hospital Management
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;When your LLM provider goes down, the hospital can't wait. Here's how I designed a system that never stops.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Every AI demo you see on Twitter works with one LLM provider. But what happens when that provider hits rate limits at 8 AM — exactly when your hospital staff needs their morning revenue report?&lt;/p&gt;

&lt;p&gt;I learned this the hard way. After my AI agent failed three mornings in a row because OpenRouter was rate-limited, I rebuilt the entire system from scratch. Today, &lt;strong&gt;HISDashboard&lt;/strong&gt; runs 10 specialized AI agents across 4 LLM providers with automatic fallback — and it hasn't missed a morning report in months.&lt;/p&gt;

&lt;p&gt;Here's the architecture that made it possible.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The Problem: One Agent Can't Do Everything
&lt;/h2&gt;

&lt;p&gt;My first attempt was simple: one ReAct agent, one LLM, all tools loaded. It failed spectacularly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context window overflow&lt;/strong&gt; — 40+ tool descriptions consumed most of the tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wrong tool selection&lt;/strong&gt; — asked about HR staffing, agent queried financial data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single point of failure&lt;/strong&gt; — OpenRouter down = everything down&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I needed a fundamentally different architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The Solution: Router → Specialist → Reflection
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flhi2rav06j9tnlwpykp8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flhi2rav06j9tnlwpykp8.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2.1 Router Agent — Intent Classification
&lt;/h3&gt;

&lt;p&gt;The Router doesn't use regex or keyword matching. It's a lightweight LLM call with &lt;strong&gt;structured output&lt;/strong&gt; via Pydantic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;IntentResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Structured intent classification result.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;One of: clinical, booking, analysis, hr_dispatch...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Confidence score 0.0 to 1.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When confidence drops below 0.4, instead of guessing wrong, it asks the user to clarify:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;LOW_CONFIDENCE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clarify&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;confidence&lt;/span&gt;  &lt;span class="c1"&gt;# Better to ask than to guess wrong
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And when the LLM itself fails? Three layers of fallback ensure the router &lt;strong&gt;never&lt;/strong&gt; crashes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer 1: LLM structured output (Pydantic schema)
    ↓ (parse error)
Layer 2: Text response → keyword extraction
    ↓ (LLM down)
Layer 3: Regex keyword matching (always works, no API needed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key insight&lt;/strong&gt;: Your router is the single most critical component. If it fails, everything fails. Over-engineer it.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 Specialist Agents — Right Tool, Right Model
&lt;/h3&gt;

&lt;p&gt;Each agent gets only the tools it needs. No context window waste:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Financial agent: complex model, 20+ tools, 10 reasoning iterations
&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;financial&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_complex&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;# GPT-4 class model
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;toolkit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;financial_tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 20+ tools (revenue, insurance, forecast...)
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_iters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;# More reasoning steps for complex analysis
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parallel_tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# Call multiple APIs simultaneously
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Booking agent: simple model, 2 tools, 3 iterations
&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;booking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_complex&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# GPT-3.5 class model (cheaper, faster)
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;toolkit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;booking_tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Just create_appointment + cancel_appointment
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_iters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;             &lt;span class="c1"&gt;# Simple task, fewer steps needed
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The tradeoff&lt;/strong&gt;: complex agents use expensive models with more iterations. Simple agents use cheap, fast models. This saves &lt;strong&gt;~60% on API costs&lt;/strong&gt; while maintaining quality where it matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.3 Reflection Layer — Self-Correction Before Responding
&lt;/h3&gt;

&lt;p&gt;Before any response reaches the user, the Reflection layer runs 5 quality checks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;issues&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_check_empty_or_short&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;      &lt;span class="c1"&gt;# Agent gave up?
&lt;/span&gt;    &lt;span class="n"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_check_tool_failures&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;    &lt;span class="c1"&gt;# All APIs failed?
&lt;/span&gt;    &lt;span class="n"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_check_ungrounded_medical&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_check_repetition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;           &lt;span class="c1"&gt;# LLM stuck in loop?
&lt;/span&gt;    &lt;span class="n"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_check_generic_nonanswer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;What it catches&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Empty/Short&lt;/td&gt;
&lt;td&gt;Agent returned minimal text&lt;/td&gt;
&lt;td&gt;Users expect detailed analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool Failures&lt;/td&gt;
&lt;td&gt;All API calls failed silently&lt;/td&gt;
&lt;td&gt;Agent should acknowledge errors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ungrounded Medical&lt;/td&gt;
&lt;td&gt;Dosages without RAG source&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Patient safety&lt;/strong&gt; — can't hallucinate medicine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Repetition&lt;/td&gt;
&lt;td&gt;Same sentence 3+ times&lt;/td&gt;
&lt;td&gt;LLM generation loop detected&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generic Non-answer&lt;/td&gt;
&lt;td&gt;"I can't help" when tools exist&lt;/td&gt;
&lt;td&gt;Agent should try harder&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If quality score &amp;lt; 0.4 or a critical issue is found, the system &lt;strong&gt;automatically retries&lt;/strong&gt; with improved instructions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all_tools_failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;issues&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;retry_hint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tools failed. Try again, or answer from knowledge &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                 &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;and NOTE clearly this is not real-time data.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;improved_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;retry_hint&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Original question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="c1"&gt;# Agent gets a second chance with better guidance
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  3. Multi-LLM: The Insurance Policy
&lt;/h2&gt;

&lt;p&gt;The core of reliability — a 4-provider fallback chain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Primary:    OpenRouter (DeepSeek-V4) — Best cost/performance ratio
    ↓ (429 rate limit or timeout)
Fallback 1: Google Gemini — Free tier, solid quality
    ↓ (quota exhausted)
Fallback 2: Groq — Fastest inference, limited context
    ↓ (all cloud providers down)
Fallback 3: Ollama (local) — Runs on our server, always available
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The factory pattern makes switching &lt;strong&gt;invisible&lt;/strong&gt; to agents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Agents don't know which provider they're using
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_complex_model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Returns first available provider
&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ReActAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;toolkit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;toolkit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The secret weapon&lt;/strong&gt;: &lt;code&gt;force_new=True&lt;/code&gt; rotates API keys on 429 errors without restarting the system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_agent_async&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;force_new&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;force_new&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;_agents_async&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cache_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Rebuilds with fresh model config → new API key
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When one key hits rate limits, the system seamlessly switches to a backup key, then to the next provider. The user never notices.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. MCP Protocol: Standardizing 40+ Tools
&lt;/h2&gt;

&lt;p&gt;With 40+ tool functions spread across 12 files (finance, HR, diagnostics, RAG, booking...), maintaining consistent interfaces was a nightmare. MCP (Model Context Protocol) solved this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before MCP&lt;/strong&gt;: Each tool had its own calling convention, error format, and response structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After MCP&lt;/strong&gt;: One protocol, one schema, one way to discover and call tools.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Build a hybrid toolkit: local tools + MCP-discovered tools
&lt;/span&gt;&lt;span class="n"&gt;toolkit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;build_hybrid_toolkit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;local_functions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;local_fns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# Our FastAPI wrappers (high priority)
&lt;/span&gt;    &lt;span class="n"&gt;mcp_category&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;mcp_category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# MCP tools discovered at runtime
&lt;/span&gt;    &lt;span class="n"&gt;mcp_max_tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;# Cap to prevent context overflow
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The hybrid approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Local tools&lt;/strong&gt; (REST wrappers around our FastAPI backend) → fast, reliable, battle-tested&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP tools&lt;/strong&gt; (auto-discovered from tool server) → extensible without code changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Priority system&lt;/strong&gt; → local tools always take precedence; MCP supplements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means I can add new capabilities by registering tools on the MCP server — no agent code changes needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Results &amp;amp; Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The numbers
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before (Single Agent)&lt;/th&gt;
&lt;th&gt;After (Multi-Agent)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Routing accuracy&lt;/td&gt;
&lt;td&gt;~60%&lt;/td&gt;
&lt;td&gt;~92%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average response time&lt;/td&gt;
&lt;td&gt;8-15s&lt;/td&gt;
&lt;td&gt;3-6s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failed morning reports&lt;/td&gt;
&lt;td&gt;3/week&lt;/td&gt;
&lt;td&gt;0/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly LLM cost&lt;/td&gt;
&lt;td&gt;~$80&lt;/td&gt;
&lt;td&gt;~$35&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code maintainability&lt;/td&gt;
&lt;td&gt;🔴 Monolith&lt;/td&gt;
&lt;td&gt;🟢 Modular&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  What worked
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;Zero morning report failures&lt;/strong&gt; since implementing Multi-LLM fallback&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;60% cost reduction&lt;/strong&gt; by routing simple queries to cheap models&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;&amp;lt; 2 second failover&lt;/strong&gt; between LLM providers&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Medical safety net&lt;/strong&gt; — Reflection catches ungrounded claims before doctors see them&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What I'd do differently
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with the Router from day 1&lt;/strong&gt; — Don't build "one agent to rule them all" first. You'll waste weeks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log every routing decision&lt;/strong&gt; — When something goes wrong, you need the trace. Every intent classification, tool call, and reflection check gets logged.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keyword fallback is not optional&lt;/strong&gt; — When your LLM router fails, regex literally saves the day.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple agents are underrated&lt;/strong&gt; — Booking agent with 2 tools and 3 iterations handles 90% of requests perfectly. Not everything needs GPT-4.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building AI agents for production is &lt;strong&gt;fundamentally different&lt;/strong&gt; from building demos. The three patterns that made the biggest difference:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Route, don't bloat&lt;/strong&gt; — Specialized agents with focused toolkits beat one god-agent every time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fail gracefully&lt;/strong&gt; — Multi-LLM fallback + keyword backup + reflection retries = zero downtime&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-correct&lt;/strong&gt; — Never send an AI response to a user without checking it first&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The full system powers a real hospital dashboard serving staff daily. Architecture &amp;amp; documentation: &lt;a href="https://github.com/ducthanh1810/ai-healthcare-dashboard" rel="noopener noreferrer"&gt;Ai-Healthcare-Dashboard&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Source code is private (enterprise healthcare). The showcase repo contains full architecture docs, diagrams, and technical decisions.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;strong&gt;About the author&lt;/strong&gt;: I'm Duc Thanh, an AI Engineer specializing in Healthcare IT. I build production AI systems that hospitals actually use. Connect with me on &lt;a href="https://www.linkedin.com/in/ducthanh1810" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; or &lt;a href="https://github.com/ducthanh1810" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agentarchitecture</category>
      <category>healthcare</category>
      <category>python</category>
    </item>
  </channel>
</rss>
