DEV Community

Nathaniel Hamlett
Nathaniel Hamlett

Posted on • Originally published at nathanhamlett.com

Building a Multi-Model AI Agent: Automatic Fallback When Your Primary LLM Refuses

Building a Multi-Model AI Agent: Automatic Fallback When Your Primary LLM Refuses

If you're building an AI agent that does real work -- sending emails, filling forms, generating documents -- you've hit the wall. Your primary model refuses a perfectly reasonable task. Not because the task is harmful. Because the model's safety filter is miscalibrated for your use case.

This isn't a theoretical problem. I run an autonomous agent system with 41 skills that handles everything from job applications to outreach emails to resume generation. It runs on Claude as the primary brain. Claude is the best model I've used for complex reasoning. It's also the model most likely to refuse to write a cold outreach email because it "might be perceived as spam."

The fix isn't prompt engineering. The fix is architecture.

The Problem: Single-Model Fragility

Most agent tutorials show you how to connect to one LLM. That works for demos. In production, a single-model agent has a single point of failure -- and that failure mode isn't "the API is down." It's "the model decided your legitimate task violates some policy it invented."

Here's what this looks like in practice:

  • Ask the model to write a personalized outreach message. Refused -- "I can't help with unsolicited contact."
  • Ask it to tailor a resume to emphasize certain skills. Refused -- "I don't want to misrepresent your qualifications."
  • Ask it to extract text from a competitor's public webpage. Refused -- "I can't assist with scraping."

Each refusal kills an entire pipeline run. If your agent runs on a cron schedule, that's a wasted cycle. If a human is waiting, that's a frustrated user.

The Solution: A Fallback Chain

The architecture is simple: try your preferred model first. If it fails (refusal, rate limit, API error), try the next one. Keep going until something works.

MODELS = {
    'gemini': {
        'fn': _gemini,
        'requires': 'GEMINI_API_KEY',
    },
    'openrouter': {
        'fn': _openrouter,
        'requires': 'OPENROUTER_API_KEY',
    },
    'deepseek': {
        'fn': _deepseek,
        'requires': 'DEEPSEEK_API_KEY',
    },
    'ollama': {
        'fn': _ollama,
        'requires': None,  # local, no key needed
    },
}
Enter fullscreen mode Exit fullscreen mode

Each backend is a function with the same signature: (prompt, system=None) -> str. The registry tracks what's available at runtime by checking for API keys in the environment.

The Router

The core routing function is under 40 lines:

def route_prompt(prompt, system=None, prefer=None, fallback_chain=None):
    if fallback_chain is None:
        fallback_chain = ['deepseek', 'gemini', 'openrouter', 'ollama']

    if prefer:
        fallback_chain = [prefer] + [m for m in fallback_chain if m != prefer]

    available = get_available_models()
    errors = []

    for model_name in fallback_chain:
        if model_name not in available:
            errors.append(f"{model_name}: not available")
            continue
        try:
            response = MODELS[model_name]['fn'](prompt, system)
            return {
                'response': response,
                'model_used': model_name,
                'error': None,
            }
        except Exception as e:
            errors.append(f"{model_name}: {str(e)[:200]}")
            continue

    return {
        'response': None,
        'model_used': None,
        'error': f"All models failed: {'; '.join(errors)}",
    }
Enter fullscreen mode Exit fullscreen mode

Key design decisions:

  1. Caller picks the preference, router picks the fallback. The calling code says "I'd prefer Gemini for this" and the router handles everything else.
  2. Availability is checked at call time. Models come and go (API keys expire, local models get uninstalled). Don't assume your config from startup is still valid.
  3. Errors are collected, not swallowed. If everything fails, you get a diagnostic string showing exactly what went wrong at each step.

Backend Wrappers

Each backend is a thin wrapper. Here's the pattern for any OpenAI-compatible API (which covers most providers):

def _deepseek(prompt, system=None, model="deepseek-chat"):
    import openai
    client = openai.OpenAI(
        base_url="https://api.deepseek.com",
        api_key=os.environ['DEEPSEEK_API_KEY'],
    )
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=4096,
    )
    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

OpenRouter, DeepSeek, Together, Groq -- they all use the OpenAI client library with a different base_url. One wrapper function covers all of them.

For Google's Gemini, the SDK is different but the wrapper stays the same shape:

def _gemini(prompt, system=None, model="gemini-2.5-pro"):
    from google import genai
    client = genai.Client(api_key=os.environ['GEMINI_API_KEY'])
    config = {}
    if system:
        config['system_instruction'] = system
    response = client.models.generate_content(
        model=model, contents=prompt, config=config or None,
    )
    return response.text
Enter fullscreen mode Exit fullscreen mode

For local models via Ollama, shell out to the CLI:

def _ollama(prompt, system=None, model="llama3.2:3b"):
    import subprocess
    full_prompt = f"[System: {system}]\n\n{prompt}" if system else prompt
    result = subprocess.run(
        ['ollama', 'run', model],
        input=full_prompt,
        capture_output=True, text=True, timeout=120,
    )
    return result.stdout.strip()
Enter fullscreen mode Exit fullscreen mode

What I Learned Running This in Production

After several weeks of running this across thousands of calls, here's what surprised me:

1. Temperature should vary by task, not by model.

Scoring a resume against a job description? Temperature 0.2. Writing a cover letter? Temperature 0.7. Extracting structured data from a webpage? Temperature 0.1. I learned this from studying how ApplyPilot handles their resume pipeline -- they tune temperature per pipeline stage, not globally.

2. JSON extraction needs a fallback chain of its own.

LLMs wrap JSON in markdown fences, add preamble text, or return partial objects. Every script that parses LLM output needs this:

def extract_json(raw: str) -> dict:
    try: return json.loads(raw)
    except: pass
    if "```

" in raw:
        for part in raw.split("

```")[1::2]:
            try: return json.loads(part.replace("json","").strip())
            except: continue
    start, end = raw.find("{"), raw.rfind("}")
    if start != -1 and end > start:
        try: return json.loads(raw[start:end+1])
        except: pass
    raise ValueError("No valid JSON found")
Enter fullscreen mode Exit fullscreen mode

This one function eliminated about 30% of my pipeline failures.

3. Local models are the escape hatch, not the default.

Ollama on a consumer GPU (I use an RTX 5070 Ti) handles 3B-7B models fine. That's enough for simple extraction, classification, and form-filling. But for anything requiring reasoning -- scoring, strategy, personalization -- cloud models win by a wide margin.

4. Cost is almost irrelevant at agent scale.

My entire model routing setup costs under $5/month. DeepSeek is $0.27 per million input tokens. Gemini has a generous free tier. OpenRouter offers free access to Llama 3.3 70B. The expensive part of running an agent isn't the LLM calls -- it's the engineering time debugging why your pipeline broke at 3am.

When to Build This vs. Use a Framework

LangChain has model routing. So does LiteLLM. You don't need to write your own if you're already in one of those ecosystems.

Build your own when:

  • You want zero framework dependencies (my router is stdlib + the provider SDKs)
  • You need custom fallback logic (e.g., "use the uncensored model only for outreach tasks")
  • You're running in a cron job where startup time matters (importing LangChain adds seconds)

Use a framework when:

  • You need streaming, tool calling, structured output, and other features the providers expose differently
  • You're building a product, not an internal tool
  • You have a team and want standardized patterns

The Architecture Diagram

Caller (any skill/script)
    |
    v
route_prompt(prompt, prefer="gemini")
    |
    +---> gemini (try first)
    |       |-- success? return response
    |       |-- fail? continue
    +---> deepseek (try second)
    |       |-- success? return response
    |       |-- fail? continue
    +---> openrouter (try third)
    |       |-- success? return response
    |       |-- fail? continue
    +---> ollama (last resort, local)
            |-- success? return response
            |-- fail? return all errors
Enter fullscreen mode Exit fullscreen mode

The caller doesn't know or care which model answered. It gets back a response and a model_used field for logging.

Start Here

  1. Pick two providers minimum. I'd suggest Gemini (free tier, strong) + DeepSeek (cheap, fast, permissive).
  2. Write wrapper functions with identical signatures.
  3. Build the fallback loop.
  4. Add model_used to your logging so you can see which models are actually handling your traffic.

The whole thing is about 100 lines of meaningful code. It took me an afternoon to build and has saved dozens of hours of debugging since.


Nathaniel Hamlett builds autonomous AI agent systems. More at nathanhamlett.com.

Top comments (0)