DEV Community

Devon
Devon

Posted on • Originally published at kalibr.systems

Stop Hardcoding Model Fallbacks: Let Production Data Pick Your Paths

Stop Hardcoding Model Fallbacks: Let Production Data Pick Your Paths

You've seen the pattern. Maybe you've written it:

def call_model(prompt: str) -> str:
    try:
        return call_gpt4o(prompt)
    except Exception:
        try:
            return call_claude(prompt)
        except Exception:
            return call_gpt4o_mini(prompt)
Enter fullscreen mode Exit fullscreen mode

Nested try/excepts. A fallback chain. It feels like resilience. It's not.

This pattern has three problems that compound in production: it's static, it's exception-only, and it never learns.


Why Hardcoded Fallbacks Break Down

Problem 1: They only catch exceptions.

The most common production failure mode for LLM calls is not an exception — it's a valid API response with bad output. The model returns HTTP 200 with a response that doesn't parse, doesn't match your schema, or gives an answer that's technically formed but factually wrong.

Your try/except doesn't catch any of that. The broken response flows downstream as if it succeeded.

Problem 2: They're static.

You wrote the fallback order once, based on your intuitions at that moment. gpt-4oclaude-3-5-sonnetgpt-4o-mini. That hierarchy doesn't update. If Claude starts consistently outperforming GPT-4o on your specific task next month, your code still tries GPT-4o first, every time.

This isn't hypothetical. Model behavior changes with every update. The model that was best for your use case six months ago might not be best today.

Problem 3: They don't distribute load intelligently.

Your fallback chain means the primary model absorbs 100% of initial requests. The fallback sees only overflow traffic from failures. If the primary model is 90% reliable and the fallback is 95% reliable for your task, a static chain never exploits that. It just retries failures on the better model.


What Thompson Sampling Actually Does (Plain English)

Thompson Sampling is a Bayesian decision algorithm for choosing between options when you want to balance exploration (trying all options) with exploitation (using the best-known option).

In plain terms: it keeps a probability distribution for each path representing "how likely is this path to succeed?" It pulls from those distributions to make routing decisions, weighted toward paths with better outcomes. When a path succeeds, its distribution shifts to reflect that. When it fails, the same.

The result: a path that's working gets more traffic. A path that's degrading gets less, automatically.

It's not magic. It's weighted random selection with learning. But the key insight is that the weighting updates in real time, based on real outcomes, with no human in the loop.

Here's a simplified mental model of what's happening under the hood:

# Conceptual — not actual Kalibr internals
import random

class SimpleBandit:
    def __init__(self, n_paths: int):
        # Beta distribution parameters: alpha=successes+1, beta=failures+1
        self.alphas = [1] * n_paths
        self.betas = [1] * n_paths

    def choose(self) -> int:
        """Sample from each path's success probability distribution, pick the best."""
        samples = [
            random.betavariate(self.alphas[i], self.betas[i])
            for i in range(len(self.alphas))
        ]
        return samples.index(max(samples))

    def update(self, path_index: int, success: bool):
        """Update the distribution based on observed outcome."""
        if success:
            self.alphas[path_index] += 1
        else:
            self.betas[path_index] += 1
Enter fullscreen mode Exit fullscreen mode

In early deployment with equal priors, all paths get roughly equal traffic. As outcomes accumulate, traffic shifts to better-performing paths. If a previously good path starts degrading, the algorithm detects it within dozens of requests and redistributes traffic.


The Before/After Comparison

Before: Manual fallback chain

import openai
import anthropic
import json
import logging

logger = logging.getLogger(__name__)

def summarize_with_gpt4o(text: str) -> str:
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Summarize in 2-3 sentences."},
            {"role": "user", "content": text}
        ],
        timeout=20
    )
    return response.choices[0].message.content

def summarize_with_claude(text: str) -> str:
    client = anthropic.Anthropic()
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=256,
        messages=[{"role": "user", "content": f"Summarize in 2-3 sentences: {text}"}]
    )
    return response.content[0].text

def summarize_with_mini(text: str) -> str:
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Summarize in 2-3 sentences."},
            {"role": "user", "content": text}
        ],
        timeout=20
    )
    return response.choices[0].message.content

def summarize(text: str) -> str:
    """Hardcoded fallback chain."""
    try:
        result = summarize_with_gpt4o(text)
        if result and len(result.strip()) > 20:
            return result
    except Exception as e:
        logger.warning(f"GPT-4o failed: {e}")

    try:
        result = summarize_with_claude(text)
        if result and len(result.strip()) > 20:
            return result
    except Exception as e:
        logger.warning(f"Claude failed: {e}")

    return summarize_with_mini(text)  # last resort, no try/except
Enter fullscreen mode Exit fullscreen mode

Problems with this:

  • Success check (len > 20) is a proxy, not a real success signal
  • Hierarchy is frozen — Claude is always the backup, never the primary
  • If GPT-4o is returning low-quality outputs (not exceptions), they pass through
  • No learning — you'll be writing this same fallback chain in two years

After: Outcome-based routing with Kalibr

import kalibr  # Must be imported before OpenAI/Anthropic
import openai
import anthropic
from typing import Optional

def summarize_with_gpt4o(text: str) -> Optional[str]:
    try:
        client = openai.OpenAI()
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "Summarize in 2-3 sentences."},
                {"role": "user", "content": text}
            ],
            timeout=20
        )
        return response.choices[0].message.content
    except Exception:
        return None

def summarize_with_claude(text: str) -> Optional[str]:
    try:
        client = anthropic.Anthropic()
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=256,
            messages=[{"role": "user", "content": f"Summarize in 2-3 sentences: {text}"}]
        )
        return response.content[0].text
    except Exception:
        return None

def summarize_with_mini(text: str) -> Optional[str]:
    try:
        client = openai.OpenAI()
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "Summarize in 2-3 sentences."},
                {"role": "user", "content": text}
            ],
            timeout=20
        )
        return response.choices[0].message.content
    except Exception:
        return None

def is_good_summary(result: Optional[str]) -> bool:
    """Real success function — define what 'good' means for your use case."""
    if not result:
        return False
    stripped = result.strip()
    # Must be at least 2 sentences, under 500 chars, not an error message
    sentences = [s.strip() for s in stripped.split('.') if s.strip()]
    return len(sentences) >= 2 and len(stripped) < 500 and "error" not in stripped.lower()

router = kalibr.Router(
    paths=[summarize_with_gpt4o, summarize_with_claude, summarize_with_mini],
    success_fn=is_good_summary,
    task="document-summarization"
)

def summarize(text: str) -> Optional[str]:
    return router.run(text)
Enter fullscreen mode Exit fullscreen mode

The difference: the success function is explicit and meaningful. The routing algorithm adapts based on which paths actually pass that check. If Claude consistently produces better summaries next week, it'll see more traffic next week — without anyone touching the code.


CrewAI Integration

If you're using CrewAI, you can wrap Kalibr at the tool or LLM call level without restructuring your agents.

import kalibr  # First
from crewai import Agent, Task, Crew
from crewai.tools import BaseTool
import openai
import anthropic
from typing import Optional, Type
from pydantic import BaseModel, Field

class ResearchInput(BaseModel):
    query: str = Field(description="The research query to answer")

def research_with_gpt4o(query: str) -> Optional[str]:
    try:
        client = openai.OpenAI()
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a research assistant. Answer thoroughly with sources when possible."},
                {"role": "user", "content": query}
            ],
            timeout=30
        )
        return response.choices[0].message.content
    except Exception:
        return None

def research_with_claude(query: str) -> Optional[str]:
    try:
        client = anthropic.Anthropic()
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            messages=[{"role": "user", "content": f"Research and answer: {query}"}]
        )
        return response.content[0].text
    except Exception:
        return None

def is_research_complete(result: Optional[str]) -> bool:
    if not result:
        return False
    # Research should be substantive — at least 3 sentences, not an error
    sentences = [s for s in result.split('.') if len(s.strip()) > 10]
    return len(sentences) >= 3

research_router = kalibr.Router(
    paths=[research_with_gpt4o, research_with_claude],
    success_fn=is_research_complete,
    task="research"
)

class ResearchTool(BaseTool):
    name: str = "research_tool"
    description: str = "Research a topic or answer a question thoroughly"
    args_schema: Type[BaseModel] = ResearchInput

    def _run(self, query: str) -> str:
        result = research_router.run(query)
        return result or "Research failed — no result available"

# Build your CrewAI agent normally — routing is transparent
researcher = Agent(
    role="Research Analyst",
    goal="Research topics thoroughly and accurately",
    backstory="Expert at finding and synthesizing information",
    tools=[ResearchTool()],
    verbose=False
)

research_task = Task(
    description="Research the current state of AI agent frameworks",
    expected_output="A comprehensive overview of major AI agent frameworks",
    agent=researcher
)

crew = Crew(agents=[researcher], tasks=[research_task])
result = crew.kickoff()
Enter fullscreen mode Exit fullscreen mode

The CrewAI agent doesn't know about routing. It calls a tool. The tool routes between models. Outcomes feed back to Kalibr. If one model starts underperforming, the router adjusts — with no CrewAI configuration changes needed.


LangChain Integration

For LangChain, the cleanest integration point is a custom LLM wrapper:

import kalibr  # First
from langchain_core.language_models import BaseLLM
from langchain_core.outputs import LLMResult, Generation
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage
from typing import Optional, List, Any

def call_openai_lc(prompt: str) -> Optional[str]:
    try:
        llm = ChatOpenAI(model="gpt-4o", timeout=20)
        result = llm.invoke([HumanMessage(content=prompt)])
        return result.content
    except Exception:
        return None

def call_anthropic_lc(prompt: str) -> Optional[str]:
    try:
        llm = ChatAnthropic(model="claude-3-5-sonnet-20241022", max_tokens=1024)
        result = llm.invoke([HumanMessage(content=prompt)])
        return result.content
    except Exception:
        return None

def is_valid_response(result: Optional[str]) -> bool:
    return bool(result and len(result.strip()) > 10)

router = kalibr.Router(
    paths=[call_openai_lc, call_anthropic_lc],
    success_fn=is_valid_response,
    task="langchain-completion"
)

class KalibrRoutedLLM(BaseLLM):
    """LangChain-compatible LLM backed by Kalibr routing."""

    @property
    def _llm_type(self) -> str:
        return "kalibr-routed"

    def _generate(self, prompts: List[str], **kwargs) -> LLMResult:
        generations = []
        for prompt in prompts:
            result = router.run(prompt)
            generations.append([Generation(text=result or "")])
        return LLMResult(generations=generations)

    def _call(self, prompt: str, **kwargs) -> str:
        return router.run(prompt) or ""

# Use as a drop-in replacement anywhere LangChain expects an LLM
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate

llm = KalibrRoutedLLM()
prompt_template = PromptTemplate(
    input_variables=["topic"],
    template="Explain {topic} in simple terms."
)
chain = LLMChain(llm=llm, prompt=prompt_template)

result = chain.run("Thompson Sampling")
Enter fullscreen mode Exit fullscreen mode

This approach lets you swap in the routed LLM anywhere in your LangChain pipeline without changing chain logic.


The Benchmark Numbers

This isn't theoretical. The Kalibr team ran controlled degradation benchmarks — simulating real model/tool degradation events — comparing hardcoded fallback systems to outcome-based routing.

Results during degradation events:

  • Hardcoded systems (including ones with fallbacks): 16-36% success rate
  • Outcome-based routing (Kalibr): 88-100% success rate

The gap is that large because hardcoded fallbacks only trigger on exceptions. Most degradation shows up as bad outputs, not errors. The router catches those because it's tracking outcomes, not just exceptions.

Full methodology and results: kalibr.systems/docs/benchmark


When This Approach Makes Sense

Use outcome-based routing when:

  • Your agent is in production with real traffic
  • You can define success programmatically (even roughly)
  • You have at least two viable execution paths
  • You want the system to adapt to model changes without manual intervention

Don't use it when:

  • You're still figuring out the basic approach — routing needs working paths to route between
  • Every output requires human judgment — you need a programmatic signal
  • Your failure modes are catastrophic — routing reduces failure rate, doesn't eliminate it

The Core Shift

Hardcoded fallbacks are a snapshot of your intuition at one point in time. They don't update. They don't learn. They treat all failures the same (exception only). They never exploit information about which path is actually working right now.

Outcome-based routing is adaptive. It treats your success function as ground truth and distributes traffic based on what that function says is working. It handles the failures that don't raise exceptions. It finds the best path as conditions change.

Your production agent will face model updates, API degradation, and input distribution shifts that you can't predict. Hardcoding paths is betting that nothing will change. Routing on outcomes is accepting that things will change and building the system to handle it.

Start with pip install kalibr. Docs at kalibr.systems/docs.


See also: Why Your AI Agent Works in Dev and Silently Fails in Production for the detection side of this problem, and the Production Agent Checklist for the full pre-flight list.

Top comments (0)