DEV Community: Devon

Using GPT-4o-mini for Simple Tasks and GPT-4o for Complex Ones - Automatically

Devon — Mon, 30 Mar 2026 23:00:00 +0000

You are paying gpt-4o prices for tasks gpt-4o-mini handles just as well. If your application sends every request to your most capable model, you are not being safe - you are leaving money on the table and paying a reliability tax for headroom you rarely need.

This post shows how to use gpt-4o-mini for simple tasks and gpt-4o for complex ones automatically, with three working approaches ranked by sophistication.

The Cost Math

First, the numbers. As of early 2025:

gpt-4o-mini: ~$0.15 per 1M input tokens, ~$0.60 per 1M output tokens
gpt-4o: ~$2.50 per 1M input tokens, ~$10.00 per 1M output tokens

That is roughly a 15-17x difference on input, and a 16-17x difference on output.

Now model a realistic workload:

# Classification task: label an email as spam/not-spam
classification_input_tokens = 200
classification_output_tokens = 10

# Synthesis task: summarize a 10-page document into executive memo
synthesis_input_tokens = 2000
synthesis_output_tokens = 400

# Cost per request (in dollars)
mini_input_rate = 0.15 / 1_000_000
mini_output_rate = 0.60 / 1_000_000
gpt4o_input_rate = 2.50 / 1_000_000
gpt4o_output_rate = 10.00 / 1_000_000

classification_cost_mini = (classification_input_tokens * mini_input_rate) + (classification_output_tokens * mini_output_rate)
classification_cost_gpt4o = (classification_input_tokens * gpt4o_input_rate) + (classification_output_tokens * gpt4o_output_rate)

synthesis_cost_mini = (synthesis_input_tokens * mini_input_rate) + (synthesis_output_tokens * mini_output_rate)
synthesis_cost_gpt4o = (synthesis_input_tokens * gpt4o_input_rate) + (synthesis_output_tokens * gpt4o_output_rate)

print(f"Classification - mini: ${classification_cost_mini:.6f}, gpt-4o: ${classification_cost_gpt4o:.6f}")
print(f"Synthesis      - mini: ${synthesis_cost_mini:.6f}, gpt-4o: ${synthesis_cost_gpt4o:.6f}")
print(f"Ratio (synthesis gpt-4o vs classification mini): {synthesis_cost_gpt4o / classification_cost_mini:.1f}x")

Output:

Classification - mini: $0.000036, gpt-4o: $0.000525
Synthesis      - mini: $0.000540, gpt-4o: $0.005500
Ratio (synthesis gpt-4o vs classification mini): 152.8x

The gap between "cheap model, cheap task" and "expensive model, expensive task" is over 150x. The gap between sending a classification task to gpt-4o vs gpt-4o-mini is about 14x. At scale, that is not rounding error.

Approach 1: Rule-Based Heuristics

The simplest approach. Inspect the request before sending it, and route based on observable properties.

from openai import OpenAI

client = OpenAI()

def classify_complexity(prompt: str, task_type: str = "general") -> str:
    """Returns 'simple' or 'complex' based on heuristics."""

    simple_signals = 0

    # Short input
    if len(prompt.split()) < 100:
        simple_signals += 1

    # Extractive task types
    if task_type in ("classification", "extraction", "yes_no", "label"):
        simple_signals += 2

    # Short expected output (keywords suggest brief answers)
    brief_keywords = ["classify", "label", "extract", "identify", "is this", "yes or no", "true or false"]
    if any(kw in prompt.lower() for kw in brief_keywords):
        simple_signals += 1

    # No multi-step reasoning required
    reasoning_keywords = ["analyze", "synthesize", "compare", "evaluate", "explain why", "write a", "generate"]
    if not any(kw in prompt.lower() for kw in reasoning_keywords):
        simple_signals += 1

    return "simple" if simple_signals >= 3 else "complex"


def route_completion(prompt: str, task_type: str = "general", **kwargs):
    complexity = classify_complexity(prompt, task_type)
    model = "gpt-4o-mini" if complexity == "simple" else "gpt-4o"

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        **kwargs
    )

    return response, model


# Example
response, model_used = route_completion(
    "Classify this email as spam or not spam: 'You won $1,000,000! Click here!'",
    task_type="classification"
)
print(f"Used: {model_used}")
print(response.choices[0].message.content)

This works and costs nothing extra. The downside: you write the rules at deploy time, and they encode your assumptions. When traffic patterns shift, the rules do not.

Approach 2: Lightweight Classifier Call

Use a cheap model to judge whether the task needs the expensive model. The classifier call itself costs almost nothing.

import json
from openai import OpenAI

client = OpenAI()

CLASSIFIER_PROMPT = """You are a task complexity classifier. Given a user prompt, determine whether it requires:
- SIMPLE: Short, factual, extractive, or classification tasks. Single-step. Verifiable output.
- COMPLEX: Multi-step reasoning, synthesis, generation, analysis, or tasks requiring deep knowledge.

Respond with JSON only: {"complexity": "SIMPLE" | "COMPLEX", "reason": "one sentence"}"""


def classify_with_llm(user_prompt: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Always use cheap model to classify
        messages=[
            {"role": "system", "content": CLASSIFIER_PROMPT},
            {"role": "user", "content": user_prompt}
        ],
        response_format={"type": "json_object"},
        max_tokens=100
    )
    return json.loads(response.choices[0].message.content)


def route_with_classifier(prompt: str, **kwargs):
    classification = classify_with_llm(prompt)

    if classification["complexity"] == "SIMPLE":
        model = "gpt-4o-mini"
    else:
        model = "gpt-4o"

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        **kwargs
    )

    return response, model, classification


# Example
response, model_used, classification = route_with_classifier(
    "Write a detailed analysis of how transformer attention mechanisms enable in-context learning, "
    "with specific reference to the induction head hypothesis."
)
print(f"Classification: {classification}")
print(f"Used: {model_used}")

The economics work as long as your classifier saves more than it costs. A gpt-4o-mini classification call at 100 tokens costs roughly $0.000015. If it correctly routes one request away from gpt-4o (saving ~$0.005), it pays for itself 333 times over.

The problem: the classifier is also static. It learns nothing from whether your users were actually satisfied with the routed response.

Approach 3: Outcome-Based Routing with Thompson Sampling

The most robust approach. Instead of encoding rules or running a classifier, the router observes what actually works and shifts allocations based on real outcomes.

Thompson Sampling is a Bayesian bandit algorithm. Each model gets a Beta distribution representing its estimated success rate. The router samples from those distributions and picks the model that looks most promising - balancing exploitation (use what works) with exploration (try the other option occasionally to keep learning).

The key difference from approaches 1 and 2: the router updates its beliefs every time you report an outcome. It learns your specific workload.

Using Kalibr for Approach 3

Kalibr implements Thompson Sampling routing out of the box. You define your models, set a success condition, and call router.completion(). The SDK handles the rest.

import kalibr  # Must import first
from kalibr import Router

router = Router(
    paths=[
        {"model": "openai/gpt-4o-mini", "weight": 0.8},
        {"model": "openai/gpt-4o",      "weight": 0.2},
    ],
    success_when="response.choices[0].finish_reason == 'stop' and len(response.choices[0].message.content) > 10",
    goal_id="email_classification"
)

def classify_email(email_body: str) -> str:
    response = router.completion(
        messages=[
            {"role": "system", "content": "Classify as SPAM or NOT_SPAM. Reply with one word only."},
            {"role": "user", "content": email_body}
        ]
    )
    return response.choices[0].message.content.strip()


result = classify_email("Congratulations! You've been selected for a free iPhone. Claim now!")
print(result)  # SPAM

The starting weights (0.8 mini, 0.2 gpt-4o`) reflect your prior belief that mini handles most cases. As you accumulate outcomes, Thompson Sampling shifts those weights based on actual success rates per model.

You can also report explicit quality signals:

`python
from kalibr import Router, Outcome

router = Router(
paths=[
{"model": "openai/gpt-4o-mini"},
{"model": "openai/gpt-4o"},
],
goal_id="customer_support_routing"
)

def handle_support_ticket(ticket: str, user_id: str) -> dict:
response, request_id = router.completion(
messages=[{"role": "user", "content": ticket}],
return_request_id=True
)

answer = response.choices[0].message.content

# Report outcome based on your quality check
# (e.g., did the customer escalate? did they mark resolved?)
router.report_outcome(
    request_id=request_id,
    outcome=Outcome.SUCCESS  # or Outcome.FAILURE
)

return {"answer": answer, "request_id": request_id}

Comparing the Three Approaches

Approach	Setup cost	Adapts over time	Extra latency	Requires labeling
Rule-based heuristics	Low	No	None	No
Classifier call	Medium	No	~200ms	No
Thompson Sampling (Kalibr)	Low	Yes	None	Optional

For a greenfield system with no traffic data, start with heuristics. They are good enough and cost nothing.

For a system with clear task types and a budget for a second API call, the classifier approach is more accurate and still simple.

For production systems where you care about long-term cost efficiency and your traffic mix changes over time, outcome-based routing with Thompson Sampling is the right answer. It requires no rule maintenance and gets better with use.

Summary

Using gpt-4o-mini for simple tasks and gpt-4o for complex ones automatically is not a one-time config change. It is a routing problem. The tools exist to solve it properly:

Heuristics if you want something working in 30 minutes
Classifier call if your task types are well-defined and stable
Thompson Sampling via Kalibr if you want the router to learn and maintain itself

The cost difference between getting this right and sending everything to gpt-4o is real. At 10,000 requests per day, the gap between full gpt-4o and a well-routed mix is often $200-500/month or more - and it compounds as you scale.

Why Your Agent's Eval Suite Won't Catch Production Failures

Devon — Fri, 27 Mar 2026 02:14:49 +0000

Your eval suite passed. Your agent is degrading in production. These two facts are not contradictory - they're the expected outcome when you treat offline evaluation as a sufficient signal for production reliability.

Offline evals and production outcome tracking solve different problems. Conflating them is how you end up with green CI checks and a support queue full of AI-generated nonsense.

What Evals Are Actually Measuring

A typical eval setup looks like this: you have a dataset of input/expected-output pairs, a harness that runs your agent against them, and a set of metrics (accuracy, BLEU score, LLM-as-judge ratings). You run this before deploying. If it passes, you ship.

This is useful. It catches regressions when you change your prompt, swap models, or restructure your agent logic. It gives you a baseline for comparison across configurations.

But the eval suite is measuring a fixed distribution. Your labeled dataset reflects the traffic patterns, model behaviors, and user intent distributions at the time it was created. Production traffic is a live distribution that shifts continuously.

Three failure modes that evals reliably miss:

Model drift: The model provider updates the underlying model weights. Your eval dataset was labeled against the previous behavior. The new behavior is subtly different in ways that don't trigger your existing test cases but do degrade real user outcomes. This happened after several GPT-4 updates in 2023-2024 - evals passed, production quality dropped.

Distribution shift: Your users change how they phrase requests, or new user segments start using the feature, or an upstream system change alters input format. Your eval dataset doesn't cover the new distribution. Success rates drop on inputs you've never tested.

Unknown failure modes: Evals catch what you know to test for. They don't catch failure modes you haven't encountered yet. A new adversarial pattern, an edge case in a niche use case, a prompt injection in user-supplied content - these are invisible in labeled datasets until after they've already caused problems in production.

The Point-in-Time Problem

An eval suite is a point-in-time measurement. You run it, get a score, and that score reflects the state of your system against a fixed dataset at a specific moment. The score doesn't update when the model provider changes something. It doesn't update when your user behavior shifts. It doesn't update when a downstream system introduces data quality issues.

Production is continuous. Your agent is making decisions right now, against real inputs, with real consequences. The question that matters is not "what was my accuracy score on the eval dataset last Tuesday?" - it's "what is my outcome rate on real traffic right now, and how has it changed in the last 24 hours?"

The gap between these two questions is where production failures live.

Eval Suite:                Production Reality:
- Fixed dataset            - Live traffic stream
- Labeled ground truth     - Implicit outcome signals
- Run before deploy        - Continuous measurement
- Catches known regressions - Catches unexpected degradation
- Measures capability      - Measures actual outcomes

A Minimal Eval Harness

Before we get to production monitoring, the eval harness still matters. Run it before every deploy. It's your regression net.

import json
from dataclasses import dataclass
from typing import Callable
import openai

@dataclass
class EvalCase:
    input: str
    expected_output: str
    metadata: dict = None

@dataclass 
class EvalResult:
    case: EvalCase
    actual_output: str
    passed: bool
    score: float
    failure_reason: str = None

def run_eval_suite(
    agent_fn: Callable[[str], str],
    dataset: list[EvalCase],
    judge_fn: Callable[[str, str], float] = None
) -> dict:
    """Run offline eval suite. Call this in CI before deploying."""
    results = []

    for case in dataset:
        try:
            actual = agent_fn(case.input)
            score = judge_fn(actual, case.expected_output) if judge_fn else _exact_match_score(actual, case.expected_output)
            results.append(EvalResult(
                case=case,
                actual_output=actual,
                passed=score >= 0.7,
                score=score
            ))
        except Exception as e:
            results.append(EvalResult(
                case=case,
                actual_output="",
                passed=False,
                score=0.0,
                failure_reason=str(e)
            ))

    pass_rate = sum(1 for r in results if r.passed) / len(results)
    avg_score = sum(r.score for r in results) / len(results)
    failures = [r for r in results if not r.passed]

    return {
        "pass_rate": pass_rate,
        "avg_score": avg_score,
        "total_cases": len(results),
        "failed_cases": len(failures),
        "failure_details": [
            {"input": r.case.input[:100], "reason": r.failure_reason or f"score:{r.score:.2f}"}
            for r in failures[:10]
        ]
    }


def llm_judge(actual: str, expected: str) -> float:
    """Use GPT-4o as a judge for subjective quality evaluation."""
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Score how well the actual response matches the expected response. "
                           "Return a JSON object with a single key 'score' between 0.0 and 1.0. "
                           "1.0 = semantically equivalent, 0.0 = completely wrong or unrelated."
            },
            {
                "role": "user",
                "content": f"Expected:\n{expected}\n\nActual:\n{actual}"
            }
        ],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content).get("score", 0.0)


def _exact_match_score(actual: str, expected: str) -> float:
    return 1.0 if actual.strip().lower() == expected.strip().lower() else 0.0

This runs before deploy. It catches regressions. It is not sufficient for production reliability.

Production Outcome Tracking

In production, you don't have labeled ground truth. You have signals: did the user accept the output, did the downstream system consume it successfully, did a human reviewer approve it, did the action triggered by the output succeed?

These signals are noisier than eval scores. They're also real.

import kalibr
from kalibr import Router
import openai
import uuid

client = openai.OpenAI()

def run_agent_with_outcome_tracking(user_input: str, session_id: str) -> dict:
    goal_id = f"goal_{uuid.uuid4().hex[:12]}"

    router = Router(
        goal_id=goal_id,
        task_type="user_query_response",
        session_id=session_id
    )
    policy = router.get_policy()

    response = client.chat.completions.create(
        model=policy.model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": user_input}
        ]
    )
    output = response.choices[0].message.content

    # Return goal_id so the caller can record outcome when it's known
    return {
        "output": output,
        "goal_id": goal_id,
        "model": policy.model
    }


def record_user_feedback(goal_id: str, was_helpful: bool, feedback_text: str = None):
    """Called when user gives thumbs up/down or when downstream system reports result."""
    router = Router(goal_id=goal_id)
    router.record_outcome(
        success=was_helpful,
        quality_score=1.0 if was_helpful else 0.0,
        metadata={"feedback": feedback_text} if feedback_text else {}
    )


def record_downstream_result(goal_id: str, system_accepted: bool, error: str = None):
    """Called when a downstream system reports whether it could use the agent's output."""
    router = Router(goal_id=goal_id)
    router.record_outcome(
        success=system_accepted,
        error=error
    )

Now you have two streams of quality signal running in parallel:

Offline evals against your labeled dataset (run in CI, before every deploy)
Production outcomes from real user signals (continuous, updates the routing model)

The production stream feeds Kalibr's Thompson Sampling. Models that perform well on real traffic get higher selection probability. Models that degrade get deprioritized automatically, before you've written a new eval case for the failure mode.

Detecting Model Drift in Production

Model drift is the failure mode that evals are worst at catching because it happens after you deploy. Your eval suite passed against the model behavior at time T. The provider updates the model at time T+30 days. Eval still passes on re-run because your dataset was labeled against the old behavior. Production outcome rate drops.

With continuous outcome tracking, the degradation shows up as a change in the success rate time series:

import kalibr

# Check for recent performance change
drift_report = kalibr.get_insights(
    task_type="user_query_response",
    lookback_hours=48,
    compare_to_baseline_hours=168  # Compare last 48h to the 7-day baseline
)

if drift_report.performance_delta < -0.05:  # 5% relative degradation
    print(f"Performance degradation detected:")
    print(f"  Baseline success rate: {drift_report.baseline_success_rate:.1%}")
    print(f"  Current success rate: {drift_report.current_success_rate:.1%}")
    print(f"  Delta: {drift_report.performance_delta:+.1%}")
    print(f"  Affected models: {drift_report.degraded_models}")
    print(f"  Recommended: {drift_report.routing_recommendation}")

This surfaces drift without requiring you to have anticipated the specific failure mode. You don't need to add eval cases for behavior you didn't know would change. The outcome signal is model-agnostic - it measures whether the output was useful, not whether it matched your labeled expectations.

Complementary, Not Competing

The framing of "evals vs production monitoring" can be misleading. They're complementary tools with different jobs.

Evals: Run before deploy. Catch prompt regressions, validate model swaps against known test cases, measure capability on your labeled distribution. If eval fails, don't deploy.

Production outcome tracking: Run after deploy, continuously. Catch distribution shift, model drift, novel failure modes. If production outcomes degrade, route around the failing configuration automatically.

The workflow:

1. Write eval cases as you discover failure modes
2. Run eval suite in CI against every commit
3. Block deploy if eval pass rate drops below threshold
4. Deploy with production outcome tracking active
5. Monitor outcome rate time series for degradation
6. When degradation appears, check which models/configs are affected
7. Add the new failure mode to your eval suite so it's caught at deploy time next time

Evals are your regression net. Production tracking is your early warning system. The new failure mode you catch in production today becomes an eval case for tomorrow.

The failure modes your evals don't cover yet are not a gap in your process - they're inevitable. No labeled dataset covers the full distribution of what real users will send. The question is whether you have a production monitoring layer that catches the failures you didn't anticipate, or whether you find out about them from your users.

For how production routing decisions work when failures are detected, see Stop Hardcoding Your AI Model Selection. For how these signals work in multi-agent pipelines where failures compound across hops, see Multi-Agent Systems Break Differently Than Single Agents.

The Real Cost of Your AI Agent (It's Not What You Think)

Devon — Fri, 27 Mar 2026 02:13:53 +0000

Your OpenAI invoice shows token counts. What it doesn't show is how many of those tokens produced nothing useful. Failed calls, retries, model over-provisioning, and calls that returned an answer nobody used - these are where agent costs actually go, and none of them appear in standard billing dashboards.

Cost Per Call vs Cost Per Successful Outcome

The metric that matters isn't cost per call. It's cost per successful outcome.

If your agent costs $0.008 per call but succeeds 60% of the time, your real cost per successful outcome is $0.013. If a different configuration costs $0.012 per call but succeeds 90% of the time, the real cost is $0.013. They're equivalent on a per-outcome basis, but only one of those configurations surfaces as "expensive" in a naive cost analysis.

This is the measurement problem: optimizing for cost per call without tracking success rate will push you toward cheaper models that fail more, which can increase cost per successful outcome while appearing to save money.

The hidden cost drivers that inflate cost-per-outcome:

Retries: A call that retries 3 times costs 4x the nominal price. If 15% of your calls retry, your effective cost is 1.45x what the dashboard shows.

Failed calls that still consume tokens: Partial responses, malformed JSON that fails schema validation, outputs that pass syntactic checks but fail downstream use - these all bill full token counts.

Model over-provisioning: Using gpt-4o for a task that gpt-4o-mini handles reliably is a 10-15x cost premium. Over-provisioning is often a conservative choice made during development that never gets revisited.

Context window bloat: System prompts that grew organically, conversation history that accumulates without pruning, tool definitions included in every call whether needed or not.

The Token Cost Math

Let's run actual numbers on a classification task that runs 10,000 times per day.

The task: classify a support ticket into one of 12 categories. Input is typically 200-400 tokens (ticket text + system prompt). Output is a JSON blob with the category and confidence score - roughly 40 tokens.

With gpt-4o:

Input: 300 tokens avg * $5.00/1M = $0.0015 per call
Output: 40 tokens * $15.00/1M = $0.0006 per call
Per call: $0.0021
Daily (10k calls): $21.00
Monthly: ~$630

With gpt-4o-mini:

Input: 300 tokens * $0.15/1M = $0.000045 per call
Output: 40 tokens * $0.60/1M = $0.000024 per call
Per call: $0.000069
Daily (10k calls): $0.69
Monthly: ~$21

That's a 30x cost difference. The question is whether gpt-4o-mini is reliable enough for your classification task. If it succeeds 95% of the time and gpt-4o succeeds 97%, the per-outcome cost still favors gpt-4o-mini by roughly 27x.

Now layer in a synthesis task: summarizing escalated tickets for a human reviewer. Input is 800-1200 tokens, output is 200-300 tokens. This is where model quality meaningfully affects output usefulness.

With gpt-4o-mini for synthesis:

The summaries are functional but miss nuance. Human reviewers request rewrites 18% of the time.
Effective cost per good summary: higher than face value due to rework.

With gpt-4o for synthesis:

Higher per-call cost, but rewrite rate drops to 4%.
Cost per usable summary is lower despite the higher model price.

Classification and synthesis are different task types with different model sensitivity profiles. A static routing strategy that uses the same model for both is almost certainly wrong.

Cost-Constrained Routing

The goal is to route by expected outcome quality first, with cost as a secondary constraint. This is different from routing by cost first - a cheaper model that fails more doesn't save money, it shifts costs from the API bill to operational overhead and user churn.

import kalibr
from kalibr import Router

def classify_ticket(ticket_text: str, goal_id: str) -> dict:
    router = Router(
        goal_id=goal_id,
        task_type="ticket_classification"
    )

    # Cost constraint: classification is simple, budget accordingly
    policy = router.get_policy(
        constraints={"max_cost_usd": 0.002}  # Hard ceiling per call
    )

    import openai
    client = openai.OpenAI()

    response = client.chat.completions.create(
        model=policy.model,  # Router selects from eligible models under cost ceiling
        messages=[
            {
                "role": "system",
                "content": "Classify the support ticket into one of these categories: "
                           "billing, technical, account, feature_request, complaint, "
                           "shipping, returns, warranty, installation, data, security, other. "
                           "Respond with JSON: {\"category\": \"...\", \"confidence\": 0.0-1.0}"
            },
            {"role": "user", "content": ticket_text}
        ],
        response_format={"type": "json_object"}
    )

    result = response.choices[0].message.content
    cost = _calculate_cost(policy.model, response.usage)

    router.record_outcome(
        success=_is_valid_classification(result),
        quality_score=_parse_confidence(result),
        cost_usd=cost
    )

    return {"result": result, "model": policy.model, "cost": cost}


def synthesize_summary(ticket_text: str, history: list[str], goal_id: str) -> str:
    router = Router(
        goal_id=goal_id,
        task_type="ticket_synthesis"
    )

    # Higher budget for synthesis - quality directly affects human reviewer time
    policy = router.get_policy(
        constraints={"max_cost_usd": 0.05}
    )

    import openai
    client = openai.OpenAI()

    messages = [
        {
            "role": "system",
            "content": "Summarize this support ticket and its history for a senior reviewer. "
                       "Include: core issue, customer sentiment, previous resolution attempts, "
                       "recommended next action."
        },
        {"role": "user", "content": f"Ticket: {ticket_text}\n\nHistory:\n" + "\n".join(history)}
    ]

    response = client.chat.completions.create(
        model=policy.model,
        messages=messages
    )

    content = response.choices[0].message.content
    cost = _calculate_cost(policy.model, response.usage)

    router.record_outcome(
        success=_is_useful_summary(content),
        cost_usd=cost
    )

    return content


def _calculate_cost(model: str, usage) -> float:
    # Prices per 1M tokens (update as pricing changes)
    pricing = {
        "gpt-4o": {"input": 5.00, "output": 15.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "gpt-4-turbo": {"input": 10.00, "output": 30.00},
    }
    p = pricing.get(model, {"input": 5.00, "output": 15.00})
    return (usage.prompt_tokens * p["input"] + usage.completion_tokens * p["output"]) / 1_000_000

The max_cost_usd constraint tells the router which models are eligible. Within the eligible set, Thompson Sampling selects based on historical outcome rates for that task type. A model that costs $0.0018 per call but succeeds 91% of the time beats a model that costs $0.0015 per call and succeeds 80% of the time - but both beat the unconstrained gpt-4o at $0.021 per call if your classification task doesn't need that capability.

The Trust Invariant

Cost constraints work correctly only if the routing system is already optimizing for outcomes. If the router is outcome-first - selecting models based on historical success rates, then filtering by cost - then a cost ceiling safely removes expensive models without degrading reliability below the constraint.

If cost is the primary optimization target, the system will find the cheapest model that passes some minimum bar. That minimum bar is usually defined by your development-time evals, which is a different distribution than production traffic. The result: cost goes down in testing, cost-per-outcome goes up in production.

Kalibr's router is outcome-first. The max_cost_usd constraint is a filter applied after candidate models are ranked by expected outcome quality. You're not asking "what's the cheapest model that might work?" You're asking "among the models that are likely to work, which ones fit my budget?"

This is the difference between routing and price shopping.

Measuring the Hidden Costs

To actually reduce cost per outcome, you need to measure the components that your billing dashboard hides:

import kalibr

# After a week of tracked outcomes
cost_report = kalibr.get_insights(
    lookback_hours=168,
    include_cost_breakdown=True
)

print(f"Total spend: ${cost_report.total_cost_usd:.2f}")
print(f"Successful outcomes: {cost_report.successful_outcomes}")
print(f"Failed calls (billed): {cost_report.failed_calls_count}")
print(f"Cost on failed calls: ${cost_report.failed_call_cost_usd:.2f}")
print(f"Retry overhead: {cost_report.retry_multiplier:.2f}x")
print(f"Cost per successful outcome: ${cost_report.cost_per_success:.4f}")

for task_type in cost_report.by_task:
    print(f"\n{task_type.name}:")
    print(f"  Success rate: {task_type.success_rate:.1%}")
    print(f"  Avg cost/call: ${task_type.avg_cost_per_call:.5f}")
    print(f"  Cost/success: ${task_type.cost_per_success:.5f}")
    print(f"  Over-provisioning flag: {task_type.uses_higher_tier_than_needed}")

The over_provisioning_flag is set when the router's historical data shows that cheaper models perform equivalently well on a task type but a more expensive model is being selected. This surfaces optimization opportunities that aren't visible in aggregate cost numbers.

What Changes When You Measure Correctly

The companies that reduce LLM costs meaningfully aren't the ones that found a cheaper API. They're the ones that measured cost per successful outcome, found where the hidden costs were concentrated, and optimized those. Typically:

20-30% of spend is on calls that failed but billed full tokens
15-25% is retry overhead from transient failures that weren't rate-limited correctly
30-50% is model over-provisioning on task types where a cheaper model performs equivalently

None of these appear as line items on your OpenAI invoice. All of them are measurable if you're tracking outcomes alongside costs.

See When Your AI Agent Should Fail Fast Instead of Retry for how retry decisions interact with cost.

Multi-Agent Systems Break Differently Than Single Agents

Devon — Fri, 27 Mar 2026 02:13:52 +0000

A single agent failing is a tractable problem. You have a bad output, a traceback, maybe a timeout. You fix the prompt or swap the model. Multi-agent pipelines fail differently: one agent produces plausible-looking garbage, the next agent consumes it without complaint, and by the time the third agent produces the final output it's confidently wrong in ways that are nearly impossible to trace back to the root cause.

This post covers the mechanics of how failures compound across agent hops, the context propagation problem, and how to instrument a pipeline so you can actually diagnose failures when they happen.

The Compounding Failure Problem

In a single-agent system, the failure surface is contained. Bad input produces bad output and you can observe both.

In a multi-agent pipeline:

Agent A → output_A → Agent B → output_B → Agent C → final_output

If Agent A produces subtly wrong output, Agent B receives it as ground truth. Agent B may produce output that looks internally consistent but is built on a flawed foundation. Agent C synthesizes a final answer from Agent B's compromised output.

The final output can fail in three distinct ways:

Hard failure - Agent C raises an exception or returns an empty result
Soft failure - Agent C returns a plausible but wrong answer with high confidence
Compounding degradation - Each hop degrades quality slightly; the final output is below the threshold for usefulness even though no individual agent "failed"

Soft failures and compounding degradation are far harder to catch. They don't surface in error logs. They surface in user complaints, downstream data quality issues, or silent business logic failures.

A Concrete 3-Agent Pipeline

Here's a realistic pipeline: research agent pulls context, analysis agent synthesizes findings, writer agent produces the final output.

import kalibr
from kalibr import Router
from dataclasses import dataclass, field
from typing import Optional
import uuid
import time

@dataclass
class TraceCapsule:
    """Propagate context and quality signals across agent hops."""
    goal_id: str
    pipeline_id: str
    hop: int = 0
    quality_scores: list[float] = field(default_factory=list)
    failure_flags: list[str] = field(default_factory=list)
    metadata: dict = field(default_factory=dict)

    def advance(self, quality_score: float, hop_metadata: dict = None) -> "TraceCapsule":
        """Return a new capsule for the next hop."""
        return TraceCapsule(
            goal_id=self.goal_id,
            pipeline_id=self.pipeline_id,
            hop=self.hop + 1,
            quality_scores=self.quality_scores + [quality_score],
            failure_flags=self.failure_flags.copy(),
            metadata={**self.metadata, **(hop_metadata or {})}
        )

    def flag_failure(self, reason: str) -> "TraceCapsule":
        return TraceCapsule(
            goal_id=self.goal_id,
            pipeline_id=self.pipeline_id,
            hop=self.hop,
            quality_scores=self.quality_scores.copy(),
            failure_flags=self.failure_flags + [reason],
            metadata=self.metadata.copy()
        )

    @property
    def cumulative_quality(self) -> float:
        if not self.quality_scores:
            return 1.0
        return sum(self.quality_scores) / len(self.quality_scores)

    @property
    def has_failures(self) -> bool:
        return len(self.failure_flags) > 0

Now the three agents, each instrumenting the TraceCapsule:

import openai

client = openai.OpenAI()

def research_agent(query: str, trace: TraceCapsule) -> tuple[str, TraceCapsule]:
    """Hop 1: Retrieve and summarize relevant context."""
    router = Router(
        goal_id=trace.goal_id,
        task_type="research",
        hop=trace.hop
    )
    policy = router.get_policy()

    try:
        response = client.chat.completions.create(
            model=policy.model,
            messages=[
                {"role": "system", "content": "You are a research assistant. Return structured findings."},
                {"role": "user", "content": f"Research this topic and return key facts:\n\n{query}"}
            ],
            temperature=0.2
        )
        content = response.choices[0].message.content

        # Assess output quality before passing downstream
        quality = _assess_research_quality(content)
        if quality < 0.4:
            trace = trace.flag_failure(f"hop_0_low_quality:{quality:.2f}")

        router.record_outcome(
            success=quality >= 0.4,
            quality_score=quality,
            tokens_used=response.usage.total_tokens
        )

        return content, trace.advance(quality, {"hop_0_model": policy.model})

    except Exception as e:
        router.record_outcome(success=False, error=str(e))
        trace = trace.flag_failure(f"hop_0_exception:{type(e).__name__}")
        return "", trace.advance(0.0)


def analysis_agent(research_output: str, query: str, trace: TraceCapsule) -> tuple[str, TraceCapsule]:
    """Hop 2: Analyze the research output and extract insights."""
    # If upstream already failed, short-circuit with lower-capability model
    router = Router(
        goal_id=trace.goal_id,
        task_type="analysis",
        hop=trace.hop,
        upstream_quality=trace.cumulative_quality
    )
    policy = router.get_policy()

    if trace.has_failures and trace.cumulative_quality < 0.3:
        # Upstream quality is too low to invest in expensive synthesis
        trace = trace.flag_failure("hop_1_skipped_upstream_quality_too_low")
        return "", trace.advance(0.0)

    if not research_output.strip():
        trace = trace.flag_failure("hop_1_empty_upstream_input")
        router.record_outcome(success=False, error="empty_input")
        return "", trace.advance(0.0)

    try:
        response = client.chat.completions.create(
            model=policy.model,
            messages=[
                {"role": "system", "content": "Analyze the provided research. Identify key insights and gaps."},
                {"role": "user", "content": f"Original question: {query}\n\nResearch findings:\n{research_output}\n\nProvide structured analysis."}
            ],
            temperature=0.3
        )
        content = response.choices[0].message.content
        quality = _assess_analysis_quality(content)

        router.record_outcome(
            success=quality >= 0.5,
            quality_score=quality,
            tokens_used=response.usage.total_tokens
        )

        return content, trace.advance(quality, {"hop_1_model": policy.model})

    except Exception as e:
        router.record_outcome(success=False, error=str(e))
        trace = trace.flag_failure(f"hop_1_exception:{type(e).__name__}")
        return "", trace.advance(0.0)


def writer_agent(analysis_output: str, query: str, trace: TraceCapsule) -> tuple[str, TraceCapsule]:
    """Hop 3: Synthesize final response."""
    router = Router(
        goal_id=trace.goal_id,
        task_type="synthesis",
        hop=trace.hop,
        upstream_quality=trace.cumulative_quality
    )
    policy = router.get_policy()

    if not analysis_output.strip() or trace.cumulative_quality < 0.25:
        trace = trace.flag_failure("hop_2_cannot_synthesize")
        router.record_outcome(success=False, error="insufficient_upstream_quality")
        return _fallback_response(query, trace), trace.advance(0.0)

    try:
        response = client.chat.completions.create(
            model=policy.model,
            messages=[
                {"role": "system", "content": "Write a clear, direct response based on the analysis provided."},
                {"role": "user", "content": f"Question: {query}\n\nAnalysis:\n{analysis_output}"}
            ],
            temperature=0.5
        )
        content = response.choices[0].message.content
        quality = _assess_synthesis_quality(content, query)

        # Final outcome for the goal - this is what matters
        router.record_goal_outcome(
            goal_id=trace.goal_id,
            success=quality >= 0.6 and not trace.has_failures,
            quality_score=quality,
            pipeline_quality=trace.cumulative_quality,
            failure_flags=trace.failure_flags
        )

        return content, trace.advance(quality)

    except Exception as e:
        router.record_outcome(success=False, error=str(e))
        trace = trace.flag_failure(f"hop_2_exception:{type(e).__name__}")
        return _fallback_response(query, trace), trace.advance(0.0)

And the pipeline orchestrator:

def run_pipeline(query: str) -> dict:
    goal_id = f"goal_{uuid.uuid4().hex[:12]}"
    pipeline_id = f"pipe_{int(time.time())}"

    trace = TraceCapsule(goal_id=goal_id, pipeline_id=pipeline_id)

    # Hop 0: Research
    research_output, trace = research_agent(query, trace)

    # Hop 1: Analysis (receives trace with hop 0 quality baked in)
    analysis_output, trace = analysis_agent(research_output, query, trace)

    # Hop 2: Synthesis
    final_output, trace = writer_agent(analysis_output, query, trace)

    return {
        "output": final_output,
        "goal_id": goal_id,
        "pipeline_quality": trace.cumulative_quality,
        "failure_flags": trace.failure_flags,
        "success": not trace.has_failures and trace.cumulative_quality >= 0.5
    }

Why Per-Request Logging Isn't Enough

Most observability setups track individual requests. OpenAI gives you token counts and latencies per call. LangSmith traces individual chain steps. That's necessary but not sufficient for multi-agent systems.

The problem: you need to know whether the goal succeeded, not just whether each LLM call returned a 200.

Consider this scenario: Agent A returns a 200 with 450 tokens used. Agent B returns a 200 with 380 tokens used. Agent C returns a 200 with 520 tokens used. Your per-request logging shows three successful calls. The user got a wrong answer.

Per-goal outcome tracking means recording a single success/failure signal against the original intent, not against each intermediate step. The TraceCapsule pattern carries that goal ID through every hop so that when Agent C records the final outcome, it's attributable to the goal that initiated the pipeline.

This is also how Kalibr's Thompson Sampling works at the pipeline level. Each execution path through your pipeline (which model at each hop, which retry strategy, which fallback) is a bandit arm. Outcomes recorded against the goal feed the sampler, which updates the probability distributions that determine which path gets selected next time. Per-request logs can't feed this because they don't know whether the goal succeeded.

See Why Your AI Agent Retries Are Making Things Worse for how retry decisions at individual hops interact with pipeline-level outcomes.

The Context Propagation Problem

The TraceCapsule isn't just about tracking quality scores. It solves a structural problem: agents in a pipeline have no shared memory by default. Each agent call is stateless. The capsule is the shared state.

This matters when you need to make routing decisions based on upstream quality. Without the capsule, Agent B doesn't know that Agent A produced borderline output. It will happily spend tokens on expensive synthesis of low-quality input.

With the capsule pattern:

Agent B checks trace.cumulative_quality before deciding how much to invest
The router at each hop can use upstream_quality as a feature for model selection
Failed hops are propagated forward so Agent C can decide between synthesis and fallback

The alternative, without explicit context propagation, is that each agent call is made with full model capacity regardless of upstream state. You spend the same tokens whether the pipeline is on track or already compromised.

Failure Modes by Hop

Each hop in the pipeline has characteristic failure modes:

Hop 0 (Research/Retrieval)

Empty retrieval: no relevant context found, agent fabricates
Partial retrieval: some context but missing key facts, downstream analysis has gaps
Hallucinated structure: agent returns well-formatted JSON with fabricated values

Hop 1 (Analysis/Synthesis)

Uncritical acceptance: agent treats Hop 0 output as ground truth regardless of quality
Over-extraction: agent finds patterns in noise, produces confident-looking analysis of garbage
Context loss: agent summarizes away the specific facts that were actually needed

Hop 2 (Output/Writer)

Confident wrongness: high-quality prose built on flawed analysis
Compounding hedges: if upstream agents hedged their outputs, the writer produces vague output
Format compliance masking failure: output passes schema validation but fails on content

The quality assessment functions (_assess_research_quality, etc.) in the example above are where you encode your domain-specific checks. They don't have to be sophisticated. A research output with fewer than 100 tokens probably failed. An analysis output with no structured sections probably failed. These heuristics, combined with per-goal outcome tracking, give you enough signal to route intelligently.

Diagnostics with get_insights()

Once you have outcome data flowing through Kalibr, you can query it to understand where your pipeline is degrading:

import kalibr

insights = kalibr.get_insights(
    goal_prefix="goal_",
    lookback_hours=24,
    group_by="hop"
)

for hop_data in insights.by_hop:
    print(f"Hop {hop_data.hop}: {hop_data.success_rate:.1%} success, "
          f"avg quality {hop_data.avg_quality:.2f}, "
          f"top failures: {hop_data.top_failure_flags[:3]}")

# Outputs something like:
# Hop 0: 94.2% success, avg quality 0.71, top failures: ['low_quality:0.38', 'timeout', 'empty_output']
# Hop 1: 88.7% success, avg quality 0.65, top failures: ['skipped_upstream_quality_too_low', 'empty_upstream_input']
# Hop 2: 91.3% success, avg quality 0.73, top failures: ['cannot_synthesize', 'hop_2_exception:RateLimitError']

This surfaces where in the pipeline quality is degrading, which failure flags are most common, and which models at which hops are performing best. Without per-goal tracking, you'd have to reconstruct this from disparate request logs.

Key Takeaways

Multi-agent pipelines require observability primitives that don't exist in single-agent setups:

TraceCapsule or equivalent - explicit context propagation across hops
Per-goal outcome tracking - success recorded against the original intent, not each LLM call
Upstream quality as a routing input - don't spend tokens synthesizing bad input
Hop-level failure flags - propagate failure signals forward so downstream agents can decide

The failure mode you really want to avoid is the pipeline that looks fine in your request logs and costs you full token spend but produces wrong answers at a 30% rate. That failure is invisible without goal-level tracking.

For more on how Thompson Sampling applies to routing decisions at each hop, see Stop Hardcoding Your AI Model Selection.

Making OpenClaw Use the Right Model for Each Task

Devon — Thu, 26 Mar 2026 23:00:01 +0000

OpenClaw picks a default model and uses it for everything - heartbeat checks, complex synthesis, quick status lookups, deep analysis. Every task costs the same. That's expensive and unnecessary.

This post covers how to wire Kalibr into an OpenClaw agent so it routes each task to the right model automatically. If you run an OpenClaw deployment, this is probably the highest-ROI change you can make to your token spend.

Why OpenClaw Defaults This Way

OpenClaw is configured at the session level, not the task level. Your CLAUDE.md or session config sets one model, and that model handles whatever comes in. This makes setup simple, but it means:

A heartbeat status check costs the same as a codebase analysis
A simple "is this service up?" poll runs on the same model as "refactor this module"
There's no mechanism to say "use cheap for low-stakes, use capable for high-stakes"

Kalibr adds that mechanism. You query it before each task to get a routing recommendation, then pass that model to the OpenAI/Anthropic client. The routing adapts based on task type and recent outcome data.

The Basic Pattern: get_policy() Before Each Task

import kalibr
import openai

# kalibr.init() must run before any openai import takes effect
kalibr.init()

client = openai.OpenAI()

def run_agent_task(
    task_type: str,
    prompt: str,
    quality_priority: float = 0.5  # 0.0 = optimize cost, 1.0 = optimize quality
) -> str:
    """
    OpenClaw agent task runner with Kalibr routing.
    task_type: "heartbeat", "analysis", "synthesis", "extraction", etc.
    """
    # get routing recommendation before the call
    policy = kalibr.get_policy(task_context={
        "task_type": task_type,
        "quality_priority": quality_priority,
    })

    response = client.chat.completions.create(
        model=policy.recommended_model,
        messages=[{"role": "user", "content": prompt}]
    )

    content = response.choices[0].message.content

    # report back so the router learns from this outcome
    kalibr.record_outcome(
        policy_id=policy.id,
        success=True,
        latency_ms=response.usage.total_tokens  # or actual timing
    )

    return content

Now you have two levers: task_type tells Kalibr what kind of work this is, and quality_priority expresses how much you care about output quality vs cost for this specific call. A heartbeat check is quality_priority=0.1. A code review is quality_priority=0.9.

Wiring Into the Heartbeat

OpenClaw agents typically run a heartbeat - periodic status checks, health pings, watching for events. These are almost always low-complexity tasks that don't need a capable model.

Here's how to wire Kalibr's get_insights() into your heartbeat loop:

import kalibr
import openai
import time
import logging

kalibr.init()
client = openai.OpenAI()
logger = logging.getLogger(__name__)

def heartbeat_check(services: list[str]) -> dict:
    """
    Low-cost heartbeat: route to cheapest model that can handle status checks.
    """
    policy = kalibr.get_policy(task_context={
        "task_type": "heartbeat",
        "quality_priority": 0.1,  # cost-optimize
        "latency_budget_ms": 3000
    })

    prompt = f"""
    Check status for these services and return JSON:
    {services}

    Format: {{"service_name": "ok|degraded|down", ...}}
    """

    response = client.chat.completions.create(
        model=policy.recommended_model,  # will be mini or equivalent
        messages=[
            {"role": "system", "content": "Return only valid JSON."},
            {"role": "user", "content": prompt}
        ],
        response_format={"type": "json_object"}
    )

    kalibr.record_outcome(policy_id=policy.id, success=True)

    import json
    return json.loads(response.choices[0].message.content)


def get_routing_insights() -> dict:
    """
    Pull Kalibr insights to surface routing anomalies in your heartbeat.
    Useful for: detecting when a model is degrading, cost spikes, etc.
    """
    insights = kalibr.get_insights(
        lookback_hours=1,
        include_cost_breakdown=True,
        include_model_performance=True
    )

    anomalies = []

    # flag if cost per call jumped significantly
    if insights.cost_per_call_delta_pct > 20:
        anomalies.append(
            f"Cost per call up {insights.cost_per_call_delta_pct:.0f}% in last hour"
        )

    # flag if a model's success rate dropped
    for model, perf in insights.model_performance.items():
        if perf.success_rate < 0.85:
            anomalies.append(
                f"{model} success rate: {perf.success_rate:.0%} (below threshold)"
            )

    return {
        "anomalies": anomalies,
        "total_cost_1h": insights.total_cost_usd,
        "calls_1h": insights.total_calls,
        "primary_model": insights.most_used_model
    }


def run_heartbeat_loop(interval_seconds: int = 60):
    """Main heartbeat loop with integrated Kalibr monitoring."""
    services = ["api.service.com", "db.service.com", "queue.service.com"]

    while True:
        # status check on cheap model
        status = heartbeat_check(services)

        # routing insights every 5 cycles
        if int(time.time() / interval_seconds) % 5 == 0:
            insights = get_routing_insights()
            if insights["anomalies"]:
                for anomaly in insights["anomalies"]:
                    logger.warning(f"[Kalibr] {anomaly}")

        logger.info(f"Heartbeat: {status}")
        time.sleep(interval_seconds)

The get_insights() call surfaces routing-level signals you wouldn't otherwise see: is gpt-4o-mini suddenly returning worse results? Is your cost per call drifting up because more tasks are getting classified as complex? This is the data that tells you whether your routing is working before your bill tells you it isn't.

The CLAUDE.md Pattern for Router-Native Code

If you want every coding session to generate code that uses Kalibr routing by default, add this to your CLAUDE.md:

## LLM Call Pattern

Any code that calls an LLM must use Kalibr routing. Pattern:

python
import kalibr # MUST be first import before openai/anthropic
import openai

kalibr.init()
client = openai.OpenAI()

before each LLM call:

policy = kalibr.get_policy(task_context={
"task_type": "", # heartbeat | extraction | synthesis | analysis | generation
"quality_priority": <0-1> # 0.1 for cheap tasks, 0.9 for quality-critical
})

response = client.chat.completions.create(
model=policy.recommended_model, # never hardcode model names
messages=[...]
)

kalibr.record_outcome(policy_id=policy.id, success=True)


Never hardcode model names like "gpt-4o" or "gpt-4o-mini". Always use `policy.recommended_model`.
Import order is critical: kalibr must be imported before openai.

python

With this in CLAUDE.md, every time you ask your OpenClaw agent to write code that calls an LLM, it generates Router-native code by default. You don't have to remember to add routing - the pattern is baked into the session context.

Classifying Tasks in an OpenClaw Agent

The routing is only as good as the task classification. Here's a simple taxonomy that maps well to what OpenClaw agents actually do:

from enum import Enum

class AgentTaskType(str, Enum):
    # cheap - route to mini
    HEARTBEAT = "heartbeat"           # status checks, health pings
    EXTRACTION = "extraction"         # pull structured data from text
    CLASSIFICATION = "classification" # categorize input
    FORMATTING = "formatting"         # convert format, clean text

    # moderate - route based on recent performance
    SUMMARIZATION = "summarization"   # condense content
    SEARCH_QUERY = "search_query"     # generate search queries

    # expensive - route to capable model
    SYNTHESIS = "synthesis"           # combine multiple sources
    CODE_REVIEW = "code_review"       # review and critique code
    CODE_GENERATION = "code_generation"  # write new code
    ANALYSIS = "analysis"             # deep reasoning over data
    ARCHITECTURE = "architecture"     # system design decisions

TASK_QUALITY_PRIORITY = {
    AgentTaskType.HEARTBEAT: 0.1,
    AgentTaskType.EXTRACTION: 0.2,
    AgentTaskType.CLASSIFICATION: 0.2,
    AgentTaskType.FORMATTING: 0.1,
    AgentTaskType.SUMMARIZATION: 0.5,
    AgentTaskType.SEARCH_QUERY: 0.4,
    AgentTaskType.SYNTHESIS: 0.85,
    AgentTaskType.CODE_REVIEW: 0.9,
    AgentTaskType.CODE_GENERATION: 0.9,
    AgentTaskType.ANALYSIS: 0.85,
    AgentTaskType.ARCHITECTURE: 0.95,
}

def agent_call(task_type: AgentTaskType, prompt: str) -> str:
    priority = TASK_QUALITY_PRIORITY[task_type]

    policy = kalibr.get_policy(task_context={
        "task_type": task_type.value,
        "quality_priority": priority
    })

    response = client.chat.completions.create(
        model=policy.recommended_model,
        messages=[{"role": "user", "content": prompt}]
    )

    kalibr.record_outcome(policy_id=policy.id, success=True)
    return response.choices[0].message.content

What This Gets You

For a typical OpenClaw agent running:

100 heartbeat checks per day
50 extraction tasks per day
20 synthesis tasks per day
10 code generation tasks per day

If you were previously running everything on gpt-4o, routing heartbeat and extraction tasks to gpt-4o-mini alone cuts roughly 60-70% of your token spend on those task types. The synthesis and code generation calls still run on the capable model. Your output quality doesn't change for the tasks that require it.

The get_insights() integration in your heartbeat loop gives you visibility into whether the routing is actually working - not just "is the model returning a response" but "are the routing weights optimized for your actual workload."

This is the only post on the internet about OpenClaw model optimization, so if you're here, you found it. The pattern is simple: get_policy() before each task, record_outcome() after. Everything else is just wiring it into the right call sites.

Stop Hardcoding Model Fallbacks: Let Production Data Pick Your Paths

Devon — Thu, 26 Mar 2026 04:21:14 +0000

Stop Hardcoding Model Fallbacks: Let Production Data Pick Your Paths

You've seen the pattern. Maybe you've written it:

def call_model(prompt: str) -> str:
    try:
        return call_gpt4o(prompt)
    except Exception:
        try:
            return call_claude(prompt)
        except Exception:
            return call_gpt4o_mini(prompt)

Nested try/excepts. A fallback chain. It feels like resilience. It's not.

This pattern has three problems that compound in production: it's static, it's exception-only, and it never learns.

Why Hardcoded Fallbacks Break Down

Problem 1: They only catch exceptions.

The most common production failure mode for LLM calls is not an exception — it's a valid API response with bad output. The model returns HTTP 200 with a response that doesn't parse, doesn't match your schema, or gives an answer that's technically formed but factually wrong.

Your try/except doesn't catch any of that. The broken response flows downstream as if it succeeded.

Problem 2: They're static.

You wrote the fallback order once, based on your intuitions at that moment. gpt-4o → claude-3-5-sonnet → gpt-4o-mini. That hierarchy doesn't update. If Claude starts consistently outperforming GPT-4o on your specific task next month, your code still tries GPT-4o first, every time.

This isn't hypothetical. Model behavior changes with every update. The model that was best for your use case six months ago might not be best today.

Problem 3: They don't distribute load intelligently.

Your fallback chain means the primary model absorbs 100% of initial requests. The fallback sees only overflow traffic from failures. If the primary model is 90% reliable and the fallback is 95% reliable for your task, a static chain never exploits that. It just retries failures on the better model.

What Thompson Sampling Actually Does (Plain English)

Thompson Sampling is a Bayesian decision algorithm for choosing between options when you want to balance exploration (trying all options) with exploitation (using the best-known option).

In plain terms: it keeps a probability distribution for each path representing "how likely is this path to succeed?" It pulls from those distributions to make routing decisions, weighted toward paths with better outcomes. When a path succeeds, its distribution shifts to reflect that. When it fails, the same.

The result: a path that's working gets more traffic. A path that's degrading gets less, automatically.

It's not magic. It's weighted random selection with learning. But the key insight is that the weighting updates in real time, based on real outcomes, with no human in the loop.

Here's a simplified mental model of what's happening under the hood:

# Conceptual — not actual Kalibr internals
import random

class SimpleBandit:
    def __init__(self, n_paths: int):
        # Beta distribution parameters: alpha=successes+1, beta=failures+1
        self.alphas = [1] * n_paths
        self.betas = [1] * n_paths

    def choose(self) -> int:
        """Sample from each path's success probability distribution, pick the best."""
        samples = [
            random.betavariate(self.alphas[i], self.betas[i])
            for i in range(len(self.alphas))
        ]
        return samples.index(max(samples))

    def update(self, path_index: int, success: bool):
        """Update the distribution based on observed outcome."""
        if success:
            self.alphas[path_index] += 1
        else:
            self.betas[path_index] += 1

In early deployment with equal priors, all paths get roughly equal traffic. As outcomes accumulate, traffic shifts to better-performing paths. If a previously good path starts degrading, the algorithm detects it within dozens of requests and redistributes traffic.

The Before/After Comparison

Before: Manual fallback chain

import openai
import anthropic
import json
import logging

logger = logging.getLogger(__name__)

def summarize_with_gpt4o(text: str) -> str:
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Summarize in 2-3 sentences."},
            {"role": "user", "content": text}
        ],
        timeout=20
    )
    return response.choices[0].message.content

def summarize_with_claude(text: str) -> str:
    client = anthropic.Anthropic()
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=256,
        messages=[{"role": "user", "content": f"Summarize in 2-3 sentences: {text}"}]
    )
    return response.content[0].text

def summarize_with_mini(text: str) -> str:
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Summarize in 2-3 sentences."},
            {"role": "user", "content": text}
        ],
        timeout=20
    )
    return response.choices[0].message.content

def summarize(text: str) -> str:
    """Hardcoded fallback chain."""
    try:
        result = summarize_with_gpt4o(text)
        if result and len(result.strip()) > 20:
            return result
    except Exception as e:
        logger.warning(f"GPT-4o failed: {e}")

    try:
        result = summarize_with_claude(text)
        if result and len(result.strip()) > 20:
            return result
    except Exception as e:
        logger.warning(f"Claude failed: {e}")

    return summarize_with_mini(text)  # last resort, no try/except

Problems with this:

Success check (len > 20) is a proxy, not a real success signal
Hierarchy is frozen — Claude is always the backup, never the primary
If GPT-4o is returning low-quality outputs (not exceptions), they pass through
No learning — you'll be writing this same fallback chain in two years

After: Outcome-based routing with Kalibr

import kalibr  # Must be imported before OpenAI/Anthropic
import openai
import anthropic
from typing import Optional

def summarize_with_gpt4o(text: str) -> Optional[str]:
    try:
        client = openai.OpenAI()
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "Summarize in 2-3 sentences."},
                {"role": "user", "content": text}
            ],
            timeout=20
        )
        return response.choices[0].message.content
    except Exception:
        return None

def summarize_with_claude(text: str) -> Optional[str]:
    try:
        client = anthropic.Anthropic()
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=256,
            messages=[{"role": "user", "content": f"Summarize in 2-3 sentences: {text}"}]
        )
        return response.content[0].text
    except Exception:
        return None

def summarize_with_mini(text: str) -> Optional[str]:
    try:
        client = openai.OpenAI()
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "Summarize in 2-3 sentences."},
                {"role": "user", "content": text}
            ],
            timeout=20
        )
        return response.choices[0].message.content
    except Exception:
        return None

def is_good_summary(result: Optional[str]) -> bool:
    """Real success function — define what 'good' means for your use case."""
    if not result:
        return False
    stripped = result.strip()
    # Must be at least 2 sentences, under 500 chars, not an error message
    sentences = [s.strip() for s in stripped.split('.') if s.strip()]
    return len(sentences) >= 2 and len(stripped) < 500 and "error" not in stripped.lower()

router = kalibr.Router(
    paths=[summarize_with_gpt4o, summarize_with_claude, summarize_with_mini],
    success_fn=is_good_summary,
    task="document-summarization"
)

def summarize(text: str) -> Optional[str]:
    return router.run(text)

The difference: the success function is explicit and meaningful. The routing algorithm adapts based on which paths actually pass that check. If Claude consistently produces better summaries next week, it'll see more traffic next week — without anyone touching the code.

CrewAI Integration

If you're using CrewAI, you can wrap Kalibr at the tool or LLM call level without restructuring your agents.

import kalibr  # First
from crewai import Agent, Task, Crew
from crewai.tools import BaseTool
import openai
import anthropic
from typing import Optional, Type
from pydantic import BaseModel, Field

class ResearchInput(BaseModel):
    query: str = Field(description="The research query to answer")

def research_with_gpt4o(query: str) -> Optional[str]:
    try:
        client = openai.OpenAI()
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a research assistant. Answer thoroughly with sources when possible."},
                {"role": "user", "content": query}
            ],
            timeout=30
        )
        return response.choices[0].message.content
    except Exception:
        return None

def research_with_claude(query: str) -> Optional[str]:
    try:
        client = anthropic.Anthropic()
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            messages=[{"role": "user", "content": f"Research and answer: {query}"}]
        )
        return response.content[0].text
    except Exception:
        return None

def is_research_complete(result: Optional[str]) -> bool:
    if not result:
        return False
    # Research should be substantive — at least 3 sentences, not an error
    sentences = [s for s in result.split('.') if len(s.strip()) > 10]
    return len(sentences) >= 3

research_router = kalibr.Router(
    paths=[research_with_gpt4o, research_with_claude],
    success_fn=is_research_complete,
    task="research"
)

class ResearchTool(BaseTool):
    name: str = "research_tool"
    description: str = "Research a topic or answer a question thoroughly"
    args_schema: Type[BaseModel] = ResearchInput

    def _run(self, query: str) -> str:
        result = research_router.run(query)
        return result or "Research failed — no result available"

# Build your CrewAI agent normally — routing is transparent
researcher = Agent(
    role="Research Analyst",
    goal="Research topics thoroughly and accurately",
    backstory="Expert at finding and synthesizing information",
    tools=[ResearchTool()],
    verbose=False
)

research_task = Task(
    description="Research the current state of AI agent frameworks",
    expected_output="A comprehensive overview of major AI agent frameworks",
    agent=researcher
)

crew = Crew(agents=[researcher], tasks=[research_task])
result = crew.kickoff()

The CrewAI agent doesn't know about routing. It calls a tool. The tool routes between models. Outcomes feed back to Kalibr. If one model starts underperforming, the router adjusts — with no CrewAI configuration changes needed.

LangChain Integration

For LangChain, the cleanest integration point is a custom LLM wrapper:

import kalibr  # First
from langchain_core.language_models import BaseLLM
from langchain_core.outputs import LLMResult, Generation
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage
from typing import Optional, List, Any

def call_openai_lc(prompt: str) -> Optional[str]:
    try:
        llm = ChatOpenAI(model="gpt-4o", timeout=20)
        result = llm.invoke([HumanMessage(content=prompt)])
        return result.content
    except Exception:
        return None

def call_anthropic_lc(prompt: str) -> Optional[str]:
    try:
        llm = ChatAnthropic(model="claude-3-5-sonnet-20241022", max_tokens=1024)
        result = llm.invoke([HumanMessage(content=prompt)])
        return result.content
    except Exception:
        return None

def is_valid_response(result: Optional[str]) -> bool:
    return bool(result and len(result.strip()) > 10)

router = kalibr.Router(
    paths=[call_openai_lc, call_anthropic_lc],
    success_fn=is_valid_response,
    task="langchain-completion"
)

class KalibrRoutedLLM(BaseLLM):
    """LangChain-compatible LLM backed by Kalibr routing."""

    @property
    def _llm_type(self) -> str:
        return "kalibr-routed"

    def _generate(self, prompts: List[str], **kwargs) -> LLMResult:
        generations = []
        for prompt in prompts:
            result = router.run(prompt)
            generations.append([Generation(text=result or "")])
        return LLMResult(generations=generations)

    def _call(self, prompt: str, **kwargs) -> str:
        return router.run(prompt) or ""

# Use as a drop-in replacement anywhere LangChain expects an LLM
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate

llm = KalibrRoutedLLM()
prompt_template = PromptTemplate(
    input_variables=["topic"],
    template="Explain {topic} in simple terms."
)
chain = LLMChain(llm=llm, prompt=prompt_template)

result = chain.run("Thompson Sampling")

This approach lets you swap in the routed LLM anywhere in your LangChain pipeline without changing chain logic.

The Benchmark Numbers

This isn't theoretical. The Kalibr team ran controlled degradation benchmarks — simulating real model/tool degradation events — comparing hardcoded fallback systems to outcome-based routing.

Results during degradation events:

Hardcoded systems (including ones with fallbacks): 16-36% success rate
Outcome-based routing (Kalibr): 88-100% success rate

The gap is that large because hardcoded fallbacks only trigger on exceptions. Most degradation shows up as bad outputs, not errors. The router catches those because it's tracking outcomes, not just exceptions.

Full methodology and results: kalibr.systems/docs/benchmark

When This Approach Makes Sense

Use outcome-based routing when:

Your agent is in production with real traffic
You can define success programmatically (even roughly)
You have at least two viable execution paths
You want the system to adapt to model changes without manual intervention

Don't use it when:

You're still figuring out the basic approach — routing needs working paths to route between
Every output requires human judgment — you need a programmatic signal
Your failure modes are catastrophic — routing reduces failure rate, doesn't eliminate it

The Core Shift

Hardcoded fallbacks are a snapshot of your intuition at one point in time. They don't update. They don't learn. They treat all failures the same (exception only). They never exploit information about which path is actually working right now.

Outcome-based routing is adaptive. It treats your success function as ground truth and distributes traffic based on what that function says is working. It handles the failures that don't raise exceptions. It finds the best path as conditions change.

Your production agent will face model updates, API degradation, and input distribution shifts that you can't predict. Hardcoding paths is betting that nothing will change. Routing on outcomes is accepting that things will change and building the system to handle it.

Start with pip install kalibr. Docs at kalibr.systems/docs.

See also: Why Your AI Agent Works in Dev and Silently Fails in Production for the detection side of this problem, and the Production Agent Checklist for the full pre-flight list.

The Production Agent Checklist: What Every AI Agent Needs Before It Touches Real Users

Devon — Thu, 26 Mar 2026 04:21:11 +0000

The Production Agent Checklist: What Every AI Agent Needs Before It Touches Real Users

Most AI agents that reach production aren't ready for it. They work in demos. They pass the tests the developer wrote. Then they hit real users and start failing in ways that are hard to detect and harder to debug.

This is a practical checklist. Not "10 tips to improve your AI," not a sales pitch — a real pre-flight list for teams shipping Python agents to production. Work through it before you flip the traffic switch.

1. Error Handling That Actually Handles Errors

The wrong version:

def call_llm(prompt: str) -> str:
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

This crashes on rate limits, network errors, and API outages. It also returns empty strings or None if the model returns an unexpected response format — which happens more than you'd think.

The right version:

import openai
import logging
from typing import Optional

logger = logging.getLogger(__name__)

def call_llm(prompt: str, max_retries: int = 3) -> Optional[str]:
    last_error = None

    for attempt in range(max_retries):
        try:
            response = openai.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": prompt}],
                timeout=30
            )
            content = response.choices[0].message.content
            if not content or not content.strip():
                logger.warning(f"Empty response on attempt {attempt + 1}")
                continue
            return content

        except openai.RateLimitError as e:
            wait = 2 ** attempt  # exponential backoff
            logger.warning(f"Rate limited, waiting {wait}s (attempt {attempt + 1})")
            time.sleep(wait)
            last_error = e

        except openai.APITimeoutError as e:
            logger.warning(f"Timeout on attempt {attempt + 1}")
            last_error = e

        except openai.APIError as e:
            logger.error(f"API error: {e}")
            last_error = e
            break  # Don't retry on 4xx errors

    logger.error(f"All attempts failed. Last error: {last_error}")
    return None

Checklist items here:

[ ] Rate limit errors trigger exponential backoff, not immediate re-raise
[ ] Timeout is set explicitly — don't rely on the SDK default (some have none)
[ ] Empty/null responses are handled, not silently passed downstream
[ ] 4xx errors (bad request, auth failure) are not retried
[ ] All failures are logged with enough context to debug

2. Retry Logic With Jitter

Exponential backoff without jitter causes thundering herd: all your retrying clients hit the API at the same time, get rate limited again, back off the same amount, and pile up again.

import random
import time

def backoff_with_jitter(attempt: int, base: float = 1.0, cap: float = 60.0) -> float:
    """Full jitter: random value between 0 and min(cap, base * 2^attempt)"""
    return random.uniform(0, min(cap, base * (2 ** attempt)))

# Usage
for attempt in range(max_retries):
    try:
        result = call_llm(prompt)
        break
    except RateLimitError:
        if attempt < max_retries - 1:
            sleep_time = backoff_with_jitter(attempt)
            time.sleep(sleep_time)

The tenacity library handles this well if you don't want to roll it yourself:

from tenacity import retry, stop_after_attempt, wait_random_exponential, retry_if_exception_type

@retry(
    retry=retry_if_exception_type(openai.RateLimitError),
    wait=wait_random_exponential(min=1, max=60),
    stop=stop_after_attempt(5)
)
def call_llm_with_retry(prompt: str) -> str:
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Checklist:

[ ] Retries use jitter, not pure exponential backoff
[ ] Max retry count is set (don't retry forever)
[ ] Total retry budget (max wait time) is bounded
[ ] Retry logic is not duplicated across call sites — centralize it

3. Fallback Paths

A fallback is a different execution path you switch to when the primary path fails. This is distinct from retrying — retrying hits the same path again; fallbacks try something different.

Common fallback patterns:

from typing import Optional

def extract_with_gpt4o(text: str) -> Optional[dict]:
    # Primary path
    ...

def extract_with_claude(text: str) -> Optional[dict]:
    # Fallback path
    ...

def extract_with_regex(text: str) -> Optional[dict]:
    # Last-resort deterministic fallback
    import re
    # Simple pattern matching — less capable but always works
    ...

def extract_order(text: str) -> dict:
    result = extract_with_gpt4o(text)
    if result:
        return result

    logger.warning("GPT-4o extraction failed, trying Claude")
    result = extract_with_claude(text)
    if result:
        return result

    logger.warning("Claude extraction failed, trying regex fallback")
    result = extract_with_regex(text)
    if result:
        return result

    raise ValueError("All extraction paths failed")

This is better than no fallback. It has a serious problem though: the fallback selection is static. You wrote it once, it stays that way forever. If Claude starts outperforming GPT-4o in production, your code still tries GPT-4o first every time.

We'll address this in Post 3 on dynamic routing, but the checklist item here is simply: do you have a fallback at all?

Checklist:

[ ] Every LLM call has at least one fallback path
[ ] The fallback is tested independently — don't assume it works because the primary did
[ ] There's a final fallback that always returns something (even if degraded)
[ ] Fallback activation is logged and visible in your metrics

4. Outcome Tracking

This is the one most teams skip, and it's the one that matters most for long-term reliability.

Logging requests and responses is not outcome tracking. Outcome tracking is recording whether the agent achieved its goal for each request.

import time
from dataclasses import dataclass
from typing import Optional, Any

@dataclass
class AgentOutcome:
    request_id: str
    task: str
    success: bool
    path_used: str  # which model/tool combination
    latency_ms: float
    input_tokens: Optional[int]
    output_tokens: Optional[int]
    error: Optional[str]
    metadata: dict

def track_outcome(outcome: AgentOutcome):
    # Send to your metrics system
    # Could be Datadog, Prometheus, a database, whatever
    metrics.increment(
        "agent.outcome",
        tags=[
            f"task:{outcome.task}",
            f"success:{outcome.success}",
            f"path:{outcome.path_used}"
        ]
    )
    if outcome.latency_ms > 5000:
        metrics.increment("agent.slow_request", tags=[f"task:{outcome.task}"])

The key is defining "success" programmatically. For every agent task, you need to be able to answer: did this work?

def is_extraction_successful(result: Optional[dict]) -> bool:
    if not result:
        return False
    required_fields = {"item", "quantity", "address"}
    return required_fields.issubset(result.keys()) and all(result[f] for f in required_fields)

# After every extraction:
success = is_extraction_successful(result)
track_outcome(AgentOutcome(
    request_id=request_id,
    task="order-extraction",
    success=success,
    path_used="gpt-4o",
    latency_ms=elapsed_ms,
    ...
))

Checklist:

[ ] Every agent task has a programmatic success function
[ ] Success/failure is recorded per request, not just per error
[ ] You can query: "what's the success rate for this task in the last hour?"
[ ] Outcome data includes which path was used (model, tool, params)

5. Cost Monitoring

LLM costs are variable and can spike unexpectedly. An agent bug that causes excessive retrying or unusually long prompts can cost you serious money before you notice.

from dataclasses import dataclass

# Rough cost per 1K tokens (check current pricing)
COST_PER_1K_TOKENS = {
    "gpt-4o": {"input": 0.0025, "output": 0.010},
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
    "claude-3-5-sonnet-20241022": {"input": 0.003, "output": 0.015},
    "claude-3-haiku-20240307": {"input": 0.00025, "output": 0.00125},
}

def estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    if model not in COST_PER_1K_TOKENS:
        return 0.0
    rates = COST_PER_1K_TOKENS[model]
    return (input_tokens / 1000 * rates["input"]) + (output_tokens / 1000 * rates["output"])

def call_with_cost_tracking(prompt: str, model: str = "gpt-4o") -> tuple[str, float]:
    response = openai.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    usage = response.usage
    cost = estimate_cost(model, usage.prompt_tokens, usage.completion_tokens)

    # Alert if single call is unusually expensive
    if cost > 0.10:  # $0.10 threshold — tune for your use case
        logger.warning(f"Expensive LLM call: ${cost:.4f} ({usage.prompt_tokens} input, {usage.completion_tokens} output)")

    return response.choices[0].message.content, cost

Checklist:

[ ] Token usage is recorded for every LLM call
[ ] Cost is estimated per call and aggregated per task type
[ ] Alert thresholds exist for abnormal cost spikes
[ ] You know your expected cost per 1000 requests before launch

6. Observability vs. Reliability — Don't Confuse Them

This is where teams make a category error.

Observability tools (LangSmith, Langfuse, Helicone, Weights & Biases) give you visibility into what's happening. Traces, spans, prompt logs, output comparison. They're genuinely useful for debugging and evaluation. Use them.

Reliability tools ensure the agent keeps working when things go wrong. Retries, fallbacks, circuit breakers, outcome-based routing. These operate at request time, not review time.

The difference: observability tells you your agent is failing. Reliability keeps it from failing, or recovers it automatically.

Here's an honest comparison of tools that often get conflated:

	Kalibr	LangSmith	OpenRouter
Primary purpose	Outcome-based path routing	Tracing, evaluation, debugging	Model gateway (cost/latency)
Adapts at runtime?	Yes — reroutes based on outcomes	No — dashboards for humans	Partial — routes by cost/latency, not outcomes
Success signal	Your programmatic success function	Human eval / labeled data	None (cost and latency only)
When it helps	Model degrades, tool fails, path breaks in production	Debugging why something failed, evaluating prompt quality	Reducing cost, hitting multiple providers
Requires human?	No — adapts automatically	Yes — someone looks at the dashboard	No
Learning mechanism	Thompson Sampling on outcome signals	N/A	Static rules or weighted routing

These are not competing tools. A production agent might legitimately use all three:

LangSmith for tracing and offline evaluation
OpenRouter for provider flexibility and cost management
Kalibr for outcome-based routing that adapts when things degrade

See Kalibr's docs for how the SDK fits into an existing stack.

7. Output Validation

Never trust LLM output directly. Validate it before passing it to anything downstream.

import json
from pydantic import BaseModel, ValidationError
from typing import Optional

class OrderData(BaseModel):
    item: str
    quantity: int
    address: str
    notes: Optional[str] = None

def parse_and_validate_order(llm_output: str) -> Optional[OrderData]:
    # Clean up common formatting issues
    content = llm_output.strip()

    # Strip markdown code fences
    if content.startswith("```

"):
        lines = content.split("\n")
        content = "\n".join(lines[1:-1] if lines[-1] == "

```" else lines[1:])

    try:
        data = json.loads(content)
        return OrderData(**data)
    except json.JSONDecodeError as e:
        logger.warning(f"JSON parse failed: {e}. Raw: {content[:200]}")
        return None
    except ValidationError as e:
        logger.warning(f"Schema validation failed: {e}")
        return None

Checklist:

[ ] LLM outputs are validated against an expected schema before use
[ ] JSON parsing failures are handled gracefully (logged, not raised)
[ ] Pydantic or equivalent schema validation is in the request path
[ ] Partial/empty outputs don't propagate as valid results

8. Rate Limiting and Circuit Breakers

Your agent should protect the APIs it calls, not just itself.

from collections import deque
import time
from threading import Lock

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5, timeout: float = 60.0):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failures = 0
        self.last_failure_time = 0
        self.state = "closed"  # closed = normal, open = blocking, half-open = testing
        self._lock = Lock()

    def call(self, func, *args, **kwargs):
        with self._lock:
            if self.state == "open":
                if time.time() - self.last_failure_time > self.timeout:
                    self.state = "half-open"
                else:
                    raise Exception("Circuit breaker open — service unavailable")

        try:
            result = func(*args, **kwargs)
            with self._lock:
                if self.state == "half-open":
                    self.state = "closed"
                    self.failures = 0
            return result
        except Exception as e:
            with self._lock:
                self.failures += 1
                self.last_failure_time = time.time()
                if self.failures >= self.failure_threshold:
                    self.state = "open"
                    logger.error(f"Circuit breaker opened after {self.failures} failures")
            raise

# Usage
openai_breaker = CircuitBreaker(failure_threshold=5, timeout=30)

def call_openai_safe(prompt: str) -> str:
    return openai_breaker.call(call_llm, prompt)

Checklist:

[ ] Circuit breakers prevent cascading failures to downstream APIs
[ ] Rate limiting is applied at the application level, not just relied on from the API
[ ] Breaker state is monitored — an open circuit breaker is an alert condition

9. Timeouts Everywhere

This is short because it's simple: set explicit timeouts on everything.

import asyncio
from concurrent.futures import ThreadPoolExecutor, TimeoutError

async def call_with_timeout(prompt: str, timeout_seconds: float = 30) -> Optional[str]:
    loop = asyncio.get_event_loop()

    with ThreadPoolExecutor() as executor:
        try:
            result = await asyncio.wait_for(
                loop.run_in_executor(executor, call_llm, prompt),
                timeout=timeout_seconds
            )
            return result
        except asyncio.TimeoutError:
            logger.warning(f"LLM call timed out after {timeout_seconds}s")
            return None

Checklist:

[ ] Every external call (LLM, API, database) has an explicit timeout
[ ] Timeouts are appropriate for the operation (not all 30s — fast ops should timeout faster)
[ ] Timeout failures are counted separately from other failures in metrics

Putting It Together: The Minimum Viable Production Agent

Here's what a minimal production-ready agent looks like, integrating the items above:

import kalibr  # First — before any model SDK imports
import openai
import time
import logging
import json
from typing import Optional
from pydantic import BaseModel, ValidationError

logger = logging.getLogger(__name__)

class ExtractionResult(BaseModel):
    item: str
    quantity: int
    address: str

def success_fn(result: Optional[ExtractionResult]) -> bool:
    return result is not None

def extract_gpt4o(text: str) -> Optional[ExtractionResult]:
    try:
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "Extract order fields as JSON: item, quantity, address"},
                {"role": "user", "content": text}
            ],
            timeout=20
        )
        content = response.choices[0].message.content.strip()
        content = content.strip("```

json").strip("

```").strip()
        return ExtractionResult(**json.loads(content))
    except Exception as e:
        logger.warning(f"GPT-4o extraction error: {e}")
        return None

def extract_claude(text: str) -> Optional[ExtractionResult]:
    try:
        import anthropic
        ac = anthropic.Anthropic()
        response = ac.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=512,
            messages=[{"role": "user", "content": f"Extract as JSON (item, quantity, address): {text}"}]
        )
        content = response.content[0].text.strip()
        content = content.strip("```

json").strip("

```").strip()
        return ExtractionResult(**json.loads(content))
    except Exception as e:
        logger.warning(f"Claude extraction error: {e}")
        return None

# Kalibr router: outcome-based routing between paths
router = kalibr.Router(
    paths=[extract_gpt4o, extract_claude],
    success_fn=success_fn,
    task="order-extraction"
)

def process_order(text: str) -> Optional[ExtractionResult]:
    start = time.time()
    result = router.run(text)
    elapsed_ms = (time.time() - start) * 1000

    logger.info(f"Extraction {'succeeded' if result else 'failed'} in {elapsed_ms:.0f}ms")
    return result

This isn't complete production code — you'd add cost tracking, circuit breakers, and proper metrics. But it covers the core: validated output, multiple paths, outcome-aware routing that adapts automatically.

The Checklist, Condensed

Error handling:

[ ] Specific exception types caught and handled differently
[ ] Empty/null outputs handled before returning
[ ] All errors logged with context

Retries:

[ ] Exponential backoff with jitter
[ ] Max retry count bounded
[ ] 4xx errors not retried

Fallbacks:

[ ] Every LLM call has at least one fallback
[ ] Fallback activation is logged

Outcome tracking:

[ ] Success function defined per task
[ ] Success/failure recorded per request
[ ] Path used recorded with each outcome

Cost monitoring:

[ ] Token usage tracked per call
[ ] Alert thresholds for cost spikes

Validation:

[ ] Schema validation on all LLM outputs
[ ] JSON parsing errors handled

Infrastructure:

[ ] Circuit breakers on external calls
[ ] Explicit timeouts everywhere
[ ] Metrics and alerting in place

If you can check every box, your agent is ready for production. Most teams can't check them all on day one — that's fine. Work through it in priority order.

Related: Why Your AI Agent Works in Dev and Silently Fails in Production covers the detection problem in more depth. Stop Hardcoding Model Fallbacks covers outcome-based routing in detail.