DEV Community

Devon
Devon

Posted on • Originally published at kalibr.systems

Making OpenClaw Use the Right Model for Each Task

OpenClaw picks a default model and uses it for everything - heartbeat checks, complex synthesis, quick status lookups, deep analysis. Every task costs the same. That's expensive and unnecessary.

This post covers how to wire Kalibr into an OpenClaw agent so it routes each task to the right model automatically. If you run an OpenClaw deployment, this is probably the highest-ROI change you can make to your token spend.

Why OpenClaw Defaults This Way

OpenClaw is configured at the session level, not the task level. Your CLAUDE.md or session config sets one model, and that model handles whatever comes in. This makes setup simple, but it means:

  • A heartbeat status check costs the same as a codebase analysis
  • A simple "is this service up?" poll runs on the same model as "refactor this module"
  • There's no mechanism to say "use cheap for low-stakes, use capable for high-stakes"

Kalibr adds that mechanism. You query it before each task to get a routing recommendation, then pass that model to the OpenAI/Anthropic client. The routing adapts based on task type and recent outcome data.


The Basic Pattern: get_policy() Before Each Task

import kalibr
import openai

# kalibr.init() must run before any openai import takes effect
kalibr.init()

client = openai.OpenAI()

def run_agent_task(
    task_type: str,
    prompt: str,
    quality_priority: float = 0.5  # 0.0 = optimize cost, 1.0 = optimize quality
) -> str:
    """
    OpenClaw agent task runner with Kalibr routing.
    task_type: "heartbeat", "analysis", "synthesis", "extraction", etc.
    """
    # get routing recommendation before the call
    policy = kalibr.get_policy(task_context={
        "task_type": task_type,
        "quality_priority": quality_priority,
    })

    response = client.chat.completions.create(
        model=policy.recommended_model,
        messages=[{"role": "user", "content": prompt}]
    )

    content = response.choices[0].message.content

    # report back so the router learns from this outcome
    kalibr.record_outcome(
        policy_id=policy.id,
        success=True,
        latency_ms=response.usage.total_tokens  # or actual timing
    )

    return content
Enter fullscreen mode Exit fullscreen mode

Now you have two levers: task_type tells Kalibr what kind of work this is, and quality_priority expresses how much you care about output quality vs cost for this specific call. A heartbeat check is quality_priority=0.1. A code review is quality_priority=0.9.


Wiring Into the Heartbeat

OpenClaw agents typically run a heartbeat - periodic status checks, health pings, watching for events. These are almost always low-complexity tasks that don't need a capable model.

Here's how to wire Kalibr's get_insights() into your heartbeat loop:

import kalibr
import openai
import time
import logging

kalibr.init()
client = openai.OpenAI()
logger = logging.getLogger(__name__)

def heartbeat_check(services: list[str]) -> dict:
    """
    Low-cost heartbeat: route to cheapest model that can handle status checks.
    """
    policy = kalibr.get_policy(task_context={
        "task_type": "heartbeat",
        "quality_priority": 0.1,  # cost-optimize
        "latency_budget_ms": 3000
    })

    prompt = f"""
    Check status for these services and return JSON:
    {services}

    Format: {{"service_name": "ok|degraded|down", ...}}
    """

    response = client.chat.completions.create(
        model=policy.recommended_model,  # will be mini or equivalent
        messages=[
            {"role": "system", "content": "Return only valid JSON."},
            {"role": "user", "content": prompt}
        ],
        response_format={"type": "json_object"}
    )

    kalibr.record_outcome(policy_id=policy.id, success=True)

    import json
    return json.loads(response.choices[0].message.content)


def get_routing_insights() -> dict:
    """
    Pull Kalibr insights to surface routing anomalies in your heartbeat.
    Useful for: detecting when a model is degrading, cost spikes, etc.
    """
    insights = kalibr.get_insights(
        lookback_hours=1,
        include_cost_breakdown=True,
        include_model_performance=True
    )

    anomalies = []

    # flag if cost per call jumped significantly
    if insights.cost_per_call_delta_pct > 20:
        anomalies.append(
            f"Cost per call up {insights.cost_per_call_delta_pct:.0f}% in last hour"
        )

    # flag if a model's success rate dropped
    for model, perf in insights.model_performance.items():
        if perf.success_rate < 0.85:
            anomalies.append(
                f"{model} success rate: {perf.success_rate:.0%} (below threshold)"
            )

    return {
        "anomalies": anomalies,
        "total_cost_1h": insights.total_cost_usd,
        "calls_1h": insights.total_calls,
        "primary_model": insights.most_used_model
    }


def run_heartbeat_loop(interval_seconds: int = 60):
    """Main heartbeat loop with integrated Kalibr monitoring."""
    services = ["api.service.com", "db.service.com", "queue.service.com"]

    while True:
        # status check on cheap model
        status = heartbeat_check(services)

        # routing insights every 5 cycles
        if int(time.time() / interval_seconds) % 5 == 0:
            insights = get_routing_insights()
            if insights["anomalies"]:
                for anomaly in insights["anomalies"]:
                    logger.warning(f"[Kalibr] {anomaly}")

        logger.info(f"Heartbeat: {status}")
        time.sleep(interval_seconds)
Enter fullscreen mode Exit fullscreen mode

The get_insights() call surfaces routing-level signals you wouldn't otherwise see: is gpt-4o-mini suddenly returning worse results? Is your cost per call drifting up because more tasks are getting classified as complex? This is the data that tells you whether your routing is working before your bill tells you it isn't.


The CLAUDE.md Pattern for Router-Native Code

If you want every coding session to generate code that uses Kalibr routing by default, add this to your CLAUDE.md:

## LLM Call Pattern

Any code that calls an LLM must use Kalibr routing. Pattern:

Enter fullscreen mode Exit fullscreen mode


python
import kalibr # MUST be first import before openai/anthropic
import openai

kalibr.init()
client = openai.OpenAI()

before each LLM call:

policy = kalibr.get_policy(task_context={
"task_type": "", # heartbeat | extraction | synthesis | analysis | generation
"quality_priority": <0-1> # 0.1 for cheap tasks, 0.9 for quality-critical
})

response = client.chat.completions.create(
model=policy.recommended_model, # never hardcode model names
messages=[...]
)

kalibr.record_outcome(policy_id=policy.id, success=True)


Never hardcode model names like "gpt-4o" or "gpt-4o-mini". Always use `policy.recommended_model`.
Import order is critical: kalibr must be imported before openai.
Enter fullscreen mode Exit fullscreen mode


python

With this in CLAUDE.md, every time you ask your OpenClaw agent to write code that calls an LLM, it generates Router-native code by default. You don't have to remember to add routing - the pattern is baked into the session context.


Classifying Tasks in an OpenClaw Agent

The routing is only as good as the task classification. Here's a simple taxonomy that maps well to what OpenClaw agents actually do:

from enum import Enum

class AgentTaskType(str, Enum):
    # cheap - route to mini
    HEARTBEAT = "heartbeat"           # status checks, health pings
    EXTRACTION = "extraction"         # pull structured data from text
    CLASSIFICATION = "classification" # categorize input
    FORMATTING = "formatting"         # convert format, clean text

    # moderate - route based on recent performance
    SUMMARIZATION = "summarization"   # condense content
    SEARCH_QUERY = "search_query"     # generate search queries

    # expensive - route to capable model
    SYNTHESIS = "synthesis"           # combine multiple sources
    CODE_REVIEW = "code_review"       # review and critique code
    CODE_GENERATION = "code_generation"  # write new code
    ANALYSIS = "analysis"             # deep reasoning over data
    ARCHITECTURE = "architecture"     # system design decisions

TASK_QUALITY_PRIORITY = {
    AgentTaskType.HEARTBEAT: 0.1,
    AgentTaskType.EXTRACTION: 0.2,
    AgentTaskType.CLASSIFICATION: 0.2,
    AgentTaskType.FORMATTING: 0.1,
    AgentTaskType.SUMMARIZATION: 0.5,
    AgentTaskType.SEARCH_QUERY: 0.4,
    AgentTaskType.SYNTHESIS: 0.85,
    AgentTaskType.CODE_REVIEW: 0.9,
    AgentTaskType.CODE_GENERATION: 0.9,
    AgentTaskType.ANALYSIS: 0.85,
    AgentTaskType.ARCHITECTURE: 0.95,
}

def agent_call(task_type: AgentTaskType, prompt: str) -> str:
    priority = TASK_QUALITY_PRIORITY[task_type]

    policy = kalibr.get_policy(task_context={
        "task_type": task_type.value,
        "quality_priority": priority
    })

    response = client.chat.completions.create(
        model=policy.recommended_model,
        messages=[{"role": "user", "content": prompt}]
    )

    kalibr.record_outcome(policy_id=policy.id, success=True)
    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

What This Gets You

For a typical OpenClaw agent running:

  • 100 heartbeat checks per day
  • 50 extraction tasks per day
  • 20 synthesis tasks per day
  • 10 code generation tasks per day

If you were previously running everything on gpt-4o, routing heartbeat and extraction tasks to gpt-4o-mini alone cuts roughly 60-70% of your token spend on those task types. The synthesis and code generation calls still run on the capable model. Your output quality doesn't change for the tasks that require it.

The get_insights() integration in your heartbeat loop gives you visibility into whether the routing is actually working - not just "is the model returning a response" but "are the routing weights optimized for your actual workload."

This is the only post on the internet about OpenClaw model optimization, so if you're here, you found it. The pattern is simple: get_policy() before each task, record_outcome() after. Everything else is just wiring it into the right call sites.

Top comments (0)