VegetableEater

Posted on Mar 26

HotSwap: Routing LLM Subtasks by Cache Economics

#ai #machinelearning #programming #architecture

Abstract

Model routing and prompt caching are well-established, separate techniques for reducing LLM API costs. Routing directs simple tasks to cheaper models (40-85% savings). Anthropic's prompt caching cuts input token costs by up to 90% on repeated prefixes. Every existing tool treats these as independent optimizations.

This post proposes HotSwap, a pattern that keeps a persistent cached Claude session as the stateful backbone while offloading read-only exploration turns to a cheaper provider. The motivation is cache economics: cached turns on Anthropic are cheap, so you want to keep complex work there while routing lightweight exploration elsewhere. The mechanism is simpler than you'd expect -- task-type classification (exploration vs. action), not real-time cost calculation. And a self-tuning model selector adapts which cheap model handles exploration based on observed success rates in your specific workload.

To be clear about what's novel and what isn't: multi-model routing exists (LiteLLM, OpenRouter). Prompt caching is well-documented. The contributions here are: (1) using cache economics as the motivating insight for a hybrid architecture that keeps the primary session warm, (2) a guardrail mechanism that lets cheap models explore freely but prevents them from taking irreversible actions, (3) a self-tuning model selector that promotes or demotes the exploration model based on observed outcomes -- no pre-trained classifier, no static rules, and (4) a cross-provider message format translation layer that makes the primary model see a seamless conversation history regardless of which provider handled which turn.

Status: HotSwap is implemented and under active benchmarking. Early data is directional, not definitive. This post describes the architecture and methodology; formal benchmark results will follow.

A note on terminology

"Cache-aware routing" is an existing term in infrastructure, but it refers to something completely different. Projects like llm-d (IBM/Google/Red Hat) use KV cache-aware routing to direct requests to GPU pods that already hold relevant context in memory -- that's routing requests to the right server. HotSwap operates at the application layer for API consumers, routing turns to the right provider based on task type, with cache economics as the motivating reason. Different layer, different audience, different problem.

The problem: two optimizations that don't talk to each other

Today, if you want to reduce your LLM API spend, you have two main levers:

Prompt caching. Anthropic lets you cache prompt prefixes and read them back at 10% of the input token cost. For long system prompts or multi-turn conversations, this is transformative -- but you're still paying full output token price on every call, even for turns that are just reading files and exploring code.
Model routing. Tools like LiteLLM, OpenRouter, and Bifrost let you send simple requests to cheaper models. This works well, but each call is independent -- there's no concept of a persistent session, and no awareness of what's already cached.

The gap: exploration turns (reading files, searching code, summarizing) don't need a frontier model, but existing routers don't know which turns are exploration and which are action. And caching tools don't influence routing at all.

The economic insight that motivates HotSwap: even with Anthropic's 90% cache discount, a cached Sonnet exploration turn costs ~$0.0045. An uncached gpt-4.1-nano turn costs ~$0.001. The hybrid isn't just preserving the cache -- it's cheaper even when the cache is working perfectly.

A recent paper (Khraishi et al., 2026, arXiv:2603.03111) studied what happens when you switch models mid-conversation, finding that handoffs can swing outcomes by -8 to +13 percentage points depending on direction. However, that paper studied per-conversation-turn switching (swapping the model that continues a dialogue), not per-agentic-turn switching (routing discrete tool-calling turns within a single execution). HotSwap addresses the drift risk with a guardrail mechanism (see below).

The HotSwap architecture

HotSwap separates your LLM usage into two channels based on task type, with cache economics as the motivating reason for the split.

The cached backbone (Claude)

A single Claude session acts as the persistent source of truth. The system prompt, tool definitions, and growing conversation history are cached using Anthropic's cache_control parameter. Every subsequent call to this session reads from the cache at 10% of the base input price. This session handles all turns that involve action: writing code, editing files, rebuilding, planning, and synthesis.

The cheap sidecar (OpenAI)

Exploration turns are offloaded to a cheaper model. The sidecar receives the full message history, translated to OpenAI's native format -- not a summarized version. The sidecar needs full context to make good exploration decisions ("which file should I read next?" depends on what's already been read). Sending a self-contained prompt would require summarizing the conversation, adding complexity and token cost. The savings come from cheaper per-token pricing, not smaller inputs.

OpenAI has automatic prompt caching, but it doesn't penalize switching between models — each exploration call is independent, so there's no warm cache to lose on the sidecar side.

The routing logic

The routing decision is task-type-based, not real-time cost calculation:

First turn always goes to the primary model -- this creates the cache.
After each turn, check: were all tools in the previous turn read-only? If yes, the next turn routes to the cheap model. If no, it stays on the primary.
If the cheap model requests action tools (edit_code, write_file, rebuild), the response is discarded and the turn is re-routed to the primary model. The cache is still warm.
If the cheap model call fails entirely (auth error, timeout, bad model name), the system falls back to the primary model gracefully. No hang, no crash.

The cache economics are the motivation -- cached turns are cheap, so you want the expensive model handling the work that matters. The mechanism is straightforward task classification. Real-time cache-economics-driven routing (factoring in TTL, marginal cost comparisons) is a direction for future work.

How it actually works

def handle_turn(messages, tools, primary_model):
    # First turn always goes to primary (creates cache)
    if turn == 1:
        return primary_model.call(messages, tools)

    # Check: was the PREVIOUS turn exploration-only?
    last_tools = get_tools_called(messages[-1])
    all_exploration = all(t in EXPLORATION_TOOLS for t in last_tools)

    if not all_exploration or not openai_client:
        return primary_model.call(messages, tools)  # cache hit

    # Route to cheap model for exploration
    cheap_model = get_best_exploration_model(stats)  # self-tuning
    try:
        response = cheap_model.call(
            translate_to_openai(messages),  # full history, translated
            translate_tools(tools)
        )
    except Exception:
        # OpenAI failed -- fall back to primary, cache still warm
        return primary_model.call(messages, tools)

    # GUARDRAIL: if cheap model requests action tools, discard and re-route
    if any(tool in ACTION_TOOLS for tool in response.tool_calls):
        log("Guardrail: re-routing to primary")
        return primary_model.call(messages, tools)  # cache still warm

    # Translate response back to primary format and inject into history
    return translate_to_anthropic(response)
    # Primary's cache stays warm -- next primary call gets cache hit

The guardrail: explore freely, never act

This is how HotSwap addresses the context drift problem from arXiv:2603.03111.

The cheap model can call any read-only tool: read files, search code, summarize, list directories. But when the cheap model's response includes a call to an action tool (edit_code, write_file, rebuild), the entire response is discarded and the turn is re-routed to the primary model. The primary's cache is still warm -- Anthropic's cache TTL is 5 minutes, and exploration turns take seconds.

This means the cheap model can explore freely but cannot take irreversible actions. The worst case is a wasted sidecar call (~$0.001 on nano) -- not a bad edit.

The guardrail itself is zero-cost on success: a string comparison against a set of tool names. No LLM call, no API round-trip. The only cost is when it triggers.

Guardrail triggers are not counted in exploration stats, which prevents pollution of the success rate data used for self-tuning.

Self-tuning exploration model selection

Existing model routers like RouteLLM train classifiers on fixed preference datasets, and FrugalGPT learns LLM cascades from labeled examples -- both require upfront training data and don't adapt at runtime. HotSwap's routing adapts based on observed outcomes within your specific workload, with zero training data.

How it works:

Starts with the cheapest model -- gpt-4.1-nano ($0.10/1M input tokens at time of writing; verify current pricing).
Tracks success rate: did this exploration turn lead to a successful edit on the next primary turn?
If success rate drops below 60% after 5+ exploration turns, promotes to the next cheapest model (e.g., gpt-4o-mini, then gpt-4.1-mini, then gpt-4o).
Stats persist across sessions in a local JSON file.
Zero LLM cost for the routing decision itself -- pure bookkeeping.

Known limitation: the success metric is conservative. Only the last exploration turn before a successful edit gets credit. If three exploration turns preceded the edit, only the third is marked successful. This deflates the success rate, meaning the system is more cautious about staying on cheap models than the data may warrant. We chose conservative over optimistic -- it's safer to promote to a better model too early than to stay on a bad one too long.

This means HotSwap gets smarter over time for your specific workload. A codebase where nano handles exploration fine stays on nano. A codebase where nano's exploration leads to bad edits gets automatically promoted to a better model. No manual tuning required.

Message format translation

A less glamorous but essential piece: the translation layer that makes cross-provider routing seamless.

Anthropic uses tool_use/tool_result blocks. OpenAI uses function_call/function response format. HotSwap translates the full Anthropic message history into OpenAI's format before sending to the sidecar, and translates the OpenAI response back into Anthropic ContentBlocks before injecting it into the conversation history.

The primary model sees a seamless history -- it doesn't know some turns were handled by a different provider. This is what makes the cache reinjection work without invalidating the cached prefix.

What already exists vs. what HotSwap adds

The building blocks are mature. The combination is new.

Capability	Existing tools	HotSwap
Multi-model routing	LiteLLM, OpenRouter, Bifrost	Uses existing
Prompt caching	Anthropic API	Uses existing
Orchestrator / worker split	Various	Uses existing
Task-type routing motivated by cache economics	None	Novel
Guardrail: explore freely, never act	None	Novel
Self-tuning model selector based on observed outcomes	None	Novel
Cross-provider message format translation with cache reinjection	None	Novel
Graceful fallback on sidecar failure	None	Novel

Known risks and mitigations

Context drift. arXiv:2603.03111 found that switching models mid-conversation can swing outcomes by -8 to +13 percentage points. HotSwap mitigates this with the guardrail: the cheap model can explore but cannot act. If it tries to act, the turn is discarded and re-routed. Additionally, the arXiv paper studied conversational handoffs (one model continuing another's dialogue); HotSwap's exploration turns produce factual tool results (file contents, search matches), which carry less drift risk than open-ended reasoning.
Cache invalidation. Reinjected sidecar results append to the end of the message history but do not modify the cached prefix (system prompt, tools, early history). The prefix stays immutable. Risk: if the message history grows large enough to push past cache boundaries, hits may degrade.
Latency overhead. The sidecar call adds a network round-trip, plus translation time. For latency-sensitive applications, this may not be worth it. Mitigation: measure P50/P95 latency per path and adjust routing thresholds.
Translation fidelity. The format translation between Anthropic and OpenAI is not lossless -- edge cases in tool call formatting, multi-block responses, or provider-specific features could cause subtle issues. Mitigation: test translation on your specific tool set before deploying.
Sidecar availability. If the OpenAI API is down or the model name is invalid, the sidecar call fails. Mitigation: HotSwap catches all sidecar errors and falls back to the primary model. This was added after discovering a real hang caused by a sidecar failure during development.

Early data (benchmarking in progress)

From actual sessions -- directional, not definitive:

Configuration	Cost	Turns	Outcome	Notes
Sonnet with caching	$0.38	9	Completed	One measured session
Sonnet with caching	$0.70	17	Completed	One measured session (more complex task)
Opus with caching	$0.88	9	Completed	One measured session
Cache discount (Sonnet)	22%	--	--	One measured session vs. uncached baseline

Hybrid routing data (OpenAI exploration turns) is being collected. Formal benchmarks across the scenarios below are actively in progress.

Benchmark methodology

To validate HotSwap, metrics are being collected across three configurations: (A) pure Claude without caching, (B) pure Claude with caching, and (C) the HotSwap hybrid approach.

Primary metrics

Metric	What to measure	Why it matters
Cost per task	Total API spend across all providers for one complete task	The headline number
Cost per turn by type	Break down by exploration vs. action turns	Shows where offloading saves money
Cache hit rate	Percentage of primary model calls that hit the cache	Validates the architecture's core assumption
End-to-end duration	Wall-clock time per task	Reveals whether sidecar round-trips add unacceptable delay
Task completion rate	Did the task succeed across configurations?	Proves offloading exploration doesn't degrade results

Secondary metrics

Metric	What to measure	Why it matters
Guardrail trigger rate	How often the cheap model tries to act and gets re-routed	High rate means exploration model is too aggressive
Self-tuning promotions	How often the system promotes to a more expensive exploration model	Indicates when the cheapest model isn't good enough
Sidecar fallback rate	How often sidecar calls fail, requiring a primary model retry	Measures reliability of the cross-provider bridge
Exploration success rate	Did the exploration turn lead to a successful action on the next turn?	The metric that drives self-tuning

Test scenarios (in progress)

Simple UI change. Change one CSS property. Minimal exploration, one edit. Tests baseline cost across models.
Find and fix. Locate a specific UI element and modify it. Moderate exploration (search + read), one edit. Tests exploration routing.
Add a feature. Add a new UI element with behavior. Heavy exploration (multiple files), multiple edits. Tests the full exploration-to-action cycle.
Investigation only. Answer a question about the codebase. All exploration, no edits. Tests whether HotSwap saves on pure-exploration tasks.
Multi-file refactor. Extract duplicated code into a shared utility. Heavy exploration + heavy action. Tests the architecture under sustained load.

What comes next

HotSwap is a hypothesis under active testing, not a proven solution. The architecture is working, but the key question -- does the added complexity pay for itself, and under what conditions? -- is still open.

Directions being explored:

Real-time cache-economics-driven routing -- using TTL remaining and marginal cost comparisons to make routing decisions dynamically, instead of pure task-type classification. This is the aspirational version of the pattern.
More granular tool classification -- the current exploration/action split is binary. Some tools are borderline (e.g., running tests is read-only but computationally meaningful). Smarter classification could improve routing accuracy.
Broader provider support -- the translation layer currently handles Anthropic <> OpenAI. Adding Google Vertex or local models as sidecar options would test provider-agnosticism.

Larry Kang - March 2026

Top comments (1)

LEI GUO • May 25

ecomai.online - DeepSeek API, $1 trial, works from any country.