DEV Community

Ken Imoto
Ken Imoto

Posted on • Originally published at doi.org

When the Free Executor Cost More: 40 Trials on Opus + Local Qwen Ended Up the Most Expensive Cloud Arm

Per-arm cumulative token volume. Even with Qwen's tokens billed at $0, the Opus + Qwen arm (B) has Opus reading 1.4–5.3× more tokens than Opus solo, because the orchestrator re-reads the executor's returned summaries on every iteration.

"Use a strong model to orchestrate, a cheap model to execute." This is now the default cost-aware recipe for agentic coding.

I believed it and ran the experiment. Opus 4.7 as the orchestrator, locally-hosted Qwen 3.5-9B (zero token cost) as the executor. This should beat running Opus alone on cost. Has to.

It did the opposite.

The supposedly "free" configuration (Opus + Qwen) came out as the most expensive cloud arm on all three of the code-repair tasks I ran. Higher than Opus solo. Higher than Opus + Haiku. And of course much higher than Haiku solo. As someone who actually built a GPU PC believing "local means cheap," I find this somewhat inconvenient.

I wrote up the 40 trials worth of numbers and the mechanism as a paper, archived on Zenodo: DOI 10.5281/zenodo.20978074 / GitHub repo.

This post walks through what happened across the 40 trials and why "free" turned out to be the most expensive option — all from real measurements.

TL;DR

40 trials × 4 configurations × 3 tasks, judged by a deterministic harness (mypy + ruff + pytest exit codes only). No LLM-as-judge anywhere in the loop.

  1. Opus orchestrates + Qwen executes is the most expensive cloud arm on every task. More expensive than Opus solo.
  2. The cause is not the executor's tokens — it's the orchestrator's prompt-cache re-reads. Opus keeps reading Qwen's returned summaries on every turn, and its own input volume grows to 1.4–5.3× that of Opus running alone.
  3. Haiku solo is 5.5× cheaper than Opus solo on the largest task — but fails 25% of the time within the per-arm iteration cap. Within cloud-only options, Opus + Haiku is the most balanced.

If the intuition "the executor's tokens are free, therefore this is cheap" feels obvious, this post is about why that intuition breaks.

What I Measured

The four arms

arm orchestrator executor role split
A Opus 4.7 (solo) one model does everything
B Opus 4.7 Qwen 3.5-9B (local / Ollama) Opus plans + verifies, Qwen edits
C Opus 4.7 Haiku 4.5 (Anthropic SDK sub-loop) Opus plans + verifies, Haiku edits
D Haiku 4.5 (solo) one cheap model does everything

All four arms use the Anthropic SDK with the same tool surface: str_replace_editor (view/create/str_replace/insert) and a bash tool with a 120-second timeout. The orchestrator arms (B, C) get one extra tool: delegate_to_executor.

Anthropic prompt caching is enabled identically on every call — system, tool definitions, and the most recent user message are marked with cache_control: ephemeral. No temperature or seed is set, so trial-to-trial variance reflects sampling noise.

The three tasks

All three operate on the typer repository at commit b210c0e (v0.26.8, MIT license). Each trial starts with git checkout -- . && git clean -fd to restore the base state.

  • T1 — Breakage recovery: 25 errors injected via AST (10 mypy + 10 ruff + 5 pytest collection failures). The agent has to return the harness to fully green.
  • T2 — Refactor: Move get_params_from_function from typer/utils.py to a new module typer/_param_extractor.py. Update every import site. All tests still passing.
  • T3 — Feature-add: Implement get_version_banner(prefix, uppercase) -> str, re-export from typer/__init__.py, and pass a SHA-256-fingerprinted test file.

The judge

mypy + ruff check + pytest — exit code 0 = success, anything else = failure. Per-task verifiers (verify-T2.sh / verify-T3.sh) add structural checks (function actually moved to the new module, fingerprinted test unmodified, etc.).

No LLM is ever asked "is this OK?". The judgment is deterministic and reproducible.

Results (success-only medians, n=3 per cell)

arm task n_succ/total wall (s) iters cost ($) success rate
A Opus solo T1 3/3 253 36 1.74 1.00
A Opus solo T2 3/4 233 26 1.11 0.75
A Opus solo T3 3/3 69 6 0.17 1.00
B Opus+Qwen T1 3/4 484 38 2.27 0.75
B Opus+Qwen T2 3/3 443 27 1.38 1.00
B Opus+Qwen T3 3/3 348 12 0.42 1.00
C Opus+Haiku T1 3/3 400 28 1.67 1.00
C Opus+Haiku T2 3/3 275 20 0.92 1.00
C Opus+Haiku T3 3/3 145 11 0.38 1.00
D Haiku solo T1 3/4 758 89 0.30 0.75
D Haiku solo T2 3/4 507 70 0.23 0.75
D Haiku solo T3 3/3 208 29 0.08 1.00

Bold = per-column winner. Total Anthropic spend across 40 trials: $35.98 — cheap for a paper.

The row worth staring at is arm B. On all three tasks, its cost ($) is the cloud-arm worst ($2.27 / $1.38 / $0.42). Qwen's tokens cost zero. Opus + Qwen is more expensive than Opus alone anyway.

T3 (feature-add) Pareto frontier. X-axis = cost in USD, Y-axis = wall time in seconds. Arm B (orange) is dominated by both arm A (red) and arm C (green) — it is neither cheaper nor faster.

Why "Free" Cost the Most

Compare Opus-side token consumption (input + cache_read_input) across arms:

arm role T1 (Opus-side in + cache_r) T2 T3
A (Opus solo) 534,586 226,474 13,320
B (Opus + Qwen) 733,142 313,914 62,864
C (Opus + Haiku) 421,622 159,640 44,016

B-over-A ratio (Opus-side only): 1.38× on T1, 1.39× on T2, 5.26× on T3.

Qwen's tokens are free. But Opus itself is reading 1.4–5.3× more tokens than it would running alone.

The mechanism. When Opus calls delegate_to_executor, Qwen returns a stdout summary (capped at 4000 chars in my implementation). That summary lands in Opus's context. Anthropic prompt caching marks the most recent message for cache_write, and the next turn reads it back via cache_read. Across 30–80 turns, Opus ends up reading the "what Qwen did" summary over and over and over.

Each re-read is billed at the cache_read rate ($1.50/M token = 10% of Opus input). The executor is free; the orchestrator is not. Which sounds obvious in hindsight, except the word "free" in a sentence tends to short-circuit human reasoning. Mine, anyway.

Stated correctly: the orchestrator's cost is proportional to how many times it re-reads the executor's returned summaries, not to the executor's raw token count. This reads more like a middle-management observation than an LLM finding, but the data says what it says.

Free-Executor Paradox mechanism. Orchestrator (Opus) delegates to Executor (Qwen) via delegate_to_executor → Executor returns a stdout summary → the summary accumulates in the Orchestrator's context and gets re-read via cache_read on every subsequent turn. Even with Qwen's tokens free, the Orchestrator's cache_read keeps accumulating.

Why T3 Blew Up to 5.3×

The most extreme case is T3, the smallest task — about 6 iterations.

Same mechanism, different ratio. The base context (system + tools + initial prompt) is cache_write-ed once on the first turn and cache_read cheaply thereafter. On long tasks (T1, T2), that base is a small fraction of the cumulative input. On a short task, it's a big fraction. So "base re-read every turn + executor summary re-read every turn" overhead dominates everything else, and T3's B/A ratio spikes to 5.3×.

Conversely, arm C (Opus + Haiku) has a smaller cache_read footprint than arm A on T1 and T2 (0.79× and 0.70× of A). Haiku does substantive work that Opus would have otherwise had to do itself, and the substance translates into useful summaries instead of dead weight. Which is the opposite end of the Qwen-summary-bloat story.

When Orchestration Would Win (cases I deliberately excluded)

The "strong orchestrator + cheap executor" recipe falters in iterative tool-loops because, over dozens of turns, the orchestrator's cache_read becomes the dominant cost line. One-shot routing has no such problem.

The experiment was, in that sense, designed against arm B:

  • Executor returns are free-form (Qwen stdout summary up to 4000 chars). If you constrain returns to "one structured diff and nothing else," the orchestrator's accumulated context shrinks.
  • Tasks are sequential (T1/T2/T3 cannot be parallelized within a single trial). Tasks where the orchestrator can dispatch "go edit these three places at once" might pay for orchestration overhead.

Re-running arm B with tightly-bounded executor returns is the next experiment on my list. I expect T3 to invert. T1 is harder to call.

Practical Takeaways

Sitting with these numbers, here is how my own agentic coding setup changed:

  1. For tasks that finish in a handful of iterations, Opus solo is the cheapest cloud option. T3: $0.17, 69 seconds, 6 iterations — cloud-best. "Opus is expensive" is a one-shot framing. Across an iterative loop, Opus's per-iteration efficiency pays.
  2. For tasks that need dozens of iterations, the model with the lowest per-iteration cost wins on dollars. T1: Haiku solo at $0.30 is 5.5× cheaper than the cheapest cloud arm. But it fails 25% of the time, so retry-adjusted expected cost narrows the gap to 4.2×.
  3. For cloud-only setups, Opus + Haiku is the most balanced. Ties Opus solo on T1, wins T2 on cost, narrowly loses to Opus solo on T3. The safe pick if you don't want Haiku-solo's failure rate.
  4. If you're going to use a local Qwen "for free," constrain the executor return size structurally. Free-form stdout returns just shift the cost to the orchestrator's cache_read line.

"Strong + cheap" composition has a narrower design surface than it seems. Unless you also specify what and how much the executor is allowed to return, you regenerate the "orchestrator-becomes-expensive" pattern. I regenerated it three times suspecting measurement error before finally accepting it.

Limitations

The honesty section:

  • n=3 per cell. Mann-Whitney U p-values use a normal approximation where 0.050 is the small-sample floor — it means "as different as this sample size can show." Trust the Cliff's delta effect sizes; don't over-read p-value differences.
  • All three tasks are on the typer repo. Generalization needs other codebases. The harness, breakage injector, runner, and analysis are all MIT-licensed in the repo, so reproducing this on your own codebase is cheap.
  • The orchestrator system prompt is asymmetric. It instructs "do not edit directly, delegate instead." This mirrors a real deployment shape but is a real confounder in the results.

Reproducing

Tested on Ubuntu 22.04 with Python 3.10+, uv 0.4+, and anthropic Python SDK 0.83+. Arm B uses Ollama 0.4+ running qwen3.5:9b. If you skip arm B, Ollama is not needed.

git clone https://github.com/kenimo49/free-executor-paradox
cd free-executor-paradox
# run arm A on T3, one trial
python scripts/runners/runner.py --arm A --task T3 --trial 1
Enter fullscreen mode Exit fullscreen mode

The repo README and paper PDF have the full reproducibility setup — harness, breakage injection, runner, and analysis scripts are all included.

Closing

The cost debate around agentic coding tends to fixate on what the executor costs per token. The dominant term is actually what the orchestrator re-reads, and how often. Qwen here is just one instance of the pattern — every "free local executor" that comes next will hit the same issue. Free executor tokens don't make orchestrator cache_read free.

I wrote it up as a paper because numbers are harder to argue with than vibes. The most satisfying outcome would be someone replying "got the same thing on my codebase" or "actually got the opposite, here's why."


Top comments (0)