DEV Community

Amit
Amit

Posted on • Originally published at artificialcuriositylabs.ai

The Cross-App Workflow Cost $5.44. Here's How We Got It to $0.17.

TL;DR

  • A hybrid agent running browser research + desktop spreadsheet creation cost $5.44 baseline. The same task, both legs optimized, costs $0.17 — a 32× reduction.
  • Fixing only the browser leg (Playwright instead of browser-use) dropped cost from $5.44 to $0.16. The browser leg was responsible for most of the original cost.
  • Fixing only the desktop leg (CLI-first prompt) dropped cost to $0.29. Helpful, but the browser was doing more damage.
  • The CLI-first desktop prompt saved ~$5 with Opus 4.8. It saved $0.01 with Sonnet 4.6 — because Sonnet found the CLI path on its own. Prompt instructions compensate for model behavior; better models need less compensating.

Post 1 established the hierarchy and the cost data for single-tier runs. Post 2 showed how to optimize within a tier. This post is about what happens when your task genuinely spans both — and how to bring that cross-app cost down to something you'd actually ship.

The hybrid task: find 3 highly-rated Mexican restaurants near 525 Market Street (browser), then create a real spreadsheet inside a desktop container with the results (desktop). One workflow, two tools, neither optional.

Baseline cost: $5.44. The original hybrid runs used a generic LLM browser agent (~60 Sonnet calls for the browser leg) and an unoptimized computer use loop with a base prompt for the desktop leg. It worked — including self-recovering from a desktop tool failure — but $5.44 per run scales badly.

The question: if we apply what we learned from the single-tier optimization runs, how far does this go?

The runs

Run Browser leg Desktop leg Wall Cost
Baseline browser-use (~60 LLM calls) Base prompt 21:52 $5.44
Browser fixed only Playwright (0 LLM calls) Base prompt 2:07 $0.16
Desktop fixed only browser-use + cache CLI-first prompt 12:10 $0.29
Both optimized Playwright (0 LLM calls) CLI-first prompt 2:12 $0.17

All runs used Sonnet 4.6 as the outer orchestration model. The baseline Opus 4.8 runs are from Post 1 — these are the Sonnet-equivalent comparisons run through a swappable test harness.

Isolating the legs

The isolation runs tell the clearest story.

Fixing only the browser leg ($0.16): Swap browser-use for Playwright on the browser research step. The desktop step continues to use the base prompt and computer use loop. Cost drops from $5.44 to $0.16 — almost the entire saving comes from the browser leg alone. The desktop leg, even unoptimized, contributes relatively little to the total when the browser leg is no longer making 60 model calls.

Fixing only the desktop leg ($0.29): Keep browser-use for the browser step, add the CLI-first prompt to the desktop step. Cost drops from $5.44 to $0.29. This is meaningful — but the browser leg still dominates because browser-use is still running its LLM loop.

The lesson: profile which leg is expensive before optimizing both. In a hybrid workflow with one expensive leg and one cheap leg, fixing the expensive leg first gives you most of the savings. The second fix is marginal by comparison.

The CLI-first prompt finding

The CLI-first desktop prompt instructs the computer use sub-agent to skip the GUI and go directly to libreoffice --headless for spreadsheet creation:

"For spreadsheets/documents: use libreoffice --headless FIRST. Write a CSV with bash, then convert: libreoffice --headless --convert-to ods file.csv. Never open LibreOffice through the GUI unless headless explicitly fails."

In the original Opus 4.8 hybrid runs, this instruction would have prevented a $2+ failure loop where the agent wrestled with the GUI, hit max_turns, and required the outer model to recover by suggesting the CLI approach itself.

In the Sonnet 4.6 runs, the same instruction saved $0.01 — because Sonnet found the CLI path on its own with the base prompt. 6 clean turns straight to libreoffice --headless, no GUI attempt.

This is a general pattern: prompt optimizations designed for one model tier are sometimes redundant at another. Opus needed the explicit instruction; Sonnet reasoned its way there without it. This doesn't mean the CLI-first prompt is useless — it's insurance against the failure mode, and it's two lines in a system prompt. But it's worth knowing that the savings are model-dependent.

What the optimized hybrid loses

The baseline hybrid demonstrated something the optimized hybrid doesn't: self-recovery from tool failure.

In the original runs, the desktop sub-agent failed twice (different failure modes each time). The outer Opus 4.8 model read the failure summary as text, diagnosed the root cause, and issued a completely different instruction on the retry — switching from GUI to CLI. That recovery behavior is what Post 3 is about.

The optimized hybrid never triggers that recovery, because the CLI-first prompt eliminates the failure mode entirely. 6 clean turns, task done, no retry needed.

For production: the optimized hybrid is the right default. $0.17, 2 minutes, task complete. For demonstrating self-recovery architecture: use the baseline. The recovery behavior is real and valuable — you just can't see it when the prompt prevents the failure.

The full picture

Across all configurations, the hierarchy holds and the cost ordering is consistent:

Approach Cost What's expensive
Playwright (browser only) $0.03 Almost nothing
Optimized hybrid $0.17 Desktop computer use loop (~$0.14)
Desktop only, tightened $1.17 Vision loop, even with convergence
Hybrid baseline $5.44 Browser-use LLM loop + desktop vision loop

The optimized hybrid ($0.17) sits above Playwright alone ($0.03) because the desktop leg requires computer use — there's no Playwright equivalent for LibreOffice. The browser leg is already at its floor; the desktop leg is what it is.

If the task could be restructured to avoid the desktop leg entirely (generate the ODS via Python or a CLI tool on the host, rather than inside a container), cost would drop further. But that changes the task. For genuinely cross-app workflows where the desktop leg is necessary, $0.17 is close to the floor with current tooling.

So what

If you're running hybrid agents that span browser and desktop:

  1. Profile first. Run the task and check CloudWatch per leg. The expensive leg is usually obvious — fix that one first.
  2. The browser leg is likely the bigger problem. browser-use is powerful and general, but its LLM loop is expensive. If the browser task has a stable DOM, a Playwright script eliminates that cost entirely.
  3. CLI-first prompts are cheap insurance. Two lines in a system prompt. They prevent the expensive GUI failure mode. Whether they save $5 or $0.01 depends on your model tier, but they're worth adding regardless.
  4. Self-recovery is still there if you need it. Eliminating the prompt-preventable failure mode doesn't remove the recovery architecture. New, unpredictable failures will still surface; the outer model will still handle them. Optimization and resilience aren't in conflict.

Part 4 of the Agent Tooling series.
← Part 3: Router or agent under failure

Top comments (0)