TL;DR
- Computer use is 45× more expensive than a structured API for the same task — not a tuning problem, it's the architecture. Vision agents pay for every screenshot of every intermediate state.
- The hierarchy holds: Native API → Existing connector → Browser automation → Computer use → Human last resort. But "browser automation" means a purpose-built Playwright script, not a generic LLM browser agent. The difference is $0.03 vs $3.41 for the same task.
- Eight runs across browser, desktop, and hybrid configurations: cost ranged from $0.03 to $5.44. The spread came entirely from implementation choices, not the model.
- The optimized hybrid (Playwright browser + computer use with CLI-first prompt) ran the cross-app task for $0.17 — 32× cheaper than the $5.44 baseline. Both legs optimized independently; the browser leg dominated the savings.
- Only reach for computer use when nothing above it covers the task — legacy desktop apps, third-party SaaS with no usable endpoints, genuine cross-app workflows.
I run a fleet of autonomous agents for my own daily operations — research, scheduling, knowledge management, recurring workflows. Most of those agents only need APIs or MCP connectors to do their work. But some tasks don't have structured endpoints — and the same is true at enterprise scale. Verify patient eligibility on an insurer's web portal, then update the record in a legacy desktop EHR. Pull supplier pricing from a vendor site, then enter it into an on-premise ERP that predates REST APIs. Extract data from a web-based reporting dashboard, then reconcile it in a desktop compliance tool. The pattern is the same whether you're building for yourself or for thousands of tenants: one half of the workflow lives in a browser, the other half lives in a desktop app that was never going to ship an API. Either tool alone leaves the other half stranded.
So I ran experiments. Not to benchmark a model release — Opus 4.8 happened to drop during this work and I used it — but to answer a practical question: when an agent needs to interact with a UI, what does that actually cost to run, and what's the right architecture? Computer use? Browser automation? Both, somehow?
The answer turned out to be sharper than I expected, and it's the headline of this post.
This post lays out the hierarchy and the experimental data behind it. The posts that follow go deeper into specific findings.
The hierarchy you should actually use:
┌─────────────────────────────────────────────────────┐
│ Tier 1: Native API cheapest │
│ ↓ (only if no API exists) │
│ Tier 2: Existing connector (MCP, vendor SDK) │
│ ↓ (only if no connector) │
│ Tier 3: Browser-based automation │
│ ↓ (only if web UI not enough) │
│ Tier 4: Computer use (vision + mouse/keyboard) │
│ ↓ (only if model can't complete it) │
│ Tier 5: Human most $ │
└─────────────────────────────────────────────────────┘
Walk down it in order. Stop at the first tier that fits the task. Don't skip up because computer use is more general or because the demo was impressive — generality has a 45× cost penalty against a structured API. A Reflex.dev benchmark on an admin-panel task: vision-based computer use needed 53 steps, 551K input tokens, 17 minutes. The equivalent through structured API calls was 8 calls, 12K tokens, 20 seconds. 45× cheaper, 51× faster, same correctness. That's not a tuning problem; it's the architecture. Vision agents pay for every screenshot of every intermediate state.
This isn't a contrarian take — it's roughly the hierarchy Anthropic itself recommends. What's missing from most marketing copy is how steep the cost gap is between tiers.
I confirmed it from the other direction with eight harness configurations on Bedrock. Cost ranged from $0.03 to $5.44 on the same task. The 180× spread came entirely from implementation choices — not the model. A purpose-built Playwright script at $0.03 proved that browser automation done right is structurally cheaper than even the best-tuned computer use run ($1.17). And the same optimization logic applied to the hybrid brought that cross-app task from $5.44 down to $0.17.
At a SaaS tenant doing 1,000 lookups a month, that gap is $14K vs $65K per year — and the API path would be ~$400. Same outcome.
The architectural hierarchy is the buried lede. Computer use isn't bad — it's powerful in the cases it covers. But the default reach for it (because it's general, because Anthropic markets it well, because it works for demos) is wrong for most production loops. The rest of this post is the data behind that claim.
What the three lower tiers actually look like
If you're at tier 3 or 4 — browser tool, computer use, or some hybrid of the two — there are three architectural shapes worth knowing apart. They're not interchangeable, and confusing them is a common mistake.
| Loop | What it is | How it works | Strengths | Weaknesses |
|---|---|---|---|---|
| Tool-calling loop (Playwright, browser-use, AgentCore Browser) | Reliable execution backend with structured actions | Agent calls click(selector), type(field, text), screenshot() against a browser via CDP
|
Fast, cheap, precise. 89-92% success on common browser tasks. | Browser-only. Needs selectors. |
| Vision loop (Anthropic computer use, Opus 4.8) | Vision-based intelligence controlling a desktop | Screenshot → LLM reasons over pixels → mouse/keyboard actions | General purpose. Works on any app. 84% on Online-Mind2Web for Opus 4.8. | Slower. More expensive. Lower precision. |
| Hybrid | Tool-calling backend + vision as supervisor/fallback | Outer agent decides per turn which tool to call | Best of both for genuinely cross-app tasks. | More moving parts. Most expensive. |
Mapping this back to the hierarchy: the tool-calling loop is what tier 3 (browser-based automation) actually is. The vision loop is what tier 4 (computer use) actually is. Hybrid is when your task spans both — and you should only reach for it when no API or single-tool path covers the workflow.
Numbers from public benchmark reports — Playwright-driven agent frameworks land in the high 80s to low 90s on common browser tasks; vision-based computer use lands around 84%. Hybrids outperform pure vision on web tasks because selectors beat pixel guessing.
When to use which
The industry shorthand is "80% Playwright, 20% browser-use." That's incomplete — it leaves out computer use entirely. Here's the actual decision tree:
Can the task be done with stable DOM selectors?
├── YES → Playwright (code, 0 LLM calls in loop)
└── NO: does it require a browser?
├── YES, DOM accessible → browser-use (LLM adapts per page)
└── NO browser / need desktop → Computer use (vision loop)
In practice:
| Situation | Right tool | Why |
|---|---|---|
| Stable DOM, predictable navigation | Playwright | No LLM needed in the loop |
| Dynamic UI, novel pages, site changes | browser-use | LLM adapts, but ~60 calls per task |
| Desktop apps (LibreOffice, Excel, legacy software) | Computer use | No browser alternative exists |
| Site blocks CDP / Playwright connections | Computer use | Browser path unreachable |
| Cross-app: stable web + desktop | Playwright + Computer use | Each leg optimized independently |
| Cross-app: dynamic web + desktop | browser-use + Computer use | More expensive, but covers both |
The mistake most builders make is treating computer use as a more powerful browser-use. It isn't. It's for tasks the browser can't reach — desktop apps, sites that block CDP, or workflows that need both. If the DOM is accessible, computer use is always the more expensive option.
If you've been treating these three as interchangeable, stop. The choice within tiers 3-4 is the most important implementation decision you'll make. The data shows the cost spread within this tier is larger than the cost spread between most models.
Test setup
Every run was fully autonomous — no human in the loop between task prompt and final output.
The stack: a Strands outer agent on Amazon Bedrock received the task and decided which tools to call. For browser work, it called a tool that provisioned a fresh AgentCore Browser session — AWS-managed Chromium running in an isolated container — and drove it either via browser-use (an LLM sub-agent that reasons over screenshots and generates click/type actions) or via a purpose-built Playwright script (code that navigates by URL and queries the DOM directly). For desktop work, it called a tool that booted an ephemeral Docker container with Xvfb, LibreOffice, and a terminal, then ran Anthropic's computer-use loop against it. Each tool call was one round trip with a fresh sandbox; no state persisted across calls.
All costs are real token counts pulled from CloudWatch (AWS/Bedrock namespace, InputTokenCount, OutputTokenCount, cache read/write metrics) and converted to USD using public Bedrock pricing. Wall times are measured end-to-end including container boot, browser provisioning, and formatting.
Task prompt
│
▼
┌─────────────────────────────────────┐
│ Outer agent (Strands + Bedrock) │
│ Decides which tool to call │
└──────────┬──────────────────────────┘
│
┌─────┴─────┐
│ │
▼ ▼
┌─────────┐ ┌──────────┐
│ Browser │ │ Desktop │
│ tool │ │ tool │
└────┬────┘ └────┬─────┘
│ │
▼ ▼
┌─────────┐ ┌──────────────────┐
│AgentCore│ │ Docker container │
│ Browser │ │ Xvfb + LibreOffice│
│(managed │ │ + terminal │
│ Chrome) │ └────────┬─────────┘
└────┬────┘ │
│ computer-use
│ loop (Bedrock)
browser-use │
OR Playwright │
│ │
└────────┬────────┘
│
▼
Final output
(costs logged to CloudWatch)
Each tool call gets a fresh sandbox — new browser session or new container — torn down after the call. No shared state between calls.
The tests
Two benchmark tasks, all costs from CloudWatch.
Simple task — browser research only:
"It's Thursday at lunchtime. I'm at 525 Market Street. Find me 3 highly-rated Mexican lunch options within 5 minutes' walk. Save as markdown."
A deliberately messy task — fuzzy criteria, a site that blocks scrapers, location-dependent results with no structured endpoint. Chosen because it forces real decision-making rather than form-filling.
| Run | What drives the browser | Wall | Cost |
|---|---|---|---|
| Playwright structured | Code — 0 LLM calls in browser loop | 1:09 | $0.03 |
| browser-use + cache | LLM (~60 Sonnet calls, cached) | 17:48 | $2.43 |
| browser-use vanilla | LLM (~60 Sonnet calls, uncached) | 20:11 | $3.41 |
| Computer use, tightened prompt | Vision loop, convergence rules | 5:11 | $1.17 |
| Computer use, base prompt | Vision loop, no stopping criteria | 20:30 (timeout) | $2.13 |
Note: Playwright and browser-use runs used Sonnet 4.6 as inner model; computer use runs used Opus 4.8. The ordering holds at comparable model tiers but the gaps narrow.
Hybrid task — browser research + desktop spreadsheet creation:
"Find 3 highly-rated Mexican restaurants near 525 Market Street. Then create a spreadsheet at /tmp/lunch.ods inside the desktop container with columns Name | Rating | Walking time."
One half lives in a browser; the other half lives in a desktop app. Neither tool alone finishes the job.
| Run | Browser leg | Desktop leg | Wall | Cost |
|---|---|---|---|---|
| Optimized | Playwright (0 LLM calls) | CLI-first prompt | 2:12 | $0.17 |
| Browser fixed only | Playwright (0 LLM calls) | Base prompt | 2:07 | $0.16 |
| Desktop fixed only | browser-use + cache | CLI-first prompt | 12:10 | $0.29 |
| Baseline | browser-use (LLM-driven) | Base prompt | 21:52 | $5.44 |
Reading the tables
Simple task table — three findings:
The Playwright row changes the browser tier argument. The browser-use runs ($3.41/$2.43) made tier 3 look expensive compared to tuned computer use ($1.17). But browser-use is a general LLM vision agent that makes ~60 model calls per task. A purpose-built Playwright script with zero LLM calls in the browser loop ran the same task for $0.03. That's what tier 3 looks like at its best. The hierarchy holds; the browser-use numbers were measuring the wrong implementation.
Within the browser tier, implementation dominates. $3.41 (generic LLM agent) → $2.43 (cached) → $0.03 (Playwright). Most of the $3.38 savings came from removing the LLM from the loop entirely, not from caching it. Caching is a 29% improvement; eliminating the calls is a 99% improvement.
Tier 3 done right beats tier 4 done right by 39×. Playwright at $0.03 vs computer use at $1.17. The structural reason is the same as the API vs computer use gap: when you remove LLM calls from the execution loop, cost collapses.
Hybrid task table — two findings:
The browser leg dominates. Fixing only the browser leg (Playwright, base desktop) dropped the hybrid from $5.44 to $0.16 — almost all of the savings. Fixing only the desktop leg ($0.29) helped, but the browser leg was doing most of the work in the original cost.
CLI-first prompt is a model-dependent optimization. Fixing both legs ($0.17) is nearly identical to fixing just the browser leg ($0.16). The CLI-first desktop instruction barely moved the cost — because Sonnet 4.6 with the base prompt was already efficient, finding the headless CLI path on its own. The CLI-first instruction was designed to fix Opus 4.8, which would fight the GUI before recovering. With Sonnet, it solved $0.01 worth of problem. The lesson: prompt instructions compensate for model behavior; better models need less compensating.
The hybrid runs worth highlighting
The baseline hybrid ($5.44) surfaced something the single-tier runs didn't: when the inner desktop sub-agent failed (twice, different failure modes), the outer model noticed the text summary of the failure, diagnosed the problem, and issued a completely new instruction — switching from driving LibreOffice through the GUI to using libreoffice --headless from the command line. The second attempt succeeded. That self-recovery behavior is the subject of the third post.
The optimized hybrid ($0.17) never triggered recovery — because the CLI-first prompt eliminated the failure mode entirely. Six clean desktop turns, straight to libreoffice --headless, no failed GUI attempt. The cross-app task completed for $0.17 vs $5.44. Same capability, 32× cheaper.
The two runs together show the tradeoff: if you want to observe self-recovery behavior, run the baseline. If you want the cheapest production path, run the optimized version. For production, $0.17 is the right number — and the recovery architecture is still there if a different failure mode appears.
What the experiments showed
Three patterns across eight runs.
1. "Browser automation" is not one thing — and the difference is $3.38.
A generic LLM browser agent costs $3.41. A purpose-built Playwright script costs $0.03. Same AWS infrastructure, same task, same output quality. The difference is whether an LLM decides every click or code does. Most browser automation benchmarks don't measure this gap because they test the agent, not the implementation.
2. The browser leg dominates hybrid cost — fix it first.
In the hybrid runs, fixing only the browser leg (Playwright) dropped cost from $5.44 to $0.16. Fixing only the desktop prompt dropped it to $0.29. The browser was doing most of the damage. When optimizing a multi-leg workflow, profile which leg is expensive before tuning both.
3. Prompt instructions compensate for model behavior; better models need less compensating.
The CLI-first desktop prompt saved ~$5 with Opus 4.8 (by preventing a GUI fight) and $0.01 with Sonnet 4.6 (which found the CLI path on its own). This is a general pattern: optimization techniques designed for one model tier may be redundant or irrelevant at another.
The hierarchy isn't theoretical. The cost gaps are structural. Vision-based loops pay for every screenshot of every intermediate state. Structured paths don't.
The thing I'm still unsure about
In both hybrid runs, the inner desktop agent saved files inside the ephemeral container and did not proactively return them. The outer model noticed and offered to copy them, but treated persistence as optional.
This raises a real architectural question: When a sub-agent produces artifacts, should the system assume they need to be explicitly returned, or should there be a convention for automatic (but safe) propagation?
I chose explicit for now. At small scale this is fine. At larger scale it may create either lost work or excessive cognitive load on the outer model. I haven't decided which risk is worse.
So what
The hierarchy that actually holds up in practice is:
Native API → Existing connector → Browser-based automation → Computer use → Human last resort.
Walk down it in order. Stop at the first tier that can do the job. Only reach for computer use when nothing above it works — legacy desktop apps, third-party SaaS with no usable endpoints, or genuine cross-app workflows that no single structured surface covers.
The experiments confirmed what the Reflex benchmark suggested: the cost gaps are structural, not a tuning problem. Vision agents pay for rendering every intermediate state. Structured paths give the model the data directly. Even well-tuned computer use is dramatically more expensive than the tiers above it for most work.
Computer use is genuinely powerful in the narrow set of cases it uniquely enables. Treating it as the default because the demos look impressive is one of the fastest ways to turn a reasonable automation project into an expensive recurring cost.
If you only read this post, the takeaway is simple: pick the cheapest tier that can actually complete the task. The model is the easy part.
If you want to go deeper
The series digs into each finding separately:
- Harness optimization — exactly where the money went in the $3.41 → $1.17 runs, and why convergence rules were the highest-leverage change.
- Router or agent under failure — the self-recovery behavior the baseline hybrid showed, why it only appears in layered architectures, and the test most systems sold as "agents" still fail.
Part 1 of the Agent Tooling series.
Part 2: Harness optimization →
Top comments (0)