Amit

Posted on Jun 6 • Originally published at artificialcuriositylabs.ai

API Connector Browser Computer Use Human: A Cost-Justified Hierarchy for Agent Tooling

#agents #patterns #ainative

TL;DR

Computer use is 45× more expensive than a structured API for the same task — not a tuning problem, it's the architecture. Vision agents pay for every screenshot of every intermediate state.
The hierarchy holds: Native API → Existing connector → Browser automation → Computer use → Human last resort. But "browser automation" means a purpose-built Playwright script, not a generic LLM browser agent. The difference is $0.03 vs $3.41 for the same task.
Eight runs across browser, desktop, and hybrid configurations: cost ranged from $0.03 to $5.44. The spread came entirely from implementation choices, not the model.
The optimized hybrid (Playwright browser + computer use with CLI-first prompt) ran the cross-app task for $0.17 — 32× cheaper than the $5.44 baseline. Both legs optimized independently; the browser leg dominated the savings.
Only reach for computer use when nothing above it covers the task — legacy desktop apps, third-party SaaS with no usable endpoints, genuine cross-app workflows.

I run a fleet of autonomous agents for my own daily operations — research, scheduling, knowledge management, recurring workflows. Most of those agents only need APIs or MCP connectors to do their work. But some tasks don't have structured endpoints — and the same is true at enterprise scale. Verify patient eligibility on an insurer's web portal, then update the record in a legacy desktop EHR. Pull supplier pricing from a vendor site, then enter it into an on-premise ERP that predates REST APIs. Extract data from a web-based reporting dashboard, then reconcile it in a desktop compliance tool. The pattern is the same whether you're building for yourself or for thousands of tenants: one half of the workflow lives in a browser, the other half lives in a desktop app that was never going to ship an API. Either tool alone leaves the other half stranded.

So I ran experiments. Not to benchmark a model release — Opus 4.8 happened to drop during this work and I used it — but to answer a practical question: when an agent needs to interact with a UI, what does that actually cost to run, and what's the right architecture? Computer use? Browser automation? Both, somehow?

The answer turned out to be sharper than I expected, and it's the headline of this post.

This post lays out the hierarchy and the experimental data behind it. The posts that follow go deeper into specific findings.

The hierarchy you should actually use:

┌─────────────────────────────────────────────────────┐
│  Tier 1: Native API                       cheapest  │
│         ↓ (only if no API exists)                   │
│  Tier 2: Existing connector (MCP, vendor SDK)       │
│         ↓ (only if no connector)                    │
│  Tier 3: Browser-based automation                   │
│         ↓ (only if web UI not enough)               │
│  Tier 4: Computer use (vision + mouse/keyboard)     │
│         ↓ (only if model can't complete it)         │
│  Tier 5: Human                            most $    │
└─────────────────────────────────────────────────────┘

Walk down it in order. Stop at the first tier that fits the task. Don't skip up because computer use is more general or because the demo was impressive — generality has a 45× cost penalty against a structured API. A Reflex.dev benchmark on an admin-panel task: vision-based computer use needed 53 steps, 551K input tokens, 17 minutes. The equivalent through structured API calls was 8 calls, 12K tokens, 20 seconds. 45× cheaper, 51× faster, same correctness. That's not a tuning problem; it's the architecture. Vision agents pay for every screenshot of every intermediate state.

This isn't a contrarian take — it's roughly the hierarchy Anthropic itself recommends. What's missing from most marketing copy is how steep the cost gap is between tiers.

I confirmed it from the other direction with eight harness configurations on Bedrock. Cost ranged from $0.03 to $5.44 on the same task. The 180× spread came entirely from implementation choices — not the model. A purpose-built Playwright script at $0.03 proved that browser automation done right is structurally cheaper than even the best-tuned computer use run ($1.17). And the same optimization logic applied to the hybrid brought that cross-app task from $5.44 down to $0.17.

At a SaaS tenant doing 1,000 lookups a month, that gap is $14K vs $65K per year — and the API path would be ~$400. Same outcome.

The architectural hierarchy is the buried lede. Computer use isn't bad — it's powerful in the cases it covers. But the default reach for it (because it's general, because Anthropic markets it well, because it works for demos) is wrong for most production loops. The rest of this post is the data behind that claim.

What the three lower tiers actually look like

If you're at tier 3 or 4 — browser tool, computer use, or some hybrid of the two — there are three architectural shapes worth knowing apart. They're not interchangeable, and confusing them is a common mistake.

Loop	What it is	How it works	Strengths	Weaknesses
Tool-calling loop (Playwright, browser-use, AgentCore Browser)	Reliable execution backend with structured actions	Agent calls `click(selector)`, `type(field, text)`, `screenshot()` against a browser via CDP	Fast, cheap, precise. 89-92% success on common browser tasks.	Browser-only. Needs selectors.
Vision loop (Anthropic computer use, Opus 4.8)	Vision-based intelligence controlling a desktop	Screenshot → LLM reasons over pixels → mouse/keyboard actions	General purpose. Works on any app. 84% on Online-Mind2Web for Opus 4.8.	Slower. More expensive. Lower precision.
Hybrid	Tool-calling backend + vision as supervisor/fallback	Outer agent decides per turn which tool to call	Best of both for genuinely cross-app tasks.	More moving parts. Most expensive.

Mapping this back to the hierarchy: the tool-calling loop is what tier 3 (browser-based automation) actually is. The vision loop is what tier 4 (computer use) actually is. Hybrid is when your task spans both — and you should only reach for it when no API or single-tool path covers the workflow.

Numbers from public benchmark reports — Playwright-driven agent frameworks land in the high 80s to low 90s on common browser tasks; vision-based computer use lands around 84%. Hybrids outperform pure vision on web tasks because selectors beat pixel guessing.

When to use which

The industry shorthand is "80% Playwright, 20% browser-use." That's incomplete — it leaves out computer use entirely. Here's the actual decision tree:

Can the task be done with stable DOM selectors?
    ├── YES → Playwright (code, 0 LLM calls in loop)
    └── NO: does it require a browser?
            ├── YES, DOM accessible → browser-use (LLM adapts per page)
            └── NO browser / need desktop → Computer use (vision loop)

In practice:

Situation	Right tool	Why
Stable DOM, predictable navigation	Playwright	No LLM needed in the loop
Dynamic UI, novel pages, site changes	browser-use	LLM adapts, but ~60 calls per task
Desktop apps (LibreOffice, Excel, legacy software)	Computer use	No browser alternative exists
Site blocks CDP / Playwright connections	Computer use	Browser path unreachable
Cross-app: stable web + desktop	Playwright + Computer use	Each leg optimized independently
Cross-app: dynamic web + desktop	browser-use + Computer use	More expensive, but covers both

The mistake most builders make is treating computer use as a more powerful browser-use. It isn't. It's for tasks the browser can't reach — desktop apps, sites that block CDP, or workflows that need both. If the DOM is accessible, computer use is always the more expensive option.

If you've been treating these three as interchangeable, stop. The choice within tiers 3-4 is the most important implementation decision you'll make. The data shows the cost spread within this tier is larger than the cost spread between most models.

Test setup

Every run was fully autonomous — no human in the loop between task prompt and final output.

The stack: a Strands outer agent on Amazon Bedrock received the task and decided which tools to call. For browser work, it called a tool that provisioned a fresh AgentCore Browser session — AWS-managed Chromium running in an isolated container — and drove it either via browser-use (an LLM sub-agent that reasons over screenshots and generates click/type actions) or via a purpose-built Playwright script (code that navigates by URL and queries the DOM directly). For desktop work, it called a tool that booted an ephemeral Docker container with Xvfb, LibreOffice, and a terminal, then ran Anthropic's computer-use loop against it. Each tool call was one round trip with a fresh sandbox; no state persisted across calls.

All costs are real token counts pulled from CloudWatch (AWS/Bedrock namespace, InputTokenCount, OutputTokenCount, cache read/write metrics) and converted to USD using public Bedrock pricing. Wall times are measured end-to-end including container boot, browser provisioning, and formatting.

Task prompt
    │
    ▼
┌─────────────────────────────────────┐
│  Outer agent (Strands + Bedrock)    │
│  Decides which tool to call         │
└──────────┬──────────────────────────┘
           │
     ┌─────┴─────┐
     │           │
     ▼           ▼
┌─────────┐  ┌──────────┐
│ Browser │  │ Desktop  │
│  tool   │  │  tool    │
└────┬────┘  └────┬─────┘
     │             │
     ▼             ▼
┌─────────┐  ┌──────────────────┐
│AgentCore│  │ Docker container │
│ Browser │  │ Xvfb + LibreOffice│
│(managed │  │ + terminal        │
│ Chrome) │  └────────┬─────────┘
└────┬────┘           │
     │           computer-use
     │           loop (Bedrock)
  browser-use         │
  OR Playwright        │
     │                │
     └────────┬────────┘
              │
              ▼
         Final output
    (costs logged to CloudWatch)

Each tool call gets a fresh sandbox — new browser session or new container — torn down after the call. No shared state between calls.

The tests

Two benchmark tasks, all costs from CloudWatch.

Simple task — browser research only:

"It's Thursday at lunchtime. I'm at 525 Market Street. Find me 3 highly-rated Mexican lunch options within 5 minutes' walk. Save as markdown."

A deliberately messy task — fuzzy criteria, a site that blocks scrapers, location-dependent results with no structured endpoint. Chosen because it forces real decision-making rather than form-filling.

Run	What drives the browser	Wall	Cost
Playwright structured	Code — 0 LLM calls in browser loop	1:09	$0.03
browser-use + cache	LLM (~60 Sonnet calls, cached)	17:48	$2.43
browser-use vanilla	LLM (~60 Sonnet calls, uncached)	20:11	$3.41
Computer use, tightened prompt	Vision loop, convergence rules	5:11	$1.17
Computer use, base prompt	Vision loop, no stopping criteria	20:30 (timeout)	$2.13

Note: Playwright and browser-use runs used Sonnet 4.6 as inner model; computer use runs used Opus 4.8. The ordering holds at comparable model tiers but the gaps narrow.

Hybrid task — browser research + desktop spreadsheet creation:

"Find 3 highly-rated Mexican restaurants near 525 Market Street. Then create a spreadsheet at /tmp/lunch.ods inside the desktop container with columns Name | Rating | Walking time."

One half lives in a browser; the other half lives in a desktop app. Neither tool alone finishes the job.

Run	Browser leg	Desktop leg	Wall	Cost
Optimized	Playwright (0 LLM calls)	CLI-first prompt	2:12	$0.17
Browser fixed only	Playwright (0 LLM calls)	Base prompt	2:07	$0.16
Desktop fixed only	browser-use + cache	CLI-first prompt	12:10	$0.29
Baseline	browser-use (LLM-driven)	Base prompt	21:52	$5.44

Reading the tables

Simple task table — three findings:

The Playwright row changes the browser tier argument. The browser-use runs ($3.41/$2.43) made tier 3 look expensive compared to tuned computer use ($1.17). But browser-use is a general LLM vision agent that makes ~60 model calls per task. A purpose-built Playwright script with zero LLM calls in the browser loop ran the same task for $0.03. That's what tier 3 looks like at its best. The hierarchy holds; the browser-use numbers were measuring the wrong implementation.

Within the browser tier, implementation dominates. $3.41 (generic LLM agent) → $2.43 (cached) → $0.03 (Playwright). Most of the $3.38 savings came from removing the LLM from the loop entirely, not from caching it. Caching is a 29% improvement; eliminating the calls is a 99% improvement.

Tier 3 done right beats tier 4 done right by 39×. Playwright at $0.03 vs computer use at $1.17. The structural reason is the same as the API vs computer use gap: when you remove LLM calls from the execution loop, cost collapses.

Hybrid task table — two findings:

The browser leg dominates. Fixing only the browser leg (Playwright, base desktop) dropped the hybrid from $5.44 to $0.16 — almost all of the savings. Fixing only the desktop leg ($0.29) helped, but the browser leg was doing most of the work in the original cost.

CLI-first prompt is a model-dependent optimization. Fixing both legs ($0.17) is nearly identical to fixing just the browser leg ($0.16). The CLI-first desktop instruction barely moved the cost — because Sonnet 4.6 with the base prompt was already efficient, finding the headless CLI path on its own. The CLI-first instruction was designed to fix Opus 4.8, which would fight the GUI before recovering. With Sonnet, it solved $0.01 worth of problem. The lesson: prompt instructions compensate for model behavior; better models need less compensating.

The hybrid runs worth highlighting

The baseline hybrid ($5.44) surfaced something the single-tier runs didn't: when the inner desktop sub-agent failed (twice, different failure modes), the outer model noticed the text summary of the failure, diagnosed the problem, and issued a completely new instruction — switching from driving LibreOffice through the GUI to using libreoffice --headless from the command line. The second attempt succeeded. That self-recovery behavior is the subject of the third post.

The optimized hybrid ($0.17) never triggered recovery — because the CLI-first prompt eliminated the failure mode entirely. Six clean desktop turns, straight to libreoffice --headless, no failed GUI attempt. The cross-app task completed for $0.17 vs $5.44. Same capability, 32× cheaper.

The two runs together show the tradeoff: if you want to observe self-recovery behavior, run the baseline. If you want the cheapest production path, run the optimized version. For production, $0.17 is the right number — and the recovery architecture is still there if a different failure mode appears.

What the experiments showed

Three patterns across eight runs.

1. "Browser automation" is not one thing — and the difference is $3.38.

A generic LLM browser agent costs $3.41. A purpose-built Playwright script costs $0.03. Same AWS infrastructure, same task, same output quality. The difference is whether an LLM decides every click or code does. Most browser automation benchmarks don't measure this gap because they test the agent, not the implementation.

2. The browser leg dominates hybrid cost — fix it first.

In the hybrid runs, fixing only the browser leg (Playwright) dropped cost from $5.44 to $0.16. Fixing only the desktop prompt dropped it to $0.29. The browser was doing most of the damage. When optimizing a multi-leg workflow, profile which leg is expensive before tuning both.

3. Prompt instructions compensate for model behavior; better models need less compensating.

The CLI-first desktop prompt saved ~$5 with Opus 4.8 (by preventing a GUI fight) and $0.01 with Sonnet 4.6 (which found the CLI path on its own). This is a general pattern: optimization techniques designed for one model tier may be redundant or irrelevant at another.

The hierarchy isn't theoretical. The cost gaps are structural. Vision-based loops pay for every screenshot of every intermediate state. Structured paths don't.

The thing I'm still unsure about

In both hybrid runs, the inner desktop agent saved files inside the ephemeral container and did not proactively return them. The outer model noticed and offered to copy them, but treated persistence as optional.

This raises a real architectural question: When a sub-agent produces artifacts, should the system assume they need to be explicitly returned, or should there be a convention for automatic (but safe) propagation?

I chose explicit for now. At small scale this is fine. At larger scale it may create either lost work or excessive cognitive load on the outer model. I haven't decided which risk is worse.

So what

The hierarchy that actually holds up in practice is:

Native API → Existing connector → Browser-based automation → Computer use → Human last resort.

Walk down it in order. Stop at the first tier that can do the job. Only reach for computer use when nothing above it works — legacy desktop apps, third-party SaaS with no usable endpoints, or genuine cross-app workflows that no single structured surface covers.

The experiments confirmed what the Reflex benchmark suggested: the cost gaps are structural, not a tuning problem. Vision agents pay for rendering every intermediate state. Structured paths give the model the data directly. Even well-tuned computer use is dramatically more expensive than the tiers above it for most work.

Computer use is genuinely powerful in the narrow set of cases it uniquely enables. Treating it as the default because the demos look impressive is one of the fastest ways to turn a reasonable automation project into an expensive recurring cost.

If you only read this post, the takeaway is simple: pick the cheapest tier that can actually complete the task. The model is the easy part.

If you want to go deeper

The series digs into each finding separately:

Harness optimization — exactly where the money went in the $3.41 → $1.17 runs, and why convergence rules were the highest-leverage change.
Router or agent under failure — the self-recovery behavior the baseline hybrid showed, why it only appears in layered architectures, and the test most systems sold as "agents" still fail.

Part 1 of the Agent Tooling series.
Part 2: Harness optimization →