Every browser agent pays a hidden tax: tokens.
When an agent visits a webpage, it dumps the DOM into an LLM. The LLM reads thousands of elements, reasons about which button to click, and generates a tool call. Then it does it again. And again.
For a 10-step workflow, that's 25+ LLM round trips. Context grows with each step because conversation history accumulates. By step 10, you're sending 175,000 tokens per action.
At frontier model pricing, that's roughly $4 for one workflow execution. Run it 1,000 times a day and you're burning $4,000 daily — on clicking buttons.
The compounding problem
The issue isn't that LLMs are expensive. It's that agent architectures make them exponentially more expensive with each step:
Step 1: Inspect DOM (4,000 tokens) → Reason → Act
Step 2: Inspect DOM + step 1 context (6,000 tokens) → Reason → Act
Step 3: Inspect DOM + steps 1-2 context (8,000 tokens) → Reason → Act
By step 10, your context window is carrying the entire conversation history. Each action costs more than the last.
This is the fundamental problem with using general-purpose reasoning for repetitive browser tasks.
What if agents could skip the reasoning?
Consider a different approach: what if the first agent to figure out how to search on Amazon shared that knowledge with every other agent?
The CSS selector for Amazon's search box doesn't change between requests. Neither does Google's search button or GitHub's login form. These are solved problems being re-solved millions of times a day.
That's the idea behind collective intelligence for agents. One agent figures out the selectors and steps. Every subsequent agent reuses that knowledge via a single API call — no DOM inspection, no LLM reasoning, zero tokens.
The result: a 10-step workflow drops from $4 and 50 seconds to $0.0006 and 178 milliseconds.
The three-step pattern
The pattern is simple — browse, execute, report:
- Browse: Ask what's possible on a domain. Get back a list of capabilities with confidence scores and pre-verified selectors.
- Execute: Request the optimal execution path for a specific capability. Get back CSS selectors, API fast-paths, or macro steps.
- Report: After executing, report what worked. This closes the learning loop — successful patterns become verified macros for every other agent.
Every report makes the system smarter. Agents that use this pattern aren't just consuming intelligence — they're producing it.
The math that matters
If you're building browser agents at scale, the cost equation is what determines viability:
Traditional approach: Cost grows linearly (or worse) with workflow complexity. 10 actions = 25 LLM calls = ~$4.
Collective approach: Cost is constant regardless of workflow complexity. 10 actions = 1 API call = $0.0006.
The gap widens with every additional step. A 50-step workflow with traditional LLM reasoning could cost $20+. With pre-verified macros, it's still $0.0006.
Who benefits
This pattern is relevant if you're building agents that interact with the web repeatedly. E-commerce bots, data extraction pipelines, testing automation, form-filling services — anywhere agents do the same browser tasks thousands of times.
If your agent visits a site once and never returns, LLM reasoning is fine. If it visits the same site daily, you're paying a recurring tax that compounds over time.
Try it
The AIR SDK implements this pattern as an MCP server. Install, point your agent at it, and the three-step browse→execute→report pattern replaces DOM reasoning automatically.
npm install @arcede/air-sdk
GitHub: ArcedeDev/air-sdk
Building browser agents? What's your cost-per-action looking like? Curious to hear how others are handling the token economics.
Top comments (0)