We tested browser agents on 20 real websites — here's where they break
Browser agents (browser-use, Stagehand, Skyvern, Playwright-based tools) promise to automate web interactions. Login to a site, search for a product, add to cart — all autonomously.
But how reliable are they actually? We measured it.
The Setup
We built a benchmark suite that tests whether agents correctly identify interactive elements on real production websites:
- 20 websites: GitHub, Amazon, Airbnb, Booking.com, eBay, LinkedIn, Stripe, Hacker News, Wikipedia, Google, Zalando, Shopify, Target, and more
- Ground truth: Manually annotated endpoints per site — what a human would identify as login forms, search bars, checkout buttons, navigation menus
- Metrics: Precision, Recall, and F1 score
We didn't test agent execution (clicking, typing). We tested something more fundamental: Does the agent understand what's on the page before it acts?
The Results
| Category | Failure Rate | What Goes Wrong |
|---|---|---|
| Login/Auth | ~30% miss rate | Agent can't find SSO buttons, misses "Sign In" links in e-commerce headers |
| Search | ~25% miss rate | Confused by category dropdowns, misses Cmd+K search overlays |
| E-commerce | ~40% miss rate | Cart icons misidentified, "Add to Cart" vs. navigation confusion |
| Cookie consent | ~50% miss rate | Banners ignored or misclassified as forms |
| Navigation | ~20% miss rate | Footer vs. header nav confusion, mega-menus not understood |
Overall F1 score: 66% — meaning roughly 1 in 3 interactions would target the wrong element or miss the right one entirely.
4 Things That Surprised Us
1. Login links on e-commerce sites are invisible to agents
eBay, Amazon, Zalando — the "Sign In" link in the header looks identical to navigation links in the DOM. Without semantic analysis (password fields nearby? auth-related URL?), agents can't distinguish them.
2. Search is harder than it looks
Many modern sites use keyboard shortcuts (Cmd+K) or overlay-based search. The search input doesn't exist in the initial DOM — it only appears after a user action. Agents that scan for <input type="search"> miss these entirely.
3. The gap between "easy" and "hard" sites is massive
Same analysis pipeline, wildly different results:
| Site | F1 Score | Difficulty |
|---|---|---|
| Google Accounts | 91% | Easy — clean, semantic HTML |
| Zalando | 89% | Medium — but well-structured |
| Hacker News | 80% | Easy — minimal DOM |
| Amazon | 75% | Hard — thousands of DOM elements, dynamic loading |
| Trello | 29% | Hard — multi-step auth, redirects |
4. Cookie banners are the silent killer
Almost every European site has a GDPR cookie banner that blocks the page. Agents either ignore it entirely (and fail on the blocked page) or misclassify it as a form. It's the most common failure mode we found.
What We Built
We built a tool that helps: balage-core — a semantic page analysis library for browser agents (MIT licensed, source on GitHub).
npm install balage-core
import { analyzeFromHTML } from "balage-core";
const result = await analyzeFromHTML(`
<form action="/login">
<input type="email" placeholder="Email">
<input type="password" placeholder="Password">
<button type="submit">Sign In</button>
</form>
`);
console.log(result.endpoints);
// [{type: "auth", label: "Login / Sign-In Form", confidence: 0.90,
// selector: 'form[action="/login"]', affordances: ["fill", "submit", "click"]}]
// (heuristic mode — no API key needed)
It works with raw HTML — no browser needed, no API key, ~4ms response time (heuristic mode, no LLM call).
What it does:
- Detects login forms, search bars, checkout flows, cookie banners, navigation
- Returns confidence scores (0-1) for every detection
- Generates CSS selectors you can use to target elements
- Detects web frameworks (React, Next.js, WordPress, Shopify, Angular, Vue)
- Optional LLM mode (OpenAI/Anthropic) for higher accuracy
What it doesn't do: It's not a browser agent. It doesn't click or type. It tells your agent what's on the page so it can make better decisions.
The Benchmark Data
Full per-site results from our 20-website benchmark:
| Site | F1 | Notes |
|---|---|---|
| Google Accounts | 91% | Auth detected at 100% |
| Zalando | 89% | Cart, Auth, Search all found |
| Typeform | 83% | Clean structure helps |
| Hacker News | 80% | Minimal DOM, easy |
| eBay | 78% | Auth + Cart detected |
| StackOverflow | 77% | Search + Auth found |
| Amazon | 75% | Complex DOM but core endpoints found |
| GitHub | 67% | Login at 93% confidence |
| Booking.com | 63% | Cookie banner still tricky |
| Trello | 29% | Multi-step auth breaks detection |
What's Next
There's also an MCP server (npx -y balage-mcp) so Claude Desktop, ChatGPT, and Cursor can use this directly. And we're looking for browser-agent developers who want to integrate this into their workflow.
If you run browser agents in production, I'd love to hear:
- What's your biggest reliability challenge?
- How do you handle sites that change their UI?
- Would confidence scores before each action help?
Drop a comment or DM me. I'll share the full benchmark dataset (all 20 sites, per-endpoint breakdown) with anyone who's interested.
Built by Julius at Sortexai. The benchmark suite and library are on GitHub.
Top comments (0)