I benchmarked an AI browser agent on 19 real sites. Here's what I learned.

#ai #automation #browserbase #discuss

Everyone's building AI agents that browse the web for you. Almost nobody's measuring how they actually do out there. So I did.

Recently at PixieBrix, we've been building Agent Browser Shield, an open-source layer that sits between a browser agent and the web.

It does three things before a page reaches the model: strips the noise (cookie banners, nav, ads, footers), masks PII and credentials, and blocks prompt injection hidden in invisible text and HTML comments.

I wanted real numbers on whether any of that helps reduce token costs, so I ran some benchmark test and wanted to share.

The setup

Same agent, same tasks, run twice: once with the shield off (baseline), once on (guarded). The agent is gpt-5-mini running on Browserbase cloud browsers. An LLM judge (claude-sonnet-4-6) grades each run pass/fail against the task's success criteria, and the harness tracks tokens and cost. I ran the clean set 3 times per cell to cut down on noise, because n=1 can be flukes.

Finding 1: agents waste a ton of tokens on page junk

The shield cut tokens about 11% on average (2.33M → 2.07M across all the tests). But the average hides the real story. On noisy pages it's actually more dramatic:

weather.gov: an agent burns ~84% of its tokens on the stuff around the forecast
Target: −51%
IKEA: −52%
Etsy: −37%
Amazon: −19%

Every one of those is tokens you pay for that do nothing for the task.

Finding 2: cleaning the page made the agent more accurate

This one I didn't expect. Task success went from 81% to 91% with the shield on. When the agent isn't wading through ads, cookie banners, and chat widgets, it trips over fewer things and finishes the job more often. The accuracy bump is smaller and noisier than the token savings, so I won't oversell it, but it showed up consistently on the clean set.

Run it yourself

The whole harness is open source. You bring a task list (a CSV) and your own model, and it'll run the off-vs-on comparison and grade it for you.

uv run scripts/benchmark_run.py \
  --scenarios benchmark/scenarios.example.yaml \
  --tasks benchmark/tasks.csv -n 3
uv run scripts/benchmark_report.py --run-id <run_id> --open

Fair warnings: this is gpt-5-mini only, n=3, on the slice of sites that actually load. It's a directional signal, not a published paper. Per-task numbers are noisy. The aggregate is the trustworthy claim.

If you're building browser agents, I'd love for you to point this at your own task list and tell me whether the numbers hold. Repo's here: https://github.com/pixiebrix/agent-browser-shield . A star helps more people find it and tells us we should keep working on it!