Last month I was building a browser-automation pipeline for an insurance-quote aggregator — a freelance gig that needed to fill forms across 20+ provider websites. I picked Stagehand because the act() / extract() API looked clean.
Two days in, the Gemini bill was at $60 and I'd only automated three sites.
I dug into the token accounting. Stagehand was sending the entire accessibility tree of every page to the LLM on every action. For an Amazon search-results page that's ~50,000 tokens per decision. For a Booking.com hotel listing, ~45,000.
This post is about the filtering heuristic I ended up writing, a head-to-head benchmark against Stagehand, and the open-source framework (Sentinel) that came out of it.
The problem: brute-force accessibility tree
When you ask an LLM to click a button, it needs to know what interactive elements exist on the page. The standard approach is:
- Parse the browser's accessibility tree (AOM)
- Serialize every interactive element (button, link, textbox, checkbox…)
- Send the whole serialization to the LLM along with your instruction
- LLM picks an element ID and returns an action
For a simple page (login form, 5 inputs) this is fine. Maybe 500 tokens. The LLM has no trouble picking the right field.
For a real-world page — Amazon search results, a form on durchblicker.at, the Gmail inbox — you easily hit 300+ interactive elements per page. Serialized, that's 30,000–50,000 tokens. Every. Single. Action.
Three problems compound:
- Cost: GPT-4o at $2.50/M input tokens = ~$0.12 per action. A 15-step task costs $1.80. Scale that to 10k users and you're bankrupt.
- Latency: More tokens = slower responses, sometimes 15–30 seconds per decision.
- Accuracy: LLMs get worse at picking the right element when there are 300 candidates. The signal-to-noise ratio tanks.
The insight: you don't need the full tree
Here's the thing — when a user says "click the Add to Cart button", there are maybe 10 elements on the page that could plausibly match. The other 290 are header nav, footer links, unrelated product cards, cookie banners, modal stubs.
If you could filter to the top-50 most relevant elements before sending to the LLM, you'd have:
- 5–10× fewer tokens → 5–10× cheaper
- Shorter prompts → faster responses
- Less noise → better element picks
The question is: how do you rank "relevance" without an LLM call (which would defeat the purpose)?
The filter: keyword overlap scoring
The approach I ended up with is embarrassingly simple:
function filterRelevantElements(
elements: UIElement[],
instruction: string,
maxElements: number = 50
): UIElement[] {
// Tokenize the instruction (handle multiple languages with \p{L}\p{N})
const instructionTokens = new Set(
instruction
.toLowerCase()
.match(/[\p{L}\p{N}]+/gu) ?? []
);
// Score each element by keyword overlap
const scored = elements.map(el => {
const elementText = `${el.role} ${el.name} ${el.value ?? ''}`.toLowerCase();
const elementTokens = elementText.match(/[\p{L}\p{N}]+/gu) ?? [];
const score = elementTokens.filter(t => instructionTokens.has(t)).length;
return { el, score };
});
// Keep top-N by score, preserve original order for tied scores
return scored
.sort((a, b) => b.score - a.score)
.slice(0, maxElements)
.map(s => s.el);
}
That's the core. ~15 lines of code, zero dependencies, runs in <5ms for 500 elements.
Three refinements that matter in practice:
1. Always keep form fields. If the instruction is "fill out the form", the keyword "form" doesn't appear on any input label. Always preserve role=textbox/combobox/checkbox/radio/slider elements regardless of score.
2. Always keep buttons near form fields. The submit button often has a generic label like "Continue" that doesn't match the instruction keywords. Keep buttons that are positionally close to form fields.
3. Unicode tokenization. /[a-z0-9]+/ breaks on German umlauts, Turkish dotted-i, Czech diacritics. Use /[\p{L}\p{N}]+/gu to handle Latin-script languages correctly.
The benchmark
I ran the same task with Stagehand and Sentinel using the exact same model — Gemini 3 Flash — and identical browser config (viewport 1920×1080, domSettleTimeoutMs 3000).
Task: Amazon.de → search "bluetooth kopfhörer over-ear" → filter by brand Sony → sort by customer rating → extract first 3 products (name, price, rating).
Sentinel
Step 1: ✅ Fill search field with "bluetooth kopfhörer over-ear"
Step 2: ✅ Click "Los" (submit)
Step 3: ✅ Click Sony brand filter in sidebar
Step 4: ✅ Select "Durchschn. Kundenrezension" from sort dropdown
Step 5: 🔍 Extract top 3 products
Result: ✅ Goal achieved in 5 steps
Time: 100s
Tokens: 33k total
Cost: $0.003
Extracted:
[
{ "product_name": "Sony WH-1000XM3…", "price": null, "star_rating": "4,6 von 5 Sternen" },
{ "product_name": "Sony WH-CH520…", "price": "25,20 €", "star_rating": "4,5 von 5 Sternen" },
{ "product_name": "Sony WH-CH720N…", "price": "60,62 €", "star_rating": "4,5 von 5 Sternen" }
]
Stagehand
Action 1: ariaTree (page observation)
Action 2: click "Akzeptieren" (cookie banner)
Action 3: type search query
Action 4: click "Los"
Action 5: ariaTree
Action 6: click "Weitere" button in Marken filter
Action 7: ariaTree
Action 8: type "Sony" into search box (misread the filter UI)
Action 9: scroll 50% down
Action 10: ariaTree
Action 11: scroll 50% down
Action 12: click "Sony" filter (finally)
⏱ Timed out at 300s. One LLM call near the end used:
- 147,158 input tokens
- 62,914 reasoning tokens
- 75 output tokens
(= 210k tokens for a single decision)
💥 Crashed: "Cannot read properties of null (reading 'awaitActivePage')"
Stagehand never got to sorting or extraction.
The 210k-token single-decision call is the headline number. That's not a typical call, but it happens when the model keeps re-observing the page without narrowing down. The average call was still in the 30–50k range.
Honest limitations
The filter approach is not magic. Failure modes I've hit:
1. Index-based sliders. Amazon's price filter is a <input type="range"> with min=0 max=145 — the values are positions, not Euros. Setting value=100 puts the slider at bucket 100, which maps to ~1,200 EUR, not 100 EUR. The real value is in aria-valuetext. Both Sentinel and Stagehand fail here. Fix on my roadmap: aria-valuetext binary search.
2. Non-Latin scripts. The tokenizer is Unicode but untested on CJK. I don't think keyword overlap works for Chinese/Japanese at all — probably need to swap to embedding-based scoring for those languages.
3. Synonyms. If the user says "submit" and the button says "Apply", keyword overlap scores zero. I mitigate this with the "keep nearby buttons" rule, but it's not a complete fix.
4. Vision-only pages. Canvas games, interactive maps, WebGL dashboards — the accessibility tree is empty. You need vision grounding. Sentinel has a mode: 'vision' fallback but it's slower and more expensive.
Other things I built along the way
If you read this far you might care about these. Quick list:
-
fillForm(json)— declarative form filling. Pass a JSON object of fields, Sentinel figures out which input is which. -
intercept(pattern, trigger)— capture the raw API response during a browser action instead of scraping the rendered DOM. -
TOTP/MFA — auto-generate 2FA codes during login (
mfa: { type: 'totp', secret: '...' }). - Planner/executor model split — use Gemini 3 Pro for planning decisions, Gemini Flash for action execution. Cheaper than all-Pro, smarter than all-Flash.
- Click-target verification — before every click, verify the element at the target coordinates matches the intended target. Catches stale-coordinate bugs in dynamic dropdowns.
Links
- GitHub (MIT): https://github.com/ArasHuseyin/sentinel.ai
- npm:
@isoldex/sentinel - Full docs and live benchmark comparison: https://isoldex.ai
- The E2E test from this benchmark:
src/__tests__/e2e/amazon-filter-sort.test.ts— run with your own Gemini key.
If you've worked on LLM-driven browser automation and have a better scoring heuristic than keyword overlap, I'd love to hear about it. The current filter works but feels primitive. Embeddings would probably be better if latency allowed.
Top comments (0)