Most AI browser automation tools fall into two camps: vision-based (screenshot the page, send it to a model, get click coordinates) or selector-based (CSS/XPath targeting). Both have fundamental problems.
Vision is slow and expensive — every action requires a screenshot round-trip to a vision model. Selectors are brittle and meaningless to an LLM — div.container > ul > li:nth-child(3) > a tells the AI nothing about what it's clicking.
There's a third approach: text snapshots with numbered refs. The browser's accessibility tree is already a structured, semantic representation of the page. Give it to the AI as text, let the AI pick a ref, execute the action. No vision, no selectors, deterministic targeting.
Here's a complete AI browser agent using browserclaw and Claude:
import Anthropic from "@anthropic-ai/sdk";
import { BrowserClaw } from "browserclaw";
const anthropic = new Anthropic();
const SYSTEM = `You are a browser automation agent.
You receive a text snapshot of a web page with numbered refs (e1, e2, etc.).
Respond with a single JSON action:
{"action": "click", "ref": "e1", "reasoning": "..."}
{"action": "type", "ref": "e3", "text": "value", "reasoning": "..."}
{"action": "done", "reasoning": "..."} — task is complete
{"action": "fail", "reasoning": "..."} — task cannot be completed`;
async function runAgent(task: string, url: string) {
const browser = await BrowserClaw.launch({ headless: false });
const page = await browser.open(url);
const history: string[] = [];
try {
for (let step = 0; step < 20; step++) {
const { snapshot } = await page.snapshot();
const response = await anthropic.messages.create({
model: "claude-opus-4-6",
max_tokens: 512,
system: SYSTEM,
messages: [{
role: "user",
content: `Task: ${task}\n\n${history.length ? `Previous actions:\n${history.join("\n")}\n\n` : ""}Current page:\n${snapshot}`,
}],
});
const text = response.content[0].type === "text" ? response.content[0].text : "";
const action = JSON.parse(text.match(/\{[\s\S]*\}/)![0]);
console.log(`Step ${step + 1}: ${action.action} ${action.ref || ""} — ${action.reasoning}`);
history.push(`${action.action} ${action.ref || ""}: ${action.reasoning}`);
switch (action.action) {
case "click":
await page.click(action.ref);
break;
case "type":
await page.type(action.ref, action.text, { submit: action.submit });
break;
case "done":
console.log("Task complete:", action.reasoning);
return;
case "fail":
console.error("Task failed:", action.reasoning);
return;
}
}
} finally {
await browser.stop();
}
}
runAgent("Find the top post on Hacker News and click into it", "https://news.ycombinator.com");
That's it. Launch a browser, loop: snapshot → ask the LLM → execute the action. No framework, no config, no hosted platform.
What the AI actually sees
When you call page.snapshot(), browserclaw returns a text representation of the page's accessibility tree with numbered refs. Here's a simplified example of what a Hacker News snapshot looks like:
- navigation:
- link "Hacker News" [e1]
- link "new" [e2]
- link "past" [e3]
- link "comments" [e4]
- table:
- row:
- cell "1."
- link "Show HN: I built a real-time collaborative spreadsheet" [e5]
- text "(github.com)"
- row:
- link "284 comments" [e6]
The AI reads this, understands the page structure instantly (it's just text — what LLMs are best at), and returns {"action": "click", "ref": "e5"}. browserclaw resolves e5 to a Playwright locator and clicks the exact element. One ref, one element, no guessing.
Compare this to vision-based approaches where the AI sees a 1280x720 screenshot and tries to output pixel coordinates. Or selector-based approaches where the AI has to guess a.storylink and hope the site hasn't changed its CSS classes.
Why this works better than you'd expect
It's fast. A text snapshot is ~1-2KB. A screenshot is ~200KB+ and needs a vision model round-trip. For a 20-step workflow, you're looking at seconds vs. minutes.
It's cheap. Text in, text out. No vision API calls. A 20-step task costs roughly ~150K tokens with snapshots vs. ~600K+ with screenshots. Scale that to hundreds of runs per day and the difference is enormous.
It's deterministic. The AI picks e5, browserclaw clicks exactly e5. No coordinate interpolation, no "close enough" matching.
It's resilient. Accessibility trees are semantic — they describe what elements are (button, link, input), not how they look. A site redesign that changes every CSS class but keeps the same form fields? The snapshot still captures them by role and name, and the AI still knows what to do.
One caveat: refs are scoped to the snapshot that created them. After navigation or significant DOM changes, old refs become invalid — always re-snapshot before acting on a changed page. That's why the loop snapshots at the top of every iteration.
How it looks
┌─────────────┐ snapshot() ┌─────────────────────────────────┐
│ Web Page │ ──────────────► │ AI-readable text tree │
│ │ │ │
│ [buttons] │ │ - heading "Example Domain" │
│ [links] │ │ - paragraph "This domain..." │
│ [inputs] │ │ - link "More information" [e1] │
└─────────────┘ └──────────────┬──────────────────┘
│
AI reads snapshot,
decides: click e1
│
┌─────────────┐ click('e1') ┌──────────────▼──────────────────┐
│ Web Page │ ◄────────────── │ Ref "e1" resolves to a │
│ (navigated)│ │ Playwright locator — one ref, │
│ │ │ one exact element │
└─────────────┘ └─────────────────────────────────┘
In production: paying bills and making donations
This same loop — snapshot → LLM → action → repeat — powers two production systems handling real money.
beelz.ai — Automated bill payment
beelz.ai pays your bills automatically. Electric, internet, insurance — any provider, any website. No per-provider API integrations.
The core is a browserclaw agent loop wrapped in a Temporal workflow. The AI navigates to your provider's payment page, identifies form fields from the snapshot, fills in your payment details, and submits. Every successful payment generates a "biller skill" — a playbook that makes subsequent payments to the same provider faster and more reliable.
The system handles MFA prompts, CAPTCHAs (by pausing and asking the user), form validation errors, and multi-step checkout flows. All from snapshots and refs.
tovli.ai — Automated charitable donations
tovli.ai automates charitable giving. Donors build a portfolio of causes, and the system executes donations on their behalf across thousands of different NGO websites.
Here's what their production agent loop looks like (simplified):
for (let step = 0; step < MAX_STEPS; step++) {
const { snapshot, refs } = await page.snapshot();
// Send snapshot to Claude, get back a structured action
const action = await askLLM(task, snapshot, refs, history, lastError);
console.log(`Step ${step + 1}: ${action.action} on ${action.ref} — ${action.reasoning}`);
emit({ type: "progress", message: action.user_message });
switch (action.action) {
case "click":
await page.click(action.ref);
break;
case "type":
await page.type(action.ref, action.text);
break;
case "select":
await page.select(action.ref, ...action.values);
break;
case "fill_payment_iframe":
// Card details come from a PCI-compliant vault, never stored on the server
await fillPaymentFields(page, cardDetails);
break;
case "done":
await captureConfirmation(page);
return;
}
}
Same pattern as the 50-line example, with production additions: error recovery, CAPTCHA detection, cross-origin iframe handling for payment forms (Stripe, Square, etc.), and SSE streaming for real-time progress updates to the donor.
The key insight: browserclaw's evaluateInAllFrames() lets the agent fill payment fields inside cross-origin iframes — something most browser automation tools can't do because they're blocked by the same-origin policy. browserclaw uses CDP to bypass it.
The architectural advantage
Most tools in this space are complete AI agents that happen to control a browser. They own the intelligence layer. You give them a task and they run.
browserclaw is different. It's just the eyes and hands. It takes a snapshot and returns refs. It executes actions. The reasoning, the task planning, the error handling — that lives in your code. You control the LLM calls, the prompts, the retry logic, the workflow orchestration.
This distinction matters when you're building a product, not a script. beelz.ai wraps browserclaw in Temporal workflows with durable retries. tovli.ai uses the Anthropic SDK directly with SSE streaming. Both use browserclaw for the browser part and nothing else. You can't compose an agent-first tool into a system that already has an agent — you end up with two brains fighting over who's in charge.
Try it
npm install browserclaw
The 50-line example runs against any website with a system Chrome — no Playwright browser install needed. Fork it, swap Claude for GPT, point it at your own use case.
If you build something with it, I'd love to hear about it.
Idan Rubin is the creator of browserclaw, extracted and refined from OpenClaw's browser automation module.
Top comments (0)