Browser automation has a dirty secret: the code you write today is already dying.
You spend two hours reverse-engineering a login form. You figure out the exact XPath. You write //div[@class='auth-container']//input[@data-testid='email-field'] and it works. You ship it. You forget about it.
Three weeks later it breaks. The frontend team ran an A/B test and swapped data-testid for id. Or they rolled out a redesign. Or they just changed a class name.
This happens constantly. Most browser automation isn't hard to write — it's hard to maintain.
The Selector Problem
Here's the actual issue: CSS selectors and XPath describe structure, not intent.
When you write #signup-form > .input-wrapper:nth-child(2) > input, you're not saying "fill the email field." You're saying "find an input element that is the second child of a wrapper inside the signup form." The machine knows nothing about what that input does — only where it lives in the DOM.
The moment the DOM changes — and it will — your selector is invalid.
Dynamic IDs make this worse. React, Angular, and Vue all generate class names like _3xR9s or css-1q5f3g. There's no stable handle to grab. You end up chaining 8 selectors together hoping one of them survives the next deploy.
A Different Approach: Describe What You See
What if, instead of describing the DOM structure, you described what you see on screen?
Instead of: //input[@id='email_input_63h2']
You say: "Fill the email address field"
Instead of: button.btn-primary:nth-child(3)
You say: "Click the blue Submit button at the bottom of the form"
This is how vision-based browser automation works. Take a screenshot. Send it to a vision model. Ask it to locate and interact with the element you described in plain English. The model sees exactly what a human sees — layout, labels, colors, context.
The DOM changes. The visual structure mostly doesn't. A blue Submit button is still a blue Submit button after a redesign.
How It Works in Practice
Here's the core of a vision-based interaction module built with Claude's vision API:
from vision_browser import VisionBrowser
from patchright.async_api import async_playwright
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto("https://example.com/signup")
vb = VisionBrowser(page)
# Understand the page state first
state = await vb.understand_page()
print(state)
# -> {"title": "Sign Up", "forms": ["email", "username", "password"],
# "buttons": ["Sign Up", "Already have an account?"]}
# Fill fields by description, not selector
await vb.smart_fill_field("email address field", "user@example.com")
await vb.smart_fill_field("username or display name", "henry_knight")
await vb.smart_fill_field("password", "my-secure-pass-123")
# Click by description
await vb.smart_click("the Sign Up button")
No selectors. No XPath. No DOM inspection. The agent takes a screenshot at each step, sends it to Claude, and uses the visual response to determine where to click or type.
The Agent Layer
For full browser tasks, wrap this in an agent loop:
// browser-agent.js — simplified
const task = "sign up for an account on example.com with email user@test.com";
// Before starting, check what failed last time
const failureHistory = buildFailureHistory(task);
const prompt = `
Complete this browser task: ${task}
Past failures to avoid:
${failureHistory}
You have access to: take_screenshot, navigate_page, click, fill, evaluate_script.
Screenshot first. Describe what you see. Then act.
`;
spawn('claude', ['-p', prompt, '--permission-mode', 'bypassPermissions']);
The key part is failureHistory — before each run, the agent queries a SQLite database of past attempts, reads what failed and why, and starts with that context. Each run gets smarter.
Handling CAPTCHAs
Vision-based interaction also changes how you handle CAPTCHAs. With traditional selectors, detection usually means looking for a specific iframe ID or class. Fragile. With vision, you ask Claude to describe the page state:
state = await vb.understand_page()
if "captcha" in state.get("obstacles", []):
captcha_type = state["captcha_type"] # "press_hold", "image_select", "text", "turnstile"
if captcha_type == "press_hold":
await solve_press_hold(page) # timing algorithm, no API needed
elif captcha_type == "text":
answer = await vb.ask_claude_vision(screenshot, "Read and solve this text CAPTCHA")
await vb.smart_fill_field("captcha answer field", answer)
else:
await solve_via_api(page, captcha_type) # EzCaptcha / 2captcha
Text CAPTCHAs are free — Claude reads them directly. Press-and-hold CAPTCHAs are free — solved with a timing algorithm. Only the hard ones (Arkose Labs, some image grids) require a paid API.
The Memory Layer
The last piece that makes this production-ready is persistence. Every attempt gets logged:
mem = BrowserMemory('/path/to/my.db')
# Before attempting
history = mem.get_site_history('example.com', limit=10)
strategy = mem.suggest_next_strategy('example.com', history)
# After outcome
mem.log_attempt(
site='example.com',
strategy_name=strategy,
outcome='success',
reasoning='Vision locate worked for all 3 fields. No CAPTCHA encountered.'
)
When the agent runs again against the same site, it reads this history first. It knows which strategies worked, which failed, and why. It doesn't repeat known dead ends.
Getting the Starter Kit
I packaged all of this — vision_browser.py, captcha_brain.py, browser_memory.py, proxy_manager.py, the agent runner, and 3 ready-to-run examples — into a starter kit.
If you're building browser agents and want to skip the parts I already figured out the hard way:
Get the Claude Browser Agent Starter Kit →
$27. Python + Node.js. Production-ready. No XPath required.
Questions or feedback? Drop a comment below — happy to explain any part in more detail.
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.