Henry Knight

Posted on Jun 7 • Edited on Jun 13

I built a Claude browser agent that automates Playwright tasks — here's the starter kit

#automation #ai #claude #playwright

Browser automation has a dirty secret: the code you write today is already dying.

You spend two hours reverse-engineering a login form. You figure out the exact XPath. You write //div[@class='auth-container']//input[@data-testid='email-field'] and it works. You ship it. You forget about it.

Three weeks later it breaks. The frontend team ran an A/B test and swapped data-testid for id. Or they rolled out a redesign. Or they just changed a class name.

This happens constantly. Most browser automation isn't hard to write — it's hard to maintain.

The Selector Problem

Here's the actual issue: CSS selectors and XPath describe structure, not intent.

When you write #signup-form > .input-wrapper:nth-child(2) > input, you're not saying "fill the email field." You're saying "find an input element that is the second child of a wrapper inside the signup form." The machine knows nothing about what that input does — only where it lives in the DOM.

The moment the DOM changes — and it will — your selector is invalid.

Dynamic IDs make this worse. React, Angular, and Vue all generate class names like _3xR9s or css-1q5f3g. There's no stable handle to grab. You end up chaining 8 selectors together hoping one of them survives the next deploy.

A Different Approach: Describe What You See

What if, instead of describing the DOM structure, you described what you see on screen?

Instead of: //input[@id='email_input_63h2']

You say: "Fill the email address field"

Instead of: button.btn-primary:nth-child(3)

You say: "Click the blue Submit button at the bottom of the form"

This is how vision-based browser automation works. Take a screenshot. Send it to a vision model. Ask it to locate and interact with the element you described in plain English. The model sees exactly what a human sees — layout, labels, colors, context.

The DOM changes. The visual structure mostly doesn't. A blue Submit button is still a blue Submit button after a redesign.

How It Works in Practice

Here's the core of a vision-based interaction module built with Claude's vision API:

from vision_browser import VisionBrowser
from patchright.async_api import async_playwright

async with async_playwright() as p:
    browser = await p.chromium.launch(headless=True)
    page = await browser.new_page()
    await page.goto("https://example.com/signup")

    vb = VisionBrowser(page)

    # Understand the page state first
    state = await vb.understand_page()
    print(state)
    # -> {"title": "Sign Up", "forms": ["email", "username", "password"],
    #     "buttons": ["Sign Up", "Already have an account?"]}

    # Fill fields by description, not selector
    await vb.smart_fill_field("email address field", "user@example.com")
    await vb.smart_fill_field("username or display name", "henry_knight")
    await vb.smart_fill_field("password", "my-secure-pass-123")

    # Click by description
    await vb.smart_click("the Sign Up button")

No selectors. No XPath. No DOM inspection. The agent takes a screenshot at each step, sends it to Claude, and uses the visual response to determine where to click or type.

The Agent Layer

For full browser tasks, wrap this in an agent loop:

// browser-agent.js — simplified
const task = "sign up for an account on example.com with email user@test.com";

// Before starting, check what failed last time
const failureHistory = buildFailureHistory(task);

const prompt = `
Complete this browser task: ${task}

Past failures to avoid:
${failureHistory}

You have access to: take_screenshot, navigate_page, click, fill, evaluate_script.
Screenshot first. Describe what you see. Then act.
`;

spawn('claude', ['-p', prompt, '--permission-mode', 'bypassPermissions']);

The key part is failureHistory — before each run, the agent queries a SQLite database of past attempts, reads what failed and why, and starts with that context. Each run gets smarter.

Handling CAPTCHAs

Vision-based interaction also changes how you handle CAPTCHAs. With traditional selectors, detection usually means looking for a specific iframe ID or class. Fragile. With vision, you ask Claude to describe the page state:

state = await vb.understand_page()

if "captcha" in state.get("obstacles", []):
    captcha_type = state["captcha_type"]  # "press_hold", "image_select", "text", "turnstile"

    if captcha_type == "press_hold":
        await solve_press_hold(page)      # timing algorithm, no API needed
    elif captcha_type == "text":
        answer = await vb.ask_claude_vision(screenshot, "Read and solve this text CAPTCHA")
        await vb.smart_fill_field("captcha answer field", answer)
    else:
        await solve_via_api(page, captcha_type)   # EzCaptcha / 2captcha

Text CAPTCHAs are free — Claude reads them directly. Press-and-hold CAPTCHAs are free — solved with a timing algorithm. Only the hard ones (Arkose Labs, some image grids) require a paid API.

The Memory Layer

The last piece that makes this production-ready is persistence. Every attempt gets logged:

mem = BrowserMemory('/path/to/my.db')

# Before attempting
history = mem.get_site_history('example.com', limit=10)
strategy = mem.suggest_next_strategy('example.com', history)

# After outcome
mem.log_attempt(
    site='example.com',
    strategy_name=strategy,
    outcome='success',
    reasoning='Vision locate worked for all 3 fields. No CAPTCHA encountered.'
)

When the agent runs again against the same site, it reads this history first. It knows which strategies worked, which failed, and why. It doesn't repeat known dead ends.

Getting the Starter Kit

I packaged all of this — vision_browser.py, captcha_brain.py, browser_memory.py, proxy_manager.py, the agent runner, and 3 ready-to-run examples — into a starter kit.

If you're building browser agents and want to skip the parts I already figured out the hard way:

Get the Claude Browser Agent Starter Kit →

$27. Python + Node.js. Production-ready. No XPath required.

Questions or feedback? Drop a comment below — happy to explain any part in more detail.

Top comments (2)

Todd Schiller • Jun 7

The selector decay problem is real. I wrote about reliable selectors for browser automation back in 2020 and most of the list still holds: avoid structural :nth-child chains, anchor on ARIA attributes and data-testid, target text with :contains/:has, prefer input[name] when you can. toddschiller.com/blog/reliable-web...

What's new is the agent can re-derive the selector when the page shifts. Two failure modes worth budgeting for: silent regressions where the selector still matches but matches the wrong element, and retry-the-same-locator-harder loops where the agent never notices the structure changed

Some comments may only be visible to logged-in visitors. Sign in to view all comments.