DEV Community

Henry Knight
Henry Knight

Posted on

How I Built a Claude Browser Agent That Actually Works (Starter Kit Inside)

Most browser automation tutorials are lying to you.

They show you a clean 20-line script that clicks a button and fills a form. They don't show you what happens when the site loads a shadow DOM, the button is inside a cross-origin iframe, or a CAPTCHA intercepts you mid-flow. They definitely don't show you what happens at 3am when your agent silently loops and burns $200 in API calls before you notice.

I spent weeks building a Claude-powered browser agent that actually survives real-world conditions. Here's what I learned—and the three things that made the difference.

1. CDP Direct > Playwright Abstractions (for agent control)

Playwright and Puppeteer are great for scripted automation. They're terrible for AI-driven agents.

The problem: when Claude decides what to do next based on what it sees on screen, you need raw accessibility tree data—not a DOM dump, not a screenshot. The Chrome DevTools Protocol (CDP) gives you a11y tree snapshots in a structured format that maps directly to agent-legible element UIDs. You can ask Claude "click the submit button" and it can identify the exact element from the snapshot, not guess at CSS selectors.

My setup:

// Get structured a11y snapshot instead of dumping innerHTML
const { nodes } = await cdp.send('Accessibility.getFullAXTree');
// Feed nodes to Claude with element UIDs attached
const decision = await claude.messages.create({
  model: 'claude-sonnet-4-6',
  messages: [{ role: 'user', content: formatNodes(nodes) + '\nWhat element should I click?' }]
});
Enter fullscreen mode Exit fullscreen mode

This is the scaffolding most tutorials skip. Without it, your agent is navigating blind.

2. Prompt Caching Cuts Costs 90%—Wire It From Day One

Every agent loop re-sends the same system prompt. If you're running 50 browser actions in a session, you're paying for 50 full system prompt inputs. With Claude's native prompt caching, you mark your system context as cacheable and the API reuses it.

In practice this cuts input token costs by ~90% and reduces latency by ~80% on repeated calls. For a long browser session this is the difference between a $0.50 task and a $5 task.

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  system: [
    {
      type: 'text',
      text: SYSTEM_PROMPT,
      cache_control: { type: 'ephemeral' }  // This is all it takes
    }
  ],
  messages: conversationHistory
});
Enter fullscreen mode Exit fullscreen mode

Add this before you do anything else. It compounds across every agent call you ever make.

3. Hard Termination Conditions Are Non-Negotiable

This is the one that bites people hardest. Uncontrolled agents in a feedback loop can burn $5k in API costs in four hours—this is documented, it happens in production, and it will happen to you if you don't build exit conditions from the start.

The pattern that works: a loop counter, a state hash to detect if the agent is cycling on the same page, and a max-retry gate per action type.

const MAX_LOOPS = 25;
const MAX_RETRIES_PER_ACTION = 3;
let loopCount = 0;
let lastStateHash = null;

while (!taskComplete && loopCount < MAX_LOOPS) {
  const stateHash = hashPageState(a11ySnapshot);
  if (stateHash === lastStateHash) retryCount++;
  if (retryCount >= MAX_RETRIES_PER_ACTION) throw new Error('Agent stuck—terminating');
  lastStateHash = stateHash;
  loopCount++;
  // ... agent step
}
Enter fullscreen mode Exit fullscreen mode

Pair this with a hard timeout (I use 900 seconds for complex flows) and you've capped your blast radius.

What I Packaged Into the Starter Kit

These three patterns—CDP a11y snapshots for agent-legible context, prompt caching from the first call, and hard termination gates—are the core scaffolding every browser agent needs and almost no tutorial covers.

I packaged the full working setup into a starter kit: https://knightops.gumroad.com/l/claude-browser-agent-starter-kit

It includes the full agent loop, the CDP snapshot formatter, the caching layer, and the termination logic—ready to drop into your own project. If you've been fighting browser automation and want a scaffold that actually works in production, that's the fastest path there.

Drop a comment if you're building something with Claude agents—I'm always up for comparing notes on what's working.

Top comments (0)