I Let an AI Agent Use My Browser Tool Unsupervised. It Found 3 Bugs in 20 Minutes.

#ai #mcp #webdev #opensource

I build Charlotte, an open-source MCP server for browser automation. It renders web pages into structured, token-efficient representations that AI agents can read and act on.

I've spent weeks benchmarking it, optimizing response sizes, writing docs. But I'd never done the obvious thing: point an agent at a real task, give it Charlotte as its only browser tool, and just watch what happens.

So I did. No guidance, no hints, no "use this tool in this way." Just: "Test this UI feature in the browser."

The agent found three bugs in about twenty minutes. Not through any testing framework. Just by trying to get its job done and hitting walls.

The Setup

I had a locally running web app (a code review tool called Crit) with a new feature: comment template chips that appear when you click a line gutter to open a comment form. I wanted to verify the feature worked. Light mode, dark mode, chip insertion, cursor positioning.

I added Charlotte to the project's MCP config and told the agent to test the template feature. That's it. No instructions about which tools to use or how.

What Worked

Navigation and observation were solid. charlotte:navigate with detail: "summary" gave the agent a clean structural overview of the page. It could see headings, interactive elements, content blocks. Good enough to confirm the page loaded and orient itself.

Screenshots were the MVP. The agent used them twice (light mode, dark mode) and they were the definitive "does it look right?" check. Fast, clear, exactly what a visual verification workflow needs.

Tool profile switching worked smoothly. The agent started on the default browse profile, realized it needed JavaScript evaluation capabilities, ran charlotte:tools enable evaluate, and kept going. No friction.

Bug 1: evaluate silently ate multi-statement code

The agent needed to query the DOM to find gutter elements. It wrote reasonable JavaScript:

var blocks = document.querySelectorAll('[data-line]');
var gutters = document.querySelectorAll('.gutter');
'dataLine=' + blocks.length + ' gutter=' + gutters.length;

Charlotte returned {value: null, type: "undefined"}. No error. The agent tried three variations. Same silent null every time. On the fourth attempt it got a different failure: Unexpected token 'var'.

The agent eventually discovered that wrapping code in an IIFE worked: (() => { ... return value; })(). But it burned four tool calls figuring that out.

Root cause: Charlotte's evaluate tool was using new Function('return ' + expr) to execute code. JavaScript's Automatic Semicolon Insertion turned multi-line input into:

return          // ASI inserts semicolon here
var blocks = ...  // dead code, never reached

So the function silently returned undefined. The fourth attempt that threw an actual error was a single-line expression where return var g = ... appeared on one line, which is a syntax error. Same underlying issue, two different failure modes depending on whitespace.

The fix: Replace new Function('return ' + expr) with CDP's Runtime.evaluate, which evaluates JavaScript as a program and returns the completion value of the last expression-statement. Handles single expressions, multi-statement code, var declarations, and IIFEs naturally. Charlotte already has CDP sessions open throughout the codebase, so this was a clean swap, not a new dependency.

Bug 2: No way to click without an element ID

The agent could see gutter line numbers in the screenshot. It knew approximately where they were on the page. It tried:

charlotte:click({ x: 38, y: 215 })

Error: element_id is required.

The gutter elements are custom <div> elements with no ARIA role. They don't appear in the accessibility tree, so charlotte:find can't locate them either. The agent was stuck: it could see the element visually but had no way to interact with it through Charlotte's standard tools.

The workaround was ugly. The agent enabled evaluate, wrote inline JavaScript to query the DOM, and manually dispatched mouse events. And even that didn't work on the first try, because the app uses mousedown to start a drag selection and attaches a mouseup listener on document to finalize it. The agent had to read the app's source code, understand the event delegation pattern, and dispatch events in the right sequence on the right targets:

// This didn't work (mouseup on wrong target):
gutter.dispatchEvent(new MouseEvent('mousedown', {bubbles: true}));
gutter.dispatchEvent(new MouseEvent('mouseup', {bubbles: true}));

// This worked (matching the app's listener pattern):
gutter.dispatchEvent(new MouseEvent('mousedown', {bubbles: true}));
document.dispatchEvent(new MouseEvent('mouseup', {bubbles: true}));

What should have been one tool call became twelve steps of trial and error.

The fix: charlotte:click_at({ x, y }). The plumbing already existed. Charlotte's internal click implementation resolves element IDs to pixel coordinates and then calls Puppeteer's page.mouse.click(centerX, centerY). The new tool just skips the element resolution step and goes straight to the CDP-level mouse dispatch. Real input events that bubble through the DOM naturally, not synthetic JS events.

Bug 3: find can't see non-semantic elements

charlotte:find({ text: "Do the thing" })  → []
charlotte:find({ type: "button", text: "1" })  → []

The rendered content and line numbers exist in the DOM but aren't exposed through the accessibility tree. Charlotte's find tool filters the accessibility representation, which is the right default for browsing standard websites. But for testing custom UIs, it means any element without a semantic role is invisible.

The fix: A new selector parameter on charlotte:find that queries the DOM directly via CSS selector. When you pass find({ selector: ".line-comment-gutter" }), Charlotte runs DOM.querySelectorAll, extracts basic info (tag, text, bounds), and registers each matched element with its ID system. The returned IDs work with click, hover, drag, and every other interaction tool.

The IDs use a dom- prefix so agents can tell these came from a DOM query rather than the accessibility tree. The semantic observation model stays unchanged. The selector mode is a parallel path that produces compatible element IDs.

The Fixes: Charlotte v0.4.1

All three shipped the same day:

charlotte:evaluate now uses CDP Runtime.evaluate directly. Multi-statement code, var declarations, IIFEs, single expressions all work naturally. No more silent nulls.
charlotte:click_at takes x/y coordinates and dispatches CDP-level mouse events. Supports left/right/double click and modifier keys.
CSS selector mode for charlotte:find accepts a selector parameter that queries the DOM directly, returning elements with Charlotte IDs usable by all interaction tools.

What I Learned

Dogfooding AI tools requires AI dogfooding. I'd used Charlotte myself dozens of times. I'd read every line of the codebase. I never would have found the evaluate ASI bug by hand because I instinctively write IIFEs. The agent doesn't have those instincts. It writes the code a reasonable developer would write, hits the wall, and shows you exactly where the wall is.

Twelve steps to one. The click_at gap turned a single interaction into a twelve-step workaround involving DOM queries, source code reading, and manual event dispatch. Watching an agent burn tokens on a workaround you can eliminate is a very effective way to prioritize your backlog.

The accessibility tree is necessary but not sufficient for testing. Charlotte's structured, semantic observation model is the right foundation for browsing and auditing. But testing custom UIs means interacting with elements that aren't semantically exposed. The selector parameter on find bridges that gap without compromising the default experience. Browse profile users get the clean semantic world. Users who need raw DOM access can reach it.

Watch the raw session, not just the results. The agent's writeup at the end was useful. But the real signal was in the session transcript: three silent nulls in a row before an error, twelve steps of increasingly creative workarounds for a missing feature, the exact sequence where it went from "I'll click this" to "I need to dispatch synthetic mouse events on two different DOM targets." That's where the bugs live.