Emad Omar

Posted on Mar 28

Your AI Doesn't Need Screenshots. It Needs DevTools.

#mcp #claude #playwright #javascript

Your AI Doesn't Need Screenshots. It Needs DevTools.

Most AI coding agents are still debugging web apps in the dumbest possible way.

They ask for a screenshot.

Meanwhile the real answer is usually sitting in the browser console, the Network tab, the request payload, or the response body.

That is not really an AI problem. It is a tooling problem.

I got tired of watching agents guess from screenshots while I had DevTools open right next to them, showing the exact reason something failed. So I built mare-browser-mcp, a browser MCP designed less like "remote control for a webpage" and more like "give the model the same debugging signals I actually use."

That changed the loop from this:

AI writes code -> I test it -> I describe the bug -> AI guesses a fix -> repeat

to this:

AI writes code -> AI tests it -> AI reads the failure -> AI fixes it

That difference matters more than people think.

The Screenshot Trap

Most browser MCPs start from the same assumption: if the model can see the page, it can debug the app.

That sounds reasonable until you try to debug anything real.

A screenshot does not tell the model:

that login failed with a 401
that the frontend sent user_email instead of email
that a button click triggered a JS exception
that the UI is fine but the API returned invalid JSON
that the missing data is actually below the fold in a scrollable grid
that the context menu only appears on right-click

A screenshot is expensive, unstructured, and often the least useful artifact in the whole debugging flow.

Language models are good at reasoning over structured text. We keep handing them pixels and asking them to guess.

What I Wanted Instead

I wanted a browser MCP that makes the model think more like a developer with DevTools open.

That means a few things:

It should read console logs and page errors directly
It should see network requests, payloads, response bodies, and timing
It should batch actions instead of making one tiny tool call per click
It should query the DOM as structured data
It should handle hover, drag, right-click, and nested scroll containers
It should only use screenshots when the problem is actually visual

That is the core idea behind mare-browser-mcp.

It is a lean MCP server built with Playwright and the MCP SDK, and the whole design is biased toward structured debugging data first.

The One Tool I Wish More Browser MCPs Had

The most important tool in the server is browser_debug.

One call returns:

current URL
page title
console logs
page errors
dialogs
network request history

And the network history is the useful part. The server captures things like:

method
URL
parsed query params
request headers
request body
status code
JSON response body
duration in milliseconds

So instead of the model seeing "submit button clicked and page still looks wrong," it can see:

{
  "method": "POST",
  "url": "https://app.example.com/api/session",
  "requestBody": {
    "user_email": "user@example.com",
    "password": "secret"
  },
  "status": 401,
  "responseBody": {
    "error": "missing field: email"
  },
  "duration_ms": 23
}

That is a completely different debugging experience.

At that point the model is not guessing anymore. It can actually reason from evidence.

There are a few guardrails in the implementation too:

static assets are skipped
OPTIONS preflights are ignored
bodies are capped at 4 KB
auth headers are masked
cookies are reduced to "[present]"
failed requests are logged with the failure reason

So the data stays useful without turning into junk.

Batching Matters More Than It Sounds

Another thing that slows agents down is one-action-per-call design.

Click. Wait. Screenshot. Fill. Wait. Screenshot.

That is not just annoying. It makes the whole debugging loop slower and dumber.

browser_act fixes that by accepting an ordered list of actions in one call:

[
  { "action": "fill", "selector": "#email", "value": "user@example.com" },
  { "action": "fill", "selector": "#password", "value": "secret" },
  { "action": "click", "selector": "button[type=submit]" }
]

The current action set covers a lot of what real apps actually need:

click
hover
drag
clicklink
fill
select
keypress
waitfor
scrollto
wait
clearconsole

That means the model is not stuck the moment the UI uses hover states, resize handles, sortable grids, or context menus.

And yes, right-click works through click with button: "right", which turns out to matter more often than you might expect.

Structured DOM Queries Beat Giant Dumps

Another thing I did not want was a browser tool that just vomits the whole DOM back at the model.

browser_query is useful because it lets the model ask narrower questions:

How many rows are there?
Is this button visible?
What text do these badges have?
Is this input disabled?

Examples:

{ "selector": ".ag-row", "count_only": true }

{ "selector": "button", "all": true, "visible_only": true, "limit": 10 }

{ "selector": ".badge", "fields": ["text", "className", "visible"] }

The supported fields are deliberately practical:

text
value
visible
disabled
className
href
innerHTML

That keeps the output small enough to reason about and avoids wasting context on useless noise.

Screenshots Are Still There. They Just Aren't First.

I did keep browser_screenshot.

But I made the tool description explicitly say it should be the last resort, not the default.

That sounds minor, but it really changes how an agent behaves. If the tool surface itself tells the model:

use browser_debug first
use browser_query second
use screenshots only for visual problems

then the model starts spending its attention on data instead of appearance.

That is the behavior I wanted.

Real Apps Need More Than Page Scroll

One of the easiest ways browser tools fall apart is scrolling.

Scrolling the page is not enough for modern apps. A lot of important UI is inside:

data grid viewports
chat panels
sidebars
overflow containers
modals

browser_scroll can scroll the page, scroll an element into view, or scroll inside a specific container.

Example:

{
  "container": ".ag-body-viewport",
  "direction": "down",
  "pixels": 500
}

And because it returns scroll position data, the model can tell where it is instead of just flailing around.

There Is Also an Escape Hatch

Even with all of that, fixed tools never cover every case.

So there is browser_eval, which runs JavaScript in the page context.

That is useful for things like:

reading computed styles
inspecting app state on window
appending input text without clearing it
simulating more custom interactions
calling fetch() directly
checking CSS visibility details

I do not think the escape hatch should be the first thing the model reaches for. But it absolutely should exist.

What This Looks Like in Practice

A realistic flow is:

1. browser_navigate({ url: "https://myapp.com", clear_logs: true })

2. browser_act({
     commands: [
       { action: "fill", selector: "#email", value: "user@example.com" },
       { action: "fill", selector: "#password", value: "secret" },
       { action: "click", selector: "button[type=submit]" }
     ]
   })

3. browser_wait_for_network({ url_pattern: "/api/session", method: "POST" })

4. browser_debug({ console_types: ["error"], last_n: 20 })

From there the model can see:

whether the request fired
what payload got sent
what response came back
whether the browser threw an error
whether the failure is frontend, backend, or both

That is the kind of loop where an agent becomes genuinely useful.

Why I Think This Matters

I think a lot of people are measuring browser MCPs by the wrong thing.

The question is not "can the AI click buttons?"

The real question is: when something breaks, can the AI see the same evidence a good developer would look at?

If the answer is yes, the quality of debugging jumps fast.

If the answer is no, you end up with a very expensive screenshot viewer.

That is why I built mare-browser-mcp the way I did. Not to make agents better at looking at pages, but to make them better at understanding what happened in the browser.

Not screenshots first. Telemetry first.

Try It

git clone https://github.com/emadklenka/mare_browser_mcp
cd mare_browser_mcp
pnpm install
npx playwright install chromium
pnpm run setup

It also supports global install:

pnpm add -g mare-browser-mcp
npx playwright install chromium

The repo has setup instructions for Claude Code and OpenCode, and the current server exposes:

browser_navigate
browser_act
browser_debug
browser_query
browser_screenshot
browser_eval
browser_scroll
browser_restart
browser_upload
browser_wait_for_network

If you are building web apps with AI, that combination is much closer to giving your agent DevTools than giving it a screenshot and hoping for the best.