I was building an AI agent that needed to browse real websites — fill out forms, click buttons, navigate multi-step flows. Pretty standard stuff for an AI agent in 2025.
Every tool I tried eventually broke.
The Playwright MCP server worked fine at first. Then a site redesigned its nav and all the CSS selectors went stale. The agent started failing on button.submit-btn-v2 when the class had changed to button.cta-primary. I'd fix it, it'd break again next week. I was playing whack-a-mole with selectors.
The deeper problem: CSS selectors are a human abstraction that AI agents were never meant to use. They assume you know the structure of the DOM ahead of time. An AI agent doesn't. It's exploring.
The insight that changed everything
Browsers already have a built-in representation of a page that doesn't depend on CSS classes — the accessibility tree. It's the structure screen readers use. It describes what's on the page in terms of roles, labels, and relationships, not implementation details.
Instead of button.submit-btn-v2, you get [button] "Submit".
That button will still be [button] "Submit" after a redesign. The label might change, but the semantic meaning stays stable far longer than CSS class names.
How Interact MCP works
I built Interact MCP around this idea. Every interaction is two steps:
Step 1: Snapshot the page
interact_snapshot()
Returns something like:
@e1 [heading] "Sign in to GitHub"
@e2 [textbox] "Username or email address"
@e3 [textbox] "Password"
@e4 [button] "Sign in"
@e5 [link] "Forgot password?"
Step 2: Act on refs
interact_fill({ ref: "@e2", value: "myuser@example.com" })
interact_fill({ ref: "@e3", value: "mypassword" })
interact_click({ ref: "@e4" })
That's it. No CSS selectors. No XPath. No fragile locators. The agent sees a clean map of the page and acts on stable refs.
The performance problem I also solved
The other issue with existing tools: cold starts. Every time a browser automation MCP server handles a request, many implementations spin up a fresh Chromium instance. That's 1-3 seconds per action before you even do anything.
Interact MCP keeps a persistent Chromium instance alive between calls. The browser is always ready. Result: 5-50ms per action instead of 500ms-2000ms.
For an agent running a 20-step task, that's the difference between 10 seconds and 40 seconds — or 1 second and 1 second. It compounds.
Other things I added
Snapshot diffing — interact_snapshot_diff shows exactly what changed after an action. The agent doesn't need to re-parse the entire page to understand what happened.
Cookie migration — Import cookies from your real Chrome, Arc, or Brave browser. Your agent can pick up authenticated sessions without you setting up auth flows manually.
Handoff mode — Some sites block headless browsers. When that happens, interact_handoff opens a visible Chrome window so you can solve the CAPTCHA or complete the OAuth flow, then interact_resume hands control back to the agent.
AI-friendly errors — Error messages are written for LLMs, not humans. Instead of ElementNotFound: no element matches .submit-btn, you get Element @e4 not found. The page may have changed. Call interact_snapshot() to get fresh refs. The agent can self-correct without human intervention.
Works with any MCP client
Claude Code, Cursor, Claude Desktop — anything that speaks MCP. Built on Playwright and the MCP SDK. MIT licensed.
{
"mcpServers": {
"interact": {
"command": "npx",
"args": ["-y", "interact-mcp"]
}
}
}
Try it
GitHub: https://github.com/TacosyHorchata/interact-mcp
If you're building AI agents that touch the web, I'd love to hear what you think. What tools are you using for browser automation right now? What keeps breaking?
Top comments (0)