DEV Community

Ralph van der Horst
Ralph van der Horst

Posted on

Selenium AI Agent 2.3.0 AI-Powered Browser Automation with 74 Tools

What is Selenium AI Agent?

selenium-ai-agent is an MCP (Model Context Protocol) server that gives your AI assistant full control over a real browser.

It is driven by Selenium WebDriver, controlled by your AI. Once installed, your AI can navigate pages, fill forms, click elements, take screenshots, run tests, heal broken locators, and explore your entire app, all from a single prompt.

It is driven by Selenium WebDriver, controlled by your AI. Once installed, your AI can navigate pages, fill forms, click elements, take screenshots, run tests, heal broken locators, and explore your entire app — all from a single prompt.

npm install -g selenium-ai-agent
npx selenium-ai-agent
Enter fullscreen mode Exit fullscreen mode

It works with Claude Desktop, Claude Code, Cursor, Cline, GitHub Copilot (VS Code 1.99+), and Windsurf — any AI client that supports MCP.

{
  "mcpServers": {
    "selenium-mcp": {
      "command": "npx",
      "args": ["selenium-ai-agent"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Requirements: Node.js 18+, Chrome/Firefox/Edge. ChromeDriver is auto-managed — no manual setup needed. If you've used Playwright MCP or similar browser automation agents, Selenium AI Agent will feel immediately familiar. It is same concept, same prompt-driven workflow, but built on Selenium WebDriver. That means you get the battle-tested cross-browser engine that teams have relied on for years, now supercharged with AI.

Unlike most browser automation MCP servers that run a single local browser, Selenium AI Agent has first-class Selenium Grid support built in. Spin up a full Grid with Docker Compose in one command and run parallel sessions across Chrome and Firefox simultaneously. Or even you can link it with Browserstack and any server-sided grid. Your AI agent can explore multiple URLs at the same time, run cross-browser tests in parallel, and merge results into a single report — all without leaving your prompt.

grid_start  # Launches 4 Chrome + 1 Firefox nodes via Docker Compose
Enter fullscreen mode Exit fullscreen mode

What Can It Do? (74 Tools)

Navigate & Interact

Navigate to URLs, click elements, type text, hover, drag and drop, press keys, upload files, handle browser dialogs, manage tabs — everything a real user can do.

Capture & Verify

Take viewport or full-page screenshots, capture the accessibility tree snapshot, verify elements are visible, verify text on page, wait for conditions, monitor network requests, collect console logs.

Test Planner

Your AI walks through your app, understands its structure, and produces a structured markdown test plan — ready to hand off to the generator.

planner_setup_page → planner_explore_page → planner_save_plan
Enter fullscreen mode Exit fullscreen mode

Test Generator

AI interacts with your app, records actions with element locators, validates selectors against the live DOM, and writes framework-ready test files. A .test-manifest.json is created so the healer knows exactly how to run them later.

Supported: Playwright, WebdriverIO, Selenium Python/Java, Robot Framework and more — it is programming independent as long as it is Selenium.

generator_setup_page → [interact] → stop_recording → generator_write_test
Enter fullscreen mode Exit fullscreen mode

Self-Healing Tests

When tests break due to UI changes, the healer pipeline finds the drift and fixes it automatically:

healer_run_tests → healer_inspect_page → healer_fix_test → healer_run_tests
Enter fullscreen mode Exit fullscreen mode

healer_inspect_page compares your expected locators against the live page, spots UI drift, and suggests fixes. healer_fix_test validates selectors before writing — no more broken locators silently committed.

Selenium Grid — Parallel Execution

First-class Grid support with Docker Compose. Explore multiple URLs simultaneously or run the same test on Chrome and Firefox at the same time. The exploration_merge tool deduplicates results and builds a unified map of your app across all sessions.

BiDi Cross-Browser Features

Full-page and element screenshots, PDF generation across Chrome, Firefox, and Edge, real-time console events via BiDi LogInspector, network request monitoring, and stealth mode via BiDi preload script injection.

What's New in 2.3.0 — Accessibility Tree Discovery

The biggest change in this release: how the agent sees a page.

Before v2.3.0, capturing a page returned a flat list of interactive elements:

Interactive Elements:
  [e1] button: Play
  [e2] a: English 7,141,000+ articles
  [e3] a: 日本語 1,491,000+ articles
  ...
Enter fullscreen mode Exit fullscreen mode

No hierarchy. No context. A lot of noise. On a complex page, 100+ elements would fill the entire budget — leaving the AI blind to other parts of the page. A nav link and a skip link looked identical. The agent had no idea what region of the page an element belonged to.

Now it looks like this

- main [e21]
  - button "Play" [e1]
  - heading "Wikipedia — 25 years..." [level=1] [e2]
  - navigation "Main menu" [e3]
    - link "Contents" [e4]
    - link "Current events" [e5]
  - search [e9]
    - searchbox "Search Wikipedia" [e6]
    - button "Search" [e7]
Enter fullscreen mode Exit fullscreen mode

The agent now understands where things are and what they mean — not just that they exist.

The new discovery engine walks the DOM recursively from <body>, computes the ARIA role for each element using the implicit role map, resolves accessible names following the W3C algorithm (aria-labelaria-labelledbyalt → text content), and prunes non-semantic wrappers — <div> and <span> with no role are collapsed, promoting their meaningful children up the tree. Refs are only assigned to nodes that have both a role AND a name (or are structural landmarks), keeping the list tight.

On a real Wikipedia page: 100 elements → 46, a 54% noise reduction — with full structural context preserved.

Also in 2.3.0

New scroll_page tool — directional scrolling (up/down/left/right) with configurable pixel amounts, plus scroll-into-view by CSS selector. Previously scrolling was jammed into other tools; now it's a first-class citizen.

Element ref budget doubled — from 100 to 200 refs. Combined with smarter pruning, agents can now navigate dense pages without hitting the ceiling.

ElementInfo now carries role and level fieldsdiscoverElements() returns { elements, tree } instead of just a flat map.

Bug fixes — deeply nested non-semantic wrappers are now correctly flattened, element resolution order fixed in drag/hover tools, trailing comma consistency across tool schema definitions.

Full Changelog

New: Accessibility tree discovery with ARIA role computation · Hierarchical snapshot output · AccessibilityNode type · formatAccessibilityTree() utility · ScrollPageTool

Changed: discoverElements() returns { elements, tree } · PageSnapshot includes tree: AccessibilityNode · Ref budget 100→200 · Improved text-matching fallback · Grid session and exploration coordinator updated for new discovery API

Fixed: Recursive flattening of __promote nodes for deeply nested non-semantic wrappers · Trailing commas in tool schemas · Ref resolution order in drag/hover tools

It is a beta but it is open source. For anyone who wants to help working on this project, contributions are very welcome!

📦 npm · 🐙 GitHub · 📖 Full release notes

Top comments (0)