Nomadev for CAMEL-AI

Posted on Oct 1

What Building a Hybrid Browser Toolkit Taught Us About the Web

#webdev #javascript #ai #automation

If you’ve ever tried browser automation, you know the drill:
You spin up Selenium, Playwright, or Puppeteer, point it at a page, and suddenly you’re wrestling with flaky selectors, weird screenshots, or the dreaded “element not found” even though it’s right there.

I’ve been there. It feels like teaching a robot to surf the web by giving it a pair of oven mitts. Sure, it clicks and scrolls, but half the time it’s guessing.

At CAMEL-AI, we ran into this wall too. Our original Camel BrowserToolkit was a first attempt at solving it. It did the basics — take screenshots, inject custom IDs, and click things. But it was… let’s say, not elegant. It worked more like asking an AI to click on pictures instead of actually understanding the page.

That got us thinking:
What if the toolkit could “see” the page like a human and understand the structure like a dev?

From Monolith to Hybrid

The big shift came when we re-architected things. Instead of one heavy Python process, we now have a Hybrid setup using Python and TypeScript.

Python is still your scripting layer. That means you can write automation in a language most of us are comfortable with.
TypeScript is the engine under the hood. It runs Playwright natively, handles async operations, and talks directly to the browser.

The two communicate over WebSockets. So Python gives high-level commands, while TypeScript executes them efficiently.

Introducing the CAMEL Hybrid Browser Toolkit

Enter the Hybrid Browser Toolkit. We've rebuilt the toolkit from the ground up as a TypeScript–Python hybrid. In this new design, TypeScript (running on Node.js) handles the browser directly via Playwright's fast native APIs, and Python remains your friendly front-end interface.

What does that buy you? Faster performance, access to all the latest Playwright features (like the new _snapshotForAI), and true async event-driven power – without sacrificing the ease of Python scripting.

The result is a layered architecture: your Python code talks to a TypeScript server over WebSockets. The TypeScript layer manages browser instances, DOM queries, screenshots, etc., all in the same high-performance JavaScript environment. Python just sends commands and gets structured results.

This split means lower latency and better concurrency. As one example, Node's Playwright doesn't spawn a fresh process for every browser window like the Python version did, so it can manage many tabs with far less CPU and memory overhead.

In short, Python becomes the brain giving high-level instructions, and TypeScript is the muscle doing the work efficiently.

What's Different Under the Hood

In the legacy toolkit, every action that needed to find or click an element typically involved injecting a random ID into the page via a script, then querying it. That worked, but it felt hacky.

In the hybrid toolkit, we leverage standard accessibility (ARIA) selectors and Playwright's new tools. Now you can do things like:

await page.locator('[aria-label="Submit"]').click();
await page.getByRole('button', { name: 'Submit' }).click();
const snapshot = await page._snapshotForAI();
// snapshot now has structured data on all elements and their ARIA roles

Playwright's _snapshotForAI() (an internal API) lets us get a rich DOM snapshot: every interactive element, its role (like button, link, textbox), labels, etc. We assign each element a ref ID and use those for all interactions. This replaces the old random-ID trick with a semantic mapping.

It also means the same snapshot data fuels both text mode and the visual "set-of-marks" screenshots.

Set-of-Marks Screenshots

Speaking of screenshots, the new toolkit's SoM (Set-of-Marks) screenshots are crisp and clever. We inject a small script into the page that outlines every clickable element with a little numbered marker (their ref ID).

This isn't just a dumb screenshot – it knows about element overlap and tries not to mark hidden elements. If a button has an icon and text, it merges them into one mark. It even picks good positions for labels so they don't scribble over each other. (This injection-based approach in the browser is more reliable than our old memory-only screenshots.)

Enhanced Stealth Mode

We've also beefed up stealth mode. By default, Playwright can be detected by many sites (indeed, "stock" Playwright is often blocked by modern anti-bot measures.

The new toolkit launches browsers with a full suite of anti-detection flags, customizable user agents, headers, etc. You can tweak a StealthConfig object to set exactly which flags or headers to use. And we maintain this even across persistent contexts or CDP connections.

The bottom line: you get a much more human-like browser fingerprint without extra work.

Memory-Efficient Screenshots

Other small but nice improvements include how we handle screenshots and images. In the old toolkit, screenshots were held entirely in memory and passed around as objects. Now we save screenshots to disk and only pass around file paths.

This keeps memory usage low, especially when you take many screenshots in a run. The agent can still request the image (and even run vision-based analysis on it), but the heavy data lives on disk.

Smarter Form Filling

We also made form-filling smarter. You can now send multiple inputs in one command, and the toolkit will try to find the right input fields (even if you accidentally point at a container).

It watches for dropdowns appearing after you type and will return just the new options (a "diff" snapshot), so you don't get overwhelmed by the whole page again. If something goes wrong, the tool tries simple recovery steps too.

Key Features at a Glance

Multi-Mode Operation: The toolkit has three modes:

Text Mode: DOM-based automation, returning textual snapshots of element lists.
Visual Mode: Screenshot-based, with interactive elements highlighted.
Hybrid Mode: Smart switching between text and visual as needed.

TypeScript Core: All browser work is done in a Node.js/TypeScript server. That means native Playwright calls (no bridging) and full async/await support. We get TypeScript's compile-time checks and the latest APIs instantly.

Better Element Handling: Use real ARIA selectors and Playwright locators instead of injected IDs. E.g. click by aria-label or role. Plus, _snapshotForAI returns structured data with semantic roles.

Instant Snapshots: Every action (click/type/etc.) that changes the page returns an updated snapshot by default, so you see the new state immediately in text mode.

Advanced Screenshot (SoM): Annotated screenshots with numbered marks for each element. Optionally, an AI can analyze the image (like "find all sign-up buttons").

Intelligent Typing: Typing into fields automatically detects dropdowns (autocomplete) and only returns the new suggestions (diff snapshot). If you point to a container, it will find the actual input inside and type there.

Powerful Stealth: Multiple Chrome flags, custom user agent/headers, persistent context, etc., to reduce bot detection. (After all, many sites try to fingerprint automation.
Flexible Connections: You can launch a fresh browser via Playwright, attach to an existing Chrome/Edge via CDP (Chrome DevTools Protocol), or even hook into an AI agent via the Model Context Protocol (MCP).

Tool Registry: The toolkit neatly separates "tools" (actions) from the core. Screenshots go to files, not memory, so you can handle them in custom agents or pipelines without huge overhead.

Try It: Session & Navigation Tools

Let's see some examples. First, create a toolkit instance and open the browser:

from camel.toolkits import HybridBrowserToolkit

# Launch a real browser (non-headless for debugging)
toolkit = HybridBrowserToolkit(headless=False)
result = await toolkit.browser_open()

print(result['result'])    # "Browser opened."
print(f"Tabs: {result['total_tabs']}, Active: {result['current_tab']}")
print("Initial Snapshot:", result['snapshot'])

Your first call must be browser_open(). That spins up Chromium/Chrome/Edge and returns a snapshot of whatever the default page is (typically about:blank or your start URL). You'll get something like:

Result: Browser opened.
Tabs: 1, Active tab index: 0
Initial Snapshot:
- link "Get Started" [ref=1]
- link "Documentation" [ref=2]
- link "GitHub" [ref=3]
- ...

Now navigation:

# Open a new tab and navigate to example.com
result = await toolkit.browser_visit_page("https://example.com")
print(f"Visiting example.com: {result['result']}")
print("Snapshot:", result['snapshot'])
print(f"Tabs now: {result['total_tabs']}, Active: {result['current_tab']}")

# Go back and forward
await toolkit.browser_back()      # go back in history
await toolkit.browser_forward()   # then forward again

browser_visit_page(url) opens the URL in a new tab and switches to it. Each call makes a new tab.

browser_back() and browser_forward() move in the history of the current tab. They both return the updated page snapshot and tab info.

For example, after visiting a couple of pages:

await toolkit.browser_visit_page("https://example.com")
await toolkit.browser_visit_page("https://example.com/about")
result = await toolkit.browser_back()
print(f"Back: {result['result']}, now at {result['snapshot']}")

Page Inspection Tools

To see what's on the page without doing anything, use:

snapshot = await toolkit.browser_get_page_snapshot()
print(snapshot)

This returns a textual list of all interactive elements in the current tab (links, buttons, inputs, etc.), each with a [ref=id]. By default it lists the full page, but you can initialize with viewport_limit=True to only see elements visible on screen. E.g.:

- link "Home" [ref=1]
- button "Sign In" [ref=2]
- textbox "Search..." [ref=3]
- link "Products" [ref=4]
- ...

For a visual view, try:

result = await toolkit.browser_get_som_screenshot()
print(result['result'])
# e.g. "Screenshot captured with 12 interactive elements (saved to: ./screenshots/page123_som.png)"

This takes a screenshot of the page and marks every element. You can also ask the toolkit to analyze it with an AI, e.g.:

result = await toolkit.browser_get_som_screenshot(
    read_image=True, 
    instruction="Find all buttons for submitting forms"
)
print(result['result'])
# e.g. "Screenshot captured... Agent analysis: Found 3 form buttons: [ref=5], [ref=9], [ref=12]"

Behind the scenes, it saved an image file and ran an agent (if requested) to look at it. The raw image path is in result['screenshotPath'] if you need it.

To inspect tabs, use:

tab_info = await toolkit.browser_get_tab_info()
print(f"Total tabs: {tab_info['total_tabs']}")
for tab in tab_info['tabs']:
    status = " (current)" if tab['is_current'] else ""
    print(f"- {tab['title']} @ {tab['url']}{status}")

You'll see each tab's ID, title, and URL. This is handy to pick a tab to switch to:

# Switch to tab by ID (the 'id' field from tab_info)
await toolkit.browser_switch_tab(tab_id=some_tab_id)

Interaction Tools

Now for real interactions:

Click an Element

Click an element by its ref:

result = await toolkit.browser_click(ref="5")
print(result['result'])   # e.g. "Clicked on button 'Submit'"

If the click opened a new tab, result will include newTabId, and current_tab/total_tabs will update accordingly. You can then browser_switch_tab to it.

Type into Input Fields

Type into an input:

# Single input
await toolkit.browser_type(ref="3", text="hello world")

If the element with ref=3 triggers an autocomplete dropdown, the toolkit will detect it. Instead of returning the full page again, it gives you result['diffSnapshot'] containing just the new options (this is the "intelligent dropdown detection"). For example, typing "San" might return:

- option "San Francisco" [ref=23]
- option "San Diego" [ref=24]
- option "San Antonio" [ref=25]

Then you can click one of those by ref. If you have multiple fields to fill, just pass a list:

inputs = [
    {'ref': '3', 'text': 'John'},
    {'ref': '4', 'text': 'Doe'},
    {'ref': '5', 'text': 'john.doe@example.com'}
]
result = await toolkit.browser_type(inputs=inputs)
print(result['details'])  # shows success/failure per field

Select Dropdowns

Select (for <select> dropdowns):

await toolkit.browser_select(ref="country-select", value="US")

You must provide the option's value attribute, not visible text. (If needed, you can browser_get_page_snapshot() first to see element refs.)

Enter Key

Enter key (submit form etc.):

await toolkit.browser_enter()

This simulates pressing Enter in the currently focused field. It's handy after typing search terms.

Scroll

Scroll the page:

await toolkit.browser_scroll(direction="down", amount=600)

Use "up" or "down", with optional pixel amount. It returns the new snapshot. You can loop scrolls to load more content:

prev = ""
while True:
    res = await toolkit.browser_scroll("down", 800)
    if res['snapshot'] == prev:
        break  # no new content
    prev = res['snapshot']
    await asyncio.sleep(1)

Mouse Control

Mouse control by coordinates:

await toolkit.browser_mouse_control(control="click", x=350.5, y=200)
await toolkit.browser_mouse_control(control="dblclick", x=123.4, y=456.7)
await toolkit.browser_mouse_control(control="right_click", x=400, y=300)

Useful for canvas or image-map interactions.

Drag and Drop

Mouse drag-and-drop:

await toolkit.browser_mouse_drag(from_ref="item-5", to_ref="trash-bin")

Drag the element with ref="item-5" onto ref="trash-bin". Handy for reordering or file moves in web UIs.

Press Keys

Press keys/combinations:

await toolkit.browser_press_key(keys=["Tab"])
await toolkit.browser_press_key(keys=["Control+a"])  # select all
await toolkit.browser_press_key(keys=["Alt+Left"])   # back in history
await toolkit.browser_press_key(keys=["F5"])         # refresh

Send any key or combo. The toolkit uses Playwright's key syntax.

Tab Management

Working with multiple tabs is easy:

Switch Tab

Switch tab by ID (from browser_get_tab_info):

await toolkit.browser_switch_tab(tab_id=some_tab_id)

This activates that tab and returns its snapshot.

Close Tab

Close a tab:

await toolkit.browser_close_tab(tab_id=some_tab_id)

After closing, it returns info on the remaining tabs.

You can, for instance, close all but the first tab by iterating through them:

tab_info = await toolkit.browser_get_tab_info()
for tab in tab_info['tabs']:
    if not tab['is_current']:
        await toolkit.browser_close_tab(tab_id=tab['id'])

Console Commands

Console commands: You can execute arbitrary JS on the page:

result = await toolkit.browser_console_exec("return window.location.href")
print("Current URL:", result['result'])

And view console logs:

logs = await toolkit.browser_console_view()
for msg in logs['console_messages']:
    print(f"[{msg['type']}] {msg['text']}")

Advanced & Utility

Wait for Manual Step

Wait for manual step: Sometimes you need a human (e.g. to solve a CAPTCHA). Use:

res = await toolkit.browser_wait_user(timeout_sec=60)
if "completed" in res['result']:
    print("User resumed, snapshot after:")
    print(res['snapshot'])
else:
    print("Wait timed out.")

This pauses execution and shows the last snapshot. When the user presses Enter (or timeout), it returns control.

Combine It All

Combine it all: Here's a mini example putting a few tools together:

toolkit = HybridBrowserToolkit(headless=False)
try:
    await toolkit.browser_open()
    await toolkit.browser_visit_page("https://example.com")
    # Look for a product link and click it
    snap = await toolkit.browser_get_page_snapshot()
    # Suppose ref=7 is "Products"
    await toolkit.browser_click(ref="7")
    # Now add to cart and checkout
    await toolkit.browser_click(ref="add-to-cart")
    await toolkit.browser_click(ref="checkout")
    # Fill checkout form
    inputs = [
        {'ref': 'name', 'text': 'Alice'},
        {'ref': 'email', 'text': 'alice@example.com'},
        {'ref': 'address', 'text': '1 Developer Way'}
    ]
    await toolkit.browser_type(inputs=inputs)
    await toolkit.browser_select(ref="shipping", value="standard")
    await toolkit.browser_console_exec("return document.querySelector('form').checkValidity()")
    await toolkit.browser_click(ref="place-order")
finally:
    await toolkit.browser_close()

This was just a taste. The Hybrid Browser Toolkit provides all the basic navigation and interaction tools you'd expect, plus some powerful extras (like smart screenshots and AI-assisted analysis) to help you automate complex tasks smoothly.

Operating Modes: Text vs. Visual vs. Hybrid

Text Mode is the default: every action returns a text snapshot. It's lightweight and great for pure data tasks (like scraping or filling forms). Each element is listed with a [ref=ID] and a label. If you initialize with full_visual_mode=True, then actions don't auto-return snapshots (fast mode); you can still call browser_get_page_snapshot() manually when you need it.

Visual Mode uses screenshots. The browser_get_som_screenshot() tool we saw is the core of this mode. It's ideal for verifying layouts, catching visual glitches, or when a human needs to see something. You'll often toggle visual mode on when you need to confirm that a button is visible, or to show the agent exactly what's on screen.

Hybrid Mode is smart: it uses text mode by default, but seamlessly takes and interprets screenshots when needed (or as requested). For example, you might click through forms in text mode, then do one final screenshot with AI analysis to "spot check" the result.

A good rule of thumb:

Use Text Mode for most automation (fast, headless, easy parsing).
Switch to Visual Mode when you need the UI context (e.g. for CAPTCHAs, complex UIs, or human verification).
Combine Both as needed. E.g., click by refs in text mode, then verify with a screenshot.

Connection Modes: Playwright vs CDP vs MCP

Finally, how do we connect to the browser?

Standard Playwright (default)

The toolkit launches and manages its own browser instance. Just HybridBrowserToolkit() and call browser_open(). You can set headless=True/False, user_data_dir for persistence, timeouts, etc. Use this when you just want an isolated browser.

Chrome DevTools Protocol (CDP)

This lets you attach to an already running browser (Chrome/Edge/Chromium) that was started with --remote-debugging-port. For example, start Chrome manually:

google-chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-profile

Then in Python:

import requests
resp = requests.get('http://localhost:9222/json/version')
ws = resp.json()['webSocketDebuggerUrl']

toolkit_cdp = HybridBrowserToolkit(cdp_url=ws)
# No need to call browser_open(); it's already running
tab_info = await toolkit_cdp.browser_get_tab_info()
print(f"Connected to {tab_info['total_tabs']} tabs")

CDP is the same protocol Chrome DevTools uses to talk to the browser chromedevtools.github.io, so any browser with debugging enabled can be controlled. You can even set cdp_keep_current_page=True to make the toolkit use the current page instead of opening a new one.

MCP (Model Context Protocol)

This is for connecting the toolkit to an AI assistant (like Claude via LLMs) so the AI can call these browser tools as if they were native functions. Here's how to set it up:

1. Install the MCP Server

git clone https://github.com/camel-ai/browser_agent.git
cd browser_agent
pip install -e .

2. Configure Claude Desktop

Add to your Claude configuration file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "hybrid-browser": {
      "command": "python",
      "args": ["-m", "hybrid_browser_mcp.server"]
    }
  }
}

3. Restart Claude Desktop

After adding the configuration, completely restart Claude Desktop. The browser tools will appear when you click the 🔌 icon in the chat interface.

Available Browser Tools

Once connected, you'll have access to:

Navigation: browser_open, browser_visit_page, browser_back, browser_forward
Interaction: browser_click, browser_type, browser_select, browser_scroll
Screenshots: browser_get_som_screenshot (captures page with clickable elements marked)
Tab Management: browser_switch_tab, browser_close_tab
Advanced: browser_console_exec, browser_mouse_control

Basic Usage Example

# Claude can now control browsers with simple commands:
await browser_open()
await browser_visit_page("https://example.com")
await browser_type(ref="search", text="AI automation")
await browser_click(ref="submit-button")
await browser_get_som_screenshot()
await browser_close()

Customization

Modify browser behavior in browser_agent/config.py:

BROWSER_CONFIG = {
    "headless": False,    # Show browser window
    "stealth": True,      # Avoid bot detection
    "enabled_tools": [...] # Specify which tools to enable
}

Closing Thoughts

In summary, the Hybrid Browser Toolkit is a major upgrade over the old screenshot-only BrowserToolkit. We still give you a friendly Python API to work with, but under the hood we're speaking the browser's native language via TypeScript.

That means faster, more reliable interactions and access to shiny new features like Playwright's accessibility snapshots. Whether you need lightning-fast DOM scraping or human-like visual checks (or both!), this toolkit handles it.

It also plays well with modern workflows. Want to connect to an existing Chrome? No problem (thanks to CDP). Want your AI agent to browse the web? Check out MCP integration.

From practical navigation (click, type, scroll) to advanced tricks (Set-of-Marks screenshots, smart autocomplete typing, multi-tab management), everything's here.

Give it a spin, and let us know what you build with it. Welcome to the new era of browser automation with CAMEL's Hybrid Browser Toolkit – it's like taking off those gloves and driving with all the precision you wanted, at full speed.

Happy automating!

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.