Jon Retting

Posted on Mar 2

vscreen deep dive: how 63 MCP tools let AI agents actually use the internet

#rust #ai #webrtc #opensource

My first post introduced vscreen — a Rust service that gives AI agents a real Chromium browser, streamed live over WebRTC. A few people asked for more detail on what the AI can actually do with it. So here's the breakdown, starting with the stuff I haven't seen anywhere else.

What makes this different

Screenshot history and session memory

AI agents forget. They take a screenshot, act, and the previous state is gone. vscreen keeps a ring buffer of the last 20 screenshots with full metadata. The AI can call vscreen_history_get(3) and see exactly what the page looked like 3 actions ago. "Wait, what was on that page before I clicked?" — now it can check.

There's also vscreen_session_log — a timestamped record of every single action taken during the session. Navigate, click, type, screenshot — all logged. When an agent gets confused mid-task, it can call vscreen_session_summary and get a condensed "here's everything you've done so far" to re-orient itself.

Annotated screenshots

Instead of just a raw screenshot, vscreen_screenshot_annotated overlays numbered bounding boxes on every interactive element and returns a legend. The AI sees "button 4 is the Submit button" and can just reference it. No CSS selectors, no guessing coordinates from pixels.

Full-page capture with automatic coordinate translation

vscreen_screenshot(full_page=true) resizes the viewport to the entire document height and captures everything in one image — the AI sees the whole page, not just the visible viewport. The clever part: when the AI clicks on coordinates from that full-page image, vscreen automatically scrolls to the right position and dispatches the event. The AI just says "click at (450, 3200)" from the full-page screenshot and it works. No scroll math.

Automated captcha solving

vscreen_solve_captcha handles reCAPTCHA v2 end-to-end using a vision LLM. It clicks the checkbox, identifies which tiles match the target, clicks them, hits verify, handles multi-round challenges, and retries on failure. All in one call. Needs --vision-url pointing at an Ollama or OpenAI-compatible vision model.

Cookie banner and ad dismissal

The real internet is full of overlays blocking everything. vscreen_dismiss_dialogs detects and dismisses OneTrust, CookieBot, Didomi, Quantcast, TrustArc, and generic consent patterns across multiple languages. vscreen_dismiss_ads handles video ad overlays like YouTube skip buttons. One call each, done.

Multi-agent coordination

Multiple agents can work in parallel on separate instances, but sometimes they share one. Lease-based locking prevents collisions — one agent takes an exclusive lock, others get observer access (read-only screenshots and queries). A background reaper expires stale leases every 10 seconds, so a crashed agent never blocks the instance forever. Waiting agents can use acquire_or_wait to automatically pick up the lock when it frees up.

Task planning

Not sure which tools to use? vscreen_plan("fill out the contact form and submit it") returns a step-by-step tool sequence with parameters. The AI doesn't need to memorize 63 tools — it can ask.

The fundamentals

On top of the novel features, all the basics are covered:

Finding elements — by CSS selector, visible text, accessibility tree, placeholder/aria-label/role, or by asking a vision LLM to identify unlabeled icons. All work across cross-origin iframes.

Interacting — click, double-click, type, fill, key press, key combos, scroll, drag, hover, dropdown select. Plus vscreen_batch_click for hitting multiple coordinates in one call and vscreen_click_element for clicking by selector or text with auto scroll-into-view and retries.

Waiting — for page idle, specific text appearing, a CSS selector matching, URL changes (SPA navigation), or network idle. The real internet is slow — these keep the AI from acting before the page is ready.

Navigation — go to a URL, back, forward, reload, or vscreen_click_and_navigate which clicks an element and waits for the URL to change (handles SPAs where clicks trigger pushState).

Data access — read/write cookies and localStorage/sessionStorage, capture console.log/warn/error messages, execute arbitrary JavaScript.

The recommended workflow

vscreen_navigate to the URL
vscreen_wait_for_idle for the page to settle
vscreen_dismiss_dialogs to clear consent banners
vscreen_screenshot to see what's there
vscreen_find_elements or vscreen_find_by_text to locate targets
vscreen_click / vscreen_type / vscreen_fill to interact
vscreen_screenshot to verify
Repeat

Or just call vscreen_plan("your task here") and let it tell you.

Try it

Pre-built binaries for Linux on the releases page. Or:

docker run -p 8450:8450 vscreen

GitHub: github.com/jameswebb68/vscreen

DEV Community