My first post introduced vscreen — a Rust service that gives AI agents a real Chromium browser, streamed live over WebRTC. A few people asked for more detail on what the AI can actually do with it. So here's the breakdown, starting with the stuff I haven't seen anywhere else.
What makes this different
Screenshot history and session memory
AI agents forget. They take a screenshot, act, and the previous state is gone. vscreen keeps a ring buffer of the last 20 screenshots with full metadata. The AI can call vscreen_history_get(3) and see exactly what the page looked like 3 actions ago. "Wait, what was on that page before I clicked?" — now it can check.
There's also vscreen_session_log — a timestamped record of every single action taken during the session. Navigate, click, type, screenshot — all logged. When an agent gets confused mid-task, it can call vscreen_session_summary and get a condensed "here's everything you've done so far" to re-orient itself.
Annotated screenshots
Instead of just a raw screenshot, vscreen_screenshot_annotated overlays numbered bounding boxes on every interactive element and returns a legend. The AI sees "button 4 is the Submit button" and can just reference it. No CSS selectors, no guessing coordinates from pixels.
Full-page capture with automatic coordinate translation
vscreen_screenshot(full_page=true) resizes the viewport to the entire document height and captures everything in one image — the AI sees the whole page, not just the visible viewport. The clever part: when the AI clicks on coordinates from that full-page image, vscreen automatically scrolls to the right position and dispatches the event. The AI just says "click at (450, 3200)" from the full-page screenshot and it works. No scroll math.
Automated captcha solving
vscreen_solve_captcha handles reCAPTCHA v2 end-to-end using a vision LLM. It clicks the checkbox, identifies which tiles match the target, clicks them, hits verify, handles multi-round challenges, and retries on failure. All in one call. Needs --vision-url pointing at an Ollama or OpenAI-compatible vision model.
Cookie banner and ad dismissal
The real internet is full of overlays blocking everything. vscreen_dismiss_dialogs detects and dismisses OneTrust, CookieBot, Didomi, Quantcast, TrustArc, and generic consent patterns across multiple languages. vscreen_dismiss_ads handles video ad overlays like YouTube skip buttons. One call each, done.
Multi-agent coordination
Multiple agents can work in parallel on separate instances, but sometimes they share one. Lease-based locking prevents collisions — one agent takes an exclusive lock, others get observer access (read-only screenshots and queries). A background reaper expires stale leases every 10 seconds, so a crashed agent never blocks the instance forever. Waiting agents can use acquire_or_wait to automatically pick up the lock when it frees up.
Task planning
Not sure which tools to use? vscreen_plan("fill out the contact form and submit it") returns a step-by-step tool sequence with parameters. The AI doesn't need to memorize 63 tools — it can ask.
The fundamentals
On top of the novel features, all the basics are covered:
Finding elements — by CSS selector, visible text, accessibility tree, placeholder/aria-label/role, or by asking a vision LLM to identify unlabeled icons. All work across cross-origin iframes.
Interacting — click, double-click, type, fill, key press, key combos, scroll, drag, hover, dropdown select. Plus vscreen_batch_click for hitting multiple coordinates in one call and vscreen_click_element for clicking by selector or text with auto scroll-into-view and retries.
Waiting — for page idle, specific text appearing, a CSS selector matching, URL changes (SPA navigation), or network idle. The real internet is slow — these keep the AI from acting before the page is ready.
Navigation — go to a URL, back, forward, reload, or vscreen_click_and_navigate which clicks an element and waits for the URL to change (handles SPAs where clicks trigger pushState).
Data access — read/write cookies and localStorage/sessionStorage, capture console.log/warn/error messages, execute arbitrary JavaScript.
The recommended workflow
-
vscreen_navigateto the URL -
vscreen_wait_for_idlefor the page to settle -
vscreen_dismiss_dialogsto clear consent banners -
vscreen_screenshotto see what's there -
vscreen_find_elementsorvscreen_find_by_textto locate targets -
vscreen_click/vscreen_type/vscreen_fillto interact -
vscreen_screenshotto verify - Repeat
Or just call vscreen_plan("your task here") and let it tell you.
Try it
Pre-built binaries for Linux on the releases page. Or:
docker run -p 8450:8450 vscreen
GitHub: github.com/jameswebb68/vscreen
Top comments (1)