DEV Community

Cover image for I built a tool that lets AI agents browse the real internet — and you can watch them do it
Jon Retting
Jon Retting

Posted on • Edited on

I built a tool that lets AI agents browse the real internet — and you can watch them do it

AI agents can write code and analyze data, but they can't browse a website, click a button, or fill out a form. They don't have a browser.

So I built one.

Deep Dive Article

What is vscreen?

A Rust service that gives AI agents a real Chromium browser and streams it to you live over WebRTC. You see what the AI sees in real-time — video, audio, everything. Mouse and keyboard relay back bidirectionally, so you can take over at any time.

AI agents connect via MCP (Model Context Protocol) with 63 automation tools: navigate, screenshot, click, type, find elements, wait for page loads, dismiss cookie banners, solve CAPTCHAs, manage cookies, and more.

Spin up multiple isolated instances — different agents working different tasks in parallel, with lease-based locking so they don't step on each other.

The tech

Written in Rust from scratch. Not a Puppeteer wrapper. A purpose-built media pipeline: tokio, axum, webrtc-rs, openh264/libvpx, Opus audio.

  • unsafe_code = "forbid" across the workspace
  • unwrap() denied, panic denied — every error path handled
  • Clippy pedantic + nursery enforced
  • 510+ tests, 3 fuzz targets, supply chain auditing via cargo-deny
  • ~31,000 lines of Rust across 8 crates

Get started

Pre-built binaries for Linux are available on the releases page. Or build from source and run:

vscreen --dev
Enter fullscreen mode Exit fullscreen mode

One command. Spins up Xvfb, PulseAudio, and Chromium. Or use Docker:

docker run -p 8450:8450 vscreen
Enter fullscreen mode Exit fullscreen mode

Source-available, non-commercial license.

GitHub: github.com/jameswebb68/vscreen

GitHub: github.com/jonretting/vscreen

Top comments (16)

Collapse
 
cloakhq profile image
CloakHQ

the lease-based locking and live WebRTC stream are both solid design choices. the trust problem you're describing (black box agent = can't verify what it did) is real and the live view solves it cleanly.

one question that comes up at scale though: when you spin up multiple isolated instances, how much are they actually isolated at the browser fingerprint level? the session isolation handles the cookie/storage layer, but if all instances are running on the same host they're likely sharing canvas fingerprint, WebGL renderer, screen resolution, timezone, and a bunch of navigator properties. detection systems pick that up — 10 "isolated" sessions that all report the same GPU and the same screen dimensions look like siblings from the same host, not distinct users.

curious whether that's in scope for vscreen or if the use case is more internal tooling where detection isn't part of the threat model. for agent workflows that hit sites with bot detection, it becomes the main bottleneck pretty fast — we ran into this building CloakBrowser (github.com/CloakHQ/CloakBrowser) which focuses on exactly that isolation layer.

Collapse
 
lowjax profile image
Jon Retting • Edited

Great question — this is something I've been thinking about and have scoped for future work.

Currently vscreen handles the first tier of bot detection: navigator.webdriver removal, AutomationControlled blink feature disabled, realistic window.chrome runtime injection, spoofed plugins/languages/permissions, and WebGL renderer masking. That gets past commodity checks like reCAPTCHA and basic Cloudflare challenges, which covers the agent workflow use cases I'm focused on right now.

You're absolutely right about the sibling correlation problem though. Multiple instances on the same host currently share canvas fingerprint, screen resolution, timezone, hardware concurrency, device memory, and a static user-agent — any detection system doing cross-session comparison would spot them.

The architecture handles the hard part already — each instance runs in its own Xvfb display with its own Chromium process and fully isolated user-data-dir, so per-instance differentiation is a natural next step. I have plans for configurable fingerprint isolation covering the vectors you mentioned (canvas, WebGL, screen dimensions, timezone, hardware properties, UA), with novels approaches rather than just bolting on the standard techniques.

For now the current stealth layer handles normal websites well. Adversarial fingerprint isolation is part off the roadmap.

Cool project with CloakBrowser — different focus but same underlying problem. vscreen is more about giving AI agents a full observable browser with live video feedback, tool orchestration, and the ability to dynamically compose what they find into something useful beyond just scraping. Always good to see people tackling the fingerprint layer seriously though!

Collapse
 
cloakhq profile image
CloakHQ

good breakdown of the tiers. the commodity checks (webdriver, AutomationControlled) are table stakes now and you're right that they're relatively easy to clear.

curious what you mean by "novel approaches" for the canvas/WebGL layer specifically. the standard path is noise injection (randomize the last bits of the canvas readback, vary WebGL precision), but that introduces its own problems: detection systems have gotten good at spotting injected noise patterns, and if your noise function is deterministic it's actually a stronger fingerprint than the original. the harder problem is generating fingerprints that are internally consistent across all the vectors that get correlated — canvas, WebGL, font rendering, audio context, hardware concurrency. most naive implementations nail canvas but leave the others matching, which is almost worse than doing nothing.

the per-Xvfb-display architecture is actually a really clean foundation for doing this properly — each virtual display can have its own injected GPU config. did you find the Xvfb approach introduces any timing artifacts? we've seen virtual displays leave traces in animation frame timing that don't match what a real GPU would produce.

Thread Thread
 
lowjax profile image
Jon Retting

Really appreciate the depth here — the point about deterministic noise functions becoming stronger fingerprints than the original is exactly the kind of trap that makes naive approaches dangerous. And the internal consistency problem across correlated vectors is something I haven't seen many people articulate that clearly. This is genuinely helpful context as I scope this out.

Still in the design phase, but the general direction involves turning the detection problem inside out — using our own fingerprint detection against ourselves to validate that instances actually look distinct rather than just hoping the noise is good enough. (

The Xvfb timing artifact question is a great one. I haven't specifically profiled requestAnimationFrame timing against real GPU baselines yet — that's now on my list to investigate. If you have any references or data on what the detectable patterns look like I'd be interested to see them.

Thanks for the thoughtful exchange — it's clear you've gone deep on this problem space. --> Jealous.

Thread Thread
 
lowjax profile image
Jon Retting

For right now, the best I can say is... there are emergent properties that appear when you build agent identity infrastructure correctly.

Thread Thread
 
cloakhq profile image
CloakHQ

the rAF timing gap is well-documented but hard to find in one place. the short version: real GPU renders show ~16.6ms intervals with sub-millisecond jitter that follows GPU vsync drift. Xvfb doesn't have a physical vsync source, so intervals tend to cluster too tightly - variance is artificially low in a way that GPU-based detection can flag.

one reference worth reading is Tor Browser's treatment of timing APIs - they explicitly reduce timer resolution to prevent this class of attack. the irony is that reducing resolution is itself a fingerprint if your population doesn't do it uniformly.

the "validate against ourselves" direction is exactly right. we do something similar - running our own detection battery as a build-time check rather than hoping the patches hold.

Thread Thread
 
lowjax profile image
Jon Retting

Appreciate the rAF data, adding that to the research. The more identity infrastructure I build, the less I think about camouflage.

Collapse
 
capestart profile image
CapeStart

Really interesting build. The live visibility piece is what makes this stand out, because trust is still one of the biggest gaps in agent workflows.

Collapse
 
leo_pechnicki profile image
Leo Pechnicki

The decision to build this in Rust from scratch instead of wrapping Puppeteer/Playwright is a bold move, but it makes sense once you think about the real-time streaming requirement. Most browser automation tools treat the browser as a black box you send commands to and get results from — the live WebRTC feed fundamentally changes that relationship. You're not just automating, you're co-piloting.

The lease-based locking is the kind of detail that separates a real tool from a demo. I've seen so many agent setups break silently when two tasks collide on the same session. The fact that leases auto-expire and waiting agents get promoted without polling is clean.

One thing I'm curious about — with the bidirectional mouse/keyboard relay, how do you handle the handoff moment when a human takes over mid-task? Does the agent get notified that it lost control, or does it keep sending commands that get ignored? Seems like there could be interesting race conditions there, especially if the agent is in the middle of a multi-step flow.

Collapse
 
lowjax profile image
Jon Retting

Great observation. Right now the human input relay is intentionally uncoordinated — DataChannel events (mouse, keyboard, scroll, clipboard) dispatch straight to CDP alongside whatever the agent is doing. The MCP lease system handles agent-to-agent coordination but doesn't gate human input, by design at this stage.

The thinking was to get the full input pipeline working end-to-end first and let the right coordination model emerge from real usage rather than over-engineering a handoff protocol before knowing what the actual interaction patterns look like. The lease system already has the primitives (exclusive/observer modes, TTL, auto-promotion), so wiring in human-aware priority — something like last-active-wins with agent notification — is a natural next step once the patterns are clearer. Good eye on the mid-flow race condition though, that's exactly the kind of scenario that will shape how the handoff gets designed.

Collapse
 
bhavin-allinonetools profile image
Bhavin Sheth

This is honestly very interesting. One big limitation I’ve felt with AI agents is that they can’t handle real websites properly, especially things like login flows, buttons, or dynamic pages.

Being able to see what the AI is doing live and take control if needed solves a huge trust problem. It makes automation feel safer and more practical

Collapse
 
lowjax profile image
Jon Retting

Thanks! That trust problem was a big motivation — if you can't see what the agent is doing, you can't trust it with anything important. Live streaming the session changes the whole dynamic.

Collapse
 
matthewhou profile image
Matthew Hou

The live WebRTC stream is a great design choice. The biggest trust issue with agent browser automation is the black box problem — you give it a task, it runs for 2 minutes, and you either get a result or an error with no visibility into what happened in between.

Being able to watch and take over at any time changes the failure mode from "debug after the fact" to "intervene when you see it going wrong." That's a fundamentally different experience.

The lease-based locking for parallel instances is smart too. I've run into exactly that problem — two agent tasks trying to interact with the same page, stepping on each other's state. How does the lease system handle timeouts? If an agent stalls mid-task, does the lease expire and another agent can pick it up, or does it require manual intervention?

Collapse
 
lowjax profile image
Jon Retting

Good question! Leases expire on their own — no babysitting needed.
Every lock has a TTL (default 120 seconds). There's a background reaper that runs every 10 seconds and cleans up anything that's expired. If an agent stalls or crashes mid-task, the lease just times out and the instance opens up for the next agent automatically. If there's an agent waiting in the queue, it gets promoted right away.
Healthy agents extend their lease with vscreen_instance_lock_renew as a heartbeat. And if the agent's session disconnects entirely (like the MCP connection drops), all its locks get cleaned up immediately — don't even have to wait for the TTL.
There's also an acquire_or_wait mode where an agent can just say "give me the lock when it's free" and it'll block until the stalled agent's lease expires, then grab it. No polling needed.

Collapse
 
klement_gunndu profile image
klement Gunndu

The lease-based locking for parallel browser instances is a smart design choice — most MCP browser tools I have seen skip the concurrency problem entirely and break silently when two agents fight over one session.

Collapse
 
lowjax profile image
Jon Retting

Yeah, that silent breakage is brutal — two agents clicking on different things in the same tab with no idea the other exists. The lease system was born out of hitting exactly that problem early on.