Forcing AI agents to actually prove they did the work

#agents #ai #cli #showdev

My AI agents kept telling me “Done!” while the page was blank and the console had 47 errors.

The problem is simple: agents write UI code but never open a browser to check. They can’t see if the layout is broken, if buttons overlap, or if the console is full of red.

So I built ProofShot: a CLI that gives any AI agent a browser, records the session, and bundles the proof of the agent's hard work
.

We hit 280 stars overnight after posting on HN — would love more feedback.

How it works

Three commands:

proofshot start --run "npm run dev" --port 3000

(agent navigates, clicks, takes screenshots)

proofshot stop

start opens a headless browser, begins video recording, and pipes your dev server logs. The agent interacts with the page through proofshot exec — clicking, typing, navigating. stop collects console errors, trims the video, and generates a self-contained HTML viewer with everything synced to a timeline.

The viewer is a single file — video playback with clickable action markers, console and server log tabs, screenshot gallery. Works offline, no dependencies. You can drop it on a PR and any reviewer sees exactly what happened.

proofshot pr uploads the whole bundle to a GitHub PR comment automatically.

What it catches

ProofShot collects console errors and matches server log errors across 10+ languages (JS, Python, Ruby, Go, Rust, Java, etc). So if the agent’s code throws a React error, a Python traceback, or a Go panic — it shows up in the viewer with timestamps.

Agent-agnostic

It’s just shell commands. Works with Claude Code, Cursor, Codex, Gemini CLI, Windsurf — anything that can call a terminal. Built on agent-browser from Vercel Labs, which uses compact element references instead of the full accessibility tree (~93% smaller than Playwright MCP’s output).

Why not Playwright?

Playwright is great for writing test scripts. ProofShot solves a different problem: the agent doesn’t write tests, it just operates the browser and records everything. No scripting, no assertions. The human reviews the evidence.

Think of it as the difference between automated testing and a screen recording of a QA session.

What it’s not

It’s not a testing framework. The agent doesn’t decide pass/fail. It gives you the recording, the errors, and the screenshots — you decide if the feature is correct.

Open source, MIT licensed.

DEV Community