Machina Tools

Posted on Jun 21 • Originally published at machina.chat

How I Built BugCapture — From Screen Recording to AI-Ready Bug Report in One Click

#ai #opensource #devtools #webdev

I was debugging a form alignment issue on a client's production server. Remote machine, no local environment. The kind of problem where you know exactly what you're seeing but translating it into words for your AI agent takes longer than finding the bug yourself.

"The second column in the input group is a few pixels wider than the first, but only when the browser is at an intermediate viewport width — somewhere between 768px and 900px — and only after a user has interacted with the first field. The offset appears to be about 12px..."

By the time you've written that, you've already lost the time you were trying to save. And the AI's first three responses are clarifying questions, because even that description is ambiguous.

This is the problem BugCapture solves: it turns a 47-second screen recording into a structured file your AI agent can act on immediately — with no text description from you, no manual screenshots, no copy-pasting error messages.

The insight: bugs have a natural format

Modern AI agents — Claude, Copilot, GPT-4o — are multimodal. They can look at screenshots. They can read transcripts. The question isn't whether they can understand a bug from visual evidence; they clearly can. The question is: what format packages that evidence in a way that maximizes AI understanding?

The answer, after a lot of iteration, is a Markdown file with:

A voice transcript from the developer reproducing the bug (what you're thinking while you click)
Sequential screenshots at regular intervals (what the screen looked like over time)
Optional SSH log capture (what the server was doing at the same time)

This combination gives the AI three independent channels of information about the same event. The transcript explains intent. The screenshots show the visual state. The logs show the runtime state. An AI reading all three can build a more accurate model of the bug than it could from any one source alone.

How it works

The workflow is exactly three steps:

1. Record — click Record in the BugCapture browser interface. The page requests screen and microphone access. You reproduce the bug while narrating what you see. The recording is captured as a MediaRecorder stream — audio and video in parallel, fully local.

2. Process — when you click Stop, the server runs two pipelines simultaneously:

Frame extraction: ffmpeg extracts one screenshot every 3 seconds (configurable), converts them to JPEG at 85% quality. Up to 20 frames per recording.
Transcription: @xenova/transformers runs the Whisper base.en model on the audio — fully offline, no API key, no data upload. On a modern laptop, a 47-second recording transcribes in about 8 seconds.

3. Export — you get a .md file: screenshots embedded as base64 + the full transcript, structured for AI consumption.

Drop that into Claude's context or Copilot Agent's workspace, and the AI has everything it needs. No text description from you. No manual screenshot upload.

LogLens: adding the server side

BugCapture has an optional LogLens mode: enable it before recording, and the server opens an SSH connection to your remote machine and tails the configured log files in parallel with the screen capture. When you export, the .md includes a timestamped log capture alongside the visual evidence.

The real test

The form alignment bug I mentioned: I recorded 47 seconds of screen and voice, exported the .md, and dropped it into Copilot Agent.

Copilot identified a conflicting width rule in a child theme stylesheet that was being applied conditionally after the first user interaction triggered a re-render. The fix was three lines of CSS. Total time from "I see the bug" to "fix deployed": under two minutes.

Key strengths

Completely offline. The Whisper model runs locally via ONNX. No transcription API, no upload, no account.

AI-agnostic output. The .md file works with any AI that accepts text: Claude, Copilot, GPT-4, Gemini, local models via Ollama.

Zero configuration for basic use. Install Node.js, clone the repo, node server.mjs. No API keys required.

Technical stack

Component	Technology
Screen + audio capture	Web MediaRecorder API
Frame extraction	ffmpeg
Transcription	`@xenova/transformers` + Whisper ONNX
SSH log capture	`ssh2`
Output format	Markdown with base64 JPEG
Server	Node.js ESM, no framework

Try it

BugCapture is part of Machina — an open source suite of tools that close the gap between "I see the bug" and "AI fixes the bug."

git clone https://github.com/machina-tools/machina.git
cd machina
bash setup.sh
cd tools/bugcapture && node server.mjs

Then open tools/bugcapture/index.html in your browser.

→ GitHub | machina.chat

Top comments (1)

Nazar Boyko • Jun 21

The idea of three channels is what makes this click for me. Transcript for intent, frames for the visual state, logs for the runtime state, all timestamped against the same moment. That's a lot more than I usually hand an agent when I paste one screenshot and a sentence. My question is about the frames. One every 3 seconds is simple, but a fast visual glitch can happen between samples. Did you consider grabbing frames on events like clicks or DOM changes instead of a fixed interval, or did the fixed cadence just turn out good enough in practice?