Manuel Bruña

Posted on Jun 9

Video Docs Builder: Turning Web App Flows Into Narrated MP4 Documentation

#ai #automation #tutorial #opensource

Video Docs Builder: Turning Web App Flows Into Narrated MP4 Documentation

I built Video Docs Builder because product documentation has a strange failure mode: the most useful docs are usually the first ones to become stale.

Text docs get outdated when the UI changes. Screenshots get outdated even faster. Product videos are worse, because recording them manually takes enough time that teams avoid updating them unless there is a launch, a customer escalation, or a deadline.

Video Docs Builder tries to make that workflow agent-friendly.

Repository:

https://github.com/tecnomanu/video-docs-builder

What it does

Video Docs Builder is an agent skill for generating narrated videos from web app flows.

It combines:

Playwright for browser automation and recording
TTS narration through Piper, ElevenLabs, or OpenAI
FFmpeg for final audio/video assembly
optional React docs site generation with embedded videos

The output is a normal MP4. The source of truth is a flow file that lives inside the target project.

The pipeline looks like this:

Playwright -> TTS narration -> FFmpeg assembly -> MP4 documentation video

That shape matters. The final video is not magic. It is built from structured steps, selectors, narration, timing, and browser interactions that can be reviewed and changed.

Why make it an agent skill?

The obvious version of this project would be a CLI that records a browser. That is useful, but it still leaves a lot of work to the human:

decide which flows matter
inspect the app
find selectors
write step narration
tune timing
re-record after UI changes
assemble output
build a docs page

An agent can help with those parts if the workflow is explicit enough.

The skill gives the agent a process:

initialize a .video-docs folder in the client repo
analyze the app with screenshots and selectors
write one or more flow JSON files
generate narration
record browser actions
assemble the video
optionally generate a docs site

Install:

npx skills add https://github.com/tecnomanu/video-docs-builder

Then ask an agent:

Document my app at http://localhost:3000

The goal is not to remove judgment. The goal is to remove the repetitive production work around a clear documentation task.

Project layout

The generated files live inside the project being documented:

your-project/
└── .video-docs/
    ├── config.json
    ├── flows/
    │   └── 01-login.json
    ├── analysis/
    ├── docs/
    └── output/
        └── 01-login/
            ├── audio/
            ├── raw/
            └── final/
                └── 01-login.mp4

That layout is deliberate.

The flow files are the durable part. They describe what to record and what to say. Generated screenshots, raw browser recordings, and audio files are build artifacts.

In practice, I would commit:

useful flow JSON files
docs site source, if the team wants generated docs in the repo
config templates without secrets

I would not commit:

raw recordings
generated audio
final MP4s unless the repo intentionally stores media
analysis screenshots if they are just temporary agent input

Flow JSON example

A flow is small enough to read, but structured enough to replay:

{
  "project": "admin-panel",
  "title": "Invite a new teammate",
  "category": "Team Management",
  "description": "A short walkthrough showing how an admin invites a teammate.",
  "output_name": "02-invite-teammate",
  "viewport": { "width": 1280, "height": 800 },
  "use_setup_login": true,
  "steps": [
    {
      "id": "open_team",
      "action": "navigate",
      "value": "http://localhost:3000/team",
      "narration": "We start in the Team section, where admins manage access.",
      "action_ms": 2000,
      "wait_for": "#invite-user-btn"
    },
    {
      "id": "explain_invite",
      "action": "wait",
      "narration": "Next, we open the invite form and enter the teammate details.",
      "action_ms": 900
    },
    {
      "id": "click_invite",
      "action": "click",
      "selector": "#invite-user-btn",
      "action_ms": 500,
      "wait_for": "form[data-testid='invite-form']"
    },
    {
      "id": "fill_email",
      "action": "fill",
      "selector": "input[name='email']",
      "value": "alex@example.com",
      "narration": "The email field defines who receives the invitation.",
      "action_ms": 700
    }
  ]
}

The important field is not just the selector. It is the narration.

Video docs are only useful when the spoken explanation matches what the viewer is about to see.

The timing rule

The README has a rule I think is worth repeating:

Narration should describe what is about to happen, not what already happened.

Bad timing:

{
  "action": "click",
  "selector": "#login-btn",
  "narration": "We click Login",
  "action_ms": 4000
}

Why is that bad? Because the UI may change immediately after the click while narration is still explaining the click. The viewer sees the dashboard before the voice finishes saying what caused it.

Better:

[
  {
    "action": "wait",
    "narration": "We click Login to authenticate.",
    "action_ms": 600
  },
  {
    "action": "click",
    "selector": "#login-btn",
    "action_ms": 500,
    "wait_for_url": "/dashboard"
  }
]

This is the difference between a video that feels intentional and a video that feels like a screen recording with audio glued on top.

Manual commands

The skill can guide an agent, but the pieces are also available manually:

# Initialize .video-docs/ in a project
npm run init-project /path/to/your-project

# Analyze a running app
npx tsx scripts/analyze-app.ts /path/to/your-project/.video-docs

# Run the full pipeline for one flow
bash scripts/run-all.sh /path/to/your-project/.video-docs/flows/01-login.json

# Re-record without regenerating audio
bash scripts/run-all.sh /path/to/your-project/.video-docs/flows/01-login.json --skip-audio

# Generate a React docs site
npx tsx scripts/generate-docs-site.ts /path/to/your-project/.video-docs

That last flag is useful in real work. If copy and narration are already approved, but the UI changed, you should not need to regenerate the voice. Re-recording the browser layer is enough.

TTS choices

Video Docs Builder supports several narration providers:

Piper: local and free
ElevenLabs: high quality, remote
OpenAI TTS: remote, good quality

For internal docs, Piper can be enough. For public onboarding or polished demos, a remote provider may be worth it.

The point is that the TTS provider should be a configuration detail, not the whole architecture.

Where this helps

I see this being useful in a few places:

onboarding flows
QA handoff videos
customer support docs
release notes with visual walkthroughs
internal admin panel documentation
demo environments that change often

The best use case is not a one-off marketing video. It is a repeated flow that changes over time and should stay documented.

What I would like agents to do with it

My ideal agent workflow looks like this:

The user points the agent at a running local app.
The agent inspects the app and proposes flows.
The user chooses the important ones.
The agent writes flow JSON.
The agent runs the pipeline.
The user reviews the MP4.
The agent fixes selectors, copy, or timing.
The final flows become part of the repo.

That gives documentation a maintenance path. When the UI changes, the agent can re-run the same flow and adjust the small broken pieces.

Repository

https://github.com/tecnomanu/video-docs-builder

If you are building tools for agents, this project is a good example of the direction I like: not a giant autonomous system, but a structured workflow where an agent can do useful production work because the steps are explicit.

Top comments (2)

Carlos Pereyra • Jun 10

wonderful, testing right now!

Manuel Bruña • Jun 26

Thanks Carlos, I appreciate it. If you test it against a real app, the most useful feedback for me is where the flow definition feels too manual or where selectors/timing become fragile. That is the part I want to make smoother.