DEV Community

Cover image for Video Docs Builder: Turning Web App Flows Into Narrated MP4 Documentation
Manuel Bruña
Manuel Bruña

Posted on

Video Docs Builder: Turning Web App Flows Into Narrated MP4 Documentation

Video Docs Builder: Turning Web App Flows Into Narrated MP4 Documentation

I built Video Docs Builder because product documentation has a strange failure mode: the most useful docs are usually the first ones to become stale.

Text docs get outdated when the UI changes. Screenshots get outdated even faster. Product videos are worse, because recording them manually takes enough time that teams avoid updating them unless there is a launch, a customer escalation, or a deadline.

Video Docs Builder tries to make that workflow agent-friendly.

Repository:

https://github.com/tecnomanu/video-docs-builder

What it does

Video Docs Builder is an agent skill for generating narrated videos from web app flows.

It combines:

  • Playwright for browser automation and recording
  • TTS narration through Piper, ElevenLabs, or OpenAI
  • FFmpeg for final audio/video assembly
  • optional React docs site generation with embedded videos

The output is a normal MP4. The source of truth is a flow file that lives inside the target project.

The pipeline looks like this:

Playwright -> TTS narration -> FFmpeg assembly -> MP4 documentation video
Enter fullscreen mode Exit fullscreen mode

That shape matters. The final video is not magic. It is built from structured steps, selectors, narration, timing, and browser interactions that can be reviewed and changed.

Why make it an agent skill?

The obvious version of this project would be a CLI that records a browser. That is useful, but it still leaves a lot of work to the human:

  • decide which flows matter
  • inspect the app
  • find selectors
  • write step narration
  • tune timing
  • re-record after UI changes
  • assemble output
  • build a docs page

An agent can help with those parts if the workflow is explicit enough.

The skill gives the agent a process:

  1. initialize a .video-docs folder in the client repo
  2. analyze the app with screenshots and selectors
  3. write one or more flow JSON files
  4. generate narration
  5. record browser actions
  6. assemble the video
  7. optionally generate a docs site

Install:

npx skills add https://github.com/tecnomanu/video-docs-builder
Enter fullscreen mode Exit fullscreen mode

Then ask an agent:

Document my app at http://localhost:3000
Enter fullscreen mode Exit fullscreen mode

The goal is not to remove judgment. The goal is to remove the repetitive production work around a clear documentation task.

Project layout

The generated files live inside the project being documented:

your-project/
└── .video-docs/
    ├── config.json
    ├── flows/
    │   └── 01-login.json
    ├── analysis/
    ├── docs/
    └── output/
        └── 01-login/
            ├── audio/
            ├── raw/
            └── final/
                └── 01-login.mp4
Enter fullscreen mode Exit fullscreen mode

That layout is deliberate.

The flow files are the durable part. They describe what to record and what to say. Generated screenshots, raw browser recordings, and audio files are build artifacts.

In practice, I would commit:

  • useful flow JSON files
  • docs site source, if the team wants generated docs in the repo
  • config templates without secrets

I would not commit:

  • raw recordings
  • generated audio
  • final MP4s unless the repo intentionally stores media
  • analysis screenshots if they are just temporary agent input

Flow JSON example

A flow is small enough to read, but structured enough to replay:

{
  "project": "admin-panel",
  "title": "Invite a new teammate",
  "category": "Team Management",
  "description": "A short walkthrough showing how an admin invites a teammate.",
  "output_name": "02-invite-teammate",
  "viewport": { "width": 1280, "height": 800 },
  "use_setup_login": true,
  "steps": [
    {
      "id": "open_team",
      "action": "navigate",
      "value": "http://localhost:3000/team",
      "narration": "We start in the Team section, where admins manage access.",
      "action_ms": 2000,
      "wait_for": "#invite-user-btn"
    },
    {
      "id": "explain_invite",
      "action": "wait",
      "narration": "Next, we open the invite form and enter the teammate details.",
      "action_ms": 900
    },
    {
      "id": "click_invite",
      "action": "click",
      "selector": "#invite-user-btn",
      "action_ms": 500,
      "wait_for": "form[data-testid='invite-form']"
    },
    {
      "id": "fill_email",
      "action": "fill",
      "selector": "input[name='email']",
      "value": "alex@example.com",
      "narration": "The email field defines who receives the invitation.",
      "action_ms": 700
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

The important field is not just the selector. It is the narration.

Video docs are only useful when the spoken explanation matches what the viewer is about to see.

The timing rule

The README has a rule I think is worth repeating:

Narration should describe what is about to happen, not what already happened.

Bad timing:

{
  "action": "click",
  "selector": "#login-btn",
  "narration": "We click Login",
  "action_ms": 4000
}
Enter fullscreen mode Exit fullscreen mode

Why is that bad? Because the UI may change immediately after the click while narration is still explaining the click. The viewer sees the dashboard before the voice finishes saying what caused it.

Better:

[
  {
    "action": "wait",
    "narration": "We click Login to authenticate.",
    "action_ms": 600
  },
  {
    "action": "click",
    "selector": "#login-btn",
    "action_ms": 500,
    "wait_for_url": "/dashboard"
  }
]
Enter fullscreen mode Exit fullscreen mode

This is the difference between a video that feels intentional and a video that feels like a screen recording with audio glued on top.

Manual commands

The skill can guide an agent, but the pieces are also available manually:

# Initialize .video-docs/ in a project
npm run init-project /path/to/your-project

# Analyze a running app
npx tsx scripts/analyze-app.ts /path/to/your-project/.video-docs

# Run the full pipeline for one flow
bash scripts/run-all.sh /path/to/your-project/.video-docs/flows/01-login.json

# Re-record without regenerating audio
bash scripts/run-all.sh /path/to/your-project/.video-docs/flows/01-login.json --skip-audio

# Generate a React docs site
npx tsx scripts/generate-docs-site.ts /path/to/your-project/.video-docs
Enter fullscreen mode Exit fullscreen mode

That last flag is useful in real work. If copy and narration are already approved, but the UI changed, you should not need to regenerate the voice. Re-recording the browser layer is enough.

TTS choices

Video Docs Builder supports several narration providers:

  • Piper: local and free
  • ElevenLabs: high quality, remote
  • OpenAI TTS: remote, good quality

For internal docs, Piper can be enough. For public onboarding or polished demos, a remote provider may be worth it.

The point is that the TTS provider should be a configuration detail, not the whole architecture.

Where this helps

I see this being useful in a few places:

  • onboarding flows
  • QA handoff videos
  • customer support docs
  • release notes with visual walkthroughs
  • internal admin panel documentation
  • demo environments that change often

The best use case is not a one-off marketing video. It is a repeated flow that changes over time and should stay documented.

What I would like agents to do with it

My ideal agent workflow looks like this:

  1. The user points the agent at a running local app.
  2. The agent inspects the app and proposes flows.
  3. The user chooses the important ones.
  4. The agent writes flow JSON.
  5. The agent runs the pipeline.
  6. The user reviews the MP4.
  7. The agent fixes selectors, copy, or timing.
  8. The final flows become part of the repo.

That gives documentation a maintenance path. When the UI changes, the agent can re-run the same flow and adjust the small broken pieces.

Repository

https://github.com/tecnomanu/video-docs-builder

If you are building tools for agents, this project is a good example of the direction I like: not a giant autonomous system, but a structured workflow where an agent can do useful production work because the steps are explicit.

Top comments (0)