DEV Community

Cover image for FrameVOX: A Video Production CLI for Agent-Made Social Videos
Manuel Bruña
Manuel Bruña

Posted on

FrameVOX: A Video Production CLI for Agent-Made Social Videos

FrameVOX: A Video Production CLI for Agent-Made Social Videos

FrameVOX is my new video production project.

Repository:

https://github.com/tecnomanu/framevox

The short version: FrameVOX is a CLI for creating publish-ready videos with HTML compositions, voice generation, templates, and rendering.

The more accurate version: FrameVOX is a thin practical layer around HyperFrames plus TTS providers. It does not replace HyperFrames. It packages the parts that agents and humans usually trip over when turning a video idea into an actual MP4.

The problem

HTML-to-video tooling is powerful, but the production workflow has many small failure points.

If you do it manually, a typical path looks like this:

mkdir my-promo
cd my-promo

# Create index.html, style.css, assets, timing, scenes, voice script.
# Wire data attributes correctly.
# Generate or record voice.
# Convert PCM to MP3 if a provider returns raw audio.
# Check whether the voice file is valid.
# Run lint.
# Render with headless Chrome.
# Fix timing.
# Render again.
Enter fullscreen mode Exit fullscreen mode

None of these steps are impossible. But together they create enough friction that the workflow becomes fragile, especially when an AI agent is doing the work.

Agents are good at writing files. They are less good when a workflow requires hidden setup, missing keys, unclear templates, manual voice conversion, and late render failures.

FrameVOX tries to make the path explicit:

npx framevox init my-promo --template minimal-mobile
npx framevox add-key gemini YOUR_GEMINI_KEY
npx framevox voice
npx framevox render
Enter fullscreen mode Exit fullscreen mode

That creates one project folder, one voice file, and one rendered MP4.

What FrameVOX wraps

FrameVOX uses HyperFrames for the HTML-to-video layer.

HyperFrames is still the rendering engine. It provides the composition model, preview, linting, inspection, and rendering pipeline. FrameVOX adds a production wrapper around it:

  • project scaffolding from templates
  • TTS key storage
  • Gemini, Piper, and ElevenLabs voice generation
  • MD5 sanity checks for generated audio
  • lint-before-render defaults
  • template discovery
  • generated recipes and project docs
  • agent skill setup
  • update and status commands

That split is important.

FrameVOX is not trying to become a full design app. It is trying to make a repeatable project workflow around a real renderer.

Requirements

FrameVOX currently expects:

Node.js >= 22
ffmpeg
Chrome or Chromium
piper binary, only if using Piper
Enter fullscreen mode Exit fullscreen mode

The package exposes:

{
  "name": "framevox",
  "version": "0.2.0",
  "bin": {
    "framevox": "bin/framevox.js"
  }
}
Enter fullscreen mode Exit fullscreen mode

HyperFrames is bundled as a dependency, so the common render path does not require manually installing it first.

Project structure

After init, a project looks like this:

my-promo/
├── index.html
├── voice.json
├── DESIGN.md
├── assets/
├── voice.mp3
├── output.mp4
├── RECIPE.md
└── .framevox/
    └── config.json
Enter fullscreen mode Exit fullscreen mode

The files have different responsibilities:

  • index.html is the actual video composition.
  • voice.json is the voice script.
  • DESIGN.md is where product, brand, colors, and copy should be clarified before editing.
  • assets/ holds logos and media.
  • .framevox/config.json stores provider and render metadata.
  • RECIPE.md documents how the project was produced.

For agent workflows, that structure matters. The agent can inspect DESIGN.md, edit the composition, adjust voice.json, regenerate audio, and render again without guessing where things belong.

Templates

FrameVOX ships templates grouped by family.

Current built-in families include:

minimal
promo
studio
immersive
Enter fullscreen mode Exit fullscreen mode

Each family can have mobile and desktop variants. For example:

templates/promo/
├── style.css
├── family.json
├── mobile/
└── desktop/
Enter fullscreen mode Exit fullscreen mode

That lets a family share visual identity while still supporting multiple output formats.

Useful commands:

framevox templates
framevox templates --json
framevox templates add promo
framevox templates install promo
framevox init my-reel --template promo-mobile
Enter fullscreen mode Exit fullscreen mode

The template lookup order is:

project .framevox/templates/
user ~/.framevox/templates/
builtin templates
Enter fullscreen mode Exit fullscreen mode

That gives you a clean path from experimentation to reusable brand templates. Start with a built-in family. Copy it into a project. Customize it for a brand. Install it globally when it becomes reusable.

Voice generation

FrameVOX supports several TTS paths:

framevox add-key gemini YOUR_GEMINI_KEY
framevox add-key elevenlabs YOUR_ELEVENLABS_KEY
framevox add-key piper-voice en_US-lessac-medium
framevox keys
Enter fullscreen mode Exit fullscreen mode

Keys live in ~/.framevox/.env. They should never be committed.

The voice file is simple:

{
  "prompt": "Read in a warm, conversational tone:",
  "text": "Meet Crewdesk... the scheduling tool that keeps field teams aligned."
}
Enter fullscreen mode Exit fullscreen mode

For longer videos, you can use scenes:

{
  "prompt": "Read with an energetic product launch tone:",
  "gap": 0.3,
  "scenes": [
    { "id": "hook", "text": "Your team schedule changed again." },
    { "id": "problem", "text": "Now three people are looking at three different plans." },
    { "id": "solution", "text": "Crewdesk keeps every shift, route, and update in one place." },
    { "id": "cta", "text": "Launch your first schedule today." }
  ]
}
Enter fullscreen mode Exit fullscreen mode

You can regenerate one part:

framevox voice --scene hook
framevox voice --scene 2
Enter fullscreen mode Exit fullscreen mode

After generation, FrameVOX writes a voice timeline. That matters because video timing should be based on measured audio, not guessed text length.

Emotion tags for Gemini

Gemini TTS supports inline delivery changes through tags:

{
  "prompt": "Read in Spanish rioplatense, warm and conversational:",
  "text": "Bueno che, mirá esto... [excited]Esto acelera todo.[/excited] ... [whisper]Y no tenés que tocar ffmpeg a mano.[/whisper]"
}
Enter fullscreen mode Exit fullscreen mode

Tags are delivery instructions, not words to speak.

Examples:

  • [eloquent]...[/eloquent] for stronger theatrical emphasis
  • [sad]...[/sad] for subdued delivery
  • [whisper]...[/whisper] for quiet delivery
  • [excited]...[/excited] for higher energy

For short promos, a single text field with ellipsis pauses often works better than many separate API calls. For longer scripts, scenes give better control and easier regeneration.

Render workflow

The main render path is intentionally boring:

framevox lint
framevox preview
framevox render
Enter fullscreen mode Exit fullscreen mode

Render can take options:

framevox render --out output/product-launch.mp4
framevox render --quality high
framevox render --no-lint
Enter fullscreen mode Exit fullscreen mode

The default should catch obvious composition mistakes before spending time rendering.

Agent setup

FrameVOX is also an agent workflow.

Run:

npx framevox setup
Enter fullscreen mode Exit fullscreen mode

It detects supported agent apps such as Claude Code, Cursor, Codex, Antigravity, and OpenCode, then installs the FrameVOX skill where it makes sense.

Check state:

framevox status
framevox update --check
framevox update
Enter fullscreen mode Exit fullscreen mode

That means the CLI and the agent instructions can stay in sync. This is important because video composition has many rules that are easier to follow when the agent has explicit guidance.

Where I expect to use it

FrameVOX is useful when the output should be short and publishable:

  • product launch reels
  • feature announcements
  • social demos
  • quick explainers
  • AI-generated news-style updates
  • branded templates for repeated campaigns

Video Docs Builder is more about documenting real app flows. FrameVOX is more about producing social videos from a script, brand, assets, and a template.

They are related, but they solve different jobs.

Full command map

framevox init [name] [--template T]
framevox voice [--provider P] [--voice V] [--scene id]
framevox render [--out file] [--quality Q] [--no-lint]
framevox lint
framevox preview
framevox recipe [title]
framevox add-key <name> <value>
framevox keys
framevox templates
framevox templates add <name>
framevox templates install <name>
framevox status
framevox setup
framevox setup --skip-hf-skills
framevox update
framevox update --check
Enter fullscreen mode Exit fullscreen mode

Why I built it

I keep building tools in the same direction: make agents useful by giving them a real workflow, not just a prompt.

FrameVOX is part of that. It gives an agent a project structure, templates, voice generation, validation, and a render command. The agent can still be creative in copy, layout, timing, and composition, but the production path is stable.

That is the kind of AI tooling I want more of: less mystery, more repeatable output.

Repository:

https://github.com/tecnomanu/framevox

Top comments (0)