Manuel Bruña

Posted on Jun 9

FrameVOX: A Video Production CLI for Agent-Made Social Videos

#ai #video #opensource #tutorial

FrameVOX: A Video Production CLI for Agent-Made Social Videos

FrameVOX is my new video production project.

Repository:

https://github.com/tecnomanu/framevox

The short version: FrameVOX is a CLI for creating publish-ready videos with HTML compositions, voice generation, templates, and rendering.

The more accurate version: FrameVOX is a thin practical layer around HyperFrames plus TTS providers. It does not replace HyperFrames. It packages the parts that agents and humans usually trip over when turning a video idea into an actual MP4.

The problem

HTML-to-video tooling is powerful, but the production workflow has many small failure points.

If you do it manually, a typical path looks like this:

mkdir my-promo
cd my-promo

# Create index.html, style.css, assets, timing, scenes, voice script.
# Wire data attributes correctly.
# Generate or record voice.
# Convert PCM to MP3 if a provider returns raw audio.
# Check whether the voice file is valid.
# Run lint.
# Render with headless Chrome.
# Fix timing.
# Render again.

None of these steps are impossible. But together they create enough friction that the workflow becomes fragile, especially when an AI agent is doing the work.

Agents are good at writing files. They are less good when a workflow requires hidden setup, missing keys, unclear templates, manual voice conversion, and late render failures.

FrameVOX tries to make the path explicit:

npx framevox init my-promo --template minimal-mobile
npx framevox add-key gemini YOUR_GEMINI_KEY
npx framevox voice
npx framevox render

That creates one project folder, one voice file, and one rendered MP4.

What FrameVOX wraps

FrameVOX uses HyperFrames for the HTML-to-video layer.

HyperFrames is still the rendering engine. It provides the composition model, preview, linting, inspection, and rendering pipeline. FrameVOX adds a production wrapper around it:

project scaffolding from templates
TTS key storage
Gemini, Piper, and ElevenLabs voice generation
MD5 sanity checks for generated audio
lint-before-render defaults
template discovery
generated recipes and project docs
agent skill setup
update and status commands

That split is important.

FrameVOX is not trying to become a full design app. It is trying to make a repeatable project workflow around a real renderer.

Requirements

FrameVOX currently expects:

Node.js >= 22
ffmpeg
Chrome or Chromium
piper binary, only if using Piper

The package exposes:

{
  "name": "framevox",
  "version": "0.2.0",
  "bin": {
    "framevox": "bin/framevox.js"
  }
}

HyperFrames is bundled as a dependency, so the common render path does not require manually installing it first.

Project structure

After init, a project looks like this:

my-promo/
├── index.html
├── voice.json
├── DESIGN.md
├── assets/
├── voice.mp3
├── output.mp4
├── RECIPE.md
└── .framevox/
    └── config.json

The files have different responsibilities:

index.html is the actual video composition.
voice.json is the voice script.
DESIGN.md is where product, brand, colors, and copy should be clarified before editing.
assets/ holds logos and media.
.framevox/config.json stores provider and render metadata.
RECIPE.md documents how the project was produced.

For agent workflows, that structure matters. The agent can inspect DESIGN.md, edit the composition, adjust voice.json, regenerate audio, and render again without guessing where things belong.

Templates

FrameVOX ships templates grouped by family.

Current built-in families include:

minimal
promo
studio
immersive

Each family can have mobile and desktop variants. For example:

templates/promo/
├── style.css
├── family.json
├── mobile/
└── desktop/

That lets a family share visual identity while still supporting multiple output formats.

Useful commands:

framevox templates
framevox templates --json
framevox templates add promo
framevox templates install promo
framevox init my-reel --template promo-mobile

The template lookup order is:

project .framevox/templates/
user ~/.framevox/templates/
builtin templates

That gives you a clean path from experimentation to reusable brand templates. Start with a built-in family. Copy it into a project. Customize it for a brand. Install it globally when it becomes reusable.

Voice generation

FrameVOX supports several TTS paths:

framevox add-key gemini YOUR_GEMINI_KEY
framevox add-key elevenlabs YOUR_ELEVENLABS_KEY
framevox add-key piper-voice en_US-lessac-medium
framevox keys

Keys live in ~/.framevox/.env. They should never be committed.

The voice file is simple:

{
  "prompt": "Read in a warm, conversational tone:",
  "text": "Meet Crewdesk... the scheduling tool that keeps field teams aligned."
}

For longer videos, you can use scenes:

{
  "prompt": "Read with an energetic product launch tone:",
  "gap": 0.3,
  "scenes": [
    { "id": "hook", "text": "Your team schedule changed again." },
    { "id": "problem", "text": "Now three people are looking at three different plans." },
    { "id": "solution", "text": "Crewdesk keeps every shift, route, and update in one place." },
    { "id": "cta", "text": "Launch your first schedule today." }
  ]
}

You can regenerate one part:

framevox voice --scene hook
framevox voice --scene 2

After generation, FrameVOX writes a voice timeline. That matters because video timing should be based on measured audio, not guessed text length.

Emotion tags for Gemini

Gemini TTS supports inline delivery changes through tags:

{
  "prompt": "Read in Spanish rioplatense, warm and conversational:",
  "text": "Bueno che, mirá esto... [excited]Esto acelera todo.[/excited] ... [whisper]Y no tenés que tocar ffmpeg a mano.[/whisper]"
}

Tags are delivery instructions, not words to speak.

Examples:

[eloquent]...[/eloquent] for stronger theatrical emphasis
[sad]...[/sad] for subdued delivery
[whisper]...[/whisper] for quiet delivery
[excited]...[/excited] for higher energy

For short promos, a single text field with ellipsis pauses often works better than many separate API calls. For longer scripts, scenes give better control and easier regeneration.

Render workflow

The main render path is intentionally boring:

framevox lint
framevox preview
framevox render

Render can take options:

framevox render --out output/product-launch.mp4
framevox render --quality high
framevox render --no-lint

The default should catch obvious composition mistakes before spending time rendering.

Agent setup

FrameVOX is also an agent workflow.

Run:

npx framevox setup

It detects supported agent apps such as Claude Code, Cursor, Codex, Antigravity, and OpenCode, then installs the FrameVOX skill where it makes sense.

Check state:

framevox status
framevox update --check
framevox update

That means the CLI and the agent instructions can stay in sync. This is important because video composition has many rules that are easier to follow when the agent has explicit guidance.

Where I expect to use it

FrameVOX is useful when the output should be short and publishable:

product launch reels
feature announcements
social demos
quick explainers
AI-generated news-style updates
branded templates for repeated campaigns

Video Docs Builder is more about documenting real app flows. FrameVOX is more about producing social videos from a script, brand, assets, and a template.

They are related, but they solve different jobs.

Full command map

framevox init [name] [--template T]
framevox voice [--provider P] [--voice V] [--scene id]
framevox render [--out file] [--quality Q] [--no-lint]
framevox lint
framevox preview
framevox recipe [title]
framevox add-key <name> <value>
framevox keys
framevox templates
framevox templates add <name>
framevox templates install <name>
framevox status
framevox setup
framevox setup --skip-hf-skills
framevox update
framevox update --check

Why I built it

I keep building tools in the same direction: make agents useful by giving them a real workflow, not just a prompt.

FrameVOX is part of that. It gives an agent a project structure, templates, voice generation, validation, and a render command. The agent can still be creative in copy, layout, timing, and composition, but the production path is stable.

That is the kind of AI tooling I want more of: less mystery, more repeatable output.

Repository:

https://github.com/tecnomanu/framevox

Top comments (2)

Carlos Pereyra • Jun 10

tools like this one make my day! congrats!

Manuel Bruña • Jun 26

Thanks Carlos, that means a lot. My goal with FrameVOX is exactly that: make the video production path feel more repeatable for agents and humans, without hiding the actual HyperFrames/rendering workflow underneath.