정상록

Posted on Apr 20

HeyGen HyperFrames: An Open-Source Video Framework Built for AI Agents (Not Humans)

TL;DR

HeyGen open-sourced HyperFrames under Apache 2.0 in 2026. Instead of programmable video via React components (like Remotion), you write plain HTML with data-* attributes and GSAP timelines. The design goal is explicit: AI coding agents are the primary users, not humans.

npx skills add heygen-com/hyperframes

This single command installs five slash commands into Claude Code / Cursor / Codex / Gemini CLI and turns your agent into a video editor.

Why Another Video Framework?

The homepage headline is the thesis statement: "Now Claude Code can edit videos."

Content automation pipelines have agent-friendly tools for research, writing, and image generation. Video was the missing piece. The question HyperFrames answers is: "What abstraction level do AI agents handle best?"

The answer, according to HeyGen: HTML. Not JSX, not imperative timeline APIs, just HTML.

The Core Primitive

<div id="root" data-composition-id="root"
     data-start="0" data-width="1920" data-height="1080">
  <video id="clip-1" data-start="0" data-duration="5" data-track-index="0"
         src="intro.mp4" muted playsinline></video>
  <img id="overlay" class="clip" data-start="2" data-duration="3"
       data-track-index="1" src="logo.png" />
  <audio id="bg-music" data-start="0" data-duration="9" data-track-index="2"
         data-volume="0.5" src="music.wav"></audio>
</div>

That is the full mental model. Four clip types:

<video> — must be muted
<img> — static visuals
<audio> — separated from video
<div data-composition-id> — nested compositions

Five required attributes cover timing, layering, and optional volume. A class="clip" tells the framework to honor the data-start/data-duration window.

Determinism Is Non-Negotiable

One of the seven official "must follow" rules caught my eye:

Math.random() is forbidden. If you need randomness, use a seeded PRNG like mulberry32.

That level of commitment to determinism is rare in video tooling. The reasoning is clear: agent-driven pipelines need the same input to produce identical bytes every time, otherwise you cannot put rendering in CI.

Other non-negotiables:

Every timeline must register into window.__timelines
<video> elements must be muted (audio goes into <audio> tags)
GSAP timeline construction must be synchronous (no async/await/fetch)
Timed elements require class="clip"
Never call video.play() or audio.currentTime from scripts — the framework owns media control
Every scene needs an entrance animation
Scenes need transitions between them

Natural Language → Technical Mapping

The prompting guide includes a mapping table that does most of the work:

Natural Language	GSAP Easing
smooth	`power2.out`
snappy	`power4.out`
bouncy	`back.out`
springy	`elastic.out`
dramatic	`expo.out`
dreamy	`sine.inOut`

The same approach for caption tones maps "Hype / Corporate / Tutorial / Storytelling / Social" to specific font weights, entrance animations, and size ranges. The user describes a feeling; the framework resolves to technique.

Two Prompt Modes

Cold Start

10-second product intro, fade-in title, dark background, BGM, corporate mood

Recommended structure:

Duration
Aspect ratio (16:9 / 9:16 / 1:1)
Mood (energetic / calm / premium / playful)
Key elements

Warm Start

This is where HyperFrames shines:

Turn this GitHub repo into a 45-second pitch video
Turn this PDF into a 30-second summary video

The agent handles both research and production in a single prompt. The /website-to-hyperframes slash command is a first-class pipeline for URL → video.

Common Mistakes (The Debugging Cheat Sheet)

From the official Common Mistakes doc, here are the failure modes I would not have guessed:

1. Animating video element dimensions

// ❌ Freezes frame rendering
gsap.to('#video1', { width: 1920, duration: 1 });

// ✅ Animate a wrapper div
gsap.to('.video-wrapper', { width: 1920, duration: 1 });

2. Timeline shorter than video

// Extend timeline with a zero-duration set
tl.set({}, {}, 283);

3. Oversized images

A 7000×5000 PNG causes ~140MB decode per frame. Keep images at 2× canvas size max.

4. Backdrop-filter stacks

16 layers of backdrop-filter: blur() calculated every frame will kill render performance. Cap at 2-3 layers.

Architecture

Monorepo with clean separation:

Package	Responsibility
`hyperframes`	CLI (create, preview, lint, render)
`@hyperframes/core`	Types, parser, linter, runtime, frame adapter
`@hyperframes/engine`	Page → video capture (Puppeteer + FFmpeg)
`@hyperframes/producer`	Full pipeline (capture + encode + audio mix)
`@hyperframes/studio`	Browser-based composition editor
`@hyperframes/player`	Embeddable `<hyperframes-player>` web component
`@hyperframes/shader-transitions`	WebGL shader transitions

The Frame Adapter pattern is the extensibility story. Adapters can bring GSAP, Lottie, CSS animations, or Three.js into the render pipeline. First-mover adapters will probably shape the ecosystem.

TTS Is Built-In

Kokoro TTS runs locally, no API key required:

npx hyperframes tts --text "Hello world" --voice af_heart --output narration.wav

Recommended voices by use case:

Product demos: af_heart, af_nova
Tutorials: am_adam, bf_emma
Marketing: af_sky, am_michael

The Component Registry

Over 50 blocks are registered and installable via CLI:

npx hyperframes add flash-through-white
npx hyperframes add instagram-follow
npx hyperframes add data-chart

Categories include social overlays, shader transitions, data visualizations, and cinematic effects.

Workflow I Would Adopt

npx hyperframes init my-video (installs skill automatically)
Open in Claude Code / Cursor / Codex
/hyperframes with a warm start prompt pointing to source material
npx hyperframes preview for browser live reload
Small, targeted follow-up prompts: "make the title 2x larger", "add a fade-out at the end"
npx hyperframes lint to catch structural issues
npx hyperframes render --preset high --output final.mp4

Anti-Patterns to Avoid

From the prompting guide:

Asking for React/Vue components — adds a translation layer
Requesting 4K/60fps — 1920×1080 30fps is the sweet spot for speed
Skipping the slash command — the agent will fall back to generic HTML video conventions
Giant monolithic prompts — targeted, iterative edits beat one-shot mega-prompts

Requirements

Node.js 22+
FFmpeg

That is the entire system requirement list.

Why This Matters

The design signals a specific bet: the future of content tooling is agent-primary, human-secondary. Most frameworks treat agent support as a retrofit. HyperFrames treats it as the foundational design constraint. Whether that bet pays off or not, the engineering choices (HTML-first, deterministic rendering, slash command integration) are worth studying regardless of which tool you end up using.

DEV Community