DEV Community

Cover image for How to Edit Video with an AI Agent Using HyperFrames
Preecha
Preecha

Posted on

How to Edit Video with an AI Agent Using HyperFrames

TL;DR

AI agents can write code, call APIs, and run multi-step workflows. Until now, one capability kept eluding them: editing video. Professional tools like After Effects and DaVinci Resolve use layered timelines and JSON scene graphs that LLMs weren’t trained on. HeyGen’s open-source project, HyperFrames, flips the approach: agents compose video with HTML, CSS, and JavaScript, then render the result to MP4, MOV, or WebM. Install it as a Claude Code skill with one command, and your agent becomes a video editor.

Try Apidog today

Introduction

Video is one of the most engaging formats on the web, but it has been harder for AI agents to produce than text, code, images, or charts.

Prompt-to-video tools like Sora, Veo, and Runway can generate a full clip from text, but the output is usually monolithic. You can’t easily compose scenes, iterate on motion graphics, overlay precise brand animations, or ask the agent to “redo scene 3 with a slower fade.”

HeyGen shipped HyperFrames on April 17, 2026 to close this gap. Instead of teaching agents traditional video software, HyperFrames gives them a format they already know: HTML.

This guide covers:

  • Why traditional video editing is hard for agents
  • How HyperFrames maps HTML to video timelines
  • How to install and render your first composition
  • Where API testing and orchestration with Apidog fit into the workflow

Why AI agents couldn’t edit video before

Traditional video tools were built for people clicking on timelines, not agents generating source code.

The main blockers:

  1. Timeline UIs don’t map cleanly to code

Tools like After Effects, Premiere, and DaVinci Resolve use proprietary project formats or deeply nested scene graphs. Even if an agent can inspect those files, there is very little public training data for models to learn the structure.

  1. Motion graphics require visual composition

Keyframes, easing, layer blending, and transitions are usually tuned by eye. Agents need a text-first representation they can reason about, modify, diff, and regenerate.

  1. Automation APIs are limited

Scripting tools like ExtendScript can automate some editor workflows, but they are narrow and fragile compared with a normal web development stack.

The result: agents could call ffmpeg, stitch clips, and add basic overlays. Anything more advanced usually required a human editor.

The HTML-for-video insight

HeyGen’s core observation is practical: LLMs already know the web stack.

Models have seen massive amounts of HTML, CSS, JavaScript, SVG, Canvas, GSAP, Lottie, and animation examples. If you ask a strong model to create an animated landing page section, it can usually write working front-end code.

Image

That means agents already know how to:

  • Position elements with CSS
  • Animate with CSS keyframes or GSAP
  • Render SVG paths
  • Layer scenes with z-index and opacity
  • Tween between visual states
  • Use Canvas, D3, Three.js, and Lottie

HyperFrames uses those browser primitives as the authoring format, then renders them into video frames.

How HyperFrames works

HyperFrames adds timeline metadata to normal HTML using data-* attributes.

Attribute Purpose
data-composition-id Unique ID for the video composition
data-width / data-height Output resolution in pixels
data-start Scene start time in seconds
data-duration Scene duration in seconds
data-track-index Layering order for overlapping scenes

The workflow is:

  1. The agent writes a normal HTML file.
  2. HyperFrames reads the timeline attributes.
  3. It runs the page in a headless browser.
  4. It captures frames at the target frame rate.
  5. It encodes the output with FFmpeg.

There is no new DSL, proprietary scene graph, or keyframe editor. Your animation logic can stay in GSAP, CSS animations, SVG, Canvas, or other browser-native tools.

A minimal HyperFrames composition

Here is a 5-second video composition with two scenes:

  • Scene 1: title card fades in
  • Scene 2: closing card appears through a blur crossfade
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <script src="https://cdn.jsdelivr.net/npm/gsap@3.14.2/dist/gsap.min.js"></script>
  <style>
    body {
      margin: 0;
      width: 1920px;
      height: 1080px;
      overflow: hidden;
      background: #0D1B2A;
      font-family: system-ui, sans-serif;
      color: white;
    }

    .scene {
      position: absolute;
      inset: 0;
      width: 1920px;
      height: 1080px;
      overflow: hidden;
      background: #0D1B2A;
    }

    #scene2 {
      z-index: 2;
      opacity: 0;
    }

    .s1 {
      display: flex;
      flex-direction: column;
      justify-content: center;
      padding: 120px 160px;
      gap: 20px;
    }

    .s2 {
      display: flex;
      flex-direction: column;
      justify-content: center;
      align-items: center;
      padding: 100px 160px;
      gap: 32px;
    }

    .s1-title,
    .s2-title {
      font-size: 96px;
      font-weight: 800;
      letter-spacing: -0.04em;
    }

    .s1-sub {
      font-size: 40px;
      opacity: 0.8;
    }
  </style>
</head>
<body>
  <div
    id="root"
    data-composition-id="hyperframes-intro"
    data-width="1920"
    data-height="1080"
    data-start="0"
    data-duration="5"
  >
    <div id="scene1" class="scene">
      <div class="s1">
        <div class="s1-title">HTML is Video</div>
        <div class="s1-sub">Compose. Animate. Render.</div>
      </div>
    </div>

    <div id="scene2" class="scene">
      <div class="s2">
        <div class="s2-title">Start composing.</div>
      </div>
    </div>
  </div>

  <script>
    window.__timelines = window.__timelines || {};

    const tl = gsap.timeline({ paused: true });

    // Scene 1: title entrance
    tl.from(".s1-title", {
      x: -40,
      opacity: 0,
      duration: 0.5,
      ease: "power3.out"
    }, 0.25);

    tl.from(".s1-sub", {
      y: 15,
      opacity: 0,
      duration: 0.4,
      ease: "power2.out"
    }, 0.5);

    // Blur crossfade transition
    const T = 2.2;

    tl.to("#scene1", {
      filter: "blur(8px)",
      scale: 1.03,
      opacity: 0,
      duration: 0.35,
      ease: "power2.inOut"
    }, T);

    tl.fromTo("#scene2",
      {
        filter: "blur(8px)",
        scale: 0.97,
        opacity: 0
      },
      {
        filter: "blur(0px)",
        scale: 1,
        opacity: 1,
        duration: 0.35,
        ease: "power2.inOut"
      },
      T + 0.08
    );

    window.__timelines["hyperframes-intro"] = tl;
  </script>
</body>
</html>
Enter fullscreen mode Exit fullscreen mode

Two implementation details matter:

  • The animation logic is plain GSAP.
  • The HyperFrames-specific part is only the data-* metadata on the root element.

To customize it, edit the HTML like any other front-end file:

  • Change copy
  • Swap colors
  • Add a logo
  • Use a web font
  • Add SVG illustrations
  • Add more scenes
  • Adjust animation timing

What agents can use inside a composition

Because HyperFrames renders through a browser, your agent can use standard web technologies:

  • CSS animations and transitions
  • GSAP timelines
  • SVG shapes and path animations
  • Canvas animations
  • Three.js scenes
  • D3.js visualizations
  • Lottie animations
  • Google Fonts or custom web fonts
  • Background videos with <video>
  • Images with <img>

There is no separate plugin architecture for the agent to learn. It writes browser code and HyperFrames renders it.

Install HyperFrames with Claude Code

HyperFrames ships as a Claude Code skill.

If you use Claude Code, install it with:

npx skills add heygen-com/hyperframes
Enter fullscreen mode Exit fullscreen mode

This fetches the skill from HeyGen’s GitHub repository, installs the toolchain, and registers video editing as an agent capability.

Then prompt your agent like this:

Build me a 10-second product explainer video for a new API.

Use a dark gradient background. Start with the product name sliding up
from the bottom with a fade. Then show three bullet points with icons.
End on a call-to-action card.
Enter fullscreen mode Exit fullscreen mode

The agent can then:

  1. Generate the HTML composition
  2. Preview it locally
  3. Revise timing and styling
  4. Render the final MP4

According to the original workflow, this runs locally without API keys or external rendering services.

Set up HyperFrames without Claude Code

HyperFrames is framework-agnostic. Any agent that can run shell commands and edit files can use it.

Clone and install:

git clone https://github.com/heygen-com/hyperframes
cd hyperframes
npm install
Enter fullscreen mode Exit fullscreen mode

Render a composition:

npx hyperframes render my-video.html --output my-video.mp4
Enter fullscreen mode Exit fullscreen mode

Preview a composition:

npx hyperframes preview my-video.html
Enter fullscreen mode Exit fullscreen mode

Use preview before rendering final output. It lets you inspect timing, scene transitions, and frame-level behavior before encoding the full video.

Practical agent workflow

A useful development loop looks like this:

  1. Generate a storyboard

Ask the agent for a scene-by-scene plan:

   Create a 15-second video storyboard for a product release note.
   Include scene duration, visual layout, motion, and copy.
Enter fullscreen mode Exit fullscreen mode
  1. Generate the first HTML composition

Ask for one HTML file using HyperFrames metadata and GSAP timelines.

  1. Preview locally
   npx hyperframes preview release-video.html
Enter fullscreen mode Exit fullscreen mode
  1. Iterate on specific problems

Instead of asking for broad changes, give targeted instructions:

   Make scene 2 last 1 second longer.
   Slow the bullet point entrance animation.
   Keep the same colors and layout.
Enter fullscreen mode Exit fullscreen mode
  1. Render the final output
   npx hyperframes render release-video.html --output release-video.mp4
Enter fullscreen mode Exit fullscreen mode
  1. Post-process if needed

Use FFmpeg for advanced audio mixing, compression, or format conversion.

Developer use cases

HyperFrames is most useful when video generation is part of an automated workflow.

Automated product marketing

An agent can read release notes, generate a short product teaser, render it, and ship the MP4 to your CDN.

Personalized video responses

A webhook can trigger an agent to generate a video for a user-specific event, such as:

  • Welcome messages
  • Receipts
  • Milestone celebrations
  • Renewal reminders

Data storytelling

An agent can take metrics, generate D3 visualizations, wrap them in HyperFrames scenes, and render a narrated dashboard summary.

Podcast and long-form content graphics

An agent can read a transcript, identify key points, and generate animated B-roll or motion graphics to layer over audio.

API documentation videos

An agent can parse an OpenAPI spec and generate endpoint walkthroughs with animated request/response diagrams.

Testing the orchestration layer with Apidog

HyperFrames handles rendering. The rest of the system is orchestration:

  • Agent loops
  • LLM API requests
  • Tool calls
  • Asset retrieval
  • Webhooks
  • Brand kit lookups
  • Render job triggers
  • Error handling

That is where production failures usually happen. A malformed tool payload, timed-out LLM request, incorrect tool_use_id, or mismatched message schema can break the pipeline before HyperFrames renders a single frame.

Apidog helps test those API-driven parts.

Mock LLM endpoints

Create dummy Claude or OpenAI-compatible endpoints in Apidog with the schema your agent expects.

Use mocks to test:

  • Valid responses
  • Malformed responses
  • Delayed responses
  • Empty tool calls
  • Unexpected message structures

This lets you validate pipeline behavior before spending real LLM tokens.

Validate tool-use payloads

If your agent calls APIs for assets, stock footage, user data, or brand kits, define those endpoints in Apidog and test the request/response contracts.

Check that the agent sends:

  • Required fields
  • Correct JSON structure
  • Valid IDs
  • Expected headers
  • Proper authentication format

Track token usage

Large video compositions can include long prompts, CSS, scene descriptions, and JavaScript timelines.

The original article notes that Claude Opus 4.7 uses a tokenizer that can produce up to 35% more tokens than Opus 4.6. For agent-generated video, token sizing matters because HTML and animation code can grow quickly.

Use Apidog’s usage tracking to estimate prompt size and avoid cost surprises.

Replay multi-turn agent flows

A full video generation flow often takes several turns:

  1. Plan the video
  2. Generate scenes
  3. Write HTML
  4. Preview
  5. Fix layout
  6. Adjust animation timing
  7. Render
  8. Finalize output

Apidog can help replay and inspect those API conversations when the agent goes off track.

Why HTML is a strong video format for agents

HeyGen’s broader argument is that HTML is not just convenient for agent-generated video. It may be a better interchange format for many video workflows.

Compared with proprietary video project files, HTML is:

  • Open
  • Versionable
  • Searchable
  • Editable with standard tools
  • Familiar to LLMs
  • Compatible with the browser runtime

That means HTML-based video compositions can be:

  • Diffable in Git

    You can review exactly what changed between revisions.

  • Componentized

    A title card can become a reusable component. A motion graphic can become an importable module.

  • Responsive

    The same composition can be adapted for 1080p, 4K, or vertical 9:16 layouts.

  • Accessible

    Text and semantic structure exist in source form instead of being flattened into pixels.

  • Searchable

    Text inside the video starts as actual text, not OCR output.

HyperFrames bridges browser-native content and rendered video output.

Limitations to plan for

HyperFrames is version 1, so plan around these constraints:

  • Render speed depends on complexity

    A simple GSAP text animation renders faster than a composition with Three.js particles or Canvas shaders.

  • Live video input is limited

    You can embed <video> tags, but real-time camera feeds or streaming sources need additional glue code.

  • Audio support is basic

    You can add audio tracks, but advanced mixing, ducking, EQ, or noise reduction still requires FFmpeg post-processing.

  • Output quality depends on the model

    The original article notes that Opus 4.6 and Gemini 3 were the first models to produce consistent, aesthetically strong output from plain prompts, and that Opus 4.7 is currently best for this workflow.

These are not blockers, but they matter if you are building a production pipeline.

Getting started checklist

Use this checklist to try HyperFrames:

  • [ ] Install Claude Code or prepare another agent that can run shell commands
  • [ ] Run npx skills add heygen-com/hyperframes
  • [ ] Ask the agent to create a simple 5-second video
  • [ ] Preview the generated HTML
  • [ ] Render the MP4
  • [ ] Iterate on layout, copy, timing, and scene count
  • [ ] For API-driven workflows, define LLM and tool endpoints in Apidog
  • [ ] Build one real video, such as a product teaser, data story, or release note summary
  • [ ] Star the GitHub repo at github.com/heygen-com/hyperframes

Conclusion

HyperFrames gives AI agents a practical way to edit video by using the stack they already understand: HTML, CSS, and JavaScript.

Instead of forcing agents into timeline-based desktop tools, it turns browser-native compositions into rendered video files. That makes video generation easier to automate, inspect, version, and integrate into developer workflows.

If your system generates marketing videos, personalized clips, documentation walkthroughs, or data stories, HyperFrames can handle the composition and render step. For the surrounding API workflow, test your agent conversations, tool calls, and LLM requests with Apidog before scaling.

Top comments (0)