DEV Community

Cover image for Write HTML, Not JSON: HeyGen's Visual-Grounding Trick
Max Quimby
Max Quimby

Posted on • Originally published at agentconn.com

Write HTML, Not JSON: HeyGen's Visual-Grounding Trick

Read the full version with screenshots and embedded sources on AgentConnagentconn.com/blog/heygen-hyperframes-html-visual-grounding-video-ui-agents-2026

Every agent framework in 2026 tells you to return structured JSON. Schema-validated, type-safe, parseable. And for most tasks, that's correct — structured output gives agents 95–99% action success rates versus 70–85% for unstructured text.

But here's the problem nobody talks about: JSON has no visual semantics. An agent can produce a perfectly valid JSON config describing a video timeline — correct schema, valid keyframes, legal property values — and the rendered output looks like garbage. The agent wrote "correct" instructions for something it can't see.

HeyGen figured this out. Their open-source framework HyperFrames doesn't use JSON configs. It uses HTML.

Rohan Paul on X — HeyGen just open-sourced HyperFrames

The Visual-Grounding Problem

When an agent generates a JSON video config, it's working blind. A perfectly valid JSON scene description — correct schema, right keyframes — can render as visual garbage because the agent can't reason about spatial layout in a non-visual format.

The SeeAct-V research confirms what practitioners already know: visual grounding is a fundamental capability gap for language models working in non-visual formats.

HyperFrames: HTML as the Agent's Canvas

HyperFrames launched April 17, 2026, and hit 30,100 stars in two months. Instead of JSON configs, agents write standard HTML with CSS and data-* attributes for timing:

<div data-scene="intro" data-duration="3s">
  <h1 style="font-size: 48px; text-align: center; margin-top: 20vh;">
    Hello World
  </h1>
</div>
Enter fullscreen mode Exit fullscreen mode

The agent can reason about this. It knows what text-align: center looks like. It knows margin-top: 20vh pushes the heading down. It understands CSS layout.

The architecture: headless Chrome (Puppeteer) for deterministic frame capture, FFmpeg for encoding. Supports GSAP 3, Lottie, Three.js, Anime.js, and WebGL shaders — any animation library that runs in a browser works without adapters.

Why HTML Wins Over JSON

The HN discussion put it plainly: "It's just a superset of HTML, and agents know how to write HTML + GSAP by default."

LLMs are trained on billions of web pages. They know what display: flex looks like, that border-radius: 50% makes a circle, that font-size: 72px is large. This visual intuition doesn't exist for arbitrary JSON coordinate systems.

The Agent Skill Architecture

HyperFrames includes dedicated skills for Claude Code, Cursor, Gemini CLI, and Codex. HeyGen's own launch video was made 100% with Claude Code + HyperFrames. Nous Research's Hermes agent has an official HyperFrames skill — the first major agent framework to integrate video production natively.

The Pattern Beyond Video

The insight is general: match the output format to the model's strongest reasoning modality.

Domain Low-Grounding High-Grounding
Video JSON config HTML + CSS + data-*
Diagrams DOT/Graphviz SVG
Dashboards Chart.js JSON HTML grid + components
Presentations Slide JSON HTML slides + CSS

HeyGen bet that agents think better in HTML than in JSON. Thirty thousand stars in two months suggests they were right.

Originally published at AgentConn

Top comments (0)