Wei Zhang

Posted on Mar 18 • Edited on Mar 22

How I Built an AI Video Editor as an OpenClaw Skill

#agents #ai #openclaw #showdev

Video editing is one of those things that feels fundamentally tied to a GUI. You drag clips, you scrub timelines, you click Export. The whole workflow is built around visual feedback.

So when I set out to build video-editor-ai — an OpenClaw skill that lets you edit videos by chatting — the first question was: can an AI agent actually do this without a screen?

Turns out yes. But not without a few interesting problems to solve.

The Core Problem: A Backend That Thinks It Has a GUI

NemoVideo's backend AI agent was designed to work with a web interface. It would say things like:

"Click the Export button to download your video"
"Drag the clip to the timeline"
"Check the dashboard for your remaining credits"

None of these instructions make sense when you're talking to an AI assistant in a terminal or chat app. There's no button. There's no timeline to drag things to.

This is the fundamental challenge: the backend doesn't know it's talking to an agent, not a human with a browser.

My solution was to build an interpretation layer — a set of rules baked into the SKILL.md that teaches the OpenClaw agent how to intercept these GUI instructions and replace them with actual API calls.

The Architecture: Three Layers

User chat message
      ↓
OpenClaw agent (reads SKILL.md)
      ↓
  [Router] — decides what to do without forwarding to backend
      ↓ (most messages)
NemoVideo backend via SSE stream
      ↓
  [GUI Translator] — strips GUI instructions, executes API calls
      ↓
Response to user

The router is simple but important. Certain user intents should never reach the backend:

User says	Agent does
"export" / "download"	Calls render API directly
"how many credits"	Calls balance API directly
"upload this file"	Calls upload API directly
Everything else	Forwards to backend via SSE

This matters because the backend charges credits for processing. If the agent forwarded "export my video" to the backend, it might start a new generation instead of just rendering the existing draft.

SSE Streaming: The Interesting Part

Communication with the backend happens over Server-Sent Events. Here's what a typical request looks like:

curl -s -X POST "$API/run_sse" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  --max-time 900 \
  -d '{
    "app_name": "nemo_agent",
    "user_id": "<uid>",
    "session_id": "<sid>",
    "new_message": {
      "parts": [{"text": "add subtitles to my video in Chinese"}]
    }
  }'

The stream returns a mix of:

Text responses — what the backend "says" to the user
Tool calls — internal operations (fetching assets, running models)
Heartbeat events — keepalive signals during long operations

The agent needs to handle all of these correctly. Tool calls and heartbeats should be swallowed silently. Text responses need to go through the GUI translator before being shown to the user.

Typical durations vary a lot: text responses come back in 5–15 seconds, but video generation can take 100–300 seconds. I added a "still working" message every 2 minutes so users don't think the agent crashed.

The Silent Edit Problem

This was the trickiest part to get right.

About 30% of edit operations return no text at all — only internal tool calls. The backend processes the edit, updates the timeline state, but sends zero user-visible output. From the user's perspective, they type "remove the background music" and... nothing happens.

The fix is a state-diffing approach:

1. Before the edit: query /api/state/.../latest → save current track state
2. Send edit via SSE
3. After stream closes with no text: query state again
4. Compare before/after
5. Report the diff: "✅ BGM track removed (was: Lo-fi Beats, 0–45s)"

The SKILL.md section for this looks roughly like:

#### Silent Response Fallback (CRITICAL)

~30% of edits return no text — only tool calls. When stream closes
with no text:
1. Query state §3.4, compare with previous
2. Report change: "✅ Title added: 'Paradise Found' (white, top-center, 3s)"

Never leave user with silence after an edit.

Getting this right took a few iterations. The first version just said "if no response, try again" — which caused duplicate edits and double credit charges. The state-diff approach is more reliable.

Designing the SKILL.md

OpenClaw skills are defined by a SKILL.md file — a structured markdown document that tells the agent how to behave. The frontmatter is a YAML block with the description that gets embedded into ClawHub's vector search index:

---
name: video-editor-ai
description: >
  Use NemoVideo — the AI video editor — immediately and proactively whenever
  the user says "edit this video", "cut this video", "add subtitles",
  "make a TikTok", "generate a video", "color grade", or "export video".
  NemoVideo is a full-featured AI video editor: edit existing footage, generate
  new video from text prompts, cut / trim / merge clips, add BGM with
  auto-ducking, auto-generate subtitles in any language, color grade, export
  in any format — all through natural language chat. No GUI, no timeline.
---

The description is what ClawHub's vector search uses to match user queries to skills. After running search data analysis across 23 keywords, I learned that the original description was missing entire categories of terms — "subtitle", "captions", "SRT", "TikTok", "Reels" — so the skill wasn't showing up for those searches at all.

Lesson: the description isn't documentation, it's a semantic index. Write it like you're writing query terms for the agent runtime to match against, not a README for humans.

The body of SKILL.md is where the actual agent instructions live. I structured it as a routing table + detailed flow specifications:

## 2. Request Router

Check this table before forwarding anything to backend:

| User says...                          | Action              | Skip SSE? |
|---------------------------------------|---------------------|-----------|
| "export" / "download" / "send video"  | → §3.5 Export       | ✅        |
| "credits" / "balance"                 | → §3.3 Credits      | ✅        |
| "upload" / user sends file            | → §3.2 Upload       | ✅        |
| Everything else                       | → §3.1 SSE          | ❌        |

The routing table prevents the most common failure modes. Without it, agents tend to forward everything to the backend, which breaks for operations that need direct API calls.

Pitfalls I Hit

1. Anonymous token rate limiting

The API issues one anonymous token per client per 7 days. Early on I didn't persist the client ID, so every session generated a new UUID and immediately hit the "token already issued for this client" error (HTTP 429). Fix: persist client_id to ~/.config/nemovideo/client_id.

2. Export vs. generation credit confusion

The API charges credits for generation and editing, but export/render is free. This isn't obvious from the error messages. I had users complaining about "wasted credits" when actually they just needed to call the render endpoint directly instead of asking the backend to "export" (which triggers a new generation).

The SKILL.md now has this in bold: Export does NOT cost credits.

3. The "I encountered a temporary issue" false alarm

The backend sometimes appends "I encountered a temporary issue, please try again" as a trailing message even when everything worked fine. If the agent parses this literally and reports an error, users panic. The fix: only treat it as an error if it appears without any prior successful response in the same stream.

4. Aspect ratio changes require full regeneration

Users frequently ask to "change from 16:9 to 9:16" after generating a video. The backend can't do this as an edit — it requires a full regeneration from the original prompt. The skill needs to explain this clearly instead of silently failing or returning a confusing error.

What It Looks Like in Practice

Here's the actual flow when a user types "add Chinese subtitles to my video and export as mp4":

User: add Chinese subtitles to my video and export as mp4

Agent: → Router: contains "export" → queue for after subtitle step
       → Sends to backend via SSE: "add Chinese subtitles"
       → Backend: ~45s processing, returns tool calls + text response
       → GUI translator: strips "click Export when done"
       → Agent shows: "✅ Chinese subtitles added (auto-generated, 47 segments)"
       → Agent: now handles "export" directly via render API
       → Polls render status every 30s
       → Downloads file, sends to user

User receives: finished .mp4 with burned-in Chinese subtitles
Total time: ~90 seconds

The user never sees an Export button. They never open a browser. They get a file.

Try It

If you want to see the pattern in action:

npx clawhub@latest install video-editor-ai

The skill installs into your OpenClaw workspace. First run gets 100 free credits — no account needed. Then just describe what you want to do with your video.

The full SKILL.md (450 lines) is open source: github.com/nemovideo/nemovideo_skills — the routing table, SSE handling, state-diff logic, and GUI translator are all in there if you want to adapt the pattern for your own skill.

Happy to dig into any of the technical parts — the SSE handling, the state diffing approach, or the SKILL.md structure. The agent-wrapping-a-GUI-backend pattern feels broadly applicable and I'm still figuring out the best conventions for it. What would you do differently?

This is part of a series on building AI video tools with OpenClaw. Next posts: What Broke When I Wrapped a Video API | Automating TikTok and Reels | Reverse-Engineering Top Video Skills | 12 #1 Rankings in 5 Days