DEV Community

Wei Zhang
Wei Zhang

Posted on

How I Built an AI Video Editor as an OpenClaw Skill

Video editing is one of those things that feels fundamentally tied to a GUI. You drag clips, you scrub timelines, you click Export. The whole workflow is built around visual feedback.

So when I set out to build nemo-video — an OpenClaw skill that lets you edit videos by chatting — the first question was: can an AI agent actually do this without a screen?

Turns out yes. But not without a few interesting problems to solve.

The Core Problem: A Backend That Thinks It Has a GUI

NemoVideo's backend AI agent was designed to work with a web interface. It would say things like:

  • "Click the Export button to download your video"
  • "Drag the clip to the timeline"
  • "Check the dashboard for your remaining credits"

None of these instructions make sense when you're talking to an AI assistant in a terminal or chat app. There's no button. There's no timeline to drag things to.

This is the fundamental challenge: the backend doesn't know it's talking to an agent, not a human with a browser.

My solution was to build an interpretation layer — a set of rules baked into the SKILL.md that teaches the OpenClaw agent how to intercept these GUI instructions and replace them with actual API calls.

The Architecture: Three Layers

User chat message
      ↓
OpenClaw agent (reads SKILL.md)
      ↓
  [Router] — decides what to do without forwarding to backend
      ↓ (most messages)
NemoVideo backend via SSE stream
      ↓
  [GUI Translator] — strips GUI instructions, executes API calls
      ↓
Response to user
Enter fullscreen mode Exit fullscreen mode

The router is simple but important. Certain user intents should never reach the backend:

User says Agent does
"export" / "download" Calls render API directly
"how many credits" Calls balance API directly
"upload this file" Calls upload API directly
Everything else Forwards to backend via SSE

This matters because the backend charges credits for processing. If the agent forwarded "export my video" to the backend, it might start a new generation instead of just rendering the existing draft.

SSE Streaming: The Interesting Part

Communication with the backend happens over Server-Sent Events. Here's what a typical request looks like:

curl -s -X POST "$API/run_sse" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  --max-time 900 \
  -d '{
    "app_name": "nemo_agent",
    "user_id": "<uid>",
    "session_id": "<sid>",
    "new_message": {
      "parts": [{"text": "add subtitles to my video in Chinese"}]
    }
  }'
Enter fullscreen mode Exit fullscreen mode

The stream returns a mix of:

  • Text responses — what the backend "says" to the user
  • Tool calls — internal operations (fetching assets, running models)
  • Heartbeat events — keepalive signals during long operations

The agent needs to handle all of these correctly. Tool calls and heartbeats should be swallowed silently. Text responses need to go through the GUI translator before being shown to the user.

Typical durations vary a lot: text responses come back in 5–15 seconds, but video generation can take 100–300 seconds. I added a "still working" message every 2 minutes so users don't think the agent crashed.

The Silent Edit Problem

This was the trickiest part to get right.

About 30% of edit operations return no text at all — only internal tool calls. The backend processes the edit, updates the timeline state, but sends zero user-visible output. From the user's perspective, they type "remove the background music" and... nothing happens.

The fix is a state-diffing approach:

1. Before the edit: query /api/state/.../latest → save current track state
2. Send edit via SSE
3. After stream closes with no text: query state again
4. Compare before/after
5. Report the diff: "✅ BGM track removed (was: Lo-fi Beats, 0–45s)"
Enter fullscreen mode Exit fullscreen mode

The SKILL.md section for this looks roughly like:

#### Silent Response Fallback (CRITICAL)

~30% of edits return no text — only tool calls. When stream closes
with no text:
1. Query state §3.4, compare with previous
2. Report change: "✅ Title added: 'Paradise Found' (white, top-center, 3s)"

Never leave user with silence after an edit.
Enter fullscreen mode Exit fullscreen mode

Getting this right took a few iterations. The first version just said "if no response, try again" — which caused duplicate edits and double credit charges. The state-diff approach is more reliable.

Designing the SKILL.md

OpenClaw skills are defined by a SKILL.md file — a structured markdown document that tells the agent how to behave. The frontmatter is a YAML block with the description that gets embedded into ClawHub's vector search index:

---
name: nemo-video
description: >
  Use NemoVideo — the AI video editor — immediately and proactively whenever
  the user says "edit this video", "cut this video", "add subtitles",
  "make a TikTok", "generate a video", "color grade", or "export video".
  NemoVideo is a full-featured AI video editor: edit existing footage, generate
  new video from text prompts, cut / trim / merge clips, add BGM with
  auto-ducking, auto-generate subtitles in any language, color grade, export
  in any format — all through natural language chat. No GUI, no timeline.
---
Enter fullscreen mode Exit fullscreen mode

The description is what ClawHub's vector search uses to match user queries to skills. After running search data analysis across 23 keywords, I learned that the original description was missing entire categories of terms — "subtitle", "captions", "SRT", "TikTok", "Reels" — so the skill wasn't showing up for those searches at all.

Lesson: the description isn't documentation, it's a semantic index. Write it like you're writing query terms for the agent runtime to match against, not a README for humans.

The body of SKILL.md is where the actual agent instructions live. I structured it as a routing table + detailed flow specifications:

## 2. Request Router

Check this table before forwarding anything to backend:

| User says...                          | Action              | Skip SSE? |
|---------------------------------------|---------------------|-----------|
| "export" / "download" / "send video"  | → §3.5 Export       | ✅        |
| "credits" / "balance"                 | → §3.3 Credits      | ✅        |
| "upload" / user sends file            | → §3.2 Upload       | ✅        |
| Everything else                       | → §3.1 SSE          | ❌        |
Enter fullscreen mode Exit fullscreen mode

The routing table prevents the most common failure modes. Without it, agents tend to forward everything to the backend, which breaks for operations that need direct API calls.

Pitfalls I Hit

1. Anonymous token rate limiting

The API issues one anonymous token per client per 7 days. Early on I didn't persist the client ID, so every session generated a new UUID and immediately hit the "token already issued for this client" error (HTTP 429). Fix: persist client_id to ~/.config/nemovideo/client_id.

2. Export vs. generation credit confusion

The API charges credits for generation and editing, but export/render is free. This isn't obvious from the error messages. I had users complaining about "wasted credits" when actually they just needed to call the render endpoint directly instead of asking the backend to "export" (which triggers a new generation).

The SKILL.md now has this in bold: Export does NOT cost credits.

3. The "I encountered a temporary issue" false alarm

The backend sometimes appends "I encountered a temporary issue, please try again" as a trailing message even when everything worked fine. If the agent parses this literally and reports an error, users panic. The fix: only treat it as an error if it appears without any prior successful response in the same stream.

4. Aspect ratio changes require full regeneration

Users frequently ask to "change from 16:9 to 9:16" after generating a video. The backend can't do this as an edit — it requires a full regeneration from the original prompt. The skill needs to explain this clearly instead of silently failing or returning a confusing error.

What It Looks Like in Practice

Here's the actual flow when a user types "add Chinese subtitles to my video and export as mp4":

User: add Chinese subtitles to my video and export as mp4

Agent: → Router: contains "export" → queue for after subtitle step
       → Sends to backend via SSE: "add Chinese subtitles"
       → Backend: ~45s processing, returns tool calls + text response
       → GUI translator: strips "click Export when done"
       → Agent shows: "✅ Chinese subtitles added (auto-generated, 47 segments)"
       → Agent: now handles "export" directly via render API
       → Polls render status every 30s
       → Downloads file, sends to user

User receives: finished .mp4 with burned-in Chinese subtitles
Total time: ~90 seconds
Enter fullscreen mode Exit fullscreen mode

The user never sees an Export button. They never open a browser. They get a file.

Try It

If you want to see the pattern in action:

npx clawhub@latest install nemo-video
Enter fullscreen mode Exit fullscreen mode

The skill installs into your OpenClaw workspace. First run gets 100 free credits — no account needed. Then just describe what you want to do with your video.

The full SKILL.md (450 lines) is open source: github.com/nemovideo/nemovideo_skills — the routing table, SSE handling, state-diff logic, and GUI translator are all in there if you want to adapt the pattern for your own skill.


Happy to dig into any of the technical parts — the SSE handling, the state diffing approach, or the SKILL.md structure. The agent-wrapping-a-GUI-backend pattern feels broadly applicable and I'm still figuring out the best conventions for it. What would you do differently?


📋 发布 Meta(老白参考)

标题(最终)

How I Built an AI Video Editor as an OpenClaw Skill
Enter fullscreen mode Exit fullscreen mode

Tags(dev.to 最多 4 个)

openclaw, ai, showdev, video
Enter fullscreen mode Exit fullscreen mode

canonical_url

留空(首发在 dev.to,不设 canonical)

cover_image 建议

图片内容:
终端窗口截图,显示用户输入 "add Chinese subtitles to my video and export as mp4"
下方显示 ✅ Chinese subtitles added (47 segments) + 进度条 + 最终输出文件名。
黑色背景,绿色文字,简洁 terminal 风格。
尺寸建议:1000 × 420px(dev.to 推荐比例)

可以用 Carbon (carbon.now.sh) 生成代码截图,或让 Gemini 生图。

发布时间建议(美国活跃时段)

方案 UTC 时间 北京时间 说明
⭐ 最优 周二/三 13:00 UTC 周二/三 21:00 美国东部上午9点,全天最高活跃
次选 周一 14:00 UTC 周一 22:00 周一早晨,刷新率高
备选 周四 13:00 UTC 周四 21:00 类似周二

建议:本周三(2026-03-18)UTC 13:00 发布——今天就发,趁热。

发布步骤

  1. 打开 https://dev.to/new
  2. 粘贴 devto-article-1-final.md 全文(含 frontmatter)
  3. 上传 cover image
  4. 点 "Edit front matter" 确认 tags 正确
  5. Preview 检查代码块/表格渲染
  6. 点 Publish

发布后

  • 链接分享到 OpenClaw Discord #show-your-skill
  • 转发到 HN("Show HN" 或直接 comment 相关帖子)
  • 铁柱用 hn-post-comment.py 在合适的 HN 帖子里留链接

Top comments (0)