Wei Zhang

Posted on Mar 20 • Edited on Mar 22

I Wrapped a Video Editing API for AI - Here is What Broke

#openclaw #ai #webdev #tutorial

Most AI agent tutorials assume you control both ends of the stack. You define the tools, you define the responses, everything is designed for programmatic access from the start.

Real-world integrations are messier. The useful APIs — the ones that actually do something valuable — were usually built for humans with browsers. They return things like "click the Export button" and "drag the clip to the timeline." They assume someone is watching a screen.

I spent the last few months building video-editor-ai, an OpenClaw skill that wraps a video editing backend originally designed for a web UI. Here's what I learned.

Lesson 1: You Need an Interception Layer, Not Just a Wrapper

The naive approach is to forward everything from the user to the backend and return whatever comes back. This breaks immediately when the backend responds with GUI instructions.

The backend I was working with would say things like:

"Your video is ready! Click the Export button in the top right 
corner to download it."

If you pass this directly to the user, they'll ask "what Export button?" The skill has no UI. There is no button.

The fix isn't prompt engineering ("please don't mention buttons"). It's an architectural one: you need a translation layer that sits between the backend response and the user.

backend response → [GUI Translator] → user-visible response
                          ↓
                   if "click X" detected:
                   execute X directly via API
                   replace with: "✅ Done. Here's your file."

The translator needs to handle two cases:

Cosmetic GUI references ("check the dashboard") — strip and ignore
Actionable GUI references ("click Export") — intercept and execute the actual API call

The second case is critical. If "click Export" reaches the user as text, nothing happens. If your translator catches it and calls the render endpoint, the user gets their file.

Lesson: Design for interception from day one. Map every GUI action in the backend's vocabulary to an API call.

Lesson 2: Silent Edits Are Real and You Must Handle Them

This one took me the longest to figure out.

About 30% of edit operations in the backend return no user-visible text at all. The backend processes the edit, updates internal state, sends a stream of tool calls — and then closes the connection without saying anything.

From the user's perspective: they type "remove the background music." Nothing happens. No confirmation, no error. Just silence.

The first instinct is to retry. That's wrong. The edit succeeded. Retrying it will apply the same edit twice, which in a credit-based system means you've now charged the user twice for one action.

The correct approach is state diffing:

# Before sending the edit:
state_before = get_current_state(session_id)

# Send edit via SSE stream
response_text = send_to_backend(user_message)

# If stream closes with no text:
if not response_text:
    state_after = get_current_state(session_id)
    diff = compute_diff(state_before, state_after)
    if diff:
        return format_diff_as_confirmation(diff)
        # "✅ BGM track removed (was: Lo-fi Beats, 0–45s)"
    else:
        return "The edit didn't seem to take effect. Want to try again?"

The diff-based confirmation also makes the UX much better than a generic "done." Users want to know what changed, not just that something happened.

Lesson: Never assume a silent response means failure. Check state before and after any operation. Build state diffing before you build anything else.

Lesson 3: Billing and Credits Belong to the Agent, Not the Backend

This is a UX problem that becomes a trust problem quickly.

The backend I was wrapping charged credits for processing operations. It also had an export/render endpoint that was completely free. But from the user's perspective — talking to an agent — this distinction is invisible.

Early in development, users would ask "how many credits do I have left?" The agent would forward this to the backend, which would respond: "You can check your credit balance on the dashboard under Account Settings."

Two problems:

There is no dashboard. The skill is the interface.
Even if there were, this is exactly the kind of round-trip that should never hit the backend.

The solution is a pre-flight router:

incoming message
      ↓
  [Router]
  ├── "credits" / "balance" / "how much left" → call balance API directly
  ├── "export" / "download" / "send me the file" → call render API directly  
  ├── "upload" / user attaches a file → call upload API directly
  └── everything else → forward to backend via SSE

The router catches intent before it reaches the backend. It means:

Credit checks are instant (no SSE stream overhead)
Exports never accidentally trigger new generations
Upload flow is deterministic

The rule of thumb: any operation with a known, fixed API endpoint should never go through the conversational backend. The backend is for things that require interpretation. Credit checks don't require interpretation.

Lesson: Map your API surface area before you write any prompt logic. Identify which operations are deterministic (route them directly) vs. which require the backend's reasoning (route them through SSE).

Lesson 4: Backend Error Messages Are Written for Humans, Not Agents

Error handling in GUI-first APIs is designed for a support workflow, not programmatic consumption.

Typical backend error message:

"I encountered a temporary issue processing your request. 
Please try again or contact support at support@example.com 
if the problem persists."

This message is useless to an agent for three reasons:

It doesn't say what failed
"Try again" is dangerous if the failure was a credit deduction
"Contact support" is a dead end in an automated flow

Worse: this exact message sometimes appears as a trailing message after a successful operation. The backend completes the edit, sends the result, then appends a generic error epilogue as a separate SSE event. If your agent treats this as an error state, you get false negatives.

The pattern that works:

def parse_sse_stream(events):
    has_success = False
    final_text = []

    for event in events:
        if looks_like_success(event):
            has_success = True
            final_text.append(event.text)
        elif looks_like_error(event):
            if has_success:
                # Trailing error after success = ignore
                continue
            else:
                # Genuine error = surface to user
                final_text.append(translate_error(event.text))

    return "\n".join(final_text)

def translate_error(backend_message):
    # Map backend error vocabulary to actionable user messages
    if "temporary issue" in backend_message:
        return "The backend is busy — try again in 30 seconds."
    if "insufficient credits" in backend_message:
        return "You've run out of credits. Get more at [link]."
    # ... etc
    return "Something went wrong. Here's the raw error: " + backend_message

The key insight: don't surface backend error messages directly. Translate them. Your agent knows the context (what was attempted, what state things are in) that the backend doesn't.

Lesson: Build an error translation table early. Expect that the same error string from the backend can mean different things depending on when in the flow it appears.

Lesson 5: Test With Transcripts, Not Unit Tests

Standard unit testing doesn't map well to conversational agent skills. You can't easily mock a 300-second SSE stream, and the interesting failure modes only surface in real multi-turn conversations.

What actually works: transcript testing.

I ended up with a library of ~110 conversation transcripts — real interactions that exposed bugs, edge cases, or just confusing UX. Each transcript is a sequence of user messages and expected agent behaviors:

# transcript: double-export.yaml
description: "User asks to export immediately after generation"
turns:
  - user: "create a 30 second video about ocean waves"
    expect:
      - type: confirmation
        contains: "video created"
  - user: "export it"
    expect:
      - type: file_delivered
        not_contains: "generating"  # should NOT start a new generation
      - type: credits_unchanged     # export is free, credits should not decrease

This approach catches:

Regression bugs: does fixing the silent edit problem break the export flow?
UX issues: does the phrasing of confirmations actually make sense in context?
Edge cases: what happens if the user asks to export a video that's still generating?

The transcripts also serve as documentation. New contributors can read them to understand how the skill is supposed to behave in specific scenarios — something a SKILL.md instruction file can't fully capture.

Lesson: Start collecting transcripts from your first real user session. Every surprising or broken interaction is a test case. By the time you have 20 transcripts, you'll have a regression suite that catches most of the things that matter.

The Common Thread

All five of these lessons come back to the same root problem: GUI-first backends communicate in a vocabulary designed for human visual processing, and agents operate in a vocabulary designed for text and function calls.

The translation work is non-trivial, but it's also reusable. The patterns above — interception layers, state diffing, pre-flight routers, error translation, transcript testing — apply to any GUI-first API you want to expose to an agent runtime.

If you're building on OpenClaw and want to see the full implementation, video-editor-ai's SKILL.md is open source:

npx clawhub@latest install video-editor-ai --force

Source: github.com/nemovideo/nemovideo_skills

What GUI-first APIs have you tried to wrap for agent use? Curious what other translation patterns people have run into.

This is part of a series on building AI video tools with OpenClaw. Previous: How I Built an AI Video Editor | Next posts: Automating TikTok and Reels | Reverse-Engineering Top Video Skills | 12 #1 Rankings in 5 Days

DEV Community