Most AI agent tutorials assume you control both ends of the stack. You define the tools, you define the responses, everything is designed for programmatic access from the start.
Real-world integrations are messier. The useful APIs — the ones that actually do something valuable — were usually built for humans with browsers. They return things like "click the Export button" and "drag the clip to the timeline." They assume someone is watching a screen.
I spent the last few months building nemo-video, an OpenClaw skill that wraps a video editing backend originally designed for a web UI. Here's what I learned.
Lesson 1: You Need an Interception Layer, Not Just a Wrapper
The naive approach is to forward everything from the user to the backend and return whatever comes back. This breaks immediately when the backend responds with GUI instructions.
The backend I was working with would say things like:
"Your video is ready! Click the Export button in the top right
corner to download it."
If you pass this directly to the user, they'll ask "what Export button?" The skill has no UI. There is no button.
The fix isn't prompt engineering ("please don't mention buttons"). It's an architectural one: you need a translation layer that sits between the backend response and the user.
backend response → [GUI Translator] → user-visible response
↓
if "click X" detected:
execute X directly via API
replace with: "✅ Done. Here's your file."
The translator needs to handle two cases:
- Cosmetic GUI references ("check the dashboard") — strip and ignore
- Actionable GUI references ("click Export") — intercept and execute the actual API call
The second case is critical. If "click Export" reaches the user as text, nothing happens. If your translator catches it and calls the render endpoint, the user gets their file.
Lesson: Design for interception from day one. Map every GUI action in the backend's vocabulary to an API call.
Lesson 2: Silent Edits Are Real and You Must Handle Them
This one took me the longest to figure out.
About 30% of edit operations in the backend return no user-visible text at all. The backend processes the edit, updates internal state, sends a stream of tool calls — and then closes the connection without saying anything.
From the user's perspective: they type "remove the background music." Nothing happens. No confirmation, no error. Just silence.
The first instinct is to retry. That's wrong. The edit succeeded. Retrying it will apply the same edit twice, which in a credit-based system means you've now charged the user twice for one action.
The correct approach is state diffing:
# Before sending the edit:
state_before = get_current_state(session_id)
# Send edit via SSE stream
response_text = send_to_backend(user_message)
# If stream closes with no text:
if not response_text:
state_after = get_current_state(session_id)
diff = compute_diff(state_before, state_after)
if diff:
return format_diff_as_confirmation(diff)
# "✅ BGM track removed (was: Lo-fi Beats, 0–45s)"
else:
return "The edit didn't seem to take effect. Want to try again?"
The diff-based confirmation also makes the UX much better than a generic "done." Users want to know what changed, not just that something happened.
Lesson: Never assume a silent response means failure. Check state before and after any operation. Build state diffing before you build anything else.
Lesson 3: Billing and Credits Belong to the Agent, Not the Backend
This is a UX problem that becomes a trust problem quickly.
The backend I was wrapping charged credits for processing operations. It also had an export/render endpoint that was completely free. But from the user's perspective — talking to an agent — this distinction is invisible.
Early in development, users would ask "how many credits do I have left?" The agent would forward this to the backend, which would respond: "You can check your credit balance on the dashboard under Account Settings."
Two problems:
- There is no dashboard. The skill is the interface.
- Even if there were, this is exactly the kind of round-trip that should never hit the backend.
The solution is a pre-flight router:
incoming message
↓
[Router]
├── "credits" / "balance" / "how much left" → call balance API directly
├── "export" / "download" / "send me the file" → call render API directly
├── "upload" / user attaches a file → call upload API directly
└── everything else → forward to backend via SSE
The router catches intent before it reaches the backend. It means:
- Credit checks are instant (no SSE stream overhead)
- Exports never accidentally trigger new generations
- Upload flow is deterministic
The rule of thumb: any operation with a known, fixed API endpoint should never go through the conversational backend. The backend is for things that require interpretation. Credit checks don't require interpretation.
Lesson: Map your API surface area before you write any prompt logic. Identify which operations are deterministic (route them directly) vs. which require the backend's reasoning (route them through SSE).
Lesson 4: Backend Error Messages Are Written for Humans, Not Agents
Error handling in GUI-first APIs is designed for a support workflow, not programmatic consumption.
Typical backend error message:
"I encountered a temporary issue processing your request.
Please try again or contact support at support@example.com
if the problem persists."
This message is useless to an agent for three reasons:
- It doesn't say what failed
- "Try again" is dangerous if the failure was a credit deduction
- "Contact support" is a dead end in an automated flow
Worse: this exact message sometimes appears as a trailing message after a successful operation. The backend completes the edit, sends the result, then appends a generic error epilogue as a separate SSE event. If your agent treats this as an error state, you get false negatives.
The pattern that works:
def parse_sse_stream(events):
has_success = False
final_text = []
for event in events:
if looks_like_success(event):
has_success = True
final_text.append(event.text)
elif looks_like_error(event):
if has_success:
# Trailing error after success = ignore
continue
else:
# Genuine error = surface to user
final_text.append(translate_error(event.text))
return "\n".join(final_text)
def translate_error(backend_message):
# Map backend error vocabulary to actionable user messages
if "temporary issue" in backend_message:
return "The backend is busy — try again in 30 seconds."
if "insufficient credits" in backend_message:
return "You've run out of credits. Get more at [link]."
# ... etc
return "Something went wrong. Here's the raw error: " + backend_message
The key insight: don't surface backend error messages directly. Translate them. Your agent knows the context (what was attempted, what state things are in) that the backend doesn't.
Lesson: Build an error translation table early. Expect that the same error string from the backend can mean different things depending on when in the flow it appears.
Lesson 5: Test With Transcripts, Not Unit Tests
Standard unit testing doesn't map well to conversational agent skills. You can't easily mock a 300-second SSE stream, and the interesting failure modes only surface in real multi-turn conversations.
What actually works: transcript testing.
I ended up with a library of ~110 conversation transcripts — real interactions that exposed bugs, edge cases, or just confusing UX. Each transcript is a sequence of user messages and expected agent behaviors:
# transcript: double-export.yaml
description: "User asks to export immediately after generation"
turns:
- user: "create a 30 second video about ocean waves"
expect:
- type: confirmation
contains: "video created"
- user: "export it"
expect:
- type: file_delivered
not_contains: "generating" # should NOT start a new generation
- type: credits_unchanged # export is free, credits should not decrease
This approach catches:
- Regression bugs: does fixing the silent edit problem break the export flow?
- UX issues: does the phrasing of confirmations actually make sense in context?
- Edge cases: what happens if the user asks to export a video that's still generating?
The transcripts also serve as documentation. New contributors can read them to understand how the skill is supposed to behave in specific scenarios — something a SKILL.md instruction file can't fully capture.
Lesson: Start collecting transcripts from your first real user session. Every surprising or broken interaction is a test case. By the time you have 20 transcripts, you'll have a regression suite that catches most of the things that matter.
The Common Thread
All five of these lessons come back to the same root problem: GUI-first backends communicate in a vocabulary designed for human visual processing, and agents operate in a vocabulary designed for text and function calls.
The translation work is non-trivial, but it's also reusable. The patterns above — interception layers, state diffing, pre-flight routers, error translation, transcript testing — apply to any GUI-first API you want to expose to an agent runtime.
If you're building on OpenClaw and want to see the full implementation, nemo-video's SKILL.md is open source:
npx clawhub@latest install nemo-video --force
Source: github.com/nemovideo/nemovideo_skills
What GUI-first APIs have you tried to wrap for agent use? Curious what other translation patterns people have run into.
Top comments (0)