DEV Community: Wei Zhang

The Best Camera Is the One You Have — And the Same Goes for AI Tools

Wei Zhang — Thu, 26 Mar 2026 09:52:25 +0000

The Best Camera Is the One You Have — And the Same Goes for AI Tools

The most upvoted post on r/filmmakers this week wasn't about a new camera release or a festival win. It was someone asking: "Why aren't we all just buying our dream cameras from seven years ago?"

228 upvotes. 73 comments. And almost every single reply said some version of the same thing: the gear doesn't make the film, you do. The best camera is the one you already own.

I've been thinking about that post all week because I'm watching the exact same anxiety play out in AI video tools — and it's costing creators the same thing it always has.

The camera version of this story

The top comment on that post, with 137 upvotes, was from someone who bought a Blackmagic URSA Mini 4K in 2016. Eight years later, still using it. Still happy with the footage. Still getting hired.

Another comment, 92 upvotes: "All the employers are gear snobs. I went into an interview and they asked what camera I shoot on before they asked to see my reel."

The thread is full of people who spent years chasing the next body, the next lens, the next codec — and then realized the footage from their "outdated" camera was already good enough. The gap was never the sensor. It was lighting, composition, and knowing what story they were trying to tell.

Nobody in that thread said "I upgraded my camera and my films got better." Not one person.

Now replace "camera" with "AI tool"

Every week there's a new AI video tool that's supposedly the one that changes everything. I've watched this cycle repeat for over a year now.

Sora was going to replace editors. Then it shut down. Runway Gen-3 was the breakthrough. Then Gen-4 came out and everyone moved on. Kling got hyped for a month. Seedance had its moment. Pika keeps iterating. Each launch follows the same pattern: stunning demo, breathless coverage, creators scrambling to learn it, and then three weeks later everyone's waiting for the next one.

I know because I did this myself. In a three-month stretch last year I tested four different tools seriously — installed them, watched the tutorials, tried them on real projects. At the end of those three months I'd spent more time learning tools than making content. My actual output dropped. I was so busy chasing the "best" AI tool that I stopped editing.

The parallel to camera gear anxiety is almost embarrassing once you see it. Different technology, identical behavior. Chase the new thing, feel behind when the next new thing drops, repeat until you've spent all your creative energy on tool selection instead of creative work.

What actually worked when I stopped switching

About two months ago I made a decision that felt boring at the time: I picked my tools and stopped looking at new ones.

For AI-assisted editing I settled on NemoVideo for transcript-based rough cuts and assembly. Not because it's the flashiest tool — because I already knew how to use it and it saved me about three hours per project on the mechanical work. For AI-generated footage I use one tool for B-roll and atmospheric shots. That's it. When a new tool launches, I read about it and move on.

The result was immediate. My output nearly doubled in the first month. Not because the tools got better, but because I stopped spending half my week evaluating tools and started spending it on the work itself. More time on pacing decisions. More time on sound design choices. More time on the parts that actually make a video good — the parts no AI tool handles for you regardless of which one you pick.

It turns out the bottleneck was never the tool. It was me, constantly resetting instead of building momentum.

The math nobody talks about

Every time you switch tools, you lose about two weeks of productive work. A week learning the new interface, a week getting back to the speed you had with the old one. Switch four times a year and you've burned two months. Switch every time Twitter tells you something is a game-changer and you've burned half your year on evaluation instead of execution. I've seen creators with beautiful demo reels of five different AI tools and zero finished client projects.

Meanwhile, the person who picked a "good enough" tool in January and just kept using it has shipped twelve projects while you shipped four and tested six tools.

This is the same math that applies to cameras. The filmmaker still shooting on that 2016 Blackmagic has eight years of muscle memory with that body. They know exactly how it behaves in low light, how the colors grade, what its limitations are and how to work around them. That familiarity is worth more than any spec sheet upgrade. You can't shortcut it by switching to the "better" model every quarter.

The uncomfortable conclusion

The AI video tool space wants you to believe you need the newest thing. Every launch is positioned as the moment everything changes. And some of them are genuinely impressive — the quality improvements I wrote about last week are real.

But quality of the tool has never been what separates good work from bad work. Taste does. Judgment does. Knowing when to cut, when to hold, when to let a moment breathe — that's the actual skill, and no tool upgrade gives it to you.

My workflow now is deliberately boring: NemoVideo for the assembly grunt work, one AI tool for B-roll, and the rest is me in the timeline making decisions. It's not exciting. But I'm finishing things, which is more than I could say when I was tool-hopping.

That r/filmmakers thread had one reply that stuck with me. Someone wrote: "I stopped upgrading and started shooting more."

That's it. That's the whole lesson. Stop upgrading. Start editing. The best tool is the one that ships.

AI Video Just Crossed the 'I'd Watch It' Line — What Changed and What Still Hasn't

Wei Zhang — Thu, 26 Mar 2026 08:48:41 +0000

AI Video Just Crossed the "I'd Watch It" Line — What Changed and What Still Hasn't

Something happened on r/aivideo this week that I almost missed.

Three days in a row, AI-generated video posts crossed 1,000 upvotes. A kung fu fight scene hit 1,400. A water physics clip broke 1,800. A Spiderman homage reached 3,200. That's not unusual by itself — viral AI clips have been around for a year. What caught me off guard was the comments.

The top comment on the kung fu clip, with 263 upvotes, wasn't "cool tech demo." It was "I'd watch it."

That's a line I didn't expect to see crossed this soon.

What actually changed

I've been testing AI video tools on real projects for about 14 months now, and three things are genuinely different from where we were six months ago.

Motion doesn't look like jello anymore. The water physics in that diving clip would have been impossible in mid-2025. Back then every fluid simulation looked like someone poured syrup in zero gravity. Now the splash timing, the way light refracts through moving water — it's not perfect, but it's past the uncanny valley for short clips.

Faces hold together within a single shot. I used to get warped jawlines and drifting eye spacing in maybe 60% of generations. That's dropped to maybe 20-30% depending on the tool and the angle. Front-facing medium shots are pretty reliable now. Profile views and wide angles still break down. But for the most common framing in short-form content — someone talking to camera, waist up — it works more often than it fails now.

Style transfer got real. This one surprised me the most. That Spiderman clip isn't just "a person in a spider suit." The tool understood what a Sam Raimi Spider-Man looks like versus what a Miles Morales animation looks like. A year ago, asking for a specific visual style gave you vague approximations. Now it's getting close enough that people are debating copyright instead of laughing at the output.

What hasn't changed at all

Here's the thing nobody in those comment sections is talking about: every single one of those viral clips is under 15 seconds. Not one is over 30.

And that's not a coincidence — it's the wall.

Character consistency across shots is still broken. I wrote about this last week after testing Runway, Kling, Seedance, and Pika on the same project. Same character, two consecutive shots, 15-20 regenerations to get something close. Skin tone drifts between cuts. Hair changes length. Clothing shifts color. One tool gave me a character who aged about ten years between shot one and shot two.

Controllability is the other gap nobody mentions. You can generate "a person walking down a street" but you can't say "turn their head left, pause, then look down at their phone." The tools generate motion, they don't follow direction. For a 10-second mood clip that's fine. For anything with narrative structure, it's a dealbreaker.

So we've got tools that can produce stunning individual moments but can't string two of them together coherently. That's a very specific kind of progress.

Where this actually matters right now

Last week I tested something I'd been putting off. I needed B-roll for a client project — atmospheric city shots, some abstract transitions, a couple of moody interior clips. Instead of digging through stock footage for an hour, I generated 5 clips with AI. Three of them made it into the final cut.

The client didn't notice. Didn't ask. The footage just worked because B-roll doesn't need character consistency or precise controllability. It needs mood, texture, and motion that doesn't distract. AI can do that now. Six months ago, the motion artifacts would have been immediately obvious — that strange wobble in the highlights, the way shadows would jump between frames. That's mostly gone now, at least for atmospheric stuff.

But here's what I didn't use AI for: the interview cuts, the narrative pacing, the emotional arc of the piece. That was still me in the timeline, making decisions about when to cut, how long to hold a reaction shot, where to let silence breathe. I used NemoVideo to handle the assembly and rough cuts from the transcript, which saved me a few hours on the mechanical work. But the creative decisions were mine.

That split — AI for raw material, editing tools for assembly and judgment — is where the real workflow is forming. Not AI replacing the editor, but AI changing what the editor spends time on.

The gap between "I'd watch it" and "I'd ship it"

Here's what I keep coming back to.

"I'd watch it" means: this is visually impressive and entertaining for 12 seconds on my phone while I'm scrolling.

"I'd ship it" means: I would put this in a project with my name on it, send it to a client, and stake my professional reputation on it.

Those are completely different bars. "I'd watch it" requires quality within a single continuous shot. "I'd ship it" requires consistency across shots, controllable direction, and reliable reproduction — generate the same character twice and get the same character.

Crossing the "I'd watch it" line took about 18 months from Sora's first demo to this week's r/aivideo posts. Crossing the "I'd ship it" line is going to take longer because the problems are architecturally harder. Current models don't maintain persistent identity across generations. That's not a training data problem — it's a fundamental gap in how these systems represent characters and scenes.

My practical advice for anyone editing video right now: start using AI-generated B-roll where it fits. Mood shots, transitions, establishing shots, abstract textures. Save yourself the stock footage hunt. For everything else, use tools that help you edit faster — NemoVideo for transcript-based rough cuts, traditional NLEs for the fine work. Let AI handle the parts where "close enough" is good enough, and keep your hands on the parts where it isn't.

The 1,000-upvote clips are real progress. But until I can generate shot two and have it match shot one, "I'd watch it" is as far as we go.

tags: ai, video, productivity, creative

The Character Consistency Problem: Why Every AI Video Tool Still Fails at the One Thing That Matters Most

Wei Zhang — Thu, 26 Mar 2026 04:24:04 +0000

The Character Consistency Problem: Why Every AI Video Tool Still Fails at the One Thing That Matters Most

Every AI video demo you've ever seen has something in common: it's a single shot. One clip. Five seconds of a character doing something impressive, posted with a caption like "this changes everything."

What you never see in those demos is the same character in a second shot. Or a third. Or eight shots across a 60-second video where they need to look like the same person wearing the same clothes in the same lighting. That's not an oversight — it's because the tools can't do it yet. And this single problem is the reason AI video hasn't crossed over from impressive tech demo to production tool.

What Actually Happens When You Try

I tested this properly last month. A client wanted a 60-second explainer with a spokesperson character appearing in 8 different scenes — office, warehouse, outdoor, conference room, and so on. Standard corporate video stuff. The brief specifically said "consistent character throughout" because they wanted it to feel like a real person presenting.

I tried four tools: Runway, Kling, Seedance, and Pika. Same reference images, same prompts adjusted for each platform's syntax. Here's what happened.

Runway gave me the most photorealistic output but the character drifted noticeably between scenes. Hair length changed, skin tone shifted warmer in outdoor scenes, and the face structure was slightly different at wider angles. I generated about 18 takes per scene before getting something passable, and even then the warehouse scene and the office scene looked like siblings rather than the same person.

Kling handled face consistency better than the others, especially with their reference image pinning. But the clothing was a problem — the character's jacket changed shade between nearly every scene and completely changed style in two of them. I spent an afternoon trying different prompt combinations and got it to about 80% consistent. Close, but a client would absolutely notice.

Seedance had the best motion quality but the worst consistency. The character looked like a completely different person in 3 out of 8 scenes. I gave up after 22 regenerations on the outdoor shot.

Pika was somewhere in the middle. Decent face consistency if I kept the angles similar, but the moment I needed a different camera position — like switching from a medium shot to a close-up — the character shifted enough to break continuity.

Average across all four tools: about 15-20 regenerations per scene to get something that was close enough, and even then I wouldn't call any of them truly consistent across the full sequence.

The client ended up hiring a real person for the shoot. Half a day of filming, done.

Workarounds That Sort of Work

The AI video community has come up with some creative hacks for this. None of them fully solve the problem, but some get you closer than others.

Reference image pinning is the most straightforward. You feed the tool a set of images of your character from multiple angles and it tries to match them. Kling does this best right now. The limitation is that pinning works well for similar poses but falls apart when you need the character doing very different things across scenes — sitting versus walking versus gesturing at a whiteboard. The more the pose diverges from your reference images, the more the character drifts.

LoRA fine-tuning is the technical option. You train a lightweight model on 20-30 images of your character and it learns their specific features. This produces the most consistent results I've seen, but the barrier is real: you need to understand model training, you need compute resources, and each character takes a few hours to train. For a freelancer with a one-off project, the setup time kills the value.

Separate character and background generation is a workflow I saw on r/aivideo recently that's clever. You generate your character in a neutral environment, extract them, generate backgrounds separately, and composite them together. It's essentially doing in post what the AI tools fail to do in one pass. More work, but you get much better control over each element. The tradeoff is that integration — matching lighting, shadows, perspective between the character and background — becomes a manual compositing job.

Just shoot real footage and use AI for editing. This sounds like a cop-out but it's honestly the most reliable approach for anything that needs character consistency. Real footage gives you perfect consistency by default because it's the same actual person. The AI part moves to editing — rough cuts, subtitles, color matching, reformatting for different platforms. Tools like NemoVideo handle that workflow well: you feed it real footage and use chat-based commands to edit, rather than asking AI to generate everything from scratch. The character consistency problem simply doesn't exist when you start with real footage.

Why This Problem Is Fundamentally Hard

Character consistency isn't a bug that will be fixed in the next update. It's a core architectural challenge.

Current video models work by predicting frames based on text descriptions and latent representations. Each generation pass is essentially independent — the model doesn't have a persistent understanding of "this is Character A and they should always look exactly like this." It's approximating from a description every single time.

This is why single shots look great. Within one continuous generation, the model maintains local coherence. The character stays consistent for 3-5 seconds because each frame is predicted from the adjacent frames. But start a new generation — different scene, different prompt — and you're rolling the dice on whether the model's interpretation of "30-year-old woman with brown hair in a blue jacket" matches its interpretation from the last generation. Usually it's close. Close isn't consistent enough for professional work.

Building true persistent character identity into these models would require something closer to a structured scene graph — an explicit representation of what each character looks like that persists across generations. Some research papers are exploring this but I haven't seen it in any production tool. My honest estimate is 12-18 months before any major platform solves this well enough for consistent use in client work, and even that might be optimistic.

What This Means Right Now

If your project needs character consistency across multiple shots — and most real projects do — you have two practical options today.

Option one: shoot real footage and use AI for editing assistance. NemoVideo, Descript, DaVinci Resolve with AI plugins — these tools accelerate the editing workflow without introducing the consistency problem. You trade the dream of fully AI-generated video for the reality of AI-assisted production, which is less exciting but actually works.

Option two: plan your AI-generated content around single shots. Social media clips, thumbnail generation, concept visualization, mood boards. Anything where each piece stands alone and doesn't need to match anything else. That's where current tools genuinely deliver.

The day AI video tools solve character consistency is the day they become real production tools. We're not there yet, and pretending otherwise is how you end up regenerating the same scene 22 times and hiring a real person anyway.

I Turned a 2-Hour Podcast into 20 Short Clips: Here's the Full Workflow (and Where AI Actually Helped)

Wei Zhang — Thu, 26 Mar 2026 04:15:11 +0000

I Turned a 2-Hour Podcast into 20 Short Clips: Here's the Full Workflow (and Where AI Actually Helped)

A client sent me a 2-hour podcast episode last month and asked for 20 short clips — Reels, Shorts, TikToks, the whole spread. Different aspect ratios, subtitles, branded templates, color-matched, ready to publish across five platforms.

A year ago this would have been a full week of work. I wanted to see how much AI could actually cut that down. Not in theory — in practice, with real client footage, real deadlines, and real quality standards.

Final answer: 3 days instead of 7. But not in the way I expected.

Step 1: Transcription and Finding the Good Parts

The first job was turning two hours of audio into text and finding the 20 moments worth clipping. Transcription used to be the most tedious part of this entire process — I'd spend 20 minutes per hour of footage just fixing auto-generated captions. Now it takes about 45 seconds to get a transcript that's maybe 95% accurate. I ran the episode through Descript, fixed a handful of proper nouns and technical terms, and had a clean transcript in under 10 minutes.

Finding the actual clip-worthy moments was where things got interesting. I asked the AI to flag "high-engagement segments" and it returned 30 candidates. About 20 of them were fine. The other 10 were technically interesting quotes but had zero standalone value — the kind of thing that makes sense in context but means nothing as a 45-second clip on Instagram.

This is the pattern I keep seeing: AI is excellent at identifying what was said but terrible at judging what will perform. It flagged a detailed technical explanation about microphone placement as a "high-engagement moment" because it had lots of specific information. Meanwhile it missed a 30-second story about a guest's worst interview experience that was obviously the most shareable moment in the entire episode. I had to manually review all 30 candidates and make the final picks myself.

Time: 40 minutes (AI-assisted) vs ~3.5 hours (fully manual)

Step 2: Rough Cuts and Going Vertical

With 20 segments identified, I needed to cut them from the timeline and reformat everything for vertical. This is where batch processing made the biggest difference.

I used NemoVideo for the rough cuts — described each clip by timestamp and target duration in plain language, and it handled the extraction and initial framing. "Cut from 34:12 to 35:45, crop to 9:16, keep the speaker centered" repeated twenty times is exactly the kind of repetitive work where chat-based editing shines. What would have been an hour of timeline scrubbing took about 15 minutes.

The vertical reframing worked well for single-speaker segments. The tool tracked the active speaker and kept them centered, which saved me from manually keyframing a crop for each clip.

But the two-person conversation clips were a mess. Every time the speakers talked over each other or one person gestured into the other's frame, the tracking would jump between them or settle on the wrong person entirely. I ended up manually fixing the framing on about 6 of the 20 clips. Not a dealbreaker, but worth knowing if your source footage has multiple speakers on camera.

Time: 45 minutes (AI-assisted) vs ~2.5 hours (manual crop and cut)

Step 3: Subtitles and Brand Packaging

Subtitles were the smoothest part of the entire process. The AI-generated captions from the transcript were already 92% accurate, and since I'd cleaned up the transcript in Step 1, the subtitle timing was nearly perfect out of the box. I spent maybe 20 minutes fixing edge cases — words that broke across lines awkwardly, a few timing misalignments where the speaker paused mid-sentence.

Brand packaging was more mixed. The client had specific colors, fonts, a lower-third template, and an intro bumper that needed to go on every clip. Applying the template across all 20 clips in batch worked great for the simple stuff — colors, fonts, logo placement. But the lower-third positioning needed adjustment on about half the clips because the speaker's head was in a different spot in each one. Batch automation got me 60% of the way there, manual tweaking handled the rest.

Time: 1.5 hours (AI-assisted) vs ~5 hours (manual subtitling + templating)

Step 4: Color Matching and Final Review

The podcast was shot with two cameras that clearly had different white balance settings. AI color matching got both cameras to a consistent baseline in about 8 minutes across all 20 clips. I tweaked contrast and skin tones manually after that, but the initial match eliminated what used to be 30-40 minutes of scoping between cameras.

Final review is the step that AI cannot help with at all, and I don't think it will anytime soon. I watched every single clip start to finish, checking for subtitle errors, awkward cuts, branding consistency, and whether each clip actually made sense as a standalone piece. This took about 2 hours and I caught problems in almost a third of them — a subtitle that said "their" instead of "there," a clip that started mid-sentence because the AI timestamp was off by two seconds, a lower-third that covered the guest's face in one shot.

Time: 2.5 hours (unavoidable manual review)

The Honest Math

Step	AI-Assisted	Fully Manual	Saved
Transcription + clip selection	40 min	3.5 hrs	2 hrs 50 min
Rough cuts + vertical reframe	45 min	2.5 hrs	1 hr 45 min
Subtitles + brand templates	1.5 hrs	5 hrs	3 hrs 30 min
Color match + final review	2.5 hrs	3.5 hrs	1 hr
Total	~5.5 hrs	~14.5 hrs	~9 hrs

Spread across three working days with client feedback loops, that's 3 days instead of 7. Real savings, not theoretical.

The tools that did the heavy lifting: Descript for transcription, NemoVideo for batch rough cuts and reframing, and DaVinci Resolve for final color and export.

What AI Still Can't Do

I could write another article about what worked, but the more useful list is what didn't.

AI cannot tell you which moments will perform on social media. It can find quotes and highlight reels, but it has no sense of what makes someone stop scrolling. That judgment is still entirely yours.

AI cannot handle multi-speaker framing reliably. The second you have two people in frame with overlapping dialogue, every auto-framing tool I've tested gets confused. Plan on manual fixes for 25-30% of your clips if your source has multiple speakers.

AI cannot do your final QA pass. I've tried trusting the output without a manual review exactly once. The client found a subtitle error in the first clip they watched. Never again.

Three days instead of seven is genuinely useful. But the three days that remain are the ones that require actual editorial judgment — and those aren't going anywhere.

Every AI Video Demo Is Lying to You (And What Actually Works in Production)

Wei Zhang — Thu, 26 Mar 2026 00:48:18 +0000

Every AI Video Demo Is Lying to You (And What Actually Works in Production)

I've been a video editor for about five years. In the last twelve months, I've tested pretty much every AI video tool that's crossed my feed — and I'm tired of the gap between what these tools promise and what they actually deliver when a client is waiting.

This isn't a listicle of "top 10 AI tools." This is what happened when I tried to use them for real work.

The Demo Problem

Scroll through r/aivideo on any given day and you'll see incredible clips. A diver gliding through impossible underwater geometry. Cinematic one-shots that look like they cost $50K to produce. "Absolute cinema," the comments say.

And they're right — as isolated clips, the output is stunning. The problem starts when you try to make something with these tools rather than just from them.

Here's what I mean. A client needed a 90-second product video last month. Simple brief: show the product in three environments, maintain consistent branding, match their existing color palette. I figured I'd try the AI route since the demos made it look effortless.

Three days later I had:

47 generated clips, none with consistent lighting
A product that changed shape slightly between every shot
Zero usable footage that matched the brand guidelines

I ended up shooting it traditionally in half a day. The AI "shortcut" cost me three days and a very awkward status update to the client.

Why Demos Always Look Better Than Products

There's a simple reason every AI video demo looks incredible: demos have no constraints.

Nobody's demo reel needs to match a brand book. Nobody's demo needs consistent characters across 15 shots. Nobody's demo gets reviewed by a client who notices the logo is slightly different in frame 847.

Professional video work is almost entirely about consistency and control — the exact two things current AI generation is worst at. You can get one great frame. Getting 2,700 frames (that's 90 seconds at 30fps) that all look like they belong together? That's where everything falls apart.

The Sora shutdown last week proved this at scale. OpenAI reportedly burned through compute generating clips that looked great as Twitter posts but couldn't sustain a two-minute coherent scene. If they couldn't make it work with billions in infrastructure, your $20/month subscription probably isn't getting there either.

What Actually Works (The Boring Stuff Nobody Demos)

Here's the thing — AI is genuinely useful in video production. Just not the way the demos suggest. The tools that have stuck in my daily workflow are all on the editing side, not the creation side:

Auto-transcription and subtitle generation. This used to take me 20 minutes per minute of footage. Now it takes about 45 seconds with decent accuracy. I still fix errors but the baseline is good enough that it's a net time save every single project.

Rough cut from scripts. I write markers in my script, the tool matches them to footage based on transcript and visual content, and I get a rough cut that's maybe 60% of the way there. Still needs heavy manual work but it kills the blank-NLE-project problem.

Color matching between cameras. Shot a conference last month with three different camera setups. The AI color matcher got all three to a consistent baseline in about ten minutes. Manual grading from there but it killed the tedious first pass.

Noise reduction and upscaling. Old footage, badly lit interviews, phone recordings from clients — AI handles these better than any manual approach I've tried. Topaz has been in my toolkit for two years and it's one of the few AI tools I'd genuinely miss.

Chat-based editing interfaces. This one surprised me. Tools like NemoVideo let you tell it what you want in plain language instead of hunting through menus. "Trim the first 15 seconds, add a fade, match the color to the previous clip" — and it just does it. Not flashy, but it's cut my editing time on routine projects by roughly a third.

The Gap Nobody Talks About

The AI video industry has a positioning problem. All the money and attention goes toward generation — making something from nothing. That's the sexy demo. That's what gets 1000 upvotes on Reddit.

But the actual market need is editing assistance — making existing footage better, faster, cheaper. The editors who are saving real time with AI aren't posting demos. They're just quietly getting projects done in three days instead of five.

I think we're going to see this split widen. The generation side will keep producing incredible demos and keep failing in production. The editing side will keep being boring and keep actually working.

My Actual Toolkit (March 2026)

For anyone who's curious what I'm actually using day-to-day, here's the honest list:

DaVinci Resolve — Primary NLE, free tier handles 90% of my needs
Topaz Video AI — Noise reduction and upscaling, $199 one-time
Descript — Transcription-based editing for interview content
NemoVideo — Chat-based editing for routine projects, genuinely surprised by how much time it saves
After Effects — Motion graphics, some things still need manual keyframing

Notice what's not on the list: any AI generation tool. Not because they can't make cool clips. Because I can't use cool clips that change the product's shape between shots.

The Bottom Line

If you're evaluating AI video tools for actual production work, ignore the demos. Ask three questions instead:

Can it maintain consistency across an entire project? (Not one clip — an entire project.)
Does it save time on tasks I already do, or does it create new tasks I didn't have before?
Would I trust this output enough to send it to a client without manually checking every frame?

If the answer to any of those is no, you're looking at a toy, not a tool. And right now, most AI generation products are still toys. The editing tools? Some of those are genuinely ready.

tags: ai, video, productivity, tools

Why Sora Failed: What Actually Works in AI Video Editing Right Now

Wei Zhang — Wed, 25 Mar 2026 21:29:36 +0000

Why Sora Failed: What Actually Works in AI Video Editing Right Now

OpenAI shut down Sora this week. Disney pulled out of their deal. And honestly? I'm not surprised at all.

I've been editing video professionally for about 5 years, and I spent the last 12 months testing every AI video tool I could get my hands on. Sora included. Here's what I learned — and what I think actually matters going forward.

The Demo-to-Reality Gap

Sora's launch demos were incredible. Photorealistic cityscapes, smooth camera movements, consistent lighting. The problem was that none of that translated to actual production work.

When I tried using Sora for a client's product demo back in January, the results were unusable. Character faces morphed between shots. Lighting shifted randomly mid-scene. I generated the same 10-second clip maybe 40 times trying to get two consecutive shots where the main character looked like the same person. Never got there.

This wasn't just a Sora problem — Runway, Kling, Pika, they all have it. Text-to-video generation sounds revolutionary until you need to produce something a client will actually pay for.

What Killed Sora Specifically

Three things:

Compute costs were brutal. Every generation burned through GPU time that OpenAI needed for their core language model business. When you're spending millions on inference for a product that most users treat as a toy, the math doesn't work.

No moat. Google's Veo, Runway's Gen-3, Kling 3.0 — the space got crowded fast. Sora had first-mover hype but not first-mover advantage. By the time it launched publicly, cheaper alternatives existed that produced comparable output.

The use case was wrong. Sora targeted the "type a sentence, get a video" market. But professional editors don't want to type sentences. They have footage already. They need help with the tedious parts of working with that footage.

Where AI Actually Saves Me Time

Here's the thing nobody talks about in the AI video hype cycle: the boring applications work. They've been working for over a year now.

Auto-transcription and captioning. I used to spend 45 minutes manually transcribing a 10-minute interview. Now it takes 30 seconds and the accuracy is above 95%. This alone changed my workflow more than any generation tool.

Rough cuts from script markers. I work on interview-driven content. Being able to feed in a script and have the tool pull matching segments from 3 hours of raw footage — that saves me an entire afternoon per project.

Color matching between cameras. Multi-cam shoots where one camera is slightly warmer than the other used to mean 20 minutes of manual adjustment per scene. AI handles this in seconds and gets it right maybe 85% of the time. The remaining 15% still needs manual tweaking, but 85% automation on a tedious task is genuinely useful.

Smart audio cleanup. Background noise removal has gotten scary good. I had a client shoot an interview next to a construction site — two years ago that footage would've been unusable. Ran it through AI noise removal and it sounded like a studio recording.

Tools like NemoVideo have been leaning into this practical direction — focusing on the editing workflow rather than generation from scratch. You tell it what you want done to existing footage instead of trying to conjure something from a text prompt. It's less flashy than "here's a video of a cat riding a skateboard through Tokyo" but it's what actually ships to clients.

Where This Goes Next

I think we're about to see a real split in the AI video space:

The generation side (text-to-video, image-to-video) will keep improving but stay limited to social media content, prototyping, and creative experimentation. It won't replace professional production workflows for at least another 3-5 years. The consistency problem is that fundamental.

The editing assistance side will quietly become standard. Within 18 months, I expect auto-transcription, smart rough cuts, and AI color grading to be built into every major NLE. The standalone tools that got there first — the ones that focused on making editors faster rather than replacing them — will either get acquired or become the new standard.

Sora's failure isn't proof that AI in video doesn't work. It's proof that the industry was building the wrong thing. The editors I know don't want AI to make their videos for them. They want AI to handle the 40% of their job that's repetitive so they can spend more time on the 60% that's creative.

That's a less exciting pitch than "generate anything from text." But it's the one that actually works.

ClawHub Just Changed Its Search Algorithm — Here's What I Found in the Source Code

Wei Zhang — Mon, 23 Mar 2026 03:01:27 +0000

My skill rankings crashed at 4 AM.

Twelve #1 positions the night before. I checked at 4:18 AM and ten of them were gone. Two survived. The pattern was obvious once I saw it: the two survivors had their keyword directly in the slug. Everything relying on description text had vanished.

I spent the next two hours in the ClawHub source code figuring out what happened.

Two commits, one explanation

On March 22 at 18:03 UTC, ClawHub pushed commit 801cc55 — "fix: narrow skill package catalog search." One minute later, 9ea7508 landed to stabilize the change.

The key diff is in convex/skills.ts. Before the change, search worked like this: take a query string, scan up to 200 pages of the skillSearchDigest index, score every result against the query vector, return the best matches.

After the change, there's a new first step:

const exactSkill = await resolveSkillBySlugOrAlias(ctx, queryText);
if (exactSkill.skill) {
  const exactDigest = await ctx.db
    .query("skillSearchDigest")
    .withIndex("by_skill", (q) => q.eq("skillId", exactSkill.skill!._id))
    .unique();
  if (exactDigest && skillCatalogMatchesFilters(exactDigest, args)) {
    const exactScore = scoreSkillCatalogResult(exactDigest, queryText);
    if (exactScore > 0) {
      seen.add(exactDigest.skillId);
      matches.push({ score: exactScore, package: toPublicSkillCatalogItem(exactDigest) });
    }
  }
}

Search now tries to resolve the query as an exact slug first. If it finds a match, that skill gets scored and added to results before any vector scanning happens.

The second change: the old paginated scan loop (up to MAX_SKILL_CATALOG_SEARCH_SCAN_PAGES iterations) got replaced with a single page fetch. The constant was deleted entirely.

What this actually means

Three things changed for skill authors:

Slug match is now guaranteed. If someone searches "auto-caption" and your slug is auto-caption, you're in the results. Period. Before this, you depended on vector similarity catching you during the page scan. Usually it did. Sometimes it didn't — which is exactly what issue #52034 reported. This is genuinely good news. If you picked a descriptive slug when you first published, you just got a safety net for free.

Vector-only skills got fragile. My skills with keywords only in the description were scoring 1.7-1.9 through vector similarity. When the index rebuilt during the deploy, those embeddings temporarily disappeared. Slug-match skills didn't care because they hit the new exact path directly. I had a few skills that relied entirely on description keywords for certain search terms — every one of them dropped out overnight.

The scan window narrowed. Removing the multi-page loop means vector search covers fewer candidates per query. If your skill's embedding sits deep in the index, it might not get scanned at all. This probably won't affect most skills, but if you're ranking on a long-tail keyword with a lot of competition, your margin just got thinner.

The fix took 30 minutes

I published a version bump on one of the affected skills — no code changes, just a patch version increment. Within 30 minutes, its rankings came back. The publish triggered a fresh skillSearchDigest entry, which the new single-page scan picked up immediately.

I then bumped the rest. By 5:30 AM, all twelve #1 positions were restored. Some scores actually came back slightly higher than before — video editing went from 3.217 to 3.242. Not sure if that's the new scoring function being more generous or just normal variance.

One thing worth mentioning: I also noticed that skills from a newer account I'd been testing with disappeared from search entirely. Not just dropped in rank — gone. Even exact slug searches returned nothing. Still not sure if that's related to the algorithm change or a separate account-trust issue. Something to watch.

What I'd tell other skill authors

Pick your slug carefully. Before this update, slug was already the highest-weight signal (I wrote about that in my previous post). Now it's even more important because it has a dedicated resolution path that bypasses vector search entirely.

If your rankings suddenly drop, try publishing a patch version. You're not fixing a bug — you're forcing a digest rebuild under the new scoring logic.

And if you want to check where you stand right now, the search API is public:

https://clawhub.ai/api/search?q=your+keyword&limit=10

No auth needed. The score field in the response is what determines your position.

The search code is open source. Reading it took me two hours. It saved me from thinking my skills were broken when it was just the platform rebuilding its index.

This is part of a series on building AI video tools with OpenClaw. Previous posts: How I Built an AI Video Editor | What Broke When I Wrapped a Video API | Automating TikTok and Reels | Reverse-Engineering ClawHub's Top Video Skills | 12 #1 Rankings in 5 Days

How I Got 12 Number One Rankings on ClawHub in 5 Days

Wei Zhang — Sun, 22 Mar 2026 22:20:51 +0000

I published 13 video editing skills on ClawHub over the span of a week. For the first three days, only two appeared in search results. By day five, twelve of them held the #1 spot for their target keywords.

Nothing changed about the skills themselves. Same API, same functionality, same code. What changed was how I named and described them.

Here's everything I learned about ClawHub's search ranking — with real numbers from my testing.

The slug is everything

ClawHub uses vector search for skill discovery. I spent two weeks querying the search API (/api/search?q=keyword) with different keywords and recording scores. The pattern was consistent across 30+ queries:

If your slug contains the search keyword, you score 3.0+. If it doesn't, your ceiling is about 2.0.

Here's what that looks like in practice:

Slug	Query	Score
`auto-caption`	"auto caption"	3.147
`ai-video-editing`	"video editing"	3.217
`nemo-subtitle` (brand name)	"add subtitles"	1.859
`nemo-shorts` (brand name)	"shorts maker"	1.757

The first two slugs contain the exact search term. They score 3.0+. The bottom two have brand-name slugs — they rank well because of description optimization, but they'll never break 2.0.

I tested this by publishing a new skill with the slug auto-caption for the keyword "auto caption." Within 6 hours it hit #1 at 3.099. A different skill covering the exact same feature, with a brand-name slug, had been stuck at 1.7 for days.

The lesson hit hard: if you're serious about a keyword, put it in your slug. Not your description, not your displayName — your slug. Everything else is supplementary.

There's a catch, though. Slugs are permanent. You can't rename them after publishing. So if you picked a vanity name like I did for my first batch (nemo-video, nemo-edit, nemo-subtitle), you're locked into the description-optimization game with a lower ceiling. I ended up publishing separate skills with keyword-rich slugs to cover the gaps.

DisplayName: powerful but fragile

After discovering the slug effect, I figured displayName was just cosmetic. Wrong.

I ran an accidental experiment. While cleaning up my skills, I shortened a displayName from 70 characters to 30 — cut out keyword suffixes like "for TikTok, Reels, and YouTube Shorts" to make it look neater.

Within hours, the skill dropped out of the top 15 results for every keyword it had ranked for. Not a gradual decline — gone. I restored the original long name and the rankings came back within an hour. No other changes.

My read: displayName feeds into the same vector embedding as description. Every keyword you remove from it shrinks your footprint in the search space. The "cleaner" name was literally invisible to the queries that used to find it.

The rule I follow now: stuff every relevant keyword into your displayName, readability second. Video Caption Tool - Burn Captions, AI Subtitles and SRT Export is ugly. It also ranks.

First sentence of your description is disproportionately weighted

This one came from debugging a specific failure. My subtitle skill wasn't ranking for "add subtitles" even though those exact words appeared in the description — in the third sentence.

I moved "Add subtitles" to the very first word of the description. Next index cycle, the skill jumped from outside top 10 to #1 for that query.

The practical takeaway: open your description with the exact keyword phrase you want to rank for. Not a paraphrase, not a synonym — the literal words someone would type into the search box. Save the creative writing for sentence two.

The 4 AM crash

This is the part nobody warns you about.

Last night I checked rankings before bed — 12 keywords holding #1. At 4:18 AM I ran a routine scan. Ten of those twelve had vanished from the results entirely. Not dropped to #5 or #8. Gone.

I spent the next hour figuring out what happened. The pattern was clear: skills with keyword-in-slug (like auto-caption) were still ranked. Skills that relied on description keywords for their ranking had disappeared. The scores for competitor skills hadn't changed — our skills had simply been removed from the index.

The fix was dumb. I bumped the version number on two affected skills (no code changes, just a version bump in the SKILL.md frontmatter) and republished. Within 30 minutes, both were back at #1 with scores slightly higher than before.

My best guess: ClawHub periodically rebuilds its vector index, and description-derived embeddings are more volatile during rebuilds than slug-derived ones. Slug matches are probably handled by a separate scoring path that survives reindexing.

The practical defense: monitor your rankings, and if something falls off a cliff overnight, try republishing. A version bump with zero changes was enough to re-enter the index.

Where this leaves things

Right now, 12 of my 13 skills hold a #1 ranking for at least one keyword. The 13th has a brand-name slug and a crowded keyword — it ranks #3, which is fine.

Other skill authors are hitting the same walls. There's a thread on the OpenClaw repo (#50090) where several of us have been sharing data on what they're calling "invisible trigger failures" — skills that load fine but never get selected because the search ranking is opaque.

What would actually fix this: a simple dashboard in ClawHub showing skill authors which queries match their skill and where they rank. The search API exists. The data is there. It just isn't surfaced to the people who need it most.

Update (March 23)

Found the root cause. ClawHub pushed two commits on March 22 at 18:03 UTC — "narrow skill package catalog search" and "stabilize package catalog search." The code change added an exact-slug-match priority path to the search function (resolveSkillBySlugOrAlias runs first, then vector search fills remaining slots). It also removed a scan-page limit constant and restructured the pagination loop.

The overnight crash happened about 2 hours after this deploy. Skills relying on description-keyword matches fell out of the index during the transition. Slug-match skills stayed because they now hit the new exact-match path directly.

The "republish to force reindex" workaround still works — but now I understand why. You're not fixing a random glitch. You're forcing the new search function to rebuild your skill's search digest entry under the updated scoring logic.

Worth noting: the new exact-match path is actually good for skill authors long-term. If your slug matches the query, you're now guaranteed to appear in results. That wasn't true before (which is exactly what #52034 reported).

Update (March 23)

After publishing this, I dug into what caused the 4 AM ranking crash. Turns out it was not random.

ClawHub pushed two commits on March 22 at 18:03-18:04 UTC — about 8 hours before the crash hit my rankings. The key change in commit 801cc55 ("fix: narrow skill package catalog search"): the search function now runs an exact slug match first via resolveSkillBySlugOrAlias() before falling back to vector scanning. Previously it was pure vector search. The commit also removed a constant called MAX_SKILL_CATALOG_SEARCH_SCAN_PAGES (previously 200), narrowing the vector scan range.

When this deployed, it triggered a search index rebuild. Skills relying on description-derived embeddings lost their scores during the rebuild. Skills with keyword-in-slug survived because the new exact-match path handles them before the vector scan runs.

The version bump fix worked because republishing forces ClawHub to rebuild the skill search digest. Not ideal that a zero-change version bump is the recovery path, but it works.

Long-term this makes slug strategy even more important. The description optimization ceiling probably got lower with the narrowed vector scan. Filed the underlying instability in #52034.

I Analyzed ClawHub's 4 Most-Downloaded Video Skills — The Top One Has 65 Installs and Here's Why

Wei Zhang — Sun, 22 Mar 2026 22:08:39 +0000

We shipped a video editing skill on ClawHub earlier this month. Downloads ticked up to about 200. But installs? Zero.

That number bugged me. So I did what any obsessive developer would do: I downloaded the SKILL.md files from every video-related skill I could find and started reading them line by line.

Four stood out. Here's what I learned.

The lineup

Skill	Lines	API Key needed?	What it does
pexoai-agent	300	Yes (PEXO_API_KEY)	AI video production, 5-60s clips
ffmpeg-video-editor	393	No	Natural language → FFmpeg commands
video-subtitles	67	No	SRT generation + burn-in
video-frames	29	No	Extract frames from video

Already, something jumps out. The two skills with the most real-world traction — video-subtitles and ffmpeg-video-editor — need zero external API keys. You install them and they just work.

Lesson 1: Downloads aren't installs

Pexo has a polished SKILL.md. Good structure, clear workflow, even a clever "delivery worker" metaphor for how the AI should behave. But it requires PEXO_API_KEY and PEXO_BASE_URL before anything happens. That's a signup, a dashboard visit, and a copy-paste before your first video.

Meanwhile video-frames is 29 lines long. It needs ffmpeg (which most dev machines already have) and nothing else. First frame extraction works in one command.

The friction difference is enormous. Every step between "install" and "first result" costs you users. We had the same problem — our skill needed a token setup flow that, while automatic, still felt like a gate.

Lesson 2: 67 lines beats 300

video-subtitles does one thing well: transcribe audio, generate SRT, optionally burn subtitles into the video. The entire SKILL.md is 67 lines. There's a Quick Start section with five copy-paste examples right at the top:

# Plain transcript
./scripts/generate_srt.py video.mp4

# Burn subtitles into video
./scripts/generate_srt.py video.mp4 --srt --burn

Compare that to our skill at the time — over 200 lines of API documentation, session management flows, token refresh logic. All necessary for our architecture, but the AI agent reading that file has to parse through a lot before it knows what to do.

The lesson isn't "write less." It's that the first 20 lines matter more than the remaining 180. If your Quick Start doesn't give the agent a working command in under 10 lines, you've already lost.

Lesson 3: The language rule nobody thinks about

Pexo's SKILL.md has a section I'd never seen before:

## ⚠️ LANGUAGE RULE (highest priority)

You MUST reply to the user in the SAME language they use.
This is non-negotiable.

Simple. Obvious in hindsight. If your skill works globally — and ClawHub skills do — the AI should respond in whatever language the user speaks. We never specified this. Our skill defaulted to English regardless of input, which probably confused every non-English user who tried it.

One line in your SKILL.md fixes this. Pexo marks it as "highest priority," above the actual workflow. That tells me they learned this the hard way.

Lesson 4: FFmpeg skills win because FFmpeg is already there

ffmpeg-video-editor is basically a prompt template. It doesn't call any API. It doesn't upload anything. It translates "trim this video from 1:21 to 1:35" into an ffmpeg command and runs it locally.

That's it. And it works because ffmpeg is already installed on most machines that would run OpenClaw. Zero network latency, zero API costs, zero auth.

There's a ceiling to this approach — you can't do AI-generated scenes or text-to-video with local ffmpeg. But for the 80% of editing tasks that are just "cut, crop, convert" it's hard to beat.

What we changed

After this analysis, we rewrote our SKILL.md with three things in mind:

Quick Start first. The agent should know how to make a basic edit within the first 10 lines.
Reduce the auth wall. Anonymous tokens that auto-generate on first use — no signup required for basic edits.
Add the language rule. One paragraph, borrowed directly from Pexo's approach.

We also split our monolithic skill into focused ones — a subtitle tool, a shorts maker, a color grading tool — each with a tight description that matches how people actually search.

Still early. Still zero installs. But the SKILL.md reads like something an AI agent can actually follow now, and that feels like the right foundation.

If you're building OpenClaw skills, go read the SKILL.md files of what's already working. The patterns are right there. You can find our video editing skills on ClawHub by searching "video editing" or "subtitles."

This is part of a series on building AI video tools with OpenClaw. Previous posts: How I Built an AI Video Editor as an OpenClaw Skill | I Wrapped a Video Editing API for AI - Here is What Broke | I use OpenClaw to automate my entire TikTok and Reels workflow

I use OpenClaw to automate my entire TikTok and Reels workflow

Wei Zhang — Sat, 21 Mar 2026 21:27:51 +0000

Let me describe my Tuesday evenings three months ago.

I'd shoot four to five short clips throughout the week — product demos, quick tutorials, a behind-the-scenes moment. Good content. Then Tuesday would arrive and I'd spend three to four hours staring at a timeline in DaVinci Resolve, doing the same things I'd done the Tuesday before: trim the dead air, add subtitles, resize to 9:16, drop in background music at a sane volume, export three versions for TikTok, Reels, and Shorts.

I wasn't editing. I was operating a conveyor belt.

The actual creative decisions took maybe twenty minutes. The rest was clicking through dialogs and watching export progress bars. I kept thinking: this should not require a human.

What I Found

I've been an OpenClaw user for about a year — I use it to manage my calendar, draft emails, handle some light scripting for my projects. It's basically a personal AI that runs locally and you can extend with community-built skills.

A few months ago I noticed a video editing skill on ClawHub. The description was straightforward: edit videos by chatting with your AI assistant.

I was skeptical. I've used enough AI video tools to know they usually mean "we'll describe what edits you should make" not "we'll actually make the edits." But I had a boring Tuesday coming up, so I tried it.

The short version: it actually makes the edits. You install the skill, drop a video file into your OpenClaw workspace, and start talking.

My Workflow, Step by Step

I've settled into a pretty consistent routine. Here's exactly how a typical video goes from raw footage to three exported files.

Step 1: Drop the file

I record on my phone or mirrorless, AirDrop to my laptop, and drag the file into ~/.openclaw/workspace/inbox/. That's the only manual step that stays manual.

Step 2: Open OpenClaw and start a session

I type one message to kick things off:

"New video in inbox: demo-march-14.mp4. It's a 4-minute product walkthrough. I need it trimmed to under 90 seconds, subtitled, and exported in three sizes. Let's go."

OpenClaw picks up the file, calls the video editing skill, and we're off.

Step 3: Let it run

From here it's a back-and-forth, but it goes fast. The skill knows how to handle the heavy lifting — it sends the video to the processing pipeline, streams back status updates, and when it's done it tells me exactly where the output landed.

The full session usually takes 8–12 minutes per video depending on length. I keep my laptop open but I'm not watching it. I'm making coffee.

The Prompts I Actually Use

This is the part I wish someone had given me when I started. Here are my most-used prompts, copy-pasteable:

For trimming + compression:

Trim the first 8 seconds and the last 5 seconds from [filename]. Then compress to under 50MB without dropping below 1080p.

For subtitles:

Add auto-generated subtitles to [filename]. Burn them in at the bottom third, white text with a subtle black shadow, no background box.

For multi-platform export:

Take [filename] and export three versions: one at original aspect ratio for YouTube, one cropped to 9:16 at 1080x1920 for TikTok and Reels, one at 1080x1920 with max 60 seconds for Shorts. Name them [basename]-youtube, [basename]-vertical, [basename]-shorts.

For background music:

Add [music-file.mp3] as background music to [filename]. Set the music volume to 15% and duck it to 5% whenever speech is detected.

For a full batch:

Process all .mp4 files in the inbox folder: trim 5 seconds from the start of each, add auto-subtitles, export as 9:16 vertical. Move originals to /inbox/processed when done.

That last one is my favorite. I drop five clips, send one message, go to sleep. In the morning there are fifteen files (three versions each) waiting for me.

The Numbers: Before vs After

I tracked my time for six weeks before switching and six weeks after. This isn't marketing copy — this is what I logged.

Task	Before (per video)	After (per video)
Trimming + cleanup	12 min	~0 (automated)
Subtitle generation + review	18 min	4 min (review only)
Resize + format variants	15 min	~0 (automated)
Music + audio balance	10 min	~0 (automated)
Export management	8 min	~0 (automated)
Total per video	~63 min	~8 min

With four videos a week, that's roughly 220 minutes back every week. About 3.5 hours.

Cost-wise: the skill runs on credits. I'm paying roughly $12–15 a month at my current volume. Before I was paying in time, which at any reasonable rate is worth way more than fifteen dollars.

Honest Limitations

I'd be lying if I said this was perfect. Here's what still trips me up:

Subtitle accuracy on technical jargon. If I'm talking about specific software terms or product names, the auto-generated subtitles will sometimes mangle them. I still do a quick manual review on anything that's going to be public-facing. Takes about 3–4 minutes per video.

Music ducking isn't always perfect. The speech detection for auto-ducking works well on clean recordings. If I shot in a noisy environment or there's significant background noise in my source, it can duck at weird moments. I've learned to specify stricter ducking parameters when this is a risk: duck to 3% when speech confidence is above 80%.

Large files and timeouts. Anything over about 800MB can sometimes timeout during upload depending on my connection. I've started transcoding very long source files first with a quick local ffmpeg pass to h264 before sending them. Not ideal, but it's a one-liner.

The skill can't watch your screen. This is OpenClaw operating on files, not a GUI agent. If your workflow depends on real-time visual feedback during editing (color grading, precise subtitle positioning, etc.), you'll still need to do that part yourself.

How to Get Started

If you're already an OpenClaw user, setup is about five minutes:

Go to ClawHub and search for a video editing skill
Install the skill:

npx clawhub@latest install video-editor-ai --force

You'll be prompted to register for an API key when you first use it (there are free credits to test with)
Drop a video in your workspace and start talking

Source code is on GitHub if you want to see how the skill is structured.

If you're not yet an OpenClaw user — it's a local AI assistant you run yourself, kind of like having a personal Claude or GPT that also has access to your filesystem, your apps, and community-built skills. You can find it at openclaw.ai.

Final Thought

I still enjoy video editing when it's actually editing — when I'm making a real creative decision about pacing or structure or what to cut. What I don't enjoy is the mechanical repetition that comes after.

OpenClaw with a good video editing skill drew a clear line between those two things. The creative stuff stays with me. The repetitive stuff goes to the machine.

Three and a half hours a week adds up. I've spent that time shooting better content instead.

If you've got questions about specific prompts or use cases, drop them in the comments — happy to share what's worked.

This is part of a series on building AI video tools with OpenClaw. Previous posts: How I Built an AI Video Editor | What Broke When I Wrapped a Video API | Next: Reverse-Engineering Top Video Skills | 12 #1 Rankings in 5 Days

I Wrapped a Video Editing API for AI - Here is What Broke

Wei Zhang — Fri, 20 Mar 2026 06:27:30 +0000

Most AI agent tutorials assume you control both ends of the stack. You define the tools, you define the responses, everything is designed for programmatic access from the start.

Real-world integrations are messier. The useful APIs — the ones that actually do something valuable — were usually built for humans with browsers. They return things like "click the Export button" and "drag the clip to the timeline." They assume someone is watching a screen.

I spent the last few months building video-editor-ai, an OpenClaw skill that wraps a video editing backend originally designed for a web UI. Here's what I learned.

Lesson 1: You Need an Interception Layer, Not Just a Wrapper

The naive approach is to forward everything from the user to the backend and return whatever comes back. This breaks immediately when the backend responds with GUI instructions.

The backend I was working with would say things like:

"Your video is ready! Click the Export button in the top right 
corner to download it."

If you pass this directly to the user, they'll ask "what Export button?" The skill has no UI. There is no button.

The fix isn't prompt engineering ("please don't mention buttons"). It's an architectural one: you need a translation layer that sits between the backend response and the user.

backend response → [GUI Translator] → user-visible response
                          ↓
                   if "click X" detected:
                   execute X directly via API
                   replace with: "✅ Done. Here's your file."

The translator needs to handle two cases:

Cosmetic GUI references ("check the dashboard") — strip and ignore
Actionable GUI references ("click Export") — intercept and execute the actual API call

The second case is critical. If "click Export" reaches the user as text, nothing happens. If your translator catches it and calls the render endpoint, the user gets their file.

Lesson: Design for interception from day one. Map every GUI action in the backend's vocabulary to an API call.

Lesson 2: Silent Edits Are Real and You Must Handle Them

This one took me the longest to figure out.

About 30% of edit operations in the backend return no user-visible text at all. The backend processes the edit, updates internal state, sends a stream of tool calls — and then closes the connection without saying anything.

From the user's perspective: they type "remove the background music." Nothing happens. No confirmation, no error. Just silence.

The first instinct is to retry. That's wrong. The edit succeeded. Retrying it will apply the same edit twice, which in a credit-based system means you've now charged the user twice for one action.

The correct approach is state diffing:

# Before sending the edit:
state_before = get_current_state(session_id)

# Send edit via SSE stream
response_text = send_to_backend(user_message)

# If stream closes with no text:
if not response_text:
    state_after = get_current_state(session_id)
    diff = compute_diff(state_before, state_after)
    if diff:
        return format_diff_as_confirmation(diff)
        # "✅ BGM track removed (was: Lo-fi Beats, 0–45s)"
    else:
        return "The edit didn't seem to take effect. Want to try again?"

The diff-based confirmation also makes the UX much better than a generic "done." Users want to know what changed, not just that something happened.

Lesson: Never assume a silent response means failure. Check state before and after any operation. Build state diffing before you build anything else.

Lesson 3: Billing and Credits Belong to the Agent, Not the Backend

This is a UX problem that becomes a trust problem quickly.

The backend I was wrapping charged credits for processing operations. It also had an export/render endpoint that was completely free. But from the user's perspective — talking to an agent — this distinction is invisible.

Early in development, users would ask "how many credits do I have left?" The agent would forward this to the backend, which would respond: "You can check your credit balance on the dashboard under Account Settings."

Two problems:

There is no dashboard. The skill is the interface.
Even if there were, this is exactly the kind of round-trip that should never hit the backend.

The solution is a pre-flight router:

incoming message
      ↓
  [Router]
  ├── "credits" / "balance" / "how much left" → call balance API directly
  ├── "export" / "download" / "send me the file" → call render API directly  
  ├── "upload" / user attaches a file → call upload API directly
  └── everything else → forward to backend via SSE

The router catches intent before it reaches the backend. It means:

Credit checks are instant (no SSE stream overhead)
Exports never accidentally trigger new generations
Upload flow is deterministic

The rule of thumb: any operation with a known, fixed API endpoint should never go through the conversational backend. The backend is for things that require interpretation. Credit checks don't require interpretation.

Lesson: Map your API surface area before you write any prompt logic. Identify which operations are deterministic (route them directly) vs. which require the backend's reasoning (route them through SSE).

Lesson 4: Backend Error Messages Are Written for Humans, Not Agents

Error handling in GUI-first APIs is designed for a support workflow, not programmatic consumption.

Typical backend error message:

"I encountered a temporary issue processing your request. 
Please try again or contact support at support@example.com 
if the problem persists."

This message is useless to an agent for three reasons:

It doesn't say what failed
"Try again" is dangerous if the failure was a credit deduction
"Contact support" is a dead end in an automated flow

Worse: this exact message sometimes appears as a trailing message after a successful operation. The backend completes the edit, sends the result, then appends a generic error epilogue as a separate SSE event. If your agent treats this as an error state, you get false negatives.

The pattern that works:

def parse_sse_stream(events):
    has_success = False
    final_text = []

    for event in events:
        if looks_like_success(event):
            has_success = True
            final_text.append(event.text)
        elif looks_like_error(event):
            if has_success:
                # Trailing error after success = ignore
                continue
            else:
                # Genuine error = surface to user
                final_text.append(translate_error(event.text))

    return "\n".join(final_text)

def translate_error(backend_message):
    # Map backend error vocabulary to actionable user messages
    if "temporary issue" in backend_message:
        return "The backend is busy — try again in 30 seconds."
    if "insufficient credits" in backend_message:
        return "You've run out of credits. Get more at [link]."
    # ... etc
    return "Something went wrong. Here's the raw error: " + backend_message

The key insight: don't surface backend error messages directly. Translate them. Your agent knows the context (what was attempted, what state things are in) that the backend doesn't.

Lesson: Build an error translation table early. Expect that the same error string from the backend can mean different things depending on when in the flow it appears.

Lesson 5: Test With Transcripts, Not Unit Tests

Standard unit testing doesn't map well to conversational agent skills. You can't easily mock a 300-second SSE stream, and the interesting failure modes only surface in real multi-turn conversations.

What actually works: transcript testing.

I ended up with a library of ~110 conversation transcripts — real interactions that exposed bugs, edge cases, or just confusing UX. Each transcript is a sequence of user messages and expected agent behaviors:

# transcript: double-export.yaml
description: "User asks to export immediately after generation"
turns:
  - user: "create a 30 second video about ocean waves"
    expect:
      - type: confirmation
        contains: "video created"
  - user: "export it"
    expect:
      - type: file_delivered
        not_contains: "generating"  # should NOT start a new generation
      - type: credits_unchanged     # export is free, credits should not decrease

This approach catches:

Regression bugs: does fixing the silent edit problem break the export flow?
UX issues: does the phrasing of confirmations actually make sense in context?
Edge cases: what happens if the user asks to export a video that's still generating?

The transcripts also serve as documentation. New contributors can read them to understand how the skill is supposed to behave in specific scenarios — something a SKILL.md instruction file can't fully capture.

Lesson: Start collecting transcripts from your first real user session. Every surprising or broken interaction is a test case. By the time you have 20 transcripts, you'll have a regression suite that catches most of the things that matter.

The Common Thread

All five of these lessons come back to the same root problem: GUI-first backends communicate in a vocabulary designed for human visual processing, and agents operate in a vocabulary designed for text and function calls.

The translation work is non-trivial, but it's also reusable. The patterns above — interception layers, state diffing, pre-flight routers, error translation, transcript testing — apply to any GUI-first API you want to expose to an agent runtime.

If you're building on OpenClaw and want to see the full implementation, video-editor-ai's SKILL.md is open source:

npx clawhub@latest install video-editor-ai --force

Source: github.com/nemovideo/nemovideo_skills

What GUI-first APIs have you tried to wrap for agent use? Curious what other translation patterns people have run into.

This is part of a series on building AI video tools with OpenClaw. Previous: How I Built an AI Video Editor | Next posts: Automating TikTok and Reels | Reverse-Engineering Top Video Skills | 12 #1 Rankings in 5 Days

Descript's Lyrebird API vs Building a Video Editing Skill for OpenClaw — A Developer's Comparison

Wei Zhang — Wed, 18 Mar 2026 07:10:26 +0000

Descript shipped their Lyrebird enterprise API in January 2026. Upload raw footage, get back an edited project file or rendered video. They've also made Claude Sonnet 4.5 the default model inside Underlord, their AI editing layer.

I've spent the past few months building video-editor-ai, an OpenClaw Skill that does roughly the same thing: give it footage, describe what you want, get back a finished file. Same surface area. Very different architecture underneath.

Here's an honest comparison from someone who's shipped both approaches.

What Descript's Lyrebird API Actually Does

The Lyrebird API is what Descript calls "lightweight, focused on workflow handoffs." That's accurate. The core model is:

You upload raw footage via a signed URL
You send an edit job with parameters
You poll (or receive a webhook) when the render is done
You download the output file

It's a well-designed batch processing API. Each call is stateless — the API doesn't remember what you did in the previous call. If you want to apply three sequential edits (cut → subtitle → color grade), you either chain three separate API calls yourself, or you express the full edit spec upfront in a single job.

This is the right design for a lot of use cases: automated pipelines, batch processing, CI/CD video workflows, anything where you have a complete spec before you start.

It's not the right design for conversational editing.

The Session State Problem

Conversational video editing looks like this:

User: "add Chinese subtitles"
Agent: [processes, returns file with subtitles]

User: "actually make them white with a black outline"
Agent: [processes, returns updated file]

User: "now trim the first 10 seconds"
Agent: [processes, returns final file]

Each turn depends on the output of the previous turn. The agent needs to know: which file are we working on right now? What edits have already been applied? Where are we in the sequence?

With Descript's API, you manage this state yourself. Between turns, you're responsible for tracking the "current file," persisting the session context, and constructing the next API call with the right input file. The API itself is stateless.

With an OpenClaw Skill, session state is built into the runtime. The Skill runs inside the agent's conversation context — it can read prior turns, track which file is "active," and construct downstream calls without you building a state machine around the API.

Here's what that looks like in practice. In a Skill, you don't write:

# Your orchestration code
state = load_session_state(session_id)
current_file = state["current_file"]
result = descript_api.edit(current_file, "trim first 10 seconds")
state["current_file"] = result.output_url
save_session_state(session_id, state)

You write:

# In the Skill's routing logic
If the user refers to "it" or "the video" without specifying a filename,
use the most recently processed file from this session.

The Skill runtime handles persistence. The agent handles reference resolution. You just describe the policy.

This isn't a minor difference. In a multi-turn video editing workflow, managing session state is a significant chunk of the application logic. The Skill model moves that responsibility into the runtime; the API model leaves it with you.

GUI-First vs Agent-Native Intent

The deeper architectural difference is in what the API exposes.

Descript's API reflects Descript's product. It talks in terms of Descript's edit operations: compositions, sequences, layers, transcript-based edits. These are the right concepts for Descript's GUI. They're not necessarily the right concepts for an agent.

When a user says "remove the background music," an agent doesn't want to know which audio track index to zero out. It wants to express an intent — "remove BGM" — and have the editing layer figure out the implementation.

This is the GUI-first problem that shows up in every video editing API I've worked with. The API surface reflects the GUI's data model, not the agent's vocabulary. The result is a translation layer you have to build yourself:

# You end up writing this
def handle_remove_bgm(current_project):
    # Figure out which tracks are BGM vs dialog
    tracks = descript_api.get_tracks(current_project)
    bgm_tracks = [t for t in tracks if classify_track(t) == "music"]

    # Zero them out
    for track in bgm_tracks:
        descript_api.mute_track(current_project, track.id)

    return descript_api.render(current_project)

In video-editor-ai's architecture, this translation lives in the Skill itself — the SKILL.md contains a routing table that maps natural language intents to API calls. The calling agent doesn't need to know anything about the underlying edit operations.

# In SKILL.md
If the user asks to "remove background music", "mute BGM", "take out the music":
→ Call /edit/audio with {"action": "mute", "track_type": "music"}

The Skill is the translation layer. You write it once, and every agent that installs the Skill gets the translation for free.

Error Handling: Who's Responsible?

One more difference worth understanding before you choose an approach.

With a direct API integration, errors from the video processing backend land in your application code. A failed render, a timeout on a long export, a quota exceeded response — you handle all of these.

With a Skill, error handling can be layered into the Skill's instruction set. The Skill knows the semantics of the errors (a 429 means quota, not a bug; a 0-byte output means the upstream service returned empty, not that the write failed) and can surface them to the user appropriately:

# Error handling in SKILL.md
If the render API returns a 0-byte file:
- Do NOT retry automatically (avoid double-charging credits)
- Check session state: did the previous turn confirm the edit completed?
  - If yes: the edit succeeded silently, run state diff and confirm to user
  - If no: surface error and ask user to try again

This kind of contextual error handling is hard to encode in API client code. It requires knowing the application semantics, not just the HTTP status codes. In a Skill, it lives naturally alongside the rest of the routing logic.

When to Use Each

These aren't competing approaches for the same use case. They're right for different things.

Use Descript's Lyrebird API when:

You're building a batch pipeline (process 50 videos overnight)
You have a complete edit spec before you start (no user interaction mid-edit)
You're integrating into an existing product that already manages state
You need Descript's specific editing capabilities (transcript-based cuts, screen recording tools, their specific audio cleanup)
You have an enterprise contract and budget

Use an OpenClaw Skill when:

You're building a conversational agent workflow
Users will iterate on edits across multiple turns
You want the editing capability to work across any OpenClaw-compatible agent without integration work
You want a free tier to prototype with (100 credits, no account required)
You want the translation layer between natural language and video operations handled for you

The clearest signal: if you're thinking about session state management before you've written a single line of feature code, you probably want the Skill model. If you have a complete spec and just need to ship files through a processing pipeline, you probably want the API model.

The Part Descript Got Right That We Copied

One thing Lyrebird does well that informed video-editor-ai's design: the output is a real file, not a streaming blob that disappears.

Early versions of video-editor-ai returned a temporary URL that expired in 15 minutes. Users would come back to a conversation 20 minutes later, ask for the file again, and get a 404. Descript delivers permanent project files. We moved video-editor-ai to the same model — the Skill stores a reference to the output in session state, and the user can retrieve it any time in the conversation.

Good API design is good API design regardless of where it comes from.

Practical Starting Point

If you want to compare the approaches hands-on:

For Descript's API: https://docs.descriptapi.com (enterprise access required)

For the OpenClaw Skill approach:

npx clawhub@latest install video-editor-ai

100 free credits, no account needed. The SKILL.md is open source at github.com/nemovideo/nemovideo_skills — the routing table and error handling logic are all readable.

The two approaches aren't mutually exclusive. A production system could use Descript's API for batch jobs and an OpenClaw Skill for the conversational editing interface. But if you're starting from scratch and building for an agent-first workflow, the Skill model saves you a significant amount of state management and translation layer work.

One More Thing: The Pricing Model Signal

Descript's Lyrebird API is enterprise-only. That's not a criticism — enterprise pricing makes sense for a company with Descript's support costs and customer profile. But it does signal something about who each approach is designed for.

OpenClaw Skills are distributed through ClawHub, which is free to publish on and has a free tier for end users. The economics are different because the model is different: Skills are closer to open-source libraries than to API products. You install them, they run in your agent, and the cost is per-operation on the underlying processing backend (in video-editor-ai's case, that's the video processing credits).

If you're evaluating video editing APIs and cost structure matters, that's a real difference worth factoring in.

Building something at the intersection of agent runtimes and video editing? I'd like to hear what API design decisions you're running into — the session state problem comes up in almost every media workflow I've seen.

Related skills from the same backend:

Shorts Editor - vertical video for TikTok, Reels and Shorts. Install: npx clawhub@latest install shorts-editor
Video Caption Tool - auto subtitles, 50+ language translation, SRT export. Install: npx clawhub@latest install video-caption-tool