DEV Community

Wei Zhang
Wei Zhang

Posted on

I Turned a 2-Hour Podcast into 20 Short Clips: Here's the Full Workflow (and Where AI Actually Helped)

I Turned a 2-Hour Podcast into 20 Short Clips: Here's the Full Workflow (and Where AI Actually Helped)

A client sent me a 2-hour podcast episode last month and asked for 20 short clips — Reels, Shorts, TikToks, the whole spread. Different aspect ratios, subtitles, branded templates, color-matched, ready to publish across five platforms.

A year ago this would have been a full week of work. I wanted to see how much AI could actually cut that down. Not in theory — in practice, with real client footage, real deadlines, and real quality standards.

Final answer: 3 days instead of 7. But not in the way I expected.

Step 1: Transcription and Finding the Good Parts

The first job was turning two hours of audio into text and finding the 20 moments worth clipping. Transcription used to be the most tedious part of this entire process — I'd spend 20 minutes per hour of footage just fixing auto-generated captions. Now it takes about 45 seconds to get a transcript that's maybe 95% accurate. I ran the episode through Descript, fixed a handful of proper nouns and technical terms, and had a clean transcript in under 10 minutes.

Finding the actual clip-worthy moments was where things got interesting. I asked the AI to flag "high-engagement segments" and it returned 30 candidates. About 20 of them were fine. The other 10 were technically interesting quotes but had zero standalone value — the kind of thing that makes sense in context but means nothing as a 45-second clip on Instagram.

This is the pattern I keep seeing: AI is excellent at identifying what was said but terrible at judging what will perform. It flagged a detailed technical explanation about microphone placement as a "high-engagement moment" because it had lots of specific information. Meanwhile it missed a 30-second story about a guest's worst interview experience that was obviously the most shareable moment in the entire episode. I had to manually review all 30 candidates and make the final picks myself.

Time: 40 minutes (AI-assisted) vs ~3.5 hours (fully manual)

Step 2: Rough Cuts and Going Vertical

With 20 segments identified, I needed to cut them from the timeline and reformat everything for vertical. This is where batch processing made the biggest difference.

I used NemoVideo for the rough cuts — described each clip by timestamp and target duration in plain language, and it handled the extraction and initial framing. "Cut from 34:12 to 35:45, crop to 9:16, keep the speaker centered" repeated twenty times is exactly the kind of repetitive work where chat-based editing shines. What would have been an hour of timeline scrubbing took about 15 minutes.

The vertical reframing worked well for single-speaker segments. The tool tracked the active speaker and kept them centered, which saved me from manually keyframing a crop for each clip.

But the two-person conversation clips were a mess. Every time the speakers talked over each other or one person gestured into the other's frame, the tracking would jump between them or settle on the wrong person entirely. I ended up manually fixing the framing on about 6 of the 20 clips. Not a dealbreaker, but worth knowing if your source footage has multiple speakers on camera.

Time: 45 minutes (AI-assisted) vs ~2.5 hours (manual crop and cut)

Step 3: Subtitles and Brand Packaging

Subtitles were the smoothest part of the entire process. The AI-generated captions from the transcript were already 92% accurate, and since I'd cleaned up the transcript in Step 1, the subtitle timing was nearly perfect out of the box. I spent maybe 20 minutes fixing edge cases — words that broke across lines awkwardly, a few timing misalignments where the speaker paused mid-sentence.

Brand packaging was more mixed. The client had specific colors, fonts, a lower-third template, and an intro bumper that needed to go on every clip. Applying the template across all 20 clips in batch worked great for the simple stuff — colors, fonts, logo placement. But the lower-third positioning needed adjustment on about half the clips because the speaker's head was in a different spot in each one. Batch automation got me 60% of the way there, manual tweaking handled the rest.

Time: 1.5 hours (AI-assisted) vs ~5 hours (manual subtitling + templating)

Step 4: Color Matching and Final Review

The podcast was shot with two cameras that clearly had different white balance settings. AI color matching got both cameras to a consistent baseline in about 8 minutes across all 20 clips. I tweaked contrast and skin tones manually after that, but the initial match eliminated what used to be 30-40 minutes of scoping between cameras.

Final review is the step that AI cannot help with at all, and I don't think it will anytime soon. I watched every single clip start to finish, checking for subtitle errors, awkward cuts, branding consistency, and whether each clip actually made sense as a standalone piece. This took about 2 hours and I caught problems in almost a third of them — a subtitle that said "their" instead of "there," a clip that started mid-sentence because the AI timestamp was off by two seconds, a lower-third that covered the guest's face in one shot.

Time: 2.5 hours (unavoidable manual review)

The Honest Math

Step AI-Assisted Fully Manual Saved
Transcription + clip selection 40 min 3.5 hrs 2 hrs 50 min
Rough cuts + vertical reframe 45 min 2.5 hrs 1 hr 45 min
Subtitles + brand templates 1.5 hrs 5 hrs 3 hrs 30 min
Color match + final review 2.5 hrs 3.5 hrs 1 hr
Total ~5.5 hrs ~14.5 hrs ~9 hrs

Spread across three working days with client feedback loops, that's 3 days instead of 7. Real savings, not theoretical.

The tools that did the heavy lifting: Descript for transcription, NemoVideo for batch rough cuts and reframing, and DaVinci Resolve for final color and export.

What AI Still Can't Do

I could write another article about what worked, but the more useful list is what didn't.

AI cannot tell you which moments will perform on social media. It can find quotes and highlight reels, but it has no sense of what makes someone stop scrolling. That judgment is still entirely yours.

AI cannot handle multi-speaker framing reliably. The second you have two people in frame with overlapping dialogue, every auto-framing tool I've tested gets confused. Plan on manual fixes for 25-30% of your clips if your source has multiple speakers.

AI cannot do your final QA pass. I've tried trusting the output without a manual review exactly once. The client found a subtitle error in the first clip they watched. Never again.

Three days instead of seven is genuinely useful. But the three days that remain are the ones that require actual editorial judgment — and those aren't going anywhere.


Top comments (0)