Vibecoding a Video Editing Pipeline

#ai #claudecode #aicodingtools #tools

Originally published on nvarma.com

Canva's AI-generated hero image for this post. It crammed every location into one impossible geography. CLIP would score this a 10.

These are my personal thoughts, experiences, and opinions, and they do not reflect the views of the company I work for.

Last week I wrote about driving the California coast and finding clarity. That was a reflective post. This one is about nerding out this weekend on all things AI video editing, LLMs and iMovie version.

I came home from that trip with 27 video files across two cameras with about 17.5 GB of footage. My main driver phone is a Samsung Galaxy S24 Ultra and last Thanksgiving, I got a Xtra Muse vlog camera for 4K scenic shots. During that trip, our first stop was my favorite Pier 39 sea lions after which we took a couple of different views of the Golden Gate Bridge on a rare fog-free day. The next day we went on to 17-Mile Drive and Pebble Beach. This was the California Coast in all its glory waiting to be explored, and my opportunity to capture it unedited in some of the most amazing scenery I have seen.

It has been on my mind on how a lot of people are using AI tools to do these content creation pipelines, and my curiosity got the better of me that I had footage to experiment with. My requirement was a simple 90-second highlight reel and a couple of YouTube Shorts. Since I was exploring a bunch of Local AI for edge apps prototyping, my guess was it would be quite a breeze to edit these videos. I don't have OpenClaw just yet, so this was my deep dive into learning traditional video editing in case things didn't work out. I've done sound recording and mixing semi-professionally, so how hard can video be? It turns out, it's quite the skill to learn.

Defaulting to ComfyUI

My first instinct was ComfyUI. I've written about using it recently and it felt like the right tool. I thought I could use image-to-video workflows to enhance the footage, or at least use it for some creative transitions.

I opened Google AI search and started brainstorming. I realized quickly that this wasn't a generation problem. I didn't need to create new footage. I needed to sift through 17 GB of existing footage and find the good parts. ComfyUI is a Swiss Army knife for image and video generation, but for selection and editing? This was the wrong tool.

Next Up: Gemma 4 and TurboQuant

Google had just released TurboQuant alongside Gemma 4, and I was itching to try local multimodal inference. The idea I had was to extract frames from every video into a folder, feed them through Gemma 4's vision model locally on my Mac Mini, and have it pick the most scenic shots.

I spent an evening chatting with Gemini about this approach. I even wrote a director.py script that loaded gemma-4-E4B-it-4bit through mlx-vlm, fed it batches of 35 frames resized to 112x112, and asked it to pick the best ones. It sort of worked. But "sort of" is a generous way to describe a script that was slow, it hallucinated filenames, and couldn't reliably distinguish a parking lot from Pebble Beach at thumbnail resolution.

The core problem was that I was using a vision language model for a task that didn't need language. I didn't need the model to describe what it saw. I needed it to score how scenic each frame was. That's a similarity match problem, not a problem solved using conversational AI.

Pivoting to a Claude Code project

I haven't really tried out Google's Vertex AI in depth, or explored Antigravity's capabilities. I'd been going back and forth with a Gemini 3.1 Pro chat, learning a lot about Gemma 4, quantization, and multimodal inference. But I kept running into hurdles. There were model loading quirks. There was a 35-image batch limit beyond which my Mac Mini M4 Pro (24GB) would go bust. The 112px resolution was killing all the detail I needed. After a couple hours of this, I decided to pause and rethink.

What if I just pivoted to load Claude Code in the folder of raw videos and described what I wanted?

I opened a new Claude Code session with Opus 4.6, gave it access to ~/Movies/SFO_PebbleBeach/raw/, and explained my goal. Within the first exchange, Claude corrected my typo on something as I was describing my need. It pointed out that mlx-lm is text-only, that for vision I needed mlx-vlm, and that for "find nature/scenery frames" CLIP is the right primary tool, not a VLM. "CLIP (Contrastive Language–Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. "

That was the moment I discovered a lot about prior art (pun intended). I've dabbled a bit before in the early days of AI tech coming out but always went to Cloud AI providers to experiment, this was the first time I had a real world use case to run through it.

Discovering CLIP

If you haven't used CLIP for something like this, it's quite simple. Once you give it a frame and a text prompt, it tells you how similar they are.

So instead of asking a language model "is this scenic?", you need to encode the frame and compare it against prompts like "dramatic Pacific Ocean cliffs" and "Monterey cypress trees on rocky shore". The same kind of thing I'd add as positive prompts in ComfyUI LoRA nodes. Then you need to compare against negative prompts like "blurry photo" and "car dashboard". The difference between the positive and negative similarity scores gives you a single number: how scenic is this frame?

Running on Apple Silicon MPS, CLIP processes frames so fast it makes VLM inference look like it's stuck in an infinite loop (at least on my machine).

Custom rigging a 10-Stage Pipeline

Claude didn't just suggest CLIP. It designed a full 10-stage pipeline and wrote every script. I vibed all the way with nods like I knew what I was doing (I was not :>|). I was just describing what I wanted and iterating on the output.

Here's what Claude listed as what the pipeline does:

Probe every raw file with ffprobe to inventory codecs, resolutions, and framerates
Detect shots using PySceneDetect's adaptive detector, so clips align with actual camera cuts
Extract 3 representative frames per shot (10%, 50%, 90%) instead of the brute-force 1-fps approach
Score frames with CLIP against scenic prompts, plus classical CV signals (Laplacian sharpness, HSV colorfulness, brightness, face detection)
Filter for shake and blur using optical flow variance
Tag locations using GPS from the Samsung files and CLIP zero-shot for the Xtra Muse (which had no GPS)
Select clips for a 90s reel and two 30s shorts, balancing score, location diversity, and chronological order
Normalize each clip to a common format (4K30 HEVC for the reel, 1080x1920 H.264 for shorts)
Concatenate into final outputs
Verify specs, durations, and that the raw folder was never modified

The scoring formula was described as a weighted combination:

frame_score = 0.35 * clip_scenic
            + 0.20 * sharpness
            + 0.15 * colorfulness
            + 0.15 * nature_ratio
            + 0.10 * brightness
            - 0.40 * has_face * (1 - nature_ratio)

That last term is clever. It penalizes faces, but waives the penalty when the scenery dominates. So a wide shot of the Golden Gate Bridge with a few tourists in the corner keeps its score. A selfie gets rejected.

Some surprising learnings

Python's scientific ecosystem is really good for this. I didn't need any exotic AI tooling for most of the pipeline. Claude chose OpenCV's Laplacian variance for blur detection, Hasler-Süsstrunk colorfulness, HSV color masks for nature detection, Farneback optical flow for shake. These are computer vision techniques from the 2000s and they worked perfectly. The AI involvement (CLIP) was just one stage out of ten.

Location tagging with CLIP zero-shot actually worked. The Xtra Muse camera had no GPS data, so Claude suggested using CLIP to match frames against location-specific prompts like "Golden Gate Bridge" and "Pebble Beach golf coastline". It wasn't perfect (Pier 39 and Pier 41 look similar from the water), but it was good enough to ensure location diversity in the reel.

The EXIF correction was its own mini-adventure. The Xtra Muse had timestamps from June 2000 instead of March 2026. Off by 25 years, 9 months, and 6 days. Claude walked me through the exiftool date shift since I've forgotten the exact format after spending most of my college days tagging photos on my Macbook Pro. It caught me when I accidentally ran a write instead of a dry run, and helped me recover from the backup. Somehow my trip footage is from the future and the past simultaneously.

The final edit involving a human and a robot

The pipeline output a 90-second reel and two shorts. This was quite impressive for a vibe coded adventure. It had its flaws though, and I definitely wanted to edit this a bit more.

I imported everything into iMovie for the final pass. I trimmed a couple of clips where the pipeline chose a good scenic shot but caught the tail end of a camera movement. I removed one clip of Golden Gate Bridge that was technically scenic but felt out of place in the flow. This is the kind of editorial judgment that's hard to score numerically.

For the soundtrack, I turned to Gemini. My last YouTube video (a Waymo ride through SF) got hit with a copyright notice for the music, and I didn't want to deal with that again. Gemini generated "The Road and the Sea", a track that fit the mood without any licensing headaches.

The final step was to put it all together and upload to YouTube and declare victory. (yay!)

Learning through pain

I went into this wanting to edit a video. I came out thinking differently about when to reach for which AI tool.

Gemma 4 and local VLMs are impressive, but they're conversational tools. I found that they are great at describing and reasoning about images but terrible at scoring thousands of them quickly. I spent an evening learning this the hard way, and I don't regret it. I understand quantization and multimodal inference better now than I did before.

CLIP is the right tool when you need similarity, not understanding. If your question is "does this match my description?", CLIP answers it faster and more reliably than any chat model. I think a lot of people reach for an LLM when a simpler model would do the job better. I know I did.

The other thing that surprised me was how much of the pipeline didn't need AI at all. Laplacian blur detection, optical flow for shake, HSV color masks for nature. These are well established computer vision techniques. They were perfect for this job. The right answer often isn't the newest model. Sometimes it is just expanding your knowledge of the different prior art that came before all the new hotness.

Vibecoding works for pipelines for a weekend project. I described what I wanted. Claude built the stages that I reviewed, iterated, and run until I got my output. The output was ten Python scripts, an orchestrator, CLIP prompts, ffmpeg encoding flags from a single session. I couldn't have written optical flow shake detection from scratch that quickly. But I could describe what a good highlight reel looks like, and honestly, that was enough.

https://www.youtube.com/watch?v=GjoOt8SQe3w

The video's up on YouTube now. Here I present almost ninety seconds of California coastline, some selected by math, some by neural nets and the rest hand finished by a human. It's a bit like mixing a track in GarageBand. The tools lay down the structure, but you still have to trust your ear for the final cut. I'm already thinking about what to automate next time.

This was not a perfect journey, but it was satisfying to have enjoyed it and finish on a high of completing it. Just like my ride down Pebble Beach on a clear sunny March weekend.

This post was originally published on nvarma.com. Follow me there for more on software architecture, engineering leadership, and the craft of building things that last.