Last month, I was juggling 12 different AI subscriptions. OpenAI for GPT. Anthropic for Claude. ElevenLabs for voice. Runway for video. Replicate for music. Google for Gemini. The list kept growing.
My credit card statement looked like a SaaS graveyard. And every time I wanted to build something that combined multiple AI capabilities? I'd spend more time wrangling API keys than writing actual code.
Then I built something that changed everything.
The Problem Nobody Talks About
The AI ecosystem is fragmented by design. Every provider wants you locked into their platform. Want to generate an image with Flux, turn it into a video with Veo, add voiceover with ElevenLabs, and deploy the result to the web? That's 4 API keys, 4 billing dashboards, 4 different SDKs, and 4 different authentication flows.
For a solo developer, this is death by a thousand paper cuts.
What I Built
I built an automated YouTube video factory. One command. Zero manual editing. Here's what it does:
node video-workflow.js https://youtube.com/watch?v=xyz
That single command triggers a 7-phase pipeline that:
- Extracts video metadata and transcripts (no download needed)
- Analyzes content using Gemini Vision
- Generates a world-class script with hook, insights, and CTA
- Creates TTS voiceover, AI video clips, thumbnail, and background music
- Assembles everything with cinematic transitions, word-by-word subtitles, and Ken Burns effects
- Uploads to YouTube with optimized title, description, and tags
- Self-evaluates the output and generates improvements for the next run
Before: 4-6 hours per video, manually switching between 8 tools.
After: 3 minutes. Fully automated. Consistently scoring 85+ on my quality rubric.
The Secret Sauce: One Gateway, 100+ Models
The magic isn't in any single AI model — it's in the orchestration layer. I use SkillBoss as a unified API gateway that gives me access to 100+ AI models through a single endpoint:
curl -fsSL https://skillboss.co/install.sh | bash
One install. One API key. One billing dashboard. Here's what I get:
| Capability | Models Available |
|---|---|
| Chat/Reasoning | Claude 4.5, GPT-5, Gemini 3, DeepSeek R1 |
| Image Generation | Gemini 3 Ultra, Flux Pro, DALL-E 3 |
| Video Generation | Veo 3.1, Sora Turbo |
| Text-to-Speech | ElevenLabs, MiniMax, OpenAI TTS |
| Music Generation | ElevenLabs Music via Replicate |
| Web Search | Perplexity Sonar Pro (with citations) |
| Web Scraping | Firecrawl, ScrapingDog |
| Deployment | Cloudflare Workers + R2 |
The best part? Pay-as-you-go pricing. No monthly subscriptions. I went from ~$180/month in AI subscriptions to spending only what I actually use.
Technical Deep Dive: The 7 Phases
Phase 1: Intelligence Gathering (No Video Download)
YouTube's SABR protocol blocks most video downloaders. Instead of fighting it, I extract only what I need:
// Metadata only — no video download needed
const metadata = await exec(`yt-dlp --dump-json --skip-download "${url}"`);
const transcript = await exec(`python3 -m youtube_transcript_api "${videoId}"`);
This gives me the title, description, duration, view count, and full transcript — everything needed for content analysis.
Phase 3: World-Class Script Generation
This is where the magic happens. I don't just "summarize" the source video. I use Claude to generate scripts that follow YouTube retention science:
- Hook formula: Specific number + bold claim + curiosity gap
- 8-10 segments, each 7-10 seconds (scene changes every ~8s = high retention signal)
- CTA: Specific benefit, never generic "like and subscribe"
The prompt engineering alone took weeks to perfect. Every script gets structured as JSON with narration text, visual descriptions, and timing metadata.
Phase 4: Parallel Asset Generation
This is where having a unified API gateway pays off. I fire off multiple AI calls simultaneously:
const [ttsResults, bgmResult, thumbnailResult, videoClips] =
await Promise.all([
generateTTS(segments), // ElevenLabs or MiniMax
generateBGM(bgmPrompt), // Replicate
generateThumbnail(thumbPrompt), // Gemini 3 Ultra Image
generateVideoClips(segments) // Veo 3.1 with fallback
]);
No juggling API keys. No switching SDKs. Same apiCall() function for everything — just different model names.
Phase 5: FFmpeg Wizardry
This phase assembles everything into a polished video. Some hard-won lessons:
Word-by-word subtitles using ASS format with karaoke tags:
{\kf80}Every {\kf60}word {\kf70}highlights {\kf90}individually
Ken Burns effect on static images:
zoompan=z='min(zoom+0.0015,1.5)':d=180:s=1920x1080
BGM mixing — keep it simple:
// DON'T use loudnorm + sidechaincompress (hangs indefinitely)
// DO use simple volume control
amix=inputs=2:duration=first [mixed]; [mixed] volume=0.12
That last one cost me 2 hours of debugging. The sidechaincompress filter in ffmpeg creates a deadlock with loudnorm in certain filter graph configurations. A simple volume=0.12 for background music works perfectly.
Phase 7: Self-Evaluation Loop
After upload, the pipeline evaluates itself across 5 dimensions:
- Hook strength (does it grab attention in 3 seconds?)
- Content value density (insight per second)
- SEO optimization (title, description, tags)
- Audio-visual quality (transitions, subtitle timing, BGM balance)
- Conversion potential (CTA effectiveness)
Score below 85? It generates specific improvements and saves them to next_run_improvements.json for the next run. The system literally gets better every time it runs.
Real Results
Here's a video this pipeline produced — fully automated, zero manual editing:
1080p, Veo 3.1 video clips, professional transitions, word-by-word subtitles, background music, and auto-generated thumbnail.
What I Learned
1. Model fallbacks are essential. Every AI model has rate limits and outages. My pipeline has automatic fallback chains (Veo 3.1 → Sora Turbo → image-to-video). It never fails completely.
2. Binary data handling matters. Early on, I was concatenating audio buffers as strings (raw += chunk). The audio came out corrupted. Switching to Buffer.concat(chunks) fixed it instantly. Seems obvious in hindsight.
3. Don't over-engineer audio mixing. FFmpeg's sidechaincompress and loudnorm are powerful but can deadlock in complex filter graphs. Sometimes the simple solution (volume=0.12) is the correct one.
4. The unified API gateway was the breakthrough. Not because any individual model was special, but because the orchestration became trivial. When adding a new capability means changing one model name instead of integrating a new SDK, you build things you never would have attempted.
Get Started
If you want to try the multi-model approach:
# Install SkillBoss
curl -fsSL https://skillboss.co/install.sh | bash
# Use with Claude Code, Cursor, or any AI coding assistant
# Or use the OpenAI-compatible endpoint directly:
# https://api.skillboss.co/v1
New accounts get $2 in free credits. No subscription needed.
What's the most complex AI pipeline you've built? I'd love to hear about multi-model workflows in the comments.
Top comments (0)