Elle

Posted on Mar 9

I Replaced 12 AI Subscriptions with One CLI Command

#ai #automation #productivity #showdev

Last month, I was juggling 12 different AI subscriptions. OpenAI for GPT. Anthropic for Claude. ElevenLabs for voice. Runway for video. Replicate for music. Google for Gemini. The list kept growing.

My credit card statement looked like a SaaS graveyard. And every time I wanted to build something that combined multiple AI capabilities? I'd spend more time wrangling API keys than writing actual code.

Then I built something that changed everything.

The Problem Nobody Talks About

The AI ecosystem is fragmented by design. Every provider wants you locked into their platform. Want to generate an image with Flux, turn it into a video with Veo, add voiceover with ElevenLabs, and deploy the result to the web? That's 4 API keys, 4 billing dashboards, 4 different SDKs, and 4 different authentication flows.

For a solo developer, this is death by a thousand paper cuts.

What I Built

I built an automated YouTube video factory. One command. Zero manual editing. Here's what it does:

node video-workflow.js https://youtube.com/watch?v=xyz

That single command triggers a 7-phase pipeline that:

Extracts video metadata and transcripts (no download needed)
Analyzes content using Gemini Vision
Generates a world-class script with hook, insights, and CTA
Creates TTS voiceover, AI video clips, thumbnail, and background music
Assembles everything with cinematic transitions, word-by-word subtitles, and Ken Burns effects
Uploads to YouTube with optimized title, description, and tags
Self-evaluates the output and generates improvements for the next run

Before: 4-6 hours per video, manually switching between 8 tools.
After: 3 minutes. Fully automated. Consistently scoring 85+ on my quality rubric.

The Secret Sauce: One Gateway, 100+ Models

The magic isn't in any single AI model — it's in the orchestration layer. I use SkillBoss as a unified API gateway that gives me access to 100+ AI models through a single endpoint:

curl -fsSL https://skillboss.co/install.sh | bash

One install. One API key. One billing dashboard. Here's what I get:

Capability	Models Available
Chat/Reasoning	Claude 4.5, GPT-5, Gemini 3, DeepSeek R1
Image Generation	Gemini 3 Ultra, Flux Pro, DALL-E 3
Video Generation	Veo 3.1, Sora Turbo
Text-to-Speech	ElevenLabs, MiniMax, OpenAI TTS
Music Generation	ElevenLabs Music via Replicate
Web Search	Perplexity Sonar Pro (with citations)
Web Scraping	Firecrawl, ScrapingDog
Deployment	Cloudflare Workers + R2

The best part? Pay-as-you-go pricing. No monthly subscriptions. I went from ~$180/month in AI subscriptions to spending only what I actually use.

Technical Deep Dive: The 7 Phases

Phase 1: Intelligence Gathering (No Video Download)

YouTube's SABR protocol blocks most video downloaders. Instead of fighting it, I extract only what I need:

// Metadata only — no video download needed
const metadata = await exec(`yt-dlp --dump-json --skip-download "${url}"`);
const transcript = await exec(`python3 -m youtube_transcript_api "${videoId}"`);

This gives me the title, description, duration, view count, and full transcript — everything needed for content analysis.

Phase 3: World-Class Script Generation

This is where the magic happens. I don't just "summarize" the source video. I use Claude to generate scripts that follow YouTube retention science:

Hook formula: Specific number + bold claim + curiosity gap
8-10 segments, each 7-10 seconds (scene changes every ~8s = high retention signal)
CTA: Specific benefit, never generic "like and subscribe"

The prompt engineering alone took weeks to perfect. Every script gets structured as JSON with narration text, visual descriptions, and timing metadata.

Phase 4: Parallel Asset Generation

This is where having a unified API gateway pays off. I fire off multiple AI calls simultaneously:

const [ttsResults, bgmResult, thumbnailResult, videoClips] =
  await Promise.all([
    generateTTS(segments),        // ElevenLabs or MiniMax
    generateBGM(bgmPrompt),       // Replicate
    generateThumbnail(thumbPrompt), // Gemini 3 Ultra Image
    generateVideoClips(segments)   // Veo 3.1 with fallback
  ]);

No juggling API keys. No switching SDKs. Same apiCall() function for everything — just different model names.

Phase 5: FFmpeg Wizardry

This phase assembles everything into a polished video. Some hard-won lessons:

Word-by-word subtitles using ASS format with karaoke tags:

{\kf80}Every {\kf60}word {\kf70}highlights {\kf90}individually

Ken Burns effect on static images:

zoompan=z='min(zoom+0.0015,1.5)':d=180:s=1920x1080

BGM mixing — keep it simple:

// DON'T use loudnorm + sidechaincompress (hangs indefinitely)
// DO use simple volume control
amix=inputs=2:duration=first [mixed]; [mixed] volume=0.12

That last one cost me 2 hours of debugging. The sidechaincompress filter in ffmpeg creates a deadlock with loudnorm in certain filter graph configurations. A simple volume=0.12 for background music works perfectly.

Phase 7: Self-Evaluation Loop

After upload, the pipeline evaluates itself across 5 dimensions:

Hook strength (does it grab attention in 3 seconds?)
Content value density (insight per second)
SEO optimization (title, description, tags)
Audio-visual quality (transitions, subtitle timing, BGM balance)
Conversion potential (CTA effectiveness)

Score below 85? It generates specific improvements and saves them to next_run_improvements.json for the next run. The system literally gets better every time it runs.

Real Results

Here's a video this pipeline produced — fully automated, zero manual editing:

🎬 Watch on YouTube

1080p, Veo 3.1 video clips, professional transitions, word-by-word subtitles, background music, and auto-generated thumbnail.

What I Learned

1. Model fallbacks are essential. Every AI model has rate limits and outages. My pipeline has automatic fallback chains (Veo 3.1 → Sora Turbo → image-to-video). It never fails completely.

2. Binary data handling matters. Early on, I was concatenating audio buffers as strings (raw += chunk). The audio came out corrupted. Switching to Buffer.concat(chunks) fixed it instantly. Seems obvious in hindsight.

3. Don't over-engineer audio mixing. FFmpeg's sidechaincompress and loudnorm are powerful but can deadlock in complex filter graphs. Sometimes the simple solution (volume=0.12) is the correct one.

4. The unified API gateway was the breakthrough. Not because any individual model was special, but because the orchestration became trivial. When adding a new capability means changing one model name instead of integrating a new SDK, you build things you never would have attempted.

Get Started

If you want to try the multi-model approach:

# Install SkillBoss
curl -fsSL https://skillboss.co/install.sh | bash

# Use with Claude Code, Cursor, or any AI coding assistant
# Or use the OpenAI-compatible endpoint directly:
# https://api.skillboss.co/v1

New accounts get $2 in free credits. No subscription needed.

What's the most complex AI pipeline you've built? I'd love to hear about multi-model workflows in the comments.

DEV Community