<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ycbing</title>
    <description>The latest articles on DEV Community by ycbing (@ycbing).</description>
    <link>https://dev.to/ycbing</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4000158%2Fa1034b6e-8eb7-4867-ac77-6391988f740b.jpg</url>
      <title>DEV Community: ycbing</title>
      <link>https://dev.to/ycbing</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ycbing"/>
    <language>en</language>
    <item>
      <title>I Built an AI Studio That Turns 1 Line of Text Into a 5-Minute Short Drama</title>
      <dc:creator>ycbing</dc:creator>
      <pubDate>Wed, 24 Jun 2026 08:52:21 +0000</pubDate>
      <link>https://dev.to/ycbing/i-built-an-ai-studio-that-turns-1-line-of-text-into-a-5-minute-short-drama-j01</link>
      <guid>https://dev.to/ycbing/i-built-an-ai-studio-that-turns-1-line-of-text-into-a-5-minute-short-drama-j01</guid>
      <description>&lt;h1&gt;
  
  
  I Built an AI Studio That Turns 1 Line of Text Into a 5-Minute Short Drama
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;From idea to 1080p video — AI scriptwriting, storyboarding, multi-voice dubbing, and video compositing in one pipeline.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I spent the past month building &lt;a href="https://github.com/ycbing/Shortify-AI" rel="noopener noreferrer"&gt;Shortify AI&lt;/a&gt;, an open-source platform that takes a creative prompt like "a time-traveling maid from ancient China lands in a modern office" and outputs a full short drama episode — script, illustrated storyboard, multi-character voiceover, and a complete 1080p video.&lt;/p&gt;

&lt;p&gt;Here's how it works under the hood, the architecture decisions I made, and the full pipeline that ties it together.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;China's short drama market hit $70B in 2025. These are vertical, fast-paced 1–5 minute episodes — basically TikTok meets TV series. The production pipeline is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Script&lt;/strong&gt; → human screenwriter (days)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storyboard&lt;/strong&gt; → illustrator + director (days)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voiceover&lt;/strong&gt; → recording studio + voice actors (days)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Video&lt;/strong&gt; → editor + effects (days)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Total: 5–10 people, 1–2 weeks per episode.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I wanted to compress this to: &lt;strong&gt;1 person, 5 minutes.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;Here's the end-to-end pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Input ("穿越到现代的女将军")
       │
       ▼
┌─────────────────────┐
│   AI Scriptwriter   │  ← GLM-4-Flash / DeepSeek / Qwen
│  (characters +      │
│   scenes + shots)   │
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Storyboard Images  │  ← Wan2.7-image / CogView-3-Plus
│  (1 image per shot) │
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Multi-voice TTS    │  ← iFlytek WebSocket / Edge-TTS
│  (male / female /   │
│   narrator per role)│
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Video Compositing  │  ← FFmpeg (Ken Burns + AI video)
│  (1080p, subtitles, │
│   background music) │
└─────────┬───────────┘
          │
          ▼
    COS Storage + Share URL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each stage runs independently and can be swapped — more on that below.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 1: AI Scriptwriting
&lt;/h2&gt;

&lt;p&gt;The LLM acts as a screenwriting assistant. Given a creative prompt, it generates a structured script with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Characters&lt;/strong&gt; (name, gender, voice type, appearance)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scenes&lt;/strong&gt; (location, atmosphere, time of day)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shots&lt;/strong&gt; (character, dialogue, narration, camera direction)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The prompt engineering was the hardest part. Early versions produced flat narration-style scripts. The breakthrough was switching to &lt;strong&gt;dialogue-centric formatting&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Simplified prompt structure&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;systemPrompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`
Generate a short drama script in this format:
{
  "characters": [{ "name": "...", "gender": "male|female", "voiceId": "..." }],
  "shots": [
    {
      "shotNumber": 1,
      "character": "...",
      "dialogue": "...",
      "sceneDescription": "...",
      "cameraDirection": "close-up|wide|over-shoulder"
    }
  ]
}
Rules:
- Each shot = one line of dialogue
- Include scene descriptions for every shot
- Mark characters explicitly so we can assign voice models
- Total length: 7-12 shots per episode
`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;API:&lt;/strong&gt; GLM-4-Flash (free tier), but any OpenAI-compatible LLM works via our model resolver layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 2: AI Storyboard Images
&lt;/h2&gt;

&lt;p&gt;For each shot, we generate an illustration. The challenge wasn't the image generation itself — it was &lt;strong&gt;managing cost and consistency&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The model chain:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Wan2.7-image (DashScope, ~$0.03/image)  → 2K resolution, synchronous
  └── fallback → Wanx-v1 (older, cheaper) → 720p, async polling
    └── fallback → CogView-3-Plus (Zhipu) → fallback with different API
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For character consistency across shots, we inject appearance descriptors into every prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;buildAppearancePrompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;shot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;characters&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;shotChars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;characters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;shot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;character&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;appearanceDesc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;shotChars&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;appearance&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;. &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;shot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sceneDescription&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;. &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;appearanceDesc&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;. 
          Cinematic lighting, 16:9 widescreen, photorealistic.`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Images are uploaded to Tencent Cloud COS (private bucket) with signed URLs for secure access.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 3: Multi-Voice Dubbing
&lt;/h2&gt;

&lt;p&gt;This was the most fun to build. The LLM script tells us which character speaks in each shot, and we assign voice IDs accordingly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;voiceMap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;VoiceConfig&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;male-lead&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;edgeTTS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;zh-CN-YunxiNeural&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="na"&gt;iFlytek&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;x4_yehaoyun_oral&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;female-lead&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;edgeTTS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;zh-CN-XiaoxiaoNeural&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;iFlytek&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;x4_shisan_oral&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;narrator&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;edgeTTS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;zh-CN-YunjianNeural&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;iFlytek&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;x4_yunbai_oral&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;child&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;edgeTTS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;zh-CN-XiaoxuanNeural&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;iFlytek&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;x4_yunxiaoyan_oral&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Failure handling:&lt;/strong&gt; The primary TTS (iFlytek WebSocket) sometimes rate-limits. In that case, the pipeline auto-falls back to Edge-TTS (free, runs locally). The entire voiceover stage runs at ~3 seconds per shot, so a 12-shot episode takes ~36 seconds for all dubbing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 4: Video Compositing — The Hard Part
&lt;/h2&gt;

&lt;p&gt;This is where most of the engineering effort went. The compositing pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;For each shot:
  1. Generate AI video (Wan2.7-t2v / CogVideoX) → or fallback to Ken Burns
  2. Mix voiceover audio → sync with video
  3. Speed-ramp video to match audio duration
  4. Apply fade-in/fade-out
  5. Generate SRT subtitles

Then:
  6. Concat all shot videos → episode
  7. Burn subtitles into final video
  8. Upload to COS
  9. Generate share URL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Ken Burns Camera Effects
&lt;/h3&gt;

&lt;p&gt;When AI video generation is disabled (or fails), we fall back to static images with camera motion. FFmpeg's &lt;code&gt;zoompan&lt;/code&gt; filter creates 10 different effects:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Example: slow zoom-in with fade&lt;/span&gt;
ffmpeg &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="nt"&gt;-loop&lt;/span&gt; 1 &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"image.jpg"&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"audio.mp3"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-filter_complex&lt;/span&gt; &lt;span class="s2"&gt;"
    [0:v]scale=1920:1080:force_original_aspect_ratio=decrease,
          pad=1920:1080:(ow-iw)/2:(oh-ih)/2:color=black,
          zoompan=z='min(zoom+0.002,1.8)':d=240:s=1920x1080:fps=24,
          fade=t=in:st=0:d=0.4,
          fade=t=out:st=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="p"&gt;(dur-0.4).toFixed(1)&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:d=0.4[v];
    [1:a]adelay=1|1[a]
  "&lt;/span&gt; &lt;span class="nt"&gt;-map&lt;/span&gt; &lt;span class="s2"&gt;"[v]"&lt;/span&gt; &lt;span class="nt"&gt;-map&lt;/span&gt; &lt;span class="s2"&gt;"[a]"&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt;:v libx264 &lt;span class="nt"&gt;-crf&lt;/span&gt; 20 &lt;span class="nt"&gt;-preset&lt;/span&gt; fast &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="s2"&gt;"output.mp4"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;10 camera effects: zoom-in, zoom-out, pan-left, pan-right, pan-up, pan-down, zoom-in-left, zoom-in-right, zoom-out-left, zoom-out-right. Each shot picks one in round-robin, making static images feel dynamic.&lt;/p&gt;

&lt;h3&gt;
  
  
  The "Never Black" Fallback
&lt;/h3&gt;

&lt;p&gt;AI video generation has a ~10-15% failure rate (API timeouts, rate limits, content filters). The original pipeline simply showed a black screen for failed shots. The fix was a three-tier defense:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tier 1: AI video generation → 80% success rate
Tier 2: Ken Burns zoompan → ~15% (catches most failures)
Tier 3: Static image + text overlay → 100% reliability
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tier 2 was the most fragile&lt;/strong&gt; — the zoompan filter chain with complex expressions often failed on edge cases (very long/short audio, special characters in subtitles). Tier 3 uses ffmpeg's &lt;code&gt;drawtext&lt;/code&gt; to render the subtitle over a dark gradient background — it literally never fails.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Full Pipeline Script
&lt;/h2&gt;

&lt;p&gt;We abstracted the entire end-to-end flow into a single CLI command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx tsx scripts/full-pipeline.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This reads the drama metadata from PostgreSQL, iterates over every episode and shot, runs the 4-stage pipeline, and uploads everything to cloud storage. The same script powers the web app's backend API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before (4 AI video generations):&lt;/strong&gt; ~20 minutes per episode (5 min/shot × 4 shots + 1080p encoding)&lt;br&gt;
&lt;strong&gt;After (Ken Burns only):&lt;/strong&gt; ~2 minutes per episode&lt;/p&gt;


&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Reliability beats quality.
&lt;/h3&gt;

&lt;p&gt;Early on, I chased the best AI video models (CogVideoX, Kling, Jimeng). But they all have ~10% failure rates, and a single failed shot ruins the entire episode. Investing in a &lt;strong&gt;rock-solid fallback chain&lt;/strong&gt; that guaranteed no black screens was worth more than a 10% quality bump.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Model diversity matters.
&lt;/h3&gt;

&lt;p&gt;No single AI provider covers everything. We use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zhipu GLM&lt;/strong&gt; for script generation (best Chinese LLM for creative writing)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DashScope Wan&lt;/strong&gt; for image/video generation (best price/quality ratio)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;iFlytek&lt;/strong&gt; for TTS (WebSocket, low latency)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tencent COS&lt;/strong&gt; for storage (CDN edge nodes in China)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model config system lets users bring their own API keys for any stage.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Video compositing is the bottleneck.
&lt;/h3&gt;

&lt;p&gt;AI content generation is getting fast and cheap. The bottleneck is now ffmpeg. A 12-shot episode with Ken Burns processing takes ~3 minutes just for the encoding. If you're building an AI video platform, &lt;strong&gt;invest in your compositing pipeline first&lt;/strong&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;Here's a 5-episode drama generated entirely by the pipeline:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Episode&lt;/th&gt;
&lt;th&gt;Duration&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Sample&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Episode 1&lt;/td&gt;
&lt;td&gt;26s&lt;/td&gt;
&lt;td&gt;5.8MB&lt;/td&gt;
&lt;td&gt;&lt;a href="https://craftmind.cn/api/uploads/cos/ca5472e9-f9ea-4c61-9d29-9e7cbc69aaf0%2Fvideos%2Fepisode-1.mp4" rel="noopener noreferrer"&gt;Watch&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Episode 2&lt;/td&gt;
&lt;td&gt;22s&lt;/td&gt;
&lt;td&gt;4.0MB&lt;/td&gt;
&lt;td&gt;&lt;a href="https://craftmind.cn/api/uploads/cos/ca5472e9-f9ea-4c61-9d29-9e7cbc69aaf0%2Fvideos%2Fepisode-2.mp4" rel="noopener noreferrer"&gt;Watch&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Episode 3&lt;/td&gt;
&lt;td&gt;23s&lt;/td&gt;
&lt;td&gt;5.2MB&lt;/td&gt;
&lt;td&gt;&lt;a href="https://craftmind.cn/api/uploads/cos/ca5472e9-f9ea-4c61-9d29-9e7cbc69aaf0%2Fvideos%2Fepisode-3.mp4" rel="noopener noreferrer"&gt;Watch&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;All 5&lt;/td&gt;
&lt;td&gt;1m47s&lt;/td&gt;
&lt;td&gt;21.2MB&lt;/td&gt;
&lt;td&gt;Generated in 7 minutes (Ken Burns) or ~1 hour (with AI video)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The project is open-source under MIT:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ycbing/Shortify-AI.git
&lt;span class="nb"&gt;cd &lt;/span&gt;Shortify-AI
npm &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;span class="c"&gt;# Configure .env.local with at least DATABASE_URL and GLM_API_KEY&lt;/span&gt;
npm run db:push
npm run dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or run the full pipeline directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx tsx scripts/full-pipeline.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;You only need a GLM API key&lt;/strong&gt; to get started — everything else has free tiers or falls back gracefully.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Advanced character consistency:&lt;/strong&gt; IP-Adapter / reference image injection for cross-shot face consistency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-prod editing UI:&lt;/strong&gt; Drag-and-drop shot reordering, manual image replacement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vertical mode (9:16):&lt;/strong&gt; TikTok/Reels/Shorts format&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sound effects library:&lt;/strong&gt; Automatic ambient audio (footsteps, rain, doors)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built with Next.js 16, FFmpeg, PostgreSQL, and a lot of API calls. &lt;a href="https://github.com/ycbing/Shortify-AI" rel="noopener noreferrer"&gt;Star on GitHub&lt;/a&gt; if you found this interesting.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>video</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
