DEV Community

Cover image for How I Built an AI Pipeline That Turns GitHub Repos into Demo Videos
Kazutaka Sugiyama
Kazutaka Sugiyama

Posted on

How I Built an AI Pipeline That Turns GitHub Repos into Demo Videos

You know that moment when you finish building something cool, push it to GitHub, and then realize... nobody's going to read your README?

I've been there too many times. So I built RepoClip — a tool that takes a GitHub URL, analyzes the source code with AI, and generates a 60-second promo video with narration, images, and music. Automatically.

In this post, I'll walk through the architecture: how I orchestrate 6+ AI services in a single pipeline, the technical decisions that actually mattered, and the mistakes I made along the way.

The Pipeline at a Glance

GitHub URL → Code Fetch → Gemini Analysis → Parallel Asset Gen → Remotion Render → MP4
Enter fullscreen mode Exit fullscreen mode

Simple on paper. In practice, it's 13 orchestrated steps, each of which can fail in creative ways.

Why Inngest for Orchestration

The first decision was how to manage a pipeline where each step depends on the last, calls to external APIs can take 30+ seconds, and failures need targeted retries.

I went with Inngest over a simple queue for one reason: step-level isolation. Each step.run() is independently retried and logged.

export const generateVideo = inngest.createFunction(
  { id: "generate-video", retries: 1 },
  { event: "video/generate.requested" },
  async ({ event, step }) => {
    const code = await step.run("fetch-github-code", async () => {
      return await fetchGitHubCode(githubUrl, accessToken);
    });

    const videoConfig = await step.run("analyze-with-gemini", async () => {
      return await analyzeWithGemini(code, repoUrl, commitHash, customPrompt);
    });

    const assets = await step.run("generate-assets", async () => {
      const [images, narrations] = await Promise.all([
        generateImages(videoConfig.scenes, aspectRatio, visualStyle),
        generateNarrations(videoConfig.scenes, projectId, videoConfig.voice),
      ]);
      return { images, narrations };
    });

    // ... render step
  }
);
Enter fullscreen mode Exit fullscreen mode

If Gemini fails, it retries that step — not the whole pipeline. If rendering fails after assets are already generated, those assets are preserved.

This also enabled a selective refund strategy: credits are only refunded if the render step fails, because at that point we've already consumed API calls.

Teaching Gemini to Think Like a Video Producer

The hardest part wasn't calling the API — it was prompt engineering for consistent, structured output.

Gemini 2.5 Flash analyzes the repository code and returns a VideoConfig JSON: title, scenes, narration scripts, image prompts, voice selection, and styling. Here's what I learned:

The Sandwich Technique

Custom user instructions (like "make it sound enthusiastic" or "use anime style visuals") get placed both before and after the code content in the prompt. This "sandwich" structure significantly improved instruction adherence — placing them only at the beginning caused Gemini to "forget" them after processing thousands of lines of code.

Language Detection ≠ Prompt Language

A Japanese developer writing a custom prompt in Japanese might still have an English-language repository. The system detects the repository's language from README content, code comments, and UI strings — not from the custom prompt language. This was a subtle but important distinction.

Image Prompts Stay English

Regardless of the detected language, all imagePrompt fields are generated in English. Why? Flux.2 (our image model) performs significantly better with English prompts. The narration and titles are in the detected language, but image generation always goes through English.

Parallel Asset Generation

Once Gemini returns the video structure, we generate images and narrations in parallel for all scenes:

const [images, narrations] = await Promise.all([
  // All 5-8 images generated simultaneously via Fal.ai
  Promise.all(scenes.map(scene => generateImage(scene.imagePrompt))),
  // All narrations generated simultaneously via OpenAI TTS
  Promise.all(scenes.map(scene => generateNarration(scene.narration))),
]);
Enter fullscreen mode Exit fullscreen mode

This cuts asset generation time from ~60s (sequential) to ~12s. The tradeoff is higher burst API usage, but for a credit-based SaaS the speed improvement is worth it.

Audio Duration: The Surprisingly Hard Problem

To compose the video, I need to know exactly how long each narration clip is. My approach:

  1. Primary: Parse the actual MP3 with music-metadata to get precise duration
  2. Fallback: Estimate from word count at 130 words/minute

The fallback exists because music-metadata occasionally fails on certain MP3 encodings. A 3-second minimum prevents empty scenes.

Remotion for Programmatic Video

Remotion lets you write video compositions as React components. Each scene is a <Sequence> with calculated frame offsets:

{scenes.map((scene, index) => {
  const startFrame = calculateStartFrame(index);
  const durationFrames = scene.audioDuration * fps;

  return (
    <Sequence key={scene.id} from={startFrame} durationInFrames={durationFrames}>
      <SceneSlide scene={scene} style={style} />
      <Audio src={scene.audioUrl} />
    </Sequence>
  );
})}
Enter fullscreen mode Exit fullscreen mode

Ken Burns Effect in React

Every scene uses a slow zoom (1x → 1.08x) over its duration to keep static images visually engaging:

const scale = interpolate(frame, [0, durationInFrames], [1, 1.08], {
  extrapolateRight: "clamp",
});
Enter fullscreen mode Exit fullscreen mode

Small detail, but it makes the difference between "slideshow" and "video."

Responsive Across Aspect Ratios

RepoClip supports 16:9, 9:16 (Reels/Shorts), and 1:1 (social). Font sizes and layout scale based on viewport dimensions:

const isPortrait = height > width;
const fontSize = Math.round(width * (isPortrait ? 0.056 : 0.033));
Enter fullscreen mode Exit fullscreen mode

This approach avoids maintaining separate templates per ratio.

Error Handling: The 80% of the Work

Every external API fails differently. Here's my retry strategy:

Stage Retries Backoff
GitHub Fetch 3 Exponential (1s → 2s → 4s)
Gemini Analysis 2 Fixed 5s
Image Gen (per image) 3 Fixed 2s
TTS (per scene) 2 Fixed 2s
Video Render 1 None

Why exponential for GitHub but fixed for AI services? GitHub rate limits respond well to backoff. AI services either work or they don't — waiting longer rarely helps.

Each failure is wrapped in a PipelineError that captures the stage name, a user-friendly message, and the raw error for Sentry. This lets me show users "Image generation failed, retrying..." instead of a cryptic 500.

Things I'd Do Differently

1. Cache more aggressively. I cache Gemini analysis by repo+commit hash, but I should also cache at the asset level. Re-generating images for a retry is wasteful.

2. Start with webhooks, not polling. Remotion Lambda supports webhook notifications, but I started with polling (every 3s, max 200 attempts). Polling is simpler to implement but wasteful. I'm migrating to webhooks now.

3. Don't underestimate prompt testing. I spent more time tuning Gemini prompts than writing actual application code. If you're building AI features, budget 40% of development time for prompt iteration.

The Stack

For anyone curious about the full stack:

  • Framework: Next.js (App Router) + TypeScript
  • Orchestration: Inngest
  • Code Analysis: Gemini 2.5 Flash
  • Image Generation: Flux.2 via Fal.ai
  • Narration: OpenAI TTS
  • Video Rendering: Remotion Lambda
  • Database/Auth/Storage: Supabase
  • Payments: Lemon Squeezy
  • Deployment: Vercel

Try It Out

If you want to see what this produces, try it on your own repo: repoclip.io

The free tier gives you 2 videos/month — enough to test it on your side projects and see how the pipeline handles your codebase.

I'd love feedback from the dev.to community, especially on:

  • What kind of repos produce the best/worst results?
  • Is 60 seconds the right length, or would shorter/longer be better?
  • Any feature ideas for the roadmap?

Drop a comment or find me on GitHub. Happy to discuss the architecture in more detail.


Built as a solo dev project. The entire pipeline from URL input to rendered MP4 takes about 2-3 minutes. Still feels like magic every time.

Top comments (0)