You know that moment when you finish building something cool, push it to GitHub, and then realize... nobody's going to read your README?
I've been there too many times. So I built RepoClip — a tool that takes a GitHub URL, analyzes the source code with AI, and generates a 60-second promo video with narration, images, and music. Automatically.
In this post, I'll walk through the architecture: how I orchestrate 6+ AI services in a single pipeline, the technical decisions that actually mattered, and the mistakes I made along the way.
The Pipeline at a Glance
GitHub URL → Code Fetch → Gemini Analysis → Parallel Asset Gen → Remotion Render → MP4
Simple on paper. In practice, it's 13 orchestrated steps, each of which can fail in creative ways.
Why Inngest for Orchestration
The first decision was how to manage a pipeline where each step depends on the last, calls to external APIs can take 30+ seconds, and failures need targeted retries.
I went with Inngest over a simple queue for one reason: step-level isolation. Each step.run() is independently retried and logged.
export const generateVideo = inngest.createFunction(
{ id: "generate-video", retries: 1 },
{ event: "video/generate.requested" },
async ({ event, step }) => {
const code = await step.run("fetch-github-code", async () => {
return await fetchGitHubCode(githubUrl, accessToken);
});
const videoConfig = await step.run("analyze-with-gemini", async () => {
return await analyzeWithGemini(code, repoUrl, commitHash, customPrompt);
});
const assets = await step.run("generate-assets", async () => {
const [images, narrations] = await Promise.all([
generateImages(videoConfig.scenes, aspectRatio, visualStyle),
generateNarrations(videoConfig.scenes, projectId, videoConfig.voice),
]);
return { images, narrations };
});
// ... render step
}
);
If Gemini fails, it retries that step — not the whole pipeline. If rendering fails after assets are already generated, those assets are preserved.
This also enabled a selective refund strategy: credits are only refunded if the render step fails, because at that point we've already consumed API calls.
Teaching Gemini to Think Like a Video Producer
The hardest part wasn't calling the API — it was prompt engineering for consistent, structured output.
Gemini 2.5 Flash analyzes the repository code and returns a VideoConfig JSON: title, scenes, narration scripts, image prompts, voice selection, and styling. Here's what I learned:
The Sandwich Technique
Custom user instructions (like "make it sound enthusiastic" or "use anime style visuals") get placed both before and after the code content in the prompt. This "sandwich" structure significantly improved instruction adherence — placing them only at the beginning caused Gemini to "forget" them after processing thousands of lines of code.
Language Detection ≠ Prompt Language
A Japanese developer writing a custom prompt in Japanese might still have an English-language repository. The system detects the repository's language from README content, code comments, and UI strings — not from the custom prompt language. This was a subtle but important distinction.
Image Prompts Stay English
Regardless of the detected language, all imagePrompt fields are generated in English. Why? Flux.2 (our image model) performs significantly better with English prompts. The narration and titles are in the detected language, but image generation always goes through English.
Parallel Asset Generation
Once Gemini returns the video structure, we generate images and narrations in parallel for all scenes:
const [images, narrations] = await Promise.all([
// All 5-8 images generated simultaneously via Fal.ai
Promise.all(scenes.map(scene => generateImage(scene.imagePrompt))),
// All narrations generated simultaneously via OpenAI TTS
Promise.all(scenes.map(scene => generateNarration(scene.narration))),
]);
This cuts asset generation time from ~60s (sequential) to ~12s. The tradeoff is higher burst API usage, but for a credit-based SaaS the speed improvement is worth it.
Audio Duration: The Surprisingly Hard Problem
To compose the video, I need to know exactly how long each narration clip is. My approach:
-
Primary: Parse the actual MP3 with
music-metadatato get precise duration - Fallback: Estimate from word count at 130 words/minute
The fallback exists because music-metadata occasionally fails on certain MP3 encodings. A 3-second minimum prevents empty scenes.
Remotion for Programmatic Video
Remotion lets you write video compositions as React components. Each scene is a <Sequence> with calculated frame offsets:
{scenes.map((scene, index) => {
const startFrame = calculateStartFrame(index);
const durationFrames = scene.audioDuration * fps;
return (
<Sequence key={scene.id} from={startFrame} durationInFrames={durationFrames}>
<SceneSlide scene={scene} style={style} />
<Audio src={scene.audioUrl} />
</Sequence>
);
})}
Ken Burns Effect in React
Every scene uses a slow zoom (1x → 1.08x) over its duration to keep static images visually engaging:
const scale = interpolate(frame, [0, durationInFrames], [1, 1.08], {
extrapolateRight: "clamp",
});
Small detail, but it makes the difference between "slideshow" and "video."
Responsive Across Aspect Ratios
RepoClip supports 16:9, 9:16 (Reels/Shorts), and 1:1 (social). Font sizes and layout scale based on viewport dimensions:
const isPortrait = height > width;
const fontSize = Math.round(width * (isPortrait ? 0.056 : 0.033));
This approach avoids maintaining separate templates per ratio.
Error Handling: The 80% of the Work
Every external API fails differently. Here's my retry strategy:
| Stage | Retries | Backoff |
|---|---|---|
| GitHub Fetch | 3 | Exponential (1s → 2s → 4s) |
| Gemini Analysis | 2 | Fixed 5s |
| Image Gen (per image) | 3 | Fixed 2s |
| TTS (per scene) | 2 | Fixed 2s |
| Video Render | 1 | None |
Why exponential for GitHub but fixed for AI services? GitHub rate limits respond well to backoff. AI services either work or they don't — waiting longer rarely helps.
Each failure is wrapped in a PipelineError that captures the stage name, a user-friendly message, and the raw error for Sentry. This lets me show users "Image generation failed, retrying..." instead of a cryptic 500.
Things I'd Do Differently
1. Cache more aggressively. I cache Gemini analysis by repo+commit hash, but I should also cache at the asset level. Re-generating images for a retry is wasteful.
2. Start with webhooks, not polling. Remotion Lambda supports webhook notifications, but I started with polling (every 3s, max 200 attempts). Polling is simpler to implement but wasteful. I'm migrating to webhooks now.
3. Don't underestimate prompt testing. I spent more time tuning Gemini prompts than writing actual application code. If you're building AI features, budget 40% of development time for prompt iteration.
The Stack
For anyone curious about the full stack:
- Framework: Next.js (App Router) + TypeScript
- Orchestration: Inngest
- Code Analysis: Gemini 2.5 Flash
- Image Generation: Flux.2 via Fal.ai
- Narration: OpenAI TTS
- Video Rendering: Remotion Lambda
- Database/Auth/Storage: Supabase
- Payments: Lemon Squeezy
- Deployment: Vercel
Try It Out
If you want to see what this produces, try it on your own repo: repoclip.io
The free tier gives you 2 videos/month — enough to test it on your side projects and see how the pipeline handles your codebase.
I'd love feedback from the dev.to community, especially on:
- What kind of repos produce the best/worst results?
- Is 60 seconds the right length, or would shorter/longer be better?
- Any feature ideas for the roadmap?
Drop a comment or find me on GitHub. Happy to discuss the architecture in more detail.
Built as a solo dev project. The entire pipeline from URL input to rendered MP4 takes about 2-3 minutes. Still feels like magic every time.
Top comments (0)