You know what's worse than a bug in production?
A bug you introduced while fixing another bug — deployed at midnight, right when your biggest traffic spike ever is happening.
The Setup: Our Best Day Ever
I run RepoClip, an AI-powered SaaS that generates promotional videos from GitHub repositories. On March 19, 2026, console.dev featured us in their newsletter.
Traffic exploded:
| Day | Active Users | New Users |
|---|---|---|
| March 19 (feature day) | 448 | 445 |
| March 20 (day after) | 154 | 139 |
448 users. For a solo indie SaaS, that felt massive.
But there was a problem I wouldn't discover until 24 hours later.
The "Fix" That Broke Everything
Earlier on March 19, I noticed that some video generations were failing because Remotion Lambda (our video renderer on AWS) couldn't download images from fal.ai's temporary CDN URLs fast enough. The URLs were expiring or timing out.
So at midnight (00:20 JST, March 20), I shipped what I thought was a solid improvement:
- Pre-fetch all media from fal.ai CDN → Supabase Storage before rendering
- Add retry logic — 3 attempts with exponential backoff
- Throw early if all clips failed to prefetch, instead of sending unreliable URLs to the renderer
That third point was the killer. Here's the diff:
// video-prefetch.ts — the "improvement"
const cached = results.filter((r, i) => r.url !== videos[i].url).length;
if (cached === 0 && videos.length > 0) {
throw new Error(
`All ${videos.length} video clips failed to prefetch from CDN`
);
}
My reasoning: "If we can't cache any clips locally, Remotion will probably fail anyway. Let's fail fast and give the user a clear error instead of wasting 5 minutes on a doomed render."
It sounded so reasonable.
The Data I Should Have Checked First
The next evening, I ran a query against our projects table:
SELECT status, content_mode, COUNT(*) as cnt
FROM projects
WHERE created_at >= '2026-03-19T15:00:00+00:00' -- after deploy
AND created_at < '2026-03-20T15:00:00+00:00'
GROUP BY status, content_mode;
| Status | Mode | Count |
|---|---|---|
| failed | video_short | 10 |
| completed | image | 3 |
Zero. Not a single video succeeded after my deploy.
Every single failure had the same message:
All 3 video clips failed to prefetch from CDN — rendering would likely fail
Meanwhile, before my fix on the same day:
| Status | Mode | Count |
|---|---|---|
| completed | video_short | 22 |
| failed | video_short | 7 |
| completed | image | 6 |
| failed | image | 8 |
76% video success rate before. 0% after. My "safety check" didn't just fail to help — it made things infinitely worse.
Why My Assumption Was Wrong
Here's what I didn't understand about my own infrastructure:
The prefetch path (broken):
Inngest (Vercel Edge) → fal.media CDN → frequently times out
The render path (working fine):
Remotion Lambda (AWS us-east-1) → fal.media CDN → usually succeeds
The prefetch runs on Vercel's serverless infrastructure. Remotion Lambda runs on AWS in us-east-1. The network path from AWS to fal.ai's CDN was far more reliable than from Vercel.
Before my fix, when prefetch failed, the code silently fell back to the original fal.media URLs. Remotion Lambda would then download them directly — and it usually worked. The 7 pre-fix failures were mostly unrelated issues (prompt rejections, Lambda crashes), not CDN problems.
By adding the throw, I cut off the fallback path entirely. The pipeline would die at the prefetch step without ever giving Remotion a chance to try.
The Human Cost
6 unique users were affected. All on the free plan, all trying video generation for the first time:
These were people who came from console.dev, signed up, and tried to generate a video — our core product experience. One user tried 3 times before giving up.
The silver lining: our pipeline deducts credits only on success (Step 13, after status: "completed"), and failed projects are excluded from the free allowance count. So no one lost credits, and they could retry. But the first impression was ruined.
The Fix: One Line
if (cached === 0 && videos.length > 0) {
- throw new Error(
- `All ${videos.length} video clips failed to prefetch from CDN`
+ console.warn(
+ `[video-prefetch] All ${videos.length} clips failed to prefetch — falling back to original CDN URLs`
);
}
That's it. Keep the retry logic (it helps when it works), but don't block the pipeline when it doesn't. Let Remotion Lambda try the direct download path.
The Uncomfortable Lesson
I've seen this pattern before, and I'll probably see it again:
"Fail fast" is not always the right answer.
In a pipeline with multiple fallback paths, throwing early can be worse than doing nothing. My prefetch-to-Supabase step was an optimization, not a requirement. The system had a perfectly good fallback — Remotion downloading directly from fal.media — and my throw statement eliminated it.
The mental model I had:
prefetch fails → render will fail → fail fast = better UX
The reality:
prefetch fails → render might succeed via different network path → fail fast = 0% success
Rules I'm Taking Away
Measure the fallback before removing it. I never checked how often Remotion Lambda could download from fal.media directly. If I had, I'd have seen it was ~90%+ reliable.
"Probably will fail" ≠ "will fail." My error message literally said "rendering would likely fail." Likely isn't certainly. Don't throw on likely.
Deploy ≠ done. I shipped at midnight and went to sleep. The next morning I was celebrating the traffic spike, not checking if video generation still worked.
Your biggest traffic day is your worst day for bugs. Murphy's law for SaaS: the feature that brought all those users? They tried the thing that was broken.
Timeline
| Time (JST) | Event |
|---|---|
| Mar 19, 17:00 | console.dev features RepoClip — traffic spikes |
| Mar 19, ~23:00 | I notice some CDN-related render failures |
| Mar 20, 00:20 | Deploy prefetch retry + throw on failure |
| Mar 20, 02:16 | First post-deploy video generation fails |
| Mar 20, 22:27 | Last failure before I investigate |
| Mar 20, 23:30 | I discover 0% video success rate |
| Mar 21, 00:15 | One-line fix deployed |
24 hours of broken video generation. During our highest-traffic period ever.
If you're building a media pipeline with external CDN dependencies, don't assume your serverless function's network path is the same as your renderer's. And if you're adding safety checks — make sure you're not removing a working fallback.
RepoClip turns GitHub repos into promotional videos. It works again now. Probably. 😅
Have you ever shipped a "fix" that made things worse? I'd love to hear your story in the comments.
Top comments (0)