Kazutaka Sugiyama

Posted on Mar 20

My Safety Check Killed 100% of Video Generations — Right When Traffic Spiked 3x

#webdev #nextjs #saas #postmortem

You know what's worse than a bug in production?

A bug you introduced while fixing another bug — deployed at midnight, right when your biggest traffic spike ever is happening.

The Setup: Our Best Day Ever

I run RepoClip, an AI-powered SaaS that generates promotional videos from GitHub repositories. On March 19, 2026, console.dev featured us in their newsletter.

Traffic exploded:

Day	Active Users	New Users
March 19 (feature day)	448	445
March 20 (day after)	154	139

448 users. For a solo indie SaaS, that felt massive.

But there was a problem I wouldn't discover until 24 hours later.

The "Fix" That Broke Everything

Earlier on March 19, I noticed that some video generations were failing because Remotion Lambda (our video renderer on AWS) couldn't download images from fal.ai's temporary CDN URLs fast enough. The URLs were expiring or timing out.

So at midnight (00:20 JST, March 20), I shipped what I thought was a solid improvement:

Pre-fetch all media from fal.ai CDN → Supabase Storage before rendering
Add retry logic — 3 attempts with exponential backoff
Throw early if all clips failed to prefetch, instead of sending unreliable URLs to the renderer

That third point was the killer. Here's the diff:

// video-prefetch.ts — the "improvement"

const cached = results.filter((r, i) => r.url !== videos[i].url).length;

if (cached === 0 && videos.length > 0) {
  throw new Error(
    `All ${videos.length} video clips failed to prefetch from CDN`
  );
}

My reasoning: "If we can't cache any clips locally, Remotion will probably fail anyway. Let's fail fast and give the user a clear error instead of wasting 5 minutes on a doomed render."

It sounded so reasonable.

The Data I Should Have Checked First

The next evening, I ran a query against our projects table:

SELECT status, content_mode, COUNT(*) as cnt
FROM projects
WHERE created_at >= '2026-03-19T15:00:00+00:00'  -- after deploy
  AND created_at < '2026-03-20T15:00:00+00:00'
GROUP BY status, content_mode;

Status	Mode	Count
failed	video_short	10
completed	image	3

Zero. Not a single video succeeded after my deploy.

Every single failure had the same message:

All 3 video clips failed to prefetch from CDN — rendering would likely fail

Meanwhile, before my fix on the same day:

Status	Mode	Count
completed	video_short	22
failed	video_short	7
completed	image	6
failed	image	8

76% video success rate before. 0% after. My "safety check" didn't just fail to help — it made things infinitely worse.

Why My Assumption Was Wrong

Here's what I didn't understand about my own infrastructure:

The prefetch path (broken):
  Inngest (Vercel Edge) → fal.media CDN → frequently times out

The render path (working fine):
  Remotion Lambda (AWS us-east-1) → fal.media CDN → usually succeeds

The prefetch runs on Vercel's serverless infrastructure. Remotion Lambda runs on AWS in us-east-1. The network path from AWS to fal.ai's CDN was far more reliable than from Vercel.

Before my fix, when prefetch failed, the code silently fell back to the original fal.media URLs. Remotion Lambda would then download them directly — and it usually worked. The 7 pre-fix failures were mostly unrelated issues (prompt rejections, Lambda crashes), not CDN problems.

By adding the throw, I cut off the fallback path entirely. The pipeline would die at the prefetch step without ever giving Remotion a chance to try.

The Human Cost

6 unique users were affected. All on the free plan, all trying video generation for the first time:

User	Failed Attempts
rafael@...	3
gabomaldi@...	2
ale@...	2
enzo@...	1
moussan@...	1
tthorjen@...	1

These were people who came from console.dev, signed up, and tried to generate a video — our core product experience. One user tried 3 times before giving up.

The silver lining: our pipeline deducts credits only on success (Step 13, after status: "completed"), and failed projects are excluded from the free allowance count. So no one lost credits, and they could retry. But the first impression was ruined.

The Fix: One Line

  if (cached === 0 && videos.length > 0) {
-   throw new Error(
-     `All ${videos.length} video clips failed to prefetch from CDN`
+   console.warn(
+     `[video-prefetch] All ${videos.length} clips failed to prefetch — falling back to original CDN URLs`
    );
  }

That's it. Keep the retry logic (it helps when it works), but don't block the pipeline when it doesn't. Let Remotion Lambda try the direct download path.

The Uncomfortable Lesson

I've seen this pattern before, and I'll probably see it again:

"Fail fast" is not always the right answer.

In a pipeline with multiple fallback paths, throwing early can be worse than doing nothing. My prefetch-to-Supabase step was an optimization, not a requirement. The system had a perfectly good fallback — Remotion downloading directly from fal.media — and my throw statement eliminated it.

The mental model I had:

prefetch fails → render will fail → fail fast = better UX

The reality:

prefetch fails → render might succeed via different network path → fail fast = 0% success

Rules I'm Taking Away

Measure the fallback before removing it. I never checked how often Remotion Lambda could download from fal.media directly. If I had, I'd have seen it was ~90%+ reliable.
"Probably will fail" ≠ "will fail." My error message literally said "rendering would likely fail." Likely isn't certainly. Don't throw on likely.
Deploy ≠ done. I shipped at midnight and went to sleep. The next morning I was celebrating the traffic spike, not checking if video generation still worked.
Your biggest traffic day is your worst day for bugs. Murphy's law for SaaS: the feature that brought all those users? They tried the thing that was broken.

Timeline

Time (JST)	Event
Mar 19, 17:00	console.dev features RepoClip — traffic spikes
Mar 19, ~23:00	I notice some CDN-related render failures
Mar 20, 00:20	Deploy prefetch retry + throw on failure
Mar 20, 02:16	First post-deploy video generation fails
Mar 20, 22:27	Last failure before I investigate
Mar 20, 23:30	I discover 0% video success rate
Mar 21, 00:15	One-line fix deployed

24 hours of broken video generation. During our highest-traffic period ever.

If you're building a media pipeline with external CDN dependencies, don't assume your serverless function's network path is the same as your renderer's. And if you're adding safety checks — make sure you're not removing a working fallback.

RepoClip turns GitHub repos into promotional videos. It works again now. Probably. 😅

Have you ever shipped a "fix" that made things worse? I'd love to hear your story in the comments.

DEV Community