Kazutaka Sugiyama

Posted on Mar 9

Our Videos Silently Failed for a Week — How a Stale Env Var Cost Us $60 and 12 Unhappy Users

#devops #nextjs #showdev #sass

You know that sinking feeling when you check your billing dashboard and something doesn't add up?

That's how this started. I noticed RepoClip — my AI video generation SaaS — was burning through $20/day on fal.ai credits for three consecutive days. My first thought was optimistic: "Great, more users are trying the product!"

I was wrong. The money was being spent. The videos were being generated. But not a single user was getting their finished video.

The Symptom: $60 Burned, Zero Videos Delivered

RepoClip generates promotional videos from GitHub repos. The pipeline looks like this:

GitHub URL → Gemini Analysis → Kling Video Clips (fal.ai) → Remotion Lambda Render → Done

Each "Video Short" generates 5 AI video clips using Kling 3.0 Pro on fal.ai, then stitches them together with narration using Remotion on AWS Lambda.

When I checked the fal.ai dashboard, I saw $20/day being consumed on March 7, 8, and 9. That's roughly 2–4 video generations per day. Seemed plausible for organic traffic.

But then I queried the database:

SELECT
  created_at::date AS date,
  status,
  COUNT(*) AS count
FROM projects
WHERE created_at >= '2026-03-01'
GROUP BY date, status
ORDER BY date;

The result was alarming:

Date	Status	Count
Mar 1	completed	3
Mar 1	failed	5
Mar 1	rendering	1
Mar 2	rendering	2
Mar 5	rendering	1
Mar 6	rendering	2
Mar 7	rendering	2
Mar 8	rendering	4

After March 1, not a single project reached "completed". Every video was stuck in "rendering" — the step right after fal.ai finishes generating clips.

The fal.ai credits were being consumed successfully. The Kling clips were generated and saved. Then the pipeline just... stopped.

The Investigation: Following the Money

The video pipeline runs on Inngest, which breaks the work into discrete steps. Each step is memoized — if the function retries, completed steps aren't re-executed. The relevant steps:

Fetch GitHub code
Analyze with Gemini
Generate video clips (fal.ai) ← money spent here
Trigger Remotion Lambda render ← failure here
Poll for render completion
Update project to "completed"

Step 3 succeeded (fal.ai charged us), but Step 4 was clearly failing. The project status gets set to "rendering" just before Step 4 runs, which explains why everything was stuck there.

I checked the Remotion Lambda configuration:

# What our env var said:
REMOTION_LAMBDA_FUNCTION_NAME=remotion-render-4-0-414-mem2048mb-disk2048mb-600sec

# What actually exists in AWS:
aws lambda list-functions --query 'Functions[?starts_with(FunctionName, `remotion`)]'

remotion-render-4-0-429-mem2048mb-disk2048mb-600sec
remotion-render-4-0-429-mem3008mb-disk4096mb-900sec

There it was. The Lambda function 4-0-414 didn't exist anymore. We had upgraded Remotion from v4.0.414 to v4.0.429, deployed new Lambda functions, deleted the old ones — and never updated the environment variable on Vercel.

Every renderMediaOnLambda() call was throwing ResourceNotFoundException, silently, inside an Inngest background job that no user ever sees.

Why No One Noticed

This is the insidious part. The failure was completely invisible:

No client-side errors — the API returned a project ID successfully. Users saw "Rendering..." and waited.
No GA4 events — we tracked video_generate_start but had no video_generate_complete event. The absence of completions was invisible in analytics.
No monitoring — we had Sentry for errors, but the Inngest function's retry logic was eating the exceptions before they surfaced meaningfully.
fal.ai charges looked normal — if anything, increasing spend looked like a good sign.

The only signal was the billing anomaly — and even that was ambiguous.

The Recovery: Don't Re-generate, Re-render

Here's the silver lining: all 12 stuck projects had their assets fully saved in the database. The Kling video clips, the narration audio, the video config — everything was persisted in the assets JSONB column before the render step.

Re-triggering the full pipeline would have re-generated all the video clips on fal.ai, costing another $60+. Instead, I wrote a targeted retry script that:

Queries all projects with status = 'rendering'
Reads their saved video_config and assets
Calls renderMediaOnLambda() with the correct (new) function name
Polls for completion
Updates the project to completed
Sends the completion email

// The key insight: assets are already saved, just re-render
const { renderId, bucketName } = await renderMediaOnLambda({
  region: REGION,
  functionName: FUNCTION_NAME, // now pointing to the correct function
  serveUrl: SERVE_URL,
  composition: "ProductVideo",
  inputProps, // built from saved assets
  codec: "h264",
  // ...
});

Result: 12 out of 12 videos recovered (9 on first attempt, 3 on retry after transient network timeouts). Every user got a completion email with their finished video. Zero additional fal.ai charges.

The Prevention: Three Layers of Defense

Layer 1: Deploy Script Auto-Updates Env Vars

The root cause was a manual step that was easy to forget. The old deploy script ended with:

echo "Set the following environment variables:"
echo "  REMOTION_LAMBDA_FUNCTION_NAME=<function name from above>"

Now it automatically extracts and updates:

# Extract function name from deploy output
FUNC_NAME=$(echo "$FUNC_OUTPUT" | grep -oE 'remotion-render-[a-zA-Z0-9-]+' | head -1)

# Verify function exists
aws lambda get-function --function-name "$FUNC_NAME" --region "$REGION"

# Auto-update Vercel + local env
echo -n "$FUNC_NAME" | npx vercel env rm REMOTION_LAMBDA_FUNCTION_NAME production -y
echo -n "$FUNC_NAME" | npx vercel env add REMOTION_LAMBDA_FUNCTION_NAME production
sed -i '' "s|^REMOTION_LAMBDA_FUNCTION_NAME=.*|REMOTION_LAMBDA_FUNCTION_NAME=$FUNC_NAME|" .env.local

No more "please update manually" — the script handles it end-to-end.

Layer 2: Inngest Cron Alert for Stuck Renders

An hourly cron job checks for projects stuck in "rendering" for more than 30 minutes:

export const monitorStuckRendersFunction = inngest.createFunction(
  { id: "monitor-stuck-renders" },
  { cron: "0 * * * *" },
  async ({ step }) => {
    const stuckProjects = await step.run("check-stuck-projects", async () => {
      const threshold = new Date(Date.now() - 30 * 60 * 1000).toISOString();
      const { data } = await supabase
        .from("projects")
        .select("id, repo_name, content_mode, updated_at")
        .eq("status", "rendering")
        .lt("updated_at", threshold);
      return data ?? [];
    });

    if (stuckProjects.length > 0) {
      // Send alert email with project details
    }
  }
);

If this had existed a week ago, we'd have known within an hour instead of seven days.

Layer 3: GA4 Start/Complete Event Comparison

We added video_generate_complete and video_generate_fail events that fire when the user's browser sees the status change via Supabase Realtime:

// ProjectStatusListener.tsx
const channel = supabase
  .channel(`project-${projectId}`)
  .on("postgres_changes", { /* ... */ }, (payload) => {
    if (payload.new?.status === "completed") {
      gaEvent("video_generate_complete", { project_id: projectId });
    } else if (payload.new?.status === "failed") {
      gaEvent("video_generate_fail", { project_id: projectId });
    }
  })
  .subscribe();

Now our BigQuery funnel query shows the start-to-complete ratio:

SELECT
  event_date,
  COUNT(DISTINCT CASE WHEN event_name = 'video_generate_start'
    THEN user_pseudo_id END) AS start_users,
  COUNT(DISTINCT CASE WHEN event_name = 'video_generate_complete'
    THEN user_pseudo_id END) AS complete_users
FROM events_*
GROUP BY event_date

A sudden drop in the complete/start ratio is now a visible signal.

Bonus: The Cost Optimization That Came From This

While investigating, I realized every free-tier user was getting the same Kling 3.0 Pro clips as paying customers. At ~$5.60 per video, with a ~3% conversion rate, the customer acquisition cost was unsustainable.

The fix:

Free plan: Kling 3.0 Standard (3 clips, ~15s) — $2.52/video
Paid plans: Kling 3.0 Pro (5 clips, ~25s) — $5.60/video

This turns "Kling 3.0 Pro quality" into a tangible upgrade incentive while cutting free-tier costs by 55%.

Lessons Learned

1. Env vars are a silent single point of failure. Automate their lifecycle. If a deploy script creates a resource, it should update the env var that references it — in the same script, in the same run.

2. Background job failures are invisible by default. If your pipeline runs asynchronously (Inngest, Bull, SQS), you need explicit monitoring for "things that should have finished but didn't." A simple cron checking for stale statuses catches an entire class of bugs.

3. Track completion, not just initiation. We had video_generate_start in GA4 from day one. We never added video_generate_complete. The absence of data is the hardest signal to notice.

4. Persist intermediate results. The only reason we recovered without re-incurring $60+ in fal.ai charges is that every pipeline step saves its output to the database before moving on. This turned a potential re-generation into a simple re-render.

5. Billing anomalies are monitoring signals. The first hint of trouble wasn't an error log or a user complaint — it was an unexpected spend pattern. If you're running a SaaS with external API costs, set up billing alerts.

RepoClip generates AI-powered promotional videos from GitHub repositories. If you want to try it out, paste any public repo URL and get a video in minutes — free, no credit card required.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.