Two billing bugs that looked fine until production proved otherwise

#postgres #webdev #typescript #saas

Billing code is the most dangerous place to have subtle bugs. It rarely crashes — it just silently does the wrong thing. Here are two we found and fixed in ClipCrafter, an AI video clip extraction tool.

Bug 1: The usage counter that lost data under load

We track how many seconds of video each user processes per day to enforce plan limits. The original increment looked like this:

const { data } = await db
  .from("users")
  .select("daily_usage_seconds")
  .eq("clerk_id", clerkUserId)
  .single();

const next = (data.daily_usage_seconds ?? 0) + seconds;

await db
  .from("users")
  .update({ daily_usage_seconds: next })
  .eq("clerk_id", clerkUserId);

This is a textbook read-modify-write race. Here's the scenario that breaks it:

Request A reads daily_usage_seconds = 120
Request B reads daily_usage_seconds = 120 (before A has written)
Request A writes 120 + 300 = 420
Request B writes 120 + 180 = 300 — overwriting A's write

The user processed 480 seconds of video but the counter shows 300. They get more usage than they're supposed to. At scale — multiple clips rendering simultaneously — this happens constantly.

The code looks correct. It reads, adds, writes. The problem is invisible in single-user testing.

The fix: push the increment into Postgres

CREATE OR REPLACE FUNCTION increment_daily_usage(
  p_clerk_id TEXT,
  p_seconds  INTEGER
)
RETURNS void LANGUAGE plpgsql SECURITY DEFINER AS $$
BEGIN
  UPDATE users
  SET daily_usage_seconds = COALESCE(daily_usage_seconds, 0) + p_seconds
  WHERE clerk_id = p_clerk_id;
END;
$$;

A single UPDATE ... SET col = col + n is atomic. Postgres takes a row-level lock for the duration of the update — no two concurrent updates can interleave. The application becomes:

await supabaseAdmin.rpc("increment_daily_usage", {
  p_clerk_id: clerkUserId,
  p_seconds: seconds,
});

One network round-trip instead of two, and no race condition.

Why SECURITY DEFINER? It lets the function run with the privileges of its creator (the DB owner), bypassing row-level security for this specific operation. Since the app calls it via a service-role client that already bypasses RLS, this is consistent — but it's worth knowing the tradeoff.

Bug 2: The webhook handler that processed events twice

Payment providers retry webhook deliveries when they don't receive a 200 response quickly enough — network blip, slow cold start, anything. Our handler wasn't idempotent:

export async function POST(request: Request) {
  // ... verify signature ...
  const { event, clerkUserId, subscriptionId } = parseEvent(body);

  if (event === "subscription.activated") {
    // This runs twice if the webhook is retried
    await activateSubscription(clerkUserId, subscriptionId);
  }
}

If the provider retried after a slow response, the user's subscription could be activated twice — not catastrophic here, but the pattern generalises to things like crediting accounts or sending emails.

The fix: a deduplication table

CREATE TABLE IF NOT EXISTS webhook_events (
  id           TEXT        PRIMARY KEY,
  event        TEXT        NOT NULL,
  processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Every webhook payload has a unique event ID. We try to insert it; a primary key conflict means we've already processed it:

async function isDuplicate(payload: Record<string, unknown>): Promise<boolean> {
  const eventId = payload.id as string | undefined;
  if (!eventId) return false;

  const { error } = await supabaseAdmin
    .from("webhook_events")
    .insert({ id: eventId, event: (payload.event as string) ?? "unknown" });

  // 23505 = unique_violation in Postgres
  return error?.code === "23505";
}

export async function POST(request: Request) {
  // ...
  if (await isDuplicate(rawPayload)) {
    return Response.json({ received: true }); // 200 so provider stops retrying
  }
  // ... process event ...
}

Key details:

Return 200 on duplicate — returning an error would cause the provider to retry again
The insert-and-check-conflict is itself atomic — no TOCTOU gap
The table doubles as an audit log of every event ever received

Shipping both as one migration

Both fixes are database objects, so they went into a single migration file:

-- 016_billing_fixes.sql

CREATE OR REPLACE FUNCTION increment_daily_usage(p_clerk_id TEXT, p_seconds INTEGER)
RETURNS void LANGUAGE plpgsql SECURITY DEFINER AS $$
BEGIN
  UPDATE users
  SET daily_usage_seconds = COALESCE(daily_usage_seconds, 0) + p_seconds
  WHERE clerk_id = p_clerk_id;
END;
$$;

CREATE TABLE IF NOT EXISTS webhook_events (
  id           TEXT        PRIMARY KEY,
  event        TEXT        NOT NULL,
  processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

They land together or not at all. Code that depends on both objects was deployed in the same release.

The pattern to remember

Both bugs share the same root cause: state mutation across multiple round-trips is never safe under concurrency. The fix in both cases was the same: push the mutation into the database where it can be made atomic.

Use UPDATE ... SET col = col + n instead of read-then-write
Use an insert-with-conflict as an atomic check-and-record for idempotency

Postgres is very good at this. Let it do the work.

We're building ClipCrafter — AI-powered short clip extraction from long videos. If you're working on similar problems, I'd love to hear how you handle billing correctness at scale.