DEV Community

nareshipme
nareshipme

Posted on

Two billing bugs that looked fine until production proved otherwise

Billing code is the most dangerous place to have subtle bugs. It rarely crashes — it just silently does the wrong thing. Here are two we found and fixed in ClipCrafter, an AI video clip extraction tool.


Bug 1: The usage counter that lost data under load

We track how many seconds of video each user processes per day to enforce plan limits. The original increment looked like this:

const { data } = await db
  .from("users")
  .select("daily_usage_seconds")
  .eq("clerk_id", clerkUserId)
  .single();

const next = (data.daily_usage_seconds ?? 0) + seconds;

await db
  .from("users")
  .update({ daily_usage_seconds: next })
  .eq("clerk_id", clerkUserId);
Enter fullscreen mode Exit fullscreen mode

This is a textbook read-modify-write race. Here's the scenario that breaks it:

  1. Request A reads daily_usage_seconds = 120
  2. Request B reads daily_usage_seconds = 120 (before A has written)
  3. Request A writes 120 + 300 = 420
  4. Request B writes 120 + 180 = 300overwriting A's write

The user processed 480 seconds of video but the counter shows 300. They get more usage than they're supposed to. At scale — multiple clips rendering simultaneously — this happens constantly.

The code looks correct. It reads, adds, writes. The problem is invisible in single-user testing.

The fix: push the increment into Postgres

CREATE OR REPLACE FUNCTION increment_daily_usage(
  p_clerk_id TEXT,
  p_seconds  INTEGER
)
RETURNS void LANGUAGE plpgsql SECURITY DEFINER AS $$
BEGIN
  UPDATE users
  SET daily_usage_seconds = COALESCE(daily_usage_seconds, 0) + p_seconds
  WHERE clerk_id = p_clerk_id;
END;
$$;
Enter fullscreen mode Exit fullscreen mode

A single UPDATE ... SET col = col + n is atomic. Postgres takes a row-level lock for the duration of the update — no two concurrent updates can interleave. The application becomes:

await supabaseAdmin.rpc("increment_daily_usage", {
  p_clerk_id: clerkUserId,
  p_seconds: seconds,
});
Enter fullscreen mode Exit fullscreen mode

One network round-trip instead of two, and no race condition.

Why SECURITY DEFINER? It lets the function run with the privileges of its creator (the DB owner), bypassing row-level security for this specific operation. Since the app calls it via a service-role client that already bypasses RLS, this is consistent — but it's worth knowing the tradeoff.


Bug 2: The webhook handler that processed events twice

Payment providers retry webhook deliveries when they don't receive a 200 response quickly enough — network blip, slow cold start, anything. Our handler wasn't idempotent:

export async function POST(request: Request) {
  // ... verify signature ...
  const { event, clerkUserId, subscriptionId } = parseEvent(body);

  if (event === "subscription.activated") {
    // This runs twice if the webhook is retried
    await activateSubscription(clerkUserId, subscriptionId);
  }
}
Enter fullscreen mode Exit fullscreen mode

If the provider retried after a slow response, the user's subscription could be activated twice — not catastrophic here, but the pattern generalises to things like crediting accounts or sending emails.

The fix: a deduplication table

CREATE TABLE IF NOT EXISTS webhook_events (
  id           TEXT        PRIMARY KEY,
  event        TEXT        NOT NULL,
  processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Enter fullscreen mode Exit fullscreen mode

Every webhook payload has a unique event ID. We try to insert it; a primary key conflict means we've already processed it:

async function isDuplicate(payload: Record<string, unknown>): Promise<boolean> {
  const eventId = payload.id as string | undefined;
  if (!eventId) return false;

  const { error } = await supabaseAdmin
    .from("webhook_events")
    .insert({ id: eventId, event: (payload.event as string) ?? "unknown" });

  // 23505 = unique_violation in Postgres
  return error?.code === "23505";
}

export async function POST(request: Request) {
  // ...
  if (await isDuplicate(rawPayload)) {
    return Response.json({ received: true }); // 200 so provider stops retrying
  }
  // ... process event ...
}
Enter fullscreen mode Exit fullscreen mode

Key details:

  • Return 200 on duplicate — returning an error would cause the provider to retry again
  • The insert-and-check-conflict is itself atomic — no TOCTOU gap
  • The table doubles as an audit log of every event ever received

Shipping both as one migration

Both fixes are database objects, so they went into a single migration file:

-- 016_billing_fixes.sql

CREATE OR REPLACE FUNCTION increment_daily_usage(p_clerk_id TEXT, p_seconds INTEGER)
RETURNS void LANGUAGE plpgsql SECURITY DEFINER AS $$
BEGIN
  UPDATE users
  SET daily_usage_seconds = COALESCE(daily_usage_seconds, 0) + p_seconds
  WHERE clerk_id = p_clerk_id;
END;
$$;

CREATE TABLE IF NOT EXISTS webhook_events (
  id           TEXT        PRIMARY KEY,
  event        TEXT        NOT NULL,
  processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Enter fullscreen mode Exit fullscreen mode

They land together or not at all. Code that depends on both objects was deployed in the same release.


The pattern to remember

Both bugs share the same root cause: state mutation across multiple round-trips is never safe under concurrency. The fix in both cases was the same: push the mutation into the database where it can be made atomic.

  • Use UPDATE ... SET col = col + n instead of read-then-write
  • Use an insert-with-conflict as an atomic check-and-record for idempotency

Postgres is very good at this. Let it do the work.


We're building ClipCrafter — AI-powered short clip extraction from long videos. If you're working on similar problems, I'd love to hear how you handle billing correctness at scale.

Top comments (0)