Billing code is the most dangerous place to have subtle bugs. It rarely crashes — it just silently does the wrong thing. Here are two we found and fixed in ClipCrafter, an AI video clip extraction tool.
Bug 1: The usage counter that lost data under load
We track how many seconds of video each user processes per day to enforce plan limits. The original increment looked like this:
const { data } = await db
.from("users")
.select("daily_usage_seconds")
.eq("clerk_id", clerkUserId)
.single();
const next = (data.daily_usage_seconds ?? 0) + seconds;
await db
.from("users")
.update({ daily_usage_seconds: next })
.eq("clerk_id", clerkUserId);
This is a textbook read-modify-write race. Here's the scenario that breaks it:
- Request A reads
daily_usage_seconds = 120 - Request B reads
daily_usage_seconds = 120(before A has written) - Request A writes
120 + 300 = 420 - Request B writes
120 + 180 = 300— overwriting A's write
The user processed 480 seconds of video but the counter shows 300. They get more usage than they're supposed to. At scale — multiple clips rendering simultaneously — this happens constantly.
The code looks correct. It reads, adds, writes. The problem is invisible in single-user testing.
The fix: push the increment into Postgres
CREATE OR REPLACE FUNCTION increment_daily_usage(
p_clerk_id TEXT,
p_seconds INTEGER
)
RETURNS void LANGUAGE plpgsql SECURITY DEFINER AS $$
BEGIN
UPDATE users
SET daily_usage_seconds = COALESCE(daily_usage_seconds, 0) + p_seconds
WHERE clerk_id = p_clerk_id;
END;
$$;
A single UPDATE ... SET col = col + n is atomic. Postgres takes a row-level lock for the duration of the update — no two concurrent updates can interleave. The application becomes:
await supabaseAdmin.rpc("increment_daily_usage", {
p_clerk_id: clerkUserId,
p_seconds: seconds,
});
One network round-trip instead of two, and no race condition.
Why SECURITY DEFINER? It lets the function run with the privileges of its creator (the DB owner), bypassing row-level security for this specific operation. Since the app calls it via a service-role client that already bypasses RLS, this is consistent — but it's worth knowing the tradeoff.
Bug 2: The webhook handler that processed events twice
Payment providers retry webhook deliveries when they don't receive a 200 response quickly enough — network blip, slow cold start, anything. Our handler wasn't idempotent:
export async function POST(request: Request) {
// ... verify signature ...
const { event, clerkUserId, subscriptionId } = parseEvent(body);
if (event === "subscription.activated") {
// This runs twice if the webhook is retried
await activateSubscription(clerkUserId, subscriptionId);
}
}
If the provider retried after a slow response, the user's subscription could be activated twice — not catastrophic here, but the pattern generalises to things like crediting accounts or sending emails.
The fix: a deduplication table
CREATE TABLE IF NOT EXISTS webhook_events (
id TEXT PRIMARY KEY,
event TEXT NOT NULL,
processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Every webhook payload has a unique event ID. We try to insert it; a primary key conflict means we've already processed it:
async function isDuplicate(payload: Record<string, unknown>): Promise<boolean> {
const eventId = payload.id as string | undefined;
if (!eventId) return false;
const { error } = await supabaseAdmin
.from("webhook_events")
.insert({ id: eventId, event: (payload.event as string) ?? "unknown" });
// 23505 = unique_violation in Postgres
return error?.code === "23505";
}
export async function POST(request: Request) {
// ...
if (await isDuplicate(rawPayload)) {
return Response.json({ received: true }); // 200 so provider stops retrying
}
// ... process event ...
}
Key details:
- Return
200on duplicate — returning an error would cause the provider to retry again - The insert-and-check-conflict is itself atomic — no TOCTOU gap
- The table doubles as an audit log of every event ever received
Shipping both as one migration
Both fixes are database objects, so they went into a single migration file:
-- 016_billing_fixes.sql
CREATE OR REPLACE FUNCTION increment_daily_usage(p_clerk_id TEXT, p_seconds INTEGER)
RETURNS void LANGUAGE plpgsql SECURITY DEFINER AS $$
BEGIN
UPDATE users
SET daily_usage_seconds = COALESCE(daily_usage_seconds, 0) + p_seconds
WHERE clerk_id = p_clerk_id;
END;
$$;
CREATE TABLE IF NOT EXISTS webhook_events (
id TEXT PRIMARY KEY,
event TEXT NOT NULL,
processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
They land together or not at all. Code that depends on both objects was deployed in the same release.
The pattern to remember
Both bugs share the same root cause: state mutation across multiple round-trips is never safe under concurrency. The fix in both cases was the same: push the mutation into the database where it can be made atomic.
- Use
UPDATE ... SET col = col + ninstead of read-then-write - Use an insert-with-conflict as an atomic check-and-record for idempotency
Postgres is very good at this. Let it do the work.
We're building ClipCrafter — AI-powered short clip extraction from long videos. If you're working on similar problems, I'd love to hear how you handle billing correctness at scale.
Top comments (0)