Exactly-once billing for LLM apps with one DynamoDB transaction

#h0 #v0 #dynamodb #awschallenge

If your product sells credits that users spend on AI calls, there's a decent chance you've already shipped this bug without noticing. It never shows up in a demo. It shows up in production, on your provider invoice.

Here's the bug, and the database pattern that makes it impossible instead of just unlikely.

The bug everyone ships first

Every AI product sells usage somehow: credit packs, token bundles, plan quotas. And most of them enforce it the same wrong way.

const balance = await getBalance(userId)   // read
if (balance < cost) return reject()        // compare
await charge(userId, cost)                 // write

Read, compare, write. It looks fine in review and works in every demo. Then two requests land in the same millisecond, both read the same balance, both pass the check, and both charge. A user with 5 credits fires 20 parallel requests and gets 20 answers. You eat the provider bill for the other 15.

This is the default failure mode of "budget" features in LLM gateways, because the check and the write are two separate operations with a gap between them. You can paper over it with locks, but a lock on a hot per-user row brings its own latency and contention problems.

The fix isn't a better check. It's removing the gap.

Billing enforcement is a transaction, not middleware

Card networks solved this a long time ago with authorize, capture, release. Before the goods ship, you place a hold for the estimated maximum. When the real cost is known, you capture it and release the remainder. The authorization is atomic, so there's no window where two charges both think the money is there.

The same three steps map cleanly onto an LLM request:

Estimate a max cost (input tokens plus max_tokens, priced at the model's rate) and place a hold.
Stream the response.
Capture the actual cost from the provider's reported usage, and release the difference back to the wallet.

The important part is step 1. Make it a single atomic, conditional operation and concurrency can't slip through. DynamoDB does this natively.

The reserve: one TransactWriteItems

A wallet is one item, keyed by user. The reserve is a single TransactWriteItems with a conditional decrement plus an idempotency record.

await ddb.send(new TransactWriteCommand({
  TransactItems: [
    {
      Update: {
        Key: { PK: `T#${tid}`, SK: `U#${uid}` },
        // Decrement only if the balance covers the hold, the user is active,
        // and the period quota isn't exhausted, all in one expression.
        ConditionExpression:
          "bal >= :hold AND #st = :active AND (attribute_not_exists(#q) OR #q < :qmax)",
        UpdateExpression: "SET bal = bal - :hold ADD #q :one",
        ExpressionAttributeNames: { "#st": "status", "#q": `q_${period}_fast` },
        ExpressionAttributeValues: {
          ":hold": hold, ":active": "active", ":qmax": qmax, ":one": 1,
        },
        ReturnValuesOnConditionCheckFailure: "ALL_OLD",
      },
    },
    {
      // Exactly-once: the idempotency key can be written at most once.
      Put: {
        Item: { PK: `T#${tid}`, SK: `IDEM#${idemKey}`, reqId, ttl },
        ConditionExpression: "attribute_not_exists(PK)",
      },
    },
  ],
}))

The condition and the write are the same operation. Under any number of concurrent requests, DynamoDB serializes the conditional updates on that item: the first N that fit suceed, and the rest fail the ConditionExpression. No lock, no read-modify-write loop in your app. And because it's a transaction, the wallet decrement and the idempotency Put either both happen or neither does.

When the transaction is cancelled, you read CancellationReasons to map the failure to the right HTTP status:

condition failed on the wallet item maps to 402 insufficient_credits (or 429 if it was the quota counter; inspect ALL_OLD to tell them apart)
condition failed on the idempotency item maps to 409 duplicate_request: return the original result and charge nothing

Capture and release: no dangling money

The hold is deliberately conservative. It caps output at max_tokens and pads the input estimate. On completion you capture the real cost and refund the rest.

const actual = creditsForUsage(price, billedOutputTokens, inputTokens)
const refund = Math.max(0, hold - actual)         // >= 0 by construction
await ddb.send(new UpdateCommand({
  Key: walletKey,
  UpdateExpression: "SET bal = bal + :refund",
  ExpressionAttributeValues: { ":refund": refund },
}))

Every error path releases the full hold. The invariant is that no code path exits withmoney still held. As a backstop, the hold and idempotency items carry a TTL, so an abandoned request expires itself. TTL as a financial primitive.

One LLM-specific gotcha: reasoning models bill "thinking" tokens that don't show up in completion_tokens. Charge max(completion_tokens, total_tokens - prompt_tokens) so reasoning gets captured no matter how the provider buckets it.

The proof: 50 parallel requests, exactly 5 succeed

The test that matters: seed a wallet with 5 credits, fire 50 parallel reserves of 1 credit each, and assert that exactly 5 succeed and the balance lands on exactly 0.

firing 50 parallel reserves of 1000 milli-credits each…
  insufficient_credits: 45
  ok: 5
  final balance: 0 milli-credits
RACE TEST PASSED

No oversell, no undersell, no negative balance. It holds across any number of stateless gateway instances, because the correctness lives in the database, not in process memory.

A practical note from running this against real DynamoDB: under heavy same-item contention you'll see TransactionConflict cancellations. Those mean "the conditions weren't evaluated," not "the condition failed." Retry them with jittered backoff, and only treat ConditionalCheckFailed as a verdict. The SDK does not auto-retry TransactionConflict for you.

Why DynamoDB for this, and Aurora DSQL for the ledger

The money-critical operation is a conditional update on a single, highly contended item: one user's wallet under parallel fire. That's DynamoDB's native workload. You get flat single-digit-millisecond conditional writes regardless of table size, plus TransactWriteItems to bundle the decrement, the quota increment, and the idempotency record into one atomic unit.

The financial record is a different shape. A double-entry ledger, config history, and an audit log want SQL, with joins and time-range queries. That lives in Aurora DSQL (Postgres, strong consistency, optimistic concurrency). OCC is basically free for an append-only ledger, since appends never conflict. It would abort-storm on a hot wallet row, though, which is exactly why the wallet stays in DynamoDB. Each store gets the workload it was built for, and the ledger write happens off the request latency path (via Vercel's after()), with a DynamoDB-backed retry enqueue if it ever fails.

This billing core is one piece of a larger project I'm building, Switchboard: a bring-your-own-key LLM gateway where the integration is two lines (point your SDK at the gateway, swap the model name for a flag), and everything else lives in a dashboard. Pricing, routing, A/B tests, kill switches, and refills via Stripe and RevenueCat
webhooks. The billing pattern above is the part I'd reach for in any product that sells usage, gateway or not.

So that's the idea in one line: don't enforce budgets in middleware that reads then writes. Make the authorization a conditional transaction, and overspend stops being a bug you catch after the fact. It becomes a state the database refuses to enter.

Built for the H0 Hackathon (Hack the Zero Stack with Vercel v0 and AWS Databases).
Switchboard runs on Amazon DynamoDB and Aurora DSQL, deployed on Vercel, scaffolded with
v0. #H0Hackathon