Krishna Tej Chalamalasetty

Posted on Apr 5 • Edited on Apr 11 • Originally published at chkrishnatej.dev

The Boring Infrastructure That Breaks AI APIs: A Guide to Billing and Metering

#ai #distributedsystems #architecture

Recently, Anthropic users ran into a frustrating pattern. Usage limits hit faster than expected. Credits appeared late. In some cases, the same request was billed twice. The forums and GitHub issues filled up fast.
But stepping back from the frustration, have you ever thought about what it actually takes to build billing infrastructure for an AI API?

It sounds simple. Count tokens, charge money. But the moment you add streaming responses, concurrent users, prepaid credits, multiple token types, and an async pipeline underneath, it becomes one of the harder problems a platform team will face. And when it breaks, it breaks visibly. Users notice billing errors faster than almost any other kind of bug.

This article is about what that system looks like under the hood, where it tends to fail, and what engineers can do about it.

The Anatomy of a Billing System

If I were to break down how a billing system is structured, I would anchor it around three core layers.

Event is the starting point. When a request completes, the API server emits a usage event that contains information about who made the request, which model, how many tokens, and a unique ID. That unique ID does quiet but important work. It is the foundation for both auditability and idempotency, which we will get to shortly.

Meter is the core of billing and probably the hardest part to get right. It takes the stream of raw events and answers one question: how much has this account consumed, and do they have enough balance to continue? Doing this correctly at scale for millions of concurrent users, across multiple token types, with partial streaming responses in flight, this is where most of the interesting failures live.

Ledger is the authoritative record of every transaction that touched a balance. It is append-only by design. You never edit a past entry. You only add new ones. The current balance is the sum of all entries for that account. This gives you two things a single mutable balance column cannot: a full audit trail for disputes, and a natural place to enforce idempotency.

These three layers are not just organizational. They represent distinct consistency boundaries. The event layer needs to be fast. The meter layer needs to be correct. The ledger layer needs to be durable. Mixing them into a single system is where things start to go wrong.

Token Counting Is Harder Than It Looks

With the anatomy established, I want to talk about how AI labs actually meter usage. Token counting sounds mechanical. In practice, it has edge cases that are easy to underestimate.

Input and output tokens are not symmetric

Token counting happens on both sides of the request. Input tokens are known before the model runs. Output tokens are not. The response is a streaming, non-deterministic outcome. The system does not know how long it will be until it finishes.

One way AI labs handle this is by metering the chunked stream as it is produced. As each chunk arrives, the meter tracks cumulative output tokens. As long as quota remains, the stream continues. Once quota is exhausted, the stream is cut. This means the billing decision and the response generation are happening at the same time, not in sequence.

Metering is opaque to the client

The token count is finalized on the server. The client cannot see it until the response is done. Given the non-deterministic nature of the output, users have no reliable way to predict what a request will cost before sending it. That opacity is a real trust problem, and it gets worse when users are on prepaid credits with hard limits.

Not all tokens are the same

There are input tokens, output tokens, thinking tokens, and cached tokens. Each has a different price and, in some cases, a different metering path. All of them draw from the same pool, that is the account's available balance.

If a user has multiple parallel sessions running, each one is consuming from that shared pool at the same time. That is a concurrency problem on the balance. At the scale of millions of users, it is a serious one. This is where the next section begins.

The Two Hardest Problems: Idempotency & The Credit Race

These two problems sit at the intersection of distributed systems and financial correctness. Getting either one wrong produces charges that are duplicated or inaccurate. Both erode user trust fast.

Idempotency

Idempotency means that performing the same operation multiple times produces the same result as performing it once. In billing, this is not a nice-to-have. It is load-bearing.

Usage events travel through an async pipeline, typically a message queue like Kafka. These queues guarantee at-least-once delivery. That means a consumer will sometimes receive the same event more than once during retries, consumer restarts, or network hiccups.

Without an idempotency guard on the ledger write, a redelivered event produces a second charge for the same request. This is the pattern behind the Anthropic dual-billing bug. Two write paths, one for API billing and one for prepaid credits, both fired for the same request with no shared dedup check between them.

The fix at the ledger level is one line of SQL:

INSERT INTO ledger (event_id, org_id, amount, type, timestamp)
VALUES ('evt_abc123', 'org_A', -0.03, 'usage_deduct', NOW())
ON CONFLICT (event_id) DO NOTHING;

The event_id column has a unique constraint. A duplicate event does nothing. The ledger's append-only structure makes this natural. You are inserting a row, not updating one, and the database enforces uniqueness for you.

The event_id is the unique key that is sent in the event. This is one of the key pillars that ensures idempotency.

The Credit Race

The credit race is a concurrency problem. Two requests arrive at the same time for an account with $0.05 remaining. Each request costs $0.03. Both are individually affordable. Together they are not.

Without an atomic check-and-deduct, both requests read the balance as sufficient, both proceed, and the account ends at -$0.01. That is a silent overdraft.

A plain Redis DECRBY that subtracts a value from a key atomically does not protect against this. It has no floor. It will go negative without complaint.

The correct approach is an atomic check-and-deduct. In Redis, this is done with a Lua script that runs as a single uninterruptible operation:

local balance = tonumber(redis.call('GET', KEYS[1]))
local cost    = tonumber(ARGV[1])
if balance >= cost then
  return redis.call('DECRBY', KEYS[1], cost)
else
  return -1  -- insufficient balance, reject the request
end

There is no gap between the read and the write. Another thread cannot slip in between them.

In practice, most AI APIs might use both. Redis for the fast atomic balance gate on the hot path. Postgres as the durable ledger for audit and idempotency. They solve different problems. Redis buys speed. Postgres buys correctness.

Failure Modes Taxonomy

Most billing failures are not random. They cluster around a small number of root causes. Recognizing the pattern matters more than patching individual bugs.

Failure Class	Real Example	What Actually Went Wrong
Double charge	Anthropic API + credit billed for same request	Two write paths, no shared idempotency check
Ghost block	Credits exist, API returns 402 anyway	The balance is sufficient in the ledger, but the meter is reading from a cached copy that hasn't been updated yet.
Silent overdraft	More tokens consumed than remaining balance	Non-atomic check-then-deduct under concurrency
Credit destruction	Gift codes wiping each other on redemption	Stripe proration logic unaware of credit stacking rules
Consumption spike	3–5x faster depletion after a model update	No baseline drift detection on per-account meter
Measurement loss	Streaming request interrupted mid-response	Token count finalized at stream end, partial streams need a separate accounting path

Four of these six failures share one root cause:

Billing state living in more than one place without a single source of truth.
Double charge happens when two systems both think they own the write.
Ghost block happens when the cache disagrees with the ledger.
Credit destruction happens when Stripe's model and the internal credit model diverge.

The fix is not patching each bug one by one. It is deciding once which system owns each piece of billing state, and making everything else read from it.

Design Principles

Idempotency is load-bearing, not defensive.

Engineers sometimes treat idempotency as a safety net added after the system works. In billing, it belongs in the design before the first line of code. Every ledger write needs a dedup key. Every event needs a stable, unique ID. At-least-once delivery will eventually betray you. The only safe response is to design for it upfront.

Credits are not money. Model them accordingly.

A balance column feels sufficient until you need expiry dates, stacking rules for different grant types, priority ordering between purchased and promotional credits, and shared wallets across teams. At that point, a single number is not enough. Credits need to be a ledger of typed entries, each with an amount, a source, an expiry, and a priority. The Anthropic gift code bug, where redeeming multiple codes destroyed existing credit value, is a direct result of treating credits as a simple balance rather than a structured set of grants.

Measure first, charge second.

This is easy to say and consistently violated under deadline pressure. The metering pipeline should finalize the token count before any charge is recorded. Charging on estimated or in-flight counts, especially for streaming creates a class of errors that are nearly impossible to audit after the fact. The accounting path and the response path can run in parallel. But the ledger write should always wait for a confirmed count.

Conclusion

The Anthropic billing incidents were not exotic failures. They were textbook distributed systems problems that surfaced in a financial context. Duplicate writes without idempotency guards. Concurrent balance checks without atomic deduction. Credit state split across systems with no clear owner.

What makes them worth studying is not that mistakes were made. Every system at scale makes mistakes. It is that these failure classes are predictable. They have names. They have known fixes. And they tend to appear in roughly the same order as a billing system grows.

If you are building or reviewing a billing system, the most useful question to ask is not whether the happy path works. It is whether you have decided which system owns each piece of billing state, and whether everything else actually reads from it.

Billing infrastructure rarely gets a design doc. It rarely gets a dedicated team until something goes wrong. It sits in the corner of the codebase, quietly doing its job, until one day it doesn't and suddenly it's the only thing anyone is talking about.

The Anthropic incidents are a good reminder of that dynamic. The failures weren't in the model. They weren't in the API. They were in the plumbing that sits between a user's wallet and a completed request. The part nobody thought was interesting enough to design carefully.

That's the thing about boring infrastructure. It doesn't announce itself when it's working. But when it breaks, it breaks in the most visible way possible on someone's credit card statement.

Token counting, idempotency, credit ledgers, atomic deductions, none of this is glamorous work. But it is the work that determines whether users trust your platform. And trust, once lost over a billing error, is genuinely hard to get back.

Build the boring parts like they matter. Because to your users, they matter more than almost anything else.

Originally published at chkrishnatej.dev

DEV Community