Takayuki Kawazoe

Posted on May 1

Why we built our own credit ledger instead of using Stripe metered billing

#stripe #billing #postgres #architecture

A few months in, an agent I was running reserved about 1,000 credits' worth of tokens for a long-running task, then crashed mid-flight. The credits were sitting in a "reserved" bucket on the org. The user's available balance had gone down. The work hadn't actually happened.

If your billing layer is Stripe metered, you have an awkward conversation with yourself at this point. stripe.SubscriptionItem.create_usage_record(...) doesn't have a quantity=-1000 "actually never mind" path that respects your prior idempotency keys. Negative quantities exist, but they aren't the same primitive as a release; they show up as a separate usage record and they don't cancel the original reservation, because there was no reservation — Stripe metered doesn't have one. The model is: emit a usage event, it accrues, you bill at period end.

That moment is when I started writing my own credit ledger.

I'm building a SaaS that touches this problem (Codens, an AI dev harness with a few specialized agents that share an org-wide credit pool — happy to talk about it but it's not the point of this post). The point of this post is the architectural reason why Stripe metered, which is genuinely great at what it's designed for, doesn't fit a multi-product credit-pool product, and what the homegrown alternative actually looks like.

I'll keep this builder-to-builder. No pricing math, no markup talk, just the primitives.

What Stripe metered is great at

Stripe metered is designed around a clean model:

One product (or one subscription item per product).
A customer is on a subscription.
You report usage events as they happen.
At the end of the billing period, Stripe converts events into an invoice.
Pricing tiers (graduated, volume) are configured on the price object.

If you sell a single product priced per API call, per minute, per GB — this is the right answer. You should use it. You'll be done in a day, you'll get hosted invoices and dunning for free, and you'll never reimplement tax rates.

The problems start the moment you violate any of those assumptions.

Where it stops being a good fit

In my case, four assumptions were violated at once.

1. The pool is org-wide, not per-product

I have multiple agents (PRD writing, error auto-fix, test gen, activity ledger, and a workflow harness) that all draw from one shared credit balance attached to the customer's organization. There is no "Subscription for the test-gen agent" — there's an organization, and an org-wide pool that any agent can pull from.

You can model this in Stripe two ways and they both hurt. One subscription item per product means N usage streams to keep in sync, and the customer doesn't have a single balance — they have N usages that get reconciled at invoice time. One subscription item with a "credits" SKU means you've already started building a homegrown ledger; you've just put a thin Stripe wrapper on it.

The org-wide entity is the unit I actually want to lock and update atomically:

# billing-control-plane/backend/src/domain/entities/organization_credit.py
@dataclass
class OrganizationCredit:
    auth_org_id: UUID
    balance: Decimal
    reserved_balance: Decimal
    billing_paused: bool
    paused_reason: str | None = None
    updated_at: datetime | None = None

    @property
    def available_balance(self) -> Decimal:
        return self.balance - self.reserved_balance

Two columns: balance and reserved_balance. available_balance is just their difference. Every consume, reserve, refund, chargeback takes a SELECT ... FOR UPDATE on this row before doing anything. There is exactly one row per organization, and it's the truth.

You cannot get SELECT FOR UPDATE semantics out of Stripe. Stripe's API is eventually consistent from your perspective; you submit a usage event, you trust their backend to process it, you read it back via the dashboard or their API. That's fine for billing-period accounting. It's not fine for "did this agent's request just bring the org under their available balance, yes or no, right now."

2. Long-running tasks need reservations

Several of my agent operations take 30 seconds to 30 minutes. The naive "consume after the fact" pattern blows up: two concurrent requests both pass the balance check, both run, the second one was always going to overspend, and now I owe somebody an apology.

The fix is reservations. Before the work starts, reserve the credits. When the work finishes, consume the actual amount and release the rest.

# reserve_credit_use_case.py — abridged
credit = await self.credit_repo.get_by_auth_org_id_for_update(req.auth_org_id)
if credit is None:
    raise NotFound(f"organization {req.auth_org_id} not initialized")
if credit.billing_paused:
    raise BillingPaused(f"billing paused: {credit.paused_reason}")
if credit.available_balance < req.amount:
    raise InsufficientCredits(
        f"available={credit.available_balance} required={req.amount}"
    )

expires_at = datetime.now(UTC) + timedelta(seconds=req.expires_in_seconds)
reservation = Reservation(
    auth_org_id=req.auth_org_id,
    amount=req.amount,
    expires_at=expires_at,
    reserved_by_service=req.service_id,
    metadata=req.metadata or {},
)
reservation = await self.reservation_repo.create(reservation)

# balance is NOT changed; only reserved_balance moves
await self.credit_repo.update_balance(
    auth_org_id=req.auth_org_id,
    balance_delta=Decimal(0),
    reserved_balance_delta=req.amount,
)

The trick is that balance doesn't move during a reservation. Only reserved_balance does. So the org's total credit balance is unchanged ("the customer didn't lose money yet, the work hasn't been done"), but their available_balance (balance - reserved_balance) drops, which is what the next concurrent request sees. If a second agent tries to reserve more than what's left available, it fails before doing any work.

Then on success the reservation gets consumed for the actual amount, and the difference goes back:

# consume_reservation_use_case.py — abridged
new_balance = credit.balance - req.actual_amount
await self.credit_repo.update_balance(
    auth_org_id=reservation.auth_org_id,
    balance_delta=-req.actual_amount,
    reserved_balance_delta=-reservation.amount,
)

credit_back = reservation.amount - req.actual_amount

balance drops by the actual consumed amount. reserved_balance drops by the full reservation amount (so what got over-reserved is returned to available). One row update, atomic.

If the agent crashed instead, the matching call is release_reservation:

# release_reservation_use_case.py — abridged
await self.credit_repo.update_balance(
    auth_org_id=reservation.auth_org_id,
    balance_delta=Decimal(0),
    reserved_balance_delta=-reservation.amount,
)

balance unchanged, reserved_balance drops back. The credits are available again.

There is no Stripe metered primitive for this. The closest you can build is a "we'll send you a usage event later, maybe" pattern, which means the customer's available balance is computed in your code anyway, which means you're keeping a parallel ledger anyway, which means... you've built this.

3. Idempotency on retries needs payload checking, not just key checking

This is the one that bit me hardest in early prototypes. Every agent retries. If the network blips during a reserve call, the agent retries with the same idempotency key. You cannot double-reserve.

Stripe's Idempotency-Key header gets you a long way for Stripe API calls. But the key alone isn't enough for a credit ledger — you also need to verify that the payload is the same. If the agent reuses a key but with a different amount (almost always a bug), you should reject it loudly, not return the cached response for the wrong amount.

That's what the idempotency manager looks like:

# idempotency_manager.py — the core branching
async def acquire(
    self,
    auth_org_id: UUID,
    service_id: str,
    idempotency_key: str,
    payload_hash: str,
    lease_seconds: int = DEFAULT_LEASE_SECONDS,
    retention_days: int = DEFAULT_RETENTION_DAYS,
) -> tuple[IdempotencyOutcome, IdempotencyKey | None]:
    existing = await self.repo.get_for_update(
        auth_org_id=auth_org_id,
        service_id=service_id,
        idempotency_key=idempotency_key,
    )
    now = datetime.now(UTC)

    if existing is not None:
        if existing.payload_hash != payload_hash:
            raise IdempotencyConflict(
                f"idempotency_key={idempotency_key} payload_hash mismatch"
            )
        if existing.status == "completed":
            return (IdempotencyOutcome.REPLAY, existing)
        if existing.status == "failed_terminal":
            return (IdempotencyOutcome.FAILED_REPLAY, existing)
        # in_flight
        lease = existing.in_flight_lease_until
        if lease is not None and lease > now:
            retry_after = max(1, int((lease - now).total_seconds()))
            raise IdempotencyInFlight(retry_after_seconds=retry_after)

The four states matter:

Completed with matching payload hash: return the cached response, do not touch balance. This is a successful retry of a prior success.
Completed with mismatched payload hash: throw. Same key, different payload is a bug.
In-flight, lease still valid: another worker is doing this right now. Tell the caller to retry after N seconds. This prevents two workers from racing on the same key.
Failed terminal: cached error response. This stops retry storms on permanently-bad requests (insufficient credits, malformed payload).

The key observation: payload hash is computed by the use case and includes only the fields that should make the request unique. For reserve, that's {amount, expires_in_seconds}. For consume, that's {service_id, operation_type, input_tokens, output_tokens, model_name}. If the caller resubmits with different content but the same key, the lock catches it before any balance changes.

def compute_payload_hash(payload: dict[str, Any]) -> str:
    """Deterministic SHA-256 hash of the canonical JSON payload."""
    canonical = json.dumps(
        payload, sort_keys=True, separators=(",", ":"), default=str
    )
    return hashlib.sha256(canonical.encode("utf-8")).hexdigest()

The whole acquire → mutate balance → complete sequence runs inside one DB transaction. If the use case crashes between acquire and complete, the row is left in_flight with a lease. Lease expires, next attempt takes over. No double-spend.

I genuinely cannot see how to express this in terms of Stripe usage records. They aren't the right primitive — they're append-only, conditionally cancellable, and the cancellation isn't tied to a payload hash.

4. Refunds at the credit level vs the invoice level

In Stripe-metered land, a refund is an invoice-level concept. You issued an invoice for $X based on accumulated usage; you can refund some or all of $X. That's fine for "we billed you wrong, here's your money back."

In credit-pool land, two different things happen and they're not the same:

Real money refund. The customer paid for credits, decided not to use the product, charges back or asks for a refund. The money flows back via Stripe. The credits should be removed from the balance. But if they already spent some of them, you have an AR situation.
Operational credit-back. A reservation was bigger than the actual consumption, so credits go back to available. No money moves at all. This happens hundreds of times a day per active org.

These are different transaction types in the ledger:

# credit_transaction.py
VALID_TRANSACTION_TYPES = frozenset(
    {
        "consume",
        "grant",
        "purchase",
        "refund",
        "chargeback",
        "reservation_consume",
        "reservation_release",
        "ar_settlement",
        "migration_apply",
        "migration_rollback",
    }
)

reservation_consume and reservation_release are operational. refund and chargeback involve real money. Stripe metered conflates these because Stripe is invoice-shaped, not credit-shaped.

The refund use case interestingly does not change balance at all — it logs the Stripe refund event for accounting and audit, because the refund itself happens in Stripe:

# refund_credit_use_case.py — abridged
# Refund: balance NOT changed, affects_balance=FALSE
tx = CreditTransaction(
    auth_org_id=req.auth_org_id,
    service_id=req.service_id,
    transaction_type="refund",
    amount=-req.amount,
    balance_after=credit.balance,
    reserved_balance_after=credit.reserved_balance,
    affects_balance=False,
    related_transaction_id=req.original_transaction_id,
    metadata={
        "stripe_refund_id": req.stripe_refund_id,
        "reason": req.reason,
    },
)

That affects_balance=False flag is a small thing that does a lot of work — it lets the audit trail include the refund event without double-counting it against the balance, because the chargeback/refund recovery happens via a separate flow.

5. The audit trail you want is row-level, not invoice-level

When a customer asks "why did our balance drop by 4,000 last Tuesday," what you want to be able to answer is: this reservation, by this service, for this operation, with this idempotency key, that resolved into this consume transaction. Then you want to be able to replay the exact event sequence.

I write every state-changing operation through three things in the same transaction:

The balance update on organization_credit.
A credit_transaction row that records the delta, the type, the related reservation/transaction, and metadata.
An outbox_event that downstream consumers (notifications, analytics, the customer's audit dashboard) can replay.

# from reserve_credit_use_case.py
await self.outbox_repo.create(
    OutboxEvent(
        aggregate_id=req.auth_org_id,
        event_type="credit.reserved",
        payload={
            "reservation_id": str(reservation.reservation_id),
            "auth_org_id": str(req.auth_org_id),
            "amount": str(req.amount),
            "expires_at": expires_at.isoformat(),
            "service_id": req.service_id,
        },
    )
)

The transactional outbox pattern is the boring right answer for "I need to atomically update state and emit an event." If event delivery fails, the state still moved consistently and the event will be redelivered. If state failed, no event went out. You can rebuild any downstream view by replaying the outbox.

Stripe's webhooks are excellent — but they're Stripe's events, not yours. They tell you "an invoice was paid" or "a usage record was created." They don't tell you "service yellow-codens reserved 1200 credits at 14:02 on behalf of operation x for org Y." That's the granularity an org admin actually wants in their audit log.

What we kept Stripe for: the actual money

The thing I am very deliberately not building is a payment processor. I am not handling card numbers, I am not implementing 3DS, I am not arguing with chargebacks. Stripe is excellent at all of that.

So Stripe still owns the seam where money becomes credits. The package-purchase flow has two stages:

# purchase_package_use_case.py — abridged
session = await self.stripe.create_checkout_session(
    stripe_customer_id=org.stripe_customer_id,
    stripe_price_id=package.stripe_price_id,
    success_url=req.success_url,
    cancel_url=req.cancel_url,
    metadata={
        "auth_org_id": str(req.auth_org_id),
        "package_id": req.package_id,
        "credits": str(package.credits),
    },
)

That's stage one — PreparePurchaseUseCase creates a Stripe Checkout session for a credit package. No balance changes. The customer gets sent to Stripe's hosted checkout.

Stage two runs from the Stripe webhook handler when payment actually clears:

# from ApplyPurchasedCreditsUseCase
settlement = await self._settle.execute(
    auth_org_id=req.auth_org_id,
    incoming_credit=req.credits,
    service_id=req.service_id,
)
net_to_balance = settlement.remaining_credit
new_balance = credit.balance + net_to_balance

await self.credit_repo.update_balance(
    auth_org_id=req.auth_org_id,
    balance_delta=net_to_balance,
    reserved_balance_delta=Decimal(0),
)

Stripe webhook arrives, dedup'd by stripe_event_id. AR settlement happens first (if the org owed money from a chargeback, that gets paid down before new credits land in the balance). Whatever's left becomes available balance.

Note the seam: purchase amounts go through Stripe, internal pool movements go through BCP. Stripe is the source of truth for "did this customer pay." BCP is the source of truth for "what's the org's available balance right now and what is each agent allowed to spend." Neither tries to be the other.

What it cost to build

This is the honest part. The credit-pool service is around 4-6 weeks of focused work to do well enough to trust in production:

Domain modeling: organizations, credits, reservations, transactions, AR, idempotency keys, outbox events.
The five core use cases (reserve, consume, consume-reservation, release, refund) plus the edge cases (chargeback, grant, AR settlement, expired-reservation cleanup).
Tests, especially concurrency tests for the lock + reservation paths.
The Stripe webhook surface — checkout success, payment intent, dispute, refund, all idempotent on stripe_event_id.
An admin surface to inspect balances and reservations during incidents.
A migration path from a previous, simpler "just deduct" model.

You are not going to ship this in a weekend, even with the schema fully designed. The concurrency tests alone took me longer than I expected, because deadlocks under SELECT FOR UPDATE ordering are subtle and the bugs only show up under load.

If this isn't core to what your product is, don't build it.

When Stripe metered is fine

To be specific about when the build vs buy answer flips back the other way — Stripe metered is the right answer if all of these are true:

One product, one usage stream per customer. Per-seat or per-API-call pricing.
You're OK with usage being consumed as it's reported (no reservations).
You're OK with the customer's "balance" being computed at invoice time, not in real-time.
Your retries can be made strictly idempotent by the Stripe Idempotency-Key header alone, without payload-hash protection.
You don't need to atomically move money between sub-products on the same account.
Your refunds are invoice-level ("we billed you wrong"), not credit-level ("here's some make-good credits for the outage").

If even one of those isn't true, you're looking at a hybrid where you keep some ledger state on your side anyway. At which point: the ledger you're keeping is the actual source of truth, and Stripe is the payment layer. Be honest about which one's the system of record.

Three things I wish I'd known sooner

Reservations are the load-bearing primitive, not consumes. I started by modeling consumption as a single atomic deduct. Reservations got bolted on later when the long-running-task problem hit. If I were starting again, I would build reserve/consume/release as the only three primitives from day one, and treat "synchronous consume" as a sugar over "reserve then immediately consume."
Payload-hash idempotency is non-negotiable. Key-only idempotency lets bugs through that you'll only catch in customer-support tickets. The day you accidentally retry the same key with a different actual_amount and silently return the cached result, you've corrupted a balance. Make it a hard error.
Keep the money seam thin and explicit. It's tempting to let your ledger know about Stripe customers, prices, invoices, taxes. Don't. The boundary is: Stripe collects money and tells you (via webhooks dedup'd by event ID); your ledger turns money-in into credits and bookkeeps every internal movement. The simpler that seam stays, the easier the next payment provider integration is when it inevitably comes up.

I'm building this credit ledger inside a larger AI dev harness called Codens — the homegrown billing layer is the "Billing Control Plane," and it's what lets the agents share an org-wide pool with the reservation/release semantics described above. LP: https://www.codens.ai/

If you've built a credit ledger that survived contact with real customers and a few chargebacks, I'd love to hear which corners you cut and which ones bit you. Or if you're staring down this build vs buy decision right now and want to compare notes — also good.

DEV Community