- Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
You ship a multi-tenant SaaS with an AI feature. One Anthropic key fans out to every customer (swap in OpenAI, Bedrock, Vertex; the shape is the same). On Tuesday morning a single tenant burns through your minute-bucket. Usually it is the one running a backfill nobody told you about. The whole platform starts returning 429s. Your enterprise customer's CEO demo at 10 AM hits the rate limit two minutes in, because a free-tier tenant is replaying their support inbox through your summarizer.
A team I talked to had a worse version of this. They added "20 requests per minute per tenant" the first time it happened. Then a tenant with a legitimate end-of-month report burst to 40, got starved, complained loudly. They raised the limit to 100. A different tenant ran a script. Repeat. By the third iteration the per-tenant rate limit was high enough that twenty tenants together could still take down the upstream key. The naive limit had become a cargo-culted comment in the YAML.
Token budgets are the layer above the rate limit. They answer a different question: given a finite supply of upstream tokens per minute, how do I divide them across tenants in a way that the noisy ones cannot starve the quiet ones, and the paid ones cannot be queued behind the free ones. The patterns below are what survives when traffic gets bursty and the upstream is shared.
The four patterns, in the order you reach for them
Walk down the list top to bottom. Each one solves a problem the one above it created.
- Token bucket per tenant. Burst-tolerant, easy to reason about.
- Tier-based budgets. Different bucket sizes for different plans.
- Soft and hard caps. Two thresholds: shape, then deny.
- Priority queues. Pay-tier tenants jump the line when supply is tight.
A naive requests-per-minute limit is none of these. It punishes burst, ignores token cost, and treats a 200-token classification call the same as a 40k-token document summary. Throw it out.
Pattern 1: token bucket sized in upstream tokens, not requests
The unit of scarcity for an LLM SaaS is upstream tokens per minute, not requests per minute. A single 32k-context summarisation request is sixty classification calls in one. If you bill, queue, or limit on requests, your bucket is lying about the actual load.
The shape of a token bucket: each tenant has a counter tokens_remaining and a refill_rate. Every request charges its expected token cost against the bucket; the bucket refills at refill_rate tokens per second up to a capacity ceiling.
import time
from dataclasses import dataclass
@dataclass
class TokenBucket:
capacity: int
refill_per_sec: float
tokens: float
last_refill: float
def take(b: TokenBucket, cost: int) -> bool:
now = time.monotonic()
elapsed = now - b.last_refill
b.tokens = min(
b.capacity,
b.tokens + elapsed * b.refill_per_sec,
)
b.last_refill = now
if b.tokens >= cost:
b.tokens -= cost
return True
return False
The cost you pass in is the expected token cost for the request: prompt size plus your max_tokens. The expected cost is wrong by definition. The model returns fewer output tokens most of the time, but the error averages out across requests and the bucket is the right shape.
A tenant who issues one 30k-token request per minute and a tenant who issues a hundred 300-token requests per minute consume the same budget. That is the property you wanted.
The reservation pattern matters here. If you debit the bucket before the call and the call fails, you must refund. The cleanest way is a with block:
class Reservation:
def __init__(self, b: TokenBucket, cost: int):
self.b = b
self.cost = cost
self.consumed = False
def commit(self, actual: int) -> None:
delta = actual - self.cost
if delta > 0:
self.b.tokens -= delta
elif delta < 0:
self.b.tokens = min(
self.b.capacity,
self.b.tokens - delta,
)
self.consumed = True
def __enter__(self):
return self
def __exit__(self, *a):
if not self.consumed:
self.b.tokens = min(
self.b.capacity,
self.b.tokens + self.cost,
)
commit(actual) reconciles the reservation against the real usage.input_tokens + usage.output_tokens Anthropic returns. If the call never makes it (network error before the API), __exit__ refunds the reservation. If you skip this step, every retry double-charges the bucket, and a flaky upstream silently halves every tenant's effective quota.
Pattern 2: tier-based budgets that map to your pricing page
The token bucket is per-tenant. The bucket parameters are per-tier. A free tenant and an enterprise tenant share the upstream key but not the budget that draws from it.
The two numbers that define a tier are capacity (the burst ceiling) and refill_per_sec (the steady-state rate). The relationship between them is the time-to-recover after a full burst: capacity / refill_per_sec seconds. Pick that number from the product side, not the engineering side.
TIERS = {
"free": {
"capacity": 50_000,
"refill_per_sec": 100.0,
},
"pro": {
"capacity": 500_000,
"refill_per_sec": 1_000.0,
},
"enterprise": {
"capacity": 5_000_000,
"refill_per_sec": 10_000.0,
},
}
A free tenant gets a 50k burst that recovers in eight minutes. A pro tenant gets 500k that recovers in eight minutes (same shape, ten times the size). An enterprise tenant gets fifty times that.
The numbers are illustrative, not load-bearing. The shape is. Two rules to keep:
- Tier limits should be a small integer multiple of your actual upstream rate divided across tenants. If your upstream is 80k tokens per minute and you have 200 free tenants, no individual free tier should be allowed to reserve more than a small fraction of that.
- The sum of all per-tenant
refill_per_secshould be less than upstream's steady-state rate, with headroom for bursts. If it is more, you have oversold and the system will queue behind the upstream rate limiter on every busy hour.
The second rule is what makes the difference between a quota system and a wishlist. If your tier table sums to more than upstream supply, every burst becomes a 429.
Pattern 3: soft caps that shape, hard caps that deny
A single threshold is too blunt. At 99% of the bucket the tenant is fine; at 101% the tenant is denied. Real workloads do not need that cliff.
Two thresholds give you a usable middle:
- Soft cap: at 80% of the bucket consumed in the current window, start shaping. Reject low-priority requests, queue high-priority ones, page the tenant's billing contact if you have one, surface a banner in the product. The point is to slow the bleed, not to stop it.
-
Hard cap: at 100%, deny. The tenant gets a clean structured error with a
Retry-Afterheader and a link to the billing page.
The shaping logic at the soft cap is where the design earns its keep:
def admit(
bucket: TokenBucket,
cost: int,
priority: int,
) -> str:
used = 1 - (bucket.tokens / bucket.capacity)
if used < 0.8:
return "ok"
if used < 1.0:
return "ok" if priority >= 5 else "shed"
return "deny"
priority is the request's importance on a 0–10 scale. Interactive user-facing calls get high priority; background batch jobs get low priority. A tenant running a nightly digest at priority 1 sheds first; the same tenant's interactive chat call at priority 9 keeps going until the hard cap.
This needs cooperation from the request originator. If every caller hard-codes priority=10, you get the cliff back. The fix is a sane default tied to the entry point: API-key requests default to priority 5, the in-product chat path defaults to 8, the cron-triggered digest path defaults to 2. Make the defaults reflect the truth of the traffic and you do not need to trust callers to be honest.
Pattern 4: priority queues for paid-tier tenants when supply is tight
Buckets are fair when supply is plentiful. They are unfair when supply is genuinely scarce: when the upstream key is at its own rate limit and every tenant's request has to queue. At that point "first-come-first-served" punishes the enterprise customer who arrived two milliseconds after a free tenant's batch job.
A priority queue keyed on tier solves this. The queue is the gate between admission control (the buckets above) and the upstream call. When the call rate exceeds upstream supply, the queue grows; pulls happen in tier order, not arrival order.
import heapq
class TierQueue:
def __init__(self):
self.h = []
self.seq = 0
def push(self, tier_weight: int, item):
# lower weight = higher priority
heapq.heappush(
self.h,
(tier_weight, self.seq, item),
)
self.seq += 1
def pop(self):
if not self.h:
return None
return heapq.heappop(self.h)[2]
TIER_WEIGHT = {
"enterprise": 0,
"pro": 1,
"free": 2,
}
tier_weight is the priority: enterprise = 0 (front of the queue), free = 2 (back). The seq counter breaks ties by FIFO inside a tier, so free tenants do not starve each other.
Two foot-guns:
- Bounded queue depth. An unbounded priority queue under sustained overload turns into a memory leak. Cap the queue at a number that fits your latency SLO. A hundred requests at the upstream's worst-case latency is a usable starting number. Shed the lowest-priority items past that cap.
- Starvation guard. Pure tier-priority will starve free tenants for as long as enterprise traffic exists. Add a max-wait timer per item: if a free request has been queued for more than thirty seconds, promote it to the front of its tier. The system stays loosely fair under sustained pressure.
A team I talked to shipped this without the starvation guard and watched their free-tier dashboards go quiet for an entire afternoon during an enterprise batch run. The fix was twelve lines.
Observability: per-tenant gauges, exhaustion events, $/req attribution
The patterns above only work if you can see them. Three signals belong on the dashboard from day one.
Per-tenant token-spend gauge. A live gauge per tenant, sampled every minute, showing tokens consumed against bucket capacity. Useful for both incident response (which tenant is the noisy neighbor) and product (which tenant is about to upgrade or churn). The shape on OTel:
from opentelemetry import metrics
m = metrics.get_meter("llm_quota")
tokens_used = m.create_observable_gauge(
name="llm.tenant.tokens_used",
description="Bucket tokens consumed",
)
def observe(_):
for tenant_id, b in BUCKETS.items():
used = b.capacity - b.tokens
yield metrics.Observation(
used,
attributes={
"tenant_id": tenant_id,
"tier": TENANT_TIER[tenant_id],
},
)
Quota-exhaustion events. Every hard-cap denial is a structured event, not a log line. The fields you want: tenant_id, tier, priority, cost_requested, tokens_remaining, recovery_seconds. Wire them to a counter so the rate of exhaustion per tier becomes a chart.
Dollars-per-request attribution. The token bucket is the control surface; the cost surface is the matching report. For each call, multiply usage.input_tokens by the model's input price and usage.output_tokens by output price (read the live numbers from the Anthropic pricing page and re-check before forecasting; the rates move). Pricing is set by the vendor and can change; treat any number derived from this post as a sketch, not a quote. Sum per tenant per day. Match against the tenant's MRR. The ratio is your cost-of-goods per tenant.
You want $/req specifically. Two tenants on the same model can have wildly different output-to-input ratios. A tenant who summarizes legal docs has a tiny output relative to input. A tenant who drafts marketing copy has output that rivals input. The token gauge hides that. The dollar attribution does not.
What to ship on Monday
If you have one Anthropic key behind a multi-tenant product and no quota layer, the order is:
- Add a per-tenant token bucket today, with one tier and conservative numbers. The shape matters more than the numbers; you tune the numbers in week two.
- Add the soft-cap shaping and a priority field on every request. Default priorities by entry point, not by caller honesty.
- Wire the per-tenant token-spend gauge and the exhaustion-event counter. Plot them next to your upstream 429 rate. The right month-over-month is "buckets full, exhaustions flat, 429s near zero."
- When two tenants of different tiers start contending for the same upstream slot, add the tier queue. They will.
A noisy neighbor on a shared upstream key is a load-shedding problem dressed up as a billing problem. The four patterns are how you get it back to looking like one.
If this was useful
The LLM Observability Pocket Guide covers the rest of the multi-tenant cost stack: the OTel span attributes that make per-tenant attribution honest, the dashboard shape that surfaces a noisy neighbor before the support ticket lands, and the failure-mode catalog for shared upstream keys. The chapters on cost attribution and per-tenant SLOs pair directly with the bucket-and-queue design above.

Top comments (0)