DEV Community: Jasmine Park

One tenant, one polling loop, and $9,100 we didn't budget for

Jasmine Park — Tue, 21 Jul 2026 17:28:17 +0000

The page that mattered never fired. Finance sent it instead, on a Thursday, in Slack: "Projected LLM spend this month is 41% over. Is that expected?" It was not expected. It took me four hours to find out why, and every one of those hours was a monitoring failure, not a model failure.

Here is the short version. One customer shipped a change that put an LLM call inside a polling loop. Their integration went from roughly 2,000 calls a day to about 140,000 calls a day, overnight. We paid for all of it. It ran for six days before anyone looked, because every dashboard I owned was aggregate, and the aggregate looked boring.

The hot take I keep repeating in incident reviews: cost blowups are a metering gap. The bill breaks down per tenant, and if you only watch the total you miss the tenant that is on fire.

What happened

The customer's feature polled a downstream job for completion. Someone refactored it and, instead of polling the job status, the loop re-issued the full LLM summarization call on every tick. Tick interval was five seconds. Do the math on a handful of concurrent sessions running most of the day and you land around 140k calls. Their code, our endpoint, our invoice.

None of it errored. That is the part that stings. Every one of those calls returned a clean 200. Latency was fine. Our SLO burn was zero. The system was, by every signal we alerted on, perfectly healthy. It was also setting money on fire at a steady rate.

What it cost

Our average cost per call for that feature is about $0.011 (short prompt, capped output). Normal footprint for this tenant:

2,000 calls/day
about $22/day

After the loop bug:

about 140,000 calls/day
about $1,540/day

The delta is about $1,518 a day. It ran six days before finance flagged it. That is roughly $9,100 of spend we had no budget line for, on one tenant, on one feature, for output nobody consumed (they were throwing away 69 of every 70 responses).

For context, our whole platform's LLM spend was averaging around $1,800/day. This one tenant nearly doubled it. And I still did not see it, because our daily total already swung 30% on a normal week (weekend dips, batch backfills, onboarding spikes). A jump from $1,800 to $3,300 read like "big customer doing something legitimate." We had genuinely had 2x days before that were fine. So the signal sat inside the noise band, and the noise band was wide because I had never bothered to narrow it per tenant.

What it missed at scale

The scary math is not the $9,100. It is the slope. At $1,518/day this was a $45k/month leak from a single misbehaving integration, and we have hundreds of tenants. Any one of them can do this to us on any day. Our exposure was never "how expensive is the model." It was "how long can one runaway tenant run before a human happens to look at a bill." Six days, apparently. That is the actual SLO I had, and I had never written it down or defended it.

The fix

The refactor was easy: tag every call, attribute cost per tenant and per feature, and alert on spend rate the same way I alert on error rate. We already emitted tenant_id and feature on the request span. We just were not multiplying tokens by price and summing by tenant anywhere a human or an alert would see it.

The core alert is a spend-rate anomaly per tenant against that tenant's own recent baseline. Rolled up hourly, it looks like this:

-- hourly spend per tenant vs its trailing 7-day median
WITH hourly AS (
  SELECT tenant_id,
         date_trunc('hour', ts) AS hr,
         SUM(calls * cost_per_call) AS spend
  FROM llm_calls
  WHERE ts > now() - interval '8 days'
  GROUP BY 1, 2
),
baseline AS (
  SELECT tenant_id,
         percentile_cont(0.5) WITHIN GROUP (ORDER BY spend) AS med_spend
  FROM hourly
  WHERE hr < date_trunc('hour', now())      -- exclude current window
  GROUP BY tenant_id
)
SELECT h.tenant_id, h.spend, b.med_spend,
       round(h.spend / nullif(b.med_spend, 0), 1) AS ratio
FROM hourly h
JOIN baseline b USING (tenant_id)
WHERE h.hr = date_trunc('hour', now())
  AND h.spend > b.med_spend * 3            -- 3x its own median
  AND h.spend > 5                          -- ignore trivial tenants
ORDER BY ratio DESC;

Two guards matter. Compare a tenant to itself, not to the fleet (a big customer is allowed to be big, they just are not allowed to suddenly 70x). And put a floor on absolute spend so you do not page at 3am because a tiny tenant went from 4 cents to 15. On our data, this alert would have fired inside the first hour of the loop bug, at a ratio near 70x, six days before finance did.

I also wired a projection: current month-to-date spend, extrapolated to month end, compared against budget. That is the check that would have caught it even if the per-tenant alert had a gap. It is cheap and it maps directly to the number the business actually cares about.

What I'd page on

This is the dashboard and alert set I now run for anything that spends money per request. If you take one thing, take the fact that none of these are latency or error signals. Cost needs its own alerts, the same way latency and errors do.

Per-tenant spend rate. Page when any tenant's hourly spend exceeds 3x its trailing 7-day median, with a floor (we use $5/hour) so trivial tenants stay quiet. (The current hour is partial, so this catches a 60-70x spike fast and lags a subtle drift; pair it with the projection below.)
Top-tenant concentration. Page if a single non-whitelisted tenant crosses 40% of total hourly spend. One customer owning the bill is an incident until proven otherwise.
Calls/min per (tenant, feature). Page on step changes (greater than 5x hour over hour). This catches the loop before the dollars pile up, because request-count moves before the invoice does.
Cost per call, p99, per feature. Page when it climbs. Rising cost per call with flat volume means prompt or context bloat, a different leak with the same symptom.
Month-end spend projection vs budget. Page if the linear projection exceeds budget by more than 15%. This is the backstop that speaks finance's language.
New entrants in the top-10 spenders. Not a page, a daily digest. A (tenant, feature) pair you have never seen in the top 10 is worth 30 seconds of a human's attention.

The uncomfortable lesson: a green dashboard is not proof that nothing is wrong. It is proof that nothing you decided to measure is wrong. I measured latency and errors because those page loudly and customers complain. Nobody complains about a bill that is quietly too high, so nobody measured it, so it ran for six days.

Our LLM service had no backpressure. The provider got 7x slower and our p99 got 25x worse.

Jasmine Park — Fri, 17 Jul 2026 19:21:07 +0000

TL;DR. Our summarization endpoint holds a p99 under 3 seconds. On 21/05 our provider degraded: median call time went from 620ms to about 4.3 seconds, roughly 7x. Our p99 went to 35 seconds, roughly 25x, and it stayed there for 52 minutes. The provider was slow for 12 of those minutes. The other 40 were us draining a queue we had never bounded and could not see. Every arrival became an in-flight task, in-flight peaked near 9,000 across 12 pods, and the latency budget went on queue wait rather than on the provider. CPU sat at 7%, so the autoscaler never moved: the workers were I/O-bound, parked on a socket, burning no CPU at all. More pods were never the answer. What worked: a bounded queue, a concurrency semaphore, a wait budget that drops requests whose caller already left, and two metrics where we previously had one. Queue wait and provider latency are different numbers. Report them as one and you will blame your provider for your own queue.

Our summarization endpoint has one SLO anyone cares about: p99 under 3 seconds, 99% of the month. It held for two quarters. On 21/05 it missed for 52 minutes and burned about 13% of a 28-day error budget in an afternoon.

The provider was slow for 12 minutes.

That gap is the whole post. A 12-minute problem upstream became a 52-minute problem for us, and the extra 40 minutes were self-inflicted. All numbers here are ours, rounded, from our incident. The code at the bottom runs.

The shape of it

14:14, latency alert: p99 over 3s for five minutes. I opened the provider's status page, which was green, and our dashboard, which said p99 31s and climbing.

Nothing had deployed since 09:40. Error rate was 0.02%, which is normal. The endpoint was returning 200s. Slowly, but returning them.

Our provider-latency graph had gone from 620ms median to about 4.3s. Real degradation, and 7x. I filed a ticket with them and started the incident note with "upstream provider degradation" as the cause, because that is exactly what it looked like.

Then the arithmetic stopped working.

7x on a 620ms call gets you 4.3s. Our p99 was 35s. Even if every request paid the full degraded latency twice, that is 9s, not 35. Something was adding twenty-five seconds that was not the provider.

At 14:24 the provider recovered. Median went back to 640ms. I watched our p99 and it did not move. It sat at 35s for another forty minutes, on a healthy provider, with a green status page, while the alert kept firing.

At 15:06 I restarted the deployment. That dropped every queued request on the floor, and p99 was back under 3s within ninety seconds. A rolling restart is a crude, indiscriminate load-shed, and it is the only thing that ended the incident. I did not enjoy learning that.

A service still broken forty minutes after its dependency is healthy is not suffering from its dependency. It is suffering from what it did while its dependency was unhealthy. We had built a backlog, and the backlog had to clear before anybody got a fast answer. The provider's 12 bad minutes bought us 40 of our own.

Why the autoscaler sat still

First thing I checked was the HPA. CPU: 7%. Target: 70%. Nothing was computing, so the autoscaler had never had a reason to act and by its own logic it was right: every pod was sitting on await, holding a socket open to a provider taking four seconds to answer.

I wrote up the general version of this a couple of weeks ago, the week we found our autoscaler tracking request rate while the bill tracked tokens, so I will spare you the re-derivation and give you the line that transfers: scaling on a signal your incident cannot move is the same as not scaling at all. Here it would also have hurt, because more pods means more concurrent calls into a provider that was already saturating for us.

The saturation signal for an I/O-bound LLM service is in-flight requests against a concurrency limit, and queue depth against a queue bound. We had neither number, because we had neither limit.

The queue I could not see

Here is what I shipped, simplified to the shape that matters:

@app.post("/summarize")
async def summarize(req: Request):
    body = await req.json()
    resp = await client.post(PROVIDER_URL, json=build(body))   # httpx.AsyncClient
    return parse(resp)

No queue in that code. That is the problem: there is no queue in that code that I can name, measure, or bound. There are three, and I wrote none of them.

The event loop's task list is the first. Uvicorn accepts a connection, schedules a coroutine. At 40 requests per second with each parked for 4.3 seconds, you are holding about 172 at once. Nothing in that handler stops the number growing.

The second lives inside httpx. A default AsyncClient carries Limits(max_connections=100, max_keepalive_connections=20) (documented here; I also checked httpx._config.DEFAULT_LIMITS on 0.28.1 to be sure). The 101st concurrent request does not fail and does not reach the provider. It waits for a free connection, inside the pool, with no metric on it. We were timing await client.post(...), which includes that wait. So our "provider latency" graph was never measuring the provider. It measured the provider plus however long we sat in our own connection pool, and during the incident the second term dominated.

Third is the kernel accept queue, which I will not pretend I looked at.

None of these are bugs. Each is a sensible default doing exactly what it documents. Together, with 40 rps arriving and no admission control anywhere, they add up to a queue that grows without bound and reports itself to you as "the provider is slow."

What it cost

Close to nothing on the invoice, so this section is short: the provider bills tokens, and a slow call costs the same as a fast one. Financially this incident was a rounding error. That is precisely why it ran for 52 minutes without anyone escalating on cost.

It was expensive in the budget that applied. At 40 rps, a 28-day window is roughly 97 million requests, so a 99% latency SLO permits about 970,000 breaches of the 3-second line. We put roughly 125,000 requests over it in 52 minutes: about 13% of a month's error budget, in under an hour, caused by a dependency that was unhealthy for 12 minutes.

Around 100,000 of those had already timed out at 30 seconds and gone. We computed responses in full and wrote them to closed sockets.

Little's Law, and the number I did not have

The arithmetic that explains all of this is 65 years old and fits on one line.

L = λW. The average number of items in a queuing system equals the average arrival rate times the average time each item spends in it. John Little published the proof in 1961 (A Proof for the Queuing Formula: L = λW, Operations Research 9(3), 383-387) and wrote a genuinely readable retrospective on its fiftieth anniversary: Little's Law as Viewed on Its 50th Anniversary, Operations Research 59(3), 2011. It assumes almost nothing. No distribution, no independence, no particular queue discipline, and it holds when arrivals are nonstationary, which is exactly what an incident is. It describes your service whether or not you have thought about it.

Healthy: λ = 40 rps, W = 0.62s, so L = 25 concurrent calls. Comfortable. It is why nobody had ever needed a concurrency limit to exist.

Degraded: W = 4.3s, so holding 40 rps requires L = 172 concurrent calls. Twelve pods at 100 pooled connections each gave us room for 1,200, so we could physically open 172. We did. And that is what did the damage, because the provider would not serve 172 of our calls at once. Their effective throughput for us fell to about 27 rps.

Now the law runs the wrong way. Arrivals 40. Departures 27. The queue grows at 13 per second, 780 per minute, and over 12 minutes that is about 9,400. We measured a peak near 9,000 in-flight. That agreement is the only reason I trust the reconstruction at all.

Then invert it. L = 9,000, λ = 40, so W = L/λ = 225 seconds. Our clients time out at 30. At the peak we were producing answers that were, on average, 195 seconds too late to be wanted, and we kept producing them for forty minutes after the provider was fine. Nothing recovers from that on its own, which is why a restart was the only lever left.

The fix: a queue you can name

More pods would not have helped. A longer timeout is the other reflex, and it is strictly worse: a longer timeout means the caller waits longer before leaving, which raises W, which raises L. Timeouts are not a capacity strategy.

Four things went in that week. Roughly in the order I would put them back:

A concurrency semaphore. A hard cap on simultaneous provider calls, set below the point where the provider's latency curve bends. This is the one that stops 172 from ever happening. (asyncio.Semaphore is the whole implementation.)
A bounded queue with fail-fast admission. Past the bound, refuse immediately: 429 plus Retry-After. That is the same contract as the token-budget admission gate from that same write-up, moved down a layer and re-denominated. That one bounded tokens in flight, because tokens were what saturated the GPU. This one bounds calls in flight, because concurrency is what saturates a provider you do not own. asyncio.Queue(maxsize=N) raises QueueFull from put_nowait the moment it is full, which is the behaviour you want (docs). A 429 in one millisecond is a kinder answer than a 200 in 225 seconds.
A wait budget. When a worker picks a job up, check how long it sat. If it sat longer than the caller will wait, drop it and never call the provider. Calling on behalf of someone who has already hung up spends provider capacity that the callers still waiting need.
Two timers instead of one. One extra perf_counter() call. Cheapest thing on this list and the reason I understood any of the rest, which is why I have given it room further down rather than a bullet.

The whole thing, runnable, no dependencies:

"""Bounded concurrency + admission control for LLM calls.  python3 pool.py"""
import asyncio
import time
from dataclasses import dataclass, field

CONCURRENCY = 8    # simultaneous provider calls
QUEUE_MAX = 11     # admission buffer: drain rate x WAIT_BUDGET, not a vibe
WAIT_BUDGET = 2.0  # sec. waited longer than this and the caller is gone


class Shed(Exception):
    """Queue full at admission. -> 429 + Retry-After."""


class Stale(Exception):
    """Waited past budget. -> 503. Never reaches the provider."""


@dataclass
class Job:
    rid: int
    fut: asyncio.Future
    enqueued_at: float = field(default_factory=time.perf_counter)


class BoundedPool:
    def __init__(self, call, concurrency=CONCURRENCY, queue_max=QUEUE_MAX,
                 wait_budget=WAIT_BUDGET):
        self.call, self.wait_budget, self._n = call, wait_budget, concurrency
        self.q = asyncio.Queue(maxsize=queue_max)
        self.sem = asyncio.Semaphore(concurrency)
        self.wait_ms, self.provider_ms = [], []
        self.ok = self.shed = self.stale = 0

    async def __aenter__(self):
        self._workers = [asyncio.create_task(self._worker()) for _ in range(self._n)]
        return self

    async def __aexit__(self, *exc):
        for t in self._workers:
            t.cancel()
        await asyncio.gather(*self._workers, return_exceptions=True)

    async def submit(self, rid):
        job = Job(rid, asyncio.get_running_loop().create_future())
        try:
            self.q.put_nowait(job)        # QueueFull once maxsize is reached
        except asyncio.QueueFull:
            self.shed += 1
            raise Shed(f"req {rid}: {self.q.qsize()} already queued, refusing")
        return await job.fut

    async def _worker(self):
        while True:
            job = await self.q.get()
            try:
                async with self.sem:
                    # Everything above this line is our fault, not theirs.
                    waited = time.perf_counter() - job.enqueued_at
                    self.wait_ms.append(waited * 1000)
                    if waited > self.wait_budget:
                        self.stale += 1
                        job.fut.set_exception(Stale(f"req {job.rid}: {waited:.1f}s"))
                        continue
                    t0 = time.perf_counter()
                    try:
                        result = await self.call(job.rid)
                    except Exception as e:
                        job.fut.set_exception(e)
                    else:
                        self.ok += 1
                        job.fut.set_result(result)
                    finally:
                        self.provider_ms.append((time.perf_counter() - t0) * 1000)
            finally:
                self.q.task_done()


def pct(xs, p):
    s = sorted(xs)
    return s[min(int(round(p / 100 * (len(s) - 1))), len(s) - 1)] if s else 0.0


# A stand-in provider with a real ceiling: serves 12 at a time, never errors,
# and goes 7x slower two seconds in. Swap in your own client here.
_sem, START = None, 0.0


async def provider(rid):
    async with _sem:
        await asyncio.sleep(1.40 if time.perf_counter() - START > 2.0 else 0.20)
        return f"resp-{rid}"


async def main():
    global _sem, START
    _sem, START = asyncio.Semaphore(12), time.perf_counter()
    async with BoundedPool(provider) as pool:
        async def one(rid):
            try:
                await pool.submit(rid)
            except (Shed, Stale):
                pass  # in a request handler, return 429 / 503 here

        tasks = []
        for rid in range(400):
            tasks.append(asyncio.create_task(one(rid)))
            await asyncio.sleep(1 / 40)   # 40 rps of arrivals
        await asyncio.gather(*tasks, return_exceptions=True)

        print(f"served={pool.ok}  shed={pool.shed}  stale={pool.stale}")
        print(f"queue wait  p50={pct(pool.wait_ms, 50):6.0f}ms  p99={pct(pool.wait_ms, 99):6.0f}ms")
        print(f"provider    p50={pct(pool.provider_ms, 50):6.0f}ms  p99={pct(pool.provider_ms, 99):6.0f}ms")


if __name__ == "__main__":
    asyncio.run(main())

The stand-in provider has a ceiling (12 concurrent) and goes 7x slower two seconds in. It is not our incident in miniature, and I want to be precise about the gap: the stand-in never recovers, where ours did after 12 minutes and the backlog outlived it by forty. So the demo shows a backlog outliving its arrivals. It does not show one outliving the degradation that caused it, and that second thing is what made 21/05 confusing enough that restarting pods looked like the only lever. On my machine:

served=136  shed=255  stale=9
queue wait  p50=     0ms  p99=  2574ms
provider    p50=   201ms  p99=  1401ms

Run the same 400 arrivals with no pool, one task per request, the way we had it, and you get this instead, again on my machine:

served=400/400  peak in-flight=263  drained in 40s
end-to-end  p50=  12273ms   p99=  29696ms

Compare the provider numbers. The bounded run says the provider's p99 is 1,401ms, which is the truth: 1.4s is literally the constant inside the sleep. The unbounded run says 29,696ms. Same provider, same degradation, and a 21x difference in the number you would paste into a support ticket.

Watch the drain, too. Arrivals stop at 10 seconds; the unbounded run does not finish until 40. Three quarters of that run is backlog burning down after the last caller has arrived. Ours was forty minutes of drain on twelve minutes of degradation.

Timing our wait separately from their call

The highest-leverage line in that file is where the second timer starts.

async with self.sem:
    waited = time.perf_counter() - job.enqueued_at   # ours
    ...
    t0 = time.perf_counter()
    result = await self.call(job.rid)                # theirs

Two timers, split at the moment we actually begin talking to the provider. Everything before is our queueing: admission, queue, semaphore. Everything after is theirs.

We had been exporting one histogram, llm_request_duration_seconds, wrapped around the whole handler. That histogram is worse than having none, because it is confidently wrong. It read 35 seconds while pointing at a provider that was answering in 4.3. I took that graph into a support ticket and asked a vendor to explain a number my own service had manufactured. That is embarrassing, and one extra perf_counter() call would have prevented it.

We now export three: llm_queue_wait_seconds, llm_provider_duration_seconds, and end-to-end. The first two should roughly sum to the third, and when they stop summing, the difference is time being spent somewhere neither timer covers. That divergence is its own signal.

The rule I would hand my past self: if a request waits inside your process before you do the thing, that wait is a metric. Any queue you are not measuring is unbounded, because you cannot bound what you cannot see.

What it misses at scale

Four things still wrong, in descending order of how much they bother me.

I cannot defend the queue depth. I derived QUEUE_MAX = 11 from Little's Law: useful depth = drain rate x wait budget, and the degraded drain rate is 8 permits / 1.4s = 5.7 rps, so 5.7 x 2.0 ≈ 11. But drain rate is not a constant, so the derivation is thinner than it looks. Healthy, it is 8/0.2 = 40 rps, so the right depth would be 80. A fixed number is wrong in one direction at all times. I ran the demo at both depths, several times each, on my machine: served never moved outside noise (135 to 141 at depth 32, 131 to 141 at depth 11) while stale dropped from the seventies to around ten. Your absolute numbers will move with load; the gap between the two columns does not. Depth buys you no throughput whatsoever. It only decides whether you refuse a request in one millisecond or waste two seconds of its life first. The wait budget is what actually protects the caller. The depth only decides how much memory the backlog is allowed to occupy while it waits.

The SRE book disagrees with my ratio and may well be right. Chapter 22 of the Google SRE book, Addressing Cascading Failures, recommends keeping queue size small relative to pool size, on the order of 50% or less, so that a server rejects early under sustained overload. Mine is 11 against 8 permits, about 1.4x. I picked that deliberately for bursty arrivals: a short burst that clears inside the wait budget should be absorbed rather than refused. If your traffic is steady rather than bursty, take their ratio over mine.

The concurrency limit is static. Ours is a number I chose by watching where the provider's latency curve bends, then re-chose twice. The honest version measures it continuously, because a provider's capacity for you is neither constant nor published. That is real work, and we have not done it.

Shedding is not free, and the demo is blunt about the bill: 255 shed, 9 stale, 136 served out of 400. A 66% rejection rate. No amount of engineering fixes that, because a pool draining at 5.7 rps cannot absorb 40 rps of arrivals. Physics gets a vote. The only decision available is who finds out, and how quickly. Shedding does not rescue the requests it drops. It moves the failure to a moment you picked in advance, while you can still answer in one millisecond with a status code the caller can act on. If you read one thing before this post, read Handling Overload, chapter 21 of the same book.

What I'd page on

I had four dashboards for this service on 21/05. Every one of them was green while the SLO burned, because all four watched CPU, error rate, request count, and provider latency: the two that cannot see queueing, and two that actively lie about it. Here is what replaced them. Copy the metrics. The thresholds are ours.

Queue wait p99, alone, as its own series. Page when it exceeds half the wait budget. Not end-to-end, not provider latency: the wait, by itself. Had this existed on 21/05 it would have read 200 seconds while the provider graph read 4.3, and nobody would have spent nineteen minutes reading a status page.
Shed rate and stale rate, separately, never summed. They mean opposite things. Shed climbing means admission control is working and arrivals exceed capacity: expected, page only if sustained past five minutes. Stale climbing means we admitted work we could not finish in time, which is an admission-control bug, and the queue is too deep for the budget. That distinction is what QUEUE_MAX = 11 is for, and stale is the metric that proves the number is right.
Provider latency timed around the call and nothing else. Not the handler. Not the pool acquire. The call. This is the only number worth taking to a vendor, and until 21/05 we did not have it, which is why my support ticket was fiction.
Backlog drain time, derived: queue depth divided by observed drain rate. L/λ, on a graph. It answers "if arrivals stopped right now, how long until we are clear," and it is the number that tells you a restart is the only remaining lever, roughly forty minutes before you work it out by hand.

I have written before about a green dashboard hiding a real problem, so rather than run that argument again I will mark what is different here. That one was a bill: every run passed, nothing errored, and the damage showed up on an invoice. This one is time. Our error rate on 21/05 sat at 0.02%, normal for us, and the provider's status page stayed green from the first alert to the last. Neither number was wrong. Neither number was about where the 195 seconds went, because we were measuring how much time a request took and never which queue it spent it in.

Three questions decide whether you are running my 21/05 service. What happens to the 173rd concurrent call? How long did the last request sit before we dialed anyone? If arrivals stopped this second, when would the backlog be clear? We could not answer any of the three, and it took a provider having a bad twelve minutes to show us.

Self-hosted LLM observability: six stacks, weighed by what they cost you to run

Jasmine Park — Tue, 14 Jul 2026 21:24:23 +0000

TL;DR: If your LLM traffic sends personal data, or your obs invoice scales with token volume, self-hosting the trace pipeline is worth costing out. Storage is the cheap part (roughly 60 GB and about 18 dollars a month for a million spans a day at 30-day retention). The expensive part is that you now run it: retention, cardinality, ingest lag, and the pager. Here are six open-source stacks you can run yourself, what each is actually good at, and what each costs you to operate.

Why self-host at all

Two reasons show up in real incidents. The first is data residency: if a trace payload contains a prompt with a customer's PII, sending it to someone else's cloud is a compliance conversation you do not want to have after the fact. Self-hosting keeps the payload inside your VPC. The second is cost shape. Managed LLM observability tends to price on events or token volume, so the invoice grows exactly when your product succeeds. I inherited a managed bill that had quietly tripled over two quarters because traffic tripled. We moved the high-volume traces in-house. The invoice went down. The number of dashboards I now own went up.

That is the trade. You are swapping a predictable invoice for operational surface area. Before you make it, cost out the operational side honestly, because "open-source" is a license, not a free lunch. The question is never "is it free," it is "what does it take to keep it green at 3x the traffic."

What "self-hostable" has to mean here

I only included stacks you can actually run on your own infrastructure without a sales call, that speak a standard trace format (most of these are built on OpenTelemetry), and that a small team can operate. Every option below is open-source and self-hostable today. I have put them in no particular ranking order, because the right pick depends on whether you want pure tracing, a full platform, or just instrumentation you point at a backend you already run.

The six stacks

Langfuse. The one most teams reach for first. It is open-source and self-hostable, and it combines tracing with evals and prompt management, so it is more than a trace viewer. Strong at the prompt-iteration loop: versioned prompts, scores, and a clean trace UI. What it costs to run: a Postgres plus a ClickHouse-backed deployment for the self-hosted stack, which is a real database to operate at volume. What it can miss at scale: high-cardinality trace attributes will punish your storage if you do not sample. github.com/langfuse/langfuse

Arize Phoenix. Open-source, self-hostable, and built on OpenTelemetry. Like Langfuse it is more than tracing: it pairs trace collection with an evaluation layer, and it is notebook-friendly, which makes it strong for offline analysis and debugging a RAG pipeline span by span. What it costs to run: light to stand up for a single team, heavier once you want durable multi-tenant storage rather than an analysis session. What it can miss at scale: it started life analysis-first, so treat the production, always-on deployment as the part you validate under load. github.com/Arize-ai/phoenix

Helicone. Proxy-first. You route model calls through it and get logging, cost, latency, and caching almost immediately, which is the fastest path to a spend dashboard. Open-source and self-hostable. What it costs to run: low to start, but a proxy in the request path is now a thing on your critical path, so its availability is your availability. What it can miss at scale: proxy-based capture is excellent for cost and latency and less focused on deep multi-span agent traces than the OTel-native tools. github.com/Helicone/helicone

Future AGI. Open-source and self-hostable, and OpenTelemetry-native for tracing through its traceAI framework. It sits at the platform end of this list rather than the pure-tracing end: the same stack also carries evaluation, simulation, and a model gateway, so the draw is running one self-hosted system instead of several. Honest placement: it is younger at pure tracing than Langfuse or Phoenix, so if all you want is the most battle-tested trace viewer, it is not your first pick. What it costs to run: a full platform, so you operate more surface than a single-purpose tracer. What it can miss at scale: validate the tracing path at your volume before you retire the incumbent. github.com/future-agi/future-agi

SigNoz. The general-purpose APM option. It is open-source, self-hostable, and OpenTelemetry-native, and it treats LLM spans as spans inside your broader application traces. Strong if you already want one observability backend for services and models rather than an LLM-only tool. What it costs to run: a ClickHouse-backed APM, which your infra team may already know how to operate, which is a point in its favor. What it can miss at scale: it is not LLM-specialized, so prompt-level ergonomics (diffing prompt versions, judge scores) are not its focus. github.com/SigNoz/signoz

Traceloop OpenLLMetry. Not a backend at all, and that is the point. It is an open-source set of OpenTelemetry instrumentations for LLM apps that you point at whatever OTel-compatible backend you already run. Strong if you have an observability stack and just want your model calls to show up in it without vendor lock-in on the storage side. What it costs to run: almost nothing on its own, because it is instrumentation, but you still need a backend, so its true cost is whatever you send the data to. What it can miss at scale: it gives you spans, not opinions, so the eval and prompt-management layers are on you. github.com/traceloop/openllmetry

Cost out the storage before the on-call

The invoice you escaped was the easy number. Here is the one people skip. This is a rough hot-storage estimate, not a full TCO, but it anchors the conversation.

def trace_storage_gb(spans_per_day, bytes_per_span, retention_days):
    return spans_per_day * bytes_per_span * retention_days / 1e9

def monthly_cost(spans_per_day, bytes_per_span=2_000, retention_days=30,
                 usd_per_gb_month=0.10, replication=3):
    gb = trace_storage_gb(spans_per_day, bytes_per_span, retention_days)
    return gb, gb * usd_per_gb_month * replication

for spd in (1_000_000, 20_000_000):
    gb, cost = monthly_cost(spd)
    print(f"{spd:>12,} spans/day -> {gb:8.1f} GB hot, ~${cost:7.2f}/mo storage (x3 repl)")

   1,000,000 spans/day ->     60.0 GB hot, ~$  18.00/mo storage (x3 repl)
  20,000,000 spans/day ->   1200.0 GB hot, ~$ 360.00/mo storage (x3 repl)

Storage is cheap. Eighteen dollars a month for a million spans a day is not what makes self-hosting expensive. What makes it expensive is the ClickHouse or Postgres you are now running, the retention job you have to get right, the cardinality that blows up when someone puts a UUID in a span attribute, and the fact that when ingest lags, that is your page. Plug your real numbers in. If the storage line is trivial and the operator line is not, you have learned the actual shape of the decision.

What I'd page on

If you self-host any of these, do not page on "the dashboard is down." Page on the things that mean you are silently losing data:

Ingest lag over threshold (your traces are minutes behind reality, so every incident you debug is stale).
Dropped or refused spans greater than zero (the exporter queue is full, and you are blind to exactly the traffic spike you most need to see).
Storage growth rate outrunning your retention budget, or trace attribute cardinality spiking (a UUID landed in a label, and your index is about to hurt).
Sampling rate drift (someone changed the sampler, and your tail-latency traces quietly stopped being collected).

A $3,900 overnight bill from our LLM eval suite: the incident, and the spend guard I shipped after

Jasmine Park — Tue, 14 Jul 2026 21:21:13 +0000

TL;DR. Our LLM-judge eval suite had no cost ceiling. It ran the full judge over 1,200 cases on every CI trigger. On 08/07 a dependency bot opened 41 pull requests between 01:00 and 04:00, and our merge queue re-ran the whole suite on every push and every rebase: roughly 270 full runs at about $14.40 each. That one window cost $3,900 against a $1,730 monthly eval budget. No dashboard fired, because ours watched request rate and 5xx, not tokens or dollars. We found out from the invoice, because nothing we monitored watched spend. What I shipped was four guards under the suite: a pre-flight cost cap (estimate tokens with tiktoken, multiply by price, refuse the run if it would breach a daily ceiling), a result cache keyed on the candidate answer, sampling on non-main branches, and an alert on token-spend rate. Code is below.

I run reliability for a small ML platform team. We ship an LLM feature and gate it with an offline eval suite: 1,200 graded cases, each scored by a separate judge model against a rubric. Standard setup. It has caught real regressions. It also carried, for four months, a failure mode I built and did not see until it cost us most of a monthly budget in nine hours.

This is the writeup. Numbers are from our incident, rounded. The prices are the ones we paid at the time, not a benchmark.

How I found out

The first signal came in as an email.

At 09:40 the next morning, finance forwarded a cost-anomaly notice from our model provider: the prior day's spend on one API key was 68x its trailing average. My first reaction was that the anomaly detector was wrong. We had shipped nothing overnight. No incident channel, no 5xx, no latency alarm. Green board.

Then I read which key. It was ci-eval, not prod. I had no dashboard for ci-eval, because eval traffic never paged anyone, so I had never built one. I pulled the provider's usage export for that key and sorted by hour. Between 01:00 and 04:00 it had billed just over a billion tokens. A normal day for that key is about 14 million.

I checked the deploy log twice, expecting a runaway retry loop or a stuck worker hammering the API. There was neither. The requests were clean, sequential, and successful. Whatever had done this had done it deliberately, at 200 OK, which meant the provider was behaving correctly and the bug was in my own cost arithmetic.

That is how I learned my eval suite had a cost bug: about twelve hours late, from a finance email, in dollars rather than in the tokens or request counts I was actually watching.

What it cost

The arithmetic is the whole story, so here it is.

One judged case, at our rubric size, costs:

input: about 2,600 tokens (rubric, question, candidate answer, reference answer)
output: about 550 tokens (a score plus a short justification)

At the prices we paid ($2.50 per million input, $10.00 per million output), that is $0.0065 plus $0.0055, so roughly $0.012 per case. The full suite is 1,200 cases, so one run is 1,200 x $0.012 = $14.40. On a normal day CI fires it three or four times: a couple of merges to main, a manual rerun or two. Call it $57 a day, about $1,730 a month. That budget line had been flat for a quarter. So I stopped looking at it. That was mistake one.

On the night of 08/07, a dependency bot opened 41 pull requests in three hours. Each PR triggered the full suite on its first push. Each then got a lockfile follow-up commit, which triggered it again. Then the merge queue rebased each PR onto main before merging and ran the suite once more. Six to seven full runs per PR. About 270 runs total.

270 x $14.40 = $3,888. Just over a billion tokens. In one overnight window we spent 2.25x our entire monthly eval budget, and every single run passed. Nothing was broken. That is the part that still bothers me. The suite did exactly what I had configured it to do, 270 times, and nothing malfunctioned. The entire bill came from correct runs firing far more often than they ever should have.

Split the billion tokens and it maps straight back to the bill: about 842 million input tokens at $2.50 per million is $2,106, and about 178 million output tokens at $10.00 per million is $1,782. Sum $3,888. Output was 17% of the tokens and 46% of the cost, which is worth internalizing. On a rubric-heavy judge, the short justification you ask it to write is nearly half the spend. Trimming the justification moves the bill more than trimming the rubric does.

Root cause, and my part in it

Two settings did this. Both were mine.

Our CI ran the eval job on the GitHub synchronize event, which fires on every push to an open PR. I wrote that trigger in March so a PR's eval status stayed fresh as you pushed fixes. Reasonable for a human pushing three commits to one branch. Pathological for a bot pushing 41 branches with lockfile follow-ups.

The merge queue had "re-run required checks after rebase" turned on. I enabled that in May, after a stale check let a bad merge through. Also reasonable in isolation. Combined with the synchronize trigger and a bot storm, it meant every PR paid for the full suite at least three times before it even cleared the queue.

Neither setting is wrong on its own. Together, with no cost ceiling underneath them, they are a spend amplifier. I built both halves, two months apart, and never once did the multiplication. That combination is the actual root cause. The dependency bot was only the trigger: it exercised a gap I had left open, 41 times overnight.

What the dashboards missed

Three things were watching this system. All three were structurally blind to a cost spike.

The provider billing console. Aggregated by calendar day, updated on a lag of roughly 24 hours. At 04:00, while the spend was actually happening, it still showed the previous day's $58. It is accurate but far too slow for a spike that starts and finishes inside one night.

Our Grafana panels. Built on request count and error rate, scoped to the prod key. Two separate problems. First, count is the wrong unit. 270 eval runs is not a remarkable request volume; one prod minute makes more requests than that. The signal was never in the count, it was in tokens-per-request times price. Second, the panels were scoped to prod, and the runaway was on ci-eval, which had no panel at all.

The one alert we did have. A 5xx and rate-limit alarm on the provider. It never fired, because the provider was perfectly happy. It served all billion tokens without a single error. This is the trap: a cost runaway is not an error condition. The provider will keep billing at 200 OK and never surface an error for you to alert on.

So the real failure was not that an alert broke. It is that I had never expressed "dollars per hour on the eval key" as a number that anything watched. I monitored availability and errors, which are what page you at 3am. Spend was only ever reviewed in a monthly budget meeting, so nothing watched it continuously.

The fix

Four changes, ordered by how much they mattered.

A pre-flight cost cap. Before any eval run, estimate its token cost with tiktoken, multiply by the price, and refuse to run if it would push the key past a daily ceiling. This is the guard that stops run number nine, not run number 270.

import tiktoken

# Judge pricing, USD per 1M tokens (from the provider's public pricing page).
PRICE_IN_PER_M = 2.50
PRICE_OUT_PER_M = 10.00
DAILY_BUDGET_USD = 120.00  # hard ceiling for the ci-eval key

def encoding_for(model: str):
    try:
        return tiktoken.encoding_for_model(model)
    except KeyError:
        # Newer judge models aren't in tiktoken's registry yet; fall back.
        return tiktoken.get_encoding("o200k_base")

def projected_cost(prompts: list[str], est_output_tokens: int, model: str) -> float:
    enc = encoding_for(model)
    in_tokens = sum(len(enc.encode(p)) for p in prompts)
    out_tokens = est_output_tokens * len(prompts)
    return (in_tokens / 1e6) * PRICE_IN_PER_M + (out_tokens / 1e6) * PRICE_OUT_PER_M

def guard(prompts, est_output_tokens, model, spent_today):
    run_cost = projected_cost(prompts, est_output_tokens, model)
    if spent_today + run_cost > DAILY_BUDGET_USD:
        raise SystemExit(
            f"eval blocked: projected ${run_cost:,.2f} would push today's "
            f"ci-eval spend past the ${DAILY_BUDGET_USD:,.2f} ceiling"
        )
    return run_cost

The edge case that bit us while testing this: encoding_for_model raises KeyError for judge models tiktoken has not shipped a mapping for yet. If you let that propagate, the guard crashes, and depending on how you wired it that either blocks all evals or (worse) gets swallowed by a broad except so the cap silently stops applying. Fall back to a known encoding: o200k_base for recent models, cl100k_base for older ones. The estimate is then approximate, which is fine here. This is a budget fuse, not an invoice. tiktoken is the actual tokenizer for this model family and it is open source (github.com/openai/tiktoken), so the estimate lands within a couple of percent of billed tokens in practice.

A result cache. Key the judge result on (rubric_version, case_id, sha256(candidate_answer)). On a dependency-bump PR, the model under test emits byte-identical answers, so the candidate hash does not change and the judge never runs twice on the same input. Rerunning 270 times over the same 1,200 unchanged candidates should have been about 1,200 judge calls and 322,800 cache hits: a hit rate near 99%. Cost with the cache in place: about $14, paid once.
Sampling off main. The full 1,200-case suite runs on merges to main and in the merge queue's final gate. Every other trigger (feature pushes, draft PRs, bot PRs) runs a fixed 10% stratified sample, 120 cases, about $1.44. You still catch gross regressions on a branch. You stop paying full freight for a lockfile bump.
Spend-rate alerting. This one gets its own section below, because it is the part I most want you to copy.

We shipped all four in two days. The cap and the cache did the heavy lifting. Together they take the worst case of a storm like this from $3,900 down to about $120, because once the ceiling is hit the cap simply stops starting new runs.

What it misses at scale

I am not going to pretend this is finished.

The pre-flight estimate uses a fixed est_output_tokens. Judges that write longer justifications on hard cases will be under-estimated, so the fuse is loosest on exactly the runs that cost the most. We set the constant to our 90th-percentile output length, which over-charges most runs on purpose. Safer, less precise. That is the trade, and it is deliberate.

The cache is only correct while the rubric and the judge model are pinned. Bump either one and every cache key changes, so the first run after a judge upgrade pays full price, and 270x full price if the same bot storm lands that same morning. The cap catches that case. The cache does not.

Sampling trades cost for coverage. A 10% branch sample will miss a regression living in the 90% you skipped, and you only see it at the main-merge gate. For us that is acceptable, because branch evals are advisory and the main gate is the one that blocks release. For a team that ships straight off branches, it is not, and they should not copy the 10%.

And a daily ceiling is a blunt instrument. Set it too low and you block legitimate work at 16:00 on a heavy release day. Set it too high and it would not have stopped this incident. We landed on 2x a normal day and accept that a genuinely large legitimate day needs a human to raise it by hand. The ongoing work here is tuning that threshold, not writing more code.

What I'd page on

The lesson was not "eval is expensive." It was that I had been monitoring the wrong quantities. I watched request counts and error rates; the failure only ever showed up in tokens and dollars, which nothing tracked. So here is the dashboard I built afterward, and the four things it pages on. Copy the metrics, not the numbers. The numbers are ours.

Token-spend rate, per API key, per hour. The primary signal. Alert when any key crosses 3x its own trailing 7-day hourly median. This is the one that would have paged me at 01:20 instead of routing through finance at 09:40. Per-key matters: ci-eval and prod have different baselines and have to alarm independently, or the loud one masks the quiet one.
Cost per eval run. Emit it as a metric from the harness itself, tagged by trigger (main, merge-queue, branch, bot). Page if a single run exceeds $20 (our full run is $14.40) or if runs-per-hour exceeds 12. A run-count spike is the earliest sign of a trigger loop, and it shows up before the dollars land in any billing export.
Budget burn, daily and month-to-date. Track spend against the daily ceiling as a percentage: warn at 70%, page at 100% (by which point the cap has already blocked runs, so the page is really telling a human to decide whether to raise it). Track month-to-date against the $1,730 line and warn at 80% before the 20th, so an expensive first half of the month is visible while there is still time to react.
Cache hit rate on the judge. A drop below 60% means either a rubric or model change invalidated the cache (expected, transient, no page) or the cache key is broken (not expected, page). On a normal week we sit near 40% from genuinely new candidates. A sudden collapse to single digits during a PR storm is the tell that the cache is not absorbing what it should, which is the early warning I did not have on 08/07.

None of these watch availability. Availability was fine the entire night. The provider returned 200 to every one of a billion tokens' worth of requests, and that was precisely the problem. Reliability for an LLM system has to include a spend SLO, and a spend SLO needs continuous monitoring rather than a monthly budget review.

If you run an LLM-judge eval suite and you cannot answer, from a dashboard, right now, "what does one run cost, and what stops it running a thousand times tonight," that is your next on-call ticket. We paid $3,900 to find that out.

We alerted on errors. The silent failure was a truncated answer.

Jasmine Park — Wed, 08 Jul 2026 18:31:42 +0000

Every monitor was green. HTTP 200s, latency inside SLO, zero exceptions. Meanwhile users were getting half an answer and cut-off JSON, and nothing on our side had noticed, because a truncated completion is not an error.

The page never fired. That was the whole problem. Support forwarded a handful of complaints that answers were "getting cut off," and I opened the dashboards expecting to see the cause. Error rate flat at zero. Latency well under budget. No exceptions in the logs, no failed calls, no 5xx. By every signal we watched, the service was healthy. The signals were wrong. It was handing back broken output with a 200 stamped on it, and none of our monitors were built to notice.

What the monitors were watching, and what they missed

We had inherited the standard web-service monitoring shape: watch HTTP status, watch latency, watch exceptions. That shape assumes failure looks like an error. For a normal API that holds. For an LLM call it does not, because the most common user-facing failure throws nothing at all.

The failure was truncation. When a completion hits the max_tokens ceiling, the model stops mid-thought and hands back what it had so far. The provider returns a perfectly valid response object with a 200. The only tell is a field in the payload: finish_reason comes back as length instead of stop. No exception, no error code, no latency anomaly. A response that stopped because it ran out of room looks identical, at the transport layer, to one that finished naturally.

So the answer that got cut in half was a 200. The JSON blob that ended after {"status": "app was a 200. Our completeness was failing and our monitors were structurally blind to it, because they were counting the wrong kind of failure.

What it cost before we caught it

This ran for the better part of two weeks before the pattern was clear enough to act on. Best I can reconstruct, somewhere around 3 to 4 percent of completions were truncating, and a big share of those were the long-output ones: multi-step answers, long summaries, and the structured JSON responses another service consumed downstream.

The JSON case was the expensive one. A downstream service parsed those completions. A truncated blob is invalid JSON, so that service caught a parse exception, swallowed it, and fell back to a default. No page there either. So one silent failure (truncation with a 200) fed a second silent failure (a swallowed parse error and a default value), and the only place reality surfaced was a slow trickle of user complaints. Two weeks of that is a lot of quietly wrong answers.

Treat finish_reason as a first-class signal

The fix starts with promoting finish_reason from a field nobody reads to a metric you alert on. Every completion, check why it stopped, and emit that. A rising rate of length finishes is a truncation incident in progress, and it will show up here long before support forwards the first complaint.

from collections import Counter

_finish_reasons = Counter()

def record_finish_reason(response) -> str:
    # OpenAI-style: response.choices[0].finish_reason
    # "stop" = natural end, "length" = hit max_tokens (truncated), plus others
    reason = response.choices[0].finish_reason or "unknown"
    _finish_reasons[reason] += 1
    return reason

def truncation_rate() -> float:
    total = sum(_finish_reasons.values())
    return _finish_reasons["length"] / total if total else 0.0

That is the signal we never had. truncation_rate() is the number that should have paged us on day one.

Monitor completeness, not just finish_reason

Finish reason tells you the model ran out of room. It does not tell you whether the answer the user got was usable. For that, check the output itself. For structured responses, the sharpest signal is whether the payload parses and carries the fields you require. A valid-JSON-parse rate and an expected-fields-present rate turn "the answer looks complete" into a number you can alert on.

import json

_completeness = {"parse_ok": 0, "parse_fail": 0, "fields_ok": 0, "fields_missing": 0}
REQUIRED_FIELDS = ("status", "summary", "items")

def check_completeness(text: str, structured: bool) -> dict:
    result = {"truncated_json": False, "missing_fields": []}
    if not structured:
        return result
    try:
        obj = json.loads(text)
        _completeness["parse_ok"] += 1
    except json.JSONDecodeError:
        # a cut-off JSON blob lands here: valid 200, unusable payload
        _completeness["parse_fail"] += 1
        result["truncated_json"] = True
        return result
    missing = [f for f in REQUIRED_FIELDS if f not in obj]
    if missing:
        _completeness["fields_missing"] += 1
        result["missing_fields"] = missing
    else:
        _completeness["fields_ok"] += 1
    return result

def json_parse_rate() -> float:
    total = _completeness["parse_ok"] + _completeness["parse_fail"]
    return _completeness["parse_ok"] / total if total else 1.0

Now a truncated JSON response is not a downstream default swallowed in silence. It is a parse_fail, counted, and it moves a rate you can wake someone up over.

Set max_tokens from the data, not a guess

The last piece is the setting that caused it. Our max_tokens was a round number somebody picked once and never revisited. It was too low for the long-output traffic, which is exactly the traffic that truncates. The right value is not a guess, it is the measured distribution of actual output lengths with headroom on top.

def recommend_max_tokens(observed_output_lengths, headroom=1.2, pct=99):
    # size the ceiling off the p99 real output, not a round number
    s = sorted(observed_output_lengths)
    p99 = s[min(len(s) - 1, int(pct / 100 * len(s)))]
    return int(p99 * headroom)

Feed it a few days of real completion lengths and it hands back a ceiling that clears your p99 output with room to spare, instead of a number that clips your longest and most important answers. Reserving output budget you rarely use costs a little on the concurrency side, so this is a trade to make with eyes open, not a free lever. But clipping real answers costs correctness, and correctness is not a thing to save money on quietly.

Once truncation rate, JSON parse rate, and expected-fields-present were on the wall, the incident stopped being a support-ticket archaeology exercise. The next time a prompt change pushed outputs longer and truncation started climbing, the rate moved within minutes and we caught it before a single user did.

What I'd page on

Different checklist from the token-cost write-up. This one is about a failure that never throws.

Truncation rate (finish_reason == "length"). The headline. Fraction of completions that stopped because they hit max_tokens. Warn at a low single-digit percent, page on a sharp climb. This fires on the exact failure that green error dashboards hide.
JSON parse-success rate. For any structured output, the fraction that parses. A drop is truncated or malformed payloads reaching consumers. Page on it, because the downstream service will swallow the failure and you will not hear about it otherwise.
Expected-fields-present rate. Of the payloads that do parse, the fraction carrying every required field. Catches the answer that parsed but came back half-filled.
Output-length distribution vs max_tokens. Plot p50 / p95 / p99 output length against the ceiling. When p99 marches toward the ceiling, truncation is about to start. This is the leading indicator.
Downstream default-fallback rate. How often the consuming service fell back to a default because it could not use the response. A silent failure feeding a silent failure is the worst case, so surface the second one too.
Completeness by route. Truncation is not uniform. Break the truncation rate out by endpoint or prompt template so the long-output routes, the ones that actually clip, are not averaged into invisibility by the short ones.

Our cache hit rate was 90 percent and the bill still climbed

Jasmine Park — Wed, 08 Jul 2026 18:22:36 +0000

The dashboard said we were serving nine of every ten requests from cache. The invoice said we were paying more every week. Both numbers were correct. The hit rate was counting requests and the bill was counting dollars, and those are not the same thing once your traffic is lopsided.

The response cache went in for exactly the reason you would expect. Repeated prompts were hitting the model over and over, so we keyed on the prompt, stored the completion, and served the stored answer on a match. Within a week the hit-rate panel settled around 90 percent and stayed there. Nice flat line. Everyone moved on.

Then the monthly spend kept creeping. Not a spike, a slope. Up a bit, up a bit more, up again. The cache was clearly working (90 percent of requests were free), so nobody looked at it as the culprit. That was the mistake. The cache was doing exactly what the panel measured. We had picked the wrong thing to measure.

What the hit rate was actually counting

A hit rate is a ratio of requests. Ninety percent hit means nine of ten requests were served from the store and one went to the model. What that number never told us is which requests were in which bucket.

Our traffic was not uniform. The bulk of it was short, cheap, repetitive calls: the same handful of classification and lookup prompts, hammered thousands of times an hour. Those cached beautifully. They were also nearly free per call. Meanwhile a thin slice of traffic was long-context work, tens of thousands of tokens of document stuffed into the prompt, and those were mostly unique. They missed the cache almost every time.

So the 90 percent hit rate was 90 percent of the cheap requests and almost none of the expensive ones. The cache was serving the traffic that barely cost anything and passing through the traffic that cost real money. Counted by request, the cache looked like it was carrying us. Counted by dollars spent, it was barely touching the bill. The panel only knew how to count requests, so it never showed us the second number.

What it actually cost

Put rough numbers on it. Say a cheap call is 200 tokens and a long-context call is 30,000 tokens, so the expensive one costs roughly 150 times more. If 95 percent of requests are cheap and 5 percent are long-context, and the cheap ones hit cache 94 percent of the time while the long ones hit 8 percent of the time, the request-weighted hit rate still lands around 90 percent. The dollar-weighted hit rate, dollars served from cache over total dollars, was under 15 percent.

That is the whole gap. Ninety percent of requests free, and we were still on the hook for something like 85 percent of the money. The panel we trusted only counted requests, and requests were not what showed up on the invoice.

Two more things the cache was quietly doing wrong

Once we started looking, two smaller problems fell out of the same investigation.

The keys were too strict. We keyed on the raw prompt string. A trailing space, a reordered pair of retrieved chunks, a timestamp injected into the system prompt, any of these produced a different key and a forced miss on what was semantically the same request. We were busting the cache on prompts that should have collided. Normalizing the key (strip whitespace, sort the parts that are order-insensitive, drop the volatile fields that do not change the answer) recovered a real chunk of hits on the traffic that mattered.

The cheap calls were inflating the number we bragged about. Because the tiny repetitive calls dominated the request count and cached almost perfectly, they dragged the headline hit rate up toward 90 no matter what the expensive traffic did. The one metric on the wall was structurally incapable of showing us the miss cost, because the misses were rare by count and huge by dollar.

The metric we should have had from day one

The fix is to weight the hit rate by cost instead of by request. Every lookup carries a dollar estimate. Sum the dollars you served from cache, sum the total dollars you would have spent, and divide. That number moves when an expensive request misses, which is exactly when you want it to move.

from dataclasses import dataclass, field

# price per 1k tokens for the model you are caching in front of.
# use your real numbers; these are placeholders.
INPUT_PER_1K = 0.0005
OUTPUT_PER_1K = 0.0015

def dollar_cost(prompt_tokens: int, output_tokens: int) -> float:
    return (prompt_tokens / 1000) * INPUT_PER_1K + (output_tokens / 1000) * OUTPUT_PER_1K

@dataclass
class CostWeightedCacheStats:
    dollars_from_cache: float = 0.0   # cost we avoided by serving a hit
    dollars_total: float = 0.0        # cost we would have paid with no cache
    hits: int = 0
    misses: int = 0
    miss_costs: list = field(default_factory=list)  # per-miss dollar cost

    def record_hit(self, prompt_tokens: int, output_tokens: int) -> None:
        c = dollar_cost(prompt_tokens, output_tokens)
        self.dollars_from_cache += c
        self.dollars_total += c
        self.hits += 1

    def record_miss(self, prompt_tokens: int, output_tokens: int) -> None:
        c = dollar_cost(prompt_tokens, output_tokens)
        self.dollars_total += c
        self.miss_costs.append(c)
        self.misses += 1

    @property
    def request_hit_rate(self) -> float:
        total = self.hits + self.misses
        return self.hits / total if total else 0.0

    @property
    def dollar_hit_rate(self) -> float:
        # the number that actually tracks the bill
        return self.dollars_from_cache / self.dollars_total if self.dollars_total else 0.0

    def miss_cost_percentiles(self):
        # where the missed money lives: is it the long tail?
        if not self.miss_costs:
            return {}
        s = sorted(self.miss_costs)
        def pct(p):
            return s[min(len(s) - 1, int(p / 100 * len(s)))]
        return {"p50": pct(50), "p90": pct(90), "p99": pct(99), "max": s[-1]}

Wire it around the cache lookup and both numbers fall out of the same code path.

stats = CostWeightedCacheStats()

def cached_call(req, cache):
    key = normalize_key(req.prompt)     # strip, sort order-insensitive parts, drop volatile fields
    hit = cache.get(key)
    if hit is not None:
        stats.record_hit(req.prompt_tokens, hit.output_tokens)
        return hit
    resp = call_model(req)
    stats.record_miss(req.prompt_tokens, resp.output_tokens)
    cache.set(key, resp)
    return resp

# emit both, side by side, so nobody trusts the wrong one again
def emit_metrics():
    return {
        "cache.request_hit_rate": round(stats.request_hit_rate, 3),
        "cache.dollar_hit_rate": round(stats.dollar_hit_rate, 3),
        "cache.miss_cost": stats.miss_cost_percentiles(),
    }

When we put those two lines on the same panel, the gap was undeniable. Request hit rate near 90, dollar hit rate scraping the bottom. The miss-cost percentiles made it concrete: the p99 miss was hundreds of times the median miss. All the money was in the tail, and the tail was invisible on the old graph.

Normalizing the keys pulled the dollar hit rate up, because some of that expensive traffic was more repetitive than the strict key let us see. But the honest conclusion was that a response cache in front of mostly-unique long-context calls was never going to save much. Knowing that stopped us from over-investing in a cache that could not move the bill, and pointed us at prompt trimming and shorter retrieved context instead, which did.

What I'd page on

Different checklist from the autoscaling write-up. This one is about a cache lying to you by measuring the wrong axis.

Dollar-weighted hit rate. Dollars served from cache over total dollars. This is the headline. Warn if it drops below your target while request hit rate stays high, because that divergence is the exact failure in this post. Request hit rate alone is a vanity metric here.
Request hit rate vs dollar hit rate, on one panel. Never show one without the other. The wider the gap between them, the more spend the request number is hiding.
Miss cost distribution, p50 / p90 / p99. Where the unserved money lives. A p99 miss orders of magnitude above the median means your expensive traffic is bypassing the cache, and that is where to spend engineering time.
Key-collision rate. Fraction of lookups where a normalized key would have hit but the raw key missed. A rising value means your keys are too strict and you are busting cache on semantically identical prompts. Alert when it climbs.
Cost per served request, hourly. Total spend over requests served. Flat is healthy. Rising while the request hit rate holds steady is the surprise invoice forming, and it is invisible on any request-count panel.
Cache dollar savings, absolute per day. Not a rate, a number: dollars the cache actually kept off the bill today. If it stops growing while traffic grows, the cache has stopped earning its keep and it is time to look upstream at prompt and context size.

What wakes you at 2am when an enterprise operator deploys your agent

Jasmine Park — Thu, 02 Jul 2026 08:13:02 +0000

Month one of our enterprise rollout: three 3am pages. None of them were code bugs. None of them were model quality issues. All of them were operational surprises we had not accounted for before the handoff.

We were production-ready. We were not operator-ready. Those are different things.

Here's what the three incidents were, what they cost, and the alert set I now require before any enterprise agent deployment.

Incident one: rate limits hit at 6pm EST

The operator's usage pattern had a daily spike between 5pm and 7pm EST. Their team was closing out the day, reviewing agent outputs, running follow-up queries. Peak usage was 4x average.

Our OpenAI rate limit was sized for average load. At 6:04pm on day eight, we hit it. Every request after that returned 429 errors for eleven minutes. The agent returned a generic error message. Eleven minutes of silent failure during peak business hours.

Nobody had tested for burst traffic. We had tested for average load. The cost of finding this in production instead of before handoff: one executive complaint, two support tickets, and an emergency weekend call.

The fix was multi-provider routing with automatic failover: when provider A returns 429, route to provider B. We had this wired up within 48 hours. We should have had it before day one.

What this incident taught us about operator-readiness: size for the operator's peak, not your average. The operator's usage pattern will not match your test environment.

Incident two: $11,800 overage in 18 days

The operator had one very active team. Four engineers running batch analysis jobs against our agent at hours that made no operational sense (11pm local, Saturdays). Within 18 days, their usage represented 68% of total API cost.

We had a total cost ceiling. We did not have a per-tenant ceiling. When we hit the total ceiling, the entire deployment slowed down. The heavy team's cost was invisible to us until the month-end bill.

The question from the customer's VP of operations on day 20: "Can you break down the cost by team?"

We could not.

This is a FinOps problem that looks like a monitoring problem. Per-tenant cost tagging needs to be in the request headers before the first request goes out. Not added after month-end when someone asks the question.

Cost of finding this in production instead of before handoff: $11,800 overage, one executive conversation, two weeks of retroactive tagging work to approximate the breakdown.

Minimum viable cost attribution setup: tag every request with operator ID, team ID, and use-case ID at the gateway layer. Aggregate daily by tag. Alert when any single tag hits 70% of its monthly budget by day 15.

Incident three: no audit trail for a compliance request

Six weeks in, the operator's compliance officer needed to reconstruct a decision the agent made on a specific document at a specific time. Customer data question. The agent had made a routing decision that the compliance team needed to audit.

We had trace spans in our observability system. We did not have an immutable per-request audit log that showed: which document, which agent version, which model version, which prompt version, what the output was, what confidence score.

Trace spans are not audit logs. They're operational data. An audit log needs to be write-once, timestamped, and correlated with the business object (the document, the customer record, whatever the operator's domain object is).

We spent four days building a retroactive audit log approximation. It was not what the compliance officer needed, but it was the best we could do. The real audit log went live in week eight.

Cost of finding this in production instead of before handoff: one compliance near-miss, four days of engineering time, one uncomfortable meeting.

What operator-ready means vs. what production-ready means

Production-ready is about your system. Uptime, latency, error rates. Metrics you control.

Operator-ready is about what happens when your system runs inside someone else's organization, with their usage patterns, their cost constraints, their compliance requirements, their data.

The three incidents above were all foreseeable. The operator's usage pattern is discoverable before handoff (just ask them). Per-tenant cost attribution is a gateway configuration decision that takes a day to implement. Audit log requirements for regulated industries are documented in their compliance frameworks.

We didn't discover any of these things before the handoff because we didn't ask the right questions before the handoff.

The pre-operator readiness checklist I run now

Before any enterprise agent deployment, I go through five operational questions. These are not code reviews. They're operational configuration checks.

Rate limit sizing: What is the operator's expected peak usage, and what is our per-provider rate limit at that peak? If peak usage exceeds 60% of the rate limit, configure burst handling or multi-provider routing before go-live.

Cost attribution: Is every request tagged with the minimum attribution set (operator, team, use-case) at the gateway layer? If not, implement it before the first request. Do not add it retroactively.

Audit log schema: Does the operator operate in a regulated industry? If yes, map their compliance requirements to a specific log schema before deployment. Generic trace spans do not satisfy financial services, healthcare, or legal audit requirements.

Failover configuration: Is there a secondary provider configured for automatic failover? If not, is there a documented manual procedure and a stated SLA for the outage window?

Cost ceiling configuration: Is there a per-tenant monthly budget ceiling with an alert at 70% of budget? Not a per-account ceiling. Per tenant. An over-active team should not consume another team's budget.

These five checks take about two hours to complete and review. Three incidents in month one took about three weeks to fully resolve, plus the relationship cost.

What I'd page on

If I were setting up the alert set for a new enterprise agent deployment, these are the four alerts I'd configure first:

Alert one: per-tenant request rate above 80% of the tenant's configured rate limit. Not 100%. Leave 20% headroom to investigate before hitting the ceiling.

Alert two: per-request cost moving average above threshold for any single tenant (set the threshold based on the expected per-request cost, alert at 3x). Catches batch jobs and runaway loops before month-end.

Alert three: agent response with no trace ID in the audit log. Means the audit trail has a gap. You want to know about this in real time, not when a compliance officer asks.

Alert four: first-request p99 latency above 2x the steady-state p99. Catches cold-start regressions before the operator's peak usage hits them.

None of these alerts require custom infrastructure. They require that your gateway and your agent emit the right metadata on every request.

Get the metadata right before handoff. Fix the alerts before go-live. The 3am page is not a production incident. It's a pre-deployment checklist item you deferred.

The Langfuse migration that cost us a sprint: how I now budget LLM observability

Jasmine Park — Fri, 26 Jun 2026 21:37:56 +0000

We moved off our first tracer in month eight. The migration took one engineer the better part of a sprint, because the trace data lived in a schema we did not own. Nobody costed that line item on day one. I am writing this so you can.

I run reliability for a small team shipping LLM features. When the pager goes off at 2am, I do not care which dashboard is prettiest. I care about two numbers: what this tool costs me per month, and what it costs me to leave. Those two numbers are the whole story, and they are almost never on the comparison page.

So here are six Langfuse alternatives. For each I tracked both numbers: the monthly bill on the invoice, and the exit bill that only shows up the day you migrate. I compared Helicone, Arize Phoenix, LangSmith, Braintrust, Laminar, and Future AGI traceAI. They all trace LLM calls (prompts, tokens, retrieval spans, latency). The axis that decides your exit cost is whether the trace format is OpenTelemetry-native or a vendor schema. Get that wrong and the migration bill lands later, with interest.

The cost nobody puts on the pricing page

Your monthly invoice is the visible cost. The exit cost is the invisible one: re-instrumenting the app, rebuilding integrations, and losing historical traces when the schema does not travel. If your spans are OTel, the exit cost trends toward zero because the data is portable by construction. If they are proprietary, you are paying a deferred bill every month you stay. Sort on that first.

Helicone. The gateway-first option. You proxy model calls through it and get logging, cost tracking, and analytics with almost no code change. Apache-2.0, self-hostable, roughly 5,800 GitHub stars as of June 2026. On pure observability ergonomics this is one of the strongest picks, and the proxy model means low setup cost. The thing to watch at scale: a gateway in the request path is one more hop to reason about when latency spikes.

Arize Phoenix. The open-source OTel option. Tracing plus evals, self-hostable, around 10,000 stars as of June 2026. Because it is OTel-native, your exit cost stays low. The commercial Arize AX tier adds ML monitoring and enterprise features. If portability is your top line, this and traceAI are the two that keep the invisible bill near zero.

LangSmith. The LangChain-native option. If you live in LangChain or LangGraph, instrumentation is automatic and the developer experience is strong. Proprietary and closed-source, tightly coupled to the LangChain ecosystem. This is the most lock-in of the group: the day-one cost is the lowest, the day-200 cost is the highest. Worth it only if you are certain you are never leaving LangChain.

Braintrust. The polished SaaS option. One of the better eval and observability experiences, and the people who do not page (PMs, leads) tend to like the UI. Proprietary trace schema, closed-source, managed by default. Even on enterprise deployments you operate inside their format, so the exit cost stays on the books.

Laminar. The newer open-source entrant. OTel-based tracing with evals, smaller and younger than Phoenix, in the low-thousands of stars as of June 2026. Lower lock-in on the same OTel logic. The cost to weigh here is maturity, not portability: a smaller project means fewer battle-tested edges, which matters more for an on-call rotation than a demo.

Future AGI traceAI. The instrumentation-layer option. Worth being precise here, because it is not the same kind of thing as the others. traceAI is not an observability dashboard. It is an Apache-2.0, OpenTelemetry-native instrumentation SDK (pip install fi-instrumentation-otel) that emits portable OTel spans for 50-plus frameworks as of June 2026. The spans go wherever you point your collector. Future AGI's broader platform adds evals on top (50-plus metrics under one evaluate() call as of June 2026), but on raw observability ergonomics Helicone and Phoenix are more mature dashboards. Where traceAI earns its place on this list is the exit-cost column: because it speaks OTel, the cost of leaving is roughly the cost of changing a collector endpoint. Code: github.com/future-agi/traceAI.

The two numbers, side by side

Visible cost is easy: read the pricing page, multiply by your span volume, done. Invisible cost is the one that bit me. The open-source OTel tools (Phoenix, Laminar, traceAI as the instrumentation layer) keep your exit near free. The proprietary ones (LangSmith, Braintrust) front-load convenience and back-load the migration. Helicone sits in between: open and portable, with a proxy hop to account for. Pick the lock-in profile you can afford in month eight, then argue about features.

What I'd page on

If I were standing this up again, here is the dashboard and alert set I would build before I cared about anything else:

Trace export success rate below 99 percent over 5 minutes. A silent collector drop is invisible until you need the trace you do not have.

Span ingestion cost per day trending above your budget line. Token spend gets watched; span volume does not, and it scales with traffic too.

P99 added latency from the tracing path above your SLO budget. If the tracer (or proxy) adds tail latency, that is a reliability cost masquerading as observability.

Percent of spans in a portable (OTel) format. This is your exit-cost gauge. If it drifts down because someone added a proprietary integration, you just took on migration debt. Page on it before it compounds.

Dropped-trace rate during incidents specifically. Tracing tends to fail exactly when load is highest, which is exactly when you need it. Alert on the correlation, not just the absolute.

Build those five first. The dashboard you actually page on is cheaper than the migration you did not plan for.

The gateway tax: 6 OpenAI-compatible gateways.

Jasmine Park — Fri, 26 Jun 2026 21:35:09 +0000

On March 14, 2026, our LLM bill came in at $9,140 for the month, up from about $5,200, and I could not tell you which team spent it. The gateway in front of every provider emitted one cost line and one trace span per request, all tagged service=llm-gateway, so the platform team ate the whole overage in the FinOps review while three product teams shrugged.

That month is the reason I now treat cost attribution as a gateway design decision, not an afterthought. If you cannot answer "which team, which feature, which key spent this" from the layer every call already passes through, you will answer it never. This is a comparison of the OpenAI-compatible LLM gateways I have evaluated for exactly that job: LiteLLM, Portkey, Helicone, Cloudflare AI Gateway, and Bifrost, plus one newer open-source entrant I introduce in the comparison table below. The lens is an SRE lens. What does it cost you in p99, and how granularly can you bill it back.

TL;DR

Cost attribution belongs at the gateway, not in each app's SDK and not in your provider's dashboard. The gateway is the one chokepoint every call crosses, so it is the only place where per-team, per-feature, per-key spend is both complete and consistent.

Every OpenAI-compatible gateway you put in that path adds latency. Call it the gateway tax. It is real, it is usually single-digit milliseconds at the proxy hop, and it varies with what you turn on (caching, guardrails, semantic lookups). The tax is not the deciding factor for most teams, because provider latency dwarfs it. What actually differs across gateways, by a lot, is attribution granularity: whether you can slice spend by virtual key, by route, by user, and whether the cost shows up as a first-class OpenTelemetry span attribute or as a number you have to scrape out of a dashboard later.

So the decision rule is short. Pick the gateway whose tax you can afford at your p99 budget, and whose attribution you can actually bill against. Most teams over-index on the first half and never check the second. Then March happens.

One honesty note up front, because it matters for how you read everything below. We did not re-run a latency benchmark across these six gateways on one rig. Anybody who hands you a clean cross-vendor p99 table either ran a heroic apples-to-apples harness (rare) or is quietly comparing numbers each vendor measured on different hardware against different upstreams (common). Where I cite latency, it is the vendor's own published number, labeled as such. The capability columns (self-host, caching type, attribution granularity, OTel-native, guardrails, license) are checked against each project's public docs and READMEs, because those are verifiable and they are what you will actually live with.

Why not the app SDK, and why not the provider dashboard

Before the table, kill the two alternatives, because most teams reach for one of them first and it is why their numbers never reconcile.

Cost attribution does not belong in each app's SDK. The pitch is seductive: every service instruments its own OpenAI client, tags spend with its own team name, ships it to your metrics backend. In practice you now have N implementations of "compute token cost" drifting against each other. One team is on an old pricing table. One forgot to count cached input tokens at the discounted rate. One service calls the provider directly in a cron job and bypasses instrumentation entirely, so that spend is simply invisible. When the provider changes per-token pricing (they do, quietly), you are editing N codebases to stay correct. SDK metering is great for in-process latency spans. It is a bad system of record for dollars, because the source of truth is smeared across every repo and every deploy cadence.

Cost attribution does not belong in the provider dashboard either. The OpenAI or Anthropic billing console knows your org spent the money. It does not know your org chart. It cannot tell you that team-checkout spent $4k and team-search spent $300, because your teams are not a concept the provider has. The best you get is per-API-key, and only if you had the discipline to mint one key per team up front and never share them, which under load nobody does. Multi-provider makes it worse: now you are stitching three billing consoles, three export formats, three currencies of "cost," into one spreadsheet a human maintains by hand. That spreadsheet is wrong by the second week.

The gateway is the only layer that sees every request, knows which credential made it, can compute cost once against one pricing table, and can stamp that cost onto a span before the response leaves the building. That is the whole argument. Now, which gateway.

Definitions, so the table means something

Two terms do all the work in this post. Pin them down before you read the comparison.

Cost-attribution granularity is the finest dimension along which the gateway can split spend without you doing post-hoc log surgery. I rank it in three tiers:

Per-key: the gateway issues virtual keys (its own keys, mapped to upstream provider keys) and tracks spend and budget per virtual key. You hand team-checkout a virtual key, and its spend is isolated. This is the floor for billing back, and honestly it is enough for most orgs.
Per-route / per-model: spend split by which model or endpoint served the call, so you can see that GPT-4-class traffic is 80% of cost while being 10% of calls.
Per-user / per-metadata: arbitrary tags (end-user id, feature flag, tenant) attached at request time and queryable later. This is what you need for usage-based billing to your customers, not just internal chargeback.

A gateway that only gives you per-key is fine for internal FinOps. A gateway that gives you per-user metadata is what you need if you resell LLM features and bill your customers per seat.

The gateway tax is the latency the gateway hop adds on top of provider latency. It has a floor (the proxy itself: parse, auth, route, re-serialize) and a variable part (every feature you enable adds a little: an exact-cache lookup is cheap, a semantic-cache vector search is not free, each inline guardrail is a synchronous scan). The tax is paid on every request that is not a cache hit. On a cache hit you skip the provider entirely and the gateway saves you latency, which is the one case where the tax goes negative. The mistake teams make is benchmarking the bare proxy, seeing 2 ms, and budgeting as if guardrails and semantic cache are free. They are not. Measure the tax with your real feature set on, or do not quote it.

And again, the number you measure on your rig is not comparable to the number a vendor measured on theirs. Different CPU, different upstream, different concurrency, different request body size. Treat every cross-vendor latency claim, including the ones in this post, as directional.

The comparison

Read this as capabilities first, latency last. The capability columns are what you live with daily. The latency column is vendor-published and not re-run by us, so it is the least load-bearing thing here.

Gateway	Self-host?	Caching (exact / semantic)	Cost-attribution granularity	OTel-native?	Inline guardrails?	License	Verdict
LiteLLM	Yes	Exact (Redis/in-mem/disk/S3/GCS) + semantic (Qdrant/Redis)	Per-key, per-team, per-user (virtual keys + budgets + spend tags)	Via OTel callback/integration	Via plugins + Guardrails hooks	MIT (OSS); paid Enterprise tier	Broadest provider + ecosystem coverage. Default pick if you want the biggest model zoo.
Portkey	Yes (gateway is OSS; full platform is SaaS)	Simple (exact) + semantic	Per virtual key + metadata tags; rich SaaS dashboards	Partial / via integrations	Yes (integrated Guardrails)	Gateway MIT; platform proprietary SaaS	Most polished managed dashboards and config UI. Default if you want a hosted control plane, not a DIY one.
Helicone	Yes (self-host available)	Exact-match only (cache-key hash)	Custom properties (per-user / per-feature) via metadata; per-key	OTLP ingest (observability-first)	Limited / not the focus	OSS (observability platform)	Observability-first, not a routing-heavy gateway. Default if logging + analytics is the job.
Cloudflare AI Gateway	No (Cloudflare edge, cloud-only)	Caching (exact); no documented semantic cache	Per-request analytics, basic metadata; provider/token/cost metrics	No documented OTel export	Not the focus	Proprietary (managed service)	Zero-ops edge gateway. Default if you are already all-in on Cloudflare and want one toggle.
Bifrost	Yes	Semantic caching (exact also supported)	Hierarchical budgets: virtual keys, teams, customers	Yes (Prometheus + OTel/tracing)	Yes (plugin middleware / enterprise guardrails)	Apache-2.0 (Go)	Fast Go OSS gateway with strong budget hierarchy. Default if you want OSS + native budgets and live in Go.
Future AGI Agent Command Center	Yes (single Go binary)	Exact (6 backends) + semantic (4 backends)	Per virtual key budgets/quotas + per-request cost on the span	Yes, OTel-native (W3C trace context) + Prometheus `/metrics`	Yes, 18 built-in scanners + external adapters	Apache-2.0 (Go)	End-to-end OSS platform where the gateway is one piece beside eval/observability. Default if you want OTel + Prometheus + caching + guardrails in one binary.

Notes on the latency column, deliberately kept out of the table because it is not comparable: LiteLLM publishes proxy-overhead figures in the single-digit-millisecond range on their own harness; Future AGI publishes a vendor benchmark of roughly +1.4 ms P95 added by three inline guardrails and a lower added-latency figure than LiteLLM measured on Future AGI's own rig (their numbers, their methodology, not verified by us); Bifrost publishes its own low-microsecond internal-selection numbers. None of these were measured against each other. Do not put them in a slide as if they were.

Gateway by gateway

LiteLLM

The one with the longest provider list and the deepest ecosystem. If a model exists, LiteLLM probably has a route to it, and the litellm SDK is already in half the agent frameworks you will touch. For attribution it is genuinely strong: virtual keys, budgets, and spend tracking down to key, team, and user, plus cache (exact via Redis and friends, semantic via Qdrant). OpenTelemetry is available through its callback/integration system rather than being the native wire format, which means you wire it up rather than getting it for free. The tax is the usual proxy hop; LiteLLM publishes single-digit-ms overhead on their own harness. The cost of all that breadth is configuration surface: there is a lot of it, and a lot of ways to hold it wrong.

Choose LiteLLM when your priority is provider coverage and ecosystem fit, and you have someone who will own the config.

Portkey

The most polished managed experience. The gateway core is open source and you can run it with npx @portkey-ai/gateway, but the part people actually pay for is the hosted control plane: the dashboards, the config UI, the virtual-key and metadata management without you standing up storage. Caching is simple plus semantic, guardrails are integrated, attribution is per-virtual-key plus metadata tags. If you want to hand a non-platform team a screen where they can see their own spend without you building it, Portkey is the shortest path. The trade is that the nice parts are SaaS and proprietary, so the dependency is on Portkey-the-company, not just Portkey-the-binary.

Choose Portkey when you want a managed control plane and dashboards out of the box, and SaaS dependency is acceptable.

Helicone

Observability-first. Helicone is excellent at logging every request, tagging it with custom properties, and giving you analytics over that, including per-user and per-feature cost slicing via metadata. Caching is exact-match only (the cache key is a hash of URL, body, and relevant headers, so "Hello" and "Hi" are different entries). It is self-hostable and open source, and it leans into OTLP-style ingest because its center of gravity is the observability plane, not heavy multi-provider routing or failover. If your real problem is "I cannot see what my LLM calls are doing," Helicone solves that cleanly. If your real problem is "I need 15 routing strategies and inline guardrails," it is not aimed there.

Choose Helicone when logging, analytics, and per-feature cost visibility are the job and routing is secondary.

Cloudflare AI Gateway

The zero-ops option. It runs on Cloudflare's edge, so there is no binary to operate and no SPOF you own (you inherit Cloudflare's). It does caching and gives you analytics: request counts, tokens, cost. What you do not get, per the public docs, is self-hosting, a documented OpenTelemetry export, or deep per-team attribution beyond request-level metadata. It is the right answer when you are already on Cloudflare, you want one dashboard and one toggle, and your attribution needs stop at "roughly how much, roughly where."

Choose Cloudflare AI Gateway when you want a managed edge gateway with near-zero ops and you already live on Cloudflare.

Bifrost

A fast Go OSS gateway (Apache-2.0) with a genuinely good cost model: hierarchical budgets across virtual keys, teams, and customers, which maps cleanly onto chargeback. It ships native Prometheus metrics and distributed tracing / OTel, semantic caching, and a plugin middleware system for analytics and guardrail-style logic. It is newer and the ecosystem is smaller than LiteLLM's, so you trade provider breadth for a tight, performant core and a budget hierarchy that is built in rather than bolted on.

Choose Bifrost when you want OSS, native budget hierarchy, and Prometheus + OTel, and you are comfortable in the Go ecosystem.

Future AGI Agent Command Center

An OpenAI-compatible gateway shipped as a single Go binary, Apache-2.0, open source (repo at github.com/future-agi). As of June 2026 it ships 15 routing strategies, two-tier caching (6 exact-match backends and 4 semantic backends), and 18 built-in guardrail scanners plus adapters for external guardrail vendors. The piece that matters for this post: it is OpenTelemetry-native using W3C trace context and also exposes a Prometheus /metrics endpoint, and it tracks per-virtual-key budgets and quotas, so cost can ride on the span rather than living only in a dashboard. It also ships a committed, reproducible benchmark harness (a bench/ directory with a mock upstream), which I respect more than a marketing number, because it means you can re-run their claim instead of trusting it.

On their own published benchmark (vendor numbers, not verified by us), three inline guardrails add roughly +1.4 ms at P95, and they claim lower added latency than LiteLLM measured on their rig. Same caveat as everywhere else: their hardware, their upstream, their methodology. The honest positioning: LiteLLM still has the broadest provider and ecosystem coverage, and Portkey has the more polished managed SaaS and dashboards. Future AGI's actual edge is that the gateway is one component of an end-to-end open-source platform that also does eval and observability, with native OTel plus Prometheus and built-in caching and guardrails in a single binary, so you are not assembling four tools to get attribution onto a span.

Choose Agent Command Center when you want OTel + Prometheus + caching + guardrails in one OSS binary, and you value the gateway being part of one eval/observability platform.

The diagram you should draw on your whiteboard

Figure: the gateway is the one layer every call crosses. Stamp cost on the OpenTelemetry span at GOVERN/COST and attribution stays complete and consistent.

The single most important thing in that diagram is where the span is emitted. It is emitted inside the gateway, at the govern/cost control point, after the gateway has resolved the credential and computed the cost. That is what makes attribution complete (every call crosses it) and consistent (one pricing table, one cost function). Move that emission into each app and you reintroduce every drift problem from the "why not the SDK" section above.

Honest limitations: where every one of these adds risk

No gateway is free of downside. If you put one in your hot path, you have signed up for these, regardless of vendor.

Single point of failure. Every request now depends on the gateway being up. A managed edge service (Cloudflare) trades your SPOF for theirs, which may be a better or worse bet than your own uptime. A self-hosted binary (LiteLLM, Bifrost, Future AGI) is yours to make HA: run more than one replica, put a real load balancer in front, and test failover before you need it. "We deployed one gateway pod" is not a control plane, it is an incident waiting for a node drain.

Cache poisoning and stale answers. Semantic caching is the feature most likely to bite you. A vector-similarity hit can return a cached answer for a prompt that is close but not equivalent, and now one user sees another user's response, or a stale answer to a changed question. Exact caching is safer but still leaks across users if your cache key does not include the right scoping. Scope cache keys per tenant where correctness matters, and keep semantic caching off for anything with PII or per-user state until you have measured the false-hit rate.

Span-cardinality blowup. The fix for attribution (rich tags on every span) is also the way you melt your metrics backend. Put end_user_id as a label on a Prometheus metric and you have just created one time series per user. That is a cardinality bomb. Keep high-cardinality identifiers (user id, request id) on traces and logs, where high cardinality is fine, and keep your metric labels low-cardinality (team, model, provider, cache_hit). Conflating the two is the most common way an attribution rollout pages the observability team instead of the FinOps team.

A pasteable artifact: per-key budget plus OTel export

Here is a minimal, runnable setup for one gateway (LiteLLM, because its config is the most widely deployed and the spend tracking is mature), showing a per-virtual-key budget and OpenTelemetry export, plus the queries that turn it into a bill-back.

docker-compose.yml:

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-stable
    ports:
      - "4000:4000"
    environment:
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
      DATABASE_URL: postgres://litellm:litellm@db:5432/litellm
      # Send OTel spans to your collector
      OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    volumes:
      - ./config.yaml:/app/config.yaml:ro
    depends_on:
      - db

  db:
    image: postgres:16
    environment:
      POSTGRES_USER: litellm
      POSTGRES_PASSWORD: litellm
      POSTGRES_DB: litellm
    volumes:
      - litellm-pg:/var/lib/postgresql/data

volumes:
  litellm-pg:

config.yaml:

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

litellm_settings:
  # Emit an OpenTelemetry span per request, with cost + tokens as attributes.
  callbacks: ["otel"]
  # Track and persist spend so it can be queried per key/team/user.
  store_model_in_db: true

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL

Mint a virtual key for one team, with a hard monthly budget, so March cannot happen silently:

curl -s http://localhost:4000/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "key_alias": "team-checkout",
        "models": ["gpt-4o"],
        "max_budget": 500,
        "budget_duration": "30d",
        "metadata": {"team": "checkout", "cost_center": "cc-4471"}
      }'

That key now refuses traffic once team-checkout crosses $500 in a 30-day window, and every call it makes carries team=checkout into the spend store and onto the OTel span.

Attributing spend to a team comes from the gateway's own spend store. With LiteLLM's spend logs in Postgres, the bill-back for last month is one query:

SELECT
  metadata ->> 'team'      AS team,
  COUNT(*)                 AS requests,
  ROUND(SUM(spend)::numeric, 2) AS usd
FROM "LiteLLM_SpendLogs"
WHERE "startTime" >= date_trunc('month', now()) - interval '1 month'
  AND "startTime" <  date_trunc('month', now())
GROUP BY 1
ORDER BY usd DESC;

And for the live alerting view, scrape low-cardinality cost metrics into Prometheus and rank current-month spend by team. With a gateway that exposes a per-team cost counter (label team, deliberately low-cardinality), the PromQL is:

topk(5,
  sum by (team) (
    increase(llm_gateway_cost_usd_total[30d])
  )
)

Keep team, model, and provider as metric labels. Keep end_user_id and request_id out of metrics and on the trace instead. That one discipline is the difference between an attribution dashboard and a cardinality incident.

Paste this into your PRD

A scenario matrix for the decision review, so the next person does not re-derive it.

Scenario	Priority	Default pick	Escalate to	Why
Internal chargeback, many providers	Provider breadth + per-team spend	LiteLLM	Bifrost (if you want native budget hierarchy in Go)	Biggest model zoo, mature virtual keys and spend tracking; budgets get you per-team bill-back.
Non-platform teams need their own spend screen	Managed dashboards, low build cost	Portkey	LiteLLM self-host (if SaaS dependency is a no)	Hosted control plane and config UI mean you do not build the dashboard yourself.
"I cannot see what my LLM calls do"	Logging + per-feature cost visibility	Helicone	Future AGI ACC (if you also need routing + guardrails)	Observability-first with custom-property attribution; exact-match cache.
Already on Cloudflare, want near-zero ops	One toggle, no binary to run	Cloudflare AI Gateway	Any self-hosted gateway (when you outgrow request-level attribution)	Edge-managed, no SPOF you operate; attribution stops at request-level metadata.
Want OTel + Prometheus + cache + guardrails in one OSS binary	One platform, attribution on the span	Future AGI Agent Command Center	LiteLLM (for wider provider coverage) or Portkey (for managed dashboards)	Native OTel (W3C) + Prometheus, two-tier cache, 18 guardrail scanners in one Go binary, part of an eval/observability platform.
Resell LLM features, bill your customers per seat	Per-user / per-metadata attribution	LiteLLM or Portkey (rich metadata)	Helicone (for the analytics layer on top)	You need arbitrary per-user tags queryable later, not just per-key.

What I'd page on

This is the on-call checklist for a gateway in your hot path. If you adopt one of these gateways and do not wire these alerts, you are flying blind and the next $9k month is already in flight.

Gateway p99 latency, by route. Page if p99 of the gateway-added overhead (gateway span duration minus upstream span duration) exceeds your budget for 5 minutes. This is the gateway tax going bad. Separate the proxy hop from provider latency or you will blame the wrong layer at 2am.
Gateway error rate and saturation. Page on 5xx rate from the gateway above baseline, and on CPU saturation, because at high concurrency CPU is the bottleneck, not the network. A saturated gateway fails every team at once.
Per-team budget burn. Page (or auto-throttle) when any virtual key crosses, say, 80% of its monthly budget before the month is 80% over. This is the alert that would have caught March on March 6, not March 31.
Total spend rate-of-change. Page on day-over-day total LLM spend up more than X%. A runaway retry loop or a new feature shipping uncapped shows up here first, hours before the invoice.
Cache hit rate drop. Page if cache hit rate falls below your assumed floor, because your cost model and your latency budget both silently assumed those hits. A cache that quietly stopped hitting is a bill increase and a latency regression in one.
Semantic-cache false-hit signal. If you run semantic caching on anything user-facing, alert on user reports or eval-detected wrong answers correlated with cache hits. This is correctness, not cost, and it is the one that becomes a postmortem instead of a FinOps slide.
Span cardinality / metrics ingestion. Page if your metrics backend's active series count jumps after a deploy. That is usually someone putting a user id on a metric label. Catch it before it takes down the observability stack.
Provider failover events. Alert (not page) when the gateway fails over between providers, so a silent provider degradation does not hide inside your routing logic until the bill from the more expensive fallback shows up.

Pick the gateway whose tax you can afford and whose attribution you can bill against. Then wire the eight alerts above, because the gateway is now load-bearing infrastructure, and load-bearing infrastructure gets a pager.

Capability claims here reflect each project's public docs and READMEs as of June 2026. Latency figures are vendor-published on each vendor's own harness, not re-run on a common rig, and are not comparable across vendors. Future AGI's gateway (Agent Command Center) is open source at github.com/future-agi.

Langfuse alternatives: 6 LLM observability tools, sorted by the thing that bites you in month eight

Jasmine Park — Fri, 19 Jun 2026 09:33:19 +0000

TL;DR

I went looking for Langfuse alternatives after living with a proprietary tracer for eight months and then paying to migrate off it.

I compared six options:

Helicone
Arize Phoenix
LangSmith
Braintrust
Laminar
Future AGI traceAI

They all trace LLM calls.

The axis that actually mattered was OpenTelemetry-native (OTel) vs proprietary tracing, because that's what determines whether you can leave without re-instrumenting everything.

Four of the six are open-source, ranging from roughly 200 to 10,000 GitHub stars (June 2026). That spread turned out to predict almost nothing about what I actually cared about: portability.

The Axis That Bites You in Month Eight: Whose Traces Are These?

Every tracer captures:

LLM calls
Prompts
Tokens
Retrieval events

The question nobody asks on day one and everybody regrets on day 200:

Is the trace format OpenTelemetry (portable) or the vendor's own schema (locked to their dashboard)?

If it's proprietary, switching tools later often means:

Re-instrumenting your application
Rebuilding integrations
Losing historical trace data

I learned this the expensive way.

So today I sort observability tools by lock-in first and features second.

The Six Tools, Sorted by Lock-In

1. Helicone

The gateway-first open-source pick.

Often the first Langfuse alternative people mention.

You proxy model calls through Helicone and get:

Logging
Cost tracking
Analytics

with very little code change.

Highlights

Open-source (Apache-2.0)
Self-hostable
~5,800 GitHub stars (June 2026)

Best for: teams that want a fast observability layer with minimal engineering effort.

2. Arize Phoenix

The open-source OTel pick.

Phoenix combines:

OTel-based tracing
Evaluations
Self-hosted deployment

The core project is free and open-source.

Arize's commercial offering (AX) adds:

Enterprise capabilities
Advanced ML monitoring

Highlights

Open-source
OTel-native
Self-hosted
~10,000 GitHub stars (June 2026)

Best for: teams that want portable tracing and full ownership.

3. LangSmith

The LangChain-native pick.

If you're already using LangChain or LangGraph, LangSmith provides:

Automatic instrumentation
Deep framework integration
Strong developer experience

The tradeoff is coupling.

Highlights

Proprietary
Closed-source
Closely tied to the LangChain ecosystem

Best for: teams fully committed to LangChain.

Most lock-in of the group.

4. Braintrust

The polished SaaS pick.

Braintrust has one of the strongest experiences for:

Evaluations
Observability
Cross-functional visibility

Non-technical stakeholders tend to like the UI.

Highlights

Proprietary trace schema
Closed-source
Managed SaaS by default
Enterprise deployment options available

Even in enterprise deployments, you're still operating within their trace format.

Best for: organizations that prioritize product polish over portability.

5. Future AGI traceAI

The no-lock-in instrumentation pick.

traceAI is different from the others.

It is not an observability platform.

It is an Apache-2.0 OpenTelemetry instrumentation layer that captures:

LLM calls
Prompts
Tokens
Retrieval
Agent steps

and exports them to any OTel-compatible backend:

Datadog
Grafana
Jaeger
Vendor platforms

In other words:

It handles instrumentation, not dashboards.

If you want a polished product out of the box, tools like Langfuse or Helicone are more complete.

If you want portable traces that you own, this is the lightest approach I found.

Highlights

Apache-2.0
OTel-native
Backend-agnostic
~200 GitHub stars (June 2026)

Best for: teams optimizing for long-term portability.

It's also the youngest project here, so think of it as a bet on the instrumentation-first approach rather than a mature platform.

6. Laminar

The newer open-source pick.

Laminar combines:

OTel-based observability
Evaluations
Modern architecture

It is newer than Phoenix but worth evaluating.

Highlights

Apache-2.0
OTel-native
~3,000 GitHub stars (June 2026)

Best for: teams looking for a modern open-source observability stack.

My Take

I'm not crowning a winner.

Different tools optimize for different priorities.

If you want...	Look at...
Fastest onboarding	Helicone
Self-hosted OTel observability	Phoenix or Laminar
Deep LangChain integration	LangSmith
Polished SaaS workflows	Braintrust
Pure OTel instrumentation	traceAI

The proprietary tools are completely reasonable choices.

Until the day you want to leave.

What I Would Do Differently

I would choose OTel-native instrumentation from day one.

Not because proprietary tools are bad.

In many cases they're actually easier and more pleasant to use.

But the cost of switching is paid later through:

Re-instrumentation
Migration effort
Lost historical context

And on day one, you have no idea whether you'll outgrow the tool.

The argument is simple:

Instrument once with OpenTelemetry. Point it at whatever backend you want. Change backends without changing application code.

That's the entire case for OTel.

And it's the one strong opinion I came away with.

FAQ

Is Langfuse bad?

No.

Langfuse is genuinely good.

It's self-hostable, widely adopted, and has roughly 29,000 GitHub stars as of June 2026.

This post is about alternatives and tradeoffs, not criticism.

If I'm OTel-native, does that mean I don't get a dashboard?

No.

You still need somewhere to view traces.

Examples include:

Grafana
Datadog
Jaeger
Vendor platforms

OTel-native simply means you can change the backend later without changing instrumentation.

Should I self-host or use a managed service?

Self-host if you need:

Data residency
Infrastructure control
Lower long-term platform dependency

Managed if you want:

Faster setup
Less operational burden
A more polished experience

Open Question

Lock-in is easy to reason about.

What I still struggle to quantify is the value difference between:

A polished proprietary platform
A portable-but-rawer OTel stack

The proprietary tools genuinely save time every day.

The portability benefits only pay off if you actually switch.

I don't have a clean framework for valuing that optionality.

If you've found a useful way to think about that tradeoff, I'd love to hear it.

Per-project LLM cost attribution with OTel spans: the wiring

Jasmine Park — Thu, 04 Jun 2026 21:01:52 +0000

TL;DR. If your LLM bill is one line item on a cloud invoice, you cannot answer "which team spent that." We fixed this by tagging every gateway span with team.id, project.id, and feature.id, plus the OpenInference token-count attributes, shipping those spans through an OTel collector into Tempo, and rolling cost up per team with TraceQL in Grafana. The payoff that sold it internally: one team's monthly spend quietly went from a few hundred dollars to over a thousand because of a retry loop, and the org-level dashboard never flinched. The per-team view caught it in a day. Below is the wiring, the collector config, the rollup query, the alert, and the attributes I tried and threw away.

1. The problem is attribution, not collection

Most teams already collect LLM telemetry. Spans exist, tokens get counted, traces land somewhere. What is missing is the dimension that finance and eng leads actually ask about: who owns this spend. The provider invoice gives you one number per month per API key. If you share keys across services (most people do at some point), that number is useless for chargeback. You cannot tell the platform team's spend from the support-bot team's spend.

So the design goal was narrow. Every LLM call has to carry enough labels that I can group spend by team, by project under that team, and by feature inside that project. Three levels. No more, because deeper than feature and nobody reads the dashboard. I standardized the whole pipeline on OpenTelemetry and OpenInference, and I will state the one opinion plainly: I want the labels, the wire format, and the storage to be things I can swap without rewriting instrumentation. We tag spans with open semantic conventions so the day we change a backend or a dashboard tool, the gateway code does not move. That is a portability decision, not a verdict on anyone's product.

2. Which attributes get tagged, and where

Tag at the gateway, not in each service. We run an LLM gateway (every call to every provider goes through it), so it is the one place that sees model, token counts, and request context together. A new service gets attribution for free as long as it routes through the gateway and forwards the three context headers.

The cost-math group comes straight from OpenInference semantic conventions: llm.model_name, llm.token_count.prompt, llm.token_count.completion. The attribution group is custom, set from request headers: team.id, project.id, feature.id. Cost is not a span attribute. I compute it at query time from token counts and a small price lookup, because prices change and I do not want last quarter's spans frozen at last quarter's rates.

Attribute	What it buys you	Keep or drop
`team.id`	Top-level chargeback. The number a director asks for.	Keep
`project.id`	Splits a team's spend across its services.	Keep
`feature.id`	Which feature drove a spike inside a project.	Keep
`llm.model_name`	Lets you weight tokens by per-model price.	Keep
`llm.token_count.prompt`	Input side of the cost.	Keep
`llm.token_count.completion`	Output side. Usually the expensive half and the one that runs away.	Keep
`user.id`	Per-user spend, in theory. A privacy liability in traces.	Drop
`request.id`	Already covered by the trace and span IDs.	Drop

3. The collector config

OTLP in, batch, set anything the gateway missed, Tempo out. The one processor worth calling out is transform: I use it to backfill team.id with a sentinel when a service forgets the header, so unlabeled spend shows up as unattributed instead of vanishing. Cost with no label is cost you will never find.

receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }

processors:
  batch: { timeout: 5s, send_batch_size: 1024 }
  transform/attribution:
    trace_statements:
      - context: span
        statements:
          - set(attributes["team.id"], "unattributed") where attributes["team.id"] == nil
          - set(attributes["project.id"], "unknown") where attributes["project.id"] == nil
          - set(attributes["feature.id"], "unknown") where attributes["feature.id"] == nil

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls: { insecure: true }

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [transform/attribution, batch]
      exporters: [otlp/tempo]

Two notes from running this. Put transform before batch so the backfill happens per span while the data is still cheap to touch. And keep the price table out of the collector. I tried encoding per-model rates as collector attributes once. Every price change became a config deploy, and the rates drifted out of sync with what we were actually billed. Pricing lives next to the query now.

4. Rolling cost up by team in Grafana

Tempo stores spans, not dollars. So the rollup is two steps: TraceQL pulls token sums grouped by the attribution attributes, and a small price map turns tokens into cost downstream. I start from this, which aggregates output-token counts (the number I watch most, because completion tokens are usually where the money and the runaways are):

{ .team.id = "support-platform" && .llm.token_count.completion > 0 }
| select(.project.id, .feature.id, .llm.model_name, .llm.token_count.prompt, .llm.token_count.completion)
| by(.team.id, .project.id, .llm.model_name)
| sum(.llm.token_count.completion)

Drop the team.id filter and group by it instead for the all-teams board. The grouping by llm.model_name matters: a mini-tier model and a frontier model can differ by more than an order of magnitude per token, so summing raw tokens across models hides which team is expensive because of volume versus model choice. The dollar step is deliberately dumb: a lookup from llm.model_name to input-price and output-price per thousand tokens, multiplied through, summed per team. Keeping it dumb and external is what lets me re-price history when a provider changes rates.

5. The alert: page when a team's output tokens jump

Cost attribution is reporting. The thing that earns its keep is the page. The rule I run is week-over-week on output tokens per team: if this week's completion-token total for any team is more than 2x the same window last week, page. Output tokens, not input, because the runaway failure modes (retry storms, an agent that loops, a prompt-chaining bug that re-asks) all show up as generation volume first. Why 2x week-over-week and not a fixed dollar ceiling: a fixed ceiling either pages constantly for your big teams or never fires for your small ones. A relative jump normalizes across team size on its own. The team whose spend doubled in the story above would have tripped a 2x rule on day one. It did not trip our dollar alert because the absolute number was still small against the org total. Small against the org, doubled for the team, is exactly the blind spot per-team attribution exists to close. Route it to whoever owns the team's budget, not a shared channel where it gets ignored.

6. What I tagged and then dropped

user.id. Per-user spend sounds useful and is occasionally asked for. But putting a user identifier on every span means every trace is now PII, and your whole tracing backend inherits the retention, access, and deletion obligations that come with that. The attribution win did not come close to justifying the compliance surface. Dropped it, have not missed it.

request.id. Pure redundancy. A trace already has a trace ID and every span has a span ID. Anywhere I thought I wanted it, the trace ID was already there and already correct.

The pattern in both: an attribute is only worth tagging if it answers a question the cheaper attributes cannot, and if its cost (privacy, plumbing, drift) is lower than that answer is worth.

FAQ

Why compute cost at query time instead of writing a cost attribute on the span? Prices change and I want to re-cost history when they do. A cost attribute freezes the rate at write time.

Do I need the gateway, or can each service tag its own spans? You can tag per service. I prefer the gateway because it sees model and tokens and request context in one place, so a new service gets attribution by routing through it and forwarding three headers.

Why Tempo specifically? It is what we run, and TraceQL's aggregation over span attributes does the rollup I need. The attribute conventions are OpenInference, so the labels are not tied to Tempo. The point of standardizing on open conventions is that this choice is reversible.

What if a service forgets the attribution headers? The collector backfills unattributed. The spend still shows up, just in a bucket whose name tells me to go fix the instrumentation.

Is week-over-week 2x too noisy? For steady traffic, no. For genuinely spiky workloads, raise the ratio or widen the comparison window. I bias toward a slightly noisy page over a silent doubling.

Open questions

Caching breaks the token-to-cost math. Cached prompt tokens bill at a different rate (sometimes free), and I do not yet tag cache hits cleanly enough to price them right. Streaming and cancelled generations: if a client disconnects mid-stream, what is the honest output-token count, and does the provider bill for tokens generated after the cancel? Feature-level granularity has a ceiling, and I keep wanting per-prompt-version attribution but every level deeper is one more label nobody reads. And whether 2x week-over-week should itself be per-team, since some teams are spiky by nature and one global ratio serves both imperfectly. If you have wired cached-token pricing into a span-based cost model in a way that survives a provider changing its cache rates, I want to hear how.

Span attributes that catch LLM cost regressions before billing does

Jasmine Park — Tue, 02 Jun 2026 19:09:31 +0000

The default OTel + OpenInference span has llm.tokens.input and llm.tokens.output as numeric attributes. Useful for trace-level debugging. Not useful for per-team cost regressions, because nothing groups traces by team.

The 3 attribute additions that earned their keep:

team.id. Tagged on every span at the gateway layer (before the call routes to the LLM provider). This is the column that makes the cost rollup possible. Without it, you can attribute spend to an org but not to a team inside the org.

feature.id. The product feature that triggered the call (chat_assistant, summarizer, rag_search). Lets you see when one feature's token cost spikes vs the overall trend.

llm.model. Already standard in OpenInference but worth flagging: without this, you cannot separate a cheap mini-tier model's spikes from a frontier model's spikes when both are in the same feature.

The daily Tempo + Grafana query (TraceQL):

{ resource.service.name = "llm-gateway" }
| histogram_quantile(0.95, llm.tokens.output_total) by (team.id, feature.id, llm.model)

The alert rule: page when 7-day-trailing average of output-tokens-per-team-per-feature jumps more than 2x week-over-week. We caught a runaway retry loop last quarter that the org-level spend dashboard missed because the total stayed within budget while one team's bill quietly doubled.

What we tried and dropped:

user.id tagging: privacy concerns at scale, and the rollup-by-team covered the use case.

request.id tagging: redundant with the trace_id; just adds cardinality.

Drafted with AI assistance, edited and verified by author.