<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jasmine Park</title>
    <description>The latest articles on DEV Community by Jasmine Park (@jasmine_park_dev).</description>
    <link>https://dev.to/jasmine_park_dev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3940554%2F33355dac-c999-4ac2-ba72-34c28bf9f1d7.png</url>
      <title>DEV Community: Jasmine Park</title>
      <link>https://dev.to/jasmine_park_dev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jasmine_park_dev"/>
    <language>en</language>
    <item>
      <title>What wakes you at 2am when an enterprise operator deploys your agent</title>
      <dc:creator>Jasmine Park</dc:creator>
      <pubDate>Thu, 02 Jul 2026 08:13:02 +0000</pubDate>
      <link>https://dev.to/jasmine_park_dev/what-wakes-you-at-2am-when-an-enterprise-operator-deploys-your-agent-3dlm</link>
      <guid>https://dev.to/jasmine_park_dev/what-wakes-you-at-2am-when-an-enterprise-operator-deploys-your-agent-3dlm</guid>
      <description>&lt;p&gt;Month one of our enterprise rollout: three 3am pages. None of them were code bugs. None of them were model quality issues. All of them were operational surprises we had not accounted for before the handoff.&lt;/p&gt;

&lt;p&gt;We were production-ready. We were not operator-ready. Those are different things.&lt;/p&gt;

&lt;p&gt;Here's what the three incidents were, what they cost, and the alert set I now require before any enterprise agent deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Incident one: rate limits hit at 6pm EST
&lt;/h2&gt;

&lt;p&gt;The operator's usage pattern had a daily spike between 5pm and 7pm EST. Their team was closing out the day, reviewing agent outputs, running follow-up queries. Peak usage was 4x average.&lt;/p&gt;

&lt;p&gt;Our OpenAI rate limit was sized for average load. At 6:04pm on day eight, we hit it. Every request after that returned 429 errors for eleven minutes. The agent returned a generic error message. Eleven minutes of silent failure during peak business hours.&lt;/p&gt;

&lt;p&gt;Nobody had tested for burst traffic. We had tested for average load. The cost of finding this in production instead of before handoff: one executive complaint, two support tickets, and an emergency weekend call.&lt;/p&gt;

&lt;p&gt;The fix was multi-provider routing with automatic failover: when provider A returns 429, route to provider B. We had this wired up within 48 hours. We should have had it before day one.&lt;/p&gt;

&lt;p&gt;What this incident taught us about operator-readiness: size for the operator's peak, not your average. The operator's usage pattern will not match your test environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Incident two: $11,800 overage in 18 days
&lt;/h2&gt;

&lt;p&gt;The operator had one very active team. Four engineers running batch analysis jobs against our agent at hours that made no operational sense (11pm local, Saturdays). Within 18 days, their usage represented 68% of total API cost.&lt;/p&gt;

&lt;p&gt;We had a total cost ceiling. We did not have a per-tenant ceiling. When we hit the total ceiling, the entire deployment slowed down. The heavy team's cost was invisible to us until the month-end bill.&lt;/p&gt;

&lt;p&gt;The question from the customer's VP of operations on day 20: "Can you break down the cost by team?"&lt;/p&gt;

&lt;p&gt;We could not.&lt;/p&gt;

&lt;p&gt;This is a FinOps problem that looks like a monitoring problem. Per-tenant cost tagging needs to be in the request headers before the first request goes out. Not added after month-end when someone asks the question.&lt;/p&gt;

&lt;p&gt;Cost of finding this in production instead of before handoff: $11,800 overage, one executive conversation, two weeks of retroactive tagging work to approximate the breakdown.&lt;/p&gt;

&lt;p&gt;Minimum viable cost attribution setup: tag every request with operator ID, team ID, and use-case ID at the gateway layer. Aggregate daily by tag. Alert when any single tag hits 70% of its monthly budget by day 15.&lt;/p&gt;

&lt;h2&gt;
  
  
  Incident three: no audit trail for a compliance request
&lt;/h2&gt;

&lt;p&gt;Six weeks in, the operator's compliance officer needed to reconstruct a decision the agent made on a specific document at a specific time. Customer data question. The agent had made a routing decision that the compliance team needed to audit.&lt;/p&gt;

&lt;p&gt;We had trace spans in our observability system. We did not have an immutable per-request audit log that showed: which document, which agent version, which model version, which prompt version, what the output was, what confidence score.&lt;/p&gt;

&lt;p&gt;Trace spans are not audit logs. They're operational data. An audit log needs to be write-once, timestamped, and correlated with the business object (the document, the customer record, whatever the operator's domain object is).&lt;/p&gt;

&lt;p&gt;We spent four days building a retroactive audit log approximation. It was not what the compliance officer needed, but it was the best we could do. The real audit log went live in week eight.&lt;/p&gt;

&lt;p&gt;Cost of finding this in production instead of before handoff: one compliance near-miss, four days of engineering time, one uncomfortable meeting.&lt;/p&gt;

&lt;h2&gt;
  
  
  What operator-ready means vs. what production-ready means
&lt;/h2&gt;

&lt;p&gt;Production-ready is about your system. Uptime, latency, error rates. Metrics you control.&lt;/p&gt;

&lt;p&gt;Operator-ready is about what happens when your system runs inside someone else's organization, with their usage patterns, their cost constraints, their compliance requirements, their data.&lt;/p&gt;

&lt;p&gt;The three incidents above were all foreseeable. The operator's usage pattern is discoverable before handoff (just ask them). Per-tenant cost attribution is a gateway configuration decision that takes a day to implement. Audit log requirements for regulated industries are documented in their compliance frameworks.&lt;/p&gt;

&lt;p&gt;We didn't discover any of these things before the handoff because we didn't ask the right questions before the handoff.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pre-operator readiness checklist I run now
&lt;/h2&gt;

&lt;p&gt;Before any enterprise agent deployment, I go through five operational questions. These are not code reviews. They're operational configuration checks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limit sizing:&lt;/strong&gt; What is the operator's expected peak usage, and what is our per-provider rate limit at that peak? If peak usage exceeds 60% of the rate limit, configure burst handling or multi-provider routing before go-live.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost attribution:&lt;/strong&gt; Is every request tagged with the minimum attribution set (operator, team, use-case) at the gateway layer? If not, implement it before the first request. Do not add it retroactively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit log schema:&lt;/strong&gt; Does the operator operate in a regulated industry? If yes, map their compliance requirements to a specific log schema before deployment. Generic trace spans do not satisfy financial services, healthcare, or legal audit requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failover configuration:&lt;/strong&gt; Is there a secondary provider configured for automatic failover? If not, is there a documented manual procedure and a stated SLA for the outage window?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost ceiling configuration:&lt;/strong&gt; Is there a per-tenant monthly budget ceiling with an alert at 70% of budget? Not a per-account ceiling. Per tenant. An over-active team should not consume another team's budget.&lt;/p&gt;

&lt;p&gt;These five checks take about two hours to complete and review. Three incidents in month one took about three weeks to fully resolve, plus the relationship cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd page on
&lt;/h2&gt;

&lt;p&gt;If I were setting up the alert set for a new enterprise agent deployment, these are the four alerts I'd configure first:&lt;/p&gt;

&lt;p&gt;Alert one: per-tenant request rate above 80% of the tenant's configured rate limit. Not 100%. Leave 20% headroom to investigate before hitting the ceiling.&lt;/p&gt;

&lt;p&gt;Alert two: per-request cost moving average above threshold for any single tenant (set the threshold based on the expected per-request cost, alert at 3x). Catches batch jobs and runaway loops before month-end.&lt;/p&gt;

&lt;p&gt;Alert three: agent response with no trace ID in the audit log. Means the audit trail has a gap. You want to know about this in real time, not when a compliance officer asks.&lt;/p&gt;

&lt;p&gt;Alert four: first-request p99 latency above 2x the steady-state p99. Catches cold-start regressions before the operator's peak usage hits them.&lt;/p&gt;

&lt;p&gt;None of these alerts require custom infrastructure. They require that your gateway and your agent emit the right metadata on every request.&lt;/p&gt;

&lt;p&gt;Get the metadata right before handoff. Fix the alerts before go-live. The 3am page is not a production incident. It's a pre-deployment checklist item you deferred.&lt;/p&gt;

</description>
      <category>mlops</category>
      <category>observability</category>
      <category>sre</category>
      <category>llmproduction</category>
    </item>
    <item>
      <title>The Langfuse migration that cost us a sprint: how I now budget LLM observability</title>
      <dc:creator>Jasmine Park</dc:creator>
      <pubDate>Fri, 26 Jun 2026 21:37:56 +0000</pubDate>
      <link>https://dev.to/jasmine_park_dev/the-langfuse-migration-that-cost-us-a-sprint-how-i-now-budget-llm-observability-ane</link>
      <guid>https://dev.to/jasmine_park_dev/the-langfuse-migration-that-cost-us-a-sprint-how-i-now-budget-llm-observability-ane</guid>
      <description>&lt;p&gt;We moved off our first tracer in month eight. The migration took one engineer the better part of a sprint, because the trace data lived in a schema we did not own. Nobody costed that line item on day one. I am writing this so you can.&lt;/p&gt;

&lt;p&gt;I run reliability for a small team shipping LLM features. When the pager goes off at 2am, I do not care which dashboard is prettiest. I care about two numbers: what this tool costs me per month, and what it costs me to leave. Those two numbers are the whole story, and they are almost never on the comparison page.&lt;/p&gt;

&lt;p&gt;So here are six Langfuse alternatives. For each I tracked both numbers: the monthly bill on the invoice, and the exit bill that only shows up the day you migrate. I compared Helicone, Arize Phoenix, LangSmith, Braintrust, Laminar, and Future AGI traceAI. They all trace LLM calls (prompts, tokens, retrieval spans, latency). The axis that decides your exit cost is whether the trace format is OpenTelemetry-native or a vendor schema. Get that wrong and the migration bill lands later, with interest.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost nobody puts on the pricing page
&lt;/h2&gt;

&lt;p&gt;Your monthly invoice is the visible cost. The exit cost is the invisible one: re-instrumenting the app, rebuilding integrations, and losing historical traces when the schema does not travel. If your spans are OTel, the exit cost trends toward zero because the data is portable by construction. If they are proprietary, you are paying a deferred bill every month you stay. Sort on that first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Helicone.&lt;/strong&gt; The gateway-first option. You proxy model calls through it and get logging, cost tracking, and analytics with almost no code change. Apache-2.0, self-hostable, roughly 5,800 GitHub stars as of June 2026. On pure observability ergonomics this is one of the strongest picks, and the proxy model means low setup cost. The thing to watch at scale: a gateway in the request path is one more hop to reason about when latency spikes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Arize Phoenix.&lt;/strong&gt; The open-source OTel option. Tracing plus evals, self-hostable, around 10,000 stars as of June 2026. Because it is OTel-native, your exit cost stays low. The commercial Arize AX tier adds ML monitoring and enterprise features. If portability is your top line, this and traceAI are the two that keep the invisible bill near zero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangSmith.&lt;/strong&gt; The LangChain-native option. If you live in LangChain or LangGraph, instrumentation is automatic and the developer experience is strong. Proprietary and closed-source, tightly coupled to the LangChain ecosystem. This is the most lock-in of the group: the day-one cost is the lowest, the day-200 cost is the highest. Worth it only if you are certain you are never leaving LangChain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Braintrust.&lt;/strong&gt; The polished SaaS option. One of the better eval and observability experiences, and the people who do not page (PMs, leads) tend to like the UI. Proprietary trace schema, closed-source, managed by default. Even on enterprise deployments you operate inside their format, so the exit cost stays on the books.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Laminar.&lt;/strong&gt; The newer open-source entrant. OTel-based tracing with evals, smaller and younger than Phoenix, in the low-thousands of stars as of June 2026. Lower lock-in on the same OTel logic. The cost to weigh here is maturity, not portability: a smaller project means fewer battle-tested edges, which matters more for an on-call rotation than a demo.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Future AGI traceAI.&lt;/strong&gt; The instrumentation-layer option. Worth being precise here, because it is not the same kind of thing as the others. traceAI is not an observability dashboard. It is an Apache-2.0, OpenTelemetry-native instrumentation SDK (pip install fi-instrumentation-otel) that emits portable OTel spans for 50-plus frameworks as of June 2026. The spans go wherever you point your collector. Future AGI's broader platform adds evals on top (50-plus metrics under one evaluate() call as of June 2026), but on raw observability ergonomics Helicone and Phoenix are more mature dashboards. Where traceAI earns its place on this list is the exit-cost column: because it speaks OTel, the cost of leaving is roughly the cost of changing a collector endpoint. Code: github.com/future-agi/traceAI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two numbers, side by side
&lt;/h2&gt;

&lt;p&gt;Visible cost is easy: read the pricing page, multiply by your span volume, done. Invisible cost is the one that bit me. The open-source OTel tools (Phoenix, Laminar, traceAI as the instrumentation layer) keep your exit near free. The proprietary ones (LangSmith, Braintrust) front-load convenience and back-load the migration. Helicone sits in between: open and portable, with a proxy hop to account for. Pick the lock-in profile you can afford in month eight, then argue about features.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd page on
&lt;/h2&gt;

&lt;p&gt;If I were standing this up again, here is the dashboard and alert set I would build before I cared about anything else:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trace export success rate&lt;/strong&gt; below 99 percent over 5 minutes. A silent collector drop is invisible until you need the trace you do not have.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Span ingestion cost per day&lt;/strong&gt; trending above your budget line. Token spend gets watched; span volume does not, and it scales with traffic too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P99 added latency from the tracing path&lt;/strong&gt; above your SLO budget. If the tracer (or proxy) adds tail latency, that is a reliability cost masquerading as observability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Percent of spans in a portable (OTel) format.&lt;/strong&gt; This is your exit-cost gauge. If it drifts down because someone added a proprietary integration, you just took on migration debt. Page on it before it compounds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dropped-trace rate during incidents specifically.&lt;/strong&gt; Tracing tends to fail exactly when load is highest, which is exactly when you need it. Alert on the correlation, not just the absolute.&lt;/p&gt;

&lt;p&gt;Build those five first. The dashboard you actually page on is cheaper than the migration you did not plan for.&lt;/p&gt;

</description>
      <category>llmops</category>
      <category>langfuse</category>
      <category>opentelemetry</category>
      <category>observability</category>
    </item>
    <item>
      <title>The gateway tax: 6 OpenAI-compatible gateways.</title>
      <dc:creator>Jasmine Park</dc:creator>
      <pubDate>Fri, 26 Jun 2026 21:35:09 +0000</pubDate>
      <link>https://dev.to/jasmine_park_dev/the-gateway-tax-6-openai-compatible-gateways-5f1e</link>
      <guid>https://dev.to/jasmine_park_dev/the-gateway-tax-6-openai-compatible-gateways-5f1e</guid>
      <description>&lt;p&gt;On March 14, 2026, our LLM bill came in at $9,140 for the month, up from about $5,200, and I could not tell you which team spent it. The gateway in front of every provider emitted one cost line and one trace span per request, all tagged &lt;code&gt;service=llm-gateway&lt;/code&gt;, so the platform team ate the whole overage in the FinOps review while three product teams shrugged.&lt;/p&gt;

&lt;p&gt;That month is the reason I now treat cost attribution as a gateway design decision, not an afterthought. If you cannot answer "which team, which feature, which key spent this" from the layer every call already passes through, you will answer it never. This is a comparison of the OpenAI-compatible LLM gateways I have evaluated for exactly that job: LiteLLM, Portkey, Helicone, Cloudflare AI Gateway, and Bifrost, plus one newer open-source entrant I introduce in the comparison table below. The lens is an SRE lens. What does it cost you in p99, and how granularly can you bill it back.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Cost attribution belongs at the gateway, not in each app's SDK and not in your provider's dashboard. The gateway is the one chokepoint every call crosses, so it is the only place where per-team, per-feature, per-key spend is both complete and consistent.&lt;/p&gt;

&lt;p&gt;Every OpenAI-compatible gateway you put in that path adds latency. Call it the gateway tax. It is real, it is usually single-digit milliseconds at the proxy hop, and it varies with what you turn on (caching, guardrails, semantic lookups). The tax is not the deciding factor for most teams, because provider latency dwarfs it. What actually differs across gateways, by a lot, is attribution granularity: whether you can slice spend by virtual key, by route, by user, and whether the cost shows up as a first-class OpenTelemetry span attribute or as a number you have to scrape out of a dashboard later.&lt;/p&gt;

&lt;p&gt;So the decision rule is short. Pick the gateway whose tax you can afford at your p99 budget, and whose attribution you can actually bill against. Most teams over-index on the first half and never check the second. Then March happens.&lt;/p&gt;

&lt;p&gt;One honesty note up front, because it matters for how you read everything below. We did not re-run a latency benchmark across these six gateways on one rig. Anybody who hands you a clean cross-vendor p99 table either ran a heroic apples-to-apples harness (rare) or is quietly comparing numbers each vendor measured on different hardware against different upstreams (common). Where I cite latency, it is the vendor's own published number, labeled as such. The capability columns (self-host, caching type, attribution granularity, OTel-native, guardrails, license) are checked against each project's public docs and READMEs, because those are verifiable and they are what you will actually live with.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not the app SDK, and why not the provider dashboard
&lt;/h2&gt;

&lt;p&gt;Before the table, kill the two alternatives, because most teams reach for one of them first and it is why their numbers never reconcile.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost attribution does not belong in each app's SDK.&lt;/strong&gt; The pitch is seductive: every service instruments its own OpenAI client, tags spend with its own team name, ships it to your metrics backend. In practice you now have N implementations of "compute token cost" drifting against each other. One team is on an old pricing table. One forgot to count cached input tokens at the discounted rate. One service calls the provider directly in a cron job and bypasses instrumentation entirely, so that spend is simply invisible. When the provider changes per-token pricing (they do, quietly), you are editing N codebases to stay correct. SDK metering is great for in-process latency spans. It is a bad system of record for dollars, because the source of truth is smeared across every repo and every deploy cadence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost attribution does not belong in the provider dashboard either.&lt;/strong&gt; The OpenAI or Anthropic billing console knows your org spent the money. It does not know your org chart. It cannot tell you that &lt;code&gt;team-checkout&lt;/code&gt; spent $4k and &lt;code&gt;team-search&lt;/code&gt; spent $300, because your teams are not a concept the provider has. The best you get is per-API-key, and only if you had the discipline to mint one key per team up front and never share them, which under load nobody does. Multi-provider makes it worse: now you are stitching three billing consoles, three export formats, three currencies of "cost," into one spreadsheet a human maintains by hand. That spreadsheet is wrong by the second week.&lt;/p&gt;

&lt;p&gt;The gateway is the only layer that sees every request, knows which credential made it, can compute cost once against one pricing table, and can stamp that cost onto a span before the response leaves the building. That is the whole argument. Now, which gateway.&lt;/p&gt;

&lt;h2&gt;
  
  
  Definitions, so the table means something
&lt;/h2&gt;

&lt;p&gt;Two terms do all the work in this post. Pin them down before you read the comparison.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost-attribution granularity&lt;/strong&gt; is the finest dimension along which the gateway can split spend without you doing post-hoc log surgery. I rank it in three tiers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Per-key&lt;/em&gt;: the gateway issues virtual keys (its own keys, mapped to upstream provider keys) and tracks spend and budget per virtual key. You hand &lt;code&gt;team-checkout&lt;/code&gt; a virtual key, and its spend is isolated. This is the floor for billing back, and honestly it is enough for most orgs.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Per-route / per-model&lt;/em&gt;: spend split by which model or endpoint served the call, so you can see that GPT-4-class traffic is 80% of cost while being 10% of calls.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Per-user / per-metadata&lt;/em&gt;: arbitrary tags (end-user id, feature flag, tenant) attached at request time and queryable later. This is what you need for usage-based billing to &lt;em&gt;your&lt;/em&gt; customers, not just internal chargeback.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A gateway that only gives you per-key is fine for internal FinOps. A gateway that gives you per-user metadata is what you need if you resell LLM features and bill your customers per seat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The gateway tax&lt;/strong&gt; is the latency the gateway hop adds on top of provider latency. It has a floor (the proxy itself: parse, auth, route, re-serialize) and a variable part (every feature you enable adds a little: an exact-cache lookup is cheap, a semantic-cache vector search is not free, each inline guardrail is a synchronous scan). The tax is paid on every request that is not a cache hit. On a cache hit you skip the provider entirely and the gateway &lt;em&gt;saves&lt;/em&gt; you latency, which is the one case where the tax goes negative. The mistake teams make is benchmarking the bare proxy, seeing 2 ms, and budgeting as if guardrails and semantic cache are free. They are not. Measure the tax with your real feature set on, or do not quote it.&lt;/p&gt;

&lt;p&gt;And again, the number you measure on your rig is not comparable to the number a vendor measured on theirs. Different CPU, different upstream, different concurrency, different request body size. Treat every cross-vendor latency claim, including the ones in this post, as directional.&lt;/p&gt;

&lt;h2&gt;
  
  
  The comparison
&lt;/h2&gt;

&lt;p&gt;Read this as capabilities first, latency last. The capability columns are what you live with daily. The latency column is vendor-published and not re-run by us, so it is the least load-bearing thing here.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Gateway&lt;/th&gt;
&lt;th&gt;Self-host?&lt;/th&gt;
&lt;th&gt;Caching (exact / semantic)&lt;/th&gt;
&lt;th&gt;Cost-attribution granularity&lt;/th&gt;
&lt;th&gt;OTel-native?&lt;/th&gt;
&lt;th&gt;Inline guardrails?&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LiteLLM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Exact (Redis/in-mem/disk/S3/GCS) + semantic (Qdrant/Redis)&lt;/td&gt;
&lt;td&gt;Per-key, per-team, per-user (virtual keys + budgets + spend tags)&lt;/td&gt;
&lt;td&gt;Via OTel callback/integration&lt;/td&gt;
&lt;td&gt;Via plugins + Guardrails hooks&lt;/td&gt;
&lt;td&gt;MIT (OSS); paid Enterprise tier&lt;/td&gt;
&lt;td&gt;Broadest provider + ecosystem coverage. Default pick if you want the biggest model zoo.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Portkey&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (gateway is OSS; full platform is SaaS)&lt;/td&gt;
&lt;td&gt;Simple (exact) + semantic&lt;/td&gt;
&lt;td&gt;Per virtual key + metadata tags; rich SaaS dashboards&lt;/td&gt;
&lt;td&gt;Partial / via integrations&lt;/td&gt;
&lt;td&gt;Yes (integrated Guardrails)&lt;/td&gt;
&lt;td&gt;Gateway MIT; platform proprietary SaaS&lt;/td&gt;
&lt;td&gt;Most polished managed dashboards and config UI. Default if you want a hosted control plane, not a DIY one.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Helicone&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (self-host available)&lt;/td&gt;
&lt;td&gt;Exact-match only (cache-key hash)&lt;/td&gt;
&lt;td&gt;Custom properties (per-user / per-feature) via metadata; per-key&lt;/td&gt;
&lt;td&gt;OTLP ingest (observability-first)&lt;/td&gt;
&lt;td&gt;Limited / not the focus&lt;/td&gt;
&lt;td&gt;OSS (observability platform)&lt;/td&gt;
&lt;td&gt;Observability-first, not a routing-heavy gateway. Default if logging + analytics is the job.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloudflare AI Gateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No (Cloudflare edge, cloud-only)&lt;/td&gt;
&lt;td&gt;Caching (exact); no documented semantic cache&lt;/td&gt;
&lt;td&gt;Per-request analytics, basic metadata; provider/token/cost metrics&lt;/td&gt;
&lt;td&gt;No documented OTel export&lt;/td&gt;
&lt;td&gt;Not the focus&lt;/td&gt;
&lt;td&gt;Proprietary (managed service)&lt;/td&gt;
&lt;td&gt;Zero-ops edge gateway. Default if you are already all-in on Cloudflare and want one toggle.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bifrost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Semantic caching (exact also supported)&lt;/td&gt;
&lt;td&gt;Hierarchical budgets: virtual keys, teams, customers&lt;/td&gt;
&lt;td&gt;Yes (Prometheus + OTel/tracing)&lt;/td&gt;
&lt;td&gt;Yes (plugin middleware / enterprise guardrails)&lt;/td&gt;
&lt;td&gt;Apache-2.0 (Go)&lt;/td&gt;
&lt;td&gt;Fast Go OSS gateway with strong budget hierarchy. Default if you want OSS + native budgets and live in Go.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Future AGI Agent Command Center&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (single Go binary)&lt;/td&gt;
&lt;td&gt;Exact (6 backends) + semantic (4 backends)&lt;/td&gt;
&lt;td&gt;Per virtual key budgets/quotas + per-request cost on the span&lt;/td&gt;
&lt;td&gt;Yes, OTel-native (W3C trace context) + Prometheus &lt;code&gt;/metrics&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Yes, 18 built-in scanners + external adapters&lt;/td&gt;
&lt;td&gt;Apache-2.0 (Go)&lt;/td&gt;
&lt;td&gt;End-to-end OSS platform where the gateway is one piece beside eval/observability. Default if you want OTel + Prometheus + caching + guardrails in one binary.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Notes on the latency column, deliberately kept out of the table because it is not comparable: LiteLLM publishes proxy-overhead figures in the single-digit-millisecond range on their own harness; Future AGI publishes a vendor benchmark of roughly +1.4 ms P95 added by three inline guardrails and a lower added-latency figure than LiteLLM measured on Future AGI's own rig (their numbers, their methodology, not verified by us); Bifrost publishes its own low-microsecond internal-selection numbers. None of these were measured against each other. Do not put them in a slide as if they were.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gateway by gateway
&lt;/h2&gt;

&lt;h3&gt;
  
  
  LiteLLM
&lt;/h3&gt;

&lt;p&gt;The one with the longest provider list and the deepest ecosystem. If a model exists, LiteLLM probably has a route to it, and the &lt;code&gt;litellm&lt;/code&gt; SDK is already in half the agent frameworks you will touch. For attribution it is genuinely strong: virtual keys, budgets, and spend tracking down to key, team, and user, plus cache (exact via Redis and friends, semantic via Qdrant). OpenTelemetry is available through its callback/integration system rather than being the native wire format, which means you wire it up rather than getting it for free. The tax is the usual proxy hop; LiteLLM publishes single-digit-ms overhead on their own harness. The cost of all that breadth is configuration surface: there is a lot of it, and a lot of ways to hold it wrong.&lt;/p&gt;

&lt;p&gt;Choose LiteLLM when your priority is provider coverage and ecosystem fit, and you have someone who will own the config.&lt;/p&gt;

&lt;h3&gt;
  
  
  Portkey
&lt;/h3&gt;

&lt;p&gt;The most polished managed experience. The gateway core is open source and you can run it with &lt;code&gt;npx @portkey-ai/gateway&lt;/code&gt;, but the part people actually pay for is the hosted control plane: the dashboards, the config UI, the virtual-key and metadata management without you standing up storage. Caching is simple plus semantic, guardrails are integrated, attribution is per-virtual-key plus metadata tags. If you want to hand a non-platform team a screen where they can see their own spend without you building it, Portkey is the shortest path. The trade is that the nice parts are SaaS and proprietary, so the dependency is on Portkey-the-company, not just Portkey-the-binary.&lt;/p&gt;

&lt;p&gt;Choose Portkey when you want a managed control plane and dashboards out of the box, and SaaS dependency is acceptable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Helicone
&lt;/h3&gt;

&lt;p&gt;Observability-first. Helicone is excellent at logging every request, tagging it with custom properties, and giving you analytics over that, including per-user and per-feature cost slicing via metadata. Caching is exact-match only (the cache key is a hash of URL, body, and relevant headers, so "Hello" and "Hi" are different entries). It is self-hostable and open source, and it leans into OTLP-style ingest because its center of gravity is the observability plane, not heavy multi-provider routing or failover. If your real problem is "I cannot see what my LLM calls are doing," Helicone solves that cleanly. If your real problem is "I need 15 routing strategies and inline guardrails," it is not aimed there.&lt;/p&gt;

&lt;p&gt;Choose Helicone when logging, analytics, and per-feature cost visibility are the job and routing is secondary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloudflare AI Gateway
&lt;/h3&gt;

&lt;p&gt;The zero-ops option. It runs on Cloudflare's edge, so there is no binary to operate and no SPOF you own (you inherit Cloudflare's). It does caching and gives you analytics: request counts, tokens, cost. What you do not get, per the public docs, is self-hosting, a documented OpenTelemetry export, or deep per-team attribution beyond request-level metadata. It is the right answer when you are already on Cloudflare, you want one dashboard and one toggle, and your attribution needs stop at "roughly how much, roughly where."&lt;/p&gt;

&lt;p&gt;Choose Cloudflare AI Gateway when you want a managed edge gateway with near-zero ops and you already live on Cloudflare.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bifrost
&lt;/h3&gt;

&lt;p&gt;A fast Go OSS gateway (Apache-2.0) with a genuinely good cost model: hierarchical budgets across virtual keys, teams, and customers, which maps cleanly onto chargeback. It ships native Prometheus metrics and distributed tracing / OTel, semantic caching, and a plugin middleware system for analytics and guardrail-style logic. It is newer and the ecosystem is smaller than LiteLLM's, so you trade provider breadth for a tight, performant core and a budget hierarchy that is built in rather than bolted on.&lt;/p&gt;

&lt;p&gt;Choose Bifrost when you want OSS, native budget hierarchy, and Prometheus + OTel, and you are comfortable in the Go ecosystem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Future AGI Agent Command Center
&lt;/h3&gt;

&lt;p&gt;An OpenAI-compatible gateway shipped as a single Go binary, Apache-2.0, open source (repo at github.com/future-agi). As of June 2026 it ships 15 routing strategies, two-tier caching (6 exact-match backends and 4 semantic backends), and 18 built-in guardrail scanners plus adapters for external guardrail vendors. The piece that matters for this post: it is OpenTelemetry-native using W3C trace context and also exposes a Prometheus &lt;code&gt;/metrics&lt;/code&gt; endpoint, and it tracks per-virtual-key budgets and quotas, so cost can ride on the span rather than living only in a dashboard. It also ships a committed, reproducible benchmark harness (a &lt;code&gt;bench/&lt;/code&gt; directory with a mock upstream), which I respect more than a marketing number, because it means you can re-run their claim instead of trusting it.&lt;/p&gt;

&lt;p&gt;On their own published benchmark (vendor numbers, not verified by us), three inline guardrails add roughly +1.4 ms at P95, and they claim lower added latency than LiteLLM measured on their rig. Same caveat as everywhere else: their hardware, their upstream, their methodology. The honest positioning: LiteLLM still has the broadest provider and ecosystem coverage, and Portkey has the more polished managed SaaS and dashboards. Future AGI's actual edge is that the gateway is one component of an end-to-end open-source platform that also does eval and observability, with native OTel plus Prometheus and built-in caching and guardrails in a single binary, so you are not assembling four tools to get attribution onto a span.&lt;/p&gt;

&lt;p&gt;Choose Agent Command Center when you want OTel + Prometheus + caching + guardrails in one OSS binary, and you value the gateway being part of one eval/observability platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  The diagram you should draw on your whiteboard
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh3.googleusercontent.com%2Fd%2F1o6-cyigVBiyxu3Qh4sqDjUgkvCIBJ4_m" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh3.googleusercontent.com%2Fd%2F1o6-cyigVBiyxu3Qh4sqDjUgkvCIBJ4_m" alt="Gateway control points with the OTel cost span emitted at govern/cost" width="1483" height="789"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure: the gateway is the one layer every call crosses. Stamp cost on the OpenTelemetry span at GOVERN/COST and attribution stays complete and consistent.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The single most important thing in that diagram is where the span is emitted. It is emitted inside the gateway, at the govern/cost control point, after the gateway has resolved the credential and computed the cost. That is what makes attribution complete (every call crosses it) and consistent (one pricing table, one cost function). Move that emission into each app and you reintroduce every drift problem from the "why not the SDK" section above.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limitations: where every one of these adds risk
&lt;/h2&gt;

&lt;p&gt;No gateway is free of downside. If you put one in your hot path, you have signed up for these, regardless of vendor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single point of failure.&lt;/strong&gt; Every request now depends on the gateway being up. A managed edge service (Cloudflare) trades your SPOF for theirs, which may be a better or worse bet than your own uptime. A self-hosted binary (LiteLLM, Bifrost, Future AGI) is yours to make HA: run more than one replica, put a real load balancer in front, and test failover before you need it. "We deployed one gateway pod" is not a control plane, it is an incident waiting for a node drain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cache poisoning and stale answers.&lt;/strong&gt; Semantic caching is the feature most likely to bite you. A vector-similarity hit can return a cached answer for a prompt that is &lt;em&gt;close&lt;/em&gt; but not equivalent, and now one user sees another user's response, or a stale answer to a changed question. Exact caching is safer but still leaks across users if your cache key does not include the right scoping. Scope cache keys per tenant where correctness matters, and keep semantic caching off for anything with PII or per-user state until you have measured the false-hit rate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Span-cardinality blowup.&lt;/strong&gt; The fix for attribution (rich tags on every span) is also the way you melt your metrics backend. Put &lt;code&gt;end_user_id&lt;/code&gt; as a label on a Prometheus metric and you have just created one time series per user. That is a cardinality bomb. Keep high-cardinality identifiers (user id, request id) on &lt;em&gt;traces and logs&lt;/em&gt;, where high cardinality is fine, and keep your &lt;em&gt;metric&lt;/em&gt; labels low-cardinality (team, model, provider, cache_hit). Conflating the two is the most common way an attribution rollout pages the observability team instead of the FinOps team.&lt;/p&gt;

&lt;h2&gt;
  
  
  A pasteable artifact: per-key budget plus OTel export
&lt;/h2&gt;

&lt;p&gt;Here is a minimal, runnable setup for one gateway (LiteLLM, because its config is the most widely deployed and the spend tracking is mature), showing a per-virtual-key budget and OpenTelemetry export, plus the queries that turn it into a bill-back.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker-compose.yml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;litellm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/berriai/litellm:main-stable&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4000:4000"&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${OPENAI_API_KEY}&lt;/span&gt;
      &lt;span class="na"&gt;LITELLM_MASTER_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${LITELLM_MASTER_KEY}&lt;/span&gt;
      &lt;span class="na"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres://litellm:litellm@db:5432/litellm&lt;/span&gt;
      &lt;span class="c1"&gt;# Send OTel spans to your collector&lt;/span&gt;
      &lt;span class="na"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://otel-collector:4317&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--config"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/app/config.yaml"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--port"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4000"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./config.yaml:/app/config.yaml:ro&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;db&lt;/span&gt;

  &lt;span class="na"&gt;db&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:16&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;litellm&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;litellm&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;litellm&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;litellm-pg:/var/lib/postgresql/data&lt;/span&gt;

&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;litellm-pg&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;config.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;model_list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o&lt;/span&gt;
      &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/OPENAI_API_KEY&lt;/span&gt;

&lt;span class="na"&gt;litellm_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Emit an OpenTelemetry span per request, with cost + tokens as attributes.&lt;/span&gt;
  &lt;span class="na"&gt;callbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;otel"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="c1"&gt;# Track and persist spend so it can be queried per key/team/user.&lt;/span&gt;
  &lt;span class="na"&gt;store_model_in_db&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;general_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;master_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/LITELLM_MASTER_KEY&lt;/span&gt;
  &lt;span class="na"&gt;database_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/DATABASE_URL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Mint a virtual key for one team, with a hard monthly budget, so March cannot happen silently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:4000/key/generate &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$LITELLM_MASTER_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
        "key_alias": "team-checkout",
        "models": ["gpt-4o"],
        "max_budget": 500,
        "budget_duration": "30d",
        "metadata": {"team": "checkout", "cost_center": "cc-4471"}
      }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That key now refuses traffic once &lt;code&gt;team-checkout&lt;/code&gt; crosses $500 in a 30-day window, and every call it makes carries &lt;code&gt;team=checkout&lt;/code&gt; into the spend store and onto the OTel span.&lt;/p&gt;

&lt;p&gt;Attributing spend to a team comes from the gateway's own spend store. With LiteLLM's spend logs in Postgres, the bill-back for last month is one query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'team'&lt;/span&gt;      &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;team&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                 &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spend&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;usd&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;"LiteLLM_SpendLogs"&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nv"&gt;"startTime"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;date_trunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'month'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'1 month'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="nv"&gt;"startTime"&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;  &lt;span class="n"&gt;date_trunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'month'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;usd&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And for the live alerting view, scrape low-cardinality cost metrics into Prometheus and rank current-month spend by team. With a gateway that exposes a per-team cost counter (label &lt;code&gt;team&lt;/code&gt;, deliberately low-cardinality), the PromQL is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;topk(5,
  sum by (team) (
    increase(llm_gateway_cost_usd_total[30d])
  )
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keep &lt;code&gt;team&lt;/code&gt;, &lt;code&gt;model&lt;/code&gt;, and &lt;code&gt;provider&lt;/code&gt; as metric labels. Keep &lt;code&gt;end_user_id&lt;/code&gt; and &lt;code&gt;request_id&lt;/code&gt; out of metrics and on the trace instead. That one discipline is the difference between an attribution dashboard and a cardinality incident.&lt;/p&gt;

&lt;h2&gt;
  
  
  Paste this into your PRD
&lt;/h2&gt;

&lt;p&gt;A scenario matrix for the decision review, so the next person does not re-derive it.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Priority&lt;/th&gt;
&lt;th&gt;Default pick&lt;/th&gt;
&lt;th&gt;Escalate to&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Internal chargeback, many providers&lt;/td&gt;
&lt;td&gt;Provider breadth + per-team spend&lt;/td&gt;
&lt;td&gt;LiteLLM&lt;/td&gt;
&lt;td&gt;Bifrost (if you want native budget hierarchy in Go)&lt;/td&gt;
&lt;td&gt;Biggest model zoo, mature virtual keys and spend tracking; budgets get you per-team bill-back.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Non-platform teams need their own spend screen&lt;/td&gt;
&lt;td&gt;Managed dashboards, low build cost&lt;/td&gt;
&lt;td&gt;Portkey&lt;/td&gt;
&lt;td&gt;LiteLLM self-host (if SaaS dependency is a no)&lt;/td&gt;
&lt;td&gt;Hosted control plane and config UI mean you do not build the dashboard yourself.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"I cannot see what my LLM calls do"&lt;/td&gt;
&lt;td&gt;Logging + per-feature cost visibility&lt;/td&gt;
&lt;td&gt;Helicone&lt;/td&gt;
&lt;td&gt;Future AGI ACC (if you also need routing + guardrails)&lt;/td&gt;
&lt;td&gt;Observability-first with custom-property attribution; exact-match cache.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Already on Cloudflare, want near-zero ops&lt;/td&gt;
&lt;td&gt;One toggle, no binary to run&lt;/td&gt;
&lt;td&gt;Cloudflare AI Gateway&lt;/td&gt;
&lt;td&gt;Any self-hosted gateway (when you outgrow request-level attribution)&lt;/td&gt;
&lt;td&gt;Edge-managed, no SPOF you operate; attribution stops at request-level metadata.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Want OTel + Prometheus + cache + guardrails in one OSS binary&lt;/td&gt;
&lt;td&gt;One platform, attribution on the span&lt;/td&gt;
&lt;td&gt;Future AGI Agent Command Center&lt;/td&gt;
&lt;td&gt;LiteLLM (for wider provider coverage) or Portkey (for managed dashboards)&lt;/td&gt;
&lt;td&gt;Native OTel (W3C) + Prometheus, two-tier cache, 18 guardrail scanners in one Go binary, part of an eval/observability platform.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Resell LLM features, bill your customers per seat&lt;/td&gt;
&lt;td&gt;Per-user / per-metadata attribution&lt;/td&gt;
&lt;td&gt;LiteLLM or Portkey (rich metadata)&lt;/td&gt;
&lt;td&gt;Helicone (for the analytics layer on top)&lt;/td&gt;
&lt;td&gt;You need arbitrary per-user tags queryable later, not just per-key.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What I'd page on
&lt;/h2&gt;

&lt;p&gt;This is the on-call checklist for a gateway in your hot path. If you adopt one of these gateways and do not wire these alerts, you are flying blind and the next $9k month is already in flight.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gateway p99 latency, by route.&lt;/strong&gt; Page if p99 of the gateway-added overhead (gateway span duration minus upstream span duration) exceeds your budget for 5 minutes. This is the gateway tax going bad. Separate the proxy hop from provider latency or you will blame the wrong layer at 2am.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gateway error rate and saturation.&lt;/strong&gt; Page on 5xx rate from the gateway above baseline, and on CPU saturation, because at high concurrency CPU is the bottleneck, not the network. A saturated gateway fails every team at once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-team budget burn.&lt;/strong&gt; Page (or auto-throttle) when any virtual key crosses, say, 80% of its monthly budget before the month is 80% over. This is the alert that would have caught March on March 6, not March 31.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total spend rate-of-change.&lt;/strong&gt; Page on day-over-day total LLM spend up more than X%. A runaway retry loop or a new feature shipping uncapped shows up here first, hours before the invoice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache hit rate drop.&lt;/strong&gt; Page if cache hit rate falls below your assumed floor, because your cost model and your latency budget both silently assumed those hits. A cache that quietly stopped hitting is a bill increase and a latency regression in one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic-cache false-hit signal.&lt;/strong&gt; If you run semantic caching on anything user-facing, alert on user reports or eval-detected wrong answers correlated with cache hits. This is correctness, not cost, and it is the one that becomes a postmortem instead of a FinOps slide.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Span cardinality / metrics ingestion.&lt;/strong&gt; Page if your metrics backend's active series count jumps after a deploy. That is usually someone putting a user id on a metric label. Catch it before it takes down the observability stack.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider failover events.&lt;/strong&gt; Alert (not page) when the gateway fails over between providers, so a silent provider degradation does not hide inside your routing logic until the bill from the more expensive fallback shows up.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pick the gateway whose tax you can afford and whose attribution you can bill against. Then wire the eight alerts above, because the gateway is now load-bearing infrastructure, and load-bearing infrastructure gets a pager.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Capability claims here reflect each project's public docs and READMEs as of June 2026. Latency figures are vendor-published on each vendor's own harness, not re-run on a common rig, and are not comparable across vendors. Future AGI's gateway (Agent Command Center) is open source at github.com/future-agi.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llmops</category>
      <category>gateway</category>
      <category>observability</category>
      <category>finops</category>
    </item>
    <item>
      <title>Langfuse alternatives: 6 LLM observability tools, sorted by the thing that bites you in month eight</title>
      <dc:creator>Jasmine Park</dc:creator>
      <pubDate>Fri, 19 Jun 2026 09:33:19 +0000</pubDate>
      <link>https://dev.to/jasmine_park_dev/langfuse-alternatives-6-llm-observability-tools-sorted-by-the-thing-that-bites-you-in-month-eight-22j8</link>
      <guid>https://dev.to/jasmine_park_dev/langfuse-alternatives-6-llm-observability-tools-sorted-by-the-thing-that-bites-you-in-month-eight-22j8</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I went looking for Langfuse alternatives after living with a proprietary tracer for eight months and then paying to migrate off it.&lt;/p&gt;

&lt;p&gt;I compared six options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Helicone&lt;/li&gt;
&lt;li&gt;Arize Phoenix&lt;/li&gt;
&lt;li&gt;LangSmith&lt;/li&gt;
&lt;li&gt;Braintrust&lt;/li&gt;
&lt;li&gt;Laminar&lt;/li&gt;
&lt;li&gt;Future AGI traceAI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They all trace LLM calls.&lt;/p&gt;

&lt;p&gt;The axis that actually mattered was &lt;strong&gt;OpenTelemetry-native (OTel) vs proprietary tracing&lt;/strong&gt;, because that's what determines whether you can leave without re-instrumenting everything.&lt;/p&gt;

&lt;p&gt;Four of the six are open-source, ranging from roughly &lt;strong&gt;200 to 10,000 GitHub stars&lt;/strong&gt; (June 2026). That spread turned out to predict almost nothing about what I actually cared about: &lt;strong&gt;portability&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Axis That Bites You in Month Eight: Whose Traces Are These?
&lt;/h1&gt;

&lt;p&gt;Every tracer captures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM calls&lt;/li&gt;
&lt;li&gt;Prompts&lt;/li&gt;
&lt;li&gt;Tokens&lt;/li&gt;
&lt;li&gt;Retrieval events&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The question nobody asks on day one and everybody regrets on day 200:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Is the trace format OpenTelemetry (portable) or the vendor's own schema (locked to their dashboard)?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If it's proprietary, switching tools later often means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Re-instrumenting your application&lt;/li&gt;
&lt;li&gt;Rebuilding integrations&lt;/li&gt;
&lt;li&gt;Losing historical trace data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I learned this the expensive way.&lt;/p&gt;

&lt;p&gt;So today I sort observability tools by &lt;strong&gt;lock-in first&lt;/strong&gt; and &lt;strong&gt;features second&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Six Tools, Sorted by Lock-In
&lt;/h1&gt;

&lt;h2&gt;
  
  
  1. Helicone
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The gateway-first open-source pick.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Often the first Langfuse alternative people mention.&lt;/p&gt;

&lt;p&gt;You proxy model calls through Helicone and get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Logging&lt;/li&gt;
&lt;li&gt;Cost tracking&lt;/li&gt;
&lt;li&gt;Analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;with very little code change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Highlights&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open-source (Apache-2.0)&lt;/li&gt;
&lt;li&gt;Self-hostable&lt;/li&gt;
&lt;li&gt;~5,800 GitHub stars (June 2026)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; teams that want a fast observability layer with minimal engineering effort.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Arize Phoenix
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The open-source OTel pick.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Phoenix combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OTel-based tracing&lt;/li&gt;
&lt;li&gt;Evaluations&lt;/li&gt;
&lt;li&gt;Self-hosted deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The core project is free and open-source.&lt;/p&gt;

&lt;p&gt;Arize's commercial offering (AX) adds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enterprise capabilities&lt;/li&gt;
&lt;li&gt;Advanced ML monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Highlights&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open-source&lt;/li&gt;
&lt;li&gt;OTel-native&lt;/li&gt;
&lt;li&gt;Self-hosted&lt;/li&gt;
&lt;li&gt;~10,000 GitHub stars (June 2026)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; teams that want portable tracing and full ownership.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. LangSmith
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The LangChain-native pick.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're already using LangChain or LangGraph, LangSmith provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automatic instrumentation&lt;/li&gt;
&lt;li&gt;Deep framework integration&lt;/li&gt;
&lt;li&gt;Strong developer experience&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tradeoff is coupling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Highlights&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Proprietary&lt;/li&gt;
&lt;li&gt;Closed-source&lt;/li&gt;
&lt;li&gt;Closely tied to the LangChain ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; teams fully committed to LangChain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Most lock-in of the group.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Braintrust
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The polished SaaS pick.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Braintrust has one of the strongest experiences for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Evaluations&lt;/li&gt;
&lt;li&gt;Observability&lt;/li&gt;
&lt;li&gt;Cross-functional visibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Non-technical stakeholders tend to like the UI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Highlights&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Proprietary trace schema&lt;/li&gt;
&lt;li&gt;Closed-source&lt;/li&gt;
&lt;li&gt;Managed SaaS by default&lt;/li&gt;
&lt;li&gt;Enterprise deployment options available&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even in enterprise deployments, you're still operating within their trace format.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; organizations that prioritize product polish over portability.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Future AGI traceAI
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The no-lock-in instrumentation pick.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;traceAI is different from the others.&lt;/p&gt;

&lt;p&gt;It is &lt;strong&gt;not an observability platform&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It is an &lt;strong&gt;Apache-2.0 OpenTelemetry instrumentation layer&lt;/strong&gt; that captures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM calls&lt;/li&gt;
&lt;li&gt;Prompts&lt;/li&gt;
&lt;li&gt;Tokens&lt;/li&gt;
&lt;li&gt;Retrieval&lt;/li&gt;
&lt;li&gt;Agent steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and exports them to any OTel-compatible backend:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Datadog&lt;/li&gt;
&lt;li&gt;Grafana&lt;/li&gt;
&lt;li&gt;Jaeger&lt;/li&gt;
&lt;li&gt;Vendor platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It handles instrumentation, not dashboards.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you want a polished product out of the box, tools like Langfuse or Helicone are more complete.&lt;/p&gt;

&lt;p&gt;If you want portable traces that you own, this is the lightest approach I found.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Highlights&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apache-2.0&lt;/li&gt;
&lt;li&gt;OTel-native&lt;/li&gt;
&lt;li&gt;Backend-agnostic&lt;/li&gt;
&lt;li&gt;~200 GitHub stars (June 2026)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; teams optimizing for long-term portability.&lt;/p&gt;

&lt;p&gt;It's also the youngest project here, so think of it as a bet on the instrumentation-first approach rather than a mature platform.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Laminar
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The newer open-source pick.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Laminar combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OTel-based observability&lt;/li&gt;
&lt;li&gt;Evaluations&lt;/li&gt;
&lt;li&gt;Modern architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is newer than Phoenix but worth evaluating.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Highlights&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apache-2.0&lt;/li&gt;
&lt;li&gt;OTel-native&lt;/li&gt;
&lt;li&gt;~3,000 GitHub stars (June 2026)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; teams looking for a modern open-source observability stack.&lt;/p&gt;




&lt;h1&gt;
  
  
  My Take
&lt;/h1&gt;

&lt;p&gt;I'm not crowning a winner.&lt;/p&gt;

&lt;p&gt;Different tools optimize for different priorities.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If you want...&lt;/th&gt;
&lt;th&gt;Look at...&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fastest onboarding&lt;/td&gt;
&lt;td&gt;Helicone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted OTel observability&lt;/td&gt;
&lt;td&gt;Phoenix or Laminar&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deep LangChain integration&lt;/td&gt;
&lt;td&gt;LangSmith&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Polished SaaS workflows&lt;/td&gt;
&lt;td&gt;Braintrust&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pure OTel instrumentation&lt;/td&gt;
&lt;td&gt;traceAI&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The proprietary tools are completely reasonable choices.&lt;/p&gt;

&lt;p&gt;Until the day you want to leave.&lt;/p&gt;




&lt;h1&gt;
  
  
  What I Would Do Differently
&lt;/h1&gt;

&lt;p&gt;I would choose &lt;strong&gt;OTel-native instrumentation from day one.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not because proprietary tools are bad.&lt;/p&gt;

&lt;p&gt;In many cases they're actually easier and more pleasant to use.&lt;/p&gt;

&lt;p&gt;But the cost of switching is paid later through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Re-instrumentation&lt;/li&gt;
&lt;li&gt;Migration effort&lt;/li&gt;
&lt;li&gt;Lost historical context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And on day one, you have no idea whether you'll outgrow the tool.&lt;/p&gt;

&lt;p&gt;The argument is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Instrument once with OpenTelemetry. Point it at whatever backend you want. Change backends without changing application code.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's the entire case for OTel.&lt;/p&gt;

&lt;p&gt;And it's the one strong opinion I came away with.&lt;/p&gt;




&lt;h1&gt;
  
  
  FAQ
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Is Langfuse bad?
&lt;/h2&gt;

&lt;p&gt;No.&lt;/p&gt;

&lt;p&gt;Langfuse is genuinely good.&lt;/p&gt;

&lt;p&gt;It's self-hostable, widely adopted, and has roughly &lt;strong&gt;29,000 GitHub stars&lt;/strong&gt; as of June 2026.&lt;/p&gt;

&lt;p&gt;This post is about alternatives and tradeoffs, not criticism.&lt;/p&gt;




&lt;h2&gt;
  
  
  If I'm OTel-native, does that mean I don't get a dashboard?
&lt;/h2&gt;

&lt;p&gt;No.&lt;/p&gt;

&lt;p&gt;You still need somewhere to view traces.&lt;/p&gt;

&lt;p&gt;Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Grafana&lt;/li&gt;
&lt;li&gt;Datadog&lt;/li&gt;
&lt;li&gt;Jaeger&lt;/li&gt;
&lt;li&gt;Vendor platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OTel-native simply means you can change the backend later without changing instrumentation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Should I self-host or use a managed service?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Self-host&lt;/strong&gt; if you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data residency&lt;/li&gt;
&lt;li&gt;Infrastructure control&lt;/li&gt;
&lt;li&gt;Lower long-term platform dependency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Managed&lt;/strong&gt; if you want:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faster setup&lt;/li&gt;
&lt;li&gt;Less operational burden&lt;/li&gt;
&lt;li&gt;A more polished experience&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Open Question
&lt;/h1&gt;

&lt;p&gt;Lock-in is easy to reason about.&lt;/p&gt;

&lt;p&gt;What I still struggle to quantify is the value difference between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A polished proprietary platform&lt;/li&gt;
&lt;li&gt;A portable-but-rawer OTel stack&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The proprietary tools genuinely save time every day.&lt;/p&gt;

&lt;p&gt;The portability benefits only pay off if you actually switch.&lt;/p&gt;

&lt;p&gt;I don't have a clean framework for valuing that optionality.&lt;/p&gt;

&lt;p&gt;If you've found a useful way to think about that tradeoff, I'd love to hear it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>monitoring</category>
      <category>tooling</category>
    </item>
    <item>
      <title>Per-project LLM cost attribution with OTel spans: the wiring</title>
      <dc:creator>Jasmine Park</dc:creator>
      <pubDate>Thu, 04 Jun 2026 21:01:52 +0000</pubDate>
      <link>https://dev.to/jasmine_park_dev/per-project-llm-cost-attribution-with-otel-spans-the-wiring-3897</link>
      <guid>https://dev.to/jasmine_park_dev/per-project-llm-cost-attribution-with-otel-spans-the-wiring-3897</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR.&lt;/strong&gt; If your LLM bill is one line item on a cloud invoice, you cannot answer "which team spent that." We fixed this by tagging every gateway span with &lt;code&gt;team.id&lt;/code&gt;, &lt;code&gt;project.id&lt;/code&gt;, and &lt;code&gt;feature.id&lt;/code&gt;, plus the OpenInference token-count attributes, shipping those spans through an OTel collector into Tempo, and rolling cost up per team with TraceQL in Grafana. The payoff that sold it internally: one team's monthly spend quietly went from a few hundred dollars to over a thousand because of a retry loop, and the org-level dashboard never flinched. The per-team view caught it in a day. Below is the wiring, the collector config, the rollup query, the alert, and the attributes I tried and threw away.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The problem is attribution, not collection
&lt;/h2&gt;

&lt;p&gt;Most teams already collect LLM telemetry. Spans exist, tokens get counted, traces land somewhere. What is missing is the dimension that finance and eng leads actually ask about: who owns this spend. The provider invoice gives you one number per month per API key. If you share keys across services (most people do at some point), that number is useless for chargeback. You cannot tell the platform team's spend from the support-bot team's spend.&lt;/p&gt;

&lt;p&gt;So the design goal was narrow. Every LLM call has to carry enough labels that I can group spend by team, by project under that team, and by feature inside that project. Three levels. No more, because deeper than feature and nobody reads the dashboard. I standardized the whole pipeline on OpenTelemetry and OpenInference, and I will state the one opinion plainly: I want the labels, the wire format, and the storage to be things I can swap without rewriting instrumentation. We tag spans with open semantic conventions so the day we change a backend or a dashboard tool, the gateway code does not move. That is a portability decision, not a verdict on anyone's product.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Which attributes get tagged, and where
&lt;/h2&gt;

&lt;p&gt;Tag at the gateway, not in each service. We run an LLM gateway (every call to every provider goes through it), so it is the one place that sees model, token counts, and request context together. A new service gets attribution for free as long as it routes through the gateway and forwards the three context headers.&lt;/p&gt;

&lt;p&gt;The cost-math group comes straight from OpenInference semantic conventions: &lt;code&gt;llm.model_name&lt;/code&gt;, &lt;code&gt;llm.token_count.prompt&lt;/code&gt;, &lt;code&gt;llm.token_count.completion&lt;/code&gt;. The attribution group is custom, set from request headers: &lt;code&gt;team.id&lt;/code&gt;, &lt;code&gt;project.id&lt;/code&gt;, &lt;code&gt;feature.id&lt;/code&gt;. Cost is not a span attribute. I compute it at query time from token counts and a small price lookup, because prices change and I do not want last quarter's spans frozen at last quarter's rates.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attribute&lt;/th&gt;
&lt;th&gt;What it buys you&lt;/th&gt;
&lt;th&gt;Keep or drop&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;team.id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Top-level chargeback. The number a director asks for.&lt;/td&gt;
&lt;td&gt;Keep&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;project.id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Splits a team's spend across its services.&lt;/td&gt;
&lt;td&gt;Keep&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;feature.id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Which feature drove a spike inside a project.&lt;/td&gt;
&lt;td&gt;Keep&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llm.model_name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Lets you weight tokens by per-model price.&lt;/td&gt;
&lt;td&gt;Keep&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llm.token_count.prompt&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Input side of the cost.&lt;/td&gt;
&lt;td&gt;Keep&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llm.token_count.completion&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Output side. Usually the expensive half and the one that runs away.&lt;/td&gt;
&lt;td&gt;Keep&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;user.id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Per-user spend, in theory. A privacy liability in traces.&lt;/td&gt;
&lt;td&gt;Drop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;request.id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Already covered by the trace and span IDs.&lt;/td&gt;
&lt;td&gt;Drop&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  3. The collector config
&lt;/h2&gt;

&lt;p&gt;OTLP in, batch, set anything the gateway missed, Tempo out. The one processor worth calling out is &lt;code&gt;transform&lt;/code&gt;: I use it to backfill &lt;code&gt;team.id&lt;/code&gt; with a sentinel when a service forgets the header, so unlabeled spend shows up as unattributed instead of vanishing. Cost with no label is cost you will never find.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;protocols&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;grpc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;0.0.0.0&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;4317&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;0.0.0.0&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;4318&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;

&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;5s&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;send_batch_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;1024&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
  &lt;span class="na"&gt;transform/attribution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;trace_statements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;span&lt;/span&gt;
        &lt;span class="na"&gt;statements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;set(attributes["team.id"], "unattributed") where attributes["team.id"] == nil&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;set(attributes["project.id"], "unknown") where attributes["project.id"] == nil&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;set(attributes["feature.id"], "unknown") where attributes["feature.id"] == nil&lt;/span&gt;

&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp/tempo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tempo:4317&lt;/span&gt;
    &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;insecure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;true&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;transform/attribution&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;batch&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp/tempo&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two notes from running this. Put &lt;code&gt;transform&lt;/code&gt; before &lt;code&gt;batch&lt;/code&gt; so the backfill happens per span while the data is still cheap to touch. And keep the price table out of the collector. I tried encoding per-model rates as collector attributes once. Every price change became a config deploy, and the rates drifted out of sync with what we were actually billed. Pricing lives next to the query now.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Rolling cost up by team in Grafana
&lt;/h2&gt;

&lt;p&gt;Tempo stores spans, not dollars. So the rollup is two steps: TraceQL pulls token sums grouped by the attribution attributes, and a small price map turns tokens into cost downstream. I start from this, which aggregates output-token counts (the number I watch most, because completion tokens are usually where the money and the runaways are):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{ .team.id = "support-platform" &amp;amp;&amp;amp; .llm.token_count.completion &amp;gt; 0 }
| select(.project.id, .feature.id, .llm.model_name, .llm.token_count.prompt, .llm.token_count.completion)
| by(.team.id, .project.id, .llm.model_name)
| sum(.llm.token_count.completion)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Drop the &lt;code&gt;team.id&lt;/code&gt; filter and group by it instead for the all-teams board. The grouping by &lt;code&gt;llm.model_name&lt;/code&gt; matters: a mini-tier model and a frontier model can differ by more than an order of magnitude per token, so summing raw tokens across models hides which team is expensive because of volume versus model choice. The dollar step is deliberately dumb: a lookup from &lt;code&gt;llm.model_name&lt;/code&gt; to input-price and output-price per thousand tokens, multiplied through, summed per team. Keeping it dumb and external is what lets me re-price history when a provider changes rates.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. The alert: page when a team's output tokens jump
&lt;/h2&gt;

&lt;p&gt;Cost attribution is reporting. The thing that earns its keep is the page. The rule I run is week-over-week on output tokens per team: if this week's completion-token total for any team is more than 2x the same window last week, page. Output tokens, not input, because the runaway failure modes (retry storms, an agent that loops, a prompt-chaining bug that re-asks) all show up as generation volume first. Why 2x week-over-week and not a fixed dollar ceiling: a fixed ceiling either pages constantly for your big teams or never fires for your small ones. A relative jump normalizes across team size on its own. The team whose spend doubled in the story above would have tripped a 2x rule on day one. It did not trip our dollar alert because the absolute number was still small against the org total. Small against the org, doubled for the team, is exactly the blind spot per-team attribution exists to close. Route it to whoever owns the team's budget, not a shared channel where it gets ignored.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. What I tagged and then dropped
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;user.id&lt;/code&gt;.&lt;/strong&gt; Per-user spend sounds useful and is occasionally asked for. But putting a user identifier on every span means every trace is now PII, and your whole tracing backend inherits the retention, access, and deletion obligations that come with that. The attribution win did not come close to justifying the compliance surface. Dropped it, have not missed it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;request.id&lt;/code&gt;.&lt;/strong&gt; Pure redundancy. A trace already has a trace ID and every span has a span ID. Anywhere I thought I wanted it, the trace ID was already there and already correct.&lt;/p&gt;

&lt;p&gt;The pattern in both: an attribute is only worth tagging if it answers a question the cheaper attributes cannot, and if its cost (privacy, plumbing, drift) is lower than that answer is worth.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why compute cost at query time instead of writing a cost attribute on the span?&lt;/strong&gt; Prices change and I want to re-cost history when they do. A cost attribute freezes the rate at write time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need the gateway, or can each service tag its own spans?&lt;/strong&gt; You can tag per service. I prefer the gateway because it sees model and tokens and request context in one place, so a new service gets attribution by routing through it and forwarding three headers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Tempo specifically?&lt;/strong&gt; It is what we run, and TraceQL's aggregation over span attributes does the rollup I need. The attribute conventions are OpenInference, so the labels are not tied to Tempo. The point of standardizing on open conventions is that this choice is reversible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if a service forgets the attribution headers?&lt;/strong&gt; The collector backfills unattributed. The spend still shows up, just in a bucket whose name tells me to go fix the instrumentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is week-over-week 2x too noisy?&lt;/strong&gt; For steady traffic, no. For genuinely spiky workloads, raise the ratio or widen the comparison window. I bias toward a slightly noisy page over a silent doubling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open questions
&lt;/h2&gt;

&lt;p&gt;Caching breaks the token-to-cost math. Cached prompt tokens bill at a different rate (sometimes free), and I do not yet tag cache hits cleanly enough to price them right. Streaming and cancelled generations: if a client disconnects mid-stream, what is the honest output-token count, and does the provider bill for tokens generated after the cancel? Feature-level granularity has a ceiling, and I keep wanting per-prompt-version attribution but every level deeper is one more label nobody reads. And whether 2x week-over-week should itself be per-team, since some teams are spiky by nature and one global ratio serves both imperfectly. If you have wired cached-token pricing into a span-based cost model in a way that survives a provider changing its cache rates, I want to hear how.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>observability</category>
      <category>opentelemetry</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Span attributes that catch LLM cost regressions before billing does</title>
      <dc:creator>Jasmine Park</dc:creator>
      <pubDate>Tue, 02 Jun 2026 19:09:31 +0000</pubDate>
      <link>https://dev.to/jasmine_park_dev/span-attributes-that-catch-llm-cost-regressions-before-billing-does-472n</link>
      <guid>https://dev.to/jasmine_park_dev/span-attributes-that-catch-llm-cost-regressions-before-billing-does-472n</guid>
      <description>&lt;p&gt;The default OTel + OpenInference span has &lt;code&gt;llm.tokens.input&lt;/code&gt; and &lt;code&gt;llm.tokens.output&lt;/code&gt; as numeric attributes. Useful for trace-level debugging. Not useful for per-team cost regressions, because nothing groups traces by team.&lt;/p&gt;

&lt;p&gt;The 3 attribute additions that earned their keep:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;team.id&lt;/code&gt;. Tagged on every span at the gateway layer (before the call routes to the LLM provider). This is the column that makes the cost rollup possible. Without it, you can attribute spend to an org but not to a team inside the org.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;feature.id&lt;/code&gt;. The product feature that triggered the call (chat_assistant, summarizer, rag_search). Lets you see when one feature's token cost spikes vs the overall trend.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;llm.model&lt;/code&gt;. Already standard in OpenInference but worth flagging: without this, you cannot separate a cheap mini-tier model's spikes from a frontier model's spikes when both are in the same feature.&lt;/p&gt;

&lt;p&gt;The daily Tempo + Grafana query (TraceQL):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;resource&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"llm-gateway"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="err"&gt;|&lt;/span&gt; &lt;span class="nb"&gt;histogram_quantile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;feature&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;llm&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The alert rule: page when 7-day-trailing average of output-tokens-per-team-per-feature jumps more than 2x week-over-week. We caught a runaway retry loop last quarter that the org-level spend dashboard missed because the total stayed within budget while one team's bill quietly doubled.&lt;/p&gt;

&lt;p&gt;What we tried and dropped:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;user.id&lt;/code&gt; tagging: privacy concerns at scale, and the rollup-by-team covered the use case.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;request.id&lt;/code&gt; tagging: redundant with the trace_id; just adds cardinality.&lt;/p&gt;

&lt;p&gt;Drafted with AI assistance, edited and verified by author.&lt;/p&gt;

</description>
      <category>observability</category>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
