Jasmine Park

Posted on Jun 26 • Originally published at jas-blogs.hashnode.dev

The gateway tax: 6 OpenAI-compatible gateways.

#llmops #gateway #observability #finops

On March 14, 2026, our LLM bill came in at $9,140 for the month, up from about $5,200, and I could not tell you which team spent it. The gateway in front of every provider emitted one cost line and one trace span per request, all tagged service=llm-gateway, so the platform team ate the whole overage in the FinOps review while three product teams shrugged.

That month is the reason I now treat cost attribution as a gateway design decision, not an afterthought. If you cannot answer "which team, which feature, which key spent this" from the layer every call already passes through, you will answer it never. This is a comparison of the OpenAI-compatible LLM gateways I have evaluated for exactly that job: LiteLLM, Portkey, Helicone, Cloudflare AI Gateway, and Bifrost, plus one newer open-source entrant I introduce in the comparison table below. The lens is an SRE lens. What does it cost you in p99, and how granularly can you bill it back.

TL;DR

Cost attribution belongs at the gateway, not in each app's SDK and not in your provider's dashboard. The gateway is the one chokepoint every call crosses, so it is the only place where per-team, per-feature, per-key spend is both complete and consistent.

Every OpenAI-compatible gateway you put in that path adds latency. Call it the gateway tax. It is real, it is usually single-digit milliseconds at the proxy hop, and it varies with what you turn on (caching, guardrails, semantic lookups). The tax is not the deciding factor for most teams, because provider latency dwarfs it. What actually differs across gateways, by a lot, is attribution granularity: whether you can slice spend by virtual key, by route, by user, and whether the cost shows up as a first-class OpenTelemetry span attribute or as a number you have to scrape out of a dashboard later.

So the decision rule is short. Pick the gateway whose tax you can afford at your p99 budget, and whose attribution you can actually bill against. Most teams over-index on the first half and never check the second. Then March happens.

One honesty note up front, because it matters for how you read everything below. We did not re-run a latency benchmark across these six gateways on one rig. Anybody who hands you a clean cross-vendor p99 table either ran a heroic apples-to-apples harness (rare) or is quietly comparing numbers each vendor measured on different hardware against different upstreams (common). Where I cite latency, it is the vendor's own published number, labeled as such. The capability columns (self-host, caching type, attribution granularity, OTel-native, guardrails, license) are checked against each project's public docs and READMEs, because those are verifiable and they are what you will actually live with.

Why not the app SDK, and why not the provider dashboard

Before the table, kill the two alternatives, because most teams reach for one of them first and it is why their numbers never reconcile.

Cost attribution does not belong in each app's SDK. The pitch is seductive: every service instruments its own OpenAI client, tags spend with its own team name, ships it to your metrics backend. In practice you now have N implementations of "compute token cost" drifting against each other. One team is on an old pricing table. One forgot to count cached input tokens at the discounted rate. One service calls the provider directly in a cron job and bypasses instrumentation entirely, so that spend is simply invisible. When the provider changes per-token pricing (they do, quietly), you are editing N codebases to stay correct. SDK metering is great for in-process latency spans. It is a bad system of record for dollars, because the source of truth is smeared across every repo and every deploy cadence.

Cost attribution does not belong in the provider dashboard either. The OpenAI or Anthropic billing console knows your org spent the money. It does not know your org chart. It cannot tell you that team-checkout spent $4k and team-search spent $300, because your teams are not a concept the provider has. The best you get is per-API-key, and only if you had the discipline to mint one key per team up front and never share them, which under load nobody does. Multi-provider makes it worse: now you are stitching three billing consoles, three export formats, three currencies of "cost," into one spreadsheet a human maintains by hand. That spreadsheet is wrong by the second week.

The gateway is the only layer that sees every request, knows which credential made it, can compute cost once against one pricing table, and can stamp that cost onto a span before the response leaves the building. That is the whole argument. Now, which gateway.

Definitions, so the table means something

Two terms do all the work in this post. Pin them down before you read the comparison.

Cost-attribution granularity is the finest dimension along which the gateway can split spend without you doing post-hoc log surgery. I rank it in three tiers:

Per-key: the gateway issues virtual keys (its own keys, mapped to upstream provider keys) and tracks spend and budget per virtual key. You hand team-checkout a virtual key, and its spend is isolated. This is the floor for billing back, and honestly it is enough for most orgs.
Per-route / per-model: spend split by which model or endpoint served the call, so you can see that GPT-4-class traffic is 80% of cost while being 10% of calls.
Per-user / per-metadata: arbitrary tags (end-user id, feature flag, tenant) attached at request time and queryable later. This is what you need for usage-based billing to your customers, not just internal chargeback.

A gateway that only gives you per-key is fine for internal FinOps. A gateway that gives you per-user metadata is what you need if you resell LLM features and bill your customers per seat.

The gateway tax is the latency the gateway hop adds on top of provider latency. It has a floor (the proxy itself: parse, auth, route, re-serialize) and a variable part (every feature you enable adds a little: an exact-cache lookup is cheap, a semantic-cache vector search is not free, each inline guardrail is a synchronous scan). The tax is paid on every request that is not a cache hit. On a cache hit you skip the provider entirely and the gateway saves you latency, which is the one case where the tax goes negative. The mistake teams make is benchmarking the bare proxy, seeing 2 ms, and budgeting as if guardrails and semantic cache are free. They are not. Measure the tax with your real feature set on, or do not quote it.

And again, the number you measure on your rig is not comparable to the number a vendor measured on theirs. Different CPU, different upstream, different concurrency, different request body size. Treat every cross-vendor latency claim, including the ones in this post, as directional.

The comparison

Read this as capabilities first, latency last. The capability columns are what you live with daily. The latency column is vendor-published and not re-run by us, so it is the least load-bearing thing here.

Gateway	Self-host?	Caching (exact / semantic)	Cost-attribution granularity	OTel-native?	Inline guardrails?	License	Verdict
LiteLLM	Yes	Exact (Redis/in-mem/disk/S3/GCS) + semantic (Qdrant/Redis)	Per-key, per-team, per-user (virtual keys + budgets + spend tags)	Via OTel callback/integration	Via plugins + Guardrails hooks	MIT (OSS); paid Enterprise tier	Broadest provider + ecosystem coverage. Default pick if you want the biggest model zoo.
Portkey	Yes (gateway is OSS; full platform is SaaS)	Simple (exact) + semantic	Per virtual key + metadata tags; rich SaaS dashboards	Partial / via integrations	Yes (integrated Guardrails)	Gateway MIT; platform proprietary SaaS	Most polished managed dashboards and config UI. Default if you want a hosted control plane, not a DIY one.
Helicone	Yes (self-host available)	Exact-match only (cache-key hash)	Custom properties (per-user / per-feature) via metadata; per-key	OTLP ingest (observability-first)	Limited / not the focus	OSS (observability platform)	Observability-first, not a routing-heavy gateway. Default if logging + analytics is the job.
Cloudflare AI Gateway	No (Cloudflare edge, cloud-only)	Caching (exact); no documented semantic cache	Per-request analytics, basic metadata; provider/token/cost metrics	No documented OTel export	Not the focus	Proprietary (managed service)	Zero-ops edge gateway. Default if you are already all-in on Cloudflare and want one toggle.
Bifrost	Yes	Semantic caching (exact also supported)	Hierarchical budgets: virtual keys, teams, customers	Yes (Prometheus + OTel/tracing)	Yes (plugin middleware / enterprise guardrails)	Apache-2.0 (Go)	Fast Go OSS gateway with strong budget hierarchy. Default if you want OSS + native budgets and live in Go.
Future AGI Agent Command Center	Yes (single Go binary)	Exact (6 backends) + semantic (4 backends)	Per virtual key budgets/quotas + per-request cost on the span	Yes, OTel-native (W3C trace context) + Prometheus `/metrics`	Yes, 18 built-in scanners + external adapters	Apache-2.0 (Go)	End-to-end OSS platform where the gateway is one piece beside eval/observability. Default if you want OTel + Prometheus + caching + guardrails in one binary.

Notes on the latency column, deliberately kept out of the table because it is not comparable: LiteLLM publishes proxy-overhead figures in the single-digit-millisecond range on their own harness; Future AGI publishes a vendor benchmark of roughly +1.4 ms P95 added by three inline guardrails and a lower added-latency figure than LiteLLM measured on Future AGI's own rig (their numbers, their methodology, not verified by us); Bifrost publishes its own low-microsecond internal-selection numbers. None of these were measured against each other. Do not put them in a slide as if they were.

Gateway by gateway

LiteLLM

The one with the longest provider list and the deepest ecosystem. If a model exists, LiteLLM probably has a route to it, and the litellm SDK is already in half the agent frameworks you will touch. For attribution it is genuinely strong: virtual keys, budgets, and spend tracking down to key, team, and user, plus cache (exact via Redis and friends, semantic via Qdrant). OpenTelemetry is available through its callback/integration system rather than being the native wire format, which means you wire it up rather than getting it for free. The tax is the usual proxy hop; LiteLLM publishes single-digit-ms overhead on their own harness. The cost of all that breadth is configuration surface: there is a lot of it, and a lot of ways to hold it wrong.

Choose LiteLLM when your priority is provider coverage and ecosystem fit, and you have someone who will own the config.

Portkey

The most polished managed experience. The gateway core is open source and you can run it with npx @portkey-ai/gateway, but the part people actually pay for is the hosted control plane: the dashboards, the config UI, the virtual-key and metadata management without you standing up storage. Caching is simple plus semantic, guardrails are integrated, attribution is per-virtual-key plus metadata tags. If you want to hand a non-platform team a screen where they can see their own spend without you building it, Portkey is the shortest path. The trade is that the nice parts are SaaS and proprietary, so the dependency is on Portkey-the-company, not just Portkey-the-binary.

Choose Portkey when you want a managed control plane and dashboards out of the box, and SaaS dependency is acceptable.

Helicone

Observability-first. Helicone is excellent at logging every request, tagging it with custom properties, and giving you analytics over that, including per-user and per-feature cost slicing via metadata. Caching is exact-match only (the cache key is a hash of URL, body, and relevant headers, so "Hello" and "Hi" are different entries). It is self-hostable and open source, and it leans into OTLP-style ingest because its center of gravity is the observability plane, not heavy multi-provider routing or failover. If your real problem is "I cannot see what my LLM calls are doing," Helicone solves that cleanly. If your real problem is "I need 15 routing strategies and inline guardrails," it is not aimed there.

Choose Helicone when logging, analytics, and per-feature cost visibility are the job and routing is secondary.

Cloudflare AI Gateway

The zero-ops option. It runs on Cloudflare's edge, so there is no binary to operate and no SPOF you own (you inherit Cloudflare's). It does caching and gives you analytics: request counts, tokens, cost. What you do not get, per the public docs, is self-hosting, a documented OpenTelemetry export, or deep per-team attribution beyond request-level metadata. It is the right answer when you are already on Cloudflare, you want one dashboard and one toggle, and your attribution needs stop at "roughly how much, roughly where."

Choose Cloudflare AI Gateway when you want a managed edge gateway with near-zero ops and you already live on Cloudflare.

Bifrost

A fast Go OSS gateway (Apache-2.0) with a genuinely good cost model: hierarchical budgets across virtual keys, teams, and customers, which maps cleanly onto chargeback. It ships native Prometheus metrics and distributed tracing / OTel, semantic caching, and a plugin middleware system for analytics and guardrail-style logic. It is newer and the ecosystem is smaller than LiteLLM's, so you trade provider breadth for a tight, performant core and a budget hierarchy that is built in rather than bolted on.

Choose Bifrost when you want OSS, native budget hierarchy, and Prometheus + OTel, and you are comfortable in the Go ecosystem.

Future AGI Agent Command Center

An OpenAI-compatible gateway shipped as a single Go binary, Apache-2.0, open source (repo at github.com/future-agi). As of June 2026 it ships 15 routing strategies, two-tier caching (6 exact-match backends and 4 semantic backends), and 18 built-in guardrail scanners plus adapters for external guardrail vendors. The piece that matters for this post: it is OpenTelemetry-native using W3C trace context and also exposes a Prometheus /metrics endpoint, and it tracks per-virtual-key budgets and quotas, so cost can ride on the span rather than living only in a dashboard. It also ships a committed, reproducible benchmark harness (a bench/ directory with a mock upstream), which I respect more than a marketing number, because it means you can re-run their claim instead of trusting it.

On their own published benchmark (vendor numbers, not verified by us), three inline guardrails add roughly +1.4 ms at P95, and they claim lower added latency than LiteLLM measured on their rig. Same caveat as everywhere else: their hardware, their upstream, their methodology. The honest positioning: LiteLLM still has the broadest provider and ecosystem coverage, and Portkey has the more polished managed SaaS and dashboards. Future AGI's actual edge is that the gateway is one component of an end-to-end open-source platform that also does eval and observability, with native OTel plus Prometheus and built-in caching and guardrails in a single binary, so you are not assembling four tools to get attribution onto a span.

Choose Agent Command Center when you want OTel + Prometheus + caching + guardrails in one OSS binary, and you value the gateway being part of one eval/observability platform.

The diagram you should draw on your whiteboard

Figure: the gateway is the one layer every call crosses. Stamp cost on the OpenTelemetry span at GOVERN/COST and attribution stays complete and consistent.

The single most important thing in that diagram is where the span is emitted. It is emitted inside the gateway, at the govern/cost control point, after the gateway has resolved the credential and computed the cost. That is what makes attribution complete (every call crosses it) and consistent (one pricing table, one cost function). Move that emission into each app and you reintroduce every drift problem from the "why not the SDK" section above.

Honest limitations: where every one of these adds risk

No gateway is free of downside. If you put one in your hot path, you have signed up for these, regardless of vendor.

Single point of failure. Every request now depends on the gateway being up. A managed edge service (Cloudflare) trades your SPOF for theirs, which may be a better or worse bet than your own uptime. A self-hosted binary (LiteLLM, Bifrost, Future AGI) is yours to make HA: run more than one replica, put a real load balancer in front, and test failover before you need it. "We deployed one gateway pod" is not a control plane, it is an incident waiting for a node drain.

Cache poisoning and stale answers. Semantic caching is the feature most likely to bite you. A vector-similarity hit can return a cached answer for a prompt that is close but not equivalent, and now one user sees another user's response, or a stale answer to a changed question. Exact caching is safer but still leaks across users if your cache key does not include the right scoping. Scope cache keys per tenant where correctness matters, and keep semantic caching off for anything with PII or per-user state until you have measured the false-hit rate.

Span-cardinality blowup. The fix for attribution (rich tags on every span) is also the way you melt your metrics backend. Put end_user_id as a label on a Prometheus metric and you have just created one time series per user. That is a cardinality bomb. Keep high-cardinality identifiers (user id, request id) on traces and logs, where high cardinality is fine, and keep your metric labels low-cardinality (team, model, provider, cache_hit). Conflating the two is the most common way an attribution rollout pages the observability team instead of the FinOps team.

A pasteable artifact: per-key budget plus OTel export

Here is a minimal, runnable setup for one gateway (LiteLLM, because its config is the most widely deployed and the spend tracking is mature), showing a per-virtual-key budget and OpenTelemetry export, plus the queries that turn it into a bill-back.

docker-compose.yml:

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-stable
    ports:
      - "4000:4000"
    environment:
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
      DATABASE_URL: postgres://litellm:litellm@db:5432/litellm
      # Send OTel spans to your collector
      OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    volumes:
      - ./config.yaml:/app/config.yaml:ro
    depends_on:
      - db

  db:
    image: postgres:16
    environment:
      POSTGRES_USER: litellm
      POSTGRES_PASSWORD: litellm
      POSTGRES_DB: litellm
    volumes:
      - litellm-pg:/var/lib/postgresql/data

volumes:
  litellm-pg:

config.yaml:

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

litellm_settings:
  # Emit an OpenTelemetry span per request, with cost + tokens as attributes.
  callbacks: ["otel"]
  # Track and persist spend so it can be queried per key/team/user.
  store_model_in_db: true

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL

Mint a virtual key for one team, with a hard monthly budget, so March cannot happen silently:

curl -s http://localhost:4000/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "key_alias": "team-checkout",
        "models": ["gpt-4o"],
        "max_budget": 500,
        "budget_duration": "30d",
        "metadata": {"team": "checkout", "cost_center": "cc-4471"}
      }'

That key now refuses traffic once team-checkout crosses $500 in a 30-day window, and every call it makes carries team=checkout into the spend store and onto the OTel span.

Attributing spend to a team comes from the gateway's own spend store. With LiteLLM's spend logs in Postgres, the bill-back for last month is one query:

SELECT
  metadata ->> 'team'      AS team,
  COUNT(*)                 AS requests,
  ROUND(SUM(spend)::numeric, 2) AS usd
FROM "LiteLLM_SpendLogs"
WHERE "startTime" >= date_trunc('month', now()) - interval '1 month'
  AND "startTime" <  date_trunc('month', now())
GROUP BY 1
ORDER BY usd DESC;

And for the live alerting view, scrape low-cardinality cost metrics into Prometheus and rank current-month spend by team. With a gateway that exposes a per-team cost counter (label team, deliberately low-cardinality), the PromQL is:

topk(5,
  sum by (team) (
    increase(llm_gateway_cost_usd_total[30d])
  )
)

Keep team, model, and provider as metric labels. Keep end_user_id and request_id out of metrics and on the trace instead. That one discipline is the difference between an attribution dashboard and a cardinality incident.

Paste this into your PRD

A scenario matrix for the decision review, so the next person does not re-derive it.

Scenario	Priority	Default pick	Escalate to	Why
Internal chargeback, many providers	Provider breadth + per-team spend	LiteLLM	Bifrost (if you want native budget hierarchy in Go)	Biggest model zoo, mature virtual keys and spend tracking; budgets get you per-team bill-back.
Non-platform teams need their own spend screen	Managed dashboards, low build cost	Portkey	LiteLLM self-host (if SaaS dependency is a no)	Hosted control plane and config UI mean you do not build the dashboard yourself.
"I cannot see what my LLM calls do"	Logging + per-feature cost visibility	Helicone	Future AGI ACC (if you also need routing + guardrails)	Observability-first with custom-property attribution; exact-match cache.
Already on Cloudflare, want near-zero ops	One toggle, no binary to run	Cloudflare AI Gateway	Any self-hosted gateway (when you outgrow request-level attribution)	Edge-managed, no SPOF you operate; attribution stops at request-level metadata.
Want OTel + Prometheus + cache + guardrails in one OSS binary	One platform, attribution on the span	Future AGI Agent Command Center	LiteLLM (for wider provider coverage) or Portkey (for managed dashboards)	Native OTel (W3C) + Prometheus, two-tier cache, 18 guardrail scanners in one Go binary, part of an eval/observability platform.
Resell LLM features, bill your customers per seat	Per-user / per-metadata attribution	LiteLLM or Portkey (rich metadata)	Helicone (for the analytics layer on top)	You need arbitrary per-user tags queryable later, not just per-key.

What I'd page on

This is the on-call checklist for a gateway in your hot path. If you adopt one of these gateways and do not wire these alerts, you are flying blind and the next $9k month is already in flight.

Gateway p99 latency, by route. Page if p99 of the gateway-added overhead (gateway span duration minus upstream span duration) exceeds your budget for 5 minutes. This is the gateway tax going bad. Separate the proxy hop from provider latency or you will blame the wrong layer at 2am.
Gateway error rate and saturation. Page on 5xx rate from the gateway above baseline, and on CPU saturation, because at high concurrency CPU is the bottleneck, not the network. A saturated gateway fails every team at once.
Per-team budget burn. Page (or auto-throttle) when any virtual key crosses, say, 80% of its monthly budget before the month is 80% over. This is the alert that would have caught March on March 6, not March 31.
Total spend rate-of-change. Page on day-over-day total LLM spend up more than X%. A runaway retry loop or a new feature shipping uncapped shows up here first, hours before the invoice.
Cache hit rate drop. Page if cache hit rate falls below your assumed floor, because your cost model and your latency budget both silently assumed those hits. A cache that quietly stopped hitting is a bill increase and a latency regression in one.
Semantic-cache false-hit signal. If you run semantic caching on anything user-facing, alert on user reports or eval-detected wrong answers correlated with cache hits. This is correctness, not cost, and it is the one that becomes a postmortem instead of a FinOps slide.
Span cardinality / metrics ingestion. Page if your metrics backend's active series count jumps after a deploy. That is usually someone putting a user id on a metric label. Catch it before it takes down the observability stack.
Provider failover events. Alert (not page) when the gateway fails over between providers, so a silent provider degradation does not hide inside your routing logic until the bill from the more expensive fallback shows up.

Pick the gateway whose tax you can afford and whose attribution you can bill against. Then wire the eight alerts above, because the gateway is now load-bearing infrastructure, and load-bearing infrastructure gets a pager.

Capability claims here reflect each project's public docs and READMEs as of June 2026. Latency figures are vendor-published on each vendor's own harness, not re-run on a common rig, and are not comparable across vendors. Future AGI's gateway (Agent Command Center) is open source at github.com/future-agi.

Top comments (1)

Max Quimby • Jun 29

The "which team, which feature, which key spent this" framing matches exactly what bit us. The thing I'd add: attribution granularity isn't only a gateway feature — it's a discipline you have to enforce upstream. A single shared virtual key behind a service means the gateway can attribute to the service but never to the feature or the request that caused the spend, no matter how good its OTel support is. We ended up minting a key per logical caller and propagating a feature/trace tag as a header the gateway promotes to a span attribute, otherwise you're right back to one service=llm-gateway line eating the overage. Strong +1 on refusing to trust cross-vendor p99 tables, too. The tax that actually hurt us wasn't the proxy hop — it was turning on semantic caching, which added a synchronous embedding call on the hot path. Did caching mode dominate the tax variance in your eval, or did guardrails turn out to be the bigger swing?