I've been building an AI agent that routes requests across multiple LLM providers, OpenAI, Anthropic etc., based on the task. But pretty quickly, I...
For further actions, you may consider blocking this person and/or reporting abuse
The per-token billing approach is smart β most AI agent systems I have seen just track total API cost at the provider level, which makes it really hard to attribute spend to specific features or user actions. Breaking it down to the token level gives you the granularity to actually optimize.
One thing I have found useful in similar setups is adding a "token budget" per task type. Instead of just tracking what was spent, you set a ceiling before execution starts. If the agent is about to blow past the budget on a single task, it forces a checkpoint instead of running up the bill silently. Pairs well with the billing system you built here.
Yeah, totally agree, budgeting is the missing control loop.
Per-token billing (what I built here with Kong AI Gateway + Konnect Metering & Billing) gives you accurate attribution, who/what actually consumed tokens. But by itself, itβs reactive.
A token budget adds a runtime guardrail. For agent flows, that means checking expected token usage before each step and stopping or degrading (smaller model, less context, fewer tool calls) instead of silently overspending.
In practice, you need both:
metering for visibility, budgets for control.
"Hey, this is one of the cleanest and most practical token billing setups Iβve seen. Really well written!
I love that you went with Kong AI Gateway + Konnect Metering instead of building yet another custom pipeline. The fact that the gateway already knows the token counts and can meter them directly is such a smart move.
The part about splitting input vs output tokens (and why it matters for pricing) is gold β a lot of people miss that and end up undercharging on output-heavy usage.
Quick questions for you:
Howβs the added latency from the gateway in production? Noticeable or basically zero?
Would you recommend this stack for a smaller indie AI product, or is it more suitable once you have decent volume?
Thanks for the detailed walkthrough β saved it for future reference. Super helpful!"
Thanks, really appreciate that.
On latency:
In practice, the gateway hop is usually small relative to model/provider latency, so it hasnβt been the bottleneck in my experience. Kongβs docs also call out that Gateway and AI Gateway are designed for minimal and predictable latency, but Iβd still benchmark with your own setup (plugins, traffic, provider mix) since thatβs what really determines impact.
developer.konghq.com/ai-gateway/re...
For indie products:
Yeah, I think it can make sense earlier than most people expect, if you already know you need a gateway boundary, provider abstraction, per-consumer usage tracking, and usage-based billing.
AI Gateway gives you a consistent layer across providers, and Konnect Metering & Billing handles usage tracking, pricing models, subscriptions/invoicing, and limits on top.
dev.to/tejakummarikuntla/i-built-a...
If itβs a very small app with a single provider and you just need basic cost visibility, this might be more than you need initially. But once you care about attribution, enforcing limits, or monetizing usage cleanly, doing it at the gateway layer is a lot simpler than pushing all of that logic into app code.
β€οΈ
β€οΈπ
The decision to meter at the gateway level instead of the application layer is smart β I've seen teams build token tracking into their app code and it becomes a maintenance nightmare when you add new models or providers. The gateway already sees everything, so why duplicate that logic? One challenge I've run into with per-token billing is that users often can't predict their costs because token counts are invisible to them. A "2,000 token request" means nothing to a non-technical user. Have you considered adding a cost-estimate preview before the request actually executes, or some kind of budget cap that blocks requests once a threshold is hit? That seems like the missing UX piece for making usage-based AI billing actually work for end users.
Totally agree on both points.
Gateway-level metering was mainly about avoiding duplication and keeping model/provider changes out of the app layer.
On the UX side - youβre right, token counts arenβt intuitive at all. Right now this setup solves accurate billing, but not predictable costs. Adding:
is something makes it more solid.
Estimation is a bit tricky (especially output tokens), but even a rough preview would go a long way. Feels like thatβs the next layer needed to make this usable for non-technical users.
Love the article man. Thanks for posting it!
π
Great breakdown. One thing worth flagging for anyone copying this pattern: always log both prompt_tokens and completion_tokens separately because output tokens are typically 3-5x more expensive depending on the model. I also add a safety buffer by multiplying the tokenizer estimate by ~1.1 before charging β real billed tokens often come in slightly higher than tiktoken's local count, especially for models that do tool-calling. How are you handling streaming responses where you don't get the final usage object until the end?
the token budget approach only works if the price you're budgeting against is accurate. most systems hardcode a rate at build time and never update it. vendors reprice quietly, caching discounts appear or disappear, and suddenly your budget math is off by 30% or more without any visible signal. the control loop needs live pricing inputs to stay meaningful.
Solid walkthrough. I've been running a similar setup but hit an interesting edge case β streaming responses. When you're using SSE for chat completions, token counts aren't always available until the stream ends. Had to implement a small buffer that waits for the final chunk before emitting the usage event to the gateway.
The input vs output pricing split is crucial. We started with a flat "token" rate and quickly realized we were losing money on long-form generation tasks. GPT-4o's 4x output premium adds up fast.
One question: how are you handling failed requests? If a request times out or hits a rate limit mid-stream, do you still bill for the partial tokens consumed? We ended up adding a "billable" flag that only gets set when the response completes successfully.
Hi Kai, we have
provider(e.g., Anthropic),model(e.g., opus-4),type(e.g. output), andstatus_codedimensions on metered AI requests, so you can price differently for input and output tokens and filter out non-successful requests.Really solid approach to per-token billing. The split between input and output token pricing is something a lot of teams overlook β they just track total cost per call and lose visibility into where the money actually goes.
One thing I've been thinking about with multi-provider agent setups: do you handle rate limiting or fallback routing at the gateway level too? Because if you're already tracking tokens per provider through Kong, it seems like a natural extension to add cost-aware routing β e.g., route lower-priority tasks to the cheaper model automatically based on the billing data you're already collecting.
The Konnect Metering + Stripe integration is clean. Way better than building a custom metering pipeline from scratch.
Hi, yes, Kong AI Gateway has both usage and cost rate limiters.
This is technically possible, but it should be an app decision, no? It's specific to what you are building what is a low or high priority task