Per-customer cost attribution without a proxy

#ai #webdev #saas #llm

Most AI cost tracking solutions force you to route all your LLM traffic through their proxy. Tbh, that's an architectural nightmare waiting to happen. You're adding latency, introducing a single point of failure, and giving some third-party service the keys to your entire prompt stream.

If their proxy goes down, your app goes down. If their proxy gets slow, your users think your app is slow. And let's not even talk about the compliance headache of sending sensitive customer data through an intermediary just to track API costs.

You don't need a proxy to figure out which customer is burning your OpenAI budget. You just need proper attribution at the request level, handled asynchronously.

The Problem with LLM Billing

When you look at your billing dashboard on OpenAI or Anthropic, you just see total tokens used and a massive dollar amount at the end of the month. You don't see that user_123 ran a massive batch extraction job that cost you $40 in API calls, while your other 100 users cost $2 combined.

Multi-tenant SaaS apps need unit economics. If you charge a flat $20/mo subscription but a power user is burning $50/mo in Claude 3.5 API costs, you are actively losing money. But to fix it, you need to know exactly who is spending what.

Why Proxies Are a Bad Idea for This

A lot of dev tools in the AI space right now tell you to just swap your base URL from api.openai.com to proxy.theirservice.com.

Here is what happens when you do that:

Every request adds 50-200ms of network overhead.
If the proxy goes down, your production app fails to serve requests.
You are sending raw PII and proprietary data to a vendor just to count tokens.

It's massive overkill. Cost tracking should be out-of-band. It should never be in the critical path of your application's request/response cycle.

The Async Logging Approach

The correct way to handle this is logging costs asynchronously after the request completes. Your app talks directly to the provider (OpenAI, DeepSeek, OpenRouter, Anthropic), gets the token usage from the response, and fires a background job to log it against the customer ID in your own database.

Here is the flow:

User triggers an action.
Your backend calls the LLM provider directly using their official SDK.
Provider responds with the completion and usage stats (prompt_tokens, completion_tokens).
Your backend returns the response to the user immediately.
Your backend fires a non-blocking async event (e.g., using Inngest, BullMQ, or standard background workers) with the user ID, model used, and token count.

This gives you zero added latency. Zero third-party risk. Your app stays fast and reliable even if your cost-tracking database goes down.

Implementing the Calculation

Calculating the cost is straightforward but tedious. You need to maintain a pricing table for every model you support.

For example, if the payload from OpenAI says:
{ "prompt_tokens": 1500, "completion_tokens": 400 }

The background worker calculates the cost based on current model pricing and writes it to your database.