Why AI Agencies are flying blind (and how to fix your LLM margins)

#saas #ai #webdev #startup

If you're running an AI agency, you're probably building some
variation of RAG or agentic workflows for your clients.

You deliver the project, it works great, and then the first OpenAI bill hits.

Most agencies I talk to are still in the "winging it" phase when it
comes to API costs. They use one master key for dev, one for prod, and
maybe—if they're feeling fancy—one key per client.

But fwiw, per-client keys are a maintenance nightmare. And if you're
using a single master key for multiple clients, you're flying blind.

The "Averages" Trap

You might think: "I'll just charge a flat $100/mo for API usage."

Then one client decides to run a bulk ingest of 5,000 PDFs. Your
margin on that client just went negative, and you won't even know it
until the end of the month when you see a spike in the dashboard that
you can't explain.

Averages don't work for LLMs. Usage is too spiky.

3 ways to handle client attribution

1. The Metadata Header (The "Minimum Viable" way)

OpenAI, Anthropic, and OpenRouter all allow you to pass a user or
metadata header.

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [...],
  user: `client_acme_corp`
});

This is the bare minimum. It lets you export a CSV at the end of the
month and spend 4 hours in Excel trying to pivot-table your way to a
client invoice. tbh, it's better than nothing, but it's not real-time.

2. The Custom Proxy

You build a middleman. Every request from your client's app goes to
your proxy first, you log the tokens, then you forward it to OpenAI.
Pros: Absolute control.
Cons: You just added a single point of failure and 200ms of latency to
every request. Unless you're a DevOps wizard, this is usually
over-engineering for a boutique agency.

3. Real-time Attribution (The "Sanity" way)

You keep your direct provider connection but fire an async event for
every request.

This is why I built LLMeter. I needed a way to see exactly which
client was spending what, in real-time, without adding latency to the
main request.

We use it at Simplifai for our own tools. It tracks:

Cost per client (tenant)
Cost per model
Daily burn rates

If a client starts hitting the API harder than expected, I get an
alert immediately. I can then decide to upsell them, cap their usage,
or adjust the billing.

Why this matters for your agency

Clients don't like "surprise" bills. If you can show them a dashboard
(or a report) with their exact usage and cost, the trust level goes up
10x. It moves you from "freelancer with a script" to "professional AI
partner."

LLMeter is open-source (AGPL-3.0) and you can self-host it for free.
Check it out: llmeter.org

How are you billing your clients for LLM usage right now? Flat fee?
Pass-through? Or just eating the cost?