You're watching your OpenAI bill climb and the numbers don't add up. You've been careful — short prompts, reasonable max_tokens, no runaway loops. But the usage dashboard tells a different story.
If you're routing API calls through any kind of middleware, proxy, or wrapper tool, there's a real possibility that extra tokens are being consumed without your knowledge. I've seen this firsthand, and it's more common than you'd think.
The Problem: Invisible Token Overhead
When you send a prompt directly to an LLM API, what you see is what you get. But the moment you introduce a layer between your code and the API — a caching proxy, a prompt router, an AI gateway — that layer can modify your requests before they reach the provider.
Common ways this happens:
- System prompt injection: The proxy prepends its own system instructions to improve response quality or enforce guardrails
- Telemetry and feedback loops: Your prompts and completions get sent to a secondary model for evaluation, quality scoring, or fine-tuning data collection
- Prompt rewriting: The tool "enhances" your prompt with additional context, few-shot examples, or chain-of-thought instructions
- Shadow requests: Your request gets duplicated to a second model for A/B testing or fallback purposes
The frustrating part? Most of these are technically documented somewhere in the tool's docs or source code, but almost nobody reads every line before deploying.
How to Catch It: Comparing Expected vs. Actual Usage
The debugging process is straightforward. You need to compare what you think you're sending with what actually hits the API.
Step 1: Log Your Outbound Requests
Before anything else, intercept the actual HTTP requests leaving your infrastructure. If you're using Python with the OpenAI SDK, you can do this with a simple wrapper:
import openai
import json
import logging
logger = logging.getLogger("llm_audit")
def audited_completion(**kwargs):
# Log the exact payload before it ships
prompt_tokens_estimate = sum(
len(m["content"]) // 4 # rough token estimate
for m in kwargs.get("messages", [])
)
logger.info(f"Outbound request: ~{prompt_tokens_estimate} tokens estimated")
logger.info(f"Messages count: {len(kwargs.get('messages', []))}")
response = openai.chat.completions.create(**kwargs)
# Log what the API actually reports
usage = response.usage
logger.info(f"API reported: {usage.prompt_tokens} prompt, {usage.completion_tokens} completion")
# Flag significant discrepancies
if usage.prompt_tokens > prompt_tokens_estimate * 1.5:
logger.warning(
f"TOKEN MISMATCH: estimated {prompt_tokens_estimate}, "
f"API reported {usage.prompt_tokens} prompt tokens"
)
return response
If there's a big gap between your estimate and what the API reports, something is adding tokens between you and the provider.
Step 2: Use a Man-in-the-Middle Proxy to Inspect Traffic
For a more thorough inspection, route your traffic through mitmproxy and examine the raw payloads:
# Install mitmproxy
pip install mitmproxy
# Start the proxy on port 8080
mitmweb --listen-port 8080
Then configure your HTTP client to route through it:
import httpx
# Point your OpenAI client through mitmproxy
client = openai.OpenAI(
http_client=httpx.Client(
proxy="http://127.0.0.1:8080",
verify=False # only for local debugging, obviously
)
)
Now open the mitmweb UI in your browser and inspect every request. Look at the actual JSON body being sent to the API. Are there extra messages in the messages array that you didn't put there? Is the system prompt longer than what you wrote?
Step 3: Check for Parallel or Follow-Up Requests
Some tools make additional API calls you never asked for. In mitmproxy, filter for all requests to api.openai.com (or your provider's endpoint) and count them. If you made 10 calls in your code but see 20 requests in the proxy, that's your answer.
You can also check this at the provider level. OpenAI's usage dashboard breaks down requests by time, and most providers expose a usage API:
import requests
import os
# Check your actual API usage for today
headers = {
"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
}
# Use the /usage endpoint to get granular breakdowns
# Compare this against your application-level logs
If the provider-reported usage significantly exceeds your application logs, there's a leak somewhere.
Common Culprits and How to Fix Them
Prompt Caching Layers That Phone Home
Some caching proxies send a subset of your queries to a feedback endpoint. This is sometimes disclosed in the terms of service but not in the technical docs. Fix: read the source code if it's open source, or capture traffic for a day and audit it.
Gateway-Injected System Prompts
AI gateways often inject their own system prompt for safety or routing purposes. This might add 200-500 tokens per request. Doesn't sound like much until you're making thousands of calls a day.
Fix: Compare the messages array in your code vs. what arrives at the API:
# What you think you're sending
messages = [
{"role": "user", "content": "Summarize this document."}
]
# What actually arrives at the API (captured via mitmproxy)
# [
# {"role": "system", "content": "You are a helpful assistant. Always
# respond in a structured format. Report quality issues to..."},
# {"role": "user", "content": "Summarize this document."}
# ]
#
# That system message? You didn't write it.
Fine-Tuning Data Collection
This is the one that really gets people. Some tools collect your prompt-completion pairs to improve their own models or services. Your prompts become training data, and you're paying for the API calls that generate it. Check the privacy policy and terms, but also verify by monitoring for requests to endpoints you don't recognize.
Prevention: A Checklist
-
Always log
usagefrom API responses and compare against your expected token counts. Set up alerts for discrepancies above a threshold (I use 30%). - Audit network traffic periodically. Run mitmproxy for an hour once a month and review where your requests actually go.
-
Read the source code of any open-source proxy or gateway you deploy. Grep for things like
system,prepend,inject,telemetry,feedback, oranalyticsin the codebase. - Use API keys with minimal scope. If your proxy only needs to call chat completions, don't give it a key that can also access fine-tuning or file uploads.
- Set hard budget limits at the provider level. OpenAI lets you set monthly spend caps. Use them.
- Pin your dependencies. A proxy tool that's clean in v1.2 might start collecting telemetry in v1.3. Review changelogs before upgrading.
The Bigger Picture
This isn't necessarily malicious. Many of these tools inject tokens to genuinely improve the experience — better system prompts, safety guardrails, smarter routing. The problem is transparency. If a tool is consuming your API credits to do work you didn't ask for, you should know about it upfront.
Before adopting any LLM middleware, ask three questions: Does it modify my prompts? Does it make additional API calls? Does it send my data anywhere else? If the docs don't answer these clearly, the traffic logs will.
The good news is that this is entirely debuggable. HTTP requests aren't magic. Capture them, count them, and compare. The numbers either add up or they don't.
Top comments (0)