You deploy an AI feature to production, everything looks fine in testing, and three days later you get a surprise API bill that makes you question your life choices. I've seen it happen. More than once. And the worst part is it's almost never a big obvious bug. It's a quiet failure mode that compounds silently.
Here's the pattern that causes it, how to spot it before it hits your wallet, and the guardrails I now put on every AI pipeline I build.
The Bug: Runaway Retries in an AI Agent Loop
Suppose you build an AI-powered job description enrichment pipeline. The flow is simple: take a raw job listing from an ATS, send it to GPT-4 with a structured prompt, and get back a cleaned, keyword-optimized description.
The problem hides in the error handling. A naive retry pattern looks like this:
async function enrichJobDescription(rawText: string, maxRetries = 5) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{ role: "system", content: ENRICHMENT_PROMPT },
{ role: "user", content: rawText }
],
max_tokens: 2000
});
return response.choices[0].message.content;
} catch (error) {
console.error(`Attempt ${attempt + 1} failed:`, error);
// Retry immediately
}
}
throw new Error("All retries exhausted");
}
See the problem? No backoff. No circuit breaker. No cost-awareness. When the OpenAI API starts returning 429 rate-limit errors during a traffic spike, the retries fire instantly. Each failed attempt still consumed tokens because the request had already been sent. And if you process jobs in batches of 100, a single rate-limit event triggers hundreds of retry requests in under a minute.
But that's not the worst failure mode.
The Real Culprit: Unbounded Token Usage from a Prompt Injection Edge Case
The real cost explosion happens when a single bad input slips through. Picture a job listing that contains 80,000 characters of raw HTML embedded in the description field. The ATS scraped the listing badly and dumped the entire page markup into the job description.
Your prompt doesn't truncate input. It sends the full 80,000 characters to GPT-4 with max_tokens: 2000. The model processes all those input tokens, charged at the input rate, and then the retry loop multiplies that cost by 5 attempts per batch cycle. One bad record, processed repeatedly across multiple batch cycles, can eat through a significant portion of your monthly API budget faster than anything else in your system.
The failure is systemic. You have no guardrails.
The Fix: Token Budgeting Before the API Call
The first fix is obvious once you see the pattern. Truncate input before it ever reaches the model. Add a simple token estimator and a hard cutoff:
function estimateTokens(text: string): number {
// Rough estimate: ~4 characters per token for English text
return Math.ceil(text.length / 4);
}
function truncateToTokenBudget(text: string, maxInputTokens: number): string {
const estimatedTokens = estimateTokens(text);
if (estimatedTokens <= maxInputTokens) return text;
// Cut to approximate character budget, leaving room for the prompt
const charBudget = maxInputTokens * 4;
return text.slice(0, charBudget) + "\n\n[Content truncated due to length]";
}
async function safeEnrichJobDescription(rawText: string) {
const truncated = truncateToTokenBudget(rawText, 3000);
// Now the API call with predictable cost
}
This is a five-line fix that prevents an entire class of problem. But it's not enough on its own.
Rate Limiting and Cost-Aware Circuit Breakers
The retry pattern needs a complete redesign. Implement exponential backoff with jitter and a cost-aware circuit breaker that halts batch processing when spend exceeds a threshold:
async function costAwareLLMCall(prompt: string, options: {
maxRetries: number;
maxInputTokens: number;
costPerToken: number;
dailyBudgetRemaining: number;
}) {
const estimatedCost = options.maxInputTokens * options.costPerToken;
if (estimatedCost > options.dailyBudgetRemaining) {
throw new Error("Daily budget exhausted for this pipeline");
}
let delay = 1000;
for (let attempt = 0; attempt < options.maxRetries; attempt++) {
try {
return await openai.chat.completions.create({ /* ... */ });
} catch (error) {
if (error.status === 429) {
await sleep(delay + Math.random() * 1000); // Jitter
delay *= 2; // Exponential backoff
continue;
}
throw error; // Non-rate-limit errors fail fast
}
}
}
The key insight: treat every LLM call like a financial transaction. Know the cost before you execute. If the estimated cost exceeds your remaining budget, fail fast and log it. Don't let a runaway loop burn through your month's API spend in an afternoon.
Observability: What You Don't Measure Will Cost You
The reason a bug like this runs for days without anyone noticing is simple: nobody is watching. Sentry for error tracking and LogRocket for session replay are great, but neither alerts on cost anomalies by default.
Add three things.
First, a per-request cost log. Every LLM call logs its model, input tokens, output tokens, latency, and estimated cost:
async function loggedLLMCall(params) {
const start = Date.now();
const response = await openai.chat.completions.create(params);
const duration = Date.now() - start;
await db.collection("llm_calls").insertOne({
model: params.model,
inputTokens: response.usage.prompt_tokens,
outputTokens: response.usage.completion_tokens,
estimatedCost: calculateCost(response.usage),
duration,
timestamp: new Date(),
route: params.route || "unknown"
});
return response;
}
Second, a daily cost budget check that runs before batch processing starts. If spend is already most of the daily budget, pause the pipeline and send an alert.
Third, a Sentry alert that fires when any single LLM call exceeds a cost threshold. That catches the unusually large input before it multiplies through retries.
The Pattern That Matters
The mistake isn't using AI in production. It's treating LLM calls like regular API calls. A database query that costs a fraction of a cent to run 10,000 times is fine. An LLM call that costs significantly more per invocation and is designed to run 10,000 times is a cost explosion waiting to happen.
The patterns I now apply to every AI pipeline I build:
- Token budget first, prompt engineering second. Always truncate input before the API call. Know your max cost before you send a single token.
- Exponential backoff with jitter, not naive retries. A rate-limit error means slow down, not try again immediately.
- Cost-aware circuit breakers. If a pipeline has spent its daily budget, stop processing and alert. Don't let one bad record burn through your margin.
- Per-request cost logging. You can't optimize what you don't measure. Log every LLM call's token usage and estimated cost.
If your team is shipping AI features without cost guardrails, you're one bad input away from a surprise bill. That's the kind of thing I help engineering teams fix before it hits production. Happy to compare notes on what's worked for your stack.
Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.
Top comments (0)