Whatsonyourmind

Posted on Apr 2

Your LLM Costs Spiked 400% Last Night — Here's How to Catch It in One API Call

#monitoring #api #devops #ai

You wake up Monday morning. Coffee in hand, you open your LLM provider's billing dashboard. The weekend total: $2,400. Your usual weekend spend is $600.

Somewhere between Friday at 11pm and Saturday at 3am, an agent hit a retry loop. Each retry included the full conversation context. Each retry was bigger than the last. A 400% cost spike. Nobody noticed because nobody was watching.

The fix took 5 minutes — a missing max_retries cap. The damage took 48 hours to discover.

This is the most expensive category of bug in AI-native applications. Not a logic error. Not a crash. A silent cost explosion that hides inside normal-looking logs until the invoice arrives.

You'd think monitoring would catch it. And it would — if you had monitoring. But proper observability means DataDog ($15/host/month), New Relic ($0.30/GB ingested), or a full Prometheus + Grafana stack that someone needs to maintain. For a team running a few LLM-powered features, that's like buying a fire truck to watch a candle.

Here's the thing: you don't need any of that. The math behind anomaly detection is old. Really old. The two techniques that catch 90% of cost spikes were invented in the 1800s. They run in microseconds. And they can be wrapped in a single API call.

Let me show you.

Two Algorithms That Catch Almost Everything

There are two statistical methods that handle the vast majority of "did something weird happen in my numbers?" scenarios. They're different, and knowing when to use each one matters.

Z-Score: For Well-Behaved Data

The Z-score measures how far a data point is from the mean, expressed in standard deviations:

z = (x - mean) / standard_deviation

That's it. If your daily LLM cost averages $150 with a standard deviation of $20, and today's cost is $250, the Z-score is:

z = (250 - 150) / 20 = 5.0

A Z-score of 5.0 means the value is 5 standard deviations from normal. In a normal distribution, anything beyond 2-3 standard deviations is extremely unlikely (less than 0.3% probability at z > 3). You have an anomaly.

Best for: Costs, latency, throughput — any metric that clusters around a predictable average. If you plotted two weeks of your daily LLM spend and it looked roughly like a bell curve, Z-score is your tool.

Weakness: Z-score assumes your data is normally distributed. If your data is already skewed — say, you have occasional legitimate high-spend days — the mean and standard deviation get pulled toward the outliers, and real anomalies hide in the noise.

IQR: For Data With a Long Tail

The Interquartile Range method doesn't care about your data's shape. It works by looking at the middle 50%:

IQR = Q3 - Q1
Lower fence = Q1 - 1.5 * IQR
Upper fence = Q3 + 1.5 * IQR

Q1 is the 25th percentile. Q3 is the 75th percentile. Anything below the lower fence or above the upper fence is an anomaly.

The 1.5 multiplier is Tukey's original recommendation from 1977 — it corresponds roughly to +/- 2.7 standard deviations in normal data, catching about 0.7% of points as outliers.

Best for: Response times (they always have a long tail), batch sizes, error rates, token counts per request — anything where legitimate values occasionally spike but you still want to catch the truly abnormal ones.

More robust than Z-score because medians and quartiles aren't pulled by extreme values the way means and standard deviations are.

The Decision Rule

If your data looks like a bell curve, use Z-score. If it has a long tail or you're not sure, use IQR.

When in doubt, run both. If they agree, you have high confidence. If only one flags an anomaly, investigate further.

Real Example: Catching a Cost Spike

Here's a working example. These are 14 days of daily LLM costs in dollars. One day had a problem:

curl -X POST https://oraclaw-api.onrender.com/api/v1/detect/anomaly \
  -H "Content-Type: application/json" \
  -d '{
    "data": [142, 156, 138, 161, 145, 152, 139, 148, 155, 143, 612, 147, 151, 140],
    "method": "zscore",
    "threshold": 2.0
  }'

Response:

{
  "method": "zscore",
  "anomalies": [{ "index": 10, "value": 612, "zScore": 3.5 }],
  "stats": { "mean": 187.8, "stdDev": 121.3, "threshold": 2.0 },
  "totalPoints": 14,
  "anomalyCount": 1
}

Let's walk through the output:

Day 11 (index 10, zero-indexed) cost $612
The mean across all 14 days is $187.80 (inflated by the spike itself)
Even with the spike pulling the mean up, $612 is still 3.5 standard deviations above it
Your actual baseline is around $148/day. Something went very wrong on day 11.

The Z-score of 3.5 means this value has less than a 0.02% chance of occurring naturally. That's not variance. That's an incident.

You can swap "method": "zscore" for "method": "iqr" to use the IQR method instead — useful if your cost data has legitimate weekly patterns (higher on weekdays, lower on weekends) that make the distribution non-normal.

Building an Alert Pipeline in 10 Lines

Detection is only useful if it triggers an action. Here's a minimal alert pipeline — a cron job that checks daily costs and sends a Slack notification when something looks wrong:

// anomaly-alert.js — run via cron: 0 8 * * *
const costs = await fetchDailyCosts(14); // last 14 days from your billing API
const res = await fetch("https://oraclaw-api.onrender.com/api/v1/detect/anomaly", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ data: costs, method: "zscore", threshold: 2.5 }),
});
const { anomalies, stats } = await res.json();
if (anomalies.length > 0) {
  await sendSlackAlert(`Cost anomaly detected: $${anomalies[0].value} ` +
    `(z-score: ${anomalies[0].zScore.toFixed(1)}, baseline: ~$${stats.mean.toFixed(0)})`);
}

That's it. No agents. No dashboards. No monthly SaaS bill. A cron job, one HTTP call, and a Slack webhook. You now have cost spike detection.

Set the threshold based on your tolerance: 2.0 catches more anomalies but includes some false positives — good for high-stakes environments where you'd rather investigate a false alarm than miss a real spike. 3.0 catches only extreme outliers — better for noisy data where daily fluctuations are normal. Start at 2.5 and adjust based on what you see in your first week.

A few practical notes for production use:

Window size matters. 14 days gives a solid baseline. Fewer than 7 data points and your statistics get unreliable. More than 30 and you start averaging over too much history, making seasonal shifts invisible.
Run both methods. If Z-score and IQR both flag the same point, that's a high-confidence anomaly. If only one flags it, it might be worth investigating but isn't urgent.
Include context in your alert. The raw Z-score or IQR deviation tells you how anomalous the value is, but your Slack message should also include what the normal range looks like, so whoever gets paged can immediately gauge severity.

When You Need More

This approach handles the "did something weird happen?" question well. But there are cases where you need heavier tools:

Real-time streaming detection (sub-second) — look at Grafana's built-in anomaly detection or AWS CloudWatch Anomaly Detection
Time-series decomposition (separating trend, seasonality, residual) — Facebook's Prophet or statsmodels in Python
Multi-dimensional anomalies (cost is normal but latency + error rate together are weird) — PyOD, Isolation Forest, or a full observability platform

For "did my daily numbers do something weird?" — one API call is enough.

The Bottom Line

Z-score and IQR are 19th-century statistics. They work. They're fast. They're deterministic. They don't need training data, GPUs, or a machine learning pipeline.

You don't need a $500/month observability platform to know that $612 is not normal when your average is $148. You need arithmetic and a threshold.

The OraClaw anomaly detection endpoint wraps both Z-score and IQR into a single API call — free, open source, no API key required. The /detect/anomaly route is one of 18 endpoints in a broader decision intelligence toolkit, but it stands alone for this exact use case.

Stop discovering cost spikes from invoices. Start discovering them from alerts.

Found this useful? The OraClaw API includes anomaly detection, Monte Carlo simulation, Bayesian inference, and 15 other decision endpoints. Star the repo or try the live API at oraclaw-api.onrender.com.