Chalom Ellezam

Posted on Apr 29

Your AI app is silently burning $2,000/month and you don't know it. Here are the 5 patterns that bite founders.

#ai #observability #devops #webdev

Disclosure: I'm a senior backend tech lead and I run HostingGuru, where built-in AI monitoring is the feature I'm proudest of. This article will mention HostingGuru once at the end, but the patterns and detection methods below work on any platform — I want this useful even if you never become a customer.

The cleanest version of this story is one I keep hearing from founders, with small variations each time:

"We woke up to a $2,400 OpenAI bill. The product still works. Sentry is green. Our error rate is normal. We have no idea what happened."

Then they dig in. They find a webhook handler that's been retrying a Stripe event for 11 days because a key was rotated and the retry logic capped out at "30 minutes between attempts" instead of "stop after 24 hours." Each retry calls an LLM to summarize the event for an internal log. 11 days × 48 retries × 8K tokens × $0.04. The math is unforgiving.

Or they find an agent that's been self-triggering. Or a context window that quietly grew from 4K to 80K tokens because nobody noticed a bug stuffing the entire conversation history into every prompt. Or a cron job that runs at 3am and produces output nobody reads, but produces it via Claude Sonnet at $3 per million input tokens.

This is the 2026 version of a problem that used to be small. AI made it expensive.

I want to walk you through the five patterns I see most often, why they're invisible to traditional monitoring, and what you can actually do about them tonight.

Why this is harder than it used to be

Pre-AI, a runaway loop in your app was annoying. It maxed out a CPU, your alerting noticed the CPU pegged, you got paged, you fixed it. Total damage: a few hours of degraded service, maybe a small AWS bill bump.

Post-AI, a runaway loop is expensive. Each iteration calls an LLM. Each LLM call costs real money. Worst of all, the loop doesn't show up as a problem in any of your existing tools:

Sentry: aggregates errors by fingerprint. Same retry loop = "1 issue, +850 events." It looks like one bug, not eight hundred.
CloudWatch / Datadog: traffic and CPU look fine — a retry loop is just a steady stream of requests.
Stripe / your billing dashboard: shows charges after they happen, on a 24-48h delay.
Your inbox: silent. The OpenAI / Anthropic / Stripe APIs don't email you when one customer is making 50,000 calls an hour.

The first signal you usually get is the credit card alert from your bank. By then, you're $1,000+ in.

Pattern 1: The infinite retry loop

The classic. A background job hits a transient error, your retry logic backs off exponentially, eventually retries every 30 minutes, but never gives up. The underlying issue is permanent: a webhook secret was rotated, an API key was deactivated, a file path was renamed. The job will retry forever.

If the job involves an LLM call (summarizing the error, deciding next action, generating a fallback response), every retry costs tokens. Multiply by however long until someone notices.

Real example I saw last month: a B2B SaaS doing email parsing. Their email parser used GPT-4 to extract structured data. One specific email format consistently failed validation downstream. The retry queue kept retrying. 11,000 emails × 6 retries × $0.10 per call = $6,600 wasted before the founder noticed.

How to detect it tonight: query your job queue (BullMQ, Sidekiq, Celery, whatever) for jobs that have been "active" or "failed" for more than 24 hours. Set a hard cap: any job that retries more than 10 times gets paged or dropped, no exceptions.

Pattern 2: The self-triggering agent

Multi-agent systems are particularly good at this one. Agent A produces output. Agent B reads agent A's output and decides "I should ping agent A for clarification." Agent A produces a clarification. Agent B reads it and decides "I should clarify the clarification." The conversation continues until you run out of context or money — whichever comes first.

I saw this kill a YC startup's monthly budget in 14 hours. They'd shipped a "research assistant" that orchestrated three agents. A user typed an ambiguous query. The agents started clarifying each other. By the time the user's session timed out, the system had made 4,200 LLM calls.

How to detect it tonight: hard-cap your multi-agent loops at 10 turns. After 10 back-and-forth iterations, the system returns whatever it has and exits. If you're using LangChain or similar, this is one config flag. If you've written your own orchestration, it's three lines of code.

Pattern 3: The "fingerprint aggregation" blind spot

This is the most insidious one because it specifically defeats Sentry / Bugsnag / Honeybadger.

Error monitoring tools group errors by fingerprint (basically a hash of the stack trace + error message). Same fingerprint = "this is the same bug." The dashboard shows "+842 events on this issue" with a slowly incrementing counter.

The problem: a retry loop firing the same error 800 times looks identical to 800 different users hitting the same bug once. Your error tool can't tell them apart. Both show up as "+800 events on the same issue." If you're not specifically watching event-rate per fingerprint, you'll miss the loop entirely.

The default Sentry alerts trigger on new issues, not on suddenly-very-noisy existing issues. So a bug that's been silently looping at 50/sec for 6 hours doesn't trip any alerts.

How to detect it tonight: add a custom Sentry alert on "any single issue with > 100 events per hour." Most teams forget this exists. It's the alert that catches the silent loops.

Pattern 4: The context window that quietly grew

Here's how this happens: you ship an AI feature with a 4K-token context window. Works fine in dev. In prod, a customer accumulates a long conversation history. Your code (or worse, Claude's code from when it built the feature) appends the entire conversation history to every new prompt without truncation.

Six months later, that customer has a 60K-token conversation. Every interaction now costs 15× what it did at launch. Multiplied across all your power users, you've quietly 5x'd your per-user AI cost without noticing — because the increase is gradual and the dashboard just shows "monthly OpenAI bill went up."

How to detect it tonight: log the input token count of every LLM call (most SDKs return this). Plot the p95 input token count over time. If it's trending up, you have context bloat. The fix is usually a sliding window or a summarization step.

This is also where I see the most "Claude Code did this and now I owe $400" stories. Claude is generous with context — it'll happily concatenate everything if you don't tell it not to.

Pattern 5: The cron job that never reads its output

Less dramatic, more common: a 0 3 * * * cron job kicks off every night at 3am. It runs an analysis. It generates a report. It writes the report to a database table or an S3 bucket. Nobody reads the report.

This was useful when you had it built last year. Then the team member who used it left. Then the report became stale. Then it became wrong. But the cron keeps running every night, calling the LLM, eating tokens. Quietly.

How to detect it tonight: list every cron job in your system. For each one, ask: "if this stopped running tomorrow, would anyone notice within 7 days?" If the answer is no, kill it. (You can always add it back if someone complains.)

What "good monitoring" looks like for AI apps

Traditional monitoring (Sentry, Datadog, CloudWatch) is great at finding errors. They're bad at finding patterns.

The patterns above all share two properties:

They're not errors. They're successful behavior at high volume.
They don't trigger alerts. Each individual call looks fine. Only the aggregate rate is wrong.

What you actually need is a layer that watches behavior, not errors. Some signals worth tracking:

Token usage per user per day (spike = investigation trigger)
LLM call rate per service (steady ≠ healthy if it's been steady for 18 hours unattended)
Job queue length over time (growing slowly = retry loop accumulating)
Per-fingerprint event rate (the Sentry blind spot above)
Cost per active user (rising = something's bloating somewhere)

You can build this yourself. It takes about 2 weeks of work for a backend engineer. You can also use a platform that has it built in, which is what I want to be honest about now.

What I built (and why I built it)

I'm a senior backend tech lead. I've shipped production systems for BeReal, Oney, Ringover. I built HostingGuru because the gap between "Sentry tells me when something errors" and "I get a Telegram ping at 3am that says 'this Stripe webhook handler has retried 200 times in the last hour, here's the link to the logs'" was the gap I kept finding myself filling manually for clients.

HostingGuru's AI monitoring tails your production logs and alerts on patterns, not errors:

Retry loops detected when the same operation fires faster than expected, regardless of error rate
Token spikes detected when a user's per-day LLM cost jumps significantly
Hot fingerprints detected when one Sentry-style issue suddenly explodes in event rate
Anomalous response times detected when p95 latency jumps without an obvious traffic cause
Silent cron failures detected when a job that ran consistently for 30 days suddenly stops

Alerts go to Telegram by default — because that's where founders actually look at 3am. (Email and Slack also supported.)

It works on any app deployed to HostingGuru, on any of the 14+ frameworks we support. The alerts and pattern detection are part of the platform — no extra config, no extra subscription.

If you've ever woken up to a surprise bill, this is the layer that would have caught it before it happened.

What to do tonight, regardless of which platform you use

You don't need to switch hosts to catch most of these. Five concrete moves:

Run a query on your job queue for any job retrying more than 10 times. Cancel them.
Cap your multi-agent loops at 10 turns in code. One commit.
Add a Sentry alert on "any single issue with > 100 events per hour."
Log token counts on every LLM call and check p95 input tokens trend. If trending up, fix context truncation.
List every cron job and kill any whose output nobody reads.

These five moves take an evening. They prevent the vast majority of the surprise-bill stories I hear. Whether you do them on HostingGuru, Render, Railway, AWS, or your own VPS, you should do them.

The harder truth

The hardest part of running an AI-powered product in 2026 isn't building it. AI tools made building it 10x cheaper. The hard part is operating it — knowing what's running, what it's costing, what's broken in a way that doesn't show up as broken.

The cost of a runaway loop went from "annoying" to "expensive" the moment AI became a per-call API charge. The tools we use to monitor production didn't get the memo. Sentry was designed in a world where errors were the primary problem; it's still the best at that, but it's not the right tool for "your tokens are leaking somewhere."

Until that gap closes across the whole industry, you have to build it yourself or use a platform that has it built in. Either path is fine. The one path that doesn't end well is "we'll find out at the end of the month."

If you've had a $2,000 surprise bill, what was the cause? I'm collecting these stories — drop them in the comments. The patterns repeat surprisingly often.

Previous posts in this series:

1. Heroku just went into "sustaining engineering mode." Here are 5 alternatives whose free tier actually doesn't sleep

2. I built my MVP with Claude Code. Now I need to deploy it. Here's what nobody tells you.