Last year I was asked to investigate a startup's AWS bill.
It had jumped from roughly $200/month to over $3,000 in a few weeks.
Nobody knew why.
After digging through logs, metrics, and database traffic, I found the culprit: a polling loop with no backoff strategy.
The code looked harmless:
async function processQueue() {
const jobs = await getJobs()
for (const job of jobs) {
await processFile(job)
}
processQueue()
}
processQueue()
At first glance, this seems reasonable. Process all available jobs, then check again.
The problem appears when the queue is empty.
When getJobs() returned no work, the loop immediately queried the database again. And again. And again.
There was no delay, no backoff, and no event-driven trigger.
As a result, the service continuously hammered the database looking for work that didn't exist.
Each iteration generated:
- A database query
- Network traffic
- CPU usage
- Logging overhead
- Additional infrastructure load
Individually, each operation was cheap.
Executed hundreds of thousands of times per day, they became expensive.
The fix was simple:
async function processQueue() {
while (true) {
const jobs = await getJobs()
for (const job of jobs) {
await processFile(job)
}
await new Promise(resolve => setTimeout(resolve, 5000))
}
}
Even better would have been replacing polling entirely with an event-driven design using a message queue.
What this incident taught me:
1. Empty queues are production workloads.
Many engineers optimize for peak traffic and forget about idle traffic. Systems often spend more time idle than busy.
2. Polling needs backoff.
If you're polling, always define what happens when no work is found.
3. Cost bugs rarely look like bugs.
Nothing crashed. No exceptions were thrown. The system was technically working exactly as written.
It was just doing useless work 24/7.
4. Always monitor cost alongside performance.
CPU, latency, and error rates looked normal.
The AWS bill was the first real alert.
One question I ask during reviews now:
"What does this code do when there's nothing to do?"
That single question has caught more production issues than many architecture discussions ever did.
What's the most expensive bug you've ever seen in production?
Top comments (7)
I feel that... I was adding features to my accounting suite, so naturally I added cloud logging. But I didnt limit the logs that cloud is meant to generate... Nor storage time... Yeah, so $50 of log files later, wont be making that mistake again.
Ouch, that one hurts! At least $50 is a lesson learned cheap — I've heard of people hitting thousands just from CloudWatch logs with no retention policy set. The scary part is how long it can go unnoticed before the billing alert fires (if you even have one set up). Did you catch it from the bill or did something else tip you off?
This is on GCP free trial credits. So I just saw my credit balance went down, had to dig into the reports, filter by SKU, disable savings and BOOM cloud run cost 2c, vertex ai $2, cloud log $20, cloud log storage $30... Not a fun 1, cuz it means I need to pay overage outta pocket when clients hit the limit a whole 5 months earlier
WoW, that's crazy, ☺️
Thanks! :)
I'm curious if the issue was related to an
Good question. The main issue was that
getJobs()queried the database continuously when the queue was empty. There was no delay between checks, so the service ended up making an enormous number of requests for work that didn't exist.Have you ever run into something similar with polling loops or background workers?