Lolo

Posted on Jun 23

The Node.js Mistake That Cost My Client $3,000 in AWS Bills

#aws #webdev #javascript #node

Tight loops and missing backoff logic

Last year I was asked to investigate a startup's AWS bill.

It had jumped from roughly $200/month to over $3,000 in a few weeks.

Nobody knew why.

After digging through logs, metrics, and database traffic, I found the culprit: a polling loop with no backoff strategy.

The code looked harmless:

async function processQueue() {
  const jobs = await getJobs()

  for (const job of jobs) {
    await processFile(job)
  }

  processQueue()
}

processQueue()

At first glance, this seems reasonable. Process all available jobs, then check again.

The problem appears when the queue is empty.

When getJobs() returned no work, the loop immediately queried the database again. And again. And again.

There was no delay, no backoff, and no event-driven trigger.

As a result, the service continuously hammered the database looking for work that didn't exist.

Each iteration generated:

A database query
Network traffic
CPU usage
Logging overhead
Additional infrastructure load

Individually, each operation was cheap.

Executed hundreds of thousands of times per day, they became expensive.

The fix was simple:

async function processQueue() {
  while (true) {
    const jobs = await getJobs()

    for (const job of jobs) {
      await processFile(job)
    }

    await new Promise(resolve => setTimeout(resolve, 5000))
  }
}

Even better would have been replacing polling entirely with an event-driven design using a message queue.

What this incident taught me:

1. Empty queues are production workloads.

Many engineers optimize for peak traffic and forget about idle traffic. Systems often spend more time idle than busy.

2. Polling needs backoff.

If you're polling, always define what happens when no work is found.

3. Cost bugs rarely look like bugs.

Nothing crashed. No exceptions were thrown. The system was technically working exactly as written.

It was just doing useless work 24/7.

4. Always monitor cost alongside performance.

CPU, latency, and error rates looked normal.

The AWS bill was the first real alert.

One question I ask during reviews now:

"What does this code do when there's nothing to do?"

That single question has caught more production issues than many architecture discussions ever did.

What's the most expensive bug you've ever seen in production?

Top comments (16)

UnitBuilds • Jun 23

I feel that... I was adding features to my accounting suite, so naturally I added cloud logging. But I didnt limit the logs that cloud is meant to generate... Nor storage time... Yeah, so $50 of log files later, wont be making that mistake again.

Lolo • Jun 23

Ouch, that one hurts! At least $50 is a lesson learned cheap — I've heard of people hitting thousands just from CloudWatch logs with no retention policy set. The scary part is how long it can go unnoticed before the billing alert fires (if you even have one set up). Did you catch it from the bill or did something else tip you off?

UnitBuilds • Jun 23

This is on GCP free trial credits. So I just saw my credit balance went down, had to dig into the reports, filter by SKU, disable savings and BOOM cloud run cost 2c, vertex ai $2, cloud log $20, cloud log storage $30... Not a fun 1, cuz it means I need to pay overage outta pocket when clients hit the limit a whole 5 months earlier

Frank • Jun 24

Was the issue related to improper handling of async operations or perhaps a misconfigured cluster autoscaler, leading to unnecessary resource utilization?

Lolo • Jun 24

Nope. No fancy infrastructure failure—just a few lines of code enthusiastically looking for work that didn't exist. 😅

𝐓𝐡𝐞 𝐋𝐚𝐳𝐲 𝐆𝐢𝐫𝐥 • Jun 23

Real-world engineering lessons like this are far more valuable than theoretical best practices. The bill hurt, but the takeaway will probably save many developers from making the same mistake!

Lolo • Jun 24

Thanks! That's exactly why I wanted to share it. The expensive lessons tend to be the ones that stick. 😅

Nazar Boyko • Jun 24

That closing question, what does this code do when there's nothing to do, is the part worth stealing. It travels way past polling loops, into retries that never back off or a health check firing way more than it needs to. On the fix, the 5 second sleep stops the bleeding but it's still polling, so you're trading a huge bill for a few seconds of pickup delay, which is fine right up until that delay matters and you wish you'd gone event driven from the start. The scariest line in the whole thing is that nothing crashed, because a bug with no exception and no failed metric can run for weeks before the invoice is what finally tells you.

Lolo • Jun 24

Exactly. The invoice ended up being the first real alert, which is never where you want to discover a bug. 😅

Frank • Jun 23

I'm curious if the issue was related to an

Lolo • Jun 23

Good question. The main issue was that getJobs() queried the database continuously when the queue was empty. There was no delay between checks, so the service ended up making an enormous number of requests for work that didn't exist.
Have you ever run into something similar with polling loops or background workers?

HopeGeek • Jun 23

WoW, that's crazy, ☺️

Lolo • Jun 23

Thanks! :)

Mudassir Khan • Jun 29

"what does this code do when there's nothing to do?" is the question i run on every background worker in code review. catches more than polling bugs — retry handlers that compound on empty error queues, health checks on intervals too short for the underlying service, scheduled jobs that lock a row already processed.

the 5 second sleep stops the bleed but it is still polling. the next question after that fix: how long can pickup delay be before it matters? if the answer is "zero, realtime pickup" then you have found the event driven argument already.

the scariest line is "nothing crashed." silent cost bugs run for weeks because every metric you watch looks fine.

what did your CloudWatch log retention look like before this? usually the second bill hiding next to the polling one.

Vlad Z • Jul 28

A $3,000 jump in AWS bills is a huge hit, I once had to deal with a similar issue where a misconfigured Redis cluster was costing my company $5,000 a month, we were able to reduce that cost by 30% by optimizing our instance types and usage patterns, what was the main contributor to the increased costs in your client's AWS bill?

View full discussion (16 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.