DEV Community

Lolo
Lolo

Posted on

The Node.js Mistake That Cost My Client $3,000 in AWS Bills

Tight loops and missing backoff logic

Last year I was asked to investigate a startup's AWS bill.

It had jumped from roughly $200/month to over $3,000 in a few weeks.

Nobody knew why.

After digging through logs, metrics, and database traffic, I found the culprit: a polling loop with no backoff strategy.

The code looked harmless:

async function processQueue() {
  const jobs = await getJobs()

  for (const job of jobs) {
    await processFile(job)
  }

  processQueue()
}

processQueue()
Enter fullscreen mode Exit fullscreen mode

At first glance, this seems reasonable. Process all available jobs, then check again.

The problem appears when the queue is empty.

When getJobs() returned no work, the loop immediately queried the database again. And again. And again.

There was no delay, no backoff, and no event-driven trigger.

As a result, the service continuously hammered the database looking for work that didn't exist.

Each iteration generated:

  • A database query
  • Network traffic
  • CPU usage
  • Logging overhead
  • Additional infrastructure load

Individually, each operation was cheap.

Executed hundreds of thousands of times per day, they became expensive.

The fix was simple:

async function processQueue() {
  while (true) {
    const jobs = await getJobs()

    for (const job of jobs) {
      await processFile(job)
    }

    await new Promise(resolve => setTimeout(resolve, 5000))
  }
}
Enter fullscreen mode Exit fullscreen mode

Even better would have been replacing polling entirely with an event-driven design using a message queue.

What this incident taught me:

1. Empty queues are production workloads.

Many engineers optimize for peak traffic and forget about idle traffic. Systems often spend more time idle than busy.

2. Polling needs backoff.

If you're polling, always define what happens when no work is found.

3. Cost bugs rarely look like bugs.

Nothing crashed. No exceptions were thrown. The system was technically working exactly as written.

It was just doing useless work 24/7.

4. Always monitor cost alongside performance.

CPU, latency, and error rates looked normal.

The AWS bill was the first real alert.

One question I ask during reviews now:

"What does this code do when there's nothing to do?"

That single question has caught more production issues than many architecture discussions ever did.

What's the most expensive bug you've ever seen in production?

Top comments (7)

Collapse
 
unitbuilds profile image
UnitBuilds

I feel that... I was adding features to my accounting suite, so naturally I added cloud logging. But I didnt limit the logs that cloud is meant to generate... Nor storage time... Yeah, so $50 of log files later, wont be making that mistake again.

Collapse
 
manolito99 profile image
Lolo

Ouch, that one hurts! At least $50 is a lesson learned cheap — I've heard of people hitting thousands just from CloudWatch logs with no retention policy set. The scary part is how long it can go unnoticed before the billing alert fires (if you even have one set up). Did you catch it from the bill or did something else tip you off?

Collapse
 
unitbuilds profile image
UnitBuilds

This is on GCP free trial credits. So I just saw my credit balance went down, had to dig into the reports, filter by SKU, disable savings and BOOM cloud run cost 2c, vertex ai $2, cloud log $20, cloud log storage $30... Not a fun 1, cuz it means I need to pay overage outta pocket when clients hit the limit a whole 5 months earlier

Collapse
 
espoirsamah profile image
HopeGeek

WoW, that's crazy, ☺️

Collapse
 
manolito99 profile image
Lolo

Thanks! :)

Collapse
 
frank_signorini profile image
Frank

I'm curious if the issue was related to an

Collapse
 
manolito99 profile image
Lolo

Good question. The main issue was that getJobs() queried the database continuously when the queue was empty. There was no delay between checks, so the service ended up making an enormous number of requests for work that didn't exist.
Have you ever run into something similar with polling loops or background workers?