DEV Community

Cover image for Our AWS Bill Spiked 3x Overnight — It Wasn’t Traffic, It Was One Missing Limit
Frozen Blood
Frozen Blood

Posted on

Our AWS Bill Spiked 3x Overnight — It Wasn’t Traffic, It Was One Missing Limit

One morning I opened AWS Billing and saw something that makes every DevOps engineer uncomfortable: a sharp vertical spike.
No marketing campaign.
No traffic surge.
No new feature launch.

Our AWS cost had tripled in less than 24 hours.

Here’s what happened — and the guardrail I’ll never skip again.


The Setup: “It’s Just a Background Job”

We had a background worker running on EC2. Its job was simple:

  • Pull messages from SQS
  • Process media files
  • Upload results to S3
  • Delete the message

Standard stuff.

We had recently increased concurrency to “speed things up.”

Node.js worker (simplified):

while (true) {
  const messages = await sqs.receiveMessage({ MaxNumberOfMessages: 10 }).promise();

  await Promise.all(
    messages.map(msg => processJob(JSON.parse(msg.Body)))
  );
}
Enter fullscreen mode Exit fullscreen mode

It worked great in testing.

Until production.


The Silent Problem: No Concurrency Ceiling

Under certain failure conditions:

  • processJob retried internally
  • Jobs took longer
  • Messages weren’t deleted fast enough
  • More instances scaled up

Because we had:

  • An Auto Scaling Group
  • CPU-based scaling policy
  • No max instance cap set properly

AWS did exactly what it was told to do:

“CPU high? Add instances.”

It added them.

And kept adding them.


The Cascade Effect

Here’s how it spiraled:

  1. Jobs slow down → CPU increases
  2. ASG launches more EC2 instances
  3. More instances pull more SQS messages
  4. Downstream dependency throttles
  5. Retries increase
  6. CPU spikes more
  7. Repeat

It wasn’t traffic.

It was self-amplifying concurrency.

By the time we noticed:

  • EC2 count had doubled
  • S3 PUT requests spiked
  • Data transfer increased
  • NAT gateway costs quietly ballooned

AWS didn’t fail.
Our assumptions did.


The Fix (What I Should Have Done First)

1️⃣ Hard Cap the Auto Scaling Group

Always define a maximum instance count.

Terraform example:

resource "aws_autoscaling_group" "workers" {
  max_size         = 5
  min_size         = 1
  desired_capacity = 2
}
Enter fullscreen mode Exit fullscreen mode

No infinite scaling. Ever.


2️⃣ Limit Application-Level Concurrency

Instead of unlimited Promise.all, I switched to controlled parallelism:

import pLimit from "p-limit";

const limit = pLimit(5);

await Promise.all(
  messages.map(msg =>
    limit(() => processJob(JSON.parse(msg.Body)))
  )
);
Enter fullscreen mode Exit fullscreen mode

Infrastructure scaling + unlimited app concurrency = chaos.

Pick one.


3️⃣ Add Circuit Breakers

If downstream dependency fails:

  • Stop pulling new messages
  • Pause briefly
  • Fail fast instead of retrying blindly

Backpressure is not optional in distributed systems.


4️⃣ Billing Alerts (The Obvious One)

I added:

  • AWS Budget alerts at 50%, 75%, 90%
  • Cost anomaly detection
  • Slack notifications

Embarrassingly, we had none of that before.


The Lesson Most Teams Learn Once

Cloud doesn’t break loudly.

It scales quietly.

And scaling is expensive when the trigger condition is flawed.

AWS is incredibly good at doing what you configure.
It will happily automate your mistakes.


Conclusion / Key takeaway

Our AWS bill didn’t spike because of growth.
It spiked because we removed a limit and trusted scaling without boundaries.

In cloud systems, unbounded concurrency is a liability.

Have you ever had an AWS (or cloud) cost surprise? What guardrail did you add afterward?

Top comments (0)