Frozen Blood

Posted on Mar 3

Our AWS Bill Spiked 3x Overnight — It Wasn’t Traffic, It Was One Missing Limit

#devops #aws #cloud #backend

One morning I opened AWS Billing and saw something that makes every DevOps engineer uncomfortable: a sharp vertical spike.
No marketing campaign.
No traffic surge.
No new feature launch.

Our AWS cost had tripled in less than 24 hours.

Here’s what happened — and the guardrail I’ll never skip again.

The Setup: “It’s Just a Background Job”

We had a background worker running on EC2. Its job was simple:

Pull messages from SQS
Process media files
Upload results to S3
Delete the message

Standard stuff.

We had recently increased concurrency to “speed things up.”

Node.js worker (simplified):

while (true) {
  const messages = await sqs.receiveMessage({ MaxNumberOfMessages: 10 }).promise();

  await Promise.all(
    messages.map(msg => processJob(JSON.parse(msg.Body)))
  );
}

It worked great in testing.

Until production.

The Silent Problem: No Concurrency Ceiling

Under certain failure conditions:

processJob retried internally
Jobs took longer
Messages weren’t deleted fast enough
More instances scaled up

Because we had:

An Auto Scaling Group
CPU-based scaling policy
No max instance cap set properly

AWS did exactly what it was told to do:

“CPU high? Add instances.”

It added them.

And kept adding them.

The Cascade Effect

Here’s how it spiraled:

Jobs slow down → CPU increases
ASG launches more EC2 instances
More instances pull more SQS messages
Downstream dependency throttles
Retries increase
CPU spikes more
Repeat

It wasn’t traffic.

It was self-amplifying concurrency.

By the time we noticed:

EC2 count had doubled
S3 PUT requests spiked
Data transfer increased
NAT gateway costs quietly ballooned

AWS didn’t fail.
Our assumptions did.

The Fix (What I Should Have Done First)

1️⃣ Hard Cap the Auto Scaling Group

Always define a maximum instance count.

Terraform example:

resource "aws_autoscaling_group" "workers" {
  max_size         = 5
  min_size         = 1
  desired_capacity = 2
}

No infinite scaling. Ever.

2️⃣ Limit Application-Level Concurrency

Instead of unlimited Promise.all, I switched to controlled parallelism:

import pLimit from "p-limit";

const limit = pLimit(5);

await Promise.all(
  messages.map(msg =>
    limit(() => processJob(JSON.parse(msg.Body)))
  )
);

Infrastructure scaling + unlimited app concurrency = chaos.

Pick one.

3️⃣ Add Circuit Breakers

If downstream dependency fails:

Stop pulling new messages
Pause briefly
Fail fast instead of retrying blindly

Backpressure is not optional in distributed systems.

4️⃣ Billing Alerts (The Obvious One)

I added:

AWS Budget alerts at 50%, 75%, 90%
Cost anomaly detection
Slack notifications

Embarrassingly, we had none of that before.

The Lesson Most Teams Learn Once

Cloud doesn’t break loudly.

It scales quietly.

And scaling is expensive when the trigger condition is flawed.

AWS is incredibly good at doing what you configure.
It will happily automate your mistakes.

Conclusion / Key takeaway

Our AWS bill didn’t spike because of growth.
It spiked because we removed a limit and trusted scaling without boundaries.

In cloud systems, unbounded concurrency is a liability.

Have you ever had an AWS (or cloud) cost surprise? What guardrail did you add afterward?

DEV Community