One morning I opened AWS Billing and saw something that makes every DevOps engineer uncomfortable: a sharp vertical spike.
No marketing campaign.
No traffic surge.
No new feature launch.
Our AWS cost had tripled in less than 24 hours.
Here’s what happened — and the guardrail I’ll never skip again.
The Setup: “It’s Just a Background Job”
We had a background worker running on EC2. Its job was simple:
- Pull messages from SQS
- Process media files
- Upload results to S3
- Delete the message
Standard stuff.
We had recently increased concurrency to “speed things up.”
Node.js worker (simplified):
while (true) {
const messages = await sqs.receiveMessage({ MaxNumberOfMessages: 10 }).promise();
await Promise.all(
messages.map(msg => processJob(JSON.parse(msg.Body)))
);
}
It worked great in testing.
Until production.
The Silent Problem: No Concurrency Ceiling
Under certain failure conditions:
-
processJobretried internally - Jobs took longer
- Messages weren’t deleted fast enough
- More instances scaled up
Because we had:
- An Auto Scaling Group
- CPU-based scaling policy
- No max instance cap set properly
AWS did exactly what it was told to do:
“CPU high? Add instances.”
It added them.
And kept adding them.
The Cascade Effect
Here’s how it spiraled:
- Jobs slow down → CPU increases
- ASG launches more EC2 instances
- More instances pull more SQS messages
- Downstream dependency throttles
- Retries increase
- CPU spikes more
- Repeat
It wasn’t traffic.
It was self-amplifying concurrency.
By the time we noticed:
- EC2 count had doubled
- S3 PUT requests spiked
- Data transfer increased
- NAT gateway costs quietly ballooned
AWS didn’t fail.
Our assumptions did.
The Fix (What I Should Have Done First)
1️⃣ Hard Cap the Auto Scaling Group
Always define a maximum instance count.
Terraform example:
resource "aws_autoscaling_group" "workers" {
max_size = 5
min_size = 1
desired_capacity = 2
}
No infinite scaling. Ever.
2️⃣ Limit Application-Level Concurrency
Instead of unlimited Promise.all, I switched to controlled parallelism:
import pLimit from "p-limit";
const limit = pLimit(5);
await Promise.all(
messages.map(msg =>
limit(() => processJob(JSON.parse(msg.Body)))
)
);
Infrastructure scaling + unlimited app concurrency = chaos.
Pick one.
3️⃣ Add Circuit Breakers
If downstream dependency fails:
- Stop pulling new messages
- Pause briefly
- Fail fast instead of retrying blindly
Backpressure is not optional in distributed systems.
4️⃣ Billing Alerts (The Obvious One)
I added:
- AWS Budget alerts at 50%, 75%, 90%
- Cost anomaly detection
- Slack notifications
Embarrassingly, we had none of that before.
The Lesson Most Teams Learn Once
Cloud doesn’t break loudly.
It scales quietly.
And scaling is expensive when the trigger condition is flawed.
AWS is incredibly good at doing what you configure.
It will happily automate your mistakes.
Conclusion / Key takeaway
Our AWS bill didn’t spike because of growth.
It spiked because we removed a limit and trusted scaling without boundaries.
In cloud systems, unbounded concurrency is a liability.
Have you ever had an AWS (or cloud) cost surprise? What guardrail did you add afterward?
Top comments (0)