Amaan Ul Haq Siddiqui

Posted on Dec 26, 2025

Memory-Based Auto Scaling: Saving Our Sidekiq Jobs When CPU Metrics Lied to Us

#devops #aws #rails #monitoring

We usually just default to CPU-based scaling for our Auto Scaling Groups (ASGs). It’s the standard move. It’s easy, it’s familiar, and for web servers? It usually works fine.

But sometimes, CPU utilization lies.

We recently hit a wall where CPU scaling completely failed us. This is the story of how a critical background job kept crashing even though our dashboards said everything was "healthy," and how switching to memory-based metrics saved the day.

The Silent Failure

We run a Ruby on Rails app. It relies heavily on Sidekiq for background work. These workers run on EC2 instances in an Auto Scaling Group.

On paper, everything looked great.
CPU usage? A comfortable 20–30%.
Network? Normal.
Disk? Fine.
AWS said we were green.

But the app was on fire.

Critical jobs were timing out. The queues were piling up. Retries were spiking. Worst of all? Workers were just... vanishing. Processes were dying, but since CPU was low, the auto-scaler didn't care. It didn't launch new instances. It just let them die.

The Culprit: The OOM Killer

I dug into the logs, and the answer was right there. Memory.

Our Sidekiq jobs are hungry. As the Ruby processes chewed through heavy tasks, they ate up more and more RAM. The instances were running out of memory, and the Linux OOM (Out-of-Memory) Killer stepped in to save the server by killing our Sidekiq process.

The problem? EC2 doesn't send memory metrics to CloudWatch by default.

So, while our RAM was screaming for help, CloudWatch saw low CPU and thought, "Everything is chill."

Step 1: Making Memory Visible

You can't fix what you can't see.

First thing I did was install the CloudWatch Agent on our instances. I needed it to ship custom metrics—specifically mem_used_percent—to AWS.

As soon as we turned it on, the graphs confirmed it.
CPU was bored at 20%.
Memory? It was spiking over 85%.

Above: Finally seeing the truth. CPU was low, but RAM was maxed out.

Step 2: Changing the Rules

We stopped listening to CPU. I set up a Target Tracking scaling policy that looks strictly at memory.

I told the ASG: "Keep average memory at 40%."

Sounds low, right? But background workers are unpredictable. They need breathing room for sudden spikes. This setup does two things:

It adds new servers before we hit the danger zone (75%+).
It doesn't kill servers too fast, so we avoid "thrashing" (booting up and shutting down constantly).

Above: The new policy. If RAM goes up, we scale out.

The Result

Instant fix.

The scaling became predictive. Instead of waiting for a crash, the cluster sees the memory pressure building and adds more power before things break.

We haven't seen a single OOM kill since. The Sidekiq service is happy, the queues are empty, and I can finally sleep.

Above: Stable, boring, and running perfectly.

Why CPU Scaling Sucks for Workers

Here's the takeaway.

Web traffic is usually CPU-heavy. You get a request, you process it, you send a response. CPU spikes, you scale. Simple.

Background workers (like Sidekiq, Celery, Bull) are different. They load big files. They process heavy data objects. They eat RAM. Your CPU can be totally asleep while your memory is completely full.

Final Thoughts

If you're running background jobs in an ASG and you're only watching CPU... you might be one heavy job away from a silent outage.

Don't just use the default settings. Scale on what actually hurts your application.

Note: The screenshots above are from a test setup. I’ve hidden sensitive stuff like Account IDs.

DEV Community