srinu nuthi

Posted on Jun 14 • Originally published at srinun.in

QPS Limit Exceeded on EKS Start-up: The Image Pull Thundering Herd

#kubernetes #aws #eks #devops

I scaled our dev EKS cluster down to zero overnight to save cost. The next morning it didn't come back up cleanly — pods got stuck and the events were full of "QPS limit exceeded". The cause wasn't the automation. It was every pod trying to pull its image at the same second. Here's the thundering herd, and how I fixed it.

Why I started stopping the dev cluster at night

A dev cluster doesn't need to run 24/7. There are 168 hours in a week, but a dev environment realistically only needs ~50 (10 hours a day, 5 days a week). So I set up a schedule: scale the node groups to zero at night, bring them back at 8 AM. The control plane stays up; the expensive worker nodes go to zero.

Savings: roughly 60–70% on dev worker-node compute.

Then the cluster woke up angry

The automation worked perfectly going down. The problem was going up. The first morning the cluster scaled back from zero, pods got stuck in ContainerCreating:

Failed to pull image "xxxx.dkr.ecr.ap-south-1.amazonaws.com/my-app:latest":
... 429 Too Many Requests
Warning  Failed   kubelet  Error: ErrImagePull
Warning  Failed   kubelet  ... QPS limit exceeded / Rate exceeded

Root cause: everything pulls at once

When a cluster runs normally, pods start at different times, so image pulls are naturally spread out. But when you bring a cluster back from zero, that smooth spread collapses into a single spike:

All node groups scale up together — a batch of fresh nodes joins within the same minute.
Every node starts with an empty image cache.
The scheduler places every pending pod from every namespace at once.
So every kubelet, on every node, fires image pulls to the registry at the same second.

This is a classic thundering herd, and it hits two rate limits at once:

Registry-side (ECR): Amazon ECR rate-limits the API calls used during a pull (GetDownloadUrlForLayer, BatchGetImage, GetAuthorizationToken). Hundreds of simultaneous pulls return 429 Too Many Requests.
Node-side (kubelet): Each kubelet also rate-limits pulls via registryPullQPS and registryBurst.

The key insight: the error has nothing to do with broken images or a down registry. It's purely a concurrency problem — too many pulls in too short a window.

How I fixed it

1. Stagger the scale-up instead of big-banging it

The single most effective fix. Instead of scaling all node groups to full size at once, bring capacity back in phases — a few nodes, wait a couple minutes for their images to land, then the rest. Same idea for workloads: restore critical namespaces first, the rest a few minutes later.

2. Tune the kubelet pull limits

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
serializeImagePulls: false   # allow parallel pulls per node
registryPullQPS: 10          # default is 5
registryBurst: 20            # default is 10

Caution: turning these up while the registry is the bottleneck can make ECR throttling worse. Pair it with step 3.

3. Put a pull-through cache in front of ECR

Set up an ECR pull-through cache and make sure the cluster reaches ECR over a VPC interface endpoint (plus the S3 gateway endpoint). Repeated pulls hit a warm cache instead of re-fetching upstream — especially valuable for public images with their own aggressive rate limits.

4. Pre-pull the hot images

Don't let nodes start with an empty cache: bake common images into a custom AMI, or run a lightweight image pre-puller DaemonSet. Fewer cold pulls means a far smaller herd.

The result

The QPS limit exceeded / 429 errors disappeared on subsequent morning start-ups.
Pods reached Running faster.
We kept the full cost savings of scaling to zero — without the painful wake-up.

If you only do one thing: stagger the scale-up. Most "QPS limit exceeded" start-up failures vanish the moment you stop bringing the entire cluster back in a single burst.

Takeaway

Scaling a dev cluster to zero overnight is one of the easiest cost wins in Kubernetes — but "scale to zero" quietly turns your start-up from a trickle into a flood. Once I stopped big-banging the start-up and gave ECR breathing room with a cache and pre-pulled images, the mornings got quiet again.

Originally published on srinun.in. I write about DevOps, AWS, and Kubernetes — connect with me on LinkedIn.

DEV Community