"The scheduler made a promise. VPA broke it. Your users felt it."
π― The Setup
You deployed VPA. Requests are auto-tuned. Nodes are optimally packed. You feel smart.
Then 3am happens. PagerDuty fires. Half your production pods are in Pending. The other half just restarted cold, in a different zone, with no image cache.
VPA didn't malfunction. It did exactly what it was designed to do. The problem is that VPA and the Kubernetes scheduler operate on fundamentally incompatible assumptions β and nobody told you they were quietly at war inside your cluster.
This post is that warning.
π€― Interesting Fact #1: VPA Can Make Your Pod Permanently Unschedulable
Not temporarily unschedulable. Permanently.
Here's how:
VPA's Recommender watches your pod's actual CPU usage over time. Your pod runs on a node with 8 CPUs. It consistently pegs at 7.5 cores. VPA sees this and responsibly recommends:
status:
recommendation:
containerRecommendations:
- containerName: api
target:
cpu: "14" # β VPA's honest recommendation
memory: "24Gi"
Honest? Yes. Schedulable? Absolutely not.
Your entire cluster runs 8-CPU nodes. No node can ever fit requests: cpu: 14. The VPA Updater evicts your pod. The scheduler tries to place it. Filters every node. Finds zero candidates.
Events:
Warning FailedScheduling 0/12 nodes available:
12 Insufficient cpu.
Your pod sits in Pending forever. VPA just self-destructed your workload with good intentions.
The fix is non-negotiable:
spec:
resourcePolicy:
containerPolicies:
- containerName: api
maxAllowed:
cpu: "4" # β Always cap below your largest node size
memory: 8Gi
minAllowed:
cpu: 100m
memory: 128Mi
π₯ SRE Rule:
maxAllowedis not optional. It's the contract between VPA's ambitions and your cluster's physical reality.
π§ Understanding the Three-Headed Beast
VPA isn't one thing. It's three components with three very different personalities:
Click to view VPA Architecture Diagram
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VPA Architecture β
β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββ β
β β Recommender β β Updater β β Admission β β
β β β β β β Controller β β
β β π Watches β β π£ Evicts pods β β π Mutates β β
β β metrics via β β whose requests β β pod spec at β β
β β metrics-server β β drift too far β β creation β β
β β Computes ideal β β from target β β with VPA β β
β β requests using β β Respects PDBs β β recommended β β
β β histogram algo β β (if they exist)β β values β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββ β
β β
β All three talk to the VPA object. You control β
β which ones are active via updateMode. β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The Recommender is harmless β it only writes recommendations. The Updater is where the chaos lives. It proactively evicts running pods to force them to restart with new requests. No warning, no graceful drain β just SIGTERM and goodbye.
π₯ Conflict #1 β The Scheduler's Promise vs. VPA's Revision
The scheduler operates on a single moment in time. At pod creation, it evaluates the pod's requests, filters nodes, scores them, and commits. That's it. It doesn't watch your pod after placement. It doesn't re-evaluate. It made its decision and moved on.
VPA operates on continuous time. It's always watching. Always revising. Never satisfied.
t=0 Pod created: requests cpu=200m
Scheduler: "node-07 has 300m free β placing here β
"
t=30m VPA Recommender: "Actual usage is 900m β recommending 950m"
VPA Updater: "Current requests too low β evicting pod π£"
t=30m+1s Pod evicted. Scheduler wakes up.
Scheduler: "Find node with 950m CPU free..."
node-07: "Only 150m free now (others moved in)"
node-12: "950m free β placing here"
t=30m+8s Pod running on node-12.
Different zone. No image cache. Affinity re-evaluated.
Your carefully tuned topology? Gone.
π€― Wild Fact: The scheduler has no memory of why it placed a pod somewhere. Every reschedule starts from scratch. All the context β image locality, zone preference, anti-affinity satisfaction β is reconstructed from current cluster state, which has changed.
The SRE impact: This is an unplanned restart with cold start penalty (image pull, JVM warmup, cache miss) landing on a node the scheduler chose based on a cluster state from 30 minutes ago, not the state you designed for.
π₯ Conflict #2 β VPA + HPA = Feedback Loop From Hell
This is the conflict that takes down clusters.
Run VPA and HPA both targeting CPU on the same deployment, and you've created a distributed control system with two competing controllers and no coordination mechanism:
Step 1: CPU spikes β HPA scales out (adds replicas)
Step 2: More replicas β load redistributed β CPU per pod drops
Step 3: VPA sees lower CPU per pod β recommends lower requests
Step 4: Lower requests β pods look cheaper β scheduler packs them tighter
Step 5: Tighter packing β CPU spikes again β back to Step 1
Meanwhile VPA is also evicting pods to apply new requests, which HPA interprets as replica count changes, which triggers its own scaling decisions...
It's two thermostats in one room fighting over the temperature. The room never stabilizes.
The absolute rule:
| Autoscaler | Controls | Metric Source |
|---|---|---|
| HPA | Replica count | RPS, queue depth, custom metrics |
| VPA | CPU/Memory requests per pod | Historical usage |
| Never | Both on CPU/Memory | Mutual destruction |
# β
Safe combination
# HPA scales on requests-per-second (not CPU)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
metrics:
- type: Pods
pods:
metric:
name: requests_per_second # β External/custom metric
target:
type: AverageValue
averageValue: 1000m
# VPA owns CPU and memory right-sizing
# HPA never touches those dimensions
π₯ Pro Tip: Use KEDA for HPA scaling on queue depth, Kafka lag, or SQS length β completely orthogonal to CPU/memory. Then VPA can safely own the resource dimension without fighting anyone.
π₯ Conflict #3 β VPA Evictions Don't Care About Your Traffic
VPA Updater evicts pods when their actual requests diverge too far from the recommendation. It does respect PodDisruptionBudgets β but only if you've defined them.
Without a PDB, VPA can and will evict all replicas of a deployment simultaneously:
Deployment: api-server (5 replicas)
No PDB defined.
VPA Updater: "All 5 pods have requests that need updating"
VPA Updater: *evicts pod 1* *evicts pod 2* *evicts pod 3*...
api-server: 0 replicas running.
Your users: 503s.
Your SLO: burning.
With a PDB:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: "80%" # VPA Updater must leave 80% running
selector:
matchLabels:
app: api-server
VPA Updater queries the PDB before each eviction. If the eviction would violate it, the Updater backs off and retries later β one pod at a time, rolling safely.
π¨ SRE Non-Negotiable: PDB is the seatbelt for VPA Auto mode. No PDB = no seatbelt. If you're running
updateMode: Autowithout PDBs, you're one VPA recommendation cycle away from a full outage.
βοΈ The Update Mode Dial β Know What You're Turning On
updateMode: "Off"
# π’ Recommender runs. Nothing applied.
# Read recommendations via: kubectl describe vpa <name>
# Perfect for: new workloads, learning phase, audit
updateMode: "Initial"
# π‘ Admission controller applies recommendations at pod CREATION only.
# No evictions. Scheduler sees correct values upfront β no conflict!
# Perfect for: stateless apps, safe migration from Off
updateMode: "Recreate"
# π Applies updates when pods restart naturally (crashes, deploys).
# No proactive evictions. Lower blast radius than Auto.
updateMode: "Auto"
# π΄ Full loop. Proactive evictions. Continuous tuning.
# Perfect for: stateless apps WITH PDBs and bounded maxAllowed.
# Dangerous for: stateful apps, anything without PDB.
π‘ Google SRE Graduation Ladder:
Off(2-4 weeks) βInitialβRecreateβAuto(only with PDB + maxAllowed)
π€― Interesting Fact #2: VPA Uses a Histogram, Not an Average
Most engineers assume VPA recommends based on average CPU/memory usage. It doesn't.
VPA's Recommender builds an exponential decay histogram of observed usage samples. It then recommends at the 90th percentile for CPU and 90th percentile OOM-aware for memory by default.
This means:
- VPA recommendations are spiky-traffic-aware β they account for your worst 10% of traffic moments
- Old samples decay in weight over time β recent spikes matter more than ancient ones
- Memory is handled more conservatively β OOM kills are weighted more heavily than CPU throttling
Why this matters for the scheduler conflict:
Average CPU: 200m β Scheduler would have placed fine
P90 CPU: 850m β VPA recommends 850m
Scheduler now needs 850m free on a node, not 200m
Feasible node set shrinks dramatically
The scheduler was designed around declared requests. VPA dynamically moves that target based on statistical modeling of your actual workload. The two systems are speaking different languages about the same resource.
πΊοΈ Decision Framework: Should You Even Use VPA?
Is your workload stateless (Deployment)?
βββ YES β Does it have predictable, well-tuned requests from load testing?
β βββ YES β Skip VPA. Use HPA on custom metrics.
β βββ NO β VPA is valuable. Start with updateMode: Off.
β Validate recommendations for 2 weeks.
β Graduate: Initial β Auto (with PDB + maxAllowed)
β
βββ NO (StatefulSet / batch / ML training)?
βββ NEVER use updateMode: Auto.
Use updateMode: Off for recommendations only.
Apply manually during maintenance windows.
Reason: stateful pods can't safely restart mid-operation.
π SRE Monitoring Pack for VPA
# Track VPA recommendation vs actual requests β catch divergence early
kube_verticalpodautoscaler_status_recommendation_containerrecommendations_target
# VPA-evicted pods β should be predictable and low
kube_pod_status_reason{reason="Evicted"}
# Pending pods after VPA eviction β signals over-recommendation
kube_pod_status_phase{phase="Pending"} > 0
# Scheduler failures after VPA update β catch the unschedulable bomb
scheduler_unschedulable_pods_total
# Alert: pod evicted AND pending for > 2 min = VPA caused scheduling failure
(kube_pod_status_reason{reason="Evicted"} > 0)
and (kube_pod_status_phase{phase="Pending"} > 0)
π TL;DR Cheat Sheet
| Problem | Root Cause | Fix |
|---|---|---|
| Pod permanently Pending after VPA update | Recommendation exceeds node capacity | Set maxAllowed below largest node |
| HPA and VPA fighting | Both targeting CPU | HPA on custom/external metrics only |
| VPA evicted all replicas simultaneously | No PodDisruptionBudget | Define PDB with minAvailable: 80%
|
| Scheduler placed pod in wrong zone after eviction | Scheduler has no memory of prior placement | Use topologySpreadConstraints (re-enforced every schedule) |
| VPA recommendations too aggressive | Workload has traffic spikes | Tune targetCPUPercentile in VPA config |
If VPA has ever woken you up at 3am, drop a π₯ in the comments. You're not alone.
Follow for more deep dives into the Kubernetes internals that actually matter in production π
Top comments (0)