Nijo George Payyappilly

Posted on Apr 11

⚔️ Kubernetes Civil War: When VPA Fights the Scheduler (And Your Pods Pay the Price)

#kubernetes #devops #sre #cloudnative

"The scheduler made a promise. VPA broke it. Your users felt it."

🎯 The Setup

You deployed VPA. Requests are auto-tuned. Nodes are optimally packed. You feel smart.

Then 3am happens. PagerDuty fires. Half your production pods are in Pending. The other half just restarted cold, in a different zone, with no image cache.

VPA didn't malfunction. It did exactly what it was designed to do. The problem is that VPA and the Kubernetes scheduler operate on fundamentally incompatible assumptions — and nobody told you they were quietly at war inside your cluster.

This post is that warning.

🤯 Interesting Fact #1: VPA Can Make Your Pod Permanently Unschedulable

Not temporarily unschedulable. Permanently.

Here's how:

VPA's Recommender watches your pod's actual CPU usage over time. Your pod runs on a node with 8 CPUs. It consistently pegs at 7.5 cores. VPA sees this and responsibly recommends:

status:
  recommendation:
    containerRecommendations:
    - containerName: api
      target:
        cpu: "14"    # ← VPA's honest recommendation
        memory: "24Gi"

Honest? Yes. Schedulable? Absolutely not.

Your entire cluster runs 8-CPU nodes. No node can ever fit requests: cpu: 14. The VPA Updater evicts your pod. The scheduler tries to place it. Filters every node. Finds zero candidates.

Events:
  Warning  FailedScheduling  0/12 nodes available:
           12 Insufficient cpu.

Your pod sits in Pending forever. VPA just self-destructed your workload with good intentions.

The fix is non-negotiable:

spec:
  resourcePolicy:
    containerPolicies:
    - containerName: api
      maxAllowed:
        cpu: "4"        # ← Always cap below your largest node size
        memory: 8Gi
      minAllowed:
        cpu: 100m
        memory: 128Mi

🔥 SRE Rule: maxAllowed is not optional. It's the contract between VPA's ambitions and your cluster's physical reality.

🧠 Understanding the Three-Headed Beast

VPA isn't one thing. It's three components with three very different personalities:

Click to view VPA Architecture Diagram

┌──────────────────────────────────────────────────────────────────┐
│                        VPA Architecture                          │
│                                                                  │
│  ┌─────────────────┐   ┌─────────────────┐   ┌───────────────┐   │
│  │   Recommender   │   │    Updater      │   │   Admission   │   │
│  │                 │   │                 │   │  Controller   │   │
│  │  👁 Watches     │   │  💣 Evicts pods  │   │  🎭 Mutates   │   │
│  │  metrics via    │   │  whose requests │   │  pod spec at  │   │
│  │  metrics-server │   │  drift too far  │   │  creation     │   │
│  │  Computes ideal │   │  from target    │   │  with VPA     │   │
│  │  requests using │   │  Respects PDBs  │   │  recommended  │   │
│  │  histogram algo │   │  (if they exist)│   │  values       │   │
│  └─────────────────┘   └─────────────────┘   └───────────────┘   │
│                                                                  │
│         All three talk to the VPA object. You control            │
│         which ones are active via updateMode.                    │
└──────────────────────────────────────────────────────────────────┘

The Recommender is harmless — it only writes recommendations. The Updater is where the chaos lives. It proactively evicts running pods to force them to restart with new requests. No warning, no graceful drain — just SIGTERM and goodbye.

💥 Conflict #1 — The Scheduler's Promise vs. VPA's Revision

The scheduler operates on a single moment in time. At pod creation, it evaluates the pod's requests, filters nodes, scores them, and commits. That's it. It doesn't watch your pod after placement. It doesn't re-evaluate. It made its decision and moved on.

VPA operates on continuous time. It's always watching. Always revising. Never satisfied.

t=0   Pod created: requests cpu=200m
      Scheduler: "node-07 has 300m free → placing here ✅"

t=30m VPA Recommender: "Actual usage is 900m → recommending 950m"
      VPA Updater: "Current requests too low → evicting pod 💣"

t=30m+1s  Pod evicted. Scheduler wakes up.
           Scheduler: "Find node with 950m CPU free..."
           node-07: "Only 150m free now (others moved in)"
           node-12: "950m free → placing here"

t=30m+8s  Pod running on node-12.
           Different zone. No image cache. Affinity re-evaluated.
           Your carefully tuned topology? Gone.

🤯 Wild Fact: The scheduler has no memory of why it placed a pod somewhere. Every reschedule starts from scratch. All the context — image locality, zone preference, anti-affinity satisfaction — is reconstructed from current cluster state, which has changed.

The SRE impact: This is an unplanned restart with cold start penalty (image pull, JVM warmup, cache miss) landing on a node the scheduler chose based on a cluster state from 30 minutes ago, not the state you designed for.

💥 Conflict #2 — VPA + HPA = Feedback Loop From Hell

This is the conflict that takes down clusters.

Run VPA and HPA both targeting CPU on the same deployment, and you've created a distributed control system with two competing controllers and no coordination mechanism:

Step 1: CPU spikes → HPA scales out (adds replicas)
Step 2: More replicas → load redistributed → CPU per pod drops
Step 3: VPA sees lower CPU per pod → recommends lower requests
Step 4: Lower requests → pods look cheaper → scheduler packs them tighter  
Step 5: Tighter packing → CPU spikes again → back to Step 1

Meanwhile VPA is also evicting pods to apply new requests, which HPA interprets as replica count changes, which triggers its own scaling decisions...

It's two thermostats in one room fighting over the temperature. The room never stabilizes.

The absolute rule:

Autoscaler	Controls	Metric Source
HPA	Replica count	RPS, queue depth, custom metrics
VPA	CPU/Memory requests per pod	Historical usage
Never	Both on CPU/Memory	Mutual destruction

# ✅ Safe combination
# HPA scales on requests-per-second (not CPU)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  metrics:
  - type: Pods
    pods:
      metric:
        name: requests_per_second   # ← External/custom metric
      target:
        type: AverageValue
        averageValue: 1000m

# VPA owns CPU and memory right-sizing
# HPA never touches those dimensions

🔥 Pro Tip: Use KEDA for HPA scaling on queue depth, Kafka lag, or SQS length — completely orthogonal to CPU/memory. Then VPA can safely own the resource dimension without fighting anyone.

💥 Conflict #3 — VPA Evictions Don't Care About Your Traffic

VPA Updater evicts pods when their actual requests diverge too far from the recommendation. It does respect PodDisruptionBudgets — but only if you've defined them.

Without a PDB, VPA can and will evict all replicas of a deployment simultaneously:

Deployment: api-server (5 replicas)
No PDB defined.

VPA Updater: "All 5 pods have requests that need updating"
VPA Updater: *evicts pod 1* *evicts pod 2* *evicts pod 3*...

api-server: 0 replicas running.
Your users: 503s.
Your SLO: burning.

With a PDB:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: "80%"   # VPA Updater must leave 80% running
  selector:
    matchLabels:
      app: api-server

VPA Updater queries the PDB before each eviction. If the eviction would violate it, the Updater backs off and retries later — one pod at a time, rolling safely.

🚨 SRE Non-Negotiable: PDB is the seatbelt for VPA Auto mode. No PDB = no seatbelt. If you're running updateMode: Auto without PDBs, you're one VPA recommendation cycle away from a full outage.

⚙️ The Update Mode Dial — Know What You're Turning On

updateMode: "Off"      
# 🟢 Recommender runs. Nothing applied. 
# Read recommendations via: kubectl describe vpa <name>
# Perfect for: new workloads, learning phase, audit

updateMode: "Initial"  
# 🟡 Admission controller applies recommendations at pod CREATION only.
# No evictions. Scheduler sees correct values upfront — no conflict!
# Perfect for: stateless apps, safe migration from Off

updateMode: "Recreate" 
# 🟠 Applies updates when pods restart naturally (crashes, deploys).
# No proactive evictions. Lower blast radius than Auto.

updateMode: "Auto"     
# 🔴 Full loop. Proactive evictions. Continuous tuning.
# Perfect for: stateless apps WITH PDBs and bounded maxAllowed.
# Dangerous for: stateful apps, anything without PDB.

💡 Google SRE Graduation Ladder:
Off (2-4 weeks) → Initial → Recreate → Auto (only with PDB + maxAllowed)

🤯 Interesting Fact #2: VPA Uses a Histogram, Not an Average

Most engineers assume VPA recommends based on average CPU/memory usage. It doesn't.

VPA's Recommender builds an exponential decay histogram of observed usage samples. It then recommends at the 90th percentile for CPU and 90th percentile OOM-aware for memory by default.

This means:

VPA recommendations are spiky-traffic-aware — they account for your worst 10% of traffic moments
Old samples decay in weight over time — recent spikes matter more than ancient ones
Memory is handled more conservatively — OOM kills are weighted more heavily than CPU throttling

Why this matters for the scheduler conflict:
  Average CPU: 200m  → Scheduler would have placed fine
  P90 CPU:     850m  → VPA recommends 850m
  Scheduler now needs 850m free on a node, not 200m
  Feasible node set shrinks dramatically

The scheduler was designed around declared requests. VPA dynamically moves that target based on statistical modeling of your actual workload. The two systems are speaking different languages about the same resource.

🗺️ Decision Framework: Should You Even Use VPA?

Is your workload stateless (Deployment)?
├── YES → Does it have predictable, well-tuned requests from load testing?
│         ├── YES → Skip VPA. Use HPA on custom metrics.
│         └── NO  → VPA is valuable. Start with updateMode: Off.
│                   Validate recommendations for 2 weeks.
│                   Graduate: Initial → Auto (with PDB + maxAllowed)
│
└── NO (StatefulSet / batch / ML training)?
          └── NEVER use updateMode: Auto.
              Use updateMode: Off for recommendations only.
              Apply manually during maintenance windows.
              Reason: stateful pods can't safely restart mid-operation.

📊 SRE Monitoring Pack for VPA

# Track VPA recommendation vs actual requests — catch divergence early
kube_verticalpodautoscaler_status_recommendation_containerrecommendations_target

# VPA-evicted pods — should be predictable and low
kube_pod_status_reason{reason="Evicted"}

# Pending pods after VPA eviction — signals over-recommendation
kube_pod_status_phase{phase="Pending"} > 0

# Scheduler failures after VPA update — catch the unschedulable bomb
scheduler_unschedulable_pods_total

# Alert: pod evicted AND pending for > 2 min = VPA caused scheduling failure
(kube_pod_status_reason{reason="Evicted"} > 0)
  and (kube_pod_status_phase{phase="Pending"} > 0)

🏁 TL;DR Cheat Sheet

Problem	Root Cause	Fix
Pod permanently Pending after VPA update	Recommendation exceeds node capacity	Set `maxAllowed` below largest node
HPA and VPA fighting	Both targeting CPU	HPA on custom/external metrics only
VPA evicted all replicas simultaneously	No PodDisruptionBudget	Define PDB with `minAvailable: 80%`
Scheduler placed pod in wrong zone after eviction	Scheduler has no memory of prior placement	Use `topologySpreadConstraints` (re-enforced every schedule)
VPA recommendations too aggressive	Workload has traffic spikes	Tune `targetCPUPercentile` in VPA config