Daya Shankar

Posted on Feb 19

Memory Ballooning Effects in Virtualized Cloud Environments

#cloud #cloudcomputing

Memory ballooning is a host memory reclaim method used during VM overcommit. The hypervisor inflates a balloon driver inside a VM to claw back RAM.

It can avoid host swapping, but it also shrinks guest page cache and can trigger paging. In Kubernetes, you see MemoryPressure, pod evictions, and tail-latency spikes.

What memory ballooning is

Ballooning is cooperative reclaim. It’s not “free memory.”

On VMware, the balloon driver (vmmemctl) works with the host to reclaim pages the guest considers least valuable.

VMware’s own perf guidance is blunt: avoid overcommit that forces regular host swapping, because that’s where performance collapses.

What you actually see in a managed Kubernetes service

You don’t see “ballooned MB.” You see consequences.

Kubelet enforces node-pressure eviction. Default hard threshold on Linux is memory.available<100Mi, and hard evictions have no grace period.
So any reclaim event that drops memory.available can turn into kills.

How ballooning pressure turns into outages on K8s nodes

This is the failure chain you should expect under overcommit.

Cache gets punched → more disk reads → p95 climbs.
Paging starts → jitter rises.
Kubelet evicts → restarts + thundering herd.

You don’t need hypervisor access to catch this. You just need node metrics and events.

What AceCloud gives you to control blast radius

You control node sizing and node-group policy, not the host reclaim knobs.

AceCloud Managed Kubernetes exposes worker node configurations like 2 vCPU/4 GiB, 4 vCPU/8 GiB, 8 vCPU/16 GiB (their published comparison table).
If you need bigger worker nodes, AceCloud’s flavor catalog shows Standard Instance options like S3a.2xlarge (8 vCPU/32 GiB) up through S3a.8xlarge (32 vCPU/128 GiB) and beyond.

Guardrails that work when overcommit is “yes”

These are defaults you can deploy without cluster-specific tuning.

Split worker node groups by risk

One node pool for prod latency. One for batch.

Protected pool: ingress, API, user-facing services.

Best-effort pool: ETL, async jobs, rebuilds.

This keeps batch from turning your prod nodes into the provider’s pressure valve.

Enforce requests/limits everywhere

The scheduler packs based on requests. If you don’t set them, you’re gambling.

Use Kubernetes resource requests/limits for CPU and memory.
For latency pods, run Guaranteed QoS: requests == limits.

resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "1"
memory: "2Gi"

Keep headroom by design

If you can’t tune Node Allocatable, you simulate it with request budgets.

Kubernetes calls this concept Node Allocatable (reserving resources for system daemons).
In a managed service, you may not get to set kube-reserved / system-reserved, so leave headroom in pod requests.

Baseline rule (protected nodes): don’t schedule more than 75% of node RAM by requested memory.

Pod density: don’t chase 110

Kubernetes “supported scale” guidance says no more than 110 pods per node.
Some platforms can configure higher, but pod IP and CNI limits usually bite first.

Use caps that match memory, not bragging rights.

Starting caps for AceCloud-sized worker nodes

Assumptions: typical daemonsets, no hugepages/DPDK, overcommit exists somewhere upstream.

Worker node	Role	Max total pod memory requests	Pod cap	Why
4 GiB	best-effort	2.5–3.0 GiB	15–25	leaves OS+kube headroom
8 GiB	protected	5.5–6.0 GiB	25–40	avoids eviction on small dips
16 GiB	protected	11–12 GiB	40–70	room for spikes + cache
32 GiB	mixed	24–26 GiB	70–110	only if requests are real

Anchor: the 110-pods/node guidance is a ceiling, not a target.

Evictions: make them predictable

If you can’t set kubelet flags, you still control which pods die first.

Assign PriorityClasses.

Put best-effort on best-effort nodes.

Put strict limits on batch so it can’t eat the node.

Know the kubelet defaults: memory.available<100Mi is the hard tripwire on Linux.

Swap: pick a stance and document it

Swap support exists now, but it’s not “turn it on and pray.”

Kubernetes documents swap memory management and node swap behaviors (including LimitedSwap).

Practical policy:

Protected nodes: swap off unless you’ve load-tested tail latency with swap on.

Best-effort nodes: consider LimitedSwap if you accept slower jobs.

What to alert on (works in any managed K8s)

You don’t need vCenter. You need signals.

Kubernetes-level

Node condition: MemoryPressure=True

Events: eviction messages

(Default eviction behavior is documented upstream.)

Node-level (Prometheus / node-exporter)

Alert on:

sustained low MemAvailable

paging activity (pgmajfault, pswpin, pswpout)

memory PSI pressure rising

If those light up during latency spikes, you’re in reclaim/paging territory.

Where AceCloud fits in this story

This is how you use their catalog without lying to yourself.

Start with AceCloud’s published worker sizes (4/8/16 GiB) for general pools.

For memory-heavy services (Kafka, JVM heaps, model servers), move the protected pool to bigger flavors from the standard catalog (ex: 8 vCPU/32 GiB and up).