vCPUs are a marketing scam

#webdev #cloud #devops #docker

You've probably seen "4 vCPU" on a pricing page and wondered what that actually means. Is it 4 CPU cores? 4 threads? Something else entirely?

The short answer: a vCPU is whatever the hell your cloud provider decides. Sometimes it's a share of CPU time, sometimes it's an actual hardware thread, and sometimes providers offer both - AWS has regular instances with dedicated threads and burstable instances with CPU credits.

This post focuses on the quota-based model, because that's where the confusing behavior lives. Understanding how it works will save you debugging headaches.

To be fair, I dont actually think vCPUs are a scam usually. But it made you click, and now I get to explain to you some fun stuff about Linux CFS!

If you're a visual learner and want to skip the text, try this out:

The interactive component doesnt work on dev.to, click the image to land on my own blog where it works!

Why vCPUs Exist

Most web applications don't need constant CPU. Your server processes a request in milliseconds, then sits idle waiting for the next one. Even under load, CPU usage typically looks like spikes rather than a flat line.

Giving each customer a dedicated physical core would waste most of that capacity. Instead, providers give you a quota of CPU time: a baseline you can always use, plus a burst allowance for spikes. If your API needs 50ms of CPU three times per second, that's 150ms out of 1000ms - why pay for the other 850ms?

There's another reason: "1 core" is meaningless as a unit. A 2025 AMD EPYC core and a 2012 Xeon have completely different performance. A time-based quota at least gives you a consistent allocation, even if what you can accomplish in that time still depends on the underlying hardware.

To use this model well, you need to understand how it works under the hood.

The CFS Bandwidth Controller

Linux uses the CFS (Completely Fair Scheduler) bandwidth controller to manage CPU quotas. Three parameters control everything:

cpu.cfs_quota_us: How much CPU time (in microseconds) you get per period
cpu.cfs_period_us: How long each accounting period is (also in microseconds)
cpu.cfs_burst_us: The maximum accumulated run-time you can bank (in microseconds)

These are cgroup v1 names, which I find easier to understand. If you're configuring this yourself, you're probably on cgroup v2, which uses cpu.max and cpu.max.burst instead.

For example, if your quota is 25,000µs (25ms) and your period is 100,000µs (100ms), you can use 25ms of CPU time every 100ms. That's equivalent to 25% of a single CPU's time - and that might be what "1 vCPU" means on a shared instance.

The math: quota / period = your CPU share. A quota of 50ms per 100ms period means 50% of a CPU's time. Note that this is aggregate time across all threads in the cgroup, not pinned to a single core.

The burst parameter allows unused quota to accumulate. If your app only uses 10ms during one period, the remaining 15ms (partially) gets added to a burst balance. When a later request needs more than your baseline quota, it can draw from this balance to exceed the baseline temporarily. Burst is capped between 0 and your quota, and is often disabled by default.

Once the burst balance hits zero, you're capped at your baseline until it recovers. It only recovers when you're using less than your baseline - which becomes difficult when requests are queuing up.

See It In Action

Play with this simulator to build intuition for how quota and period settings affect latency:

The interactive component doesnt work on dev.to, click the image to land on my own blog where it works!

Try these experiments:

Switch to "Spiky" workload and watch the balance drain to zero, triggering throttling (red dashed line)
Increase the baseline to 25% - notice how the balance stays healthier and throttling decreases
With "Bursty" workload and low baseline, see how bursts drain balance but it recovers during idle periods

What Happens When You Exceed Your Quota

Let's say you have a 25ms quota per 100ms period, and a request comes in that needs 30ms of CPU time to process.

Your process starts running
After 25ms, the kernel sees you've used your quota
Your process gets paused until the next period starts
75ms later, a new period begins and you get 25ms more
Your process finishes the remaining 5ms

Total wall-clock time: 105ms for 30ms of actual work.

Your process wasn't slow - it was waiting. When you exceed your quota, latency doesn't degrade gracefully. It jumps by the length of the remaining period. That sucks, especially for your P99 latency!

When This Works Well

The quota model fits most web workloads because they're inherently bursty - short CPU spikes with idle time in between. If your app averages 10% CPU but occasionally spikes to 50%, the burst balance absorbs those spikes while idle periods let it recover.

When This Breaks Down

The model falls apart in a few scenarios:

Long synchronous operations: A request needing 50ms of CPU with a 25ms quota will always get throttled, regardless of burst balance.

Latency-sensitive workloads: If P99 latency matters, your longest operations need to fit within your quota.

Sustained load: Once burst balance depletes and requests queue up, each new request starts mid-period with less quota remaining. The backlog compounds.

Practical Takeaways

1. Size for your longest operations, not average CPU: If your P99 request needs 40ms of CPU, a 25ms quota will throttle those requests every time.

2. Shorter periods reduce worst-case throttling delay, longer periods increase it: A 50ms period means you wait at most 50ms when throttled. A 100ms period means potentially waiting 100ms. The tradeoff is how quota gets distributed within each period.

3. Watch for cascade effects: When one request gets throttled and takes longer, it holds a connection longer, which can cause queuing, which makes the next request start with less remaining quota in the period.

4. "Low CPU usage" can be misleading: If your monitoring shows 20% CPU but users complain about latency, check throttling stats. You might be at 80% of your quota while only using 20% of the physical core.

5. Consider dedicated CPU for latency-critical paths: If consistent latency matters more than cost, dedicated CPU instances guarantee you won't share with noisy neighbors. But of course, "dedicated" also has many different definitions. Sometimes that just means a bigger slice!

The Bottom Line

vCPUs aren't a scam; they're a mostly sensible way to share compute resources efficiently. The quota system works great for bursty workloads, which is most workloads.

The key is understanding that exceeding your quota doesn't make things "a little slower" - it makes them wait for the next period. Once you internalize that, you can make informed decisions about resource sizing and understand why latency sometimes spikes even when CPU "looks fine."