LinkedIn Draft — Insight (2026-05-01)
This is what separates teams that scale from teams that survive:
Capacity planning is a risk budget conversation, not a utilization spreadsheet
Teams that plan capacity by extrapolating last month's P95 get surprised when a product launch doubles traffic in a week. The right frame isn't 'what utilization should we run at' — it's 'what's the asymmetric cost of being wrong in each direction, and how much buffer does that justify?'
Cost asymmetry analysis:
Over-provision by 20%: Under-provision by 20%:
Cost: +$8K/month Cost: Incident
+ on-call burnout
Direct, predictable + customer churn
+ post-mortem
+ team morale tax
For P0 services, 20% buffer almost always wins.
The non-obvious part:
→ The teams who get capacity planning right treat it as an insurance calculation, not an optimization problem. Over-provision cost is direct and visible. Under-provision cost is diffuse, delayed, and always larger than it looks. The asymmetry should drive your buffer strategy — not your CFO's target utilization number.
My rule:
→ Set buffer based on incident cost, not utilization targets. For every P0 service, calculate: what does one hour of downtime cost vs one month of 20% over-provision? The math almost always justifies the buffer.
Worth reading:
▸ Google SRE Book — Being On Call and Handling Overload (ch. 11-12)
▸ AWS/GCP cost anomaly detection — real-time signals for when your buffer is being consumed
Hiring note: engineers who think about this in system design conversations stand out immediately.
Top comments (0)