This is what separates teams that scale from teams that survive:

#observability #sre #devops #platformengineering

LinkedIn Draft — Insight (2026-05-01)

Capacity planning is a risk budget conversation, not a utilization spreadsheet

Teams that plan capacity by extrapolating last month's P95 get surprised when a product launch doubles traffic in a week. The right frame isn't 'what utilization should we run at' — it's 'what's the asymmetric cost of being wrong in each direction, and how much buffer does that justify?'

Cost asymmetry analysis:

Over-provision by 20%:    Under-provision by 20%:

Cost: +$8K/month          Cost: Incident
                               + on-call burnout
Direct, predictable            + customer churn
                               + post-mortem
                               + team morale tax

For P0 services, 20% buffer almost always wins.

The non-obvious part:
→ The teams who get capacity planning right treat it as an insurance calculation, not an optimization problem. Over-provision cost is direct and visible. Under-provision cost is diffuse, delayed, and always larger than it looks. The asymmetry should drive your buffer strategy — not your CFO's target utilization number.

My rule:
→ Set buffer based on incident cost, not utilization targets. For every P0 service, calculate: what does one hour of downtime cost vs one month of 20% over-provision? The math almost always justifies the buffer.

Worth reading:
▸ Google SRE Book — Being On Call and Handling Overload (ch. 11-12)
▸ AWS/GCP cost anomaly detection — real-time signals for when your buffer is being consumed

https://neeraja-portfolio-v1.vercel.app/insights/capacity-planning-is-a-risk-budget-conversation-not-a-utilization-spreadsheet

Hiring note: engineers who think about this in system design conversations stand out immediately.