There's a small industry of vendors that want to sell you machine learning capacity planning. For 95% of teams, you don't need it. You need a spreadsheet, an honest growth assumption, and a buffer.
Here's the practical version of capacity planning that catches most real problems.
What you actually need to know
You need to answer three questions:
- When does the current setup run out?
- What does adding capacity cost (money and engineering time)?
- What's the cheapest action you can take to push the answer to question 1 out by 6 months?
That's the whole game. Everything else is detail.
The boring forecast
Take your traffic from the last 12 months. Fit a linear regression. Extrapolate forward 6 months. That's your baseline forecast.
For most B2B SaaS workloads, this is accurate enough. Your traffic isn't a fractal pattern. It grows roughly linearly with customer count, with seasonal bumps you already know about.
Where to be careful:
- One-time growth events. If your sales team is about to land a contract that doubles traffic, that's not captured in the regression. Talk to sales monthly so you know what's coming.
- Product launches. If marketing is about to launch a campaign with paid acquisition, expect a 2-4x bump for the campaign window. Don't average it into your baseline.
- Customer churn. A big customer leaving will dent the trend. Subtract their traffic from history before fitting, then add it back as a separate term.
What to provision against
Your forecast tells you the expected traffic. You should be provisioned for double the peak day of that forecast. The factor-of-two buffer covers:
- The forecast being wrong (it always is)
- A spike day (some Monday will be 50% above average)
- The buffer you need to do safe deploys without saturating
- A reasonable margin for new product features adding load
If doubling capacity is unaffordable, you're already running close to the edge. That's a real problem and you should be talking to leadership about either the cost or the SLO compromise, not the architecture.
When ML actually helps
There are cases where machine learning capacity planning pays off:
- Highly seasonal workloads. E-commerce on Black Friday, betting platforms during major sporting events, tax software in March. The pattern is complex enough that a linear regression misses too much.
- Multi-dimensional resource constraints. When CPU, memory, network, and disk all behave differently and you have a complex auto-scaler trying to optimize across them.
- Very large fleets. Beyond a few thousand instances, the variance between machines matters and the spreadsheet stops working.
If you're none of those, skip the ML. Spend the time you saved on something that actually reduces your bill, like rightsizing your instances or shutting down zombie services nobody owns.
Top comments (0)