Capacity Planning for Startups

#sre #devops #capacity #startup

Capacity planning sounds like enterprise spreadsheet work. For a startup, it's 'don't get embarrassed when traffic spikes, don't go broke overprovisioning.' Here's the pragmatic version.

The 3 questions

1. What does normal look like right now? Peak RPS, p99 latency, CPU/memory utilization at peak. If you don't know these, stop and measure. You cannot plan without baseline.

2. What's the next expected spike? A launch. A press mention. A marketing campaign. The Black Friday of your industry. Put these on a calendar.

3. How long does it take to add capacity? Minutes (autoscaling)? Hours (VM provisioning)? Weeks (vendor contracts)?

Your capacity plan is the gap between expected spike and response time.

The startup hack: overprovision early

At startup scale, overprovisioning is cheap. An extra $5k/month of slack is trivial compared to the embarrassment of 'we went down during the launch.'

Run at 30-40% peak utilization. Yes, that's wasteful. It's also a 3x buffer for unexpected spikes. Worth it until you're big enough to care about the efficiency.

The autoscaling reality

Autoscaling is great but has limits:

Cold start times mean it can't handle traffic that doubles in 30 seconds
Provisioning limits mean you can only add X instances per minute
Downstream dependencies (databases, queues) usually don't autoscale

Test your autoscaling before you need it. The first time I depended on autoscaling in production, it worked. The second time, it didn't, because our database connection pool was capped. That was the real bottleneck.

The 3 bottlenecks to check

For every scaling test, verify:

Stateless service capacity. Usually easy — just add more instances.
Database capacity. Connection counts, query latency, replication lag. Usually the real bottleneck.
Third-party dependencies. Rate limits on external APIs, email providers, payment processors. A sudden 10x spike usually hits someone's rate limit.

The launch checklist

Before any planned spike:

Overprovision by 3x what you think you need
Pre-warm caches and connection pools
Confirm your paging rotation is ready
Prepare a rollback plan for the feature being launched
Schedule the launch during your team's awake hours, not off-hours

The real lesson

For startups, the goal of capacity planning is not efficiency. It's confidence. If you have to spend a little more to avoid panic during growth, spend it. Optimize for efficiency later, when you have a year of traffic history to work from.

Right now, your job is to stay standing. Do that first.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com