Capacity planning sounds like enterprise spreadsheet work. For a startup, it's 'don't get embarrassed when traffic spikes, don't go broke overprovisioning.' Here's the pragmatic version.
The 3 questions
1. What does normal look like right now? Peak RPS, p99 latency, CPU/memory utilization at peak. If you don't know these, stop and measure. You cannot plan without baseline.
2. What's the next expected spike? A launch. A press mention. A marketing campaign. The Black Friday of your industry. Put these on a calendar.
3. How long does it take to add capacity? Minutes (autoscaling)? Hours (VM provisioning)? Weeks (vendor contracts)?
Your capacity plan is the gap between expected spike and response time.
The startup hack: overprovision early
At startup scale, overprovisioning is cheap. An extra $5k/month of slack is trivial compared to the embarrassment of 'we went down during the launch.'
Run at 30-40% peak utilization. Yes, that's wasteful. It's also a 3x buffer for unexpected spikes. Worth it until you're big enough to care about the efficiency.
The autoscaling reality
Autoscaling is great but has limits:
- Cold start times mean it can't handle traffic that doubles in 30 seconds
- Provisioning limits mean you can only add X instances per minute
- Downstream dependencies (databases, queues) usually don't autoscale
Test your autoscaling before you need it. The first time I depended on autoscaling in production, it worked. The second time, it didn't, because our database connection pool was capped. That was the real bottleneck.
The 3 bottlenecks to check
For every scaling test, verify:
- Stateless service capacity. Usually easy — just add more instances.
- Database capacity. Connection counts, query latency, replication lag. Usually the real bottleneck.
- Third-party dependencies. Rate limits on external APIs, email providers, payment processors. A sudden 10x spike usually hits someone's rate limit.
The launch checklist
Before any planned spike:
- Overprovision by 3x what you think you need
- Pre-warm caches and connection pools
- Confirm your paging rotation is ready
- Prepare a rollback plan for the feature being launched
- Schedule the launch during your team's awake hours, not off-hours
The real lesson
For startups, the goal of capacity planning is not efficiency. It's confidence. If you have to spend a little more to avoid panic during growth, spend it. Optimize for efficiency later, when you have a year of traffic history to work from.
Right now, your job is to stay standing. Do that first.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)