Your system worked fine. Then it didn't.
Not at 1,000 users — at 1,000 it was still fine, a bit slow maybe. The crash came around 50,000 concurrent requests. Database refused connections. Response times went from 180ms to 11 seconds. The on-call was you. The postmortem was painful.
This isn't a story about bad engineering. Most teams that hit scaling walls wrote reasonable code for the scale they had. The problem is that reasonable code for 100 users has a different shape than reasonable code for 10 million, and nobody warns you about the specific places it breaks in between.
What follows is the sequence of bottlenecks you'll actually hit, roughly in the order you'll hit them. Not theory. The things we've seen break at funded startups, and what actually fixed them.
Start boring. Stay boring as long as you can.
The most expensive advice in early-stage software is "build for scale from day one."
Don't.
Nobody knows what their system actually needs to scale until it needs to scale. Teams that design microservices at MVP stage spend their first year fighting infrastructure instead of building their product. I've watched it happen. It's not a capacity problem — it's a self-inflicted coordination problem.
The right architecture for your first 10,000 users is a monolith: one codebase, one database, one server. A well-tuned PostgreSQL instance on a decent Hetzner or DigitalOcean box can handle more traffic than most founders expect. Gojek didn't launch as a distributed system. Neither did Tokopedia. They started boring, scaled up when they had to, and made the hard architectural decisions with real traffic data instead of guesses.
The skill isn't picking the right architecture upfront. It's recognising when your current one stops working and knowing what to reach for next.
Where systems actually break first: the database
Eighty percent of scaling problems live here. Not the app layer. Not the load balancer. The database.
Most backends start on a single PostgreSQL (or MySQL) instance. That's fine — until queries slow down, connections pile up, and response times spike at peak hours. Before reaching for read replicas or sharding, check these first:
Unindexed columns. Run EXPLAIN ANALYZE on your slowest queries. You'll almost always find a sequential scan on a column with no index. Adding the right index can turn a 4-second query into 40ms. We've seen it on tables with 200 million rows — the query just worked after the index landed.
N+1 queries. ORMs hide these well. Your endpoint that loads 50 products is probably firing 51 queries: one for the list, one per product for a related model. Find it in query logs. Fix it with eager loading or a JOIN.
Connection exhaustion. Every API request opening its own database connection doesn't scale. PgBouncer as a connection pooler is a one-afternoon change that has unblocked teams hitting walls at 50k DAU.
Fix those three things first. You probably just bought yourself three to six months of headroom.
When that's not enough: add a read replica. Route all SELECT queries there, writes stay on primary. This halves primary load for read-heavy applications and is a Monday morning change, not a quarter-long project.
Sharding — splitting data across multiple database instances — comes much later, when a single machine genuinely can't store your data or sustain your write volume. Most startups never get there. The ones that do at least know exactly why they're doing it.
Caching: what it solves, and what it doesn't
Redis is often treated as magic. It isn't. It's a trade-off: faster reads at the cost of potential staleness.
It works well when the same data gets read far more often than it changes — user profiles, product listings, pricing tables, configuration values. The cache-aside pattern covers most cases: check Redis first, on miss hit the database, write the result back to Redis with a TTL.
Two things that bite teams in production:
Cache stampede. Your TTL expires on a popular key. Three hundred concurrent requests miss cache simultaneously and pile onto the database. Fix it with mutex locking on cache population (only one request rebuilds the cache, others wait) or by randomising TTLs so popular keys don't all expire at the same moment.
Stale data at the worst time. A promotion goes live, prices change, cache still serves old values. Every cached key needs a TTL appropriate to how often the underlying data actually changes. "Cache forever" always becomes a problem eventually.
One important note: caching buys time. It doesn't fix slow queries or connection problems. Solve those first, then layer caching on top.
Horizontal scaling: when it helps, when it doesn't
Adding more app servers is the straightforward part — once your application is stateless. Sessions can't live in memory on individual servers. They need to live in Redis or the database so any instance can handle any request.
Beyond statelessness: a load balancer distributes traffic across instances, health checks remove dead ones automatically. Round-robin works for most cases.
What horizontal scaling doesn't fix is a slow database. Five app servers hitting a slow query just create five times the load on the same bottleneck. This is the trap most teams fall into — they see high CPU on the app server, add another instance, and watch database CPU spike instead.
Fix the database layer first. Then scale the application horizontally.
The mistake almost everyone makes
Microservices.
I've seen this at multiple startups in the last two years. The team reads about how a unicorn operates, decides they should architect the same way, and six months later they have fifteen services, a Kubernetes cluster nobody fully understands, distributed tracing that half-works, and a deployment pipeline that takes 45 minutes.
Microservices solve an organisational problem, not a technical one. They exist so large engineering organisations — 50, 100, 200 people — can ship independently without blocking each other. At 10 to 20 engineers, you don't have that problem. You just gave yourself one.
The inflection point where microservices start making sense: multiple teams, multiple deployment cadences, clear domain ownership, and enough engineers to properly staff each service. Before that, the right answer is usually a modular monolith — clear internal module boundaries, defined interfaces between them, deployed as one unit. Most of the organisational benefit, none of the distributed systems complexity.
What an actual scale-up sequence looks like
A fintech company in Jakarta processes payment webhook notifications for a mid-size e-commerce platform. At launch: single Django app, single PostgreSQL, one EC2 instance.
At around 300,000 daily active users, two things broke simultaneously. Database connections were exhausted during 11am–1pm peak (the lunch scroll). Webhook processing was blocking synchronous API responses, adding 3–8 seconds of latency.
The fix sequence:
- PgBouncer for connection pooling → connection exhaustion resolved within 24 hours
- Celery + Redis for async webhook processing → API responses back to sub-200ms
- PostgreSQL read replica → offloaded 60% of DB reads, primary CPU dropped from 82% to 34%
Same Django monolith throughout. No Kubernetes. No microservices. Six times the headroom, two weeks of engineering work.
They're at 1.2M DAU now on the same core architecture. The next actual architectural decision is sharding the payments table, which is approaching 800GB. That's a six-month project, carefully sequenced. It's the right problem to be solving at 1.2M DAU — not at 300k.
FAQ
When should I move from a monolith to microservices?
When you have multiple teams that need to deploy independently, clear domain boundaries in your codebase, and at least two engineers who can own each service end-to-end. Most teams under 30 engineers aren't there yet, and the ones that think they are usually regret it six months in.
How much traffic can a single PostgreSQL instance actually handle?
With proper indexing and connection pooling, a well-specced instance (32 cores, 128GB RAM) handles tens of thousands of queries per second. Most teams hit problems in their application code long before the database itself is the ceiling.
My server is struggling. What's the first thing to check?
Run EXPLAIN ANALYZE on your slowest queries. Then check connection counts in pg_stat_activity. Then look at whether you're repeatedly fetching data that rarely changes. In that order — skipping ahead usually wastes a week.
Do we actually need Kubernetes?
Probably not yet. Kubernetes is operationally expensive. Managed container services — AWS ECS, Cloud Run, Fly.io — give you the container deployment benefits without the complexity overhead. Most startups are better served by those until they have a dedicated platform team who wants to own the cluster.
How do you handle sudden traffic spikes?
Queue-based load levelling is the most reliable pattern: spikes hit the queue, workers drain it at a pace the database can sustain. Teams that handle Lebaran or Harbolnas well pre-scale infrastructure, aggressively cache product and pricing data, and have tested their queue depth limits before the event. The ones that don't plan spend the night firefighting.
Scaling is a sequence of boring decisions made at the right moment. The teams that get it right aren't the ones who designed for 10M users on day one — they're the ones who knew which bottleneck they were actually solving when each one showed up.
If you're not sure where your system starts breaking, an architecture audit is usually faster than guessing in the dark.
SpectreDev builds high-traffic, reliable backend systems for startups in Indonesia, Australia, and Southeast Asia.
Top comments (0)