Rajkiran

Posted on May 31

Scalability in System Design

#architecture #backend #performance #systemdesign

The Day Instagram Almost Died
In 2011, Instagram had 13 employees and 30 million users. Then they launched on Android.

In 24 hours, they got 1 million new signups.

Their servers didn't die. Not because they were lucky — but because they had already solved the hardest question in system design:

"What happens when 100x more people show up tomorrow?"

That question is called scalability. And today, we're going to understand it the way senior engineers do — not just the definition, but the why, the when, and what actually breaks.

**
What Is Scalability?**
Scalability is your system's ability to handle growing load without breaking or slowing down.

Load can mean:

More users (1K → 1M)
More data (GB → PB)
More requests per second (100 QPS → 100,000 QPS)

A scalable system doesn't just survive growth — it handles it gracefully, ideally without you waking up at 3am.

The Two Paths: Vertical vs Horizontal Scaling
Imagine you run a restaurant. On a busy Friday night, you have two options:

Option A: Hire one superhuman chef who can cook 10x faster.
**Option B: **Hire 10 regular chefs and divide the work.

That's the exact difference between vertical and horizontal scaling.

Vertical Scaling (Scale Up)
What it is: Make your existing server bigger — more CPU, more RAM, faster disk.

Before: [Server: 4 CPU, 16GB RAM]

After: [Server: 32 CPU, 256GB RAM]

When it works: Early stage. Simple systems. Databases that are hard to distribute (like PostgreSQL). The early Instagram ran on a few beefy EC2 instances — vertical scaling bought them time.

Why it hits a wall:

There's a physical limit to how big one machine can get.
It's expensive — doubling RAM doesn't double your price; it 5x's it.
Single point of failure — if that one big server dies, everything dies.
You have to take downtime to upgrade it.

The ceiling is real. You can scale vertically to a point, but you will hit it.

Horizontal Scaling (Scale Out)
What it is: Add more servers and spread the load across them.

Before: [Server 1]

After: [Server 1] [Server 2] [Server 3] ... [Server N]

A Load Balancer sits in front and distributes incoming requests across all servers.

Why Netflix uses horizontal scaling: Netflix runs on tens of thousands of servers. When they add capacity, they don't upgrade existing machines — they spin up new ones. When load drops, they terminate them. This is elastic, cost-efficient, and there's no theoretical ceiling.
**
The catch:** Horizontal scaling introduces complexity. If a user logs in on Server 1, their session is on Server 1. If the next request goes to Server 3 — Server 3 knows nothing about them. This is the stateless design problem, and we'll solve it in a moment.

The Real Unlock: Stateless Design
Here's the architectural insight that makes horizontal scaling actually work.

Stateful server (problem): The server remembers things about you between requests. Your session, your cart, your login state — it's all stored in the server's memory.

User → Request 1 → Server A (stores session)

User → Request 2 → Server B (no session = logged out!)

Stateless server (solution): The server remembers nothing. All state lives outside the server — in a shared database, Redis cache, or JWT token that the client carries.

User → Request 1 → Server A (reads session from Redis ✓)

User → Request 2 → Server B (reads session from Redis ✓)

User → Request 3 → Server C (reads session from Redis ✓)

Now any server can handle any request. You can add 100 more servers tomorrow and they all work immediately — because none of them hold state.

This is the architectural principle behind every horizontally scalable system. AWS Lambda, Kubernetes pods, Docker containers — they're all stateless by design.

Auto-Scaling: Letting the System Manage Itself
Manual scaling (someone clicking "add server" at 2am) doesn't work at scale. The answer is auto-scaling — the system detects load and adjusts itself automatically.

How it works (AWS Auto Scaling Group example):

CPU > 70% for 5 minutes → Add 2 servers

CPU < 30% for 10 minutes → Remove 1 server

Kubernetes Horizontal Pod Autoscaler (HPA):

When CPU crosses 60%, spin up more pods

target:

kind: Deployment

name: my-api

metrics:

type: Resource

resource:

name: cpu

target:

type: Utilization

averageUtilization: 60

Real example: Netflix uses auto-scaling aggressively. On Sunday evenings (peak streaming time), their infrastructure automatically scales up. By 3am, it scales back down. They're not paying for idle servers — and nobody is manually managing this.

The 3 Failure Modes Nobody Talks About
Knowing what breaks makes you a better designer than knowing what works.

1. Stateful servers under horizontal scale You add servers but sessions break. Users get randomly logged out. Fix: externalize all state.

2. Database becomes the bottleneck You scaled your app servers 10x. Now 10x the queries are hitting one database. The DB becomes the ceiling. Fix: read replicas, caching, sharding (Day 3 topics).

3. Premature horizontal scaling A startup with 500 users adds a load balancer, 3 app servers, and Redis for sessions. Now they have 5x the infrastructure to maintain and debug. Fix: start vertical, switch to horizontal when you actually feel the pain.

Instagram's early lesson: They scaled vertically first (bigger servers) then horizontally (more servers). They didn't start distributed — they evolved to it.

Interview Scenario: "Estimate Servers for 1 Million DAU"
This is a real interview question. Here's how to answer it like a senior engineer:

Given: 1 million Daily Active Users (DAU)

Step 1: Calculate QPS

Assume each user makes 10 requests/day

Total requests/day = 1M × 10 = 10M requests/day

Average QPS = 10M ÷ 86,400 seconds ≈ 116 QPS

Peak QPS = 116 × 3 (peak multiplier) ≈ 350 QPS

Step 2: Estimate server capacity

A typical server handles ~500-1000 simple requests/sec

For 350 peak QPS → 1 server is enough

But: add redundancy (minimum 2), so → 2 servers

Step 3: Know when to go vertical vs horizontal

If your service is stateless (REST API) → horizontal
If your service needs to maintain state or run single-threaded (DB, matching engine) → vertical first

The interviewer is testing: Do you estimate before designing? Do you know the difference between average and peak load? Do you know when each scaling strategy applies?

The Trade-off Triangle
Every scalability decision trades off three things:

     Performance

          ▲

         /|\

        / | \

       /  |  \

      /   |   \

Cost ◄----+----► Simplicity

_Vertical scaling: _High simplicity, high cost at scale, performance ceiling.
Horizontal scaling: High performance, higher complexity, better cost curve.
Auto-scaling: Best performance/cost ratio, most complex to set up right.

There is no free lunch. Your job as an engineer is to pick the right trade-off for your current scale — not the theoretically perfect architecture.

Real Systems, Real Decisions

Stack Overflow serving 1.5B requests/month with just 9 web servers is one of the most impressive vertical scaling stories in the industry — proof that "scale horizontally" isn't always the answer.

Key Takeaways
Scalability = handling more load without breaking. It's not optional — it's survival.
Vertical scaling is simple but has a ceiling. Use it early, when your team is small.
Horizontal scaling has no ceiling but requires stateless design.
Stateless design is the prerequisite for horizontal scaling — externalize all state.
Auto-scaling automates capacity management — essential at production scale.
The database is almost always the first bottleneck when you scale app servers.
Always estimate (QPS, servers, storage) before designing. Numbers validate your design.

DEV Community

Scalability in System Design

When CPU crosses 60%, spin up more pods

Top comments (0)