DEV Community

Rupesh Konduru
Rupesh Konduru

Posted on

What Is System Design, Really?

And why your perfectly working code can still fail spectacularly at scale.


Let me start with something honest: I used to think system design was something only senior engineers needed to worry about. Write clean code, pass the tests, ship the feature. Done.

Then I started actually thinking about what happens when your app goes from 500 users to 500,000 — and I realized good code alone doesn't save you. The structure of your system is what either holds or collapses under pressure.

This is the first post in a three-part series where I break down the foundations of system design the way I wish someone had explained them to me — through real analogies, simple diagrams, and plain English.


The Restaurant That Went Viral

Imagine you open a small restaurant. Day one, it's just you — you cook, you serve, you clean. Ten customers walk in. Everything runs smoothly. You're happy.

Now imagine a food blogger with a million followers posts about your place. The next morning, 10,000 people show up.

Suddenly you need multiple chefs. A system for taking orders without everyone shouting at once. A pantry that restocks itself. A way to handle the dinner rush without the kitchen catching fire.

System design is the art of building software that doesn't fall apart when the world shows up at your door.

That's it. That's the whole field. Everything else is just details of how to do that well.

What System Design Actually Asks

When you solve a LeetCode problem, you're asking: does this work?

When you do system design, you're asking something completely different: does this work for ten million people, reliably, cheaply, and without going down at 2am on a Sunday?

These are two different kinds of thinking. The first is about correctness. The second is about architecture — and that's what this series is about.

The two goals every system must balance:

  • Scalability — Can it handle growth?
  • Reliability — Does it keep working when things go wrong?

Every design decision you ever make is a trade-off between these two (and cost). There's no perfect answer — only informed choices.


Your Starter Kit — The Building Blocks

Think of system design like LEGO. Before you build a castle, you need to know what pieces exist. Here's the vocabulary you need before anything else makes sense:

Component What it does Restaurant analogy
Client The browser or app making requests The customer walking in
Server Processes incoming requests The kitchen
Database Stores data persistently The pantry and fridge
Cache Fast, temporary storage Pre-prepped ingredients on the counter
Load Balancer Distributes traffic across servers The host who seats customers evenly
Message Queue Holds tasks to be processed later The order ticket rail in a diner

We'll go deep on each of these. For now, just know they exist and roughly what job they do.


Scaling: What Happens When Your App Blows Up

So your app got popular. Great problem to have. Now what?

You have exactly two moves. The mental model: your server is a worker in a factory.

Option 1 — Vertical Scaling

Make the worker stronger. Give your existing server more RAM, a faster CPU, more storage. Simple, no code changes needed, works immediately.

Before:  [ Server: 8GB RAM,  4 cores  ]
After:   [ Server: 64GB RAM, 32 cores ]
Enter fullscreen mode Exit fullscreen mode

This works — until it doesn't. There's a physical ceiling to how powerful one machine can get. And here's the silent killer: if that one giant server goes down, everything goes down. You've built a very expensive single point of failure.

Option 2 — Horizontal Scaling

Instead of making one worker stronger, hire more workers. Add more servers and split the work between them.

Before:  [ Server 1 ]

After:   [ Server 1 ]  [ Server 2 ]  [ Server 3 ]
Enter fullscreen mode Exit fullscreen mode

This is how Google, Amazon, and Netflix operate. Theoretically infinite — just keep adding machines. And if one dies, the others keep running. No single point of failure.

The downside? Complexity. Now you need something to coordinate these servers. And a new question emerges: if a user logs in on Server 1, does Server 3 know who they are?


The Stateless Insight That Makes It All Work

When you have multiple servers, a user might hit Server 1 on their first request and Server 3 on their next. If their login session was stored inside Server 1, Server 3 has no idea who they are.

The elegant fix: make your servers stateless. They don't remember anything about the user themselves. All session data lives in a shared database or cache that every server can reach.

❌ Stateful — bad for scaling:
User → Server 1 (remembers session) ✅
User → Server 3 (no memory)         ❌

✅ Stateless — good for scaling:
User → Server 1 → reads from shared DB ✅
User → Server 3 → reads from shared DB ✅
Enter fullscreen mode Exit fullscreen mode

Every server becomes interchangeable — like identical chefs who all read from the same recipe book. It doesn't matter which one handles your order. The output is the same.

Don't Forget: Your Database Scales Too

Here's a mistake beginners almost always make. You scale your servers to 100 instances — but they're all hammering the same single database. That database becomes your new bottleneck. You've just moved the problem downstream.

Two techniques to know for now:

Replication — Copy your database across multiple machines. Reads get faster and you get built-in backups.

Sharding — Split your database into chunks. User IDs 1–1M on DB1, 1M–2M on DB2. Each machine handles a slice of the data.

The key insight: every layer of your system can become a bottleneck, and every layer can be scaled.


The Mental Model to Keep

Whenever someone asks "how would you scale X?" — think in layers:

Traffic surge hits →
  → Scale your servers (horizontal)
  → Put a Load Balancer in front
  → Make servers stateless
  → Scale your database (replication / sharding)
  → Add a Cache to reduce DB load
  → Add a CDN for static content
Enter fullscreen mode Exit fullscreen mode

Each fix reveals the next bottleneck. That's not a bug — that's the game.

Anyone can write code. Not everyone can think about what happens when 10 million people run that code simultaneously.

That's what system design is training you to do.


Next in the series → Load Balancing & Consistent Hashing — The Art of Splitting Work Fairly

Top comments (0)