DEV Community

Cover image for System Design From Zero: An Engineering Head Teaches His Nephew
surajrkhonde
surajrkhonde

Posted on

System Design From Zero: An Engineering Head Teaches His Nephew

Part 0: Why Smart Engineers Freeze in This Round

๐Ÿ‘ฆ Nephew: Uncle, I know Redis, MongoDB, Kafka, Docker, AWS โ€” I've built real things. But the moment someone says "Design WhatsApp" or "Design a Notification System," my brain just... goes blank. I start talking about random technologies and it sounds unstructured.

๐Ÿ‘จโ€๐Ÿฆณ Uncle: That's not a knowledge problem. That's a framework problem, and it's the single most common reason strong engineers fail this round. Let me show you the wrong instinct first, because you'll recognize it immediately.

The wrong mindset: the moment you hear "design X," your brain jumps straight to "which database? which cache? should I use Kafka?" โ€” technology-first thinking.

The right mindset: "what problem are we actually solving? what's the scale? where will this break?" โ€” architecture comes after you answer those, never before.

๐Ÿ‘ฆ Nephew: So the fix isn't learning more tools?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Correct. You already know enough tools. What you're missing is a fixed sequence you run through every single time, regardless of whether it's a URL shortener, WhatsApp, Netflix, or a notification system. Master the sequence once, and you can walk into any "design X" question without ever feeling lost โ€” because the shape of your answer never changes, only the specific boxes inside it. That's what we're building today, completely, in one sitting.


Part 1: The 12-Step Master Framework โ€” The Spine of Everything

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Here it is. Memorize the flow, not just the numbers.

Step Phase Activity
1 Understand Understand the Problem
2 Understand Gather Requirements
3 Plan Estimate Scale
4 Design Design APIs
5 Design Design Database
6 Design High Level Design (HLD)
7 Explain Deep Dive Components
8 Scale Scaling Strategy
9 Robust Reliability
10 Observe Monitoring
11 Protect Security
12 Discuss Bottlenecks & Tradeoffs

๐Ÿ‘ฆ Nephew: Twelve steps feels like a lot to hold in my head under interview pressure.

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Don't hold twelve numbers. Hold three phases, each with a one-line purpose:

Phase 1 โ€” UNDERSTAND  (Steps 1-4)   "Before you build, understand what you're building."
Phase 2 โ€” DESIGN       (Steps 5-8)   "Now design the actual system."
Phase 3 โ€” ROBUSTIFY    (Steps 9-12)  "Now make sure it doesn't fall apart."
Enter fullscreen mode Exit fullscreen mode

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Three phases, and inside each phase, the steps follow naturally โ€” once you understand the problem, requirements follow. Once you know requirements, scale estimation follows. Once you know scale, your API and database design write themselves, because now they're grounded in real numbers instead of guesses. Never skip a step, and never skip the order. Skip scale estimation and jump straight to "I'll use Kafka" โ€” you've just guessed. A senior engineer never guesses when they could calculate.


Part 2: Phase 1 โ€” Understand (Steps 1-4)

Step 1 โ€” Understand the Problem

๐Ÿ‘ฆ Nephew: Interviewer says "Design a URL Shortener." What do I actually say first?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: You do not say "I'll use Redis." You ask questions, out loud, before touching architecture:

  • What's the main use case โ€” is this for individual users, or a marketing team doing bulk campaigns?
  • What features are required versus optional?
  • Any constraints I should know about (must the short URL be a specific length, must it be guessable-resistant, etc.)?

This alone โ€” pausing to ask instead of diving in โ€” signals seniority before you've said a single technical word.

Step 2 โ€” Requirements: Split Functional vs Non-Functional

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Always split these into two explicit buckets. Interviewers love this separation because it shows systematic thinking, not just listing random features.

Functional Requirements โ€” WHAT the system does:

  • User submits a long URL
  • System generates a short URL
  • Redirecting the short URL takes you to the original
  • (Optional) Analytics on clicks
  • (Optional) Expiration after 30 days

Non-Functional Requirements โ€” HOW WELL it should behave:

  • 99.99% availability
  • Low latency (<100ms)
  • Scalable to millions of users
  • Secure (no URL guessing)

๐Ÿ‘ฆ Nephew: Why does splitting them matter so much? Isn't it all just "requirements"?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Because they drive completely different architecture decisions. Functional requirements shape your API and data model. Non-functional requirements shape your infrastructure choices โ€” availability targets push you toward replication and multi-region thinking; latency targets push you toward caching; scale targets push you toward horizontal scaling and sharding. Mixing them together is how candidates end up designing something that "does the right things" but "does them the wrong way" โ€” technically correct features, but the wrong infrastructure underneath.

Step 3 โ€” Non-Functional Requirements, One Level Deeper

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Let's go deeper here, because this is where seniority really shows. There isn't just one non-functional requirement โ€” there's a whole family, and different systems prioritize different members of that family:

NFR Question It Answers Typical Tools
Scalability Can it grow 10x without breaking? Horizontal scaling, sharding, caching
Availability Is the system up? (99.99% = ~52 min downtime/year) Replicas, failover, load balancing
Reliability Can users trust the system? Transactions, replication, retry logic
Consistency Do all users see the same data? (Strong vs Eventual) ACID DB, leader-based writes, quorum
Latency How fast is the response? Caching, CDN, database indexes
Throughput How many requests/second? Message queues, load balancing
Fault Tolerance What happens if a component fails? Replication, backup, failover
Durability Does data survive crashes? Persistent storage, backups, WAL
Security Protected from attacks? HTTPS, encryption, rate limiting, auth
Maintainability Can engineers easily modify it? Microservices, CI/CD, monitoring

Memory trick โ€” S-A-C-L-R-F-D-S: Scalability, Availability, Consistency, Latency, Reliability, Fault tolerance, Durability, Security.

๐Ÿ‘ฆ Nephew: Do I need all ten for every system?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: No โ€” and knowing which ones matter most for this specific system is exactly what separates mid-level from senior. Look at this comparison:

System Most Important NFRs Why
WhatsApp Scalability, Availability, Latency Billions of messages, can't afford downtime, must feel instant
Netflix Availability, Scalability, Bandwidth Millions streaming simultaneously, video needs huge bandwidth
Banking Consistency, Durability, Security Money is involved โ€” zero tolerance for data loss or inconsistency
Trading Platform Consistency, Latency, Reliability Milliseconds matter, accuracy is everything
URL Shortener Availability, Scalability, Latency Simple functionality, but must handle massive, simple traffic

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Saying "for a banking system, I'd prioritize strong consistency and durability over raw availability, because losing or double-counting money is unacceptable, even if it costs us a bit of uptime" โ€” that one sentence, said out loud, tells the interviewer you actually understand tradeoffs, not just terminology.

Step 4 โ€” Scale Estimation โ€” The Step Everyone Skips (Don't)

๐Ÿ‘จโ€๐Ÿฆณ Uncle: This is, without exaggeration, the single most important step in the entire framework. Most candidates skip straight from requirements to architecture. That's the mistake that separates a mid-level answer from a senior one. You cannot choose the right architecture without knowing the numbers first.

Ask these before calculating anything:

  • How many Daily Active Users (DAU)?
  • How many requests per day?
  • What's the read/write ratio?
  • How much storage per record?

Why this single step tells you almost everything:

  • Immediately tells you if you need caching
  • Tells you your database size, and whether one server can hold it
  • Tells you your read/write pattern (read-heavy? write-heavy?)
  • Tells you if a single server is even viable, or if you need to design for distribution from day one

We'll do the actual math in the next section โ€” but understand why it comes first: everything after this step (API design, database choice, caching strategy, number of servers) is a direct consequence of these numbers, not a separate creative decision.


Part 3: Back-of-the-Envelope Calculations โ€” The Skill That Makes You Sound Senior Instantly

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Before you ever say "Redis," "MongoDB," "Kafka," or "CDN" out loud, you should be able to answer: how many users? How many requests? How much storage? How much bandwidth? How much memory? Otherwise you're designing blind, guessing dressed up as confidence.

Memorize these cold โ€” memory units

Unit Bytes Everyday Example
1 KB 10ยณ = 1,000 Small text file
1 MB 10โถ = 1,000,000 A song, small image
1 GB 10โน = 1,000,000,000 A movie, many songs
1 TB 10ยนยฒ = 1,000,000,000,000 A large database

And one time constant you'll use in nearly every calculation:

1 Day = 86,400 seconds
Enter fullscreen mode Exit fullscreen mode

๐Ÿ‘ฆ Nephew: Why 86,400 specifically? Where does that even come from?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: 24 hours ร— 60 minutes ร— 60 seconds = 86,400. Simple, but if you fumble this number live in an interview trying to do 24ร—60ร—60 in your head under pressure, it costs you composure at the worst possible moment. Just memorize it as a fact.

The Five Formulas You Must Know By Heart

Formula 1 โ€” QPS (Queries Per Second), the most important one:

QPS = Total Requests Per Day / 86,400
Enter fullscreen mode Exit fullscreen mode

Example: 100 million requests/day โ†’ 100,000,000 / 86,400 โ‰ˆ 1,157 QPS

Peak traffic rule โ€” always design for peak, never average:

Peak QPS = Average QPS ร— 3 to 5
Enter fullscreen mode Exit fullscreen mode

Example: 1,157 ร— 5 โ‰ˆ 5,785 QPS

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Why multiply by 3-5? Because average traffic hides the truth โ€” real traffic isn't smooth, it spikes around specific hours, campaigns, or events. If you design a system that can only handle the average, the very first traffic spike takes it down. Designing for peak is designing for reality, not for a spreadsheet.

Formula 2 โ€” Storage per day:

Storage = Records Per Day ร— Size Per Record
Enter fullscreen mode Exit fullscreen mode

Example: 10 million new URLs/day ร— 500 bytes = 5 GB/day

Formula 3 โ€” Yearly storage:

Yearly Storage = Daily Storage ร— 365
Enter fullscreen mode Exit fullscreen mode

Example: 5 GB/day ร— 365 = 1.8 TB/year

๐Ÿ‘ฆ Nephew: Why does yearly storage matter so much? Isn't 5 GB/day tiny?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Because a single day's number looks harmless, and that's exactly the trap. 1.8 TB/year tells you something concrete: somewhere around year 2-3, you'll likely need sharding or a distributed database, because a single machine's disk starts becoming a real constraint. This number is what lets you say, confidently, in an interview: "for year one, a single well-provisioned database is fine; by year three, we'll need to shard" โ€” a genuinely senior-sounding sentence, backed by actual math, not vibes.

Formula 4 โ€” Bandwidth:

Bandwidth = Peak QPS ร— Response Size
Enter fullscreen mode Exit fullscreen mode

Example: 5,000 QPS ร— 2 KB = 10 MB/sec

๐Ÿ‘จโ€๐Ÿฆณ Uncle: This tells you whether your network card, your load balancer, and your CDN strategy can actually keep up โ€” especially critical for anything media-heavy (we'll see this explode in the Netflix example shortly).

Formula 5 โ€” Cache memory:

Cache Memory = Hot Data Size ร— 1.2 (overhead factor)
Enter fullscreen mode Exit fullscreen mode

Example: 10 million hot URLs ร— 500 bytes = 5 GB โ†’ with overhead, 5-8 GB Redis needed

๐Ÿ‘ฆ Nephew: Why the 1.2 overhead factor?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Redis (and caches generally) don't store just your raw data โ€” there's metadata, key overhead, internal data structure costs. The 1.2 multiplier is a practical safety margin so you don't under-provision and hit unexpected evictions the moment real traffic arrives.

The 7 Numbers You Should Calculate, Every Single Time

1. DAU (Daily Active Users)
2. Total Requests/day
3. Read Requests/day
4. Write Requests/day
5. Read QPS (average)
6. Peak QPS (for capacity planning)
7. Storage/year
Enter fullscreen mode Exit fullscreen mode

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Build the habit of running through exactly these seven, in this order, for every design question, without exception โ€” URL shortener, chat app, notification system, rate limiter, file upload, doesn't matter. This becomes muscle memory, and muscle memory is what survives interview nerves.

Seeing It All Together โ€” Three Full Worked Examples

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Numbers in isolation don't teach you anything. Let's see how the same seven-number process produces completely different architectures depending on the system.

Example A โ€” URL Shortener (a Read-Heavy System)

Assumptions: 10M users, 10M new URLs/day, 100M redirects/day, 500 bytes/record.

Metric Calculation Result Implication
Write QPS 10M / 86,400 ~116 QPS One server handles this easily
Read QPS 100M / 86,400 ~1,157 QPS Single server maxes around 1,000-5,000 QPS
Peak Read QPS 1,157 ร— 5 ~5,785 QPS Need multiple servers + cache
Daily Storage 10M ร— 500B 5 GB/day Small for one day
Yearly Storage 5 GB ร— 365 1.8 TB/year Sharding relevant around year 2-3
Bandwidth 5,785 ร— 500B ~3 MB/sec Trivial for a normal network
Cache Size 10M ร— 500B (ร—1.2) 5-8 GB A standard Redis instance

Architecture decision this produces: a cache-heavy system โ€” Redis for hot URLs, PostgreSQL for reliable storage, 3-5 API servers for peak load, read replicas for scaling reads further.

Example B โ€” WhatsApp (a Write-Heavy System)

Assumptions: 1 billion users, 1 billion messages/day, 1 KB/message.

Metric Calculation Result Implication
Write QPS 1B / 86,400 ~11,574 QPS Huge โ€” impossible for one server
Peak QPS 11,574 ร— 5 ~57,870 QPS Need 10-50+ servers minimum
Daily Storage 1B ร— 1KB 1 TB/day Massive
Yearly Storage 1TB ร— 365 365 TB/year A distributed database is mandatory, not optional

Architecture decision this produces: Kafka to queue and decouple message writes, Cassandra for distributed, high-write-throughput storage, sharding by userId with consistent hashing, 50+ API servers, WebSockets for real-time delivery.

๐Ÿ‘ฆ Nephew: So the exact same 7-number process, on a different system, points you toward a totally different toolset โ€” without me ever having to "guess" which technology sounds impressive?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: That's the entire point of this discipline. The numbers are the design decision. You're not choosing Kafka because it's trendy โ€” you're choosing it because 57,870 peak QPS of writes, sustained, genuinely requires a queue-and-distribute approach; a single database simply cannot absorb that write rate directly.

Example C โ€” Netflix (a Bandwidth-Heavy System)

Assumptions: 10 million concurrent users, 5 Mbps per video stream.

Metric Calculation Result Implication
Bandwidth 10M ร— 5 Mbps 50 Tbps Literally impossible to serve from one origin

Architecture decision this produces: this single number rules out "just a bigger server" entirely โ€” it forces a CDN (Akamai, CloudFront) with regional edge caching so videos are served from servers physically near each user, a separate lightweight "control plane" for metadata (small, fast) versus a separate "data plane" for actual video delivery (huge, offloaded to the CDN), and adaptive bitrate streaming to adjust quality per connection.

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Notice โ€” for Netflix, we didn't even need to discuss database schema first. The bandwidth number alone was so extreme it dictated the entire shape of the architecture before anything else mattered. This is why scale estimation comes third in the framework, right after requirements โ€” sometimes one number single-handedly eliminates 90% of your design options before you've drawn a single box.


Part 4: Phase 2 โ€” Design (Steps 5-8)

Step 5 โ€” Design APIs Before the Database

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Define your API contracts before your database schema. Why this order? Because the API describes what the outside world actually needs from your system โ€” and that's what should shape your data model, not the other way around. Designing the database first often leads to a schema that's convenient to build but awkward for the actual use cases.

Endpoint Method Request Response
/shorten POST {"url": "https://google.com"} {"shortUrl": "abc123"}
/:shortCode GET โ€” 301 Redirect โ†’ original URL
/stats/:shortCode GET โ€” {"clicks": 1000, "created": "..."}

๐Ÿ‘ฆ Nephew: Why 301 for the redirect specifically?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: A 301 is a permanent redirect โ€” browsers and search engines are allowed to cache it, meaning repeat visits to the same short URL can skip your server entirely after the first hit. This one HTTP status code choice is itself a small but real caching decision, baked directly into your API design.

Step 6 โ€” Design the Database

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Now think about what data must actually be stored, and exactly how big each field is โ€” because this feeds directly back into your storage calculations from Step 4.

Field Type Size Index?
_id UUID 16 B Primary Key
shortCode String(8) 8 B Unique Index
originalUrl String(2048) ~300 B No
userId UUID 16 B Index
createdAt Timestamp 8 B No
expiresAt Timestamp 8 B Index
Total per record ~500 B

๐Ÿ‘ฆ Nephew: Why index expiresAt? Nobody's directly querying "give me the record that expires at this exact millisecond."

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Good instinct to question it โ€” the real use case is a background cleanup job running periodically, querying "find all records where expiresAt < now." Without an index on that field, that query becomes a full table scan across your entire dataset, every single time it runs, which gets catastrophically slow as your table grows into the millions. This is exactly the kind of quiet, unglamorous indexing decision that separates a schema that scales gracefully from one that silently degrades over a year.

SQL or NoSQL โ€” The Actual Decision Tree

๐Ÿ‘จโ€๐Ÿฆณ Uncle: This question comes up in nearly every design interview, and most candidates answer it with a vibe ("SQL feels more reliable") instead of a reason. Here's the actual decision tree:

Need relationships / multi-table joins / transactions across entities?
      โ†’ SQL (PostgreSQL) โ€” ACID guarantees, strong consistency

Need massive horizontal scale (billions of records, simple access patterns)?
      โ†’ NoSQL (MongoDB, Cassandra) โ€” flexible schema, scales out easily

Need both โ€” some strongly relational data AND some massive-scale data?
      โ†’ Hybrid โ€” e.g. PostgreSQL for core transactional data (users, billing),
        a NoSQL store for high-volume, simpler data (activity logs, messages)
Enter fullscreen mode Exit fullscreen mode

A quick reference decision tree for the broader toolkit, not just databases:

Need data relationships?              โ†’ SQL (PostgreSQL)
Need massive scale (billions)?        โ†’ NoSQL (MongoDB, Cassandra)
Read-heavy (e.g. 10:1 ratio)?         โ†’ Add a Redis cache
Write-heavy (e.g. near 1:1 ratio)?    โ†’ Message queue (Kafka)
Need instant, real-time updates?      โ†’ WebSockets + Redis pub/sub
Need to handle 10x growth?            โ†’ Horizontal scaling + sharding
Need <100ms latency?                  โ†’ In-memory cache + CDN
Need ACID guarantees?                 โ†’ SQL with transactions
Can tolerate occasional data loss?    โ†’ Redis alone is fine
Cannot afford any data loss?          โ†’ Persistent storage + backups
Enter fullscreen mode Exit fullscreen mode

๐Ÿ‘ฆ Nephew: Can I just say "MongoDB" every time to sound modern?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Please don't. Saying "MongoDB because it scales" without being able to explain why your specific access pattern needs that scale, and what you're giving up (joins, strong consistency by default) to get it, is exactly the technology-first thinking we started this whole conversation warning you against. Always justify the choice with the actual requirement it satisfies.

Step 7 โ€” High Level Design (HLD): Draw the Big Boxes First

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Now, and only now, draw the big picture. Don't explain internals yet โ€” that's the next step. HLD is about showing the overall shape of the system.

   Client (Web / Mobile)
          โ”‚
          โ–ผ
   Load Balancer (Nginx)
          โ”‚
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ–ผ      โ–ผ      โ–ผ
[API-1] [API-2] [API-3]
          โ”‚
          โ–ผ
   Redis Cache (Hot URLs)
          โ”‚
          โ–ผ
   Primary DB  โ‡„  Replica DB
          โ”‚
          โ–ผ
   Backup (Daily Snapshot)
Enter fullscreen mode Exit fullscreen mode

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Every box on this diagram exists to answer one specific number from Step 4 โ€” the Load Balancer exists because peak QPS exceeds what one server can handle; Redis exists because read QPS is high and reads are repeated (hot URLs); the Replica exists because availability requirements demand no single point of failure; the Backup exists because durability requirements demand recoverability. If you can't point to the number that justifies a box, that box probably shouldn't be on your diagram yet.

Step 8 โ€” Deep Dive Each Component

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Now, explain how each box actually works internally.

Load Balancer:

  • Distributes requests, typically round-robin or least-connections
  • If a server goes down, it auto-detects (via health checks) and stops routing to it โ€” automatic failover
  • Can use hash-based routing for session stickiness when needed

Cache (Redis):

  • Stores hot (frequently accessed) URLs
  • TTL (time-to-live) around 1 day, tuned to your data's actual staleness tolerance
  • Eviction policy: LRU (Least Recently Used) when memory is full
  • Request flow: check Redis first โ†’ found? return it. Not found? โ†’ query the database โ†’ store the result in Redis โ†’ return it. This exact pattern is called cache-aside, and it's the most common caching pattern you'll use.

Database:

  • The permanent source of truth โ€” Redis can lose data (it's a cache), the database cannot
  • Indexes on shortCode, userId, expiresAt
  • A background job periodically deletes expired records

Replication:

  • Primary handles all writes; Replicas handle reads
  • If the primary fails, a replica is automatically promoted to become the new primary
  • Uses binary-log (bin-log) replication under the hood to keep replicas consistent with the primary

Part 5: Phase 3 โ€” Robustify (Steps 9-12)

Step 9 โ€” Scaling Strategy

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Interviewers expect this conversation without being asked. Two categories:

Vertical Scaling โ€” add more CPU/RAM to the existing machine. Simple, but limited โ€” you hit a ceiling fast, and the cost curve is brutal ($10K server โ†’ $50K server for incremental gains).

Horizontal Scaling โ€” add more machines instead. Preferred, because it scales close to indefinitely, and commodity servers are cheap: "10 servers instead of 1 huge server."

Database scaling specifically has its own two levers:

  • Read replicas โ€” scale read QPS by adding more replicas that only serve reads
  • Sharding โ€” scale write QPS by splitting data across multiple independent databases, typically by a hash of some key (like userId). Each shard commonly runs its own primary + replicas setup internally.

๐Ÿ‘ฆ Nephew: When do I reach for sharding versus just adding more read replicas?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Read replicas solve read bottlenecks โ€” they don't help at all if your writes are the problem, because all replicas still ultimately depend on one primary absorbing every write. When writes themselves exceed what a single primary can handle (like our WhatsApp example โ€” 57,870 peak write QPS), replicas can't save you; you need sharding, splitting the write load itself across multiple independent primaries.

Step 10 โ€” Reliability & Failure Handling

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Most candidates forget this entirely. Senior engineers ask, unprompted: "what if this specific component crashes?" โ€” for every component on the diagram.

Failure Scenario Solution
API server crashes Load balancer detects it via health checks, reroutes to healthy servers
Cache (Redis) dies Fall back directly to the database (slower, but the system still works)
Primary DB fails A replica is automatically promoted to primary
Network partition Message queue buffers requests; retry once connectivity returns
Data corruption Point-in-time recovery from backup
Cascading failure Circuit breaker stops sending requests to the already-struggling component, giving it room to recover instead of piling on more load

๐Ÿ‘ฆ Nephew: What's a "cascading failure," actually, in plain terms?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Imagine your database is struggling and responding slowly. Every API server, instead of backing off, keeps sending it more requests (because users keep retrying), which makes the database even slower, which makes users retry even more โ€” a downward spiral that takes the entire system down over one struggling component. A circuit breaker watches for repeated failures from a component and, past a threshold, temporarily stops sending it traffic entirely โ€” like a real household circuit breaker cutting power before a small fault becomes a fire.

Step 11 โ€” Monitoring & Observability, Including Logging and Tracing

๐Ÿ‘จโ€๐Ÿฆณ Uncle: You wanted specific depth here, so let's actually go deep โ€” this is the step that turns "it works on my laptop" into "we know exactly what's happening in production, right now, and can find any bug within minutes."

Metrics โ€” the numbers, tracked over time (Prometheus):

  • QPS (requests/sec)
  • Latency โ€” and crucially, percentiles, not just averages: p50, p95, p99. A p99 of 3 seconds means 1% of your users are waiting 3+ seconds โ€” a real, painful problem an average number would completely hide.
  • Error rate
  • Cache hit ratio
  • Database connection pool usage
  • CPU, Memory, Disk

Visualization (Grafana): turns those raw metrics into real-time dashboards, with alert thresholds โ€” "page someone if error rate exceeds 2% for more than 5 minutes."

Logs โ€” the detailed, per-event record of what actually happened (the ELK stack):

Application logs (e.g. via Pino)
        โ”‚
        โ–ผ
     Logstash        (collects and processes logs from everywhere)
        โ”‚
        โ–ผ
   Elasticsearch      (stores and indexes logs for fast search)
        โ”‚
        โ–ผ
      Kibana          (search and visualize logs instantly)
Enter fullscreen mode Exit fullscreen mode

๐Ÿ‘ฆ Nephew: How is this different from metrics? Aren't they both "monitoring"?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Metrics tell you that something's wrong and roughly where โ€” "error rate on /shorten jumped to 8% at 3:14 PM." Logs tell you exactly what happened, to which specific request, with which specific data โ€” you search Elasticsearch for all logs tied to that endpoint in that time window and read the actual error messages and stack traces. Metrics narrow the search; logs solve the case โ€” this is the exact same principle from our earlier debugging conversations, just now applied at the system-design level instead of the single-request level.

Tracing (Jaeger) โ€” following one request across many services:

๐Ÿ‘จโ€๐Ÿฆณ Uncle: In a system with multiple services (API server โ†’ cache โ†’ database โ†’ maybe a downstream notification call), a single slow user request might touch five different components. Distributed tracing assigns one unique trace ID to that request at the very start, and every service it touches logs its own timing against that same trace ID. This lets you reconstruct the entire journey of one specific slow request and see exactly which one of the five components was the actual bottleneck โ€” instead of guessing by staring at five separate dashboards trying to correlate timestamps by eye.

Trace ID: abc-123
  โ”œโ”€ API Server:     12ms
  โ”œโ”€ Redis lookup:    2ms
  โ”œโ”€ Database query: 340ms  โ† the bottleneck, immediately obvious
  โ””โ”€ Response sent:   4ms
Enter fullscreen mode Exit fullscreen mode

๐Ÿ‘จโ€๐Ÿฆณ Uncle: In an interview, mentioning tracing specifically (not just "logs and metrics") is a strong signal โ€” it shows you've actually operated a real multi-service system in production, not just read about monitoring in theory.

Step 12 โ€” Security

Concern Tool/Approach
Authentication JWT, OAuth, or Session-based
Authorization Role-Based Access Control
Rate Limiting e.g. 100 requests/minute per user
Encryption in transit HTTPS (TLS)

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Always mention this, even briefly โ€” interviewers specifically listen for it, and its absence is noticed even if everything else was excellent.

Step 13 (Really Step 12) โ€” Bottlenecks & Tradeoffs: The Senior-Level Closer

๐Ÿ‘จโ€๐Ÿฆณ Uncle: This is the single most important discussion of the entire interview, and it's where junior and senior answers diverge the most sharply.

Junior engineer thinking: "Use Redis. Use MongoDB. Done."

Senior engineer thinking: "What happens if Redis fails? What happens if MongoDB fails? What happens if an entire region goes down? How will we monitor it? How will we scale it? How will we recover the data? How much traffic can it actually handle before it breaks?"

Where bottlenecks typically appear, and their remedies:

Component Problem Solution
Database Too many reads Add cache + read replicas
Database Too many writes Sharding
Database Large result sets Pagination
Network Large payloads Compression (gzip) + CDN for geographically distributed users
Network Bandwidth exceeds capacity Rate limiting
CPU Heavy processing Background jobs (Kafka-driven workers)
CPU Auth overhead JWT (no database lookup needed per request, unlike sessions)

And always, always mention the CAP theorem tradeoff explicitly:

๐Ÿ‘จโ€๐Ÿฆณ Uncle: In any distributed system, during a network partition, you can only guarantee two of three: Consistency, Availability, Partition tolerance โ€” and partition tolerance isn't really optional in a real distributed system, so in practice you're choosing between consistency and availability. Most systems lean toward availability, accepting eventual consistency, because a slightly stale response is usually better for the user than no response at all. Banking systems are the notable exception โ€” they lean toward consistency, because a stale balance is genuinely dangerous, and they accept the availability cost.

๐Ÿ‘ฆ Nephew: So ending my answer with "here's where this breaks first, and here's the tradeoff we're accepting" is actually the strongest way to finish?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Exactly โ€” it shows the interviewer you're not just capable of building the happy path; you already know where the cracks will appear before they've asked. That's the entire senior-versus-mid-level difference, distilled into one closing statement.


Part 6: Caching, Properly โ€” The Piece Every Design Leans On

๐Ÿ‘ฆ Nephew: We used Redis in almost every example. Can we go deeper on caching itself?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Should have known you'd ask โ€” caching deserves its own deep dive, since it's the single highest-leverage tool in nearly every system design answer.

What to cache: hot data โ€” things read far more often than they're written or changed. A hot URL redirect, a user's profile info, a product page. Never cache things that change every request or are highly personalized and rarely re-read.

Caching patterns:

  • Cache-aside (the one we used above) โ€” application checks cache first; on a miss, reads from the database and populates the cache. Simple, and the most common pattern by far.
  • Write-through โ€” every write goes to the cache and the database simultaneously, keeping them always in sync, at the cost of slightly slower writes.
  • Write-behind (write-back) โ€” writes go to the cache immediately and are asynchronously flushed to the database later. Fast writes, but a real risk of data loss if the cache fails before the flush happens โ€” use only when that risk is genuinely acceptable.

TTL (Time To Live) strategy: every cached item should expire. Too short, and you lose most of the caching benefit (constant cache misses). Too long, and users see stale data. Tune it to how frequently the underlying data actually changes.

Eviction policy: when the cache is full, something has to go. LRU (Least Recently Used) is the most common default โ€” it evicts whatever hasn't been touched in the longest time, on the reasonable assumption that recently-used data is more likely to be needed again soon.

๐Ÿ‘ฆ Nephew: What if the cache and the database disagree โ€” the classic stale data problem we talked about with React and RTK Query?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Exact same root problem, one level down the stack โ€” same discipline applies. Whenever the underlying data changes, explicitly invalidate (delete or update) the corresponding cache key at that moment, rather than only relying on TTL expiry to eventually catch up. This is the system-design-level version of the "invalidate the tag when it changes" lesson from RTK Query โ€” the specific tool changes, but the underlying discipline of the caching layer is identical everywhere you find one.


Part 7: Choosing the Right Async Tool โ€” Redis vs RabbitMQ vs Kafka

๐Ÿ‘จโ€๐Ÿฆณ Uncle: These three all solve "asynchronous communication between parts of your system," but with genuinely different guarantees โ€” and choosing wrong here is a very common interview mistake.

Feature Redis Pub/Sub RabbitMQ Kafka
Stores messages? No Temporary Persistent (forever, by default retention policy)
Real-time delivery? Excellent Good Good
Replay past events? No Limited Yes โ€” excellent
Great for background jobs? No Excellent Possible, but not its primary strength
Throughput Medium Medium Excellent
Typical use Presence, typing indicators Email, SMS, task queues Event streaming, analytics, audit logs

Three simple mental pictures that make this permanently stick:

  • Redis Pub/Sub = FM Radio. If you're tuned in right now, you hear it. If you weren't listening, it's gone forever โ€” no replay.
  • RabbitMQ = A Post Office. Your letter (message) is held temporarily. Once it's delivered and signed for, it's discarded.
  • Kafka = A Library Book. The story stays on the shelf indefinitely. Anyone โ€” including a brand-new team joining six months from now โ€” can read it anytime, from the beginning if they want.

Real-world example โ€” how WhatsApp actually uses all three, for different jobs:

Typing Indicator     โ†’ Redis Pub/Sub    (not critical if occasionally lost)
Email Verification   โ†’ RabbitMQ         (must eventually be delivered, reliably)
Message Analytics     โ†’ Kafka            (millions of events; need persistence AND replay)
Enter fullscreen mode Exit fullscreen mode

๐Ÿ‘ฆ Nephew: So it's never "which one is best" โ€” it's "which guarantee does this specific job actually need"?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Precisely โ€” asking "does this message need to survive a crash? does it need to be replayed later by a system that doesn't exist yet? does losing an occasional message matter at all?" leads you to the right tool every time, far more reliably than memorizing "Kafka is for big companies."


Part 8: A Complete Worked Example โ€” Designing a Notification System

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Let's run the entire framework, start to finish, on one real design, exactly the way you'd do it live in an interview. This is one of the best practice questions there is โ€” it touches queues, async processing, retries, and failure handling all at once.

1. Understand the problem: "Design a Notification System." First question to ask: what channels โ€” email, SMS, push, real-time, scheduled? Assume: Email, SMS, Push.

2. Functional Requirements: send email, send SMS, send push notification, support bulk notifications, retry failed notifications, view delivery status.

3. Non-Functional Requirements โ€” the single most important question here: can we afford to lose a notification? Answer: No. This one answer shapes the entire architecture that follows โ€” it immediately tells you a queue with persistence and retries is mandatory, not optional.

4. Scale Estimation:

  • 10M total users, 1M DAU, 5 notifications/user/day โ†’ 1M ร— 5 = 5M notifications/day
  • QPS: 5,000,000 / 86,400 โ‰ˆ 58 QPS, Peak: 58 ร— 5 โ‰ˆ 290 QPS
  • The critical edge case โ€” bulk campaigns: marketing sends 1 million emails in a 10-minute window โ†’ 1,000,000 / 600 = 1,666 notifications/sec. This is where the queue genuinely earns its place โ€” a burst 5-6x higher than even your "peak" steady-state number.
  • Storage: 5M ร— 500 bytes = 2.5 GB/day โ†’ ร— 365 = 912 GB/year

5. API Design: POST /notifications, GET /notifications/{id}

6. Data Model: Notification: {id, userId, channel, message, status, createdAt}, DeliveryLog: {notificationId, provider, status}

7. High Level Design โ€” and why it looks this way:

Client โ†’ Notification API โ†’ Kafka โ†’  โ”Œโ”€โ”€ Email Worker โ”€โ”€โ”
                                       โ”œโ”€โ”€ SMS Worker โ”€โ”€โ”€โ”€โ”ค โ†’ Provider (SendGrid, Twilio, FCM)
                                       โ””โ”€โ”€ Push Worker โ”€โ”€โ”€โ”˜
                                              โ”‚
                                              โ–ผ
                                          Database
Enter fullscreen mode Exit fullscreen mode

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Here's the question every interviewer wants you to answer unprompted: why Kafka? Why not just call the email provider directly from the API?

Bad:    Order Service โ†’ Send Email directly โ†’ wait 2 seconds โ†’ user is stuck waiting
Better: Order Service โ†’ Kafka โ†’ return success immediately โ†’ email sent asynchronously, later
Enter fullscreen mode Exit fullscreen mode

๐Ÿ‘ฆ Nephew: So the queue exists purely to stop the user from waiting on something they don't need to wait on?

๐Ÿ‘จโ€๐Ÿฆณ Uncle: That's one huge reason โ€” but combined with our "cannot lose a notification" requirement from Step 3, there's a second, equally important reason: if the email provider is temporarily down, a direct call just fails and the notification is gone. With Kafka, the event sits safely in the queue regardless of whether the provider is currently up, and a worker can retry it whenever the provider recovers. The queue isn't just a performance optimization here โ€” it's the actual mechanism satisfying your non-negotiable non-functional requirement.

8. Scaling: if 5M notifications/day becomes 500M, add more Kafka partitions and more workers per channel โ€” they consume in parallel, so throughput scales roughly linearly with worker count.

9. Reliability & Failure Handling โ€” the heart of this design:

  • If the Email Provider (SendGrid) fails: retry 1, retry 2, retry 3 โ€” still failing after that? โ†’ send to a Dead Letter Queue (DLQ) for manual inspection later, rather than silently dropping it. Kafka retains the event, so a worker can always come back and retry later.
  • If a Worker crashes mid-processing: Kafka still has the event โ€” it was never removed until successfully processed โ€” so a new worker instance simply picks it up. No loss.
  • If the Database goes down: Primary + Replica setup with automatic failover.

10. Monitoring: notifications sent (counter), failed notifications (counter), retry count (histogram), queue size (gauge), latency (histogram), and critically โ€” Kafka consumer lag (how far behind the workers are from the incoming event stream; a rapidly growing lag number is often the very first sign of trouble, well before anything else looks wrong).

11. Security: JWT auth, authorization so only the notification's owner can view its status, rate limiting to prevent spam abuse, HTTPS encryption throughout.

๐Ÿ‘ฆ Nephew: Every single answer in this walkthrough traced back to one of the twelve steps, in order, without me ever feeling lost.

๐Ÿ‘จโ€๐Ÿฆณ Uncle: That was the entire exercise. You just watched the framework carry an entire, genuinely complex design end-to-end, without a single moment of "what do I even say next."


Part 9: The 30-Second Mental Checklist and the 30-Minute Interview Clock

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Let's compress everything into what actually runs through your head the instant an interviewer says "Design X."

The 30-second mental checklist, in order:

1. What problem, exactly? (Clarify)
2. How many users? (Scale)
3. Read/write ratio? (Traffic pattern)
4. QPS & Peak QPS? (Capacity)
5. Storage size? (Database)
6. THEN โ€” and only then โ€” design
Enter fullscreen mode Exit fullscreen mode

Do NOT:

  • Start with "I'd use Kafka..." before establishing why
  • Draw a diagram with no numbers behind any of the boxes
  • Forget to discuss failure scenarios
  • Skip the monitoring conversation entirely

How to actually spend a 30-minute interview

Minutes 0-5:   Ask clarifying questions + gather requirements (the 5 magic questions)
Minutes 5-12:  Calculate scale โ€” QPS, storage, bandwidth, the works
Minutes 12-20: HLD + API design + Database schema โ€” draw the boxes
Minutes 20-25: Deep dive โ€” load balancer, cache, DB replication, how each piece works
Minutes 25-30: Failures + Monitoring + Tradeoffs โ€” the senior-level closing
Enter fullscreen mode Exit fullscreen mode

The 5 Magic Questions, worth asking out loud at the very start of every single design

1. Scale?        (How many users / DAU / requests?)
2. Availability?  (99.9%? 99.99%?)
3. Latency?       (<100ms? <1s?)
4. Consistency?   (Strong? Eventual?)
5. Security?      (Data sensitivity? Auth required?)
Enter fullscreen mode Exit fullscreen mode

๐Ÿ‘จโ€๐Ÿฆณ Uncle: Asking these five, out loud, in the first two minutes, before you say a single technology name, makes you sound two levels more senior than a candidate who launches straight into "I'll use Redis and MongoDB."

The Five Mistakes 90% of Candidates Make

  1. No clarifying questions โ€” jumping straight to "use Redis and MongoDB" without first asking about users, requests, and latency requirements.
  2. Technology-first thinking โ€” "I'll use Kafka for everything," without ever establishing whether the scale actually requires it.
  3. Skipping back-of-envelope math โ€” "100 million users, I'll just use a distributed database" โ€” without ever calculating QPS, storage, or bandwidth to justify that.
  4. No failure discussion โ€” never mentioning "what if the server crashes" or "what if the cache dies." Reliability is a significant chunk of your evaluation โ€” skipping it costs real points.
  5. Overly complex answers โ€” reaching for 15 microservices, Kubernetes, Kafka, and Elasticsearch for what's fundamentally a simple problem. Interviewers consistently prefer pragmatism โ€” start simple, and add complexity only when a specific number or requirement actually demands it.

What Interviewers Are Actually Scoring You On

1. Systematic thinking     (did you follow a clear framework?)
2. Calculation skills       (can you actually do the back-of-envelope math?)
3. Communication             (clear explanation, legible diagrams)
4. Pragmatism                 (practical architecture, not over-engineered)
5. Tradeoff awareness          (CAP theorem, consistency vs speed, genuinely understood)
6. Ops mindset                  (monitoring, reliability, failure handling โ€” unprompted)
7. Honest about unknowns         (comfortable saying "I don't know, but here's how I'd find out")
Enter fullscreen mode Exit fullscreen mode

Part 10: The Full Reference Sheet โ€” Pin This to Your Wall

The 12-Step Framework, once more, cold:

UNDERSTAND: Problem โ†’ Requirements (Func + Non-Func) โ†’ Scale Estimation
DESIGN:     APIs โ†’ Database โ†’ HLD โ†’ Deep Dive Components
ROBUSTIFY:  Scaling โ†’ Reliability โ†’ Monitoring โ†’ Security โ†’ Bottlenecks & Tradeoffs
Enter fullscreen mode Exit fullscreen mode

Every Formula, together:

QPS               = Requests Per Day / 86,400
Peak QPS          = Average QPS ร— 3-5
Storage           = Records ร— Size Per Record
Yearly Storage    = Daily Storage ร— 365
Bandwidth         = Peak QPS ร— Response Size
Cache Memory      = Hot Data Size ร— 1.2
Enter fullscreen mode Exit fullscreen mode

Every unit, together:

1 KB  = 10ยณ  bytes
1 MB  = 10โถ  bytes
1 GB  = 10โน  bytes
1 TB  = 10ยนยฒ bytes
1 Day = 86,400 seconds
Enter fullscreen mode Exit fullscreen mode

The full "what every system design must cover" checklist:

โœ“ Requirements (Functional + Non-functional)
โœ“ Scale (Users, QPS, Storage, Bandwidth)
โœ“ API (Input/Output contracts)
โœ“ Database (Schema, SQL vs NoSQL choice, indexes)
โœ“ Cache (What to cache, TTL strategy, eviction policy)
โœ“ Load Balancer (Traffic distribution)
โœ“ Servers (Number, horizontal scaling plan)
โœ“ Storage strategy (Sharding, replication)
โœ“ Monitoring (Metrics, logs, tracing, alerts)
โœ“ Security (Auth, encryption, rate limiting)
โœ“ Failure Handling (Replication, failover, backups, circuit breakers)
โœ“ Scaling (Horizontal, vertical, sharding โ€” and which one, why)
โœ“ Tradeoffs (CAP theorem, consistency vs speed, explicitly stated)
Enter fullscreen mode Exit fullscreen mode

๐Ÿ‘จโ€๐Ÿฆณ Uncle: If you can walk through every single line of this checklist for any system, out loud, from memory, without notes โ€” you are ready for this round. Not because you'll have memorized "the answer" to every possible question, but because you'll have internalized the process that generates the right answer for a question you've never seen before.


Uncle's Closing Words

๐Ÿ‘จโ€๐Ÿฆณ Uncle: You came in today thinking this round tests how many technologies you know. It doesn't. It tests whether you can think like the person who'll be paged at 2 AM when this system breaks โ€” someone who asks the right questions before building, does the math before choosing tools, designs for failure before designing for success, and can explain why every single box on the diagram exists.

๐Ÿ‘ฆ Nephew: Understand before designing. Calculate before building. Design from numbers, not opinions. Handle failure before celebrating success. And always be honest about the tradeoffs.

๐Ÿ‘จโ€๐Ÿฆณ Uncle: (grins) Say that exact sentence, calmly, in your next interview, right after you draw your HLD โ€” and watch the interviewer's face change.


End of chat. Now go practice five designs: URL Shortener, Chat App, Notification System, Rate Limiter, File Upload โ€” with a timer running.

Top comments (0)