DEV Community

Thesius Code
Thesius Code

Posted on

System Design Interview Preparation: The Complete Roadmap

87% of engineers who fail system design interviews don't fail on technical knowledge — they fail on structure. They know what a load balancer does, they understand caching, but the moment someone says "Design Twitter," they freeze and start drawing boxes at random.

This guide fixes that. Instead of memorizing 30 specific system designs, you'll learn a repeatable framework and the core building blocks so you can design any system on the spot.

The Framework: How to Structure Your Answer

Every system design interview should follow this four-step structure. Internalize it until it's muscle memory.

Step 1: Clarify Requirements (3–5 minutes)

Before designing anything, ask questions. This demonstrates engineering maturity and prevents wasted effort.

Functional requirements:

  • What are the core features?
  • Who are the users?
  • What are the inputs and outputs?

Non-functional requirements:

  • What's the expected scale? (users, requests/sec, data volume)
  • What are the latency requirements?
  • Is availability or consistency more important?
  • What's the read/write ratio?

Example for "Design Twitter":

Functional:
- Post tweets (text, 280 chars)
- Follow/unfollow users
- View home timeline (tweets from followed users)
- Search tweets

Non-functional:
- 500M users, 200M DAU
- ~600 tweets/sec writes, ~600K reads/sec
- Timeline latency < 200ms
- Availability > consistency (eventual consistency OK)
- Read-heavy: ~1000:1 read/write ratio
Enter fullscreen mode Exit fullscreen mode

Step 2: High-Level Design (5–10 minutes)

Draw the major components and how data flows between them:

Client → Load Balancer → API Gateway → Services
                                          │
                              ┌───────────┼───────────┐
                              ▼           ▼           ▼
                         Tweet Service  Timeline    User Service
                              │         Service         │
                              ▼           │             ▼
                         Tweet DB         ▼          User DB
                              │       Cache Layer
                              ▼     (Timeline Cache)
                         Message Queue
                              │
                              ▼
                      Fan-out Service
Enter fullscreen mode Exit fullscreen mode

Step 3: Deep Dive (15–20 minutes)

Pick the most critical components and design them in detail. The interviewer will guide you, but be prepared to dive into:

  • Database schema and choice
  • API design
  • Scaling strategy
  • Caching approach
  • Failure handling

Step 4: Trade-offs and Bottlenecks (5 minutes)

Discuss what could break, what you'd monitor, and alternative approaches you considered.

Core Building Blocks You Must Know

These are the Lego pieces of system design. Learn these deeply, and you can assemble any system.

1. Load Balancing

Distributes traffic across servers. Know the algorithms:

Algorithm When to Use
Round Robin Equal server capacity, stateless services
Weighted Round Robin Mixed server capacities
Least Connections Long-lived connections (WebSocket)
IP Hash Session affinity needs
Consistent Hashing Distributed caches, database sharding

Key point: L4 (TCP) vs L7 (HTTP) load balancing. L7 can route based on content (URL path, headers) but adds latency. L4 is faster but less flexible.

2. Caching

Caching appears in every system design answer. Know the patterns:

Cache-Aside (Lazy Loading):
1. App checks cache
2. Cache miss → read from DB
3. Write result to cache
4. Return to client

Write-Through:
1. App writes to cache
2. Cache writes to DB
3. Return to client

Write-Behind (Write-Back):
1. App writes to cache
2. Cache async writes to DB (batched)
3. Return to client immediately
Enter fullscreen mode Exit fullscreen mode

When to use what:

  • Cache-aside: Default choice. Works for read-heavy workloads.
  • Write-through: When you can't afford cache misses on recently written data.
  • Write-behind: High write throughput needed, acceptable risk of data loss.

Cache invalidation strategies:

  • TTL (Time-To-Live): Simple, eventual consistency. Set TTL to match acceptable staleness.
  • Event-based: Invalidate on write. More complex but data stays fresher.
  • Version tags: Include version in cache key. New version = automatic miss.

3. Database Selection

Requirement Database Type Examples
Structured data, ACID Relational PostgreSQL, MySQL
Flexible schema, high write Document MongoDB, DynamoDB
Social graphs, relationships Graph Neo4j, Amazon Neptune
Time-series metrics Time-series InfluxDB, TimescaleDB
Full-text search Search engine Elasticsearch, OpenSearch
Session data, leaderboards Key-Value Redis, Memcached
Wide-column, massive scale Column-family Cassandra, HBase

4. Database Scaling Patterns

Vertical scaling — Bigger machine. Simple but has a ceiling.

Read replicas — Primary handles writes, replicas handle reads. Works for read-heavy workloads.

Sharding — Split data across multiple databases by a shard key.

Shard by user_id:
  user_id % 4 = 0 → Shard A
  user_id % 4 = 1 → Shard B
  user_id % 4 = 2 → Shard C
  user_id % 4 = 3 → Shard D

Problems with naive sharding:
- Hot shards (uneven distribution)
- Cross-shard queries are expensive
- Rebalancing when adding shards

Better: Consistent hashing with virtual nodes
Enter fullscreen mode Exit fullscreen mode

5. Message Queues

Decouple producers from consumers. Essential for async processing.

Producer → Queue → Consumer

Use cases:
- Order processing (place order → queue → payment → queue → fulfillment)
- Notifications (event → queue → email/push/SMS services)
- Data pipelines (change event → queue → downstream processing)

Key concepts:
- At-least-once delivery (most common)
- Exactly-once semantics (harder, Kafka supports it)
- Dead letter queues (failed messages go here)
- Message ordering (per-partition in Kafka)
Enter fullscreen mode Exit fullscreen mode

6. The CAP Theorem (Practical Version)

In a distributed system during a network partition, you must choose:

  • CP (Consistency + Partition tolerance): Every read gets the most recent write, but some requests may fail. Use for banking and inventory.
  • AP (Availability + Partition tolerance): Every request gets a response, but it might be stale. Use for social media feeds and DNS.

In practice, most systems pick AP for user-facing reads and CP for critical writes.

7. Rate Limiting

Protect services from abuse and cascading failures.

Algorithms:
1. Token Bucket — Allows bursts, smooth average rate
2. Sliding Window — Precise, more memory
3. Fixed Window — Simple, edge-case bursts at window boundaries
4. Leaky Bucket — Constant output rate, good for APIs

Where to implement:
- API Gateway (global rate limiting)
- Per-service (service-specific limits)
- Per-user/API-key (fairness)
Enter fullscreen mode Exit fullscreen mode

Practice Problems with Solution Outlines

Problem 1: Design a URL Shortener

Requirements: 100M URLs/day, 1000:1 read/write, < 10ms redirect latency

Key decisions:

  • ID generation: Base62 encoding of auto-increment or snowflake ID. 7 chars = 3.5 trillion URLs.
  • Storage: Key-value store (Redis for hot URLs, DynamoDB for persistence).
  • Caching: Cache-aside with Redis. Most URLs follow Zipf distribution (top 20% get 80% of traffic).
  • Read path: Cache → DB → 301/302 redirect.
  • Analytics: Async via Kafka → Analytics service.

Problem 2: Design a Notification System

Requirements: Multi-channel (push, email, SMS, in-app), 100M notifications/day, prioritization

Key decisions:

  • Architecture: Event-driven with priority queues.
  • Queue design: Separate queues per channel, priority levels within each.
  • Rate limiting: Per-user per-channel to prevent notification fatigue.
  • Template engine: Pre-compiled templates with variable substitution.
  • Delivery tracking: State machine (created → queued → sent → delivered → read).
  • Failure handling: Exponential backoff with max retries, DLQ for investigation.

Problem 3: Design a Distributed Cache

Requirements: Sub-millisecond latency, 1TB data, fault-tolerant

Key decisions:

  • Partitioning: Consistent hashing with virtual nodes.
  • Replication: Each partition replicated to 3 nodes.
  • Consistency: Eventually consistent reads, quorum writes (W + R > N).
  • Eviction: LRU per node with global TTL.
  • Hot key handling: Local caching on client, key splitting.

12-Week Study Plan

Week Focus Area Practice Problem
1–2 Scaling fundamentals, load balancing, caching URL Shortener
3–4 Database design, SQL vs NoSQL, sharding Instagram/Twitter
5–6 Message queues, async processing Notification System
7–8 Real-time systems, WebSockets, pub/sub Chat Application
9–10 Search systems, indexing, ranking Search Engine
11–12 Distributed systems, consensus, replication Distributed Cache

Daily Practice Routine

  1. Morning (30 min): Review one building block concept in depth.
  2. Evening (60 min): Practice one design problem end-to-end.
  3. Weekend (2 hours): Mock interview with a peer or self-record and review.

Mistakes That Sink Interviews

  1. Jumping to the solution — Always clarify requirements first. The interviewer is evaluating your process as much as your answer.
  2. Skipping back-of-envelope math — "How many servers do we need?" You should be able to estimate within an order of magnitude.
  3. Ignoring failure modes — "What happens when this component goes down?" Always address this proactively.
  4. Over-engineering — Start simple, then add complexity as requirements demand. Don't design for Google scale when the prompt says 10K users.
  5. Not discussing trade-offs — There is no perfect design. Every choice has a cost. The best candidates articulate these clearly.

Back-of-Envelope Calculations Cheat Sheet

Useful numbers:
- 1 day = ~100K seconds (86,400)
- 1 year = ~30M seconds
- QPS from daily users: DAU × avg_requests / 86,400
- Storage: items × size × retention_period
- Bandwidth: QPS × avg_response_size

Example: Twitter timeline reads
- 200M DAU, each refreshes 10x/day
- QPS = 200M × 10 / 86,400 ≈ 23K QPS
- Peak = 2–3× average ≈ 60K QPS
Enter fullscreen mode Exit fullscreen mode

Wrapping Up

System design interviews test three things:

  1. Can you break down ambiguous problems? (Requirements gathering)
  2. Do you know the building blocks? (Technical knowledge)
  3. Can you make and defend trade-offs? (Engineering judgment)

Master the framework, deeply understand 6–8 building blocks, and practice 10–15 problems. That's the formula.


If you're looking for ready-made architecture diagrams, cheat sheets, and structured study materials for data engineering and system design, check out DataStack Pro.

Top comments (0)