Md Asif Ullah Chowdhury

Posted on May 13 • Originally published at asifthewebguy.me

System Design Interview: Distributed Systems Fundamentals

#architecture #distributedsystems #interview #systemdesign

I still remember my first system design interview at a mid-sized SaaS company in 2019. The interviewer asked me to design a URL shortener, and I immediately jumped into database schemas and API endpoints. Twenty minutes in, he stopped me. "That's fine for a single server," he said. "Now what happens when you have 100 million users?"

I froze. I knew about load balancers and caching in theory, but I had no framework for how to think through distributed systems problems under pressure. That interview taught me something crucial: system design interviews aren't about memorizing solutions. They're about demonstrating how you reason through trade-offs when multiple computers need to work together as one system.

Here's what I've learned since then, refined through dozens of interviews on both sides of the table and years of building distributed systems in production. This isn't the usual regurgitated list of patterns. Every concept below is tied to a real system's architecture decision—Netflix, Uber, Twitter—so you understand not just what these patterns are, but when and why teams chose them.

What is a Distributed System (and Why It Matters for Interviews)

A distributed system is multiple computers working together to appear as a single coherent system to end users. Your banking app talks to dozens of servers. Instagram's 2 billion users hit thousands of machines. Netflix streams video from edge servers scattered across continents.

The key word is appear. Behind the scenes, these systems are coordinating across network boundaries, handling failures, and managing data that lives in multiple places at once. That coordination is hard. Networks are unreliable. Servers crash. Data gets out of sync.

Companies ask system design questions because this is the actual work. If you're hired at Google, Meta, or Amazon, you'll be building features that scale to millions of users across distributed infrastructure. The interview simulates that: here's a problem, here's scale, now show me how you think.

What interviewers evaluate isn't whether you know the "right" answer—there often isn't one. They're watching how you:

Clarify requirements before diving into solutions
Estimate capacity to size your system appropriately
Make trade-offs explicitly and explain why you chose one path over another
Communicate clearly as you design, so they can follow your reasoning

The interview is a 45-minute window into how you'd collaborate on a real architecture review. Treat it like one.

Core Distributed Systems Concepts You Must Know

Before you can design anything distributed, you need a shared vocabulary for the problems these systems solve.

Scalability

Scalability is your system's ability to handle increased load without falling over. There are two paths: vertical scaling (bigger machines) and horizontal scaling (more machines).

Vertical scaling means upgrading your server—more CPU, more RAM, faster disks. It's simple. No code changes. But there's a ceiling. The biggest AWS instance tops out, and you've hit a wall.

Horizontal scaling means adding more servers and distributing the load across them. Instagram didn't scale to 2 billion users by buying one massive server. They scaled horizontally: thousands of application servers, sharded databases, distributed caches.

The trade-off? Horizontal scaling introduces complexity. Now you need load balancers, data partitioning strategies, and coordination between nodes. But the ceiling is much, much higher.

In interviews, if someone says "design a system for 100 million users," you're designing for horizontal scale. One server won't cut it.

Reliability and Fault Tolerance

Reliability means your system does what it's supposed to do, even when things break. Fault tolerance is the mechanism: your system continues operating despite failures.

Netflix is a great example. They run on AWS, and AWS regions fail. In 2011, an outage in their primary region took down much of the internet. Netflix stayed up because they designed for failure: multi-region deployments, circuit breakers to isolate broken services, and automated failover.

The lesson: in distributed systems, failures aren't edge cases. They're Tuesday. Disks fail, networks partition, servers crash. Fault-tolerant design assumes these things will happen and builds around them.

Consistency

Consistency asks: when data exists in multiple places, do all readers see the same value at the same time?

Imagine you update your profile picture on Instagram. That change propagates to multiple databases and caches worldwide. If I view your profile one second later from Singapore, do I see the new picture or the old one?

Strong consistency guarantees I see the new picture immediately. Eventual consistency means I might see the old picture for a few seconds, but I'll eventually see the new one.

The reason this matters: achieving strong consistency across a distributed system is expensive. It requires coordination, locks, and waiting. Eventual consistency is faster but introduces temporary staleness.

Different parts of the same system often choose different consistency models. Your bank account balance? Strongly consistent. Your Twitter follower count? Eventually consistent is fine.

Availability

Availability measures how often your system is operational and responding to requests. It's usually expressed as uptime: 99.9% availability means roughly 8.7 hours of downtime per year.

High availability requires redundancy. If one server fails, another takes over. Load balancers distribute traffic across multiple healthy nodes. Databases replicate to standby instances.

But here's the catch: availability and consistency sometimes conflict. If your primary database fails, do you serve stale data from a replica (high availability, lower consistency) or refuse requests until the primary recovers (high consistency, lower availability)?

That's the trade-off space system design interviews explore.

Partition Tolerance

A network partition happens when servers can't communicate with each other. Maybe a fiber cable gets cut. Maybe a datacenter's network switch fails. The network splits into islands.

Partition tolerance means your system continues operating despite this split, even if that means making trade-offs on consistency or availability.

In practice, partitions are inevitable in distributed systems. You don't get to choose whether partitions happen—you get to choose how your system behaves when they do.

This brings us to the CAP theorem.

The CAP Theorem: Choosing Your Trade-Offs

The CAP theorem says you can have at most two of these three guarantees in a distributed system:

Consistency: All nodes see the same data at the same time.
Availability: Every request gets a response (success or failure).
Partition Tolerance: The system works despite network failures.

Here's the practical reality: partitions happen. Network splits are facts of life in distributed infrastructure. So partition tolerance is non-negotiable. The real choice is between consistency and availability during a partition.

CP Systems: Consistency Over Availability

A CP system prioritizes consistency. If the network partitions and nodes can't coordinate, the system refuses requests rather than risk serving stale or conflicting data.

Example: Banking systems. If my account balance is $100 and I try to withdraw $80 from an ATM while simultaneously withdrawing $50 from another ATM during a network partition, the system must prevent both withdrawals from succeeding. It chooses consistency (no overdraft) over availability (one ATM might reject my request).

MongoDB in its default configuration is CP. If the primary node loses connectivity to the majority of replicas, it steps down and stops accepting writes. The system becomes unavailable for writes, but you won't get inconsistent data.

AP Systems: Availability Over Consistency

An AP system prioritizes availability. If the network partitions, both sides of the partition continue serving requests. They'll reconcile later, but in the moment, availability wins.

Example: Social media feeds. When you post a photo on Instagram, it might not appear instantly to all 2 billion users worldwide. Some users might see the old feed state for a few seconds. That's acceptable—eventual consistency is fine for a social feed.

DynamoDB is AP. During a partition, it continues serving reads and writes from all nodes. Amazon chose this for their shopping cart: it's better to show you a slightly stale cart than to refuse to show you a cart at all. They reconcile conflicts later using versioning.

Real-World Nuance: Tunable Consistency

Many modern systems don't pick one extreme. They offer tunable consistency.

Cassandra, for instance, lets you specify a consistency level per query:

QUORUM: Wait for a majority of replicas to acknowledge (stronger consistency).
ONE: Accept the first response from any replica (higher availability, weaker consistency).

You can choose strong consistency for critical operations (user account updates) and eventual consistency for less critical ones (analytics counters).

The interview lesson: when someone asks you to design a system, ask what the consistency requirements are. Don't assume. Different parts of the same system might need different guarantees.

Essential Distributed Systems Patterns

Now let's talk about the building blocks interviewers expect you to know.

Sharding / Partitioning

Sharding distributes data across multiple databases so no single database holds everything.

Problem it solves: Your database can't fit on one machine, or the query load is too high for one machine to handle.

How it works: You split data by some key. Common strategies:

Hash-based sharding: Hash the user ID, mod by the number of shards. User 12345 always goes to shard 2.
Range-based sharding: Users A-M go to shard 1, N-Z to shard 2.
Geographic sharding: US users on US databases, EU users on EU databases.

When to use: When you've exhausted vertical scaling and read replicas can't handle the write load.

Real example: Instagram shards user data by user ID. Your photos, profile, and follower list live on a specific shard determined by your user ID. This lets them distribute billions of users across thousands of database instances.

Trade-off: Queries that span shards (like "show me all posts tagged #sunset") become expensive. You're trading global query flexibility for horizontal scale.

Replication

Replication duplicates data across multiple servers for redundancy and read scaling.

Problem it solves: Single point of failure (if your database crashes, you're offline) and read-heavy workloads (one database can't handle all the read queries).

How it works:

Master-slave replication: One primary handles writes. Replicas copy the data and serve reads. If the primary fails, promote a replica.
Multi-master replication: Multiple nodes accept writes. Conflicts get resolved with versioning or last-write-wins.
Quorum-based replication: Writes succeed when acknowledged by a majority of replicas.

When to use: Always, for critical data. Replication gives you fault tolerance and read scaling.

Real example: My Docker deployment setup uses a single PostgreSQL instance because I'm running a small-scale blog. But production systems at scale run master-slave replication—one primary for writes, multiple read replicas distributed geographically to reduce latency.

Caching

Caching stores frequently accessed data in fast storage (RAM) to avoid hitting slower backends (databases, APIs).

Problem it solves: Database queries are slow. Network calls are slow. Recomputing results is expensive.

How it works: Check the cache first. If the data is there (cache hit), return it. If not (cache miss), fetch from the database, store in cache, then return.

Where to cache:

CDN (Content Delivery Network): Cache static assets (images, CSS, JS) at edge servers near users. CloudFlare, Fastly.
Application cache: Cache API responses, database query results. Redis, Memcached.
Database cache: MySQL query cache, PostgreSQL shared buffers.

When to use: For read-heavy workloads with data that doesn't change frequently.

Real example: Twitter caches timeline data in Redis. When you load your feed, Twitter doesn't query the database for every tweet from every user you follow. It serves a pre-computed, cached timeline. Updates propagate to the cache asynchronously.

Trade-off: Cache invalidation is hard. When the underlying data changes, you need a strategy to update or evict stale cache entries. "There are only two hard things in Computer Science: cache invalidation and naming things."

Load Balancing

Load balancing distributes incoming requests across multiple servers so no single server gets overwhelmed.

Problem it solves: One server can't handle all the traffic. You need to spread the load.

How it works:

Round-robin: Requests go to servers in rotation. Simple, fair.
Least connections: Send the request to the server with the fewest active connections. Good for long-lived connections.
Consistent hashing: Map requests to servers using a hash ring. Adding or removing servers only affects a small subset of requests.

When to use: As soon as you have more than one application server.

Real example: Uber uses load balancers in front of their microservices. A ride request hits a load balancer, which routes it to one of hundreds of backend instances. If one instance crashes, the load balancer stops sending traffic to it.

Message Queues

Message queues decouple producers (who create work) from consumers (who process work) using an asynchronous queue in between.

Problem it solves: Synchronous processing can't handle spiky traffic. You need to buffer work and process it at your own pace.

How it works: Producer puts a message (task) in the queue. Consumer pulls messages from the queue and processes them. If the consumer is slow or crashes, messages wait in the queue.

When to use: For background jobs, asynchronous workflows, or when producers and consumers operate at different speeds.

Real example: When you upload a video to YouTube, the upload service puts a message in a queue: "transcode this video." Worker servers pull messages from the queue and transcode videos. If transcode servers are busy, the queue grows. If they're idle, the queue drains. The upload service doesn't wait—it responds immediately.

Common tools: Kafka (high-throughput, event streaming), RabbitMQ (traditional message broker), AWS SQS (managed queue).

Rate Limiting

Rate limiting restricts how many requests a client can make in a given time window.

Problem it solves: Protect your API from overload, abuse, or accidental denial-of-service (like a buggy client in a retry loop).

How it works:

Fixed window: Allow 100 requests per minute. Counter resets every minute.
Sliding window: Track requests over a rolling 60-second window.
Token bucket: Refill tokens at a fixed rate. Each request consumes a token.

When to use: On all public-facing APIs.

Real example: Twitter's API has rate limits: 300 requests per 15-minute window for certain endpoints. Exceed the limit, you get a 429 status code. This prevents one client from monopolizing API capacity.

Data Consistency Models in Distributed Systems

Consistency isn't binary. There's a spectrum of guarantees, each with different performance and complexity trade-offs.

Strong Consistency

Strong consistency (also called linearizability) guarantees that once a write completes, all subsequent reads return that value. There's no window where different readers see different data.

How it works: Typically requires coordination—locks, consensus protocols (like Paxos or Raft), waiting for acknowledgments from multiple nodes before confirming a write.

When to use: Financial transactions, inventory systems, anything where stale data causes serious problems.

Example: A stock trading platform needs strong consistency. If I sell 100 shares, no one else should be able to buy those same shares based on stale data.

Trade-off: Coordination is expensive. It adds latency and reduces throughput. Strongly consistent distributed databases are slower than eventually consistent ones.

Eventual Consistency

Eventual consistency guarantees that if no new updates are made, all replicas will eventually converge to the same value. But there's a window where replicas might return different values.

How it works: Writes propagate asynchronously. Replicas accept writes independently, then gossip updates to each other in the background.

When to use: Social media, analytics, any system where temporary staleness is acceptable.

Example: Facebook's "like" counts. If you like a post, your like might not immediately show up for every user worldwide. A few seconds later, it propagates everywhere. That delay is fine—it's not worth the coordination cost for a like button.

Trade-off: Application logic must tolerate stale reads. You can't rely on reading the most recent write.

Causal Consistency

Causal consistency preserves cause-and-effect relationships. If event A caused event B, all nodes see A before B. But independent events might appear in different orders on different nodes.

How it works: Track dependencies using vector clocks or similar mechanisms. Ensure dependent writes propagate in order.

When to use: Collaborative editing, messaging systems, any workflow where order matters for related events but not for independent events.

Example: A commenting system. If you post a comment and I reply to it, everyone should see your comment before my reply. But if two people comment independently, the order doesn't matter.

Trade-off: More complex than eventual consistency, but often more useful in practice without the full cost of strong consistency.

Common System Design Interview Questions and Frameworks

Here's the structure I use for every system design interview, both as a candidate and as an interviewer. It's not magic—it's just a way to organize your thinking so you don't spiral into irrelevant details.

The Framework

Step 1: Clarify requirements (5 minutes)

Don't assume. Ask:

What are we building? (URL shortener, Twitter, Instagram, etc.)
What's the scale? (How many users? Requests per second? Data volume?)
What's the read/write ratio? (Read-heavy, write-heavy, balanced?)
What are the latency requirements? (Real-time? Eventually consistent?)
What features are in scope? (Core features only, or advanced features too?)

Write these down. The interviewer is evaluating whether you gather requirements before jumping to solutions.

Step 2: Estimate capacity (5 minutes)

Back-of-the-envelope math:

Traffic estimate (e.g., 100M users, 10 tweets/day/user = 1B tweets/day = ~12K tweets/sec).
Storage estimate (e.g., 1B tweets/day × 200 bytes/tweet × 365 days × 5 years = ~365 TB).
Bandwidth estimate (12K tweets/sec × 200 bytes = 2.4 MB/sec write, assume 10:1 read/write ratio = 24 MB/sec read).

You don't need perfect numbers. You need order-of-magnitude estimates to inform your design (e.g., do we need sharding? How much cache do we need?).

Step 3: Define APIs (5 minutes)

Sketch the core API contracts:

POST /tweet — create a tweet
GET /timeline/:user_id — fetch a user's timeline
POST /follow/:user_id — follow a user

This forces you to think about what data flows where.

Step 4: Design the data model (5 minutes)

What tables/collections do you need?

users (user_id, username, created_at)
tweets (tweet_id, user_id, content, created_at)
follows (follower_id, followee_id)

Identify access patterns. Are you querying by user ID? By time range? This informs indexing and sharding strategies.

Step 5: Draw the high-level architecture (15 minutes)

This is where you bring in the patterns:

Load balancer → application servers
Application servers → databases (sharded? replicated?)
Cache layer (Redis for timelines)
Message queue (Kafka for async jobs like notification delivery)
CDN for static assets

Talk through data flow: "When a user tweets, the API server writes to the database, invalidates the cache, and puts a message in the queue to update followers' timelines."

Step 6: Identify bottlenecks and optimize (10 minutes)

Where does this design break?

Database writes can't keep up → shard by user ID.
Timeline queries are slow → cache pre-computed timelines in Redis.
Hotspot users (celebrities with millions of followers) overwhelm the system → use a fan-out-on-read model for them instead of fan-out-on-write.

This is where you show you understand trade-offs. "We could fan out on write for normal users and fan out on read for celebrities because celebrities' followers won't all read simultaneously."

Example Walkthrough: Design Instagram

Let me walk through one example so you see the framework in action.

Requirements clarification:

2 billion users, 500 million daily active users.
Users upload photos, follow other users, view a personalized feed.
Read-heavy (users view feeds more than they post).
Latency: feeds should load in under 1 second.
Scope: photo uploads, feed generation, follow/unfollow. Out of scope: stories, direct messaging.

Capacity estimation:

500M DAU, average 2 photos uploaded per user per day = 1B photos/day = ~11.5K uploads/sec.
Average photo size: 2 MB. Daily storage: 1B × 2 MB = 2 PB/day. 5 years: ~3.6 exabytes (clearly need distributed storage).
Feed reads: assume 10:1 read/write ratio = 115K feed requests/sec.

APIs:

POST /photos — upload a photo.
GET /feed/:user_id — get personalized feed.
POST /follow/:user_id — follow a user.

Data model:

users (user_id, username, profile_pic_url)
photos (photo_id, user_id, image_url, caption, created_at)
follows (follower_id, followee_id)

High-level architecture:

Load balancer distributes requests across app servers.
Application servers handle API logic.
Object storage (S3) stores photos. CDN caches popular photos.
Database (sharded PostgreSQL or Cassandra) stores user data, photo metadata, follows. Shard by user_id.
Cache (Redis) stores pre-computed feeds.
Message queue (Kafka) handles async feed updates: when a user uploads a photo, queue a task to update followers' feeds.

Bottlenecks and optimizations:

Feed generation is expensive. If a user follows 1000 people, querying their recent photos and merging them is slow. Solution: fan-out-on-write. When a user posts a photo, push it to all followers' feed caches. Reads become simple cache lookups.
Celebrity problem. A celebrity with 100 million followers can't fan-out-on-write—that's 100 million cache writes per post. Solution: fan-out-on-read for celebrities. When you load your feed, fetch celebrity posts on demand.
Photo storage. 3.6 exabytes in 5 years is too much for one datacenter. Solution: use S3 or equivalent distributed object storage, with CDN (CloudFlare, CloudFront) for hot photos.

Key Questions to Ask the Interviewer

These questions guide you toward the right design:

What's the read/write ratio?
What's the expected scale (users, requests/sec)?
What are the latency requirements (real-time, near-real-time, eventual consistency)?
What features are in scope, and what's out of scope?
Do we need to support multiple regions?

How to Communicate Trade-Offs

Don't just say "I'll use Redis for caching." Say:

"I'll use Redis for caching pre-computed timelines because feed reads are 10x more frequent than writes, and users expect sub-second load times. The trade-off is that cached feeds can be slightly stale—if someone I follow posts right now, it might take a few seconds to appear in my feed. For Instagram, that's acceptable. If this were a stock trading platform, I'd choose a different consistency model."

That's what interviewers want to hear. You're making a choice, you're naming the trade-off, and you're explaining why it fits this specific problem.

Measuring and Optimizing Distributed Systems

Once your system is live, you need to know if it's working. Here's what matters in production.

Latency (and Why Percentiles Matter)

Latency is how long a request takes. But "average latency" hides problems.

If your average latency is 100ms, that sounds good. But if the p99 latency (the slowest 1% of requests) is 5 seconds, 1 in 100 users is having a terrible experience.

Why percentiles matter: A user loading a page might trigger 10 backend requests. If each has a 1% chance of being slow, the page has a 10% chance of being slow. p99 latency compounds.

I track:

p50 (median): Half of requests are faster than this.
p95: 95% of requests are faster than this.
p99: 99% of requests are faster than this.

If p99 latency spikes, something is wrong. Maybe a database query hit a slow path. Maybe garbage collection paused the JVM. Percentiles surface these issues.

Throughput

Throughput is how many requests your system handles per second (QPS, queries per second, or RPS, requests per second).

High throughput is good, but only if latency stays low. A system can have high throughput with terrible latency if it's queuing requests.

Error Rates and SLAs/SLOs

Error rate is the percentage of requests that fail (5xx errors, timeouts, etc.).

SLA (Service Level Agreement) is a contract: "We guarantee 99.9% uptime."

SLO (Service Level Objective) is an internal target: "We aim for 99.95% uptime."

If your error rate exceeds your SLO, you're burning your error budget. High error rates often correlate with system overload, cascading failures, or dependency outages.

Where Bottlenecks Typically Appear

In most distributed systems, bottlenecks are:

Database: Slow queries, too many writes, lock contention. Solution: indexing, sharding, caching.
Network: High latency between services, bandwidth saturation. Solution: co-locate services, use compression, add CDN.
Cache misses: If your cache hit rate drops, traffic hits the database. Solution: increase cache size, improve eviction policy, pre-warm cache.

Monitoring Strategies

I use Prometheus for metrics (request rates, latency percentiles, error rates) and Grafana for dashboards. For distributed tracing (tracking a request across multiple services), Jaeger or DataDog APM.

When something breaks, you want:

Metrics to tell you what is broken (error rate spike, latency increase).
Logs to tell you why (stack traces, error messages).
Traces to tell you where (which service in the chain is slow).

Learning Resources and Practice Problems

Here's how I'd prepare if I were interviewing next month.

Books

Designing Data-Intensive Applications by Martin Kleppmann. The single best book on distributed systems. Covers consistency models, replication, partitioning, consensus. It's dense but worth every page.
System Design Interview – An Insider's Guide by Alex Xu (Volume 1 and 2). Practical, interview-focused. Each chapter walks through a real design problem (URL shortener, rate limiter, etc.).

Practice Platforms

Pramp (pramp.com): Free peer-to-peer mock interviews. You interview someone, they interview you. Great for practicing communication under pressure.
interviewing.io: Anonymous mock interviews with engineers from top companies. Some are free, some are paid. You get real feedback.

Real Architecture Blogs

Reading how real companies solve real problems is more valuable than generic tutorials. I follow:

Netflix Tech Blog (netflixtechblog.com): Chaos engineering, microservices, multi-region deployments.
Uber Engineering Blog (eng.uber.com): Sharding, real-time data pipelines, geospatial indexing.
Airbnb Engineering & Data Science (medium.com/airbnb-engineering): How they migrated from a monolith, service mesh, experimentation platform.

Open-Source Systems to Study

Want to understand how distributed systems actually work? Read the code:

Redis: In-memory cache and data store. Beautifully simple C codebase.
Cassandra: Wide-column distributed database. Great example of eventual consistency and gossip protocols.
Kafka: Distributed event streaming. Study how it handles partitioning and replication.

Don't try to read the entire codebase. Pick one feature (e.g., how does Redis handle expiration? How does Kafka replicate logs?) and trace it through.

System design interviews are not about memorizing the "right" architecture for Instagram or Twitter. They're about demonstrating that you can reason through ambiguity, make trade-offs, and communicate your thinking clearly.

The real skill is this: when someone says "design X for 100 million users," you can ask the right questions, sketch a reasonable architecture, identify where it breaks, and explain how you'd fix it. That's what I look for when I interview candidates. That's what got me past the interviews I used to freeze in.

Start with the framework. Practice out loud. Study real systems. And remember: the interviewer isn't testing whether you know the answer—they're testing how you think.