A deeply-synthesized, opinionated reference distilled from five canonical sources:
donnemartin/system-design-primer Β·
ByteByteGoHq/system-design-101 Β·
karanpratapsingh/system-design Β·
ashishps1/awesome-system-design-resources Β·
binhnguyennus/awesome-scalabilityUse it as: a study guide for interviews, a checklist for design reviews, and a vocabulary for cross-team discussions.
Table of Contents
- π How to Use This Playbook
- π§ The System Design Mindset
- π Core Mental Models
- π― The Interview Framework (RAPID-S)
- π’ Back-of-Envelope Math
- π Networking Fundamentals
- π DNS, CDN, and Proxies
- βοΈ Load Balancing & API Gateways
- ποΈ Databases: Pick Your Engine
- π Replication, Sharding, Federation
- π Consistency, Transactions & Isolation
- β‘ Caching
- π¨ Asynchronous Communication
- π API Design
- ποΈ Architectural Patterns
- πΈοΈ Distributed Systems Primitives
- π‘οΈ Reliability & Resilience Patterns
- π Observability, SLA/SLO/SLI
- π Security
- π Capacity Planning & Scaling Playbook
- π Data Engineering & Analytics
- π Deployment, Release & Schema Evolution
- π Tradeoffs Cheat Sheet
- π‘ Interview Problem Templates
- π Real-World Case Studies
- β οΈ Anti-Patterns to Avoid
- π Must-Read Papers & Further Reading
1. π How to Use This Playbook
There are three audiences:
- Interview candidate. Read sections 2β5 cold, drill section 22, then revisit section 21 the night before.
- Engineer in a design review. Open the relevant chapter (cache, queue, db) plus section 21 and challenge each tradeoff explicitly.
- Tech lead writing an RFC. Use section 4 as the document spine; sections 17, 18, 24 for the "Risks" section.
Reading rule: Every concept here has a counter-concept. If a passage feels like an absolute, you have not read carefully enough β find the tradeoff sentence.
2. π§ The System Design Mindset
System design is the art of making a small set of large, hard-to-reverse decisions explicit. It is rarely about choosing the "best" component; it is about choosing the component whose failure modes you can tolerate.
A good design:
- Scales with growth without full rewrites at each 10x.
- Fails gracefully rather than catastrophically β partial loss is preferable to total loss.
- Lets independent teams move in parallel without cross-team handoffs blocking releases.
- Makes tradeoffs explicit β every choice should have a paragraph saying what we gave up.
Three habits that separate senior from staff designers:
- Quantify before you draw. No box on the diagram should exist without an estimated QPS, latency budget, or storage size attached.
- Name the failure modes. For every component, ask: "what happens when this is slow / down / wrong?" If you cannot answer, you have not designed it.
- Defer the exotic. Reach for the boring tool (Postgres, Redis, Nginx, Kafka) until measurements force the exotic one. Instagram's three rules: use proven tech, don't reinvent, keep it simple.
3. π Core Mental Models
3.1 The Six Axes Every Design Lives On
| Axis | Left extreme | Right extreme | Drives choice of |
|---|---|---|---|
| Consistency vs Availability | Strong consistency (CP) | High availability (AP) | Database, replication strategy |
| Latency vs Throughput | Optimize p99 of one request | Maximize req/sec aggregate | Sync vs batched, queueing |
| Read-heavy vs Write-heavy | Cache + replicas | Shard + partition + queue | Storage + access pattern |
| Monolith vs Microservices | Single deployable | Many fine-grained services | Org structure + deployment cadence |
| Sync vs Async | In-line response | Decoupled, eventual | Coupling + tolerance to lag |
| Stateless vs Stateful | Scales linearly | Sharding complexity required | Where you put the hard problem |
3.2 CAP and PACELC
CAP (Brewer): in a network partition, a distributed system can only guarantee two of three: Consistency, Availability, Partition tolerance. Since partitions are inevitable in distributed systems, the practical choice is CP or AP.
- CP (consistency + partition tolerance): HBase, MongoDB (default), Spanner, Zookeeper. Reject requests during partitions to preserve correctness.
- AP (availability + partition tolerance): Cassandra, DynamoDB (default), CouchDB. Accept stale reads during partitions; reconcile later.
- CA without P: only single-node systems. Postgres, MySQL on one box. Not a real distributed-system choice.
PACELC extends CAP with normal-operation behavior: "if Partitioned, choose A or C; Else, choose Latency or Consistency." Examples: Spanner is PC/EC (consistent always, pays latency); Cassandra is PA/EL (favors availability + low latency).
Practical rule: Most "we need strong consistency" claims are really "we need linearizability for one specific operation." Design that one operation around a sequencer (single shard, leader, lock, distributed transaction) and let the rest be eventually consistent.
3.3 ACID vs BASE
| ACID | BASE | |
|---|---|---|
| Atomicity / Basic Availability | Transaction is all-or-nothing | System keeps responding even if degraded |
| Consistency / Soft state | Constraints hold post-tx | State may change without input |
| Isolation / Eventual consistency | Concurrent tx behave as serial | Nodes converge over time |
| Durability | Committed writes persist | (implicit) |
| Use when | Money, inventory, identity | Feeds, search, analytics, leaderboards |
3.4 Performance vs Scalability β Distinct Problems
- Performance problem: the system is slow for one user.
- Scalability problem: the system is fine for one user but degrades as you add load.
You can have a fast non-scalable system (single beefy box) or a scalable slow system (loosely-coupled microservices with bad cache hit rate). You usually want both, but you fix them with different techniques.
3.5 Latency vs Throughput vs Bandwidth
- Latency: time to do one thing (ms).
- Throughput: things per unit time (QPS, MB/s).
- Bandwidth: maximum throughput a channel could carry.
Little's Law: concurrency = throughput Γ latency. If a service handles 1000 req/s with 100 ms latency, it has 100 in-flight requests on average. This is the back-of-envelope formula for thread/connection pool sizing.
4. π― The Interview Framework (RAPID-S)
A 6-step structure that fits a 45-minute design interview, adapted from system-design-primer and reinforced by ByteByteGo.
| Step | Time | Output |
|---|---|---|
| Requirements | 5 min | Functional + non-functional list, scale numbers |
| API | 5 min | Endpoints, request/response shapes |
| Plumbing (HLD) | 10 min | Boxes-and-arrows diagram |
| Internals (LLD) | 15 min | Schema, indexes, partition keys, algorithms |
| Deep dives | 5 min | One or two areas the interviewer steers you to |
| Scale + reliability | 5 min | Bottlenecks, failure modes, observability |
4.1 Step 1 β Requirements
Ask before assuming. Functional ("what does it do?") and non-functional ("how well?"):
- DAU / MAU, peak QPS (often 5x average), read/write ratio.
- p50 and p99 latency budgets.
- Durability β how much data loss is acceptable (RPO)?
- Availability target β three nines? four?
- Geographic distribution β single region vs global?
- Consistency requirement β strong on which entities?
State assumptions explicitly: "I'll assume 100M DAU, 10:1 read:write, p99 < 200 ms, eventual consistency on feed but strong on payments."
4.2 Step 2 β APIs first
Defining the public contract first forces clarity. For each endpoint specify method, path, params, response, idempotency. This anchors the rest of the design.
4.3 Step 3 β High-Level Design
Draw 5-7 boxes. Typical: client β CDN β LB β API gateway β service(s) β cache β primary DB + replicas + queue + worker. Justify each box; remove any you cannot justify.
4.4 Step 4 β Low-Level Design
This is where you earn the title. Per service: data model with PK/SK, indexes, partition key, hot-key strategy, cache key, TTL. Per algorithm: name it (consistent hash, geohash, bloom filter, top-k via count-min sketch).
4.5 Step 5 β Deep Dives
Expect interviewer to pick the weakest area. Common targets: hot partition handling, idempotency for retries, exactly-once semantics, schema migration without downtime.
4.6 Step 6 β Bottlenecks & Reliability
Walk every box and ask: what fails when this is slow / dies / lies? Add timeouts, retries with jitter, circuit breakers, rate limits, fallbacks, dead-letter queues. State your monitoring (RED + USE), alerts, and runbook headings.
5. π’ Back-of-Envelope Math
In a 45-minute design interview, you have ~5 minutes to size the system. The goal is not precision β it's getting within an order of magnitude in seconds, then defending the assumption. The numbers below are the toolbox; this chapter shows how to wield them.
The same math runs the design review: when someone proposes a new dependency, a new cache layer, or a 10Γ scale-up, an engineer who can compute the consequence on a napkin out-arguments three engineers who can't.
5.1 Powers of Two (memorize)
Computers count in powers of 2; capacity, addressing, and memory come in 2βΏ. The convenient coincidence: each power of 2ΒΉβ° β 10Β³, so binary and decimal numbers line up cleanly and you can convert in your head.
| Power | Approx | Name | Where you see it |
|---|---|---|---|
| 2^10 | 10^3 | thousand (KB) | Packet, small file |
| 2^20 | 10^6 | million (MB) | Image, document |
| 2^30 | 10^9 | billion (GB) | Per-host RAM, HD video |
| 2^40 | 10^12 | trillion (TB) | Database, single dataset |
| 2^50 | 10^15 | quadrillion (PB) | Datacenter-scale storage |
| 2^60 | 10^18 | exabyte (EB) | Hyperscaler totals |
Bit-budget shortcuts that come up constantly:
- A signed 32-bit int holds ~2.1 Γ 10βΉ. User IDs, tweet IDs, and bigint counters all hit this ceiling β that's why you'll find production migrations from
intβbigintin every old codebase. - A signed 64-bit int holds ~9.2 Γ 10ΒΉβΈ β effectively infinite for any counter you'll ever build.
- A 64-bit nanosecond timestamp covers ~292 years from 1970.
- UUIDv4 = 128 bits = 16 bytes binary, ~36 chars hex, ~22 chars base64.
Typical record sizes (memorize the order of magnitude):
| Item | Size |
|---|---|
| Boolean, int8, char | 1 B |
| int32, float32, IPv4 | 4 B |
| int64, float64, timestamp | 8 B |
| UUID (binary) | 16 B |
| SHA-256 hash | 32 B |
| Tweet text | ~140 B |
| URL | ~100 B |
| JSON user record | 0.5β2 KB |
| Web image (compressed) | 50β500 KB |
| Phone photo (full) | 1β5 MB |
| HD video (per minute) | ~30 MB |
| 4K video (per minute) | ~200 MB |
These prevent the most common interview mistake: estimating storage off by 1000Γ because you mixed up KB and MB.
5.2 Latency Numbers Every Programmer Should Know
Originally compiled by Jeff Dean and updated by Peter Norvig. The values below are the modern, rounded version. Memorize them β every capacity argument descends from this table.
| Operation | Time | Mental model |
|---|---|---|
| L1 cache reference | 0.5 ns | "free" |
| Branch mispredict | 5 ns | Flush the pipeline |
| L2 cache reference | 7 ns | 14Γ L1 |
| Mutex lock/unlock | 25 ns | Uncontended; contention is much worse |
| Main memory reference | 100 ns | 200Γ L1 |
| Compress 1 KB with Zippy / Snappy | 10 Β΅s | |
| Send 1 KB over 1 Gbps | 10 Β΅s | Network bandwidth, not latency |
| Read 4 KB random from SSD | 150 Β΅s | NVMe is faster (10β50 Β΅s) |
| Read 1 MB sequential from memory | 250 Β΅s | |
| Round-trip within same datacenter | 500 Β΅s (0.5 ms) | One AZ-to-AZ hop |
| Read 1 MB sequential from SSD | 1 ms | |
| Disk seek | 10 ms | Why databases hate random I/O |
| Read 1 MB sequential from disk | 20 ms | 80Γ SSD |
| Cross-region (intra-continent) | 10β60 ms | |
| Cross-continent round-trip | ~150 ms | Speed of light through fiber |
Time-scaled to human terms (intuition pump). If 1 ns = 1 second:
| Operation | Human-scale |
|---|---|
| L1 hit | 0.5 s (a heartbeat) |
| Memory access | ~2 minutes |
| SSD random read | ~1.5 days |
| Same-DC round trip | ~6 days |
| 1 MB from disk | ~8 months |
| Cross-continent round trip | ~5 years |
This is why crossing layers β process β host β datacenter β region β is the dominant design concern. Each boundary is 10β100Γ slower than the one before.
Operational implications:
- Never block a user request on a cross-region call unless you absolutely must. 150 ms is a non-negotiable speed-of-light tax that blows most p99 budgets.
- Disk seeks are the enemy. Sequential I/O is ~100Γ faster than random. This is the reason LSM-trees, log-structured storage, and append-only logs win for write-heavy workloads.
- A network call costs roughly the same as 1 MB of memory work. A chatty service that issues 50 RPCs per page-render burns 50 Γ 0.5 ms = 25 ms in network alone, before any actual work.
- Memory bandwidth dominates within a process. Allocating millions of small objects is often slower than fewer big ones, because cache misses, not CPU work, are the bottleneck.
- Compression is essentially free at 10 Β΅s per KB compared to network I/O β always compress payloads crossing the network.
Typical p99 latency budget for a 200 ms web request:
| Component | Budget |
|---|---|
| TLS handshake + LB + ingress | 5β10 ms |
| App server processing | 20β30 ms |
| 1β3 cache lookups | 1β5 ms |
| 1β2 database queries | 20β50 ms |
| 1β2 downstream RPCs | 10β30 ms each |
| Response serialization + egress | 5 ms |
| Headroom for tail / GC / retries | the rest |
If any single component eats > 50 ms, scrutinize it. The discipline of budgeting latency before building catches more performance bugs than any profiler.
5.3 Time, Throughput, and Storage Quick Reference
Time conversions to memorize:
- 1 day = 86,400 s β 10β΅ s
- 1 month β 2.6 Γ 10βΆ s
- 1 year β 3.15 Γ 10β· s β 32 M s
Throughput conversions:
- QPS = daily_requests Γ· 86,400. 1 M requests/day β 12 QPS average.
- Peak QPS β 2β10Γ average, depending on workload. Consumer apps spike hard at evenings and weekends; B2B SaaS spikes at business hours; ad systems are flatter. Default to 5Γ when you don't know.
- Bandwidth = QPS Γ payload_size. 1,000 QPS Γ 100 KB = 100 MB/s = 800 Mbps.
- Daily ingest = QPS Γ payload Γ 86,400.
Storage growth:
- Annual storage = avg_QPS Γ bytes_per_record Γ 86,400 Γ 365 Γ replication_factor
- 5-year retention with 3Γ replication = 15Γ the year-1 raw number.
- Rule of thumb: a 1 KB record at 1,000 QPS sustained for a year Γ 3 replicas β 100 TB.
Worked example β Twitter sizing.
- 500 M DAU, each posts 0.2 tweets/day and reads 100 tweets/day.
- Writes: 500 M Γ 0.2 = 100 M tweets/day β ~1,200 write QPS avg, ~6,000 peak.
- Reads: 500 M Γ 100 = 50 B reads/day β ~580 K read QPS avg, ~3 M peak. Read:write = 500:1 β read-dominated, cache aggressively.
- Per tweet: ~1 KB with metadata. Daily ingest = 100 GB. 5 years Γ 3 replicas β 550 TB. Storage fits on one cluster, so storage isn't the dominant constraint β read QPS and fan-out are.
This is the right shape of an interview answer: numbers anchored, ratio called out, and the constraint named.
Read-to-write ratios (rough priors for common system types):
| System | Read : Write |
|---|---|
| Social feed (Twitter, Instagram, TikTok) | 100:1 to 1000:1 |
| Document collab (Notion, Google Docs) | 5:1 to 20:1 |
| E-commerce browse vs purchase | ~100:1 |
| Banking / ledger | ~1:1 |
| Logging / metrics / event ingest | 1:100 (write-heavy) |
| Search (queries vs reindex) | ~100:1 |
Read:write ratio is the most important early signal for the design. Read-heavy β cache + replicas + denormalize. Write-heavy β partition + queue + LSM-tree.
5.4 Availability in Numbers
| Availability | Annual downtime | Monthly | Daily |
|---|---|---|---|
| 99% (2-9s) | 3.65 days | 7.2 h | 14.4 min |
| 99.9% (3-9s) | 8.77 h | 43.8 min | 1.44 min |
| 99.95% | 4.38 h | 21.9 min | 43.2 s |
| 99.99% (4-9s) | 52.6 min | 4.32 min | 8.6 s |
| 99.999% (5-9s) | 5.26 min | 25.9 s | 0.86 s |
| 99.9999% (6-9s) | 31.5 s | 2.6 s | 0.09 s |
Each additional 9 costs roughly 10Γ more in engineering hours, infrastructure, and operational complexity. Industry reality:
- Most consumer products live at 99.9β99.95%.
- Tier-1 SaaS commits to 99.95β99.99%.
- Payment networks aim for 99.99%.
- Telephone networks were the canonical 99.999% (~5 min/year).
- 6-9s is mythological for any single system; you only get there by composing redundant systems and counting carefully.
Series vs parallel β the math that drives architecture.
When components are in series (every one must be up), availabilities multiply and total goes down:
A_total = A1 Γ A2 Γ A3 Γ β¦
A typical request path: LB (99.99%) β App (99.95%) β Cache (99.99%) β DB (99.95%) β External API (99.9%).
Total: 0.9999 Γ 0.9995 Γ 0.9999 Γ 0.9995 Γ 0.999 = **99.78%** β worse than the worst single component.
Lesson 1. Adding a dependency always lowers your availability. Each external service is an availability tax. This is one of the strongest arguments against gratuitous microservice splits β every hop is a 9 you didn't earn.
When components are in parallel (any one up keeps the system up), failure probabilities multiply and total goes up:
A_total = 1 β (1βA1) Γ (1βA2) Γ (1βA3) Γ β¦
Two 99% replicas: 1 β 0.01Β² = 99.99%. Three: 1 β 0.01Β³ = 99.9999%. Redundancy compounds exponentially β but only if failures are independent.
Lesson 2. A redundant cluster is only as good as the correlation of its failures. Two replicas in the same rack share PDU and switch failures; two regions share a deploy pipeline; all replicas share a software bug. Audit shared dependencies, not just replica counts. The truly correlated failures (a bad deploy, a poisoned cache key) are what take down "highly available" systems.
Composite reasoning β what you actually compute in a design review:
A_system = A_series_path Γ A_redundant_groups
A 3-replica DB cluster (effective 99.9999%) behind an LB (99.99%) behind an app tier (99.95%):
0.99999 Γ 0.9999 Γ 0.9995 β **99.94%** β roughly 5 hours downtime/year. To improve this, you fix the weakest link (the 99.95% app tier here), not by piling on more DB replicas β those bought you a 9 that another tier is already throwing away.
Error budget. If your SLO is 99.9%, you have 0.1% Γ 30 days β 43 min/month of allowed downtime. That budget is spent on: deploys, experiments, planned maintenance, and unplanned outages. Burn it intentionally on shipping; preserve it during incidents. (See Β§18.3 for the operational practice.)
6. π Networking Fundamentals
6.1 OSI Model (the practical version)
| Layer | Name | Examples | When you care |
|---|---|---|---|
| 7 | Application | HTTP, gRPC, DNS, SMTP | Always |
| 6 | Presentation | TLS, compression | Auth + perf |
| 5 | Session | RPC sessions | Rarely |
| 4 | Transport | TCP, UDP, QUIC | LB algorithms, sockets |
| 3 | Network | IP, ICMP | Routing, VPC, subnets |
| 2 | Data link | Ethernet, MAC | DC engineers |
| 1 | Physical | Cables, wifi | Hardware |
Practical takeaway: L4 vs L7 load balancing, TLS at L6, CDN at L7. Most senior engineers live in L7, occasionally drop to L4 for performance, and only touch L3 for VPC/peering.
6.2 TCP vs UDP vs QUIC
| TCP | UDP | QUIC (HTTP/3) | |
|---|---|---|---|
| Connection | Handshake (3-way) | None | TLS+handshake combined (1 RTT, 0-RTT resumption) |
| Reliability | Guaranteed in-order | None | Guaranteed |
| Congestion control | Yes | No | Yes (better than TCP) |
| Head-of-line blocking | Yes | N/A | No (per-stream) |
| Use for | HTTP/1.1, HTTP/2, DBs, SSH | DNS, video, VoIP, gaming | HTTP/3, gRPC over QUIC |
Connection pooling: TCP handshake costs an RTT. Reusing connections (keep-alive, gRPC channels, DB connection pools) is the #1 micro-optimization for backend services.
6.3 IP Basics
- IPv4: 32-bit, ~4.3 B addresses (exhausted; NAT + CIDR keep it alive).
- IPv6: 128-bit, effectively unlimited.
- Static vs dynamic: services use static; clients use DHCP-assigned dynamic.
- Public vs private: RFC1918 ranges (10.0.0.0/8, 172.16/12, 192.168/16) are private; NAT gateways translate to public.
7. π DNS, CDN, and Proxies
7.1 DNS
DNS resolves a domain name to an IP via a hierarchical lookup: stub resolver β recursive resolver β root β TLD β authoritative. Caching at every layer (browser, OS, resolver) is critical to performance.
Record types you must know:
- A β domain β IPv4
- AAAA β domain β IPv6
- CNAME β alias to another name
- MX β mail exchange
- NS β authoritative nameservers
- TXT β arbitrary text (SPF, DKIM, domain verification)
- PTR β reverse lookup
TTL: the cache duration. Low TTL (60s) enables fast failover but increases lookup load. High TTL (24h) is efficient but slow to propagate changes. Production rule: low TTL on records you will fail over (api.example.com), high TTL on stable records (www.example.com).
Routing strategies via DNS:
- Weighted round-robin (canary deploys).
- Latency-based (Route 53).
- Geolocation (compliance-driven).
- Failover (active-passive).
7.2 CDN
A CDN caches static (and increasingly dynamic) content at geographically distributed PoPs. Reduces latency for the user and load on the origin.
| Push CDN | Pull CDN | |
|---|---|---|
| Trigger | You upload on change | CDN fetches on first miss |
| Storage | All content always present | Hot content cached |
| Best for | Low-traffic, infrequent updates | High-traffic, frequent changes |
| Stale risk | Until next push | Until TTL expires |
Cache key tips: include version in path or query (/v3/style.css, ?v=hash). Prefer immutable URLs + long TTLs over short TTLs + invalidation. Use stale-while-revalidate for the best of both worlds.
Edge compute (Cloudflare Workers, Lambda@Edge): A/B routing, request rewriting, light auth β anything that benefits from running close to the user.
7.3 Forward vs Reverse Proxy
- Forward proxy sits in front of clients. Used for anonymity, content filtering, corporate egress, geo-bypass (VPN).
- Reverse proxy sits in front of servers. Provides TLS termination, caching, compression, rate limiting, request rewriting, blue-green routing. Examples: Nginx, Envoy, HAProxy, Traefik.
A reverse proxy is often also a load balancer; the terms overlap when you have multiple backends. The distinction: load balancer's primary job is distribution; reverse proxy's primary job is interface unification + edge concerns.
8. βοΈ Load Balancing & API Gateways
8.1 Load Balancer Layers
L4 (transport): routes by IP + port. Cheap, fast, content-blind. Connection-level stickiness only. Use for: TCP services, gRPC (with care), MySQL/Redis frontends.
L7 (application): routes by HTTP path, host, header, cookie. Expensive, flexible. Can do: SSL termination, canary by header, JSON-based routing, request rewriting. Use for: web traffic, API gateways.
8.2 Algorithms
| Algorithm | Behavior | Best for |
|---|---|---|
| Round-robin | Rotate through backends | Homogeneous backends |
| Weighted round-robin | Bigger machines get more | Heterogeneous fleet |
| Least connections | Send to least-busy | Long-lived connections, websockets |
| Least response time | Send to fastest | Mixed workloads |
| IP hash / consistent hash | Same client β same backend | Sticky cache, stateful sessions |
| Random / random-2-choices | Pick 2 random, choose lesser | Best general default at scale |
Power of 2 random choices outperforms round-robin under realistic latency variance.
8.3 Sticky Sessions vs Stateless
Sticky sessions tie a client to one backend. They make caching easier but break when that backend dies (session lost) or scales down. Prefer stateless services with session in Redis/JWT; use sticky only for stateful protocols (websockets) and even then expect to handle disconnects.
8.4 API Gateway
A specialized reverse proxy + L7 LB at the edge of a microservice cluster. Concerns it owns:
- AuthN / AuthZ (JWT validation, mTLS)
- Rate limiting and quotas
- Request transformation (protocol bridging β REST β gRPC)
- Response aggregation (BFF pattern)
- API versioning and routing
- Observability (request logs, traces)
- WAF / IP blocklist
Pitfall: the gateway can become a god-object. Keep business logic in services; gateway is for cross-cutting concerns.
9. ποΈ Databases: Pick Your Engine
9.1 Decision Matrix
| Use case | Pick | Why |
|---|---|---|
| Money, inventory, identity, anything regulated | Postgres / MySQL | ACID, mature, strong constraints |
| Flexible JSON-shaped data, modest scale | Postgres (JSONB) or MongoDB | Document flexibility |
| Massive write volume, time-series, IoT | Cassandra, ScyllaDB, InfluxDB | Wide-column / TSDB |
| Sub-ms reads, ephemeral state | Redis | In-memory KV |
| Petabyte analytics | Snowflake, BigQuery, Redshift | Columnar OLAP |
| Full-text search | Elasticsearch / OpenSearch | Inverted index |
| Highly relational queries (recommendations, fraud) | Neo4j, JanusGraph | Graph traversal |
| Globally consistent + scale | Spanner, CockroachDB, YugabyteDB | Distributed SQL |
9.2 SQL (RDBMS)
Strengths: schema enforcement, joins, ACID transactions, decades of tooling, well-understood failure modes.
Weaknesses: vertical scaling first, schema migrations under load, joins across shards are painful.
When stuck, try in this order before switching to NoSQL: index, denormalize, partition table, read replica, vertical scale, shard.
9.3 NoSQL Families
Key-Value (Redis, Memcached, DynamoDB, Riak)
- O(1) get/put. No queries beyond key. Great for cache, session, leaderboard, rate limiter state.
- Limitation: no rich query, easy to corrupt invariants by writing piecemeal.
Document (MongoDB, Couchbase, DynamoDB)
- JSON/BSON values, queryable by field, secondary indexes.
- Schemaless feels easy at first, painful at year 3 β invest in schema-on-read tooling.
Wide-Column (Cassandra, HBase, BigTable, ScyllaDB)
- Row key + dynamic columns, sparse, sorted on disk.
- Built for write-heavy time-series and event logs at PB scale.
- Consistency tunable per query (R+W>N for strong reads).
- Modeling rule: design tables per query, never normalize.
Graph (Neo4j, JanusGraph, Amazon Neptune)
- First-class nodes + edges + properties. Cypher / Gremlin.
- Killer app: many-hop relationship queries (friends-of-friends, fraud rings).
Time-Series (InfluxDB, TimescaleDB, Prometheus, Druid)
- Optimized for
(metric, timestamp, value, tags)ingestion + windowed aggregation + downsampling.
Search (Elasticsearch, OpenSearch, Solr)
- Inverted index. Full-text + faceted search + ranking.
- Not a primary store β index is rebuildable; use a real DB as source of truth.
9.4 SQL vs NoSQL β Selection Heuristic
Pick SQL when:
- Schema is stable and relationships matter.
- You need joins, multi-row transactions, or constraints.
- Data fits comfortably on one large server (or a small cluster).
Pick NoSQL when:
- Schema is flexible / multi-tenant.
- Write rate exceeds what one master can absorb.
- Access pattern is well-known and narrow (key lookup, time range).
- Operating ACID across rows is not required.
The most expensive lesson teams learn: picking NoSQL because "we'll be web-scale" when they have 100K rows. Start SQL until measurements force change. (Pinterest, GitHub, Shopify all run massive Postgres/MySQL clusters.)
9.5 Storage Engines: B-Tree vs LSM-Tree
The choice of storage engine is the biggest single determinant of a database's read/write profile. Two families dominate.
B-Tree (Postgres, MySQL InnoDB, MongoDB WiredTiger, SQLite, Oracle)
- In-place updates: writes mutate pages on disk via WAL + buffer pool.
- ~2Γ write amplification (page rewrite + WAL).
- Read-optimized: O(log n) seek, page locality.
- Mature ecosystem: indexing, MVCC, transactions, concurrency control built around it.
LSM-Tree (Cassandra, RocksDB, LevelDB, HBase, ScyllaDB, BigTable)
- Append-only memtable β flushed as immutable sorted files (SSTables) β compacted in background.
- Write-friendly: pure sequential I/O, no in-place updates.
- Read amplification: a key may live across many SSTables β bloom filter + per-file index narrow the search.
- Space amplification + compaction CPU are the costs.
The amplification triangle. A storage engine optimizes at most two of: write amp, read amp, space amp. B-trees pay write amp for read perf; LSM-trees pay read+space amp for write perf.
| Workload | Pick |
|---|---|
| Read-heavy OLTP, joins, transactions | B-tree |
| Write-heavy time-series, event logs, telemetry | LSM-tree |
| Mixed but reads dominate the latency budget | B-tree |
| Append-mostly, batch-tolerant reads | LSM-tree |
Implication for design: when an interviewer says "10Γ write rate vs read rate," that's an LSM signal even before they say "Cassandra."
10. π Replication, Sharding, Federation
10.1 Replication
Master-Slave (Primary-Replica)
- One writer, many readers. Replicas serve read traffic and act as failover candidates.
- Async replication: low write latency, replica lag, possible data loss on failover.
- Semi-sync: wait for one replica ack β middle ground.
- Sync: strong durability, write latency dominated by slowest replica.
- Pitfall: read-your-writes anomalies β solve with sticky read-from-primary for a session window after a write, or version tokens.
Master-Master (Multi-Primary)
- Both nodes accept writes. Requires conflict resolution (last-write-wins, vector clocks, CRDTs).
- Higher availability for writes; harder correctness.
Quorum (R + W > N)
- N replicas, write to W, read from R. If R+W>N you read at least one node that has the latest write.
- Cassandra, Dynamo. Tune per-query for AP-vs-CP tradeoff.
10.2 Sharding (Horizontal Partitioning)
Splits data across nodes by a shard key. Three strategies:
| Strategy | How | Pros | Cons |
|---|---|---|---|
| Range |
shard = f(range(key)) (e.g., AβF, GβMβ¦) |
Range queries fast | Hotspots if data skewed |
| Hash | shard = hash(key) % N |
Even distribution | Range queries scatter; resharding rehashes everything |
| Consistent hash | Map nodes onto a ring, key β next node clockwise | Minimal movement on add/remove | More complex |
| Directory | Lookup table from key β shard | Maximum flexibility | Lookup service is SPOF; extra hop |
| Geographic | Shard by user region | Latency wins | Cross-region traffic harder |
Shard key selection β the most important decision:
- Cardinality: millions of distinct values, not dozens.
- Even access: no celebrity hot key (e.g., a global counter).
- Query alignment: queries should be answerable from one shard whenever possible.
- Mutability: key must not change.
Examples: (user_id, created_at) for chat messages, (tenant_id, doc_id) for SaaS, (date, event_id) for events.
Resharding is the hardest operational problem. Plan for it from day one β version your shard map, build a backfill pipeline, accept dual-writes during migration.
10.3 Federation (Functional Partitioning)
Split the database by domain, not by rows: users_db, orders_db, inventory_db. Each owned by one team.
- Pro: clean ownership, independent schema evolution, smaller blast radius.
- Con: cross-domain joins now require app-level fan-out or duplication.
- Plays well with microservices (one DB per service).
10.4 Consistent Hashing
Place nodes at hashed positions on a 0β¦2^32 ring. A key maps to the first node clockwise from hash(key).
- Adding a node moves only ~K/N keys (the slice between predecessor and new node).
- Virtual nodes: each physical node owns many ring positions β smooths distribution and prevents hotspots when nodes differ in capacity.
- Used by Memcached client-side, Cassandra, DynamoDB, Discord routing layer.
10.5 Replication + Sharding Combined
Real systems do both. Each shard is itself a replica set (e.g., 3-node Raft group). A 100-shard cluster is 300 nodes. The shard map says "key X lives on shard 7"; the replica set says "shard 7 is hosted by nodes A/B/C with A as leader."
11. π Consistency, Transactions & Isolation
11.1 Consistency Spectrum
From weakest to strongest:
- Eventual β replicas converge given no new writes.
- Read-your-writes β a client sees its own writes immediately.
- Monotonic reads β once seen, never see older.
- Causal β writes that are causally related are observed in order.
- Sequential β all clients agree on a single order.
- Linearizable β operations appear instantaneous and totally ordered (real-time).
- Strict serializable β linearizable + serializable across multi-key transactions.
Most user-facing systems need read-your-writes + monotonic. Linearizability is reserved for leader election, locking, and money.
11.2 Transaction Isolation Levels (SQL)
| Level | Dirty read | Non-repeatable read | Phantom read |
|---|---|---|---|
| Read uncommitted | possible | possible | possible |
| Read committed (default in Postgres, Oracle) | no | possible | possible |
| Repeatable read (default in MySQL InnoDB) | no | no | possible* |
| Snapshot isolation | no | no | no (but write skew possible) |
| Serializable | no | no | no |
* InnoDB's "repeatable read" is actually snapshot isolation in practice.
Anomalies to know:
- Lost update β two read-modify-writes overwrite each other. Fix: SELECT FOR UPDATE, optimistic locking with version, atomic increment.
- Write skew β two transactions read overlapping data, write disjoint data, both commit, breaking an invariant. Only serializable prevents.
11.3 Distributed Transactions
Two-Phase Commit (2PC)
- Coordinator: PREPARE β all participants vote β if all yes, COMMIT.
- Atomic, simple to reason about.
- Blocking: if coordinator dies after PREPARE, participants are stuck holding locks.
- Fine within one datacenter for short transactions; bad across services or WAN.
Three-Phase Commit (3PC)
- Adds pre-commit phase to be non-blocking.
- Theoretically nicer, rarely used in practice.
Saga Pattern (the modern answer)
- A transaction = a sequence of local transactions, each with a compensating undo.
- Two flavors:
- Choreography: services emit events; downstream services react and emit their own.
- Orchestration: a saga coordinator (state machine) drives the flow.
- Choose orchestration for >3 steps or complex error paths.
TCC (Try-Confirm-Cancel)
- Reservation-style: each service "tries" (reserves), then orchestrator either "confirms" or "cancels" all.
- Stronger than saga (no observed in-between state) but more invasive on services.
Outbox Pattern (must-know companion)
- Atomically write business state + event row in same DB transaction; a separate process publishes the event row to the bus.
- Solves the "service updated DB but failed to publish event" problem without distributed transactions.
11.4 Consensus
Paxos / Multi-Paxos β the original. Hard to understand, hard to implement.
Raft β the practical replacement. Used by etcd, Consul, CockroachDB, TiKV.
ZAB β Zookeeper's variant.
You almost never implement consensus yourself. You use a library (etcd, Zookeeper, Consul) for: leader election, distributed locks, configuration, service discovery, group membership.
Consensus is expensive. Don't put it in the request hot path. Use it for control-plane decisions (who's leader, what's the shard map), then let data-plane traffic flow without consensus on every request.
11.5 Idempotency: A First-Class Design
"At-least-once delivery + idempotent handler" is the practical pattern that replaces the unattainable "exactly once." It also defends against client retries, browser double-clicks, network timeouts, and message-bus redeliveries.
The canonical recipe:
- Client generates a UUID per logical operation; sends it as
Idempotency-Keyheader (Stripe pattern). - Server checks a dedup store (Redis, DB table) keyed by
(tenant_id, idempotency_key):- Present + complete β return the stored response verbatim.
- Present + in-flight β return 409 Conflict, or block-and-wait.
- Absent β mark in-flight, perform operation, store the response.
- TTL the dedup record (24 hβ7 d typical).
Per-operation kind:
- Create: dedup by client key.
-
Increment / counter: convert to "set value if event_id not seen" (event log + materialized counter), or use natively idempotent commands (
SETNX,INCRwith seen-set guard). - External call (charge card, send email): wrap in dedup table. Record provider's response so retry returns identical payload.
-
Stream processing: dedup by
(producer_id, sequence_number)or unique event ID. Kafka transactional producer + offset commits give end-to-end exactly-once within Kafka. - HTTP PUT: semantically idempotent already β full replacement, repeatable.
Fencing tokens (for distributed locks): every write carries a monotonically increasing token (issued by lock service). Storage rejects writes with stale tokens. Defends against zombie clients holding expired locks (the classic Redis Redlock failure mode).
Hot-take: if your design has a POST without an idempotency-key story, the design has a bug.
12. β‘ Caching
12.1 Layers (in order, from client to disk)
- Browser cache β HTTP cache headers, service workers.
- CDN β geographic edge.
- Reverse proxy / web server cache β Varnish, Nginx.
- Application cache β Redis, Memcached.
- Database query cache / buffer pool β Postgres shared_buffers.
- OS page cache β Linux page cache.
Each level is faster + smaller than the next. Cache hits compound: a 90% hit rate at three layers = 99.9% of requests never reach the DB.
12.2 Cache Patterns (Read)
Cache-aside (lazy loading) β most common.
GET key in cache?
yes β return cached
no β read from DB β write to cache β return
- Pro: only requested data is cached. Resilient to cache failures.
- Con: cold-cache spikes. Stale data unless TTL or invalidation.
Read-through β same effect, but the cache library does the DB read on miss. App only talks to cache.
Refresh-ahead β cache proactively refreshes hot keys before TTL. Reduces tail latency for predictable hot keys.
12.3 Cache Patterns (Write)
| Pattern | Order | Pro | Con |
|---|---|---|---|
| Write-through | App β cache β DB (sync) | Fresh cache, no loss | Slow writes |
| Write-around | App β DB; cache filled lazily on read | Fast writes | First read slow |
| Write-behind / write-back | App β cache β DB (async batch) | Fast writes, batchable | Risk of loss on cache crash |
12.4 Eviction Policies
| Policy | Behavior | Best for |
|---|---|---|
| LRU | Evict least recently used | General purpose default |
| LFU | Evict least frequently used | Long-lived hot keys |
| FIFO | Evict oldest inserted | Simple, but rarely best |
| TTL | Evict on expiry | Time-bounded data |
| Random / 2-random | Pick random victim | Low-overhead approximation |
Production caches usually combine TTL + LRU.
12.5 Invalidation β "the second hardest problem in CS"
Strategies:
- TTL β cheapest, eventually consistent, accept staleness.
- Write-through β synchronous correctness, write cost.
- Explicit invalidation on write β app deletes cache key after DB write. Race condition: if another process repopulates between your write and delete, you cache stale. Mitigations: delete-then-write order, double-delete with delay, bump version key.
-
Versioned keys β
user:123:v42. Update a version pointer atomically; old keys age out. - Pub/sub invalidation β DB CDC stream broadcasts invalidations.
12.6 Common Pitfalls
- Thundering herd: TTL expires under load, every request hits DB simultaneously. Fix: jittered TTL, single-flight (one request fills, others wait), early refresh.
- Cache stampede on cold start: warm-up script before traffic shift; tiered caches.
- Cache penetration: queries for non-existent keys bypass cache and hit DB. Fix: cache the "not found" result, or use a bloom filter.
- Cache avalanche: mass simultaneous expiry. Fix: random jitter on TTL.
- Hot key: one celebrity key overwhelms one shard. Fix: replicate across N keys, split the key, in-process LRU on app servers.
13. π¨ Asynchronous Communication
13.1 Why Async
Decouples producer from consumer in time, fault-domain, and rate. The producer publishes a message; the consumer processes when it can. The system absorbs spikes and isolates failures.
13.2 Message Queue vs Event Stream
| Message Queue (RabbitMQ, SQS, ActiveMQ) | Event Stream (Kafka, Pulsar, Kinesis) | |
|---|---|---|
| Model | Point-to-point or routing | Pub-sub log |
| Consumption | Message removed after ack | Messages retained, consumers track offset |
| Replay | Generally no | Yes (rewind to offset) |
| Ordering | Per-queue | Per-partition |
| Throughput | High (10kβ100k/s) | Very high (1M+/s) |
| Use for | Job processing, RPC | Event sourcing, log aggregation, stream processing |
Use a queue for: send-email jobs, video transcoding, retryable RPC, fan-out to one worker.
Use a stream for: event sourcing, change data capture, multi-consumer fan-out, analytics, audit trail.
13.3 Delivery Semantics
- At-most-once β fire and forget. Messages may be lost. Use for telemetry where exact count is unimportant.
- At-least-once β guaranteed delivery, possible duplicates. The default and the realistic target.
- Exactly-once β guaranteed delivery, no duplicates. Practically achieved via at-least-once + idempotent consumer (deduplicate by message ID). Kafka offers transactional producer + read-process-write within Kafka, but end-to-end exactly-once across systems is an idempotency design problem, not a guarantee you buy.
13.4 Patterns
- Work queue: N producers β queue β M workers, one worker per message. Auto-scales.
- Pub-sub / fan-out: one publish β N subscribers each get a copy.
- Routing / topic: message tagged; subscribers filter.
- Dead-letter queue (DLQ): messages that fail repeatedly land in DLQ for manual / scripted recovery. Always configure one.
- Outbox + CDC: atomic write to DB + event table; CDC publishes. Eliminates dual-write inconsistency.
13.5 Backpressure
When consumers can't keep up, the queue grows unbounded β memory blow-up β cascading failure.
Defenses:
- Bounded queues β drop or block when full.
- HTTP 503 + Retry-After β push back to clients, who retry with exponential backoff + jitter.
- Token bucket / leaky bucket rate limiting β at the producer side.
- Auto-scaling consumers β but watch for downstream (DB) bottleneck β scaling consumers without scaling the DB just moves the bottleneck.
13.6 Kafka Mental Model
- Topic = ordered log split into partitions. Order preserved per partition only.
- Partition key decides which partition (similar to shard key). Choose for distribution + ordering needs.
- Consumers organized into consumer groups; one partition consumed by exactly one consumer in a group.
- Retention by time or size. Topic is the source of truth in event-sourced systems.
- Compaction keeps the latest value per key β useful for materializing a current-state table from a log.
13.7 Stream Processing Fundamentals
When data is unbounded (clicks, sensor readings, financial ticks), batch jobs aren't enough. Stream processing runs continuous queries on top of Kafka / Kinesis / Pulsar.
Three time concepts β pick the right one:
- Event time: when the event actually occurred (in the data).
- Ingestion time: when the broker received it.
- Processing time: when the operator handled it.
Always aggregate by event time when correctness matters β processing time is sensitive to backlog and replay.
Windows:
- Tumbling β fixed, non-overlapping (every 1 min, no overlap).
- Sliding β overlapping (every 1 min, 5-min look-back).
- Session β gaps define boundaries (per-user activity sessions).
Watermarks declare "I believe all events with timestamp β€ T have arrived." They let windows close even when out-of-order events trickle in. Late events options: drop them, route to a side output, or trigger window updates.
State management: stateful operators (joins, aggregations) need durable state. Frameworks checkpoint state to durable storage (RocksDB local + S3 backup in Flink) for fault tolerance.
Exactly-once in practice: Kafka transactions + framework checkpoint barriers, paired with idempotent or transactional sinks (UPSERT into DB; transactional Kafka producer; or end-of-pipeline dedup).
Frameworks:
- Flink β true streaming, low-latency, sophisticated state, native event-time. Default modern choice.
- Spark Structured Streaming β micro-batch, integrates with Spark batch ecosystem.
- Kafka Streams β library, no separate cluster, stateful via local RocksDB.
- Apache Beam β unified batch+stream API; runs on Flink/Spark/Dataflow.
- Materialize / RisingWave β streaming SQL with materialized views.
14. π API Design
14.1 The Big Four Styles
| REST | GraphQL | gRPC | WebSocket | |
|---|---|---|---|---|
| Transport | HTTP/1.1 + HTTP/2 | HTTP | HTTP/2 | TCP via HTTP upgrade |
| Encoding | JSON | JSON | Protobuf (binary) | Anything |
| Schema | OpenAPI (optional) | Strongly typed | Strongly typed (.proto) | App-defined |
| Direction | Request-response | Request-response | Uni / streaming both ways | Bi-directional |
| Use | Public APIs | BFF, mobile, complex queries | Service-to-service, low-latency | Real-time, chat, gaming |
14.2 REST Best Practices
-
Resources, not actions:
POST /orders, notPOST /createOrder. - Verbs: GET (safe + idempotent), PUT (idempotent replace), PATCH (partial), POST (create / non-idempotent), DELETE (idempotent).
- Status codes: 200 OK, 201 Created, 204 No Content, 301/302 redirects, 400 bad request, 401 unauth, 403 forbidden, 404 not found, 409 conflict, 429 rate limit, 500 server, 502/503/504 upstream.
-
Versioning: URL (
/v2/...) is most pragmatic; header (Accept: application/vnd.api+json;v=2) is purer; never break v1. -
Pagination:
-
Offset/limit (
?page=3&size=50) β easy, breaks under inserts, slow at deep offsets. -
Cursor / keyset (
?after=abc123) β consistent, scales, the right default for large datasets.
-
Offset/limit (
-
Idempotency: require an
Idempotency-Keyheader on POSTs that must not duplicate (payments, signup). -
Filter / sort / fields:
?status=active&sort=-createdAt&fields=id,name. - HATEOAS is academically nice, practically rare.
14.3 GraphQL β When and When Not
When: Many clients with different shape needs (mobile + web + partners), aggregation across many sources, rapidly evolving UI.
Not when: Simple CRUD, public APIs (cacheability is harder), file uploads, RPC-style.
Risks: N+1 query explosion (mitigate with DataLoader / batching), unbounded queries (depth + cost limits), caching loss (no HTTP cache for POSTed queries β use persisted queries).
14.4 gRPC
- Use: internal service-to-service in polyglot orgs.
- Wins: schema enforcement, code generation, HTTP/2 multiplexing, streaming, smaller payloads.
- Pitfalls: browser support requires gRPC-Web + proxy; harder to debug (binary); load balancing needs L7 awareness or a service mesh.
14.5 Real-Time Push: Long Polling vs SSE vs WebSocket
| Long Polling | SSE | WebSocket | |
|---|---|---|---|
| Direction | Client pulls | Server β client | Both |
| Connection | Repeated request | Persistent (HTTP/1.1) | Persistent upgrade |
| Browser support | Universal | Modern browsers | Universal |
| Best for | Legacy systems | Server notifications, news feeds | Chat, gaming, collaborative editing |
14.6 Webhooks
Server-to-server callback. Provider POSTs to your URL when an event happens. Always: verify signature, return 2xx fast and process async, dedupe by event ID, expect retries.
15. ποΈ Architectural Patterns
15.1 Monolith vs Microservices vs Modular Monolith
Monolith β single deployable, single DB. Pro: simple, fast to develop. Con: deploys couple teams; scaling is all-or-nothing.
Modular monolith β one deployable, strict module boundaries with explicit interfaces. Often the right answer for teams of < 50 engineers.
Microservices β many deployables, each owned by one team, ideally each with its own DB. Pro: independent deploys, polyglot, fault isolation. Con: distributed-systems tax (networking, observability, data consistency, deployment complexity, on-call). Conway's Law: the architecture mirrors the org chart β microservices succeed only when the org is structured for them.
Rule of thumb: start monolith. Split a service out only when (a) it has a clear domain boundary, (b) a team can own it, (c) the cost of co-deployment is provably hurting you.
15.2 N-Tier Architecture
Classic: Presentation β Business Logic β Data. Modern translation: SPA β API β Service β DB. Useful as a thinking frame, not a religion.
15.3 Event-Driven Architecture (EDA)
Services communicate via events on a bus rather than RPC. Decouples producers from consumers. Excellent for: workflows, integrations, audit, analytics. Pitfall: distributed debugging is hard β invest in correlation IDs and tracing from day one.
15.4 Event Sourcing
Persist state as an append-only sequence of events; current state is a fold of events. Excellent for: audit, time-travel debugging, deriving multiple read models from one source.
Pairs with CQRS: writes go to event store; reads go to one or more materialized projections optimized for query patterns.
Costs: event schema evolution, replay cost, harder ad-hoc querying. Reach for it when audit / temporal queries are core to the domain.
15.5 CQRS (Command Query Responsibility Segregation)
Two models: a command model that mutates state, a query model that reads denormalized projections. Lets reads and writes scale independently and have different schemas. Often paired with event sourcing but doesn't require it.
15.6 Saga Pattern
Already covered in Β§11.3. Workflow of local transactions with compensations. The de facto answer to "distributed transaction" in microservices.
15.7 Circuit Breaker
State machine: Closed (normal) β Open (fail fast after threshold of errors) β Half-Open (probe) β Closed. Prevents cascading failure when a downstream is slow or dead. Tools: Hystrix (deprecated), resilience4j, Polly, Envoy.
15.8 Bulkhead
Isolate resource pools so a flood in one cannot starve another. E.g., separate thread pool per downstream, separate DB connection pool per workload. Inspired by ship hulls β one breach doesn't sink the ship.
15.9 Sidecar (and Service Mesh)
A helper container deployed alongside each service to handle cross-cutting concerns: TLS, retries, observability, rate limiting. Implementations: Envoy as sidecar with Istio / Linkerd as control plane. Lifts these concerns out of every language's library mess into a single, language-agnostic layer.
15.10 Strangler Fig
Migration pattern: route some traffic to the new system, leave the rest on the legacy, gradually shift, retire legacy when traffic = 0. The safe alternative to big-bang rewrites.
15.11 BFF (Backend for Frontend)
A thin API per client type (web BFF, iOS BFF, partner BFF). Aggregates internal services and shapes responses for one client. Avoids the "lowest common denominator" general API.
15.12 Serverless / FaaS
Functions on demand (Lambda, Cloud Functions). Pro: zero idle cost, autoscale, no server ops. Con: cold start, runtime limits, harder local dev, vendor lock-in, observability. Use for: event handlers, glue, low-volume APIs, scheduled jobs.
16. πΈοΈ Distributed Systems Primitives
16.1 Consensus & Coordination
Already covered in Β§11.4 (Paxos, Raft). Practical use: etcd / Zookeeper / Consul for leader election, distributed locks, configuration, service discovery.
16.2 Leader Election
Many algorithms (Bully, Raft-style). Practical: use a coordination service. Critical: design for split-brain β two nodes thinking they're leader. Defenses: quorum-based election, fencing tokens, lease + heartbeat.
16.3 Gossip Protocol
Each node periodically exchanges state with random peers. Probabilistic eventual convergence. Used by: Cassandra (membership), Dynamo, Consul (LAN), serf. Scales to thousands of nodes without central authority.
16.4 Bloom Filter
Probabilistic set membership: "definitely not in the set" or "maybe in the set." Tiny memory, no false negatives, tunable false positive rate.
Use: "is this URL crawled?", "has this user seen this article?", filtering DB reads β query bloom filter first, hit DB only on positive.
16.5 Count-Min Sketch / HyperLogLog
- Count-Min Sketch: approximate frequency of items in a stream. Top-K trending.
-
HyperLogLog: approximate cardinality (distinct count) in tiny memory. Redis
PFCOUNT.
16.6 Merkle Tree
A tree of hashes where each non-leaf is a hash of its children. Quickly identifies which subtree differs between two replicas. Used by: Cassandra anti-entropy, DynamoDB, Git, blockchains, ZFS.
16.7 Vector Clocks & CRDTs
- Vector clock: logical timestamp tracking causality across nodes. Detects concurrent writes (which can then be resolved or surfaced to app).
- CRDT (Conflict-free Replicated Data Type): data structures that automatically merge concurrent updates without coordination. G-Counter, OR-Set, LWW-Register, etc. Powers offline-first apps (Riak, Redis Enterprise, collaborative editors).
16.8 Geohash & Quadtree
- Geohash: encode (lat, lng) as a string; common prefix β spatial proximity. Easy to index in a regular B-tree. Use for "within X km of me".
- Quadtree: recursive 2D partitioning. Good when density varies wildly across regions. Use for game worlds, map tile rendering, Uber's H3 (a hexagonal variant).
16.9 Distributed Lock
Lock service across nodes. Implementations: Redis Redlock (controversial), Zookeeper, etcd. Fundamental gotcha: client crashes holding the lock β lock must expire. Solution: fencing tokens β every operation includes a monotonically increasing token; storage rejects stale tokens.
17. π‘οΈ Reliability & Resilience Patterns
17.1 Failure Modes Inventory
For every component ask:
- What if it's slow (high latency)?
- What if it's down (no response)?
- What if it lies (corrupted / wrong response)?
- What if it's partitioned (some clients reach it, some don't)?
- What if it fills up (storage / queue / connection pool)?
17.2 Timeouts
Default. Every network call needs a timeout. Without one, your service inherits the slowness of every downstream and your thread pool dies. Set timeouts shorter than your own SLA (otherwise you're doomed before retry).
17.3 Retries
- Exponential backoff with jitter β never retry immediately, never retry in lockstep.
- Limit attempts β usually 3.
- Idempotency required β never retry a non-idempotent operation without an idempotency key.
- Retry only on retriable errors β 5xx, 429, network timeouts. Never retry 4xx (you'll get the same answer).
17.4 Circuit Breaker
Already covered in Β§15.7. Combine with retries: open circuit prevents wasteful retries during outage.
17.5 Bulkhead
Β§15.8. Per-dependency thread pools / connection limits.
17.6 Rate Limiting
Algorithms:
| Algorithm | How | Pro | Con |
|---|---|---|---|
| Fixed window | N tokens per minute, reset at boundary | Simple | Burst at boundary |
| Sliding window log | Store timestamps, count last N s | Accurate | Memory |
| Sliding window counter | Weighted blend of two fixed windows | Cheap + accurate | |
| Token bucket | Bucket fills at rate r, request takes 1 | Allows bursts | Tuning |
| Leaky bucket | Queue with constant outflow | Smooths spikes | Latency |
Apply at: edge (API gateway, per IP / API key), per service (per dependency), per user, per tenant. Use distributed counter (Redis) for cluster-wide limits.
17.7 Backpressure
Β§13.5. Push back on the producer when consumers can't keep up. The alternative is silent queue blow-up.
17.8 Graceful Degradation
When a non-critical dependency fails, return a degraded response (cached value, default, partial). Examples:
- Recommendation service down β show last-known popular items.
- Personalization service down β show generic homepage.
- Comment count service down β show "comments" without count.
17.9 Disaster Recovery
| Term | Meaning | Question to ask |
|---|---|---|
| RTO (Recovery Time Objective) | Maximum acceptable downtime | "How long can we be down?" |
| RPO (Recovery Point Objective) | Maximum acceptable data loss | "How much data can we lose?" |
DR strategies, in order of cost and speed:
- Backup & restore β slow restore, low cost. RTO hours, RPO hours.
- Pilot light β minimum infra running, scale up on disaster. RTO minutes, RPO seconds.
- Warm standby β scaled-down full copy, scale up. RTO seconds.
- Active-active multi-region β full capacity in each region. RTO ~0, RPO ~0. Most expensive, hardest to test.
Test your DR. Untested DR is theatre.
17.10 Chaos Engineering
Deliberately inject failure in production to validate resilience. Pioneered by Netflix Chaos Monkey. Modern: Gremlin, AWS Fault Injection Simulator, ChaosMesh on Kubernetes.
17.11 Tail Latency: "The Tail at Scale"
Average latency lies. p99 dictates user experience β and tail effects compound when one request fans out to many services.
The math that should scare you: if a service has p99 = 1 s and a request fans out to 10 such services awaiting all responses, the chance all 10 finish in 1 s is 0.99^10 β 90%. So p99 of the gather call β p90 of one component. With 100 fan-outs, only 37% of requests stay within the per-service p99 window. Tail latency is not negligible β it is the design problem.
Sources of tail latency:
- GC pauses, JIT compilation warm-up.
- Lock contention, queueing under load.
- Slow node (degraded disk, network microburst, neighboring container).
- Background tasks (compaction, vacuum) competing for resources.
- TCP retransmits, head-of-line blocking on HTTP/2 streams.
Mitigations (Dean & Barroso, The Tail at Scale, 2013):
- Hedged requests: after p95 timeout, send to a second replica; take the first response.
- Tied requests: send to two replicas simultaneously; each carries the other's identity; whichever starts first cancels its sibling.
- Micro-batching at the connection level instead of single-request RPCs.
- Per-class queueing: prioritize short interactive requests over background scans.
- Slow-node detection + drain: continuously remove the slowest replica from rotation.
- Request-level parallelism with first-N-of-M responses when business semantics allow (recommendations, search re-rank).
- Reduce fan-out depth: every extra hop multiplies tail probability.
Operational rule: alarm on p99 (or p99.9), never the mean. The mean hides everything that hurts users.
18. π Observability, SLA/SLO/SLI
18.1 The Three Pillars
Metrics β numerical time-series. Dashboards, alerts. Examples: QPS, error rate, p99 latency, queue depth, CPU. Cheap. Tools: Prometheus, Datadog, Atlas (Netflix), M3 (Uber).
Logs β discrete events with context. Debugging, audit. Examples: request logs, app logs, security audit. Expensive at scale. Tools: ELK, Splunk, Loki, CloudWatch.
Traces β causal chain of one request across services. Pinpoint slow span. Tools: Jaeger, Zipkin, Tempo, AWS X-Ray. Modern standard: OpenTelemetry.
18.2 RED (services) and USE (resources)
- RED: Rate, Errors, Duration β the three metrics every service owes you.
- USE: Utilization, Saturation, Errors β the three metrics every resource (CPU, disk, queue) owes you.
18.3 SLI / SLO / SLA
- SLI (Service Level Indicator) β what you measure (availability %, p99 latency).
- SLO (Service Level Objective) β internal target (99.9% availability monthly).
- SLA (Service Level Agreement) β external contract with consequences (refund if < 99.5%).
Error budget: 1 β SLO. If SLO is 99.9%, you have 43 minutes of monthly downtime budget. Spend it on shipping risky features. When you blow it, stop shipping and fix reliability. This is the SRE-vs-product peace treaty.
18.4 Alerting Rules
- Alert on symptoms (user pain), not causes. A pegged CPU is fine if latency is OK. Alert on "p99 > 500 ms" not "CPU > 80%".
- Page only when human action is required, now. Everything else β ticket / dashboard.
- Every alert must link to a runbook.
19. π Security
19.1 Authentication vs Authorization
- AuthN: "who are you?" β passwords, MFA, SSO.
- AuthZ: "what can you do?" β RBAC, ABAC, ACL.
19.2 OAuth 2.0 vs OIDC
- OAuth 2.0: delegated authorization. "User lets app A access their resources at provider B" via access tokens. Flows: authorization code (with PKCE for SPAs/mobile), client credentials (machine-to-machine).
- OpenID Connect: identity layer on top of OAuth 2.0. Adds an ID token (JWT) describing the user. This is what powers "Sign in with Google".
- Rule of thumb: if you want login β OIDC. If you want "let app act on behalf of user" β OAuth.
19.3 JWT (JSON Web Token)
header.payload.signature, base64url-encoded. Pros: stateless, self-contained. Cons: revocation is hard (use short expiry + refresh tokens), payload is not encrypted (only signed), size grows with claims.
Practical rules: sign with asymmetric (RS256/EdDSA) so resource servers verify without private key; keep TTL short (β€15 min); use refresh tokens for sessions; never put secrets in payload.
19.4 SSO and SAML
- SSO β log in once, access many systems. Implemented via OIDC (modern) or SAML (enterprise legacy).
- SAML β XML-based assertions, common in enterprise IdPs (Okta, AD FS). Bigger and older than OIDC; choose OIDC for new builds unless mandated.
19.5 TLS, mTLS, HTTPS
- TLS β encryption + integrity + server authentication. Replaces SSL (deprecated).
- mTLS β mutual TLS: both sides present certificates. Standard for service-to-service inside a mesh / zero-trust network.
- HTTPS = HTTP + TLS. Cert managed by the LB / CDN / reverse proxy in production.
19.6 Encryption
- In transit: TLS everywhere. No internal cleartext.
- At rest: disk-level (LUKS, KMS-managed S3, EBS); column-level for PII.
- Symmetric (AES-256-GCM) is fast β bulk data. Asymmetric (RSA, Ed25519) for key exchange + signatures.
- Key management: never roll your own. Use AWS KMS, GCP KMS, HashiCorp Vault.
19.7 Password Storage
- Never store plaintext.
- Hash with slow, salted function: bcrypt, scrypt, Argon2id. Never MD5/SHA-256 directly (too fast).
- Per-user salt is mandatory.
19.8 OWASP Top 10 β Drill List
Injection, broken auth, sensitive data exposure, XXE, broken access control, security misconfig, XSS, insecure deserialization, vulnerable components, insufficient logging. Internalize this list and the controls for each.
19.9 Defense in Depth
WAF at edge β rate limiting at gateway β input validation at service β least-privilege IAM at infra β encryption at rest β audit logs. Assume any single layer will fail.
20. π Capacity Planning & Scaling Playbook
20.1 Scaling Axes
- Vertical (scale up): bigger box. Simple, eventually impossible.
- Horizontal (scale out): more boxes. Required for true scale; demands statelessness or sharding.
- Functional (scale by service): split by domain (federation / microservices).
- Data (scale by partition): shard.
20.2 The Scale Sequence (apply in order)
- Profile. Where is the actual bottleneck? CPU, memory, disk, network, lock contention?
- Cache. First and cheapest. Identify hot reads, add Redis/Memcached, target 90%+ hit rate.
- Optimize. Indexes, query plans, N+1 elimination, payload size.
- Add read replicas. Read-heavy workloads scale here for free.
- Vertical scale. Often cheaper than re-architecting at small scale.
- Async-ify writes. Move expensive work off the request path: queue + worker.
- Functional split. Federate by domain.
- Shard. Last resort because operationally expensive. Pick shard key carefully (Β§10.2).
20.3 Capacity Estimation Worksheet
For any service, compute on paper:
DAU = ?
peak QPS = DAU Γ actions/user/day / 86400 Γ peak_factor (5β10Γ)
storage growth = QPS Γ bytes/record Γ 86400 Γ 365 Γ replication
network bandwidth = QPS Γ payload Γ replication
Compare to a rough capacity per box (e.g., a modern app server: 10K QPS, 16 GB RAM; a single Postgres node: 50K read QPS, 5K write QPS with proper indexes; Redis: 100K ops/sec; Kafka broker: 100 MB/s).
20.4 Hot Spots
Skewed access destroys partitioned systems. Identify with histograms; fix with:
-
Key salting:
userId:randomBucketfor write fan-out. - In-process caching at app layer for celebrity reads.
- Replication of hot keys across multiple shards.
- Application-level sharding of one logical key into N physical keys.
20.5 Autoscaling
- Reactive: CPU / memory / queue depth thresholds. Cheap, reactive (lag).
- Predictive: ML-based forecast (Netflix Scryer). Hard, but flattens cold starts.
- Schedule-based: known peak hours.
- Don't autoscale stateful tiers (DB, cache) the same way as stateless. Stateful scaling = sharding + rebalance, not "add a node".
20.6 Multi-Region Patterns
Going multi-region buys disaster tolerance and lower user-perceived latency, at a steep operational cost.
| Pattern | Behavior | RTO | Use when |
|---|---|---|---|
| Single-region + DR backup | Backups in another region; restore on disaster | hours | Small product, regulatory minimum |
| Active-passive | Standby region with live replica; manual or automated failover | minutes | Tier-1 service, occasional disasters acceptable |
| Active-active read | All regions serve reads; one region writes | minutes for write, ~0 for read | Read-heavy global apps |
| Active-active write | All regions serve writes | seconds | Truly global scale |
Write strategies for active-active:
- Home region per user/tenant. Each user pinned to one region; cross-region requests proxy back. Used by Slack, Zoom, GitHub. Simplest correct option for user-scoped data.
- Single global write region. Writes funnel to one region, replicated out. Strong consistency, latency for far users (Spanner with leader near majority).
- Multi-master with conflict resolution. Cassandra / DynamoDB Global Tables. LWW or app-level merge. Strong availability, weak consistency.
Routing: Geo-DNS (Route 53 latency or geo policies), Anycast IPs, or client-side region selection based on a config endpoint.
Compliance: GDPR, India DPDP, China, Russia mandate data residency. Region pinning is a product feature, not just an architecture choice. Build it in early β retrofitting tenant-scoped data residency is a migration nightmare.
Failure modes specific to multi-region:
- Cross-region replication lag spikes during regional incidents.
- Partial-region outages (some AZs up, some down) confuse health checks.
- DNS propagation slow β stragglers pin to dead region for minutes.
- Asymmetric routing (writes go region A, reads go B) β read-your-writes anomalies.
20.7 Multi-Tenancy (SaaS)
| Model | Sharing | Pros | Cons |
|---|---|---|---|
| Pool | Shared infra, tenant_id column |
Cheap, easy ops | Noisy neighbor, blast radius, per-tenant scale ceiling |
| Silo | Dedicated stack per tenant | Isolated, per-tenant tunable, compliance-friendly | Expensive, ops complexity multiplies |
| Bridge / Hybrid | Most pooled, big customers siloed | Right-sized | Two systems to maintain |
Required across all tenancy models:
- Tenant ID in every query, cache key, log line, metric label. No exceptions β leakage is a P0 incident.
- Per-tenant rate limits and quotas. Prevents one tenant's bad actor from consuming all capacity.
- Per-tenant encryption keys (BYOK) for regulated tenants.
- Per-tenant observability: metrics aggregated by tenant for support, debugging, cost attribution.
-
Schema strategies: shared schema with
tenant_id(most common), schema-per-tenant (Postgres schemas), DB-per-tenant (silo).
The biggest pool-vs-silo question: can a tenant's load realistically threaten others? If yes β silo or bulkhead the largest tenants.
20.8 Capacity Reference Card
Numbers to anchor estimates. Always benchmark, but expect this order of magnitude on commodity cloud hardware.
| Component | Capacity per instance |
|---|---|
| Modern app server (4β8 vCPU) | 5Kβ20K QPS for stateless HTTP |
| Postgres / MySQL primary | 10Kβ50K read QPS, 1Kβ5K write QPS with proper indexes |
| Postgres read replica | Same as primary for reads |
| Redis (single node) | 100K ops/sec, sub-ms latency |
| Memcached (single node) | 200K+ ops/sec |
| Kafka broker | 100 MB/s sustained, 10K+ msg/s per partition |
| Cassandra node | ~10K writes/sec, ~5K reads/sec |
| Elasticsearch node | 1K+ index ops/sec (depends on doc size) |
| Nginx / Envoy | 50K+ RPS per core for proxying |
| CDN edge (cache hit) | ~1 ms in-region |
| Cross-AZ network RTT | < 1 ms |
| Cross-region intra-continent | 10β60 ms |
| Cross-region intercontinental | 100β200 ms |
| 1 Gbps NIC | 125 MB/s, ~83K pps at MTU 1500 |
| 10 Gbps NIC | 1.25 GB/s |
| NVMe SSD | 500K+ IOPS, several GB/s sequential |
| Spinning disk | ~100 IOPS, ~100 MB/s sequential |
Use: when sizing, divide your peak QPS by per-instance numbers to get a rough box count. Add 2Γ headroom for spikes, 1.3Γ for redundancy across AZs.
21. π Data Engineering & Analytics
The product database (OLTP) is bad at analytics, and the analytics warehouse (OLAP) is bad at transactions. Modern systems run both, connected by a pipeline. Knowing the boundary is essential to scaling either side.
21.1 OLTP vs OLAP
| OLTP | OLAP | |
|---|---|---|
| Workload | Many small transactions | Few large scans |
| Latency | ms | secondsβminutes |
| Storage | Row-oriented | Column-oriented |
| Consistency | ACID | Eventually consistent (often replicated from OLTP) |
| Examples | Postgres, MySQL, MongoDB, DynamoDB | Snowflake, BigQuery, Redshift, ClickHouse, Druid |
Why columnar wins for analytics: queries touch few columns of many rows; columnar storage skips the rest; same-type values compress 10β20Γ; SIMD aggregates blocks of values at once.
21.2 Data Warehouse vs Data Lake vs Lakehouse
- Data warehouse: structured, schema-on-write, governed, expensive per TB. Fast SQL on cleaned data. Snowflake, BigQuery, Redshift, Synapse.
- Data lake: raw files (Parquet, ORC, Avro, JSON) on object storage (S3/GCS/ADLS); schema-on-read; cheap. Tends to become a swamp without governance.
- Lakehouse: open table formats (Delta Lake, Apache Iceberg, Apache Hudi) on object storage that add ACID transactions, schema evolution, and time travel. Best of both worlds; powering modern Databricks, Snowflake-on-Iceberg, AWS Athena workloads.
21.3 ETL vs ELT
- ETL (legacy): transform before loading. Heavy upfront modeling, brittle to schema change.
- ELT (modern): load raw, transform inside the warehouse using SQL (dbt). Cheaper compute, faster iteration, easier reprocessing β just rerun the SQL.
21.4 CDC (Change Data Capture)
Stream the binlog/WAL of your OLTP DB into Kafka, then onward. Tools: Debezium (most popular, open source), AWS DMS, Fivetran, Airbyte.
Common destinations:
- DB β Kafka β warehouse (analytics replication, near-real-time).
- DB β Kafka β search index (Elasticsearch) β keeps search fresh without dual-writes.
- DB β Kafka β cache invalidation.
- DB β Kafka β derived stores in other microservices (lets services own their read models without distributed transactions).
Pair CDC with the outbox pattern (Β§13.4) to first-class application events.
21.5 Lambda vs Kappa Architecture
- Lambda: two pipelines β batch (slow, accurate, source of truth) + speed (fast, approximate). Reconcile in the serving layer. Operational pain: maintain two codebases for the same logic.
- Kappa: stream-only. Replay history through the same stream pipeline by re-reading Kafka from offset 0. Simpler, requires capable stream framework (Flink) + adequate retention.
Most modern data platforms are Kappa-leaning, with batch as a special case (bounded stream).
21.6 Reference Pipeline
Source DB βDebezium CDCββ Kafka ββ Flink (cleanse, enrich, window)
β
ββββββββββββββΌβββββββββββββ
β β β
Iceberg/Delta Elasticsearch Online feature
(lakehouse) (search) store (Redis)
β
dbt models β BI dashboards
This shape β CDC β Kafka β stream proc β fan-out to lakehouse + search + online stores β is the modern default for any non-trivial data platform.
22. π Deployment, Release & Schema Evolution
Designing the system is half the job. Releasing it safely without downtime is the other half.
22.1 Deployment Strategies
| Strategy | How | Pros | Cons |
|---|---|---|---|
| Recreate | Stop old, start new | Simple | Downtime |
| Rolling | Replace instances incrementally | No downtime, gradual | Mixed versions live simultaneously |
| Blue-Green | Stand up parallel env, flip LB | Instant rollback, no version mixing | 2Γ infra during cutover |
| Canary | Send 1% β 5% β 25% β 100% to new | Catch issues with limited blast | Requires good metrics + auto-rollback |
| Shadow / Mirror | Copy traffic to new, discard responses | Test in prod with no user risk | Doesn't validate write path |
22.2 Feature Flags
Decouple deploy from release. Code ships dark; flags toggle behavior at runtime per user, tenant, percentage. Use for: progressive rollout, A/B testing, kill switches, dark launches, ops mode (read-only emergency).
Hygiene: every flag is technical debt. Set TTLs, owners, cleanup tasks. Tools: LaunchDarkly, Unleash, Flagsmith, in-house tables.
22.3 Schema Evolution: Expand-Contract (Parallel Change)
Never break running code. Apply changes in non-breaking phases:
- Expand β add the new column / table / field / version alongside the old. Both readable.
- Migrate writers β code writes to both old and new (dual-write). Backfill historical data into new.
- Migrate readers β code reads from new with fallback to old.
- Cutover β readers ignore old; writers stop writing old.
- Contract β drop old after a monitoring window.
Examples:
- Rename column: add new, dual-write, switch readers, drop old.
- Split table: create new tables, dual-write, migrate readers, retire old.
-
Change type: add
_newcolumn, backfill with cast, switch, drop.
This is the only safe pattern for online systems. "Big bang" migrations always break in production.
22.4 Online Schema Migration
Long ALTER TABLE on big tables blocks. Tools that copy and swap atomically:
- gh-ost (GitHub) β uses binlog for incremental sync, no triggers.
- pt-online-schema-change (Percona) β trigger-based.
-
Postgres:
CREATE INDEX CONCURRENTLY, partition swap, logical replication for major changes.
22.5 Schema Versioning for Messages and APIs
- Avro / Protobuf with a schema registry. Enforce backward + forward compatibility.
- Compatibility rules: never reuse field numbers, never change types, only add optional fields, never remove a required field.
- Consumers should tolerate unknown fields (forward compat) and missing fields (backward compat).
- For REST APIs: additive change preferred; breaking change β new version path (
/v2).
22.6 Database Migration Tooling
- Flyway, Liquibase (JVM); goose (Go); Alembic (Python); Prisma migrate (Node); Rails migrations.
- Forward-only philosophy: never edit applied migrations; create a new migration to fix a previous one.
- Test migrations on a recent prod-shaped snapshot β schema migrations on a tiny dev DB hide row-count and lock issues.
22.7 Progressive Delivery
Auto-rollback on SLO violation during canary. Tools: Argo Rollouts, Flagger, Spinnaker pipelines. Metrics-driven decisions remove the human from the rollback loop.
22.8 Twelve-Factor Highlights
The factors that matter most for system design:
- Config in env β never in code.
- Backing services as resources β DB, cache, queue addressable by URL; swappable.
- Stateless processes β state in backing services, not in app memory.
- Disposable processes β fast startup, graceful shutdown (SIGTERM β drain connections β exit within timeout).
- Dev/prod parity β minimize the gap to make releases predictable.
- Logs as event streams β write to stdout, let infra route + aggregate.
23. π Tradeoffs Cheat Sheet
| Choice | Win | Cost |
|---|---|---|
| Vertical scale | Simple, no app changes | Ceiling, single point of failure, downtime |
| Horizontal scale | Linear capacity, redundancy | Statelessness or sharding required |
| Cache | Latency, offload backend | Invalidation complexity, staleness |
| Read replica | Cheap read scale | Replica lag, read-after-write anomalies |
| Sharding | Parallel writes, smaller indexes | Hot keys, cross-shard joins, resharding pain |
| Denormalization | Read speed | Write complexity, redundancy |
| Strong consistency | Correctness, simpler app | Latency, lower availability |
| Eventual consistency | Latency, availability | App must tolerate staleness |
| Async (queue) | Decoupling, spike absorption | Latency, debug complexity, dup risk |
| Sync RPC | Simple, immediate response | Tight coupling, cascading failures |
| Microservices | Team autonomy, indep deploy | Distributed-systems tax |
| Monolith | Simplicity, perf, easy txns | Coupled deploys, scaling all-or-nothing |
| Push CDN | Bandwidth efficiency | Storage, manual upload |
| Pull CDN | Set and forget | First-request slow, possible stale |
| Master-slave | Simple, read scale | Failover complexity, lag |
| Master-master | Write scale, fast failover | Conflict resolution |
| 2PC | ACID across nodes | Blocking, slow, fragile |
| Saga | Liveness across services | Compensations, complexity |
| REST | Universal, cacheable | Over/under-fetching |
| GraphQL | Flexible queries | N+1, caching loss |
| gRPC | Perf, schema | Browser support, debug |
| WebSocket | Real-time, bidirectional | Stateful conns, scaling |
| SSE | Simple server push | One direction, HTTP/1.1 conn limits |
| JWT | Stateless | Hard to revoke |
| Server sessions | Easy revoke, smaller token | Stateful storage |
| Bloom filter | Memory tiny, fast | Probabilistic (false positives) |
| Consistent hashing | Smooth rebalance | Implementation complexity |
24. π‘ Interview Problem Templates
Each template lists the 4β6 things you must mention.
24.1 URL Shortener (TinyURL / bit.ly)
- Encoding: base62 of an auto-incremented ID, or hash + collision retry. ID generation: range allocator, snowflake, or DB sequence. 7 chars of base62 = 3.5T URLs.
- Storage: KV (id β long URL). Reads vastly outnumber writes (say 100:1).
- Cache: LRU on hot short URLs. CDN for redirect responses (edge cache the 301).
- Analytics: async event stream β batch aggregation. Don't write a row per click on the hot path.
- Custom aliases: uniqueness check; reserve namespace.
- Expiration: TTL field; lazy delete.
24.2 Pastebin / Document Service
- Like URL shortener for IDs, plus blob storage (S3) for content.
- Markdown rendering on read (cache the HTML), or on write.
- Expiration, access control (link-only / private / public).
24.3 News Feed / Twitter Timeline
The classic fan-out decision:
- Fan-out on write (push): when a celebrity tweets, copy to each follower's inbox. Read = O(1). Write = O(followers). Bad for users with 100M followers.
- Fan-out on read (pull): read tweets of all followees, merge. Read = O(followees). Write = O(1). Bad for high-volume readers.
- Hybrid: push for normal users, pull for celebrities (Twitter's actual approach).
Required mentions: timeline cache (Redis sorted set per user), media in CDN, ranking signals, async fan-out via queue, search via Elasticsearch.
24.4 Chat / Messaging (WhatsApp, Slack)
- Connection layer: WebSocket gateways with sticky LB; presence in Redis.
- Delivery: per-user inbox queue; ack from client; offline messages persisted.
-
Storage: Cassandra / wide-column, partition by
(user_id, conversation_id). Discord stores trillions this way. - Group chat: fan-out on write to participants' inboxes; or fan-out on read with a single conversation log.
- End-to-end encryption: Signal protocol β server cannot read messages.
- Push notifications when offline (APNs / FCM).
24.5 Video Streaming (Netflix, YouTube)
- Upload + transcode: S3 + queue + worker farm transcoding into multiple bitrates (HLS / DASH segments).
- Storage: segments in object store; metadata in SQL/NoSQL.
- Delivery: multi-tier CDN, push popular segments to edge (Open Connect).
- Adaptive bitrate (ABR): client picks bitrate based on bandwidth.
- Recommendation: offline batch + online learning.
24.6 Ride-Sharing (Uber, Lyft)
- Location ingest: drivers send GPS at e.g., 4 Hz over WebSocket. 1M drivers Γ 4 = 4M events/s β Kafka.
- Geospatial index: geohash / H3 hexes; bucket of nearby drivers per cell, kept in Redis.
- Matching: rider request β find drivers in adjacent cells β rank by ETA β dispatch.
- State machine per trip; Saga for payment.
- Surge pricing based on supply/demand per cell, computed every minute.
24.7 Search Autocomplete
- Trie of prefixes β top-K completions (with frequencies).
- Trie too big for one node? Shard by first 2 chars.
- Update from query log via batch (daily) β autocomplete doesn't need fresh.
- Cache top results per prefix in CDN.
24.8 Web Crawler
- Frontier (URLs to crawl) in priority queue; politeness (per-host rate limit).
- Bloom filter to dedupe URLs.
- Distributed workers; DNS cache; robots.txt cache.
- Storage: object store for raw pages; index pipeline β Elasticsearch / inverted index.
- Detect spider traps (depth limit, content hash dedupe).
24.9 Distributed Rate Limiter
- Token bucket per user/IP; counters in Redis with
INCR + EXPIRE. - For cluster-wide accuracy: leaky bucket via Redis sorted set, or sliding window.
- For huge scale: approximate with local counters synced periodically (cost: small over-allowance).
24.10 Distributed Unique ID (Snowflake)
- 64-bit ID =
timestamp_ms (41) | machine_id (10) | sequence (12). ~4096 IDs/ms/machine. - Required: clock sync, worker ID assignment (via Zookeeper / config).
- Alternatives: UUIDv7 (timestamp-prefixed), KSUID, DB sequence + range allocation.
24.11 Notification System
- Channels: push (APNs/FCM), SMS, email, in-app.
- Per-channel queue with retry + DLQ.
- Template service + user preferences (do-not-disturb, channel opt-out).
- Idempotency key on send to prevent duplicates.
24.12 Payment System
- Idempotency on every mutation (Idempotency-Key header + dedup table).
- Double-entry ledger β every transaction is two balanced entries.
- Saga for multi-step (charge β ship β fulfill); compensations for refund.
- Async reconciliation with payment processor.
- PCI scope minimization β tokenize card data; never store PAN.
- Hot account problem (accounts with millions of writes) β shard by sub-account.
24.13 File Storage (Dropbox / S3)
- Chunking (4β8 MB) with content-addressed hashes β enables dedup, partial sync, parallel upload.
- Metadata DB (chunk list per file).
- Object store for chunks (replicated 3x, or erasure-coded for cold storage β better space efficiency than 3x replication for rarely-read data).
- Sync protocol with delta sync, conflict resolution (LWW or branched).
24.14 Distributed Cache
- Β§10.4 + Β§12. Consistent hashing, replication for HA, eviction policy.
- Watch out: thundering herd, hot key, cache penetration, cache stampede.
24.15 Distributed Search Index
- Inverted index per shard; routing by document ID; query fan-out + merge.
- Ranking: TF-IDF / BM25 baseline, learned-to-rank on top.
- Tradeoff: more shards = faster query, more network overhead and harder relevance scoring.
24.16 Collaborative Editor (Google Docs)
- Operational Transformation (OT) or CRDT for concurrent edits without locks. Y.js, Automerge are mature CRDT libraries.
- WebSocket per session; one server is the merge authority for a given document.
- Document partitioning: one shard owns one document; co-editors all connect there.
- Snapshot + ops log: every op appended; periodic snapshots for fast loading.
- Presence cursors as a separate ephemeral channel (lower durability needs than text ops).
- For spreadsheets/drawings: domain-specific CRDTs (sequence, map, register).
24.17 Top-K Trending
- Count-Min Sketch for approximate frequency of millions of distinct keys in fixed memory.
- Heap of size K kept alongside; on each update, check if new freq > heap min.
- Time decay: shard counts by minute/hour; sum windowed for "trending in last N min."
- For accuracy at the top, combine sketch with full counters for the heap candidates.
- Stream-process via Flink with tumbling/sliding windows.
24.18 Leaderboard
-
Redis sorted set (
ZADD,ZINCRBY,ZREVRANGE). Sub-ms top-N reads. - Sharding for huge games: hash range of users β many sorted sets, merge top-K from each.
- Tiered: top-100 cached aggressively; rank for arbitrary user computed on demand or approximated.
- For 100M+ players: per-region leaderboards + global aggregation in batch.
- Anti-cheat: rate-limit score updates, validate server-side.
24.19 Distributed Scheduler / Cron
- Leader-elected coordinator (Zookeeper / etcd) β only one scheduler dispatches at a time.
- Time-bucketed queue: jobs land in a sorted set keyed by
next_run_at. - Worker pool pulls due jobs; at-least-once + idempotent jobs for safety.
- Catch-up policy on outage (run all missed? skip? run latest only?). State this explicitly.
- Production tools: Quartz, Airflow scheduler, Temporal/Cadence, AWS EventBridge.
24.20 Online Presence (Status / Last Seen)
- Heartbeat: client pings every 30 s; server sets Redis key with TTL = 60 s.
- Presence read = key exists.
- Fan-out on transition to friends via pub/sub when state changes (online β offline) β not on every heartbeat.
- Sharded by user ID; cross-shard friend lookups batched.
- Last-seen as
LASTSEEN:userwith debounced writes (1/min, not every heartbeat).
25. π Real-World Case Studies
Synthesized lessons from production write-ups (curated by awesome-scalability).
23.1 Netflix
- Microservices with strong service ownership; chaos engineering native (Chaos Monkey, Simian Army).
- EVCache (Memcached + custom) for distributed caching with cache warmer.
- Open Connect CDN β Netflix-owned ISPs-deployed appliances β 95% of traffic from edge.
- Atlas for metrics, Mantis for stream processing, Spinnaker for CD.
- Rule: observability is built before scale, never retrofitted.
23.2 Uber
- Polyglot microservices (originally Python, moved core to Go + Java).
- H3 geospatial index β hexagonal grid (uniform neighbor distance).
- Schemaless (in-house MySQL sharding layer).
- Migrated HDFS β S3 for analytics β data gravity dictates compute location.
- Ringpop for application-layer sharding.
23.3 Twitter / X
- Hybrid timeline: push for normal users, pull for celebrities β solves fan-out asymmetry.
- Manhattan distributed DB; Gizzard sharding framework.
- Kafka for event pipeline; trillions of events/day.
- Timeline construction in 1.5 s p99 via aggressive caching at every layer.
23.4 Discord
-
Cassandra for messages β partition by
(channel_id, bucket_id), billions of messages/day. - Recently migrated to ScyllaDB for better tail latency.
- Voice: separate WebRTC infrastructure, regional routing.
- Elixir for connection-heavy services (BEAM scheduling shines).
23.5 Airbnb
- Migrated from Rails monolith to service-oriented architecture.
- Elasticsearch powers search (geo + facet + ranking).
- Multi-currency, multi-payment-method ledger.
- Lessons: service migration is a multi-year project; Strangler Fig is the only safe approach.
23.6 Pinterest
- MySQL with sharding (vs going NoSQL) β vindication of relational + sharding for relational data.
- Functional partitioning by domain (pins, boards, users).
- Heavy use of Memcached + Redis.
23.7 Instagram
- Three rules: keep it simple, don't reinvent, use proven technologies.
- Postgres + sharding for social graph.
- Cassandra for activity feeds.
- Aggressive caching, one-engineer-per-million-users efficiency.
23.8 Stripe
- Idempotency-key first-class API design.
- Veneer (in-house service framework) + machine learning fraud detection (Radar) on every transaction.
- Distributed rate limiting on token-bucket primitive.
23.9 LinkedIn
- Birthplace of Kafka, Samza, Pinot, Voldemort, Espresso.
- Span Kafka clusters β cross-DC pipelines β real-time + batch unified.
- Lesson: observability investment is a force multiplier. "Observability powers high availability for LinkedIn Feed."
23.10 Recurring Lessons (the 10 most important)
- Embrace operational complexity early. Observability + chaos before scale.
- Data gravity dominates. Compute moves to data, not the other way.
- Statelessness scales linearly. Push state down to a few specialized tiers.
- Database selection is multi-dimensional. Mix SQL + NoSQL + cache + search; one size never fits.
- Observability prevents outages. You can't fix what you can't see.
- Org structure mirrors architecture (Conway). Microservices fail without team realignment.
- Cost-perf tradeoffs are real and additive. Saving 10% in three places = 30%.
- Async/event-driven decouples failure. A queue between two services is a fault break.
- Replication lag is inevitable. Design for it (read-your-writes via session, version tokens).
- Test at scale via simulation. Chaos, load tests, dark traffic, shadow writes.
26. β οΈ Anti-Patterns to Avoid
- Premature microservices. Splitting before domains and teams are clear creates a distributed monolith β worst of both.
- Premature NoSQL. "We'll be web-scale" while you have 100K rows. Postgres scales further than you think.
- Distributed transactions across services. Reach for sagas, idempotency, and outbox instead.
- Sticky sessions as state strategy. Hides true stateful design until LB scaling reveals it.
- No idempotency on POST. Every retry creates a duplicate. Plan for it day 1.
- No timeouts. Cascading failure is one slow downstream away.
- Retries without backoff. Self-DDoS during recovery.
- Cache without TTL or invalidation strategy. Permanent staleness time bomb.
- Single load balancer. SPOF, often invisible until it isn't.
- Synchronous fan-out to many services. One slow node breaks p99 for everyone.
- Logging PII. Compliance disaster.
- No observability before scale. Retrofitting traces / metrics / structured logs costs 10Γ more than building them in.
- Over-engineered abstractions. "We might need to switch DB" β you won't, and the abstraction costs you forever.
- No DLQ. Failed messages quietly disappear.
- Untested DR. Backup that's never restored is not a backup.
27. π Must-Read Papers & Further Reading
25.1 Foundational Papers
- Lamport β *Time, Clocks, and the Ordering of Events* (1978). Logical time, causality.
- Brewer β *Towards Robust Distributed Systems* (2000). CAP.
- Gilbert & Lynch β CAP proof (2002).
- Lamport β *Paxos Made Simple* (2001).
- Ongaro & Ousterhout β *In Search of an Understandable Consensus Algorithm (Raft)* (2014).
- Dean & Ghemawat β *MapReduce* (2004).
- Ghemawat et al. β *Google File System* (2003).
- Chang et al. β *Bigtable* (2006).
- DeCandia et al. β *Dynamo* (2007).
- Corbett et al. β *Spanner* (2012).
- Kreps β *The Log: What every software engineer should know* (2013).
25.2 Books
- Designing Data-Intensive Applications β Martin Kleppmann (the single most valuable systems book).
- Site Reliability Engineering β Google.
- Database Internals β Alex Petrov.
- System Design Interview (Vol 1 + 2) β Alex Xu.
- Building Microservices β Sam Newman.
- Release It! β Michael Nygard (resilience patterns).
25.3 Engineering Blogs (read regularly)
Netflix Tech Blog Β· Uber Engineering Β· Airbnb Engineering Β· Discord Engineering Β· Stripe Β· Cloudflare Β· Slack Β· Shopify Β· Dropbox Β· LinkedIn Engineering Β· The Pragmatic Engineer Β· High Scalability.
25.4 Source Repositories Referenced
- system-design-primer β interview prep, deepest single resource.
- system-design-101 β visual concepts, cheat sheets.
- karanpratapsingh/system-design β book-style chapters.
- awesome-system-design-resources β curated reading list.
- awesome-scalability β production case studies, the gold mine for real-world architecture lessons.
Final principle: The best system design is the simplest one that meets the actual requirements β not the one that anticipates every imagined future. Build for the load you have plus 10Γ. When you reach 5Γ, design the next 10Γ. When you reach 9Γ, build it. Every "we might need it someday" abstraction is a tax you pay every day for a benefit you may never collect.
If you found this helpful, let me know by leaving a π or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! π
Top comments (0)