DEV Community: Rajkiran

System Design - 24. Geospatial Indexing: How Uber Finds the Nearest Driver Among Millions in Milliseconds

Rajkiran — Mon, 15 Jun 2026 15:23:29 +0000

Geospatial Indexing: How Uber Finds the Nearest Driver Among Millions in Milliseconds

Series: System Design Mastery — Day 8 of 15
Reading time: 11 min
Covers: Quadtrees, Geohash, Google S2, Uber H3, Real-Time Index Updates

The Query That SQL Was Never Built For

You open Uber. The app needs to answer: "Which drivers are within 2km of this exact latitude/longitude, right now, out of the millions of drivers active worldwide?"

A naive SQL approach:

SELECT driver_id, latitude, longitude
FROM drivers
WHERE 
  SQRT(POW(latitude - :user_lat, 2) + POW(longitude - :user_lng, 2)) < 0.02

This computes the distance for every single row in the table — every active driver on Earth — for every search. At Uber's scale (millions of drivers, location updates every 3-5 seconds), this is computationally impossible to run in real time.

The fundamental problem: standard database indexes (B+ trees, hash indexes) are designed for 1-dimensional data — sort by a single value (a timestamp, an ID, a name). Latitude and longitude are 2-dimensional. A B+ tree index on latitude alone, or longitude alone, doesn't help you efficiently find "nearby" points — nearby in 2D space doesn't mean nearby in either dimension individually.

Geospatial indexing solves this by transforming 2D space into something that CAN be indexed efficiently in 1D — and there are three major approaches.

Approach 1: Quadtrees — Recursive Spatial Division

A Quadtree recursively divides 2D space into four equal quadrants, continuing to subdivide quadrants that contain "too much" data.

Start: entire map (1 region)

┌─────────────┬─────────────┐
│             │             │
│   NW        │    NE       │
│             │             │
├─────────────┼─────────────┤
│             │             │
│   SW        │    SE       │
│             │             │
└─────────────┴─────────────┘

If NE quadrant has too many points (e.g., dense city center), 
subdivide it further:

┌─────────────┬──────┬──────┐
│             │ NE-NW│ NE-NE│
│   NW        ├──────┼──────┤
│             │ NE-SW│ NE-SE│
├─────────────┼──────┴──────┤
│             │             │
│   SW        │    SE       │
│             │             │
└─────────────┴─────────────┘

This creates a tree structure where densely populated areas (city centers, with thousands of drivers per square km) have many small leaf nodes, while sparsely populated areas (rural regions, with few drivers per 100 square km) have few large leaf nodes.

Finding nearby drivers:

1. Locate the leaf node(s) containing the user's location
2. Drivers in that leaf node are candidates
3. If not enough candidates, check neighboring leaf nodes
   (a leaf representing a small area in a dense city center might
   need to check 1-2 neighbors; a leaf representing a large rural 
   area might already have all nearby drivers)

Advantages:

Naturally adapts to data density — dense areas get fine-grained subdivision automatically
Conceptually simple — a tree structure most engineers already understand

Disadvantages:

Tree traversal for neighbor lookups can be complex — finding "all leaves adjacent to this leaf" isn't trivial when leaves are different sizes
Rebalancing as data shifts (rush hour moves driver density from residential areas to business districts) requires tree restructuring

Used by: Many GIS (Geographic Information Systems) applications, MongoDB's older geospatial indexes (2dsphere indexes use a similar recursive subdivision).

Approach 2: Geohash — Encoding Location as a String

Geohash takes a completely different approach: encode a latitude/longitude pair into a single string, where the key property is — the longer the shared prefix between two geohashes, the closer the locations are.

How Geohash Encoding Works

Latitude/Longitude: (37.7749, -122.4194)  ← San Francisco

The encoding interleaves bits from latitude and longitude ranges,
recursively halving the search space:

Step 1: Is longitude in [-180, 0] or [0, 180]? 
        -122.4194 is in [-180, 0] → bit = 0
Step 2: Is latitude in [-90, 0] or [0, 90]?
        37.7749 is in [0, 90] → bit = 1
Step 3: Continue alternating longitude/latitude, halving each time...

After enough bits, group into base32 characters:
Result: "9q8yyk8ytpxr" ← this is the Geohash for San Francisco

The Prefix Property

"9q8yyk8ytpxr"  → San Francisco (precise location)
"9q8yyk8ytpx"   → San Francisco (slightly larger area, ~1m precision)
"9q8yyk8yt"     → San Francisco area (~150m precision)
"9q8yyk"        → San Francisco region (~5km precision)
"9q8y"          → Bay Area (~80km precision)
"9q"            → California / Nevada region (~1700km precision)

Each character you remove from the end roughly multiplies the represented area by 32 (since base32 has 32 possible characters per position).

Finding Nearby Points with Geohash

-- Find all locations with geohash starting with "9q8yyk" 
-- (same ~5km grid cell as our target location)
SELECT * FROM drivers WHERE geohash LIKE '9q8yyk%'

This is a prefix match — something B+ tree indexes (Topic 25) handle extremely efficiently! You've converted a 2D "nearby" query into a 1D "string prefix" query that any standard database index can answer fast.

The Edge Case: Boundary Problem

Two points that are physically VERY close to each other can have 
COMPLETELY DIFFERENT geohash prefixes if they're on opposite sides 
of a grid cell boundary:

Point A: geohash "9q8yyk..." (just inside one grid cell)
Point B: geohash "9q8yym..." (just across the boundary, 
         physically 10 meters from Point A)

A prefix search for "9q8yyk%" would MISS Point B entirely, 
even though it's very close.

The fix: When searching, also check the 8 neighboring grid cells (the cells surrounding the cell containing the query point) — not just the cell containing the point itself. Most Geohash implementations include a "neighbors" function for exactly this purpose.

Approach 3: Google S2 and Uber H3 — Cell-Based Indexing at Scale

Geohash has a subtle issue: its grid cells are rectangular, and rectangles on a sphere (the Earth) have wildly varying actual sizes depending on latitude — a "1-degree by 1-degree" cell near the equator covers a much larger physical area than the same cell near the poles. This distortion complicates distance calculations.

Google S2: Projecting the Sphere onto a Cube

S2 projects the Earth's surface onto the 6 faces of a cube, then recursively subdivides each face into a hierarchy of cells — similar in spirit to a Quadtree, but designed specifically to minimize the distortion of spherical geometry.

Earth (sphere) → projected onto cube faces → each face recursively 
subdivided into a hierarchy of cells (S2 cells)

Each S2 cell has a unique 64-bit ID.
Cells at the same "level" have roughly similar areas, 
regardless of where on Earth they are — solving the 
rectangular-distortion problem of simple lat/lng grids.

Used by: Google Maps internally for spatial indexing and proximity queries at global scale.

Uber H3: Hexagonal Grid System

Uber open-sourced H3, which divides the Earth into a hierarchical grid of hexagonal cells.

Why hexagons (and not squares/rectangles)?

Square grid:
┌───┬───┬───┐
│   │   │   │   Each cell has 8 neighbors, but 4 are "diagonal" 
├───┼───┼───┤   (share only a corner) and 4 are "adjacent" 
│   │ X │   │   (share an edge) — DIFFERENT distances to 
├───┼───┼───┤   center, creating asymmetry
│   │   │   │
└───┴───┴───┘

Hexagonal grid:
  ╱ ╲ ╱ ╲ ╱ ╲
 │   │ X │   │    Each cell has exactly 6 neighbors, 
  ╲ ╱ ╲ ╱ ╲ ╱     ALL at the SAME distance from the center.
   ╲ ╱ ╲ ╱ ╲      Uniform adjacency — no diagonal/edge distinction.

Why uniform adjacency matters for ride-hailing: When Uber's matching algorithm asks "find drivers in cells near this rider," with hexagons, "near" has a consistent, uniform meaning in every direction. With square grids, a driver in a "diagonal" cell is actually farther away than a driver in an "edge-adjacent" cell — even though both appear as "1 cell away" — creating subtle biases in matching.

H3's hierarchical resolution:

H3 has 16 resolution levels:
  Resolution 0: ~4,250,000 km² per cell (continent-scale)
  Resolution 7: ~5.2 km² per cell (city neighborhood-scale)
  Resolution 9: ~0.1 km² per cell (city block-scale)
  Resolution 15: ~0.9 m² per cell (single building-scale)

Uber typically uses resolution 7-9 for driver matching — 
city-block to neighborhood granularity.

Real Uber matching flow using H3:

1. Rider requests a trip at location (lat, lng)
2. Convert to H3 cell at resolution 9: cell_id = "8928308280fffff"
3. Query: "which drivers are in THIS cell, or its 6 neighbors?"
   (a simple lookup — drivers' current cell_ids are indexed)
4. Rank candidate drivers by actual road-network distance/ETA
5. Dispatch the best match

The H3 cell lookup (step 3) is essentially O(1) — a hash lookup by cell ID, returning a small set of candidates. The expensive part (step 4 — actual routing/ETA calculation) only runs on this small candidate set, not on every driver in the city.

Comparing the Three Approaches

	Quadtree	Geohash	H3 / S2
Shape	Rectangular, variable size (adaptive)	Rectangular, fixed grid	Hexagonal (H3) / roughly square (S2), fixed hierarchical grid
Adapts to density?	Yes — subdivides dense areas	No — fixed grid regardless of density	No — fixed grid, but multiple resolution levels available
Neighbor queries	Complex (variable-size neighbors)	Simple but has boundary issues	Simple, uniform (especially H3's hexagons)
Distortion	N/A (works in 2D plane)	High at extreme latitudes	Low (designed for spherical geometry)
Used by	GIS systems, older spatial databases	General-purpose location encoding	Uber (H3), Google Maps (S2)

The practical guidance: For most system design interviews, Geohash is the easiest to explain and implement (prefix matching on a string — leverages existing database indexes). H3/S2 are the "I know what real companies use at scale" answer — mentioning them, especially H3's hexagonal uniform-adjacency property, signals deeper knowledge.

Handling Real-Time Updates: The Hard Part

Static geospatial indexing (indexing restaurant locations, which rarely move) is one problem. Indexing millions of moving objects, each updating position every few seconds, is a much harder operational challenge.

The challenge:
  - 1 million active drivers
  - Each sends a GPS update every 3-5 seconds
  - = roughly 200,000-300,000 location updates PER SECOND globally

For each update:
  1. Remove driver from their OLD geo-index cell
  2. Add driver to their NEW geo-index cell
  (most updates: old cell == new cell, since drivers move slowly 
   relative to update frequency — but the system must still process it)

Why this rules out traditional databases for the "live" index:
A SQL database with a spatial index, receiving 250,000 writes/second to update positions, would be overwhelmed — especially since most "writes" are actually small UPDATEs to existing rows, which still require index maintenance.

The production pattern:

Live driver locations → Redis (in-memory, ephemeral)
  - Redis GEOADD / GEORADIUS commands provide built-in geospatial 
    indexing using geohash internally
  - In-memory writes handle 250K updates/sec easily
  - TTL on location keys — if a driver stops sending updates 
    (app crashed, phone died), they automatically disappear 
    from the index after a short timeout (e.g., 30 seconds)

Historical location data (for analytics, ETAs, surge calculation)
  → Cassandra — write-heavy, time-series friendly

Redis GEO commands (built on Geohash internally):

GEOADD drivers:active -122.4194 37.7749 "driver_12345"
GEORADIUS drivers:active -122.4194 37.7749 2 km
  → returns all drivers within 2km, computed using geohash 
    prefix matching under the hood

This is why "Redis for live location, Cassandra for history" is such a common pattern in ride-hailing and delivery system designs — it directly addresses the write-throughput requirement that a traditional spatial database can't meet.

Interview Scenario: "Design 'Find Nearby Drivers'"

The structured answer:

"I'd use a geospatial index based on Geohash or H3 — I'll go with H3 for this since it's what Uber actually uses, and its uniform hexagonal adjacency avoids the corner-case distance distortions of square grids.

Each driver's current location gets converted to an H3 cell ID at a resolution appropriate for city-block granularity — roughly resolution 8 or 9. This cell ID becomes part of the key when storing the driver's location.

For live location data, I'd use Redis — drivers send GPS updates every few seconds, and at scale that's hundreds of thousands of writes per second, which Redis handles in-memory easily. Each driver's entry has a short TTL, so a driver who stops sending updates (crashed app, dead phone) automatically disappears from the active index.

To find nearby drivers: convert the rider's location to its H3 cell, then query that cell plus its 6 neighboring cells — a simple set of lookups. This gives a small candidate set, typically tens of drivers even in a dense city, not thousands.

Only THEN do I run the expensive operation — actual routing/ETA calculation via a routing service — on this small candidate set, not on every driver in the city. The geo-index's job is to cheaply narrow millions of drivers down to dozens; the routing service's job is to rank those dozens accurately."

Key Takeaways

Standard B+ tree indexes can't efficiently answer 2D "nearby" queries — geospatial indexing transforms 2D proximity into something indexable.
Quadtree: recursive spatial subdivision, adapts to data density, but complex neighbor queries.
Geohash: encodes lat/lng as a string where shared prefixes = proximity — leverages standard database prefix indexes, but has a boundary problem (check neighboring cells too).
Google S2 projects the sphere onto a cube to minimize distortion; Uber H3 uses hexagonal cells for uniform adjacency in all directions.
Real-time location updates (hundreds of thousands/second) require in-memory storage — Redis GEO commands (built on Geohash) are the standard production pattern.
The architectural pattern: geo-index narrows millions of candidates to dozens cheaply; expensive ranking (routing/ETA) only runs on that small set.

What's Next

Topic 25 — with a grab-bag of essential data structures: Skip Lists (how Redis sorted sets work), HyperLogLog (counting a billion unique visitors with almost no memory), Tries (autocomplete), and LSM Trees vs B+ Trees (the fundamental write-optimized vs read-optimized database internals choice).

Topic 24 of the System Design Mastery series.

Tags: system-design geospatial uber data-structures backend algorithms interview-prep

System Design - 23. Bloom Filters: How Chrome Checks Billions of Malicious URLs Using Almost No Memory

Rajkiran — Sat, 13 Jun 2026 18:49:00 +0000

Covers: Probabilistic Data Structures, False Positives vs False Negatives, Counting Bloom Filters, Tuning, Real Implementations

A Question With a Surprising Answer

How does Chrome check, on every single page load, whether the URL you're visiting is on a list of millions of known malicious websites — without sending your browsing history to Google for every page you visit, and without downloading a multi-gigabyte database to your phone?

The answer is a data structure so simple it can be explained in one sentence, yet so powerful it underpins systems at Google, Cassandra, Akamai, and nearly every large-scale database: the Bloom Filter.

The Core Idea: A Probabilistic "Maybe"

A normal data structure (a hash set, a database) answers "is X in this collection?" with a definitive yes or no.

A Bloom Filter answers with:

"Definitely NOT in the set" (100% certain)
"Possibly in the set" (might be a false positive)

Bloom Filter says "NOT in set"  → guaranteed correct, 100% of the time
Bloom Filter says "MAYBE in set" → could be wrong (false positive)

Bloom Filter NEVER says:
"Definitely IS in set" with the implication of certainty
"NOT in set" when it actually IS in set (no false negatives, ever)

This asymmetry — no false negatives, but possible false positives — is the entire trick, and it's precisely calibrated to be useful.

How It Works: Bit Arrays and Hash Functions

A Bloom Filter is a bit array of size m (initially all zeros) plus k independent hash functions.

Adding an Item

Bit array (m=16): [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

Add "malware.com":
  hash1("malware.com") % 16 = 2  → set bit 2 to 1
  hash2("malware.com") % 16 = 7  → set bit 7 to 1
  hash3("malware.com") % 16 = 11 → set bit 11 to 1

Bit array: [0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0]
                ↑           ↑     ↑
              bit 2       bit 7  bit 11

Checking an Item

Check "malware.com" (already added):
  hash1("malware.com") % 16 = 2  → bit 2 is 1 ✓
  hash2("malware.com") % 16 = 7  → bit 7 is 1 ✓
  hash3("malware.com") % 16 = 11 → bit 11 is 1 ✓
  ALL bits set → "MAYBE in set" (correctly — it IS in the set)

Check "safe-site.com" (never added):
  hash1("safe-site.com") % 16 = 3  → bit 3 is 0 ✗
  ANY bit is 0 → "DEFINITELY NOT in set" (correct — it's not!)

Check "another-site.com" (never added, but...):
  hash1("another-site.com") % 16 = 2  → bit 2 is 1 (set by malware.com)
  hash2("another-site.com") % 16 = 11 → bit 11 is 1 (set by malware.com)
  hash3("another-site.com") % 16 = 7  → bit 7 is 1 (set by malware.com)
  ALL bits happen to be 1 → "MAYBE in set" 
  → FALSE POSITIVE! "another-site.com" was never added, but its hash 
    positions happen to overlap with bits set by "malware.com"

Why no false negatives are possible: If an item was added, ALL its bits were set to 1 by definition. Checking those same bits later will always find them set to 1 (bits are never unset in a basic Bloom Filter). So a real member can never be reported as "not in set."

Why false positives are possible: Multiple items' hash functions can map to overlapping bit positions. An item that was never added might, by coincidence, have all its bit positions already set to 1 by other items.

Why Tiny Memory Footprint Is the Whole Point

Here's the number that makes Bloom Filters remarkable:

To store 1 million URLs with a 1% false positive rate:
  A hash set storing actual URL strings: ~50-100 MB (depends on URL length)
  A Bloom Filter: ~1.2 MB

To store 1 BILLION URLs with a 1% false positive rate:
  Hash set: ~50-100 GB
  Bloom Filter: ~1.2 GB

The Bloom Filter doesn't store the actual data — just bits representing "something hashed here." This is why Chrome can ship a Bloom Filter representing Google's entire Safe Browsing malicious URL list as a small download, updated periodically, checked entirely locally — no network call needed for the common case (the URL is safe).

Chrome's flow:
  1. User visits a URL
  2. Check LOCAL Bloom Filter: "is this URL possibly malicious?"
  3a. Bloom Filter says "definitely not" → proceed immediately, 
      NO network call (the vast majority of URLs)
  3b. Bloom Filter says "maybe" → NOW make a network call to Google's 
      full database to confirm (rare — only for the small % of false 
      positives plus actual malicious URLs)

The Bloom Filter acts as a fast local pre-filter — eliminating ~99%+ of cases from ever needing a network round trip, while never missing an actual malicious URL.

Tuning the False Positive Rate

Two parameters control accuracy: the size of the bit array (m) and the number of hash functions (k).

The formula (intuition, not memorization-required for interviews):

False positive rate ≈ (1 - e^(-kn/m))^k

Where:
  n = number of items inserted
  m = size of bit array
  k = number of hash functions

The practical trade-off:

More bits per item (larger m relative to n):
  → Lower false positive rate
  → More memory used

Optimal k (number of hash functions):
  k ≈ (m/n) × ln(2)
  → Too few hash functions: insufficient bit coverage, higher false positives
  → Too many hash functions: bit array fills up too fast, higher false positives
  → There's a sweet spot

Practical numbers (commonly cited):

~10 bits per item, 7 hash functions → ~1% false positive rate
~15 bits per item, 10 hash functions → ~0.1% false positive rate

The interview-ready statement: "Bloom filters trade memory for accuracy — you choose your acceptable false positive rate upfront, and that determines how many bits per item you need. Doubling the bits roughly squares the accuracy improvement (10x fewer false positives for roughly 50% more bits, in the practical range)."

Counting Bloom Filters: Adding Deletion

A basic Bloom Filter has a problem: you can't delete an item. Bits are shared between items (that's the whole mechanism) — unsetting a bit to "remove" one item might break the membership check for other items whose hashes also touched that bit.

Counting Bloom Filters solve this by using small counters instead of single bits:

Basic Bloom Filter:        [0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0]
Counting Bloom Filter:     [0,0,2,0,0,0,0,1,0,0,0,3,0,0,0,0]
                                ↑               ↑
                          2 items hash here  3 items hash here

Adding an item: increment relevant counters
Removing an item: decrement relevant counters
  (if a counter reaches 0, that "bit" is effectively unset)

Checking membership: same as before — are all relevant 
counters > 0?

The cost: Counting Bloom Filters use more memory (typically 4 bits per counter instead of 1 bit) — but still dramatically less than storing actual items, while supporting deletion.

Real Implementation: Cassandra's SSTables

This is one of the most elegant uses of Bloom Filters in production databases, and directly relates to the LSM Tree structure we'll cover in Topic 25.

Cassandra stores data in immutable files called SSTables.
A single Cassandra node might have HUNDREDS of SSTable files on disk.

Without Bloom Filters:
  Looking up "user_12345" requires checking EVERY SSTable file 
  on disk — even if "user_12345" only exists in ONE of them.
  Each check = a disk read = slow (remember Day 1's latency numbers:
  SSD read ≈ 150μs, much slower than memory).

With Bloom Filters:
  Each SSTable has an associated Bloom Filter (kept in memory).
  Before reading an SSTable from disk, Cassandra checks its 
  Bloom Filter: "might 'user_12345' be in this SSTable?"

  - "Definitely not" (most SSTables) → skip this file entirely, 
    NO disk read
  - "Maybe" (the 1-2 SSTables that actually contain the key, plus 
    occasional false positives) → read this file from disk

The impact: A query that might have required checking 100 SSTable files on disk now typically checks 1-2 — because 98 of them are eliminated by an in-memory Bloom Filter check that costs microseconds. This is one of the primary reasons Cassandra achieves its famous write and read performance despite using an architecture (LSM trees) that would otherwise require many disk reads per query.

Real Implementation: Google Bigtable / CDN "One-Hit Wonder" Filtering

Akamai's CDN faces a specific problem: a huge fraction of content requested from origin servers is requested exactly once — a user clicks a link nobody else will click, requesting a resource that will never be requested again ("one-hit wonders").

Caching every requested item — including one-hit wonders — wastes cache space on content that will never produce a cache hit.

Akamai's approach using a Bloom Filter:
  1. First request for URL X → check Bloom Filter
  2. "Definitely not seen before" → DON'T cache it (probably a one-hit 
     wonder), but ADD it to the Bloom Filter
  3. Second request for the SAME URL X → check Bloom Filter
  4. "Maybe seen before" (it's now in the filter from step 2) 
     → THIS time, cache it — it's been requested twice, 
     likely to be requested again

This is a beautifully simple use of the false-positive-tolerant nature of Bloom Filters: occasionally caching a one-hit wonder due to a false positive is a minor inefficiency, but the overall cache hit rate improves significantly by not wasting cache slots on content unlikely to be re-requested.

Google Bigtable uses Bloom Filters similarly to Cassandra — to avoid unnecessary disk reads when checking if a row exists in a particular SSTable.

Bloom Filter vs Hash Set: When to Use Which

	Hash Set	Bloom Filter
Memory	O(n) — stores actual items	O(n) but tiny constant — just bits
False positives	Never	Possible (tunable rate)
False negatives	Never	Never
Deletion	Yes	No (unless Counting Bloom Filter)
Retrieve actual items	Yes	No — only membership testing

Use a Bloom Filter when:

You need to test "might X be in this huge set?"
Memory is constrained relative to the size of the set
Occasional false positives are acceptable (because they trigger a more expensive but authoritative check — disk read, network call, etc.)
You DON'T need to retrieve or enumerate the actual items — only test membership

Use a Hash Set when:

You need exact membership testing (no false positives tolerable)
You need to retrieve or iterate the actual stored items
Memory isn't a constraint relative to your dataset size

Interview Scenario: "Reduce Database Load with Bloom Filters"

Q: Your application frequently checks "does this username exist?" — and most checks are for usernames that DON'T exist (new signups checking availability). How would you optimize this?

"Most of these checks are for usernames that don't exist — every database query for a non-existent username is essentially wasted I/O. I'd maintain a Bloom Filter containing all existing usernames, kept in memory and updated whenever a new user signs up.

The check flow becomes: first check the Bloom Filter. If it says 'definitely not in set,' the username is available — no database query needed at all, this is instant. If it says 'maybe in set,' THEN query the database to confirm — this covers both real existing usernames and the rare false positives.

Given that the vast majority of availability checks during signup are for usernames that genuinely don't exist, this eliminates the majority of database reads for this endpoint — turning a database query into an in-memory bit-check for most requests.

One consideration: the Bloom Filter needs to be kept in sync as users sign up — I'd update it synchronously on user creation (cheap — just set a few bits) so it never has false negatives for newly created users."

Key Takeaways

A Bloom Filter answers "definitely not in set" (always correct) or "maybe in set" (possible false positive) — never a false negative.
It's a bit array + multiple hash functions. Adding sets bits; checking verifies all relevant bits are set.
Memory usage is a tiny constant per item — orders of magnitude smaller than storing actual data, regardless of how large the items themselves are.
Tuning: more bits per item and the right number of hash functions reduces the false positive rate — it's a tunable trade-off, not fixed.
Counting Bloom Filters use small counters instead of bits, enabling deletion at the cost of more memory.
Cassandra/Bigtable: Bloom Filters per SSTable eliminate most unnecessary disk reads — checking 1-2 files instead of 100.
Chrome Safe Browsing: a local Bloom Filter eliminates network calls for the vast majority of (safe) URLs.
Akamai CDN: Bloom Filters identify "one-hit wonders" to avoid wasting cache space on content unlikely to be requested again.

What's Next

Topic 24 covers Geospatial Indexing — how Uber finds nearby drivers among millions of moving vehicles in milliseconds, comparing Quadtrees, Google's S2, and Uber's own H3 hexagonal grid system.

Topic 23 of the System Design Mastery series.

Tags: system-design bloom-filters data-structures cassandra backend algorithms interview-prep

System Design - 22. Consistent Hashing: The Algorithm That Lets Cassandra Add a Server Without Breaking Everything

Rajkiran — Sat, 13 Jun 2026 18:44:36 +0000

Consistent Hashing: The Algorithm That Lets Cassandra Add a Server Without Breaking Everything

Covers: The Modulo Problem, Hash Ring, Virtual Nodes, Real Implementations in Cassandra and Dynamo

The Promise We Made on Day 3 (Now Fulfilled)

Back on Day 3, when discussing hash-based sharding, we hit a wall: adding a server to a hash-based shard remaps almost everything.

4 shards: shard = hash(key) % 4
5 shards: shard = hash(key) % 5

Adding ONE server changed the modulo from 4 to 5 — 
and remapped roughly 80% of all keys to different shards.

We promised a solution: consistent hashing. Today we deliver on that promise — and it's one of the most elegant algorithms in distributed systems.

The Core Idea: A Ring, Not a Line

Instead of mapping keys to shard numbers via modulo, consistent hashing maps both keys and servers onto the same circular space — a "ring" — using a hash function.

Hash space: 0 to 2^32 - 1 (a circle, where 2^32-1 wraps back to 0)

                    0 / 2^32
                       │
            Server D ──┼── Server A
                       │
        Server C ──────┼────── 
                       │
                  Server B

Placing servers on the ring:

import hashlib

def hash_to_ring(key):
    return int(hashlib.md5(key.encode()).hexdigest(), 16) % (2**32)

servers = ["server_A", "server_B", "server_C", "server_D"]
ring_positions = {hash_to_ring(s): s for s in servers}

# Result (example positions on the ring):
# server_A → position 500,000,000
# server_B → position 1,800,000,000
# server_C → position 2,900,000,000
# server_D → position 4,000,000,000

Placing data on the ring:

key = "user_12345"
key_position = hash_to_ring(key)  # e.g., position 2,100,000,000

The assignment rule: A key belongs to the first server clockwise from its position on the ring.

Ring positions (clockwise):
  Server A: 500M
  Server B: 1.8B
  Server C: 2.9B
  Server D: 4.0B

Key "user_12345" at position 2.1B
  → Walk clockwise from 2.1B → first server is Server C (at 2.9B)
  → "user_12345" is stored on Server C

The Magic: Adding or Removing a Server

This is where consistent hashing earns its name. Watch what happens when we add Server E at position 2.5B (between Server B and Server C):

Before:
  ...Server B (1.8B) ────────────────► Server C (2.9B)...
     Keys in range (1.8B, 2.9B] all belong to Server C

After adding Server E at 2.5B:
  ...Server B (1.8B) ──► Server E (2.5B) ──► Server C (2.9B)...
     Keys in range (1.8B, 2.5B] now belong to Server E
     Keys in range (2.5B, 2.9B] still belong to Server C

ONLY keys between 1.8B and 2.5B move — to Server E.
ALL other keys (on Server A, B, D, and most of Server C) — UNCHANGED.

Compare to modulo hashing:

Modulo hashing (4 → 5 servers): ~80% of ALL keys remap
Consistent hashing (4 → 5 servers): only ~20% of keys remap 
  (specifically, only the keys that "belonged" to the segment 
   now split by the new server)

The general rule: Adding or removing one server out of N only affects keys in the immediate neighboring segment(s) — roughly 1/N of all keys, not all of them. This is the property that makes horizontal scaling of stateful systems (databases, caches) practical without massive, system-wide data migrations.

The Hotspot Problem (And Why Virtual Nodes Fix It)

There's a catch with the basic algorithm above. With only 4-5 servers randomly placed on a ring spanning 0 to 2^32, the segments between servers can be wildly uneven:

Random placement might produce:
  Server A: 100M  ─┐
  Server B: 150M   │ Server B handles only 50M of keyspace (tiny segment)
                   │
  Server C: 2.5B  ─┘ Server C handles 2.35B of keyspace (huge segment!)
  Server D: 3.9B

Server C gets WAY more traffic and data than Server B — 
even though they're supposedly equal peers.

With few servers, random hash positions create uneven segments purely by chance — just like randomly throwing 4 darts at a circular dartboard rarely divides it into 4 equal slices.

Virtual Nodes: The Solution

Instead of placing each physical server at one position on the ring, place it at many positions (virtual nodes, or "vnodes") — typically 100-256 per physical server.

Without vnodes (4 physical servers, 4 ring positions):
  Uneven segments — Server C might get 50% of keyspace, Server B gets 2%

With vnodes (4 physical servers, 256 vnodes each = 1024 ring positions):
  Server A: vnodes at positions [12M, 89M, 156M, ... 256 positions total]
  Server B: vnodes at positions [34M, 102M, 198M, ... 256 positions total]
  Server C: vnodes at positions [...]
  Server D: vnodes at positions [...]

  1024 small segments, scattered across the ring.
  Each physical server "owns" ~256 of these segments — 
  on average, ~25% of the ring each (with much less variance 
  than the 4-position version).

import hashlib

def hash_to_ring(key):
    return int(hashlib.md5(key.encode()).hexdigest(), 16) % (2**32)

VNODES_PER_SERVER = 256

ring = {}  # position -> physical server
for server in ["server_A", "server_B", "server_C", "server_D"]:
    for vnode_id in range(VNODES_PER_SERVER):
        position = hash_to_ring(f"{server}#vnode{vnode_id}")
        ring[position] = server

sorted_positions = sorted(ring.keys())

def get_server(key):
    key_position = hash_to_ring(key)
    # Find first vnode position >= key_position (clockwise)
    for pos in sorted_positions:
        if pos >= key_position:
            return ring[pos]
    return ring[sorted_positions[0]]  # wrap around

The law of large numbers at work: With 256 vnodes per server spread across the ring, the sum of each server's vnode segments averages out close to 1/N of the total ring — even though any individual vnode segment might be small or large. More vnodes = more even distribution.

Bonus Benefit: Easier Rebalancing

With vnodes, when you add a new physical server, instead of taking one large chunk from one neighbor, the new server's 256 vnodes each take a small chunk from 256 different existing vnodes (scattered across all other physical servers). The data migration load is spread evenly across the entire cluster — not concentrated on one unlucky neighbor.

Real Implementation: Cassandra

Cassandra uses consistent hashing with virtual nodes (256 by default, configurable) as the foundation of its entire architecture.

Cassandra cluster: 6 nodes, 256 vnodes each = 1536 total ring positions

When you write a row:
  1. Hash the partition key → position on the ring
  2. Walk clockwise to find the "owning" vnode → identifies physical node
  3. Replicate to N nodes clockwise from there (N = replication factor)
     (this is how Cassandra achieves the W/R quorum from Day 2)

Adding a 7th node:
  - New node gets 256 new vnode positions, scattered across the ring
  - Each new vnode "steals" a small range from an existing vnode
  - Data for those ranges streams to the new node
  - ~1/7 of total data moves (not 6/7 or some larger fraction)
  - Cluster remains fully operational during this rebalancing — 
    reads/writes continue normally

This is why Cassandra clusters can grow from 10 nodes to 100 nodes over time, incrementally, without ever taking the cluster offline for a "resharding operation" — directly solving the resharding catastrophe from Day 3.

Real Implementation: Amazon Dynamo

Amazon's Dynamo paper (2007) — which inspired Cassandra, Riak, and DynamoDB — used consistent hashing as its core innovation specifically to solve the incremental scalability problem for their shopping cart and session storage systems, where adding capacity during traffic growth (especially around peak shopping seasons) couldn't require downtime.

Dynamo's specific contribution was combining consistent hashing with the quorum-based replication (W + R > N) from Day 2 — the ring determines which nodes are responsible for a key, and quorum determines how many of those nodes must agree for reads/writes. Consistent hashing answers "where," quorum answers "how consistent."

Real Implementation: Memcached Clusters

Memcached itself has no built-in clustering — each Memcached instance is independent and unaware of others. Consistent hashing happens client-side.

# Client-side consistent hashing for Memcached
memcached_servers = ["cache1:11211", "cache2:11211", "cache3:11211"]
ring = build_consistent_hash_ring(memcached_servers, vnodes=256)

def get_from_cache(key):
    server = ring.get_server(key)  # client decides which server
    return memcached_client[server].get(key)

When a Memcached server is added or removed, the client library's ring recalculates — and because of consistent hashing's core property, only ~1/N of cache keys "miss" on the new ring topology (they'll be re-fetched from the database and re-cached on their new server). Without consistent hashing, adding/removing a Memcached server would invalidate the entire cache — a massive spike in database load as everything is re-fetched simultaneously.

Interview Scenario: "Design a Distributed Cache Using Consistent Hashing"

The structured answer:

"I'd build a ring using a hash function like MD5 or SHA-1, mapping both cache server identifiers and keys onto a fixed-size space — say 0 to 2^32. Each physical cache server would be assigned multiple virtual node positions on the ring — I'd start with around 150-256 vnodes per server, which gives good load distribution without excessive memory overhead for the ring structure itself.

For lookups, a key is hashed to a ring position, and I walk clockwise to find the first vnode — that identifies the owning physical server.

When a server is added, its vnodes claim small ranges from existing vnodes scattered across the ring — so only roughly 1/N of keys need to move or be re-fetched, not the entire cache. This is the critical property: cache availability during scaling events.

For replication and fault tolerance, I'd replicate each key to the next 2 vnodes clockwise from its primary position — so if one server is down, requests fall through to a replica without a full cache miss.

One detail I'd watch for: hot keys. If a single key (a viral post) gets disproportionate traffic, consistent hashing alone doesn't help — that key still lands on one server. I'd combine this with the key salting technique from Day 3 for known hot keys."

Key Takeaways

Consistent hashing maps both servers and keys onto a circular hash space (a "ring") — a key belongs to the first server clockwise from its position.
Adding/removing one server out of N only remaps ~1/N of keys — solving the "modulo remaps everything" problem from Day 3.
Virtual nodes (100-256 per physical server) solve the uneven-segment problem of having too few ring positions, and spread rebalancing load across the entire cluster.
Cassandra uses consistent hashing + vnodes as its core architecture, enabling incremental scaling without downtime.
Amazon Dynamo combined consistent hashing (for "where") with quorum (for "how consistent") — the foundation of DynamoDB, Cassandra, and Riak.
Memcached clusters rely on client-side consistent hashing — without it, adding/removing a cache server invalidates the entire cache.
Consistent hashing doesn't solve hot keys — combine with key salting (Day 3) for that.

What's Next

Topic 23 covers Bloom Filters — the probabilistic data structure that lets Chrome check billions of malicious URLs using almost no memory, and how Cassandra uses them to avoid disk reads for keys that don't exist.

Topic 22 of the System Design Mastery series. The advanced data structures finale begins.*

Tags: system-design consistent-hashing distributed-systems cassandra backend databases interview-prep

System Design - 21. Rate Limiting: The 5 Algorithms That Protect Every API on the Internet

Rajkiran — Sat, 13 Jun 2026 18:38:45 +0000

Rate Limiting: The 5 Algorithms That Protect Every API on the Internet

Series: System Design Mastery — Day 7 of 15
Reading time: 12 min
Covers: Token Bucket, Leaky Bucket, Fixed/Sliding Window, Distributed Rate Limiting with Redis, Multi-DC

The API That Got Hugged to Death

In 2013, a small startup's API went viral — a popular blog post linked directly to their public endpoint, and within minutes their servers were receiving 50x normal traffic. Not from an attack — from genuine, enthusiastic users, all hitting "refresh" on a slow-loading page, triggering retries, triggering more load.

The servers fell over within 20 minutes. The viral moment — which should have been their best day — became their worst, as the service was completely unusable exactly when the most people wanted to try it.

Rate limiting exists to prevent exactly this — whether the traffic surge comes from genuine enthusiasm, a buggy client retrying too aggressively, or a malicious attacker. The mechanism is the same: bound how much traffic any single source can send, so the system stays healthy for everyone.

Why Rate Limiting Matters Beyond "Stopping Attacks"

Most people think rate limiting = anti-DDoS. That's one use case, but not the primary one for most systems:

1. Fair resource allocation: If one customer's batch job sends 10,000 requests/second, it shouldn't degrade service for every other customer sharing the infrastructure.

2. Cost control: Each API call to a downstream paid service (a third-party API, a database query) costs money. Rate limiting bounds your maximum cost exposure.

3. Protecting downstream systems: Your API might handle 10,000 req/s fine, but your database can only handle 1,000 writes/s. Rate limiting at the API layer protects the database from being overwhelmed by API traffic.

4. Preventing retry storms: A buggy client that retries failed requests in a tight loop can accidentally generate enormous load — rate limiting caps the damage.

5. Business model enforcement: Free tier gets 100 requests/day, paid tier gets 100,000. Rate limiting is the product tier enforcement mechanism.

The 5 Rate Limiting Algorithms

Algorithm 1: Token Bucket

The mental model: A bucket holds tokens. Tokens are added at a fixed rate, up to a maximum capacity. Each request consumes one token. If the bucket is empty, the request is rejected (or queued).

Bucket capacity: 10 tokens
Refill rate: 1 token per second

Time 0s: bucket = [●●●●●●●●●●] (10 tokens, full)
Request arrives → consume 1 token → bucket = [●●●●●●●●●_] (9 tokens)

5 rapid requests arrive → consume 5 tokens → bucket = [●●●●______] (4 tokens)
(BURST allowed! All 5 requests succeeded immediately)

Time passes, tokens refill at 1/sec...
If bucket is empty and a request arrives → REJECTED (429 Too Many Requests)

Key property: allows bursts. If the bucket is full, you can make 10 requests instantly — then you're limited to the refill rate (1/sec) until the bucket replenishes.

Implementation:

import time

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.tokens = capacity
        self.last_refill = time.time()

    def allow_request(self):
        now = time.time()
        elapsed = now - self.last_refill
        # Refill tokens based on elapsed time
        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
        self.last_refill = now

        if self.tokens >= 1:
            self.tokens -= 1
            return True  # Allowed
        return False  # Rejected

Best for: APIs where occasional bursts are normal and desirable — a user opening an app and making several requests at once (load dashboard widgets) shouldn't be immediately rate-limited just because they happened in the same second.

Who uses it: This is the most common algorithm for public APIs — Stripe, GitHub, Twitter all use token-bucket variants.

Algorithm 2: Leaky Bucket

The mental model: Requests enter a queue (the "bucket"). The queue is processed ("leaks") at a constant, fixed rate, regardless of how fast requests arrive. If the queue is full, new requests are dropped.

Queue capacity: 5
Processing rate: 1 request per second (constant, regardless of input rate)

10 requests arrive instantly:
  → First 5 enter the queue
  → Remaining 5 are REJECTED (queue full)

Queue processes at exactly 1/second:
  Second 1: process request 1
  Second 2: process request 2
  Second 3: process request 3
  ...

Key property: smooths output to a constant rate. Unlike Token Bucket (which allows bursts through), Leaky Bucket guarantees the downstream system never sees more than the configured rate — even if 1000 requests arrive in the same millisecond, the downstream sees a steady drip.

Token Bucket vs Leaky Bucket — the core difference:

Token Bucket: "How many requests can I ALLOW THROUGH right now?"
  → Output rate can spike (bursts pass through immediately)

Leaky Bucket: "At what CONSTANT rate do I process requests?"
  → Output rate is always smooth, never spikes

Best for: Protecting downstream systems that genuinely cannot handle bursts — e.g., a legacy database that chokes on concurrent connections, or a third-party API with a strict "exactly N requests per second" contract.

Algorithm 3: Fixed Window Counter

The mental model: Divide time into fixed windows (e.g., 1-minute blocks). Count requests in the current window. Reject if the count exceeds the limit. Reset the counter when the window ends.

Limit: 100 requests per minute

Window [12:00:00 - 12:01:00]: counter = 0
  Request arrives → counter = 1
  ... 99 more requests → counter = 100
  101st request → REJECTED (limit reached)

Window [12:01:00 - 12:01:00]: counter resets to 0
  New requests allowed again

The edge case problem (this is the famous interview gotcha):

Limit: 100 requests per minute

Window 1 [12:00:00 - 12:01:00]: 
  100 requests arrive at 12:00:59 (last second of window) → all allowed

Window 2 [12:01:00 - 12:02:00]:
  100 requests arrive at 12:01:00 (first second of new window) → all allowed

Result: 200 requests in a 2-second span (12:00:59 to 12:01:01)
But the configured limit was "100 per minute"!

The fixed window resets abruptly, allowing a burst of 2x limit right at the boundary. This is a real vulnerability — attackers can exploit window boundaries to send 2x traffic.

Best for: Simple cases where this edge case doesn't matter much (internal tools, low-stakes limits). Simple to implement, very low memory overhead.

Algorithm 4: Sliding Window Log

The mental model: Store a timestamp for every request. To check if a new request is allowed, count how many timestamps fall within the last N seconds (a true sliding window, not fixed boundaries).

Limit: 100 requests per minute

Request arrives at 12:01:30
→ Count all stored timestamps between 12:00:30 and 12:01:30
→ If count < 100, allow and store this timestamp
→ If count >= 100, reject

Old timestamps (before 12:00:30) are discarded

Key property: perfectly accurate. No boundary exploits — the window always represents exactly "the last N seconds," continuously sliding.

Implementation with Redis:

import time

def is_allowed_sliding_log(redis_client, user_id, limit=100, window=60):
    now = time.time()
    key = f"rate_limit:{user_id}"

    # Remove timestamps older than the window
    redis_client.zremrangebyscore(key, 0, now - window)

    # Count remaining timestamps (requests in the current window)
    count = redis_client.zcard(key)

    if count < limit:
        redis_client.zadd(key, {str(now): now})  # Record this request
        redis_client.expire(key, window)
        return True
    return False

Disadvantage: memory heavy. Storing a timestamp per request — for a user making 100 requests/minute, that's 100 entries per user, continuously. At millions of users, this becomes a significant memory cost in Redis.

Best for: When precision matters more than memory cost — security-critical rate limits (login attempts, password reset requests) where the 2x boundary exploit of Fixed Window is unacceptable.

Algorithm 5: Sliding Window Counter (The Best Balance)

The mental model: A hybrid — combine the previous window's count and the current window's count, weighted by how far into the current window we are. Approximates the Sliding Window Log without storing every timestamp.

Limit: 100 requests per minute

Previous window [12:00:00-12:01:00]: 80 requests
Current window  [12:01:00-12:02:00]: 30 requests so far
Current time: 12:01:15 (25% into the current window)

Weighted count = (previous_window_count × (1 - elapsed_fraction)) 
                  + current_window_count
                = (80 × (1 - 0.25)) + 30
                = (80 × 0.75) + 30
                = 60 + 30
                = 90

90 < 100 → request ALLOWED

This formula approximates "how many requests occurred in the trailing 60 seconds" using just two counters (previous window, current window) instead of storing every timestamp.

Why this is the industry standard:

Memory: O(1) per user (just 2 numbers), vs O(N) for Sliding Window Log
Accuracy: very close to true sliding window — eliminates the 2x boundary exploit of Fixed Window
This is what Cloudflare and most major API providers use in production

Distributed Rate Limiting with Redis

In a system with multiple API servers behind a load balancer, rate limiting must be shared across all of them — otherwise, a user could hit server A's limit, then send requests to server B which has its own independent counter, effectively multiplying their limit by the number of servers.

The solution: centralized counting in Redis, accessed by all API servers.

# Token Bucket using Redis + Lua script for atomicity

LUA_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill based on elapsed time
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)

if tokens >= 1 then
    tokens = tokens - 1
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, 60)
    return 1  -- allowed
else
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    return 0  -- rejected
end
"""

def is_allowed(redis_client, user_id, capacity=10, refill_rate=1):
    now = time.time()
    result = redis_client.eval(
        LUA_SCRIPT, 1, f"rate_limit:{user_id}", capacity, refill_rate, now
    )
    return result == 1

Why Lua scripting matters here: Without it, "check tokens, then update tokens" is two separate Redis calls — a race condition. Two simultaneous requests could both read "5 tokens available," both proceed, and both decrement — but the actual remaining count should have only allowed one of them under the limit.

Lua scripts execute atomically in Redis — the entire check-and-update happens as one indivisible operation, eliminating the race condition. This is the standard pattern for distributed rate limiting.

Response Headers: Communicating Limits to Clients

A well-designed rate-limited API tells clients where they stand:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 37
X-RateLimit-Reset: 1718888940

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1718888940
Retry-After: 45

Retry-After tells the client exactly how long to wait before retrying — directly enabling well-behaved exponential backoff (Topic 18) on the client side. A well-designed rate limiter doesn't just reject requests — it tells clients how to behave.

Tiered Rate Limits

Real APIs don't have one global limit — different user tiers get different limits:

TIER_LIMITS = {
    "free":       {"capacity": 100,    "refill_rate": 100/86400},     # 100/day
    "pro":        {"capacity": 10000,  "refill_rate": 10000/3600},    # 10K/hour
    "enterprise": {"capacity": 100000, "refill_rate": 100000/60},     # 100K/min
}

def get_rate_limit_config(user_id):
    user_tier = user_config_service.get_tier(user_id)  # cached lookup
    return TIER_LIMITS[user_tier]

This is how API products like Stripe, Twilio, and GitHub enforce their pricing tiers — rate limiting is the enforcement mechanism for "upgrade to get higher limits."

Global vs Per-Service Rate Limiting

Where should rate limiting live in your architecture?

Option 1: At the API Gateway (most common)
  Single enforcement point, before requests reach any backend service
  ✓ Consistent across all services
  ✓ Protects all downstream services uniformly
  ✗ Gateway must be fast — adds latency to every request

Option 2: Per-service
  Each service enforces its own limits
  ✓ Services can have different limits based on their specific load capacity
  ✗ Inconsistent enforcement, duplicated logic
  ✗ A user could exceed the "global" intent by spreading requests across services

Option 3: Both (defense in depth)
  Gateway enforces overall user/API-key limits
  Individual services enforce limits specific to expensive operations
  (e.g., "image processing" endpoint has a stricter limit than "get profile")

Most production systems use Option 3 — coarse-grained limiting at the gateway (overall fairness, DDoS protection) plus fine-grained limiting at specific expensive endpoints.

Multi-Datacenter Rate Limiting: The Hard Problem

If your Redis cluster is per-region, and a user's requests are routed to different regions (geo-routing, failover), each region's Redis has an independent count — a user could get limit × number_of_regions total throughput.

Approach 1: Global Redis (cross-region)
A single Redis cluster, accessed by all regions. Simple, but adds cross-region latency to every request (Day 1: 150ms cross-continent) — often unacceptable.

Approach 2: Per-region limits with a shared global budget
Each region gets limit / number_of_regions as its local limit. Simple, but if traffic is unevenly distributed (all traffic happens to hit one region), that region's limit may be too restrictive even though the global budget isn't exhausted.

Approach 3: Async global synchronization
Each region rate-limits locally with a slightly generous local limit, and periodically syncs counts to a global store. There's a small window of "overshoot" (a user could exceed the true global limit briefly), but most systems accept this trade-off for the latency win.

The honest answer for interviews: "Perfect global rate limiting across multiple datacenters with zero added latency and zero overshoot is fundamentally a trade-off — you can have strong consistency (global Redis, adds latency) or low latency (per-region, allows some overshoot), but not both. Most systems choose per-region with generous local limits and accept brief overshoot as the lesser evil — similar to the AP choice in CAP theorem from Day 2."

Key Takeaways

Token Bucket: allows bursts, refills at a steady rate — most common for public APIs.
Leaky Bucket: smooths output to a constant rate — best for protecting rate-sensitive downstream systems.
Fixed Window Counter: simple but has a boundary exploit (2x burst at window edges).
Sliding Window Log: perfectly accurate, but memory-heavy (stores every timestamp).
Sliding Window Counter: the industry-standard balance — O(1) memory, near-perfect accuracy, used by Cloudflare and most major APIs.
Distributed rate limiting requires centralized state (Redis) with atomic operations (Lua scripts) to avoid race conditions across multiple API servers.
Response headers (X-RateLimit-*, Retry-After) let clients behave well — combine with Topic 18's exponential backoff.
Multi-DC rate limiting is a genuine CAP-theorem-style trade-off between strict accuracy and low latency — most systems accept some overshoot for speed.

You've now covered Security (authentication, authorization, Zero Trust), Observability (the 3 pillars, golden signals, alerting), and Rate Limiting (the 5 algorithms and distributed implementation). These three topics form the protective and diagnostic layer that wraps around every system you'll design.

Next — the final day of Phase 2 — covers the advanced data structures every senior engineer should know: Consistent Hashing, Bloom Filters, Geospatial Indexing, and a grab-bag of structures (Skip Lists, HyperLogLog, Tries, LSM Trees, B+ Trees) that power the internals of the databases and caches you've been learning about all week.

Tags: system-design rate-limiting redis api-design backend distributed-systems interview-prep

System Design - 20. Observability: The 3 Pillars, 4 Golden Signals, and How Netflix Debugs 100 Microservices

Rajkiran — Sat, 13 Jun 2026 18:33:10 +0000

Observability: The 3 Pillars, 4 Golden Signals, and How Netflix Debugs 100 Microservices

Series: System Design Mastery — Day 7 of 15
Reading time: 11 min
Covers: Metrics/Logs/Traces, 4 Golden Signals, Distributed Tracing, Alert Fatigue, SLO-Based Alerting

The 3am Page With No Answer

Imagine you're on-call. At 3am, an alert fires: "API error rate above threshold."

You check the dashboard. Errors are up — from 0.1% to 4%. But why? Which service? Which endpoint? Which users? Is it one bad deploy, a downstream dependency failing, a database issue, or something else entirely?

In a monolith, you'd check one log file. In a system with 100 microservices, the request that failed might have passed through 8 services before erroring. Which one actually failed? Without the right tooling, you're grep-ing through 100 different log streams hoping to find a needle in a haystack — at 3am.

Observability is the discipline of building systems that can answer "why is this happening?" — not just "is something happening?" The difference between monitoring and observability is the difference between a smoke alarm and being able to see exactly which wire is overheating.

The 3 Pillars of Observability

Pillar 1: Metrics

Metrics are numerical measurements over time — counters, gauges, and histograms.

Counter:    requests_total{service="payment", status="200"} = 145,302
Gauge:      active_connections{service="payment"} = 47
Histogram:  request_duration_seconds{service="payment", quantile="0.99"} = 0.450

Prometheus is the dominant open-source metrics system. Services expose a /metrics endpoint; Prometheus periodically "scrapes" (polls) this endpoint and stores the time-series data.

# Example /metrics endpoint output
http_requests_total{method="GET",status="200"} 145302
http_requests_total{method="GET",status="500"} 23
http_request_duration_seconds_bucket{le="0.1"} 98234
http_request_duration_seconds_bucket{le="0.5"} 143821

Grafana visualizes this data — dashboards showing request rates, error rates, latency percentiles, resource usage, all in real-time graphs.

Strengths: Extremely efficient storage (numbers compress well), great for trends and alerting ("error rate > 5% for 5 minutes → alert"), low overhead.

Weakness: Metrics tell you that something is wrong (error rate spiked) but not why (which specific request, which user, what error message).

Pillar 2: Logs

Logs are timestamped records of discrete events — usually text, often structured as JSON.

{
  "timestamp": "2024-06-13T03:14:22.103Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "message": "Payment authorization failed",
  "user_id": "user_98765",
  "error": "card_declined",
  "amount": 4999
}

The ELK Stack (Elasticsearch, Logstash, Kibana) — or its modern variants (OpenSearch, Loki + Grafana) — is the standard for log aggregation:

Every service → writes structured JSON logs to stdout
       ↓
Log collector (Fluentd/Filebeat) → ships logs to Elasticsearch
       ↓
Elasticsearch → indexes logs for fast search
       ↓
Kibana → search/filter/visualize: 
  "show me all ERROR logs from payment-service in the last hour 
   where user_id=user_98765"

Structured logging matters enormously. Compare:

Unstructured: "Payment failed for user 98765, card declined, amount $49.99"
Structured:   {"event": "payment_failed", "user_id": "98765", 
               "reason": "card_declined", "amount": 4999}

The structured version can be queried, aggregated, and filtered programmatically. "Show me all payment failures with reason=card_declined in the last hour, grouped by amount range" — trivial with structured logs, painful with text parsing.

Log levels and sampling in production:

DEBUG → only in development (too verbose for production)
INFO  → significant events (request received, order placed)
WARN  → recoverable issues (retry succeeded after 1 failure)
ERROR → failures requiring attention

At high traffic: sample DEBUG/INFO logs (e.g., log 1% of successful 
requests) to reduce volume and cost, but log 100% of ERROR/WARN.

Weakness: Logs are siloed per service by default. Correlating "this user's request failed" across 8 services requires a shared identifier — which brings us to traces.

Pillar 3: Traces

Distributed tracing follows a single request as it flows through multiple services, recording the time spent in each.

Trace ID: abc123def456 (one ID for the ENTIRE request journey)

┌─────────────────────────────────────────────────────────┐
│ Span: API Gateway              [0ms ─────────────── 245ms]│
│   └─ Span: Order Service          [5ms ──────── 230ms]    │
│        └─ Span: Payment Service      [10ms ── 180ms]      │
│             └─ Span: Database query     [15ms─150ms] ←SLOW│
│        └─ Span: Inventory Service    [185ms─220ms]        │
└─────────────────────────────────────────────────────────┘

Total request time: 245ms
The Database query inside Payment Service took 135ms 
— that's the bottleneck.

Key concepts:

Trace = the entire journey of one request across all services.
Span = one unit of work within that journey (e.g., "Payment Service processing", "Database query"). Spans have a parent-child relationship, forming a tree.
Trace context propagation = passing the trace_id and span_id through HTTP headers as the request hops between services:

Service A makes a call to Service B:
  HTTP Headers:
    traceparent: 00-abc123def456-span001-01
                     │trace_id│  │span_id│

Service B continues the trace:
  Creates a new span (span002) as a child of span001
  Passes traceparent: 00-abc123def456-span002-01 to Service C

Jaeger and Zipkin are the dominant open-source tracing systems. Google Dapper (the internal system that inspired both) was one of the first large-scale implementations — Google needed it because a single search query could touch hundreds of internal services.

Why traces are essential at scale: Metrics tell you "p99 latency is 245ms." Traces tell you "...and it's because the database query inside Payment Service is taking 135ms of that." Without traces, you're debugging blind in a microservices architecture.

How the 3 Pillars Work Together

3am Alert: "Payment Service error rate > 5%" (from METRICS)
       ↓
Click into Grafana dashboard → see error spike started at 3:02am
       ↓
Filter LOGS for payment-service, level=ERROR, around 3:02am
       ↓
Find: "Database connection pool exhausted" — but WHY?
       ↓
Pick a trace_id from one of the failed requests → open in Jaeger (TRACES)
       ↓
Trace shows: Inventory Service is taking 8 seconds (normally 50ms) 
→ Payment Service's calls to Inventory are timing out
→ Connection pool fills up waiting for Inventory's slow responses
       ↓
Root cause found: Inventory Service had a bad deploy at 3:00am 
that introduced a slow database query.

Metrics told you something was wrong and roughly when. Logs gave you the specific error. Traces revealed the actual root cause was in a different service than the one alerting. This investigation — which could take hours of grep-ing without proper observability — takes minutes with all 3 pillars integrated.

The 4 Golden Signals

Google's SRE book defines 4 Golden Signals — if you can only monitor 4 things, monitor these:

1. Latency

How long do requests take? Critical: distinguish successful request latency from failed request latency. A request that fails fast (400 Bad Request in 2ms) shouldn't be averaged together with successful requests — it'll make your latency look artificially good while masking real problems.

2. Traffic

How much demand is the system experiencing? Requests per second, concurrent connections, queue depth. Traffic patterns reveal trends (growth, seasonality) and anomalies (sudden spikes — could be legitimate viral growth or an attack).

3. Errors

What fraction of requests are failing? Both explicit failures (500 errors) and implicit ones (200 OK but wrong content, policy violations). Track error rate as a percentage of traffic, not absolute count — 50 errors out of 100 requests is very different from 50 errors out of 1,000,000.

4. Saturation

How "full" is your system? CPU, memory, disk I/O, connection pool utilization. Saturation often predicts problems before they cause errors — a connection pool at 95% utilization will hit 100% (and start failing) soon.

Dashboard for ANY service should show these 4 at a glance:

┌─────────────┬─────────────┬─────────────┬─────────────┐
│   LATENCY    │   TRAFFIC    │   ERRORS     │  SATURATION  │
│  p50: 45ms   │  1,240 req/s │  0.3%        │  CPU: 62%    │
│  p99: 380ms  │  ▲ trending  │  ▼ trending  │  Mem: 71%    │
│              │     up       │     down     │  Conns: 85%  │
└─────────────┴─────────────┴─────────────┴─────────────┘

If you're designing a monitoring system in an interview, structuring your answer around these 4 signals demonstrates you know the industry-standard framework — not just "I'd add some dashboards."

Alert Fatigue: When Everything Is an Alert, Nothing Is

A common failure mode: a team sets up alerts for everything. CPU > 70%? Alert. Memory > 80%? Alert. Any 500 error? Alert. Latency > 100ms? Alert.

Within weeks, the on-call engineer is receiving 50+ alerts per day — most of which are noise (a single 500 error that auto-recovered, a brief CPU spike during a scheduled job). Engineers start ignoring alerts, muting channels, or worse — missing the one alert that mattered because it was buried in noise.

This is alert fatigue, and it's a leading cause of missed real incidents.

Severity Tiers

P1 (Page immediately, wake someone up):
  - Service completely down
  - Error rate > 50%
  - Data loss risk

P2 (Notify during business hours, investigate same day):
  - Error rate elevated but service functional (5-10%)
  - Latency degraded but within tolerable range
  - One replica down (but others healthy)

P3 (Log for review, no immediate action):
  - Minor anomalies
  - Resource usage trending toward thresholds (not yet critical)

Only P1 should page someone at 3am. P2 and P3 should be visible on dashboards and reviewed during business hours.

SLO-Based Alerting: The Modern Approach

Threshold-based alerts ("CPU > 70%") are noisy because they don't reflect user impact. SLO-based alerting (introduced in Day 1) flips this: alert based on error budget burn rate — how fast you're consuming your allowed unreliability.

SLO: 99.9% availability = 43.8 minutes of downtime allowed per month
   = 0.1% error budget

Burn rate alerting:
  "Are we consuming our monthly error budget faster than 
   we can sustain?"

Fast burn (page immediately):
  Consuming 1 hour of budget in 5 minutes
  → At this rate, you'll exhaust the ENTIRE monthly budget in hours
  → This is a genuine emergency

Slow burn (notify, don't page):
  Consuming 1 hour of budget over 6 hours
  → Concerning, but you have time to investigate during business hours

Why this is better than threshold alerts: A threshold alert (error rate > 1%) fires the same way whether it's a brief 30-second blip or a sustained outage. Burn-rate alerting distinguishes "brief blip that barely touches the error budget" from "sustained issue that will blow through the entire month's budget by lunch" — and pages accordingly.

Google's multi-window, multi-burn-rate alerting (from the SRE workbook) uses multiple time windows (5 minutes AND 1 hour) to catch both sudden spikes and slow leaks, while filtering out transient noise that self-resolves.

On-Call Culture: Runbooks and Blameless Postmortems

Observability tooling is only half the story — the human processes around incidents matter just as much.

Runbooks: A documented procedure for responding to a specific alert.

Alert: "Payment Service error rate > 5%"

Runbook:
1. Check Grafana dashboard: payment-service overview
2. Check recent deploys: did a deploy happen in the last 30 minutes?
   → If yes, consider rolling back: `kubectl rollout undo deployment/payment-service`
3. Check downstream dependencies (Inventory, Fraud Check) — 
   are THEIR error rates also elevated?
4. Check database connection pool saturation
5. If unresolved in 15 minutes, escalate to #payments-oncall

Runbooks turn "3am panic" into "follow the steps" — dramatically reducing time-to-resolution and stress on whoever's on call.

Blameless postmortems: After an incident, the team writes up what happened — without assigning blame to individuals. The focus is entirely on systemic factors: "Why did our monitoring not catch this sooner? Why did the deploy process allow a breaking change to reach production? What guardrails can we add?"

Why blameless matters: If engineers fear blame for incidents, they hide mistakes, don't report near-misses, and don't share context that could help prevent future issues. Blameless culture (pioneered by Etsy, championed by Google SRE) treats incidents as learning opportunities for the system, not performance issues for individuals.

Real Example: Netflix's Observability at Scale

Netflix operates one of the largest microservices deployments in the world — thousands of services, processing billions of requests daily. Their observability stack includes:

Atlas — Netflix's in-house metrics platform, purpose-built to handle the cardinality (millions of unique metric combinations) at their scale
Distributed tracing integrated into their service mesh
Automated canary analysis — when deploying a new version, Netflix automatically compares the new version's metrics against the old version's, and automatically rolls back if the new version shows statistically significant degradation — without human intervention
Chaos engineering (from Day 1) feeds directly into observability — when Chaos Monkey kills an instance, the team verifies their dashboards and alerts actually detect and surface it correctly

The meta-lesson: Observability isn't just for debugging incidents after they happen — at Netflix's scale, it's integrated into the deployment pipeline itself, automatically preventing bad deploys from ever reaching most users.

Interview Scenario: "Design Monitoring for 100 Microservices"

The structured answer:

"I'd start with the 4 Golden Signals as the baseline for every service — latency, traffic, errors, and saturation — exposed via Prometheus metrics with a standard dashboard template so every team's service looks consistent.

For logs, I'd enforce structured JSON logging across all services, shipped to a centralized system like Elasticsearch, with a mandatory trace_id field in every log line.

For traces, I'd implement distributed tracing with context propagation through HTTP headers — likely using OpenTelemetry as the instrumentation standard, since it's vendor-neutral and works with Jaeger, Zipkin, or commercial backends.

For alerting, rather than static thresholds per service — which would create alert fatigue at 100 services — I'd implement SLO-based burn-rate alerting. Each service defines its own SLO appropriate to its criticality, and alerts fire based on error budget consumption rate, with severity tiers so only genuine emergencies page on-call at 3am.

Finally, I'd pair this with runbooks for common alerts and a blameless postmortem process — because observability tooling without good incident response processes just means you find out about problems faster without necessarily resolving them faster."

Key Takeaways

3 Pillars: Metrics (numerical trends, efficient, good for alerting), Logs (detailed events, good for specific errors), Traces (request journeys across services, good for root cause).
Structured logging (JSON) is essential — unstructured text logs can't be queried programmatically at scale.
Distributed tracing uses trace IDs propagated via headers to follow one request across many services — essential for microservices debugging.
4 Golden Signals: Latency, Traffic, Errors, Saturation — the minimum viable dashboard for any service.
Alert fatigue is real and dangerous — use severity tiers (P1/P2/P3), and only page for genuine emergencies.
SLO-based burn-rate alerting distinguishes brief blips from sustained issues — far less noisy than static thresholds.
Runbooks + blameless postmortems turn observability data into faster resolution and systemic learning — tooling alone isn't enough.

What's Next

Topic 21 closes with Rate Limiting — Token Bucket, Leaky Bucket, Sliding Window algorithms, and how to implement distributed rate limiting with Redis that works correctly across multiple data centers.

Tags: system-design observability monitoring sre backend distributed-systems interview-prep

System Design - 19. Authentication & Authorization: OAuth2, JWT, and the Equifax Breach That Changed Everything

Rajkiran — Sat, 13 Jun 2026 18:30:03 +0000

Authentication & Authorization: OAuth2, JWT, and the Equifax Breach That Changed Everything

Covers: OAuth2 Flow, JWT vs Sessions, SAML, RBAC vs ABAC, mTLS, Zero Trust, Token Revocation

The Breach That Exposed 147 Million People

In 2017, Equifax — one of the three major US credit bureaus — suffered a breach that exposed the personal data of 147 million people: Social Security numbers, birth dates, addresses, and in some cases credit card numbers.

The root cause wasn't a sophisticated zero-day exploit. It was a known vulnerability in Apache Struts that Equifax had failed to patch for months after a fix was available — combined with an internal network where, once an attacker got in, they could move laterally with minimal additional authentication.

The lesson the security world took from this: authentication and authorization can't be an afterthought, and they can't be "strong at the perimeter, weak inside." This is the philosophy behind Zero Trust — and it's reshaped how every system designs identity and access from the ground up.

Today we cover the core building blocks: how users prove who they are (authentication), how systems decide what they can do (authorization), and the protocols that make this work at scale.

Authentication vs Authorization: The Distinction That Matters

These two words get conflated constantly, but they answer fundamentally different questions:

Authentication (AuthN): Who are you? Verifying identity. Logging in with a password, fingerprint, or token.

Authorization (AuthZ): What are you allowed to do? Verifying permissions. Once we know you're "Priya," can Priya delete this file?

Authentication: "Prove you're Priya"
  → Password check, biometric, OTP

Authorization: "Is Priya allowed to delete order #4521?"
  → Check Priya's role, ownership, permissions

A system can authenticate you perfectly and still deny you access — you proved who you are, but you don't have permission for this specific action.

OAuth2: The Protocol Behind "Sign in with Google"

OAuth2 isn't actually an authentication protocol — it's an authorization framework. It answers: "Can this third-party app access my Google Calendar, without me giving the app my Google password?"

The Authorization Code Grant Flow (Most Common)

This is the flow you experience every time you click "Sign in with Google" on a website.

┌────────┐                                          ┌──────────┐
│  User   │                                          │  Google   │
│(Browser)│                                          │ (Auth     │
└───┬────┘                                          │ Server)   │
    │                                                └─────┬────┘
    │  1. Click "Sign in with Google" on YourApp           │
    │ ──────────────────────────────────────────────────► │
    │                                                       │
    │  2. Redirect to Google login                         │
    │ ◄──────────────────────────────────────────────────  │
    │                                                       │
    │  3. User logs in, approves permissions               │
    │ ──────────────────────────────────────────────────► │
    │                                                       │
    │  4. Redirect back to YourApp with AUTHORIZATION CODE │
    │ ◄──────────────────────────────────────────────────  │
    │                                                       │
┌───▼────┐                                                  │
│ YourApp │  5. YourApp's BACKEND exchanges code for tokens │
│(Server) │ ──────────────────────────────────────────────► │
└───┬────┘                                                  │
    │  6. Returns: access_token + refresh_token             │
    │ ◄──────────────────────────────────────────────────  │
    │                                                       │
    │  7. YourApp uses access_token to call Google APIs    │
    │ ──────────────────────────────────────────────────► │

Why the "authorization code" step exists (and isn't just the token directly):

The redirect in step 4 happens through the browser — visible in the URL, browser history, server logs. If Google sent the actual access_token in this redirect, it would be exposed in all those places.

Instead, Google sends a short-lived, single-use authorization code. Only YourApp's backend (step 5) — which has a secret client_secret that never touches the browser — can exchange this code for the actual tokens. This exchange happens server-to-server, never exposed to the browser.

This is why OAuth2 is secure: the actual access token never appears in a URL, browser history, or front-end JavaScript that could be intercepted.

The Tokens OAuth2 Produces

access_token:  Short-lived (minutes to hours). Used to call APIs.
               "This bearer can access Priya's Calendar for the next hour."

refresh_token: Long-lived (days to months). Used to get NEW access tokens
               without the user logging in again.

JWT: Stateless, Self-Contained Tokens

A JWT (JSON Web Token) is a specific token format — widely used for access_tokens — that's self-contained and cryptographically signed.

Anatomy of a JWT

eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJwcml5YSIsImV4cCI6MTcxODg4ODg4OH0.4f8a92...
└──────────┬──────────┘└──────────────┬──────────────┘└────┬────┘
         HEADER                     PAYLOAD              SIGNATURE
   (algorithm info)            (claims — the data)    (verifies integrity)

Decoded payload (claims):

{
  "sub": "priya_12345",
  "name": "Priya Sharma",
  "role": "admin",
  "exp": 1718888888,
  "iat": 1718885288
}

Why JWT Is "Stateless" — And Why That's Powerful

Traditional session-based auth:
  User logs in → Server creates session → stores in Redis/DB
  Every request: server looks up session in Redis → "yes, this is Priya"
  → Requires a database/cache lookup on EVERY request

JWT-based auth:
  User logs in → Server creates JWT, SIGNS it, gives to user
  Every request: server VERIFIES the signature (no DB lookup needed!)
  → If signature is valid, the claims inside are trusted

The signature is the magic. The server has a secret key. When issuing a JWT, it signs the payload with this key. When verifying, it checks the signature using the same key (HMAC) or a public key (RSA/ECDSA for asymmetric signing).

import jwt

# Issuing (server signs with secret key)
token = jwt.encode(
    {"sub": "priya_12345", "role": "admin", "exp": expiry_timestamp},
    secret_key,
    algorithm="HS256"
)

# Verifying (any service with the secret/public key can verify — NO DB CALL)
try:
    payload = jwt.decode(token, secret_key, algorithms=["HS256"])
    # payload["sub"] == "priya_12345" — trusted, because signature is valid
except jwt.InvalidSignatureError:
    # Token was tampered with — reject
    raise AuthError()

This is why JWTs are perfect for microservices: any service holding the shared secret (or public key) can independently verify a token without calling a central auth service. No network hop, no database lookup, no shared session store.

JWT's Achilles Heel: Revocation

Here's the catch. Since JWTs are self-contained and stateless, a server verifying a JWT has no way to know if it's been "revoked" — there's no database to check.

The problem: A user's account is compromised. You want to immediately invalidate all their tokens. But their JWT is valid until exp — and there's no central record to delete.

Solving Token Revocation at Scale

Approach 1: Short TTL + Refresh Token Pattern

access_token: expires in 15 minutes (short-lived)
refresh_token: expires in 30 days, but STORED in a database

To get a new access_token:
  Client sends refresh_token to /token/refresh
  Server checks: is this refresh_token in the database AND not revoked?
  If yes → issue new access_token (15 min)
  If no  → reject, user must log in again

To revoke a user's access:
  DELETE the refresh_token from the database
  → Within 15 minutes, ALL their access_tokens expire naturally
  → They can't get new ones (refresh_token is gone)
  → Maximum exposure window: 15 minutes

This is the industry-standard pattern. You accept a small window (the access token's TTL) where a "revoked" token still technically works, in exchange for the massive performance win of stateless verification for the vast majority of requests.

Approach 2: Blacklist (for immediate revocation)

Maintain a Redis set of "revoked token IDs" (jti claim)
On verification: check signature (stateless) AND check blacklist (one Redis call)

Trade-off: re-introduces a lookup on every request — but it's a fast 
Redis lookup, not a full session database query. Used when 15-minute 
exposure windows are unacceptable (e.g., financial systems).

JWT vs Sessions: The Honest Trade-off

	Session-based	JWT-based
State	Server stores session (Redis/DB)	Stateless — token is self-contained
Scaling	Requires shared session store across servers	Any server can verify independently
Revocation	Instant — delete the session	Hard — requires short TTL + refresh pattern
Token size	Small (just a session ID)	Larger (contains claims)
Microservices	Every service needs access to session store	Any service with the key can verify
Mobile/SPA	Cookies awkward for mobile apps	Works naturally — token in header

The honest take: JWTs aren't "better" than sessions — they trade instant revocation for statelessness. For a monolith with a fast Redis session store, sessions are simpler and have no revocation problem. For microservices and mobile clients, JWT's statelessness is usually worth the revocation complexity.

SAML: Enterprise SSO

SAML (Security Assertion Markup Language) is an older (2005) but still dominant protocol for enterprise Single Sign-On — the "Login with your company account" flow used by Okta, OneLogin, and corporate Active Directory integrations.

User → tries to access SaaS App (Service Provider, SP)
SaaS App → redirects to company's Identity Provider (IdP) — e.g., Okta
User → already logged into Okta (corporate SSO session)
Okta → generates a SIGNED XML "assertion": "This is priya@company.com, 
        verified, here are her roles"
User → browser POSTs this assertion back to SaaS App
SaaS App → verifies signature, creates session for Priya

SAML vs OAuth2/OIDC:

SAML uses XML, OAuth2/OIDC use JSON — SAML is older and more verbose
SAML is dominant in enterprise/B2B SSO (legacy systems, Active Directory integration)
OAuth2 + OpenID Connect (OIDC, which adds authentication on top of OAuth2's authorization) is dominant for consumer apps and modern APIs

When you see "Enterprise SSO" as a requirement in a system design interview — that's a SAML signal. "Sign in with Google/GitHub" — that's OAuth2/OIDC.

RBAC vs ABAC: Two Models of Authorization

Once you know who the user is, how do you decide what they can do?

RBAC: Role-Based Access Control

Users are assigned roles. Roles have permissions.

Roles:
  "admin"   → permissions: [read, write, delete, manage_users]
  "editor"  → permissions: [read, write]
  "viewer"  → permissions: [read]

User "priya" → role: "editor"
→ Priya can read and write, but not delete or manage users

Advantages: Simple to understand, easy to audit ("show me everyone with admin role"), maps naturally to org structures.

Limitation: Roles are coarse-grained. What if Priya should be able to edit documents she created but not documents created by others? RBAC alone can't express this — every "editor" has the same permissions regardless of context.

ABAC: Attribute-Based Access Control

Access decisions are based on attributes of the user, resource, action, and environment — evaluated against policies.

Policy: "A user can EDIT a document IF:
  user.role == 'editor' 
  AND document.owner_id == user.id
  AND current_time is within business_hours
  AND user.department == document.department"

def can_edit_document(user, document, context):
    return (
        user.role == "editor" and
        document.owner_id == user.id and
        is_business_hours(context.current_time) and
        user.department == document.department
    )

Advantages: Extremely fine-grained — context-aware decisions (time of day, location, resource ownership, relationships between entities).

Limitation: More complex to implement, audit, and reason about. Policies can become a tangled web of conditions that are hard to verify for correctness.

The practical guideline:

RBAC for broad, organizational access control ("admins can access the admin panel")
ABAC for fine-grained, contextual rules ("users can edit their own posts, but only during business hours, and only within their department")
Many real systems use both: RBAC for coarse roles, ABAC for fine-grained exceptions layered on top.

mTLS: Service-to-Service Authentication

Regular TLS (the "S" in HTTPS) authenticates the server to the client — your browser verifies it's really talking to bank.com. But the server doesn't verify who the client is beyond what application-layer auth (passwords, tokens) provides.

Mutual TLS (mTLS) requires both sides to present certificates:

Service A wants to call Service B:

1. Service A presents its certificate to Service B
   "I am service-a.internal, signed by our internal CA"

2. Service B presents its certificate to Service A
   "I am service-b.internal, signed by our internal CA"

3. Both verify each other's certificates against the trusted CA
4. Connection established — BOTH sides cryptographically verified

This is exactly what we saw the service mesh (Topic 17) automate — Istio issues certificates to every service and enforces mTLS for all internal traffic, without application code changes. Every service-to-service call is mutually authenticated and encrypted by default.

Zero Trust: "Never Trust, Always Verify"

The Equifax breach happened partly because, once an attacker breached the perimeter, the internal network trusted them. Zero Trust is the architectural philosophy that emerged in response: no request is trusted by default, regardless of whether it originates inside or outside the network perimeter.

Traditional ("castle and moat"):
  Strong perimeter security (firewall, VPN)
  Once inside → relatively trusted, broad access

Zero Trust:
  Every request — internal or external — must be authenticated 
  AND authorized, regardless of network location

  Service A calling Service B internally:
    → mTLS authenticates A's identity
    → Service B checks: is A authorized for THIS specific operation?
    → Every hop verified, nothing assumed because "it's internal"

Practical implementation: mTLS for service identity (Topic 17's service mesh), short-lived credentials everywhere (no long-lived API keys), continuous verification (not just at login), and least-privilege access (services get only the permissions they need, nothing more).

Google's BeyondCorp is the most famous Zero Trust implementation — Google employees access internal tools the same way whether they're in a Google office or a coffee shop, because the network location confers zero trust. Identity and device posture are what matter.

Interview Scenario: "Walk Through OAuth2 Flow Step by Step"

The structured answer (this is almost always asked verbatim):

"I'll walk through the Authorization Code Grant, the most common and secure flow for server-side apps.

Step 1: The user clicks 'Sign in with Google' on our app. We redirect them to Google's authorization endpoint, including our client_id, the redirect_uri, and the requested scope (e.g., calendar access).

Step 2: The user authenticates with Google (if not already) and approves the requested permissions.

Step 3: Google redirects the browser back to our redirect_uri with a short-lived, single-use authorization_code in the URL.

Step 4: Our backend server — not the browser — exchanges this code for tokens by calling Google's token endpoint, including our client_secret. This is a server-to-server call, so the secret never touches the browser.

Step 5: Google returns an access_token and refresh_token. We store the refresh token securely server-side, associate it with the user's session.

Step 6: We use the access token to call Google APIs on the user's behalf. When it expires, we use the refresh token to get a new one — without bothering the user.

The key security property: the actual tokens never appear in browser-visible locations like URLs or history — only the one-time authorization code does, and that's useless without the client_secret to exchange it."

Key Takeaways

Authentication = who you are. Authorization = what you can do. Different problems, different solutions.
OAuth2 is an authorization framework — the Authorization Code Grant flow keeps tokens out of browser-visible locations via a server-side exchange step.
JWT is self-contained and stateless — any service with the key can verify without a database lookup. Its weakness is revocation.
Solve JWT revocation with short access token TTLs (minutes) + long-lived refresh tokens stored server-side (revoke by deleting the refresh token).
SAML dominates enterprise SSO (XML-based). OAuth2/OIDC dominates consumer and API auth (JSON-based).
RBAC (roles → permissions) for broad access control. ABAC (attribute-based policies) for fine-grained, contextual rules. Most systems use both.
mTLS authenticates both sides of a connection — the foundation of service mesh security.
Zero Trust: never trust based on network location — verify every request, everywhere, always.

What's Next

Topic 20 covers Observability — the 3 pillars (metrics, logs, traces), the 4 Golden Signals, distributed tracing, and how to avoid alert fatigue when you're running hundreds of services.

Tags: system-design authentication oauth2 jwt security backend interview-prep

System Design - 18. Fault Tolerance Patterns: Circuit Breakers, Bulkheads, and the Art of Failing Gracefully

Rajkiran — Fri, 12 Jun 2026 11:46:47 +0000

Covers: Circuit Breaker, Retry + Exponential Backoff + Jitter, Bulkhead, Timeout, Fallback, Redundancy

The Titanic's Bulkheads (And Why They Failed)

The RMS Titanic was designed with 16 watertight compartments — bulkheads. The idea: if the hull was breached, water would flood only the affected compartments, and the ship would stay afloat.

The fatal flaw: the bulkheads didn't extend high enough. Water flooding one compartment spilled over the top into the next, and the next, and the next. The isolation that was supposed to contain the damage didn't — because the walls were too short.

This is, almost too perfectly, the story of fault tolerance in distributed systems. The patterns exist. Teams implement them. But if implemented incompletely — bulkheads "too short" — a single failure cascades through the entire system anyway.

Today we cover the five patterns that, implemented correctly and together, are the difference between "one service had a bad day" and "the entire platform went down."

Why Failures Cascade: The Mechanism

Before the patterns, understand the failure mode they prevent. This is the cascading failure scenario from Day 2, now in full mechanical detail:

Step 1: Payment Service becomes slow (database under load, 5 seconds per call instead of 50ms)

Step 2: Order Service calls Payment Service, waits...
  Order Service has a thread pool of 100 threads
  Each call to Payment Service now holds a thread for 5 seconds (instead of 50ms)
  100x more threads are tied up per unit time

Step 3: Order Service's thread pool exhausts
  All 100 threads are blocked waiting on Payment Service
  New incoming requests to Order Service have no threads available
  Order Service starts rejecting/timing out ALL requests — 
  even ones that don't need Payment Service!

Step 4: Services calling Order Service experience the same problem
  Checkout Service calls Order Service → also times out
  Checkout Service's thread pool exhausts

Step 5: Cascade continues upward through the entire call graph
  The ENTIRE platform becomes unresponsive — 
  because ONE service (Payment) got slow.

The root cause: A slow dependency consumed a shared resource (threads) needed for unrelated operations. The fault tolerance patterns all attack this mechanism from different angles.

Pattern 1: Timeout — Never Wait Forever

The simplest, most fundamental pattern — and the one most commonly missing.

# WITHOUT timeout (dangerous default in many HTTP libraries)
response = requests.get("http://payment-service/charge")
# If payment-service hangs, this line waits FOREVER

# WITH timeout
response = requests.get("http://payment-service/charge", timeout=2.0)
# After 2 seconds with no response, raises a TimeoutError

Why this matters so much: Without a timeout, a hung dependency holds your thread indefinitely. With a timeout, the worst case is bounded — your thread is freed after 2 seconds, available for other work.

Choosing timeout values:

Too short: legitimate slow requests get cancelled unnecessarily
           (false failures under normal load spikes)

Too long:  threads tied up too long during real failures
           (cascading failure mechanism still triggers, just slower)

Rule of thumb: set timeout based on p99 latency of the dependency
  If p99 latency is 200ms → timeout at 500ms-1s
  Gives headroom for normal variance, fails fast for genuine hangs

Critical detail: Timeouts must be set at every layer — HTTP client, database driver, connection pool acquisition. A common bug: setting an HTTP timeout but the underlying TCP connection pool has no timeout, so threads still hang waiting for a connection from the pool.

Pattern 2: Retry with Exponential Backoff + Jitter

Transient failures (brief network blip, momentary overload) often succeed on retry. But naive retries can make things worse.

The Naive (Dangerous) Retry

def call_with_retry(url):
    for attempt in range(5):
        try:
            return requests.get(url, timeout=1)
        except RequestException:
            time.sleep(1)  # wait 1 second, retry
    raise Exception("All retries failed")

The problem: If Payment Service is overloaded and returning errors, and 1000 clients are all retrying every 1 second... you've just created a synchronized retry storm. Every client retries at the same intervals, hammering the already-struggling service in waves, preventing it from ever recovering.

Exponential Backoff

Increase the wait time between retries exponentially:

def call_with_exponential_backoff(url, max_retries=5):
    for attempt in range(max_retries):
        try:
            return requests.get(url, timeout=1)
        except RequestException:
            wait_time = 2 ** attempt  # 1s, 2s, 4s, 8s, 16s
            time.sleep(wait_time)
    raise Exception("All retries failed")

This gives the failing service progressively more breathing room. But there's still a problem.

Why Jitter Is Critical

Imagine 1000 clients all start their first retry at the same moment (because they all called at the same moment and all failed at the same moment). With pure exponential backoff:

All 1000 clients retry at: 1s, 2s, 4s, 8s, 16s...
→ Still synchronized! All 1000 hit the service AGAIN at exactly 1s, 
  then AGAIN at exactly 2s, etc.
→ The "thundering herd" pattern from caching (Day 3) — but for retries

Jitter adds randomness to break synchronization:

import random

def call_with_backoff_and_jitter(url, max_retries=5):
    for attempt in range(max_retries):
        try:
            return requests.get(url, timeout=1)
        except RequestException:
            base_wait = 2 ** attempt
            jitter = random.uniform(0, base_wait)  # random delay added
            time.sleep(jitter)
    raise Exception("All retries failed")

# Client A retries at: 0.3s, 1.8s, 5.1s, ...
# Client B retries at: 0.9s, 3.2s, 2.7s, ...
# Client C retries at: 0.1s, 0.4s, 9.3s, ...
# → Retries spread out over time, not synchronized

AWS's recommended "full jitter" formula:

wait_time = random.uniform(0, min(cap, base * (2 ** attempt)))
# cap = maximum wait time regardless of attempt number (e.g., 60s)

The interview answer: "Exponential backoff prevents hammering a recovering service with the same frequency. Jitter prevents synchronized retry storms across many clients. You need both — backoff alone still produces thundering herds at scale."

What NOT to retry: 4xx errors (client errors — retrying a malformed request won't fix it), and non-idempotent operations without an idempotency key (retrying a payment charge could double-charge — tie back to Day 5's Saga pattern).

Pattern 3: Circuit Breaker — Stop Calling What's Broken

If a dependency is consistently failing, continuing to call it — even with retries — wastes resources and prolongs the cascade. The Circuit Breaker pattern (named after electrical circuit breakers) stops calls entirely when a dependency is unhealthy.

The Three States

                    ┌─────────────────┐
        ┌──────────►│      CLOSED      │ (normal operation)
        │           │  Requests pass    │
        │           │  through normally │
        │           └─────────┬────────┘
        │                     │
        │      Failure rate exceeds threshold
        │      (e.g., 50% failures in 10s)
        │                     │
        │                     ▼
        │           ┌──────────────────┐
        │           │       OPEN        │ (failing fast)
        │           │  Requests fail    │
        │           │  IMMEDIATELY,     │
   Success          │  no call made     │
   threshold        └─────────┬────────┘
   reached                     │
        │             After timeout period
        │             (e.g., 30 seconds)
        │                     │
        │                     ▼
        │           ┌──────────────────┐
        └───────────┤    HALF-OPEN      │ (testing recovery)
                     │  Allow LIMITED    │
                     │  requests through │
                     │  to test if fixed │
                     └─────────┬────────┘
                                │
                       If test requests fail
                       → back to OPEN

CLOSED (normal): Requests flow through normally. The breaker monitors the failure rate.

OPEN (failing fast): Once the failure rate crosses a threshold, the breaker "trips." All subsequent requests fail immediately — without even attempting the network call. This is the key insight: failing fast and locally is far better than waiting for a timeout on every request to a known-broken service.

HALF-OPEN (testing recovery): After a cooldown period, the breaker allows a small number of test requests through. If they succeed, the breaker closes (back to normal). If they fail, it reopens (back to failing fast) and waits again.

Implementation Sketch

import time

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.state = "CLOSED"
        self.last_failure_time = None

    def call(self, func, *args, **kwargs):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = "HALF_OPEN"
            else:
                raise CircuitOpenException("Circuit is OPEN — failing fast")

        try:
            result = func(*args, **kwargs)
            if self.state == "HALF_OPEN":
                self.state = "CLOSED"  # Recovery confirmed
                self.failure_count = 0
            return result

        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = "OPEN"
            raise e

Real implementation: Netflix's Hystrix (now in maintenance mode) pioneered this for microservices. Resilience4j is the modern Java successor. Most languages have equivalents (e.g., pybreaker for Python, gobreaker for Go).

Why this matters at scale: If Payment Service is down, and Order Service makes 1000 requests/second to it, without a circuit breaker that's 1000 timeouts/second — each holding a thread for the timeout duration. With a circuit breaker in OPEN state, those 1000 requests fail in microseconds instead — freeing threads immediately, and giving Payment Service room to recover without being hammered.

Pattern 4: Bulkhead — Isolate Failure Domains

Named directly after the Titanic's compartments. The idea: partition resources (thread pools, connection pools) per dependency, so one dependency's failure can't exhaust resources needed for others.

Without Bulkheads (Shared Thread Pool)

Order Service has ONE thread pool of 100 threads, shared by all calls:
  - Calls to Payment Service
  - Calls to Inventory Service
  - Calls to Shipping Service

Payment Service hangs → 80 of 100 threads get stuck waiting on Payment
→ Only 20 threads remain for Inventory and Shipping calls
→ Inventory and Shipping requests queue up, time out
→ Even though Inventory and Shipping are perfectly healthy!

With Bulkheads (Isolated Thread Pools)

Order Service has SEPARATE thread pools per dependency:
  - Payment Service pool:   20 threads
  - Inventory Service pool: 20 threads
  - Shipping Service pool:  20 threads
  - (60 threads total, but partitioned)

Payment Service hangs → all 20 Payment-pool threads get stuck
→ Inventory pool (20 threads) and Shipping pool (20 threads) 
  are COMPLETELY UNAFFECTED
→ Inventory and Shipping requests continue normally

The trade-off: You're "wasting" capacity — if Payment's pool is exhausted but Inventory's pool is idle, you can't dynamically borrow threads. But this rigidity is exactly the point: it guarantees failure containment at the cost of some efficiency.

Resilience4j bulkhead configuration:

resilience4j.bulkhead:
  instances:
    paymentService:
      maxConcurrentCalls: 20
      maxWaitDuration: 10ms
    inventoryService:
      maxConcurrentCalls: 20
      maxWaitDuration: 10ms

Bulkhead vs Circuit Breaker — the distinction:

Bulkhead prevents resource exhaustion from spreading (isolation)
Circuit Breaker prevents wasted calls to a known-broken dependency (fail-fast)

They're complementary — bulkheads contain the blast radius, circuit breakers reduce wasted effort. Production systems use both together.

Pattern 5: Fallback — Degrade Gracefully

When a dependency is unavailable (circuit open, timeout, or error), what do you return to the user instead of an error?

def get_product_recommendations(user_id):
    try:
        return recommendation_service.get_personalized(user_id)
    except (CircuitOpenException, TimeoutError):
        # Fallback: return generic "trending" recommendations
        # instead of personalized ones
        return cache.get("trending_products")  # cached, always available

Fallback strategies, from best to worst degradation:

1. Cached/stale data
   "Here's your feed from 5 minutes ago" — better than nothing

2. Default/generic response
   "Here are trending products" instead of personalized recommendations

3. Reduced functionality
   "Search is temporarily unavailable, browse by category instead"

4. Queue for later
   "Your request is being processed" — async retry when service recovers

5. Honest error (last resort)
   "This feature is temporarily unavailable" — but the REST of the 
   page still works

The principle: partial degradation beats total failure. If your product page shows the product, price, and "Add to Cart" — but the "Customers also bought" section silently shows nothing (or cached trending items) because Recommendation Service is down — most users won't even notice. Compare that to the entire page returning a 500 error because one non-critical service failed.

Real example: Amazon's product pages are composed of dozens of independently-loaded widgets (price, reviews, recommendations, "frequently bought together"). Each widget fails independently with its own fallback. A Recommendation Service outage degrades one widget — the rest of the page works perfectly.

Pattern 6: Redundancy — Active-Active vs Active-Passive (Revisited)

From Day 1, but worth reinforcing in the fault tolerance context: redundancy is the foundation that makes the other patterns effective.

If there's only ONE instance of Payment Service:
  Circuit breaker trips → ALL payment requests fail
  (there's nothing else to fall back to)

If there are MULTIPLE instances across availability zones:
  Circuit breaker trips for the unhealthy instance
  Load balancer routes to healthy instances in other AZs
  Payment processing continues — degraded capacity, not total failure

Active-Active redundancy + Circuit Breakers + Bulkheads + Timeouts + Fallbacks together form a complete fault tolerance strategy. Remove any one, and the others are significantly weakened:

Without timeouts → circuit breakers can't detect failures fast enough
Without circuit breakers → retries continue hammering a dead service
Without bulkheads → one dependency's failure exhausts shared resources
Without fallbacks → circuit breaker "fails fast" just means failing faster, still an error to the user
Without redundancy → there's nothing to fail over to

Interview Scenario: "Design a Fault-Tolerant Payment Service Caller"

The complete answer, layering all patterns:

class PaymentServiceClient:
    def __init__(self):
        # Bulkhead: dedicated thread pool, isolated from other dependencies
        self.executor = ThreadPoolExecutor(max_workers=20)

        # Circuit breaker: stop calling if Payment Service is unhealthy
        self.circuit_breaker = CircuitBreaker(
            failure_threshold=5, 
            recovery_timeout=30
        )

    def charge(self, user_id, amount, idempotency_key):
        try:
            return self.circuit_breaker.call(
                self._charge_with_retry,
                user_id, amount, idempotency_key
            )
        except CircuitOpenException:
            # Fallback: queue for async retry, return "pending" to user
            queue.enqueue("retry_payment", {
                "user_id": user_id, 
                "amount": amount, 
                "idempotency_key": idempotency_key
            })
            return {"status": "pending", "message": "Processing your payment"}

    def _charge_with_retry(self, user_id, amount, idempotency_key):
        for attempt in range(3):
            try:
                # Timeout: never wait forever
                return requests.post(
                    "http://payment-service/charge",
                    json={"user_id": user_id, "amount": amount, 
                          "idempotency_key": idempotency_key},  # idempotent!
                    timeout=2.0
                )
            except (requests.Timeout, requests.ConnectionError) as e:
                if attempt == 2:
                    raise e
                # Exponential backoff + jitter
                wait = random.uniform(0, min(10, 2 ** attempt))
                time.sleep(wait)

This single code sample demonstrates: timeout, retry with backoff+jitter, idempotency (from Day 5), circuit breaker, bulkhead (separate executor), and fallback (queue for later). This is what "Top 1%" looks like in an interview — not naming the patterns, but composing them correctly together.

Key Takeaways

Cascading failures happen because a slow dependency consumes shared resources (threads) needed for unrelated work.
Timeout: never wait forever. Set based on p99 latency of the dependency, with headroom.
Retry with exponential backoff + jitter: backoff gives the dependency breathing room; jitter prevents synchronized retry storms across clients. Never retry non-idempotent operations without idempotency keys.
Circuit breaker: CLOSED → OPEN → HALF-OPEN. Fail fast locally instead of waiting for timeouts on a known-broken dependency.
Bulkhead: isolate thread/connection pools per dependency so one failure can't exhaust resources needed by others.
Fallback: degrade gracefully — cached data, generic defaults, reduced functionality — partial degradation beats total failure.
Redundancy (Active-Active) is the foundation — without something to fail over to, the other patterns just fail "faster," not "better."
All patterns work together. Removing any one significantly weakens the others.

You've now covered the entire microservices infrastructure layer: when and how to split a monolith (Topic 16), how services find each other (Topic 17), and how to keep one failing service from taking down everything else (Topic 18). This is the operational backbone of every production microservices system.

next we cover Security and Observability — OAuth2, JWT, the three pillars of observability (metrics, logs, traces), and rate limiting algorithms. The systems that protect your platform and tell you when something's wrong before your users do.

Tags: system-design fault-tolerance microservices resilience backend distributed-systems interview-prep

System Design - 17. Service Discovery & Service Mesh: How Thousands of Services Find Each Other

Rajkiran — Fri, 12 Jun 2026 11:28:21 +0000

Covers: Client-Side vs Server-Side Discovery, Service Registries, Service Mesh (Istio/Envoy), Kubernetes DNS

The Problem That Didn't Exist in the Monolith

In a monolith, calling another module is simple: OrderService.create(data). It's a function call. The compiler resolves the address. It always works (assuming the code compiles).

In microservices, "calling another service" means: where is it, right now, on the network?

This sounds trivial until you consider what's actually happening in a production environment:

Services run on dynamically allocated IPs (containers get new IPs every restart)
Services scale up and down constantly (auto-scaling adds/removes instances every few minutes)
Services deploy multiple times per day (new versions get new instances)
A single logical service might have 50 running instances across multiple servers

Hardcoding IP addresses is impossible. Even a config file with IPs would be stale within minutes. This is the problem service discovery solves.

The Two Models of Service Discovery

Client-Side Discovery

The calling service queries a service registry directly, gets a list of healthy instances, and load-balances between them itself.

Order Service wants to call Payment Service:

1. Order Service → Service Registry: "Where is Payment Service?"
2. Service Registry → returns: [10.0.1.5:8080, 10.0.1.6:8080, 10.0.1.7:8080]
3. Order Service → picks one (round-robin/random) → 10.0.1.6:8080
4. Order Service → calls Payment Service directly at 10.0.1.6:8080

┌──────────────┐     1. "Where's Payment Service?"    ┌──────────────┐
│ Order Service │ ───────────────────────────────────► │   Registry   │
│              │ ◄─────────────────────────────────── │   (Eureka)   │
│              │     2. [list of healthy instances]    └──────────────┘
│              │
│              │     3. Direct call (load-balanced     ┌──────────────┐
│              │ ───── client-side) ──────────────────►│Payment Service│
└──────────────┘                                        │  (instance 2) │
                                                          └──────────────┘

Real example: Netflix Eureka

Every service registers itself with Eureka on startup:

@EnableEurekaClient
@SpringBootApplication
public class PaymentServiceApplication {
    // On startup, this service registers with Eureka:
    // "I'm payment-service, I'm at 10.0.1.6:8080, I'm healthy"
}

Other services query Eureka and use Ribbon (Netflix's client-side load balancer) to pick an instance and call it directly.

Advantages:

No extra network hop (client calls the service directly)
Client has full control over load-balancing strategy

Disadvantages:

Every service needs discovery client logic — coupling every service to the registry's API and SDK
Multi-language environments need a discovery library for each language

Server-Side Discovery

The calling service makes a request to a load balancer, which queries the registry and routes the request. The caller never sees individual instance addresses.

Order Service wants to call Payment Service:

1. Order Service → calls "payment-service.internal" (a fixed name)
2. Load Balancer → queries registry for healthy Payment instances
3. Load Balancer → routes to one instance
4. Response flows back through the Load Balancer to Order Service

┌──────────────┐                    ┌──────────────┐                   ┌──────────────┐
│ Order Service │ ──── "payment-    │ Load Balancer │ ── queries ──────►│   Registry   │
│              │     service" ─────►│   (AWS ALB)   │ ◄── instance list─│   (AWS ECS)  │
└──────────────┘                    └───────┬──────┘                   └──────────────┘
                                              │
                                              ▼
                                     ┌──────────────┐
                                     │Payment Service│
                                     │  (instance 2)│
                                     └──────────────┘

Real example: AWS ALB + ECS

ECS (container orchestration) automatically registers/deregisters container instances with the ALB's target group as they start/stop. The Order Service simply calls a fixed DNS name — payment-service.internal — and AWS handles everything else.

Advantages:

Calling services need zero discovery logic — just call a fixed name
Language-agnostic — works the same for Java, Python, Go, anything
Centralized load-balancing logic, easier to update

Disadvantages:

Extra network hop (through the load balancer)
The load balancer itself must be highly available

Service Registry: The Source of Truth

Whichever model you use, there's a registry maintaining the live list of service instances. Popular implementations:

Consul (HashiCorp):

Service registration via agent on each host
Built-in health checking
DNS and HTTP interfaces for querying
Multi-datacenter support

etcd:

Distributed key-value store (also used as Kubernetes' backing store)
Services write their address to a key; watchers detect changes
Strongly consistent (uses Raft consensus)

ZooKeeper:

One of the oldest solutions (used by Kafka, Hadoop for coordination)
Strong consistency guarantees
More operationally complex than Consul/etcd

The registration lifecycle:

1. Service instance starts up
2. Registers itself: "I'm payment-service-7, at 10.0.1.6:8080, healthy"
3. Periodically sends heartbeats: "still alive"
4. Registry monitors heartbeats
5. If heartbeats stop (instance crashed) → registry marks instance unhealthy
6. After grace period → instance removed from registry entirely

Deregistration on graceful shutdown:
1. Instance receives SIGTERM (shutdown signal)
2. Instance explicitly deregisters from registry FIRST
3. Instance finishes in-flight requests (connection draining)
4. Instance exits
   → Other services stop routing new requests to it immediately,
     rather than waiting for heartbeat timeout (which could take 30+ seconds)

This deregistration-on-failure detail matters a lot in interviews — the difference between graceful shutdown (instant deregistration) and crash (timeout-based detection) determines how quickly your system "heals" after instance churn.

Kubernetes: Service Discovery Built In

If you're running Kubernetes, you largely don't think about service discovery — it's built into the platform via DNS.

# Define a Service — a stable name for a set of pods
apiVersion: v1
kind: Service
metadata:
  name: payment-service
spec:
  selector:
    app: payment       # Matches pods with label app=payment
  ports:
    - port: 8080

Any pod in the cluster can now call:
  http://payment-service:8080

Kubernetes DNS (CoreDNS) resolves "payment-service" 
→ to the Service's virtual IP (ClusterIP)
→ kube-proxy load-balances to one of the matching pod IPs

How it works under the hood:

Kubernetes maintains a list of "Endpoints" — the actual pod IPs matching the Service's selector
As pods are created/destroyed (scaling, deployments, crashes), the Endpoints list updates automatically
kube-proxy on each node maintains iptables/IPVS rules that load-balance traffic to current Endpoints
DNS resolution + load balancing happens transparently — application code just calls http://payment-service:8080

This is server-side discovery, fully managed by the platform. It's a major reason Kubernetes became the dominant orchestration platform — service discovery, one of the hardest microservices problems, is solved by default.

Service Mesh: Discovery Is Just the Beginning

Once you have many services, you face a recurring set of cross-cutting problems for every service-to-service call:

How do I discover the target service? (discovery)
Is this connection encrypted? (mTLS)
What if the call fails — retry? How many times?
What if the target is overloaded — circuit break?
How do I trace this request across services?
How do I roll out a new version to 5% of traffic first (canary)?

Implementing all of this inside every service's application code means every team reimplements (or imports a library for) the same logic, in every language they use.

A service mesh moves all of this into infrastructure — typically a sidecar proxy deployed alongside every service instance.

┌─────────────────────────┐     ┌─────────────────────────┐
│   Order Service Pod      │     │  Payment Service Pod      │
│  ┌───────────┐ ┌───────┐│     │┌───────┐  ┌───────────┐  │
│  │   Order    │ │ Envoy ││     ││ Envoy │  │   Payment   │  │
│  │ Container  │◄┤Sidecar├┼─────┼┤Sidecar│◄─┤  Container  │  │
│  └───────────┘ └───────┘│     │└───────┘  └───────────┘  │
└─────────────────────────┘     └─────────────────────────┘
       Application code never talks to network directly —
       Envoy sidecar intercepts ALL traffic in and out

Every request from Order Service to Payment Service actually goes:

Order Container → Order's Envoy sidecar → Payment's Envoy sidecar → Payment Container

The application code is unaware — it just makes a normal HTTP call to localhost or a service name. The sidecar handles everything else.

What Istio/Envoy Handles Transparently

mTLS (mutual TLS):
Every connection between services is automatically encrypted and authenticated — without any application code changes. Each service gets a cryptographic identity.

Retries with backoff:

# Istio VirtualService config — no app code changes needed
retries:
  attempts: 3
  perTryTimeout: 2s
  retryOn: 5xx,connect-failure

Circuit breaking:

trafficPolicy:
  connectionPool:
    http:
      maxRequestsPerConnection: 10
  outlierDetection:
    consecutive5xxErrors: 5
    interval: 30s
    baseEjectionTime: 30s
    # After 5 consecutive 5xx errors, eject this instance for 30s

Traffic splitting (canary deployments):

http:
  - route:
    - destination:
        host: payment-service
        subset: v1
      weight: 90
    - destination:
        host: payment-service
        subset: v2     # new version
      weight: 10        # 10% of traffic to test the new version

Distributed tracing:
Every sidecar automatically adds trace headers and reports spans to Jaeger/Zipkin — without any application instrumentation.

Service Mesh vs API Gateway: The Confusion Cleared Up

These get confused constantly. Here's the clean distinction:

            External traffic
                  │
                  ▼
          ┌──────────────┐
          │  API Gateway  │  ← North-South traffic
          │  (Kong, ALB)  │     (outside world → your cluster)
          └──────┬───────┘
                  │
    ┌─────────────┼─────────────┐
    ▼             ▼             ▼
[Service A]──►[Service B]──►[Service C]
    ↑─────────────↑─────────────↑
         Service Mesh (Istio)     ← East-West traffic
         (service ←→ service,      (inside your cluster)
          all sidecar-mediated)

API Gateway: Handles North-South traffic — requests entering your system from outside (browsers, mobile apps, partner integrations). Concerns: public authentication, rate limiting per API key, request transformation for external contracts.

Service Mesh: Handles East-West traffic — requests between your internal services. Concerns: mTLS, internal retries/circuit breaking, service-to-service authorization, internal observability.

They're complementary, not competing. A request might pass through the API Gateway once (entering the system) and then through the service mesh multiple times (as it's processed by several internal services).

The Cost of a Service Mesh

Service meshes solve real problems, but they're not free:

Resource overhead: Every pod now runs an extra sidecar container — additional CPU/memory per service instance. At thousands of pods, this is a meaningful infrastructure cost.

Latency overhead: Every call now passes through two sidecars (sender's and receiver's) instead of going directly. Typically adds 1-3ms per hop — usually negligible, but compounds across deep call chains.

Operational complexity: Istio itself is a complex distributed system. Debugging "why is this request slow" now involves understanding sidecar configuration, not just application code.

The honest guidance: Service meshes make sense when you have dozens to hundreds of services and the cross-cutting concerns (mTLS, retries, observability) are genuinely painful to implement per-service. For 5-10 services, the operational cost of running Istio often exceeds the benefit — application-level libraries (like Resilience4j for circuit breaking, covered in Topic 18) may be simpler.

Interview Scenario: "Client-Side vs Server-Side Discovery — Which Would You Choose?"

"It depends on the team's technology diversity and operational maturity. If the organization is running Kubernetes, server-side discovery via Kubernetes Services is essentially free — DNS-based, language-agnostic, and requires zero application code. I'd default to that.

Client-side discovery (like Eureka) made more sense in the pre-Kubernetes era, or in environments without a unified orchestration platform, because it avoids the extra network hop through a load balancer. But it requires every service — in every language — to integrate a discovery client, which becomes a maintenance burden in polyglot environments.

For the broader cross-cutting concerns — retries, circuit breaking, mTLS — I'd evaluate whether a service mesh is justified by the number of services. Below ~10-15 services, I'd handle these concerns with application-level libraries. Beyond that, the consistency and language-agnostic benefits of a service mesh like Istio typically outweigh its operational and latency overhead."

Key Takeaways

Service discovery solves the problem of finding service instances in a dynamic environment where IPs change constantly.
Client-side discovery (Eureka): caller queries registry, load-balances itself. No extra hop, but requires discovery libraries per language.
Server-side discovery (AWS ALB, Kubernetes Services): caller hits a fixed name, infrastructure routes. Language-agnostic, adds one hop.
Kubernetes provides server-side discovery via DNS automatically — a major reason for its dominance.
Service mesh (Istio/Envoy) moves cross-cutting concerns — mTLS, retries, circuit breaking, tracing, canary routing — into sidecar proxies, out of application code.
API Gateway handles North-South traffic (external → internal). Service Mesh handles East-West traffic (internal → internal). They're complementary.
Service meshes add real overhead (resources, latency, complexity) — justify their use by service count and operational pain, not by trend-following.

What's Next

Topic 18 closes Day 6 with Fault Tolerance Patterns — Circuit Breakers, Retries with Exponential Backoff and Jitter, Bulkheads, and Timeouts. The patterns that determine whether a single failing service takes down your entire platform, or fails gracefully and recovers on its own.

Tags: system-design microservices service-mesh kubernetes backend distributed-systems interview-prep

System Design - 16. Microservices vs Monolith: The Decision That Will Define Your Next 3 Years

Rajkiran — Fri, 12 Jun 2026 11:26:25 +0000

Covers: Monolith vs Microservices Trade-offs, Strangler Fig Pattern, Distributed Monolith Anti-pattern, Domain-Driven Design

The Pivot That Took Amazon 4 Years

In 2001, Amazon's entire website was a single, massive application — a monolith. Every feature shared one codebase, one database, one deployment process. Adding a new feature meant touching code that thousands of other features also depended on. A bug in the recommendations engine could crash checkout.

Amazon's leadership made a decision that, at the time, seemed almost insane: break the monolith into hundreds of independently deployable services, each owned by a small team, each with its own database, communicating only through APIs.

It took roughly 4 years. It required organizational restructuring — Jeff Bezos's famous "two-pizza team" mandate (every team small enough to be fed by two pizzas) and the equally famous API mandate (every team must expose its functionality through an API, with no backdoor database access).

The result became the architectural foundation for AWS itself — many AWS services originated as internal tools Amazon built to manage this transition.

But here's what most engineers miss about this story: Amazon ran as a successful monolith for years before this migration. The monolith wasn't a mistake — it was the right architecture for that stage. The migration happened because their scale and organizational structure had outgrown it.

This is the lens through which every "microservices vs monolith" decision should be made.

The Monolith: Simple, Until It Isn't

A monolith is a single deployable unit containing all application logic — UI, business logic, data access — typically backed by one database.

┌─────────────────────────────────┐
│         Monolith App             │
│  ┌─────────┐ ┌─────────┐        │
│  │ Users   │ │ Orders  │        │
│  ├─────────┤ ├─────────┤        │
│  │ Payment │ │ Inventory│       │
│  └─────────┘ └─────────┘        │
│         (single codebase)        │
└─────────────────────────────────┘
              │
              ▼
       ┌─────────────┐
       │   Database    │
       └─────────────┘

Why Monoliths Are Genuinely Good (Not Just "For Beginners")

Simplicity of development:
One codebase. One IDE project. git clone, run, develop. No need to run 15 services locally to test a feature.

Simplicity of deployment:
One artifact to build, test, and deploy. One CI/CD pipeline. One thing to monitor.

Transaction integrity:
A single database means ACID transactions across your entire data model. Updating a user's order and their loyalty points balance? One transaction. Done.

Performance:
Function calls within a monolith are nanoseconds. Calls between microservices involve network round-trips — milliseconds. For tightly coupled operations, a monolith is faster.

Easier debugging:
A single stack trace shows you the entire request path. No distributed tracing required.

Where Monoliths Genuinely Struggle

One bug crashes everything:
A memory leak in the reporting module can bring down checkout, because they share the same process.

Scaling is all-or-nothing:
If your image processing is CPU-heavy but your API is I/O-bound, you can't scale them independently. You scale the entire monolith — wasting resources on the parts that didn't need it.

Deployment risk increases with size:
Every deploy ships everything — even unrelated changes. A small CSS fix and a major database migration go out together. Larger blast radius for every release.

Team coordination overhead:
As teams grow, multiple teams modifying the same codebase create merge conflicts, deployment queues, and "whose change broke production" debates.

Microservices: Independent, Until They're Not

Microservices split the application into independently deployable services, each owning its own data, communicating over the network.

┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
│  Users   │   │  Orders  │   │ Payment  │   │Inventory │
│ Service  │   │ Service  │   │ Service  │   │ Service  │
└────┬─────┘   └────┬─────┘   └────┬─────┘   └────┬─────┘
     │              │              │              │
     ▼              ▼              ▼              ▼
 [Users DB]    [Orders DB]   [Payment DB]   [Inventory DB]

Why Microservices Are Genuinely Good

Independent deployment:
The Payment team can deploy 10 times a day without coordinating with the Inventory team. Smaller, more frequent, lower-risk deploys.

Independent scaling:
If Inventory Service needs 50 instances during a flash sale but Payment only needs 5, you scale them separately. No wasted resources.

Technology diversity:
The Recommendation Service can be Python with ML libraries. The Payment Service can be Java for its mature ecosystem. The real-time Notification Service can be Go for concurrency. Each team picks the right tool.

Fault isolation:
If the Recommendation Service crashes, users can still browse, add to cart, and checkout. The blast radius of a failure is contained — if you've designed fault tolerance correctly (Topic 18).

Team autonomy:
Teams own their services end-to-end — code, database, deployment, on-call. This is the organizational benefit that often matters more than the technical one.

Where Microservices Genuinely Struggle

Network overhead and latency:
What was a function call is now an HTTP/gRPC call — milliseconds instead of nanoseconds. A user-facing request that touches 10 services has 10x the network hops, each adding latency (remember Day 2's tail latency lesson).

Distributed transactions are hard:
No more BEGIN TRANSACTION across services. You need Sagas (Day 5) for anything that spans services.

Operational complexity explodes:
Instead of monitoring one application, you monitor dozens — each with its own logs, metrics, deployment pipeline, and on-call rotation. You need service discovery, API gateways, distributed tracing — infrastructure that doesn't exist in a monolith.

Testing is harder:
Integration testing requires spinning up multiple services (or sophisticated mocking). "Works on my machine" becomes much harder to achieve.

The cost is real and upfront. The benefits compound over time — which is why the decision depends heavily on your current scale and trajectory.

The Strangler Fig Pattern: How to Actually Migrate

Named after the strangler fig tree, which grows around a host tree, gradually replacing it while the host continues to live — until eventually the host is gone and only the fig remains.

This is the pattern for migrating monolith to microservices without a risky "big rewrite."

Step 1: Monolith handles everything
┌─────────────────────────────┐
│          Monolith            │
│  [Users][Orders][Payment]...  │
└─────────────────────────────┘
              ↑
         All traffic

Step 2: Introduce a proxy/router in front
┌─────────┐    ┌─────────────────────────────┐
│  Proxy   │ → │          Monolith            │
└─────────┘    └─────────────────────────────┘
   All traffic still routes to monolith

Step 3: Extract ONE service. Proxy routes its traffic there.
┌─────────┐    ┌──────────────────────────┐
│  Proxy   │ → │   Monolith (minus Users)  │
│          │    └──────────────────────────┘
│          │ → ┌──────────────┐
└─────────┘    │ Users Service │
                └──────────────┘
   /users/* → Users Service
   everything else → Monolith

Step 4: Repeat for each module, one at a time
Step 5: Eventually, monolith handles nothing — it's fully "strangled"

Why this works:

Each extraction is small and low-risk — you can stop the migration at any point with a working system
The proxy means clients never notice the migration happening
You learn from each extraction and apply lessons to the next
If an extraction goes badly, route traffic back to the monolith (rollback is trivial)

Real-world timeline: This is genuinely slow. Amazon's migration took years. Shopify's modularization effort (moving toward a "modular monolith" — a middle ground) has been multi-year. Anyone promising a "quick microservices migration" is underestimating the work.

The Distributed Monolith: Worst of Both Worlds

This is the anti-pattern that catches teams who adopt microservices without understanding why the patterns exist.

Symptoms of a distributed monolith:

❌ Services share a database
   → "Microservices" that all read/write the same tables
   → A schema change in one service breaks three others

❌ Services must be deployed together
   → Service A's new API requires Service B to deploy simultaneously
   → You've recreated a monolith's deployment coupling, but now over a network

❌ Synchronous call chains for everything
   → Service A calls B calls C calls D, synchronously, for every request
   → One slow service makes everything slow (no isolation benefit)
   → If any service is down, the whole chain fails

❌ Shared libraries with business logic
   → A "common" library contains business rules used by every service
   → Changing the library requires redeploying every service that uses it

The result: You have all the operational complexity of microservices (network calls, distributed tracing, multiple deployments, service discovery) — with none of the benefits (no independent deployment, no fault isolation, no independent scaling).

How to avoid it:

Each service owns its data — no shared databases, ever
Use async communication (events, queues) for cross-service workflows where possible
Version your APIs so services can deploy independently
Duplicate small amounts of logic rather than sharing libraries with business rules — "a little duplication is far cheaper than a tight coupling"

Domain-Driven Design: How to Draw Service Boundaries

The hardest question in microservices isn't "should we split?" — it's "where do we split?"

Domain-Driven Design (DDD) provides a framework: identify bounded contexts — areas of the business with their own language, rules, and models.

E-commerce domain, split into bounded contexts:

┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│  Catalog Context  │  │  Ordering Context │  │ Fulfillment Context│
│                  │  │                  │  │                  │
│ "Product" means: │  │ "Product" means: │  │ "Product" means: │
│  - name, images  │  │  - SKU, price,    │  │  - dimensions,   │
│  - description   │  │    quantity       │  │    weight        │
│  - category      │  │  - in an order    │  │  - warehouse     │
│                  │  │                  │  │    location      │
└─────────────────┘  └─────────────────┘  └─────────────────┘

The key insight: "Product" means something different in each context. The Catalog team thinks about marketing content. The Ordering team thinks about price and inventory. The Fulfillment team thinks about physical dimensions and warehouse logistics.

Trying to have one unified "Product" model serving all three contexts leads to a bloated, constantly-changing entity that every team fights over. DDD says: let each context have its own model of "Product." Translate between them at the boundaries (via events or API calls).

This maps directly to service boundaries:

Catalog Service     → owns "Product" (marketing view)
Ordering Service    → owns "OrderLineItem" (price/quantity view)
Fulfillment Service → owns "ShippableItem" (logistics view)

Each service has a clean, focused model. They communicate via well-defined contracts (events: ProductCreated, OrderPlaced, ItemShipped).

The practical exercise: Get your domain experts (not just engineers) in a room and map out the "ubiquitous language" of each part of the business. Where the vocabulary changes meaning, that's a likely service boundary.

Nanoservices: The Opposite Extreme

If microservices done wrong gives you a distributed monolith, microservices done too far gives you nanoservices — services so small that the overhead of running them exceeds the value they provide.

❌ Nanoservice anti-pattern:
   - "GetUserNameService" — does one thing: returns a user's name
   - "GetUserEmailService" — does one thing: returns a user's email
   - "GetUserPhoneService" — does one thing: returns a user's phone

   To render a user profile, the client now makes 3 network calls
   for data that lives in the same database row.

The rule of thumb: A service should encapsulate a meaningful business capability — not a single field, not a single function. If two pieces of data are always read together and always change together, they probably belong in the same service.

When NOT to Use Microservices

This question is asked constantly in interviews, and the honest answer matters:

Don't use microservices when:

Your team is small (< 10-15 engineers). The operational overhead of microservices requires dedicated platform/DevOps investment that small teams can't afford. A modular monolith gives you most organizational benefits without the operational tax.
Your domain isn't well understood yet. If you're still discovering your product (early startup, MVP), service boundaries drawn too early will be wrong boundaries — and changing service boundaries is far more expensive than changing module boundaries within a monolith.
You don't have strong DevOps/platform capabilities. Microservices require CI/CD per service, service discovery, distributed tracing, centralized logging. Without this infrastructure, you'll spend more time fighting infrastructure than building product.
Your transactions are heavily relational. If most operations touch many entities in ACID transactions (financial ledgers, inventory systems with strict consistency), splitting them across services forces you into complex sagas for what used to be a single COMMIT.

The honest industry trend (2023-2026): Many companies that adopted microservices early are now consolidating into "modular monoliths" — single deployments with strong internal module boundaries, ready to extract services later if needed, but without the network overhead until it's justified. Shopify, Amazon (for some services), and many others have walked back overly granular microservices.

Interview Scenario: "How Would You Split a Monolith?"

The structured answer:

"First, I'd map the domain using DDD — identify bounded contexts where the business vocabulary and rules genuinely differ. I wouldn't start with technical boundaries (database tables); I'd start with business capabilities.

Then I'd use the Strangler Fig pattern — introduce a proxy/gateway, and extract one service at a time, starting with the module that has the clearest boundary and least coupling to others. I'd pick something with low risk first to validate the approach before tackling the complex, tightly-coupled modules.

Critically, each extracted service gets its own database from day one — no shared tables, even temporarily, because that's how distributed monoliths happen. Cross-service consistency needs would be handled with sagas or eventual consistency via events, not distributed transactions.

I'd also resist over-extraction — if two pieces of functionality are always read and written together, they probably belong in the same service even if they're conceptually 'different things.'"

Key Takeaways

Monoliths are simple to develop, deploy, and debug — and genuinely the right choice for small teams and early-stage products.
Microservices enable independent deployment, scaling, and team autonomy — at the cost of network overhead, operational complexity, and harder transactions.
Strangler Fig pattern: migrate incrementally via a proxy, extracting one service at a time — never a "big rewrite."
Distributed monolith: the anti-pattern where you get microservices' complexity without their benefits — usually caused by shared databases or synchronous coupling.
Domain-Driven Design helps draw service boundaries around bounded contexts — areas where business vocabulary and rules genuinely differ.
Nanoservices: services so granular the network overhead exceeds their value. A service should encapsulate a business capability, not a field.
The industry is increasingly favoring modular monoliths as a middle ground — strong internal boundaries, single deployment, services extracted only when justified by genuine scaling or team-autonomy needs.

What's Next

Topic 17 covers Service Discovery and Service Mesh — once you have many services, how do they find each other, and how does infrastructure like Istio handle retries, mTLS, and circuit breaking transparently?

Tags: system-design microservices software-architecture backend domain-driven-design distributed-systems interview-prep

System Design - 15. The Saga Pattern: How Uber Books a Trip Without a Single Database Transaction

Rajkiran — Fri, 12 Jun 2026 11:12:12 +0000

Covers: Two-Phase Commit, Saga Pattern, Choreography vs Orchestration Sagas, Compensating Transactions, Idempotency

The Question That Breaks Most Microservices Designs

You're designing Uber's trip booking flow. A single trip booking touches multiple services:

1. Trip Service:    create trip record
2. Driver Service:  assign a driver, mark as unavailable
3. Payment Service: authorize payment method
4. Pricing Service: lock in the fare estimate

In a monolith with one database, this would be a single transaction:

BEGIN TRANSACTION;
  INSERT INTO trips (...);
  UPDATE drivers SET status = 'on_trip' WHERE id = ?;
  INSERT INTO payment_authorizations (...);
  INSERT INTO fare_locks (...);
COMMIT;  -- all or nothing

If anything fails, ROLLBACK undoes everything. Clean. Simple. Guaranteed.

But in microservices, each of these lives in a different service with its own database. There is no single COMMIT that spans all four. So what happens if:

Trip is created ✓
Driver is assigned ✓
Payment authorization fails ✗

You now have a trip with an assigned driver but no valid payment. The driver is marked unavailable for a trip that can't proceed. How do you "roll back" across four independent databases?

This is the central problem the Saga pattern solves.

Two-Phase Commit (2PC): The Tempting Wrong Answer

2PC is the "obvious" distributed transaction protocol — and almost universally considered an anti-pattern for microservices. Understanding why is important.

How 2PC Works

Phase 1 (Prepare):
  Coordinator → asks all participants: "Can you commit this?"
  Trip Service:     "Yes, I can commit" (locks resources, doesn't commit yet)
  Driver Service:   "Yes, I can commit" (locks resources, doesn't commit yet)
  Payment Service:  "Yes, I can commit" (locks resources, doesn't commit yet)
  Pricing Service:  "Yes, I can commit" (locks resources, doesn't commit yet)

Phase 2 (Commit):
  All said yes → Coordinator tells everyone: "COMMIT"
  All services commit and release locks

  OR if anyone said no:
  Coordinator tells everyone: "ROLLBACK"
  All services roll back and release locks

Why 2PC Is an Anti-Pattern for Microservices

1. Blocking and locks held across services
During Phase 1, every participant holds locks on its resources, waiting for the coordinator's decision. If the coordinator crashes between Phase 1 and Phase 2, participants are stuck holding locks indefinitely — a "blocking" state.

2. Tight coupling and availability cascade
If the Payment Service is slow or down, the entire transaction blocks — Trip Service and Driver Service hold their locks waiting. One service's unavailability brings down the whole operation. This is exactly the cascading failure problem from Day 2.

3. Doesn't scale
2PC requires synchronous coordination across all participants for every transaction. At Uber's scale (millions of trips per day across dozens of services), this creates massive contention.

4. Poor fit for NoSQL
Many NoSQL databases (Cassandra, DynamoDB) don't support distributed transactions or locking at all — 2PC simply isn't possible with them.

The rule: If you're designing microservices and reach for 2PC, stop. There's almost always a better pattern — usually the Saga.

The Saga Pattern: Local Transactions + Compensation

A Saga breaks a distributed transaction into a sequence of local transactions, each in a single service. If any step fails, previously completed steps are undone using compensating transactions.

Saga: Book Trip
  Step 1: Trip Service     → create trip            (local transaction)
  Step 2: Driver Service   → assign driver           (local transaction)
  Step 3: Payment Service  → authorize payment       (local transaction)
  Step 4: Pricing Service  → lock fare                (local transaction)

If Step 3 fails:
  Compensate Step 2: Driver Service → release driver
  Compensate Step 1: Trip Service   → cancel trip
  (Steps run in reverse order)

Each step is its own ACID transaction within its own service's database. There's no global lock, no blocking coordinator. If something fails partway through, you run compensating actions to undo the completed steps — not a database rollback, but a business-level undo operation.

The crucial insight: A compensating transaction isn't "undo" in the database sense — it's a new operation that semantically reverses the effect of the original. "Cancel the trip" isn't the same as "delete the trip row" — it might mean marking it cancelled, notifying the user, logging the cancellation reason, and releasing the driver.

Choreography-Based Saga

Each service publishes events; other services react. No central coordinator.

1. Order Service: creates order (PENDING)
   → publishes OrderCreated event

2. Payment Service: (listens for OrderCreated)
   → charges card
   → publishes PaymentCompleted (success) OR PaymentFailed (failure)

3a. If PaymentCompleted:
    Inventory Service: (listens for PaymentCompleted)
    → reserves stock
    → publishes StockReserved

3b. If PaymentFailed:
    Order Service: (listens for PaymentFailed)
    → marks order as CANCELLED (compensating action)

Diagram of the happy path and failure path:

Happy path:
OrderCreated → PaymentCompleted → StockReserved → OrderConfirmed

Failure path (payment fails):
OrderCreated → PaymentFailed → OrderCancelled
(no compensation needed — nothing else happened yet)

Failure path (stock unavailable, AFTER payment succeeded):
OrderCreated → PaymentCompleted → StockUnavailable
            → Payment Service listens for StockUnavailable
            → refunds payment (compensating action)
            → Order Service listens for PaymentRefunded
            → marks order as CANCELLED

Advantages:

Fully decoupled — no service knows about the others
Easy to add steps (just subscribe to relevant events)

Disadvantages:

The "saga" — the overall flow — exists only implicitly, scattered across event handlers in multiple services
Hard to answer "what's the current state of order #123?" without tracing through events across services
Cyclic dependencies are easy to accidentally create

Best for: Sagas with 2-4 steps and simple compensation logic.

Orchestration-Based Saga

A central Saga Orchestrator explicitly calls each service in sequence and handles compensation.

class OrderSagaOrchestrator:
    def execute(self, order_data):
        try:
            # Step 1
            order = order_service.create_order(order_data)

            # Step 2
            try:
                payment = payment_service.charge(order.total)
            except PaymentFailedException:
                order_service.cancel_order(order.id)  # compensate step 1
                raise SagaFailedException("Payment failed")

            # Step 3
            try:
                inventory_service.reserve_stock(order.items)
            except StockUnavailableException:
                payment_service.refund(payment.id)     # compensate step 2
                order_service.cancel_order(order.id)    # compensate step 1
                raise SagaFailedException("Stock unavailable")

            # Step 4
            try:
                shipping_service.schedule_delivery(order)
            except ShippingException:
                inventory_service.release_stock(order.items)  # compensate step 3
                payment_service.refund(payment.id)             # compensate step 2
                order_service.cancel_order(order.id)           # compensate step 1
                raise SagaFailedException("Shipping unavailable")

            order_service.confirm_order(order.id)
            return order

        except SagaFailedException as e:
            log_saga_failure(order_data, e)
            raise

The orchestrator maintains saga state — typically persisted so it can resume after a crash:

Saga State Table:
  saga_id | order_id | current_step | status
  saga_1  | order_42 | 3 (inventory) | IN_PROGRESS

If orchestrator crashes after step 3 completes:
  On restart, read saga state → resume from step 4
  (or run compensations for steps 1-3 if step 4 can't proceed)

Advantages:

The entire workflow is visible in one place — easy to understand, modify, debug
Centralized error handling and retry logic
Saga state can be persisted and resumed after crashes

Disadvantages:

Orchestrator becomes a critical component — must be highly available
Services become aware of the orchestrator's API contract

Best for: Complex multi-step workflows with non-trivial compensation logic. Most production order/booking systems (Amazon order fulfillment, Uber trip booking) use orchestration-based sagas with a dedicated framework like Temporal, AWS Step Functions, or Camunda.

Idempotency: The Non-Negotiable Requirement

Sagas involve retries — networks fail, services restart, messages get redelivered. Every step (and every compensation) must be idempotent: running it multiple times produces the same result as running it once.

Non-idempotent (dangerous):

def charge_card(amount):
    payment_gateway.charge(card_id, amount)  # Retrying this charges TWICE

Idempotent (safe):

def charge_card(idempotency_key, amount):
    # Idempotency key ensures the payment gateway deduplicates
    payment_gateway.charge(card_id, amount, idempotency_key=idempotency_key)

Stripe's idempotency key pattern (industry standard):

import uuid

idempotency_key = f"order_{order_id}_payment"  # deterministic, same every retry

stripe.PaymentIntent.create(
    amount=5000,
    currency="usd",
    idempotency_key=idempotency_key  # Stripe deduplicates if seen before
)

If this request is sent twice (due to a retry), Stripe recognizes the idempotency key and returns the original result without charging again. This single technique prevents the most common and costly distributed systems bug: double charges.

Compensating transactions must also be idempotent. If "release driver" is sent twice (retry), the second call should be a safe no-op — not an error, and definitely not "release a different driver."

Real-World Example: Amazon Order Fulfillment

Amazon's order fulfillment saga (simplified) looks like this:

1. Order Service: Create order (status: PLACED)
2. Payment Service: Authorize payment (hold funds, don't capture yet)
3. Inventory Service: Reserve items across warehouses
4. Fulfillment Service: Generate pick/pack/ship instructions
5. Payment Service: Capture payment (now actually charge)
6. Shipping Service: Hand off to carrier
7. Order Service: Update status to SHIPPED

Compensation scenarios:
- If inventory unavailable after payment auth → release auth (no charge happened yet)
- If fulfillment fails after capture → refund + cancel order
- If item damaged before shipping → refund + restock + notify customer

Notice step 2 (authorize) vs step 5 (capture) — this is a deliberate design choice. Authorization holds funds without charging. This gives the saga a "soft" compensation option (release the hold) for the early failure scenarios, and only "hard" compensation (refund) is needed for failures after the actual charge.

This pattern — separating authorization from capture — is one of the most important saga design techniques for payment flows. It buys you a cheap, reversible step before the expensive, harder-to-reverse step.

Interview Scenario: "Handle Payment Spanning 3 Microservices"

Q: A user purchase involves Order Service, Payment Service, and Inventory Service. How do you ensure consistency?

"I'd implement this as a Saga rather than attempting a distributed transaction. Given the complexity — three services, multiple failure scenarios — I'd lean toward an orchestration-based saga rather than choreography, so the workflow logic lives in one place and is easy to reason about.

The sequence would be: create the order in PENDING state, authorize (not capture) payment, reserve inventory, then capture payment and confirm the order. I'd use authorization-before-capture so early failures (inventory unavailable) only require releasing the auth hold — no refund needed.

Every step and compensation needs an idempotency key, because the orchestrator will retry on failures, and I need to guarantee a retried 'charge card' doesn't double-charge.

I'd persist the saga state after each step so that if the orchestrator crashes, it can resume from where it left off rather than restarting the whole flow — which could cause duplicate charges or duplicate inventory reservations if not handled carefully."

This answer demonstrates: knowledge of the pattern, a clear architectural choice with justification, awareness of idempotency, and crash-recovery thinking — exactly what the "Top 1%" checklist from our syllabus describes.

Key Takeaways

2PC is an anti-pattern for microservices — it creates blocking locks across services and cascading availability failures.
Saga pattern: break distributed transactions into local transactions per service, with compensating transactions for rollback.
Choreography sagas: event-driven, decoupled, best for simple 2-4 step flows.
Orchestration sagas: centralized coordinator, explicit workflow, best for complex flows with non-trivial compensation — most production systems use this.
Idempotency is mandatory — every step and compensation must handle retries safely. Use idempotency keys (Stripe's pattern is the gold standard).
Authorize-then-capture for payments gives you a cheap, reversible early step before the costly, harder-to-reverse final step.
Persist saga state so orchestrators can recover from crashes without duplicating side effects.

You've now covered the entire async communication layer: Message Queues (how services talk without blocking), Event-Driven Architecture (how systems react to "things that happened"), and Sagas (how distributed transactions actually work in microservices). Together, these three topics explain how every large-scale system coordinates work across dozens of independent services.

we move into Microservices Infrastructure — Monolith vs Microservices trade-offs, Service Discovery, and Fault Tolerance Patterns like Circuit Breakers and Bulkheads. How to actually run hundreds of services in production without them taking each other down.

Tags: system-design microservices saga-pattern distributed-systems backend software-architecture interview-prep

System Design - 14. Event-Driven Architecture: Event Sourcing, CQRS, and the Outbox Pattern Explained

Rajkiran — Fri, 12 Jun 2026 11:07:41 +0000

Event-Driven Architecture: Event Sourcing, CQRS, and the Outbox Pattern Explained

Covers: Event Sourcing, CQRS, Outbox Pattern, Choreography vs Orchestration

The Bank That Never Stores a Balance

Here's something that surprises most engineers: many banking systems don't store your account balance as a number in a database row.

Instead, they store every transaction that ever happened — every deposit, withdrawal, transfer, fee — as an immutable event. Your "balance" is computed by replaying all those events.

Account 12345 events:
  2024-01-01: DEPOSIT +1000
  2024-01-05: WITHDRAWAL -200
  2024-01-10: DEPOSIT +500
  2024-01-15: WITHDRAWAL -150

Balance = 1000 - 200 + 500 - 150 = 1150

Why would anyone do this instead of just storing balance: 1150?

Because the event log gives you something a single number never can: a complete, immutable, auditable history of everything that ever happened. You can answer "what was my balance on January 8th?" You can detect fraud by analyzing transaction patterns. You can replay history to debug a discrepancy.

This is event sourcing — and it's one piece of a broader architectural philosophy called event-driven architecture.

The Core Idea: Events as Facts

In traditional architecture, your database stores current state. An UPDATE statement overwrites the old value — it's gone forever.

In event-driven architecture, you store events — immutable facts about things that happened. State is derived from events, not stored directly (or stored as a cache of the derived state).

Traditional:
  UPDATE accounts SET balance = 1150 WHERE id = 12345
  (Previous balance 1300 is lost — no record of the $150 withdrawal)

Event-Driven:
  INSERT INTO events (account_id, type, amount, timestamp)
  VALUES (12345, 'WITHDRAWAL', -150, '2024-01-15T10:30:00Z')
  (The event is permanent. Balance is computed by replaying events.)

This single shift — from "store current state" to "store the history of changes" — unlocks several powerful patterns.

Event Sourcing: Store History, Derive State

Event Sourcing is the pattern of persisting all changes to application state as a sequence of events, and reconstructing current state by replaying those events.

How It Works

Event Store (append-only log):
┌────────────────────────────────────────────┐
│ OrderCreated   { order_id: 1, items: [...] }│
│ ItemAdded      { order_id: 1, item: "X" }   │
│ PaymentReceived{ order_id: 1, amount: 50 }  │
│ OrderShipped   { order_id: 1, carrier: "Y" }│
└────────────────────────────────────────────┘
            ↓ replay events in order
┌────────────────────────────────────────────┐
│ Current State:                              │
│ Order #1: items=[...], paid=true,           │
│           status="shipped"                  │
└────────────────────────────────────────────┘

To get the current state of Order #1, you replay all events for that order, applying each one in sequence.

Snapshots: Avoiding Replaying Everything

If an order has 10,000 events (unlikely, but imagine a long-lived entity like a user account with years of activity), replaying all of them on every read is slow.

Snapshots solve this — periodically save the computed state, then only replay events since the snapshot:

Snapshot at event #9000: { state at that point }
                ↓
Replay events #9001 - #10000 (only 1000 events, not 10000)
                ↓
Current state

Why Event Sourcing Is Powerful

1. Complete audit trail
Every change is recorded with who, what, when. Critical for compliance (finance, healthcare).

2. Time travel debugging
"What did this order look like before the bug was introduced?" — replay events up to that point in time.

3. Temporal queries
"What was the user's subscription status on March 15th?" — replay events up to March 15th.

4. Multiple projections from one source
The same event stream can generate different "views" — a dashboard view, an analytics view, a search index — all derived independently from the same events.

The Costs

1. Complexity
Reconstructing state from events is more complex than SELECT * FROM table.

2. Schema evolution is hard
If your event format changes, you need to handle old event formats when replaying historical events.

3. Eventual consistency
Projections (derived views) may lag behind the event stream slightly.

Real example: Banking systems, Git (every commit is an immutable event; your working directory is the "current state" derived from replaying commits), and the Axon Framework (Java event sourcing framework used in enterprise systems).

CQRS: Separating Reads from Writes

CQRS (Command Query Responsibility Segregation) separates the model used for writing data (Commands) from the model used for reading data (Queries).

The Problem CQRS Solves

In a traditional system, the same database table serves both writes and reads:

Single Model:
  Write: INSERT INTO orders (...)
  Read:  SELECT * FROM orders WHERE user_id = ? ORDER BY date DESC

But writes and reads often have very different requirements:

Writes need to be fast, validated, transactional
Reads need to be fast, denormalized, optimized for specific UI views — often aggregating data from multiple sources

Trying to satisfy both with one schema leads to compromises on both sides.

The CQRS Solution

Commands (Writes)              Queries (Reads)
       ↓                              ↑
[Write Model / DB]  ──events──► [Read Model / DB]
  Normalized,                    Denormalized,
  transactional,                 optimized per view,
  validates business rules       can be multiple specialized stores

Concrete example — e-commerce order system:

Write side (PostgreSQL, normalized):
  orders table, order_items table, customers table
  → Strict foreign keys, ACID transactions, business rule validation

Read side (multiple specialized views, built from events):
  - "Order History" view (Elasticsearch — fast full-text search)
  - "Admin Dashboard" view (denormalized SQL — pre-joined for reports)
  - "Customer Order Count" view (Redis — instant counter lookups)

Each read view is updated asynchronously when write-side events occur. The write side stays clean and normalized. The read side is optimized for whatever each specific screen needs — even if that means redundant, denormalized copies of data.

CQRS + Event Sourcing: A Natural Pair

These two patterns are often used together (though independently optional):

1. Command arrives: "Place Order"
2. Write side validates, persists event: OrderPlaced
3. Event published to event bus (Kafka)
4. Read-side projections consume the event:
   - Search index adds the order
   - Analytics dashboard updates order count
   - Customer's "recent orders" cache updates
5. Each read view is eventually consistent with the write side

When to use CQRS: Complex domains where read and write patterns are genuinely different — e-commerce (write: place order; read: browse history, search, recommendations), social media (write: post; read: feed, search, trending).

When NOT to use it: Simple CRUD applications. CQRS adds real complexity — don't introduce it unless reads and writes are genuinely pulling your data model in different directions.

The Outbox Pattern: Solving the Dual-Write Problem

Here's a subtle but critical bug pattern in event-driven systems.

The dual-write problem:

def place_order(order_data):
    db.insert("orders", order_data)          # Write 1: database
    kafka.publish("order-placed", order_data) # Write 2: message queue

    # PROBLEM: What if the process crashes between these two lines?
    # → Order exists in DB, but event was never published
    # → Downstream services never know about this order

These are two separate systems (database and message broker). There's no way to make both writes atomic with standard tools. If the database write succeeds but the Kafka publish fails (network blip, broker down, process crash) — you have a "ghost order" that exists but nothing downstream knows about.

The Outbox Pattern Solution

Write the event to an outbox table in the same database transaction as the business data. A separate process reads the outbox and publishes to Kafka.

BEGIN TRANSACTION;
  INSERT INTO orders (id, user_id, total) VALUES (123, 456, 99.99);
  INSERT INTO outbox (event_type, payload, status) 
    VALUES ('OrderPlaced', '{"order_id": 123, ...}', 'PENDING');
COMMIT;
-- Both inserts succeed or both fail. Atomic. Guaranteed.

A separate outbox processor (running continuously) reads pending outbox rows and publishes them to Kafka:

def outbox_processor():
    while True:
        pending = db.query(
            "SELECT * FROM outbox WHERE status = 'PENDING' ORDER BY created_at"
        )
        for event in pending:
            kafka.publish(event.event_type, event.payload)
            db.execute(
                "UPDATE outbox SET status = 'PUBLISHED' WHERE id = ?", 
                event.id
            )
        sleep(0.1)

Why this works: The database transaction guarantees the order and the outbox event are written together — atomically. The outbox processor guarantees eventual publishing to Kafka. Even if the processor crashes mid-publish, it retries unpublished events on restart (the outbox row stays PENDING until confirmed).

Debezium is a popular tool that implements this via Change Data Capture (CDC) — it watches the database's write-ahead log directly and publishes changes to Kafka, eliminating the need for a custom outbox processor entirely.

Choreography vs Orchestration

When an event triggers a chain of actions across multiple services, who's "in charge" of the workflow?

Choreography: No Central Coordinator

Each service listens for events and reacts independently. No service knows about the full workflow — each just does its part.

OrderPlaced event published
    ├──► Inventory Service: reserves stock → publishes StockReserved
    ├──► Payment Service: charges card → publishes PaymentProcessed
    └──► Notification Service: (listens for PaymentProcessed) → sends email

Each service reacts to events. No one orchestrates the whole flow.

Advantages:

Fully decoupled — services don't know about each other
Easy to add new participants (just subscribe to relevant events)
No single point of failure

Disadvantages:

Hard to see the overall flow — the "business process" is implicit, scattered across many services' event handlers
Debugging is hard — tracing a request through choreographed events requires distributed tracing
Easy to create circular dependencies (Service A reacts to Service B's event, which reacts to Service A's event...)

Orchestration: A Central Coordinator

A central orchestrator explicitly directs each step of the workflow.

OrderSaga (orchestrator):
  1. Call Inventory Service: reserve stock
     → wait for response
  2. Call Payment Service: charge card
     → wait for response
  3. Call Shipping Service: schedule delivery
     → wait for response
  4. Call Notification Service: send confirmation

If step 2 fails: orchestrator calls Inventory Service to release stock (compensating action)

Advantages:

Workflow is explicit and visible — read the orchestrator code to understand the whole process
Easier to handle complex error/retry/compensation logic
Easier to debug — one place to look

Disadvantages:

Orchestrator is a central point of coordination — if poorly designed, becomes a bottleneck
Services become coupled to the orchestrator's expectations

The general guideline: Choreography for simple event reactions (2-3 services, simple flows). Orchestration for complex multi-step business processes with compensation logic (this leads directly into the Saga pattern — our next topic).

Real-World Example: Activity Feed Using Events

How would you design Instagram's "Following" activity feed using event-driven architecture?

User performs action → publishes event:
  - PostCreated { user_id, post_id, timestamp }
  - PostLiked { user_id, post_id, liker_id }
  - UserFollowed { follower_id, followed_id }
  - CommentAdded { user_id, post_id, comment_id }

Activity Feed Service subscribes to all these events:
  - On PostCreated: notify followers → write to their feed projection
  - On PostLiked: append "X liked your post" to user's notification feed
  - On UserFollowed: append "X started following you"
  - On CommentAdded: append "X commented on your post"

Each user's activity feed is a CQRS read model — 
  built entirely by projecting these events into a 
  per-user feed table (Cassandra, partitioned by user_id).

This is exactly the architecture used by large-scale social platforms. The write path (creating posts, likes, follows) is decoupled from the read path (viewing feeds) via events.

Key Takeaways

Event-driven architecture treats "things that happened" (events) as the primary data, with state derived from them.
Event Sourcing: store the full history of events, reconstruct state by replaying. Gives you audit trails, time travel, and multiple derived views — at the cost of complexity.
CQRS: separate write models (normalized, transactional) from read models (denormalized, optimized per view). Pairs naturally with event sourcing.
Outbox Pattern: solves the dual-write problem — write business data and the event in the same DB transaction, publish asynchronously via a separate processor.
Choreography: decentralized, event-reactive — simple flows, fully decoupled.
Orchestration: centralized coordinator — complex flows, explicit logic, easier debugging.

What's Next

Topic 15 closes Day 5 with the Saga Pattern — how to handle transactions that span multiple microservices, why Two-Phase Commit is considered an anti-pattern in modern architectures, and how Uber coordinates trip booking across a dozen services without ever locking a database.

Tags: system-design event-driven-architecture cqrs event-sourcing backend distributed-systems interview-prep

System Design - 13. Message Queues Explained: Why LinkedIn Built Kafka and Changed Async Communication Forever

Rajkiran — Thu, 11 Jun 2026 17:33:20 +0000

Message Queues Explained: Why LinkedIn Built Kafka and Changed Async Communication Forever

Covers: Point-to-Point vs Pub-Sub, Kafka Internals, Delivery Guarantees, Dead Letter Queues, Backpressure

The Upload That Broke Everything

In 2011, LinkedIn's activity feed was choking. Every time a user updated their profile, viewed a connection, or clicked an article, the system needed to:

Update the activity feed
Recalculate recommendations
Notify relevant connections
Update search indexes
Log the event for analytics

All synchronously. All in the same request. All blocking the user from getting a response until every downstream system confirmed success.

When traffic spiked, the whole chain collapsed. One slow downstream system stalled every user action behind it.

The engineering team asked a radical question: "Does the user really need to wait for all of this?"

The answer was no. The user needed to know their profile was saved. Everything else — feed updates, recommendations, notifications — could happen a few seconds later without the user caring.

That insight led to the creation of Apache Kafka. And it fundamentally changed how large-scale systems handle communication between services.

What Is a Message Queue?

A message queue is a component that allows services to communicate asynchronously — one service produces a message, the queue stores it, and another service consumes it later.

Without queue (synchronous):
[User Action] → Service A → Service B → Service C → Service D → Response
                            ↑ if any service is slow, user waits

With queue (asynchronous):
[User Action] → Service A → [Queue] → Response (immediate)
                                ↓           ↓
                            Service B    Service C    Service D
                            (processes later, independently)

The user gets an instant response. The downstream work happens in the background, decoupled from the user's request lifecycle.

This unlocks three fundamental capabilities:

Decoupling: Producer doesn't know or care who consumes its messages. Add a new consumer without changing the producer at all.

Load leveling: Traffic spikes fill the queue rather than overwhelming consumers. Consumers process at their own pace.

Resilience: If a consumer is down, messages accumulate in the queue and are processed when it recovers. Nothing is lost.

Two Fundamental Models

Point-to-Point (Queue Model)

Each message is consumed by exactly one consumer. Once consumed, it's gone.

Producer → [Queue] → Consumer A
                  ← (message removed after consumption)

If you have multiple consumers, each message goes to only one of them (competing consumers pattern):

Producer → [Queue] → Consumer A  (takes message 1)
                  → Consumer B  (takes message 2)
                  → Consumer C  (takes message 3)

Use case: Task queues. You want each job done exactly once. Order processing, email sending, image resizing — each task should be handled by one worker, not three.

RabbitMQ is the canonical point-to-point queue. When you push a job to a RabbitMQ queue, exactly one worker picks it up and processes it.

Publish-Subscribe (Pub-Sub Model)

Each message is delivered to all subscribers. Producers publish to a topic; every subscriber to that topic gets every message.

Producer publishes "user.signup" event
    ↓
[Topic: user.signup]
    ├──► Email Service       (sends welcome email)
    ├──► Analytics Service   (records signup event)
    ├──► Recommendations     (initializes user model)
    └──► Notification Service (sends push notification)

All four consumers get the same message. Adding a fifth consumer (say, a fraud detection service) requires zero changes to the producer or existing consumers.

Use case: Event broadcasting. One thing happened; many systems need to know. This is the architecture behind every event-driven system.

Apache Kafka is the dominant pub-sub system. Let's go deep on how it works.

Kafka Internals: The Engine Behind Modern Data Pipelines

Kafka is not just a message queue — it's a distributed commit log. Understanding its internals is what separates senior engineers from those who just know the vocabulary.

Topics and Partitions

A topic is a named stream of messages (e.g., user-signups, order-placed, payment-completed).

A topic is divided into partitions — ordered, immutable sequences of messages. Each partition lives on a different broker (server):

Topic: "order-placed" (4 partitions across 3 brokers)

Broker 1: [Partition 0] msg1, msg4, msg7, msg10...
Broker 2: [Partition 1] msg2, msg5, msg8, msg11...
Broker 3: [Partition 2] msg3, msg6, msg9, msg12...
Broker 1: [Partition 3] msg0, msg3b, msg6b, msg9b...

Partitions enable parallel processing — multiple consumers can read from different partitions simultaneously, giving you throughput that scales linearly with partition count.

Partition key: When producing a message, you specify a key. Messages with the same key always go to the same partition:

producer.send(
    topic="order-placed",
    key=b"user_12345",      # All orders from user 12345 → same partition
    value=order_json
)

This guarantees ordering per key — all events for a given user are processed in sequence. Critical for correctness (you don't want "order cancelled" processed before "order placed").

Offsets: The Bookmark System

Every message in a partition has a sequential offset — an integer that uniquely identifies its position.

Partition 0: [offset 0][offset 1][offset 2][offset 3][offset 4]...
                msg_A    msg_B    msg_C    msg_D    msg_E

Consumers track their offset — which message they've processed up to. This offset is stored in Kafka itself (in a special __consumer_offsets topic).

The replay superpower: Because messages are persisted on disk (not deleted after consumption), consumers can:

Rewind to any offset and reprocess historical messages
A new service joining today can process all events from Day 1
After a bug fix, replay the last 24 hours of messages through the fixed code

This is something RabbitMQ cannot do — messages are deleted after consumption. Kafka's log retention (configurable, default 7 days) makes it a time machine.

Consumer Groups

A consumer group is a set of consumers that collectively process a topic's partitions. Each partition is assigned to exactly one consumer in the group:

Topic: "order-placed" (4 partitions)
Consumer Group: "payment-service" (2 consumers)

Consumer 1 → Partition 0, Partition 1
Consumer 2 → Partition 2, Partition 3

Scaling: Add more consumers to the group → each handles fewer partitions → higher throughput. You can scale up to as many consumers as there are partitions.

Multiple groups, same topic: Different services can each have their own consumer group, all reading the same topic independently:

Topic: "order-placed"
├── Consumer Group "payment-service"    → processes all orders
├── Consumer Group "inventory-service"  → processes all orders
└── Consumer Group "analytics-service"  → processes all orders

All three groups get every message. None of them interfere with each other. This is the pub-sub model in action.

Delivery Guarantees: The Triangle of Trust

Every messaging system makes promises about delivery. Understanding these promises is critical for system design.

At-Most-Once

Message is delivered zero or one times. If the consumer crashes before acknowledging, the message is lost.

Producer → Queue → Consumer starts processing
                → Consumer crashes mid-processing
                → Message NOT retried
                → Message lost forever

When to use: Metrics, logs, analytics events — where losing occasional messages is acceptable and duplicates are worse than losses. Very high throughput. Very low overhead.

At-Least-Once

Message is delivered one or more times. On failure, it's retried. Duplicates are possible.

Producer → Queue → Consumer processes message
                → Consumer sends ACK
                → Network drops the ACK
                → Queue doesn't receive ACK
                → Queue retries message
                → Consumer processes message AGAIN (duplicate!)

When to use: Most production systems. The standard default. Your consumers must be idempotent — processing the same message twice produces the same result as processing it once.

# Idempotent consumer example:
def process_payment(payment_id, amount):
    # Check if already processed (idempotency key)
    if db.exists(f"processed_payment:{payment_id}"):
        return  # Already done, skip safely

    # Process payment
    charge_card(amount)
    db.set(f"processed_payment:{payment_id}", True)

Exactly-Once

Message is delivered and processed exactly once, even in the face of failures. No duplicates, no losses.

The hardest guarantee. Kafka achieves it through:

Idempotent producers — Kafka deduplicates producer retries using sequence numbers
Transactional API — write to multiple partitions atomically
Transactional consumers — offset commit and business logic in the same transaction

producer = KafkaProducer(
    enable_idempotence=True,           # Dedup producer retries
    transactional_id="payment-producer-1"
)

producer.init_transactions()
producer.begin_transaction()
try:
    producer.send("payments", payment_data)
    producer.send("audit-log", audit_data)
    producer.commit_transaction()  # Both or neither
except Exception:
    producer.abort_transaction()

When to use: Financial transactions, payment processing, inventory deduction — anywhere duplicates cause real harm (double-charging a customer, overselling stock).

The cost: Lower throughput than at-least-once. More complex implementation.

Dead Letter Queue: The Safety Net

What happens to messages that consistently fail processing? Without a safety net, they can block the queue forever (a "poison pill" message).

A Dead Letter Queue (DLQ) is a special queue where failed messages are sent after N retry attempts:

Normal Queue → Consumer fails to process message
→ Retry (attempt 2)
→ Retry (attempt 3)  ← max retries reached
→ Move to DLQ

DLQ: message sits here for manual inspection or automated alert

Why this matters:

Main queue keeps flowing — the poison pill doesn't block other messages
Failed messages aren't lost — they're in the DLQ for investigation
Engineers get alerted to DLQ growth → investigate the root cause
After fixing the bug, messages can be replayed from DLQ back to the main queue

AWS SQS DLQ config:

{
  "deadLetterTargetArn": "arn:aws:sqs:us-east-1:123:my-dlq",
  "maxReceiveCount": 3
}

After 3 failed attempts, message moves to my-dlq. Engineers receive CloudWatch alarm. Messages stay in DLQ for 14 days.

Backpressure: When Consumers Can't Keep Up

If producers emit 10,000 messages/second but consumers can only process 1,000/second, the queue grows indefinitely. Eventually: out-of-memory, disk full, system collapse.

Backpressure is the mechanism by which a system signals upstream components to slow down.

Pull-based consumption (Kafka's model):
Consumers pull messages at their own pace. They never receive more than they can handle. If a consumer is slow, it simply reads fewer messages — the queue absorbs the backlog.

Fast producer → [Kafka topic, growing backlog]
Slow consumer → pulls 100 messages at a time, processes, pulls next 100
→ Consumer naturally throttles itself

Push-based queues (RabbitMQ):
The broker pushes messages to consumers. prefetch count limits how many unacknowledged messages a consumer can hold:

channel.basic_qos(prefetch_count=10)
# Consumer receives at most 10 messages before it must ACK some
# Prevents overwhelming a slow consumer

Application-level backpressure:
When the queue depth exceeds a threshold, producers are asked to slow down:

Queue depth > 1 million messages → alert producer service
Producer service → reduce emission rate by 50%
Queue depth decreasing → producer returns to full rate

This is how streaming systems like Spark Streaming and Flink handle varying load without crashing.

Kafka vs RabbitMQ: The Real Comparison

Feature	Kafka	RabbitMQ
Model	Pub-Sub (log-based)	Point-to-point + Pub-Sub
Message retention	Persisted on disk (days/weeks)	Deleted after consumption
Replay	Yes — rewind to any offset	No
Throughput	Millions/sec per broker	~50K/sec per queue
Consumer model	Pull (consumer controls pace)	Push (broker sends to consumer)
Ordering	Per partition	Per queue
Routing	Topic + partition key	Flexible exchange-based routing
Use case	Event streaming, data pipelines	Task queues, complex routing

Choose Kafka when:

High throughput (100K+ messages/second)
You need replay / event sourcing
Multiple independent consumers need the same events
Building a data pipeline (Kafka → Spark/Flink → data warehouse)

Choose RabbitMQ when:

Complex routing logic (route by message header, content, priority)
Task queue semantics (each job done by exactly one worker)
Lower throughput requirements
Need per-message TTL, priority queues, or delayed delivery

Key Takeaways

Message queues decouple producers from consumers, enabling async processing, load leveling, and resilience.
Point-to-point (RabbitMQ): one consumer per message. For task queues.
Pub-Sub (Kafka): all consumers get every message. For event broadcasting.
Kafka internals: topics → partitions → offsets. Consumer groups enable parallel processing. Partition keys guarantee per-key ordering.
Delivery guarantees: At-most-once (lossy, fast), At-least-once (default, needs idempotency), Exactly-once (strong, costly).
DLQ prevents poison pills from blocking queues — failed messages park here for investigation.
Backpressure prevents fast producers from overwhelming slow consumers — Kafka's pull model handles this naturally.

What's Next

Topic 14 goes deeper into the architectural pattern that Kafka enables: Event-Driven Architecture — event sourcing, CQRS, the outbox pattern, and how to build systems where "something happened" is the fundamental primitive.

Tags: system-design kafka message-queues backend distributed-systems event-driven interview-prep