S, Sanjay

Posted on Apr 9

Distributed Systems: Where Physics, Murphy's Law, and Your Career Collide 💥

#systemdesign #distributedsystems #devops #architecture

🎬 The Interview Question That Breaks People

"Design a system that handles 100,000 requests per second with 99.99% availability across multiple regions."

Silence. Sweating. "Uh... load balancer?"

Here's the thing — distributed systems aren't magic. They're a collection of patterns applied to specific problems. Once you learn the patterns, the interview question becomes solvable. And more importantly, the 3 AM production issue becomes debuggable.

Let's learn the patterns that power the internet.

🧪 The Fundamental Laws You Can't Break

CAP Theorem: Pick Two (But Actually Pick One)

In a distributed system, when a network partition happens (and it WILL), you must choose between:

                    Consistency
                     (C)
                      /\
                     /  \
                    /    \
                   / Pick \
                  /  two   \
                 /    but   \
                /   actually \
               /   one since  \
              /  partitions    \
             /   always happen  \
            /                    \
    Availability ──────────── Partition
        (A)                  Tolerance (P)

In plain English:

CP (Consistency + Partition Tolerance):
  "I'd rather refuse a request than give you wrong data."
  Examples: Banking systems, inventory counts, etcd
  When a network partition happens → some requests fail → but data is always correct

AP (Availability + Partition Tolerance):
  "I'd rather give you possibly-stale data than refuse your request."
  Examples: Shopping cart, social media feed, DNS
  When a network partition happens → all requests succeed → but data might be stale

CA (Consistency + Availability):
  "Doesn't exist in distributed systems."
  Only works for single-node databases. The moment you go distributed, network
  partitions are possible, so you MUST handle P.

🚨 Real Scenario: Choosing Wrong Consistency

The System: An e-commerce platform with a product catalog replicated across 3 regions.

The Choice: AP (eventual consistency) — because "availability matters more."

The Disaster: A flash sale. Product price was updated from $99 to $9.99 in the US region. Due to replication lag (3 seconds), the EU region still showed $99. EU customers paid $99 for the same product that US customers got for $9.99. Customer complaints, social media firestorm, $200K in refunds.

The Lesson: For pricing and inventory, you need strong consistency (CP). For product descriptions and reviews, eventual consistency (AP) is fine.

Rule of thumb:
  💰 Involves money?  → Strong consistency (CP)
  📦 Involves stock?  → Strong consistency (CP)
  📝 Involves content? → Eventual consistency (AP) is fine
  👤 Involves profiles? → Eventual consistency (AP) is fine

🛡️ Resilience Patterns: Surviving the Chaos

Pattern 1: Circuit Breaker

Problem: Service A calls Service B. B starts failing. A keeps calling B, wasting resources and cascading the failure everywhere.

Without circuit breaker:
  Service A → "Call B" → TIMEOUT (5s) → "Try again" → TIMEOUT →
  "Try again" → TIMEOUT → ... (meanwhile, A's thread pool is exhausted)
  → A fails → Everything calling A fails → 💀

With circuit breaker:

  ┌──────────┐   success   ┌──────────┐   timer    ┌──────────┐
  │  CLOSED  │────────────▶│   OPEN   │───────────▶│HALF-OPEN │
  │ (normal) │             │(fast-fail│             │(test 1   │
  │          │◀──too many──│ all req) │   success  │ request) │
  │          │   failures  │          │◀───────────│          │
  └──────────┘             └──────────┘   fail →   └──────────┘
                                         back to OPEN

CLOSED:    Everything is fine. Let requests through.
OPEN:      B is broken. INSTANTLY fail all requests to B.
           Don't even try. Return a fallback/error immediately.
           This prevents A from drowning in timeouts.
HALF-OPEN: After 30 seconds, try ONE request.
           If it works → CLOSED (B recovered!)
           If it fails → OPEN (B still broken, wait more)

Pattern 2: Retry with Exponential Backoff + Jitter

Naive retry:
  Fail → Retry immediately → Fail → Retry immediately → Fail...
  Problem: If 1000 clients all retry at the same time = thundering herd
  → Makes the failing service EVEN MORE overwhelmed

Smart retry:
  Attempt 1: Wait 100ms + random(0-50ms)   = ~125ms
  Attempt 2: Wait 200ms + random(0-100ms)  = ~250ms
  Attempt 3: Wait 400ms + random(0-200ms)  = ~500ms
  Attempt 4: Wait 800ms + random(0-400ms)  = ~1000ms
  Attempt 5: Give up. Circuit breaker opens.

The JITTER (random component) is crucial:
  Without jitter: 1000 clients all retry at 100ms, 200ms, 400ms (synchronized waves)
  With jitter:    1000 clients retry at random times (spread out, no wave)

Rules for Retries

✅ Retry these:
  HTTP 429 (Too Many Requests) — you're rate limited, wait and retry
  HTTP 503 (Service Unavailable) — server is temporarily overwhelmed
  HTTP 502/504 (Gateway errors) — upstream might recover
  Network timeouts — transient network issues

❌ Never retry these:
  HTTP 400 (Bad Request) — your request is wrong, retrying won't fix it
  HTTP 401/403 (Auth errors) — you're not authorized, stop trying
  HTTP 404 (Not Found) — it doesn't exist, it won't appear on retry
  HTTP 409 (Conflict) — your data is stale, need new data first

⚠️ Only retry IDEMPOTENT operations:
  GET, PUT, DELETE: Safe to retry (same result each time)
  POST: DANGEROUS to retry (might create duplicates!)
  → For POST retries, use idempotency keys

Pattern 3: Bulkhead

Inspired by ship compartments — if one floods, the others stay dry.

Without bulkhead:
  ┌──────────────────────────────────────┐
  │  Shared thread pool (100 threads)    │
  │  ├── Service A calls (SLOW!)  90/100 │  ← A is broken
  │  ├── Service B calls           5/100 │  ← B starved
  │  └── Service C calls           5/100 │  ← C starved
  └──────────────────────────────────────┘
  A breaks → B and C starve → Everything breaks

With bulkhead:
  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
  │ Pool A (40)  │ │ Pool B (30)  │ │ Pool C (30)  │
  │ Service A    │ │ Service B    │ │ Service C    │
  │              │ │              │ │              │
  │ A is slow    │ │ B runs fine  │ │ C runs fine  │
  │ Pool exhausts│ │ Unaffected   │ │ Unaffected   │
  └──────────────┘ └──────────────┘ └──────────────┘
  A breaks → Only A is affected → B and C are fine!

In practice:

Kubernetes: Separate node pools for critical vs. best-effort workloads
Code: Separate thread pools / connection pools per dependency
Networking: Separate ingress controllers for internal vs external traffic

📈 Scalability Patterns

Vertical vs. Horizontal Scaling

Vertical (Scale Up): Buy a bigger machine
  ├── Simple: No code changes
  ├── Limited: There's a maximum VM size
  └── Expensive: Exponential cost curve

  $100/mo → $400/mo → $1,600/mo → $6,400/mo
  (2x CPU)  (4x CPU)   (8x CPU)    (16x CPU)

Horizontal (Scale Out): Add more machines
  ├── Complex: Need load balancing, stateless design
  ├── Unlimited: Add as many as needed
  └── Linear: Linear cost curve

  $100 × 1 → $100 × 2 → $100 × 4 → $100 × 8
  ($100)      ($200)      ($400)      ($800)

Database Scaling: The Real Bottleneck

Your app scales horizontally easily (add more pods).
Your database is almost always the bottleneck.

Scaling strategies (in order of complexity):

1. Read Replicas (easy)
   ┌──── Write ────▶ Primary DB
   │                    │
   │              ┌─────┼─────┐
   │              ▼     ▼     ▼
   └── Read ──▶ Rep 1  Rep 2  Rep 3

   Works when: 80%+ of queries are reads (most apps)
   Doesn't help: Write-heavy workloads

2. Caching Layer (medium)
   App → Redis Cache → hit? Return cached → miss? Query DB

   Works when: Same data is requested frequently (product pages)
   Gotcha: Cache invalidation (the two hardest problems in CS)

3. Sharding (hard)
   Shard key: user_id
   Users 1-1M    → Shard 1
   Users 1M-2M   → Shard 2
   Users 2M-3M   → Shard 3

   Works when: Data is partitionable by a key
   Gotcha: Cross-shard queries are painful (joins across shards = 💀)
   Gotcha: Rebalancing shards when they grow unevenly

4. CQRS (Command Query Responsibility Segregation) (complex)
   Writes → Write Model (normalized, consistent)
   Reads  → Read Model (denormalized, fast, eventually consistent)

   Works when: Read and write patterns are vastly different
   Gotcha: Eventually consistent reads (fine for most apps)

🚨 Real-World Disaster: The Database Connection Stampede

Setup: 50 pods, each with a connection pool of 20 connections = 1,000 database connections. PostgreSQL max_connections = 500.

Normal operation:
  50 pods × 5 active connections = 250 connections (within limit)

After a deployment (all pods restart simultaneously):
  50 pods boot up at the same time
  Each opens 20 connections immediately
  50 × 20 = 1,000 connection attempts
  Database: "I can only handle 500!"
  Result: Half the pods fail to start → CrashLoopBackOff
  → Pod restarts → more connection attempts → worse stampede

The Fix:

# 1. Add PgBouncer as a connection pooler
# PgBouncer sits between your app and PostgreSQL
# 1000 app connections → PgBouncer → 100 actual DB connections

# 2. Rolling restart instead of recreate strategy
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0    # One at a time!

# 3. Startup probe with backoff
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10   # Wait before trying
  failureThreshold: 30
  periodSeconds: 5

📬 Event-Driven Architecture: Decoupling Services

The Problem With Synchronous Communication

Synchronous (request-response):
  Order Service → Payment Service → Inventory Service → Email Service

  If Payment Service is slow (2s) → EVERYTHING waits
  If Inventory Service is down → EVERYTHING breaks
  Total latency = sum of all service latencies

The Event-Driven Solution

Event-driven (publish-subscribe):
  Order Service publishes: "OrderCreated" event

  ├── Payment Service subscribes → processes payment
  ├── Inventory Service subscribes → decrements stock
  ├── Email Service subscribes → sends confirmation
  └── Analytics Service subscribes → records metrics

  Services are decoupled:
  ✅ Payment is slow? Order Service doesn't care.
  ✅ Email Service is down? Events queue up, delivered later.
  ✅ New service? Just subscribe. No changes to Order Service.

🚨 Real-World Disaster: The Unordered Events

System: Event-driven order processing.

Expected order:
  1. OrderCreated → Payment processes → Inventory decrements → Email sent

What actually happened:
  Network glitch caused events to arrive out of order:
  1. InventoryDecremented (before payment!)
  2. OrderCreated
  3. PaymentProcessed

  Result: Inventory was decremented for orders where payment FAILED.
  1,200 phantom inventory deductions. Stock counts wrong for 3 days.

The Fix: Design for out-of-order events.

Option 1: Include sequence numbers
  Event { orderId: 123, sequence: 1, type: "OrderCreated" }
  Event { orderId: 123, sequence: 2, type: "PaymentProcessed" }
  Consumer: "I got sequence 2 before 1. Buffer it, wait for 1."

Option 2: Idempotent consumers
  Each event has a unique ID. Consumer tracks processed IDs.
  If duplicate arrives → skip. If out of order → handle gracefully.

Option 3: Event sourcing
  Store ALL events in order. Replay to build current state.
  The event log IS the truth. Services derive their view from it.

🏗️ Platform Design Patterns

The Internal Developer Platform (IDP)

The Problem:
  Developer: "I need a new microservice deployed."
  Developer: Writes code → writes Terraform → writes K8s manifests →
             configures CI/CD → sets up monitoring → creates DNS →
             configures SSL → adds to service mesh → 
             2 WEEKS LATER: "It's deployed!"

The Solution: Internal Developer Platform
  Developer: "I need a new microservice deployed."
  Developer: Fills in a template → clicks deploy → 
             15 MINUTES LATER: "It's deployed with monitoring,
             SSL, CI/CD, and service mesh. All standard. All secure."

The Golden Path (Not the Golden Cage)

Golden Path = The recommended way to do things
  "Here's a well-paved road with guardrails.
   Use it and go fast."

NOT Golden Cage:
  "Here's the ONLY way to do things.
   Deviate and face consequences."

The difference matters. Teams should be able to leave the
golden path when they have a good reason. But 95% of the
time, the path should be so good that nobody wants to leave.

🔥 The Anti-Patterns Hall of Shame

🏆 Distributed Monolith
   "We have 50 microservices, but they all have to deploy
    together and they all share one database."
   Congratulations, you built a monolith but with network
   latency! The worst of both worlds.

🏆 The God Service
   "The OrderService handles orders, payments, inventory,
    emails, analytics, and user management."
   That's not a microservice. That's a monolith in a trench coat.

🏆 Chatty Services
   "To render a product page, we make 47 API calls to 12 services."
   Each call adds latency and failure risk. Use the BFF
   (Backend for Frontend) pattern or GraphQL.

🏆 Shared Database
   "All 8 services read and write to the same database."
   You lost the entire point of microservices. One schema
   change breaks everything. One slow query blocks everyone.

🏆 Not Invented Here
   "We built our own message queue because Kafka was too complex."
   Your custom queue doesn't have 15 years of production testing.
   Use the boring technology. It works.

🧠 System Design Quick Reference

Problem: Need high availability?
  → Multi-AZ deployment (minimum)
  → Multi-region for critical services
  → Health checks + auto-failover
  → Circuit breakers between services

Problem: Need low latency?
  → CDN for static content
  → Cache (Redis) for hot data
  → Edge computing for global users
  → Async processing for non-critical work

Problem: Need high throughput?
  → Horizontal scaling (more instances)
  → Event-driven architecture (decouple services)
  → Database read replicas + sharding
  → Connection pooling everywhere

Problem: Need data consistency?
  → Strongly consistent DB (PostgreSQL, Azure SQL)
  → Two-phase commit (expensive, avoid if possible)
  → Saga pattern for distributed transactions
  → Idempotency keys for retry safety

Problem: Need fault tolerance?
  → Circuit breakers between services
  → Retries with exponential backoff + jitter
  → Bulkheads for isolation
  → Graceful degradation (serve cached/partial data)
  → Queue-based architecture (survive downstream failures)

🎯 Key Takeaways

CAP theorem is real — understand your consistency needs per use case
Circuit breakers prevent cascading failures — they're non-negotiable for microservices
Retries without jitter create thundering herds — always add randomness
The database is almost always the bottleneck — scale it before anything else
Event-driven decoupling saves systems — but design for out-of-order delivery
Anti-patterns are more important than patterns — knowing what NOT to do prevents disasters
Use boring technology — battle-tested beats cutting-edge in production

🔥 Homework

Draw the architecture of your main system. Identify where a circuit breaker would prevent cascading failures.
Check your retry configurations. Do they have jitter? If not, add it.
Find one synchronous call chain that could be replaced with events. Write the event schema.

🏁 Series Wrap-Up

Congratulations — you've made it through the entire DevOps Principal Mastery series!

Here's what we covered:

[Blog 1] Azure Cloud-Native Architecture — subscriptions, networking, identity
[Blog 2] Kubernetes Mastery — pods, scaling, security, GitOps
[Blog 3] Terraform at Scale — state, modules, testing, environments
[Blog 4] CI/CD Standardization — pipelines, DORA, deployment strategies
[Blog 5] Observability — metrics, logs, traces, alerting
[Blog 6] DevSecOps — supply chain, secrets, container security, zero-trust
[Blog 7] SRE — SLOs, error budgets, incidents, chaos engineering
[Blog 8] Technical Leadership — ADRs, mentoring, stakeholder management
[Blog 9] System Design — CAP, resilience patterns, scalability, events

Every blog was packed with real incidents, real errors, real fixes, and real patterns used in production. No theoretical fluff. Just the stuff that matters.

💬 Which blog in the series was most valuable to you? What topic should I deep-dive next? Drop your votes below — the next series depends on YOU. 🎯

DEV Community