Rajkiran

Posted on Jun 13

System Design - 20. Observability: The 3 Pillars, 4 Golden Signals, and How Netflix Debugs 100 Microservices

#microservices #monitoring #sre #systemdesign

Observability: The 3 Pillars, 4 Golden Signals, and How Netflix Debugs 100 Microservices

Series: System Design Mastery — Day 7 of 15
Reading time: 11 min
Covers: Metrics/Logs/Traces, 4 Golden Signals, Distributed Tracing, Alert Fatigue, SLO-Based Alerting

The 3am Page With No Answer

Imagine you're on-call. At 3am, an alert fires: "API error rate above threshold."

You check the dashboard. Errors are up — from 0.1% to 4%. But why? Which service? Which endpoint? Which users? Is it one bad deploy, a downstream dependency failing, a database issue, or something else entirely?

In a monolith, you'd check one log file. In a system with 100 microservices, the request that failed might have passed through 8 services before erroring. Which one actually failed? Without the right tooling, you're grep-ing through 100 different log streams hoping to find a needle in a haystack — at 3am.

Observability is the discipline of building systems that can answer "why is this happening?" — not just "is something happening?" The difference between monitoring and observability is the difference between a smoke alarm and being able to see exactly which wire is overheating.

The 3 Pillars of Observability

Pillar 1: Metrics

Metrics are numerical measurements over time — counters, gauges, and histograms.

Counter:    requests_total{service="payment", status="200"} = 145,302
Gauge:      active_connections{service="payment"} = 47
Histogram:  request_duration_seconds{service="payment", quantile="0.99"} = 0.450

Prometheus is the dominant open-source metrics system. Services expose a /metrics endpoint; Prometheus periodically "scrapes" (polls) this endpoint and stores the time-series data.

# Example /metrics endpoint output
http_requests_total{method="GET",status="200"} 145302
http_requests_total{method="GET",status="500"} 23
http_request_duration_seconds_bucket{le="0.1"} 98234
http_request_duration_seconds_bucket{le="0.5"} 143821

Grafana visualizes this data — dashboards showing request rates, error rates, latency percentiles, resource usage, all in real-time graphs.

Strengths: Extremely efficient storage (numbers compress well), great for trends and alerting ("error rate > 5% for 5 minutes → alert"), low overhead.

Weakness: Metrics tell you that something is wrong (error rate spiked) but not why (which specific request, which user, what error message).

Pillar 2: Logs

Logs are timestamped records of discrete events — usually text, often structured as JSON.

{
  "timestamp": "2024-06-13T03:14:22.103Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "message": "Payment authorization failed",
  "user_id": "user_98765",
  "error": "card_declined",
  "amount": 4999
}

The ELK Stack (Elasticsearch, Logstash, Kibana) — or its modern variants (OpenSearch, Loki + Grafana) — is the standard for log aggregation:

Every service → writes structured JSON logs to stdout
       ↓
Log collector (Fluentd/Filebeat) → ships logs to Elasticsearch
       ↓
Elasticsearch → indexes logs for fast search
       ↓
Kibana → search/filter/visualize: 
  "show me all ERROR logs from payment-service in the last hour 
   where user_id=user_98765"

Structured logging matters enormously. Compare:

Unstructured: "Payment failed for user 98765, card declined, amount $49.99"
Structured:   {"event": "payment_failed", "user_id": "98765", 
               "reason": "card_declined", "amount": 4999}

The structured version can be queried, aggregated, and filtered programmatically. "Show me all payment failures with reason=card_declined in the last hour, grouped by amount range" — trivial with structured logs, painful with text parsing.

Log levels and sampling in production:

DEBUG → only in development (too verbose for production)
INFO  → significant events (request received, order placed)
WARN  → recoverable issues (retry succeeded after 1 failure)
ERROR → failures requiring attention

At high traffic: sample DEBUG/INFO logs (e.g., log 1% of successful 
requests) to reduce volume and cost, but log 100% of ERROR/WARN.

Weakness: Logs are siloed per service by default. Correlating "this user's request failed" across 8 services requires a shared identifier — which brings us to traces.

Pillar 3: Traces

Distributed tracing follows a single request as it flows through multiple services, recording the time spent in each.

Trace ID: abc123def456 (one ID for the ENTIRE request journey)

┌─────────────────────────────────────────────────────────┐
│ Span: API Gateway              [0ms ─────────────── 245ms]│
│   └─ Span: Order Service          [5ms ──────── 230ms]    │
│        └─ Span: Payment Service      [10ms ── 180ms]      │
│             └─ Span: Database query     [15ms─150ms] ←SLOW│
│        └─ Span: Inventory Service    [185ms─220ms]        │
└─────────────────────────────────────────────────────────┘

Total request time: 245ms
The Database query inside Payment Service took 135ms 
— that's the bottleneck.

Key concepts:

Trace = the entire journey of one request across all services.
Span = one unit of work within that journey (e.g., "Payment Service processing", "Database query"). Spans have a parent-child relationship, forming a tree.
Trace context propagation = passing the trace_id and span_id through HTTP headers as the request hops between services:

Service A makes a call to Service B:
  HTTP Headers:
    traceparent: 00-abc123def456-span001-01
                     │trace_id│  │span_id│

Service B continues the trace:
  Creates a new span (span002) as a child of span001
  Passes traceparent: 00-abc123def456-span002-01 to Service C

Jaeger and Zipkin are the dominant open-source tracing systems. Google Dapper (the internal system that inspired both) was one of the first large-scale implementations — Google needed it because a single search query could touch hundreds of internal services.

Why traces are essential at scale: Metrics tell you "p99 latency is 245ms." Traces tell you "...and it's because the database query inside Payment Service is taking 135ms of that." Without traces, you're debugging blind in a microservices architecture.

How the 3 Pillars Work Together

3am Alert: "Payment Service error rate > 5%" (from METRICS)
       ↓
Click into Grafana dashboard → see error spike started at 3:02am
       ↓
Filter LOGS for payment-service, level=ERROR, around 3:02am
       ↓
Find: "Database connection pool exhausted" — but WHY?
       ↓
Pick a trace_id from one of the failed requests → open in Jaeger (TRACES)
       ↓
Trace shows: Inventory Service is taking 8 seconds (normally 50ms) 
→ Payment Service's calls to Inventory are timing out
→ Connection pool fills up waiting for Inventory's slow responses
       ↓
Root cause found: Inventory Service had a bad deploy at 3:00am 
that introduced a slow database query.

Metrics told you something was wrong and roughly when. Logs gave you the specific error. Traces revealed the actual root cause was in a different service than the one alerting. This investigation — which could take hours of grep-ing without proper observability — takes minutes with all 3 pillars integrated.

The 4 Golden Signals

Google's SRE book defines 4 Golden Signals — if you can only monitor 4 things, monitor these:

1. Latency

How long do requests take? Critical: distinguish successful request latency from failed request latency. A request that fails fast (400 Bad Request in 2ms) shouldn't be averaged together with successful requests — it'll make your latency look artificially good while masking real problems.

2. Traffic

How much demand is the system experiencing? Requests per second, concurrent connections, queue depth. Traffic patterns reveal trends (growth, seasonality) and anomalies (sudden spikes — could be legitimate viral growth or an attack).

3. Errors

What fraction of requests are failing? Both explicit failures (500 errors) and implicit ones (200 OK but wrong content, policy violations). Track error rate as a percentage of traffic, not absolute count — 50 errors out of 100 requests is very different from 50 errors out of 1,000,000.

4. Saturation

How "full" is your system? CPU, memory, disk I/O, connection pool utilization. Saturation often predicts problems before they cause errors — a connection pool at 95% utilization will hit 100% (and start failing) soon.

Dashboard for ANY service should show these 4 at a glance:

┌─────────────┬─────────────┬─────────────┬─────────────┐
│   LATENCY    │   TRAFFIC    │   ERRORS     │  SATURATION  │
│  p50: 45ms   │  1,240 req/s │  0.3%        │  CPU: 62%    │
│  p99: 380ms  │  ▲ trending  │  ▼ trending  │  Mem: 71%    │
│              │     up       │     down     │  Conns: 85%  │
└─────────────┴─────────────┴─────────────┴─────────────┘

If you're designing a monitoring system in an interview, structuring your answer around these 4 signals demonstrates you know the industry-standard framework — not just "I'd add some dashboards."

Alert Fatigue: When Everything Is an Alert, Nothing Is

A common failure mode: a team sets up alerts for everything. CPU > 70%? Alert. Memory > 80%? Alert. Any 500 error? Alert. Latency > 100ms? Alert.

Within weeks, the on-call engineer is receiving 50+ alerts per day — most of which are noise (a single 500 error that auto-recovered, a brief CPU spike during a scheduled job). Engineers start ignoring alerts, muting channels, or worse — missing the one alert that mattered because it was buried in noise.

This is alert fatigue, and it's a leading cause of missed real incidents.

Severity Tiers

P1 (Page immediately, wake someone up):
  - Service completely down
  - Error rate > 50%
  - Data loss risk

P2 (Notify during business hours, investigate same day):
  - Error rate elevated but service functional (5-10%)
  - Latency degraded but within tolerable range
  - One replica down (but others healthy)

P3 (Log for review, no immediate action):
  - Minor anomalies
  - Resource usage trending toward thresholds (not yet critical)

Only P1 should page someone at 3am. P2 and P3 should be visible on dashboards and reviewed during business hours.

SLO-Based Alerting: The Modern Approach

Threshold-based alerts ("CPU > 70%") are noisy because they don't reflect user impact. SLO-based alerting (introduced in Day 1) flips this: alert based on error budget burn rate — how fast you're consuming your allowed unreliability.

SLO: 99.9% availability = 43.8 minutes of downtime allowed per month
   = 0.1% error budget

Burn rate alerting:
  "Are we consuming our monthly error budget faster than 
   we can sustain?"

Fast burn (page immediately):
  Consuming 1 hour of budget in 5 minutes
  → At this rate, you'll exhaust the ENTIRE monthly budget in hours
  → This is a genuine emergency

Slow burn (notify, don't page):
  Consuming 1 hour of budget over 6 hours
  → Concerning, but you have time to investigate during business hours

Why this is better than threshold alerts: A threshold alert (error rate > 1%) fires the same way whether it's a brief 30-second blip or a sustained outage. Burn-rate alerting distinguishes "brief blip that barely touches the error budget" from "sustained issue that will blow through the entire month's budget by lunch" — and pages accordingly.

Google's multi-window, multi-burn-rate alerting (from the SRE workbook) uses multiple time windows (5 minutes AND 1 hour) to catch both sudden spikes and slow leaks, while filtering out transient noise that self-resolves.

On-Call Culture: Runbooks and Blameless Postmortems

Observability tooling is only half the story — the human processes around incidents matter just as much.

Runbooks: A documented procedure for responding to a specific alert.

Alert: "Payment Service error rate > 5%"

Runbook:
1. Check Grafana dashboard: payment-service overview
2. Check recent deploys: did a deploy happen in the last 30 minutes?
   → If yes, consider rolling back: `kubectl rollout undo deployment/payment-service`
3. Check downstream dependencies (Inventory, Fraud Check) — 
   are THEIR error rates also elevated?
4. Check database connection pool saturation
5. If unresolved in 15 minutes, escalate to #payments-oncall

Runbooks turn "3am panic" into "follow the steps" — dramatically reducing time-to-resolution and stress on whoever's on call.

Blameless postmortems: After an incident, the team writes up what happened — without assigning blame to individuals. The focus is entirely on systemic factors: "Why did our monitoring not catch this sooner? Why did the deploy process allow a breaking change to reach production? What guardrails can we add?"

Why blameless matters: If engineers fear blame for incidents, they hide mistakes, don't report near-misses, and don't share context that could help prevent future issues. Blameless culture (pioneered by Etsy, championed by Google SRE) treats incidents as learning opportunities for the system, not performance issues for individuals.

Real Example: Netflix's Observability at Scale

Netflix operates one of the largest microservices deployments in the world — thousands of services, processing billions of requests daily. Their observability stack includes:

Atlas — Netflix's in-house metrics platform, purpose-built to handle the cardinality (millions of unique metric combinations) at their scale
Distributed tracing integrated into their service mesh
Automated canary analysis — when deploying a new version, Netflix automatically compares the new version's metrics against the old version's, and automatically rolls back if the new version shows statistically significant degradation — without human intervention
Chaos engineering (from Day 1) feeds directly into observability — when Chaos Monkey kills an instance, the team verifies their dashboards and alerts actually detect and surface it correctly

The meta-lesson: Observability isn't just for debugging incidents after they happen — at Netflix's scale, it's integrated into the deployment pipeline itself, automatically preventing bad deploys from ever reaching most users.

Interview Scenario: "Design Monitoring for 100 Microservices"

The structured answer:

"I'd start with the 4 Golden Signals as the baseline for every service — latency, traffic, errors, and saturation — exposed via Prometheus metrics with a standard dashboard template so every team's service looks consistent.

For logs, I'd enforce structured JSON logging across all services, shipped to a centralized system like Elasticsearch, with a mandatory trace_id field in every log line.

For traces, I'd implement distributed tracing with context propagation through HTTP headers — likely using OpenTelemetry as the instrumentation standard, since it's vendor-neutral and works with Jaeger, Zipkin, or commercial backends.

For alerting, rather than static thresholds per service — which would create alert fatigue at 100 services — I'd implement SLO-based burn-rate alerting. Each service defines its own SLO appropriate to its criticality, and alerts fire based on error budget consumption rate, with severity tiers so only genuine emergencies page on-call at 3am.

Finally, I'd pair this with runbooks for common alerts and a blameless postmortem process — because observability tooling without good incident response processes just means you find out about problems faster without necessarily resolving them faster."

Key Takeaways

3 Pillars: Metrics (numerical trends, efficient, good for alerting), Logs (detailed events, good for specific errors), Traces (request journeys across services, good for root cause).
Structured logging (JSON) is essential — unstructured text logs can't be queried programmatically at scale.
Distributed tracing uses trace IDs propagated via headers to follow one request across many services — essential for microservices debugging.
4 Golden Signals: Latency, Traffic, Errors, Saturation — the minimum viable dashboard for any service.
Alert fatigue is real and dangerous — use severity tiers (P1/P2/P3), and only page for genuine emergencies.
SLO-based burn-rate alerting distinguishes brief blips from sustained issues — far less noisy than static thresholds.
Runbooks + blameless postmortems turn observability data into faster resolution and systemic learning — tooling alone isn't enough.

What's Next

Topic 21 closes with Rate Limiting — Token Bucket, Leaky Bucket, Sliding Window algorithms, and how to implement distributed rate limiting with Redis that works correctly across multiple data centers.

Tags: system-design observability monitoring sre backend distributed-systems interview-prep

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.