Samson Tanimawo

Posted on Apr 16

The Golden Signals: A Practical Implementation Guide

#devops #sre #monitoring #observability

Four Metrics to Rule Them All

Google's SRE book introduced the four golden signals: Latency, Traffic, Errors, and Saturation. Simple concept, but I've seen teams struggle with implementation.

Here's a practical guide from someone who's implemented them across 50+ services.

Signal 1: Latency

Not all latency is equal. You need to track successful requests and error requests separately.

# Bad: Average latency
latency = total_request_time / total_requests  # Useless

# Good: Percentile latency, separated by status
from prometheus_client import Histogram

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'Request latency',
    ['method', 'endpoint', 'status_class'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

@app.middleware
async def track_latency(request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = time.time() - start
    status_class = f"{response.status_code // 100}xx"
    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.url.path,
        status_class=status_class
    ).observe(duration)
    return response

Alert on p99, not p50. Your happiest users don't need help.

- alert: HighLatencyP99
  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5
  for: 5m
  labels:
    severity: warning

Signal 2: Traffic

Traffic tells you "is this normal?" It's the context for every other signal.

# Current request rate
rate(http_requests_total[5m])

# Compare to same time last week
rate(http_requests_total[5m]) 
  / 
rate(http_requests_total[5m] offset 7d)

# Alert on sudden drops (possible outage nobody noticed)
- alert: TrafficDrop
  expr: >
    rate(http_requests_total[5m]) 
    < 
    (rate(http_requests_total[5m] offset 1h) * 0.5)
  for: 10m
  annotations:
    summary: "Traffic dropped >50% compared to 1 hour ago"

Traffic drops are often more concerning than traffic spikes.

Signal 3: Errors

Track error rate as a percentage, not absolute count:

# Error rate percentage
(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100

But also track error types separately:

error_categories:
  - 5xx: "Server errors (our fault)"
  - 4xx_excluding_404: "Client errors (possible API issue)"
  - timeout: "Request timeouts"
  - circuit_breaker: "Dependency failures"

Signal 4: Saturation

The most underrated signal. Saturation answers: "how close are we to full?"

# CPU saturation
process_cpu_seconds_total / container_spec_cpu_quota

# Memory saturation
container_memory_working_set_bytes / container_spec_memory_limit_bytes

# Connection pool saturation
active_connections / max_connections

# Queue saturation (the one everyone forgets)
message_queue_depth / message_queue_capacity

Alert before you hit 100%. I use 80% as the threshold for warning and 95% for critical.

Putting It All Together

Every service gets a standard dashboard with four rows:

Row 1: Latency   [p50] [p90] [p99] [error latency]
Row 2: Traffic    [rate] [vs last week] [by endpoint]
Row 3: Errors     [rate %] [by type] [by endpoint]
Row 4: Saturation [CPU] [Memory] [Connections] [Queue]

This fits on one screen. No scrolling. Any engineer can assess service health in 10 seconds.

The Anti-Pattern

Don't build a golden signals dashboard per service manually. Template it:

{
  "dashboard": {
    "title": "Golden Signals: {{ service_name }}",
    "templating": {
      "list": [
        { "name": "service", "type": "query" },
        { "name": "environment", "type": "custom", "options": ["prod", "staging"] }
      ]
    }
  }
}

One template, 50 dashboards. Update once, apply everywhere.

If you want golden signal monitoring that sets itself up automatically, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community