DEV Community

Cover image for The Golden Signals: A Practical Implementation Guide
Samson Tanimawo
Samson Tanimawo

Posted on

The Golden Signals: A Practical Implementation Guide

Four Metrics to Rule Them All

Google's SRE book introduced the four golden signals: Latency, Traffic, Errors, and Saturation. Simple concept, but I've seen teams struggle with implementation.

Here's a practical guide from someone who's implemented them across 50+ services.

Signal 1: Latency

Not all latency is equal. You need to track successful requests and error requests separately.

# Bad: Average latency
latency = total_request_time / total_requests  # Useless

# Good: Percentile latency, separated by status
from prometheus_client import Histogram

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'Request latency',
    ['method', 'endpoint', 'status_class'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

@app.middleware
async def track_latency(request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = time.time() - start
    status_class = f"{response.status_code // 100}xx"
    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.url.path,
        status_class=status_class
    ).observe(duration)
    return response
Enter fullscreen mode Exit fullscreen mode

Alert on p99, not p50. Your happiest users don't need help.

- alert: HighLatencyP99
  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5
  for: 5m
  labels:
    severity: warning
Enter fullscreen mode Exit fullscreen mode

Signal 2: Traffic

Traffic tells you "is this normal?" It's the context for every other signal.

# Current request rate
rate(http_requests_total[5m])

# Compare to same time last week
rate(http_requests_total[5m]) 
  / 
rate(http_requests_total[5m] offset 7d)

# Alert on sudden drops (possible outage nobody noticed)
- alert: TrafficDrop
  expr: >
    rate(http_requests_total[5m]) 
    < 
    (rate(http_requests_total[5m] offset 1h) * 0.5)
  for: 10m
  annotations:
    summary: "Traffic dropped >50% compared to 1 hour ago"
Enter fullscreen mode Exit fullscreen mode

Traffic drops are often more concerning than traffic spikes.

Signal 3: Errors

Track error rate as a percentage, not absolute count:

# Error rate percentage
(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100
Enter fullscreen mode Exit fullscreen mode

But also track error types separately:

error_categories:
  - 5xx: "Server errors (our fault)"
  - 4xx_excluding_404: "Client errors (possible API issue)"
  - timeout: "Request timeouts"
  - circuit_breaker: "Dependency failures"
Enter fullscreen mode Exit fullscreen mode

Signal 4: Saturation

The most underrated signal. Saturation answers: "how close are we to full?"

# CPU saturation
process_cpu_seconds_total / container_spec_cpu_quota

# Memory saturation
container_memory_working_set_bytes / container_spec_memory_limit_bytes

# Connection pool saturation
active_connections / max_connections

# Queue saturation (the one everyone forgets)
message_queue_depth / message_queue_capacity
Enter fullscreen mode Exit fullscreen mode

Alert before you hit 100%. I use 80% as the threshold for warning and 95% for critical.

Putting It All Together

Every service gets a standard dashboard with four rows:

Row 1: Latency   [p50] [p90] [p99] [error latency]
Row 2: Traffic    [rate] [vs last week] [by endpoint]
Row 3: Errors     [rate %] [by type] [by endpoint]
Row 4: Saturation [CPU] [Memory] [Connections] [Queue]
Enter fullscreen mode Exit fullscreen mode

This fits on one screen. No scrolling. Any engineer can assess service health in 10 seconds.

The Anti-Pattern

Don't build a golden signals dashboard per service manually. Template it:

{
  "dashboard": {
    "title": "Golden Signals: {{ service_name }}",
    "templating": {
      "list": [
        { "name": "service", "type": "query" },
        { "name": "environment", "type": "custom", "options": ["prod", "staging"] }
      ]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

One template, 50 dashboards. Update once, apply everywhere.

If you want golden signal monitoring that sets itself up automatically, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)