Four Metrics to Rule Them All
Google's SRE book introduced the four golden signals: Latency, Traffic, Errors, and Saturation. Simple concept, but I've seen teams struggle with implementation.
Here's a practical guide from someone who's implemented them across 50+ services.
Signal 1: Latency
Not all latency is equal. You need to track successful requests and error requests separately.
# Bad: Average latency
latency = total_request_time / total_requests # Useless
# Good: Percentile latency, separated by status
from prometheus_client import Histogram
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'Request latency',
['method', 'endpoint', 'status_class'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
@app.middleware
async def track_latency(request, call_next):
start = time.time()
response = await call_next(request)
duration = time.time() - start
status_class = f"{response.status_code // 100}xx"
REQUEST_LATENCY.labels(
method=request.method,
endpoint=request.url.path,
status_class=status_class
).observe(duration)
return response
Alert on p99, not p50. Your happiest users don't need help.
- alert: HighLatencyP99
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
Signal 2: Traffic
Traffic tells you "is this normal?" It's the context for every other signal.
# Current request rate
rate(http_requests_total[5m])
# Compare to same time last week
rate(http_requests_total[5m])
/
rate(http_requests_total[5m] offset 7d)
# Alert on sudden drops (possible outage nobody noticed)
- alert: TrafficDrop
expr: >
rate(http_requests_total[5m])
<
(rate(http_requests_total[5m] offset 1h) * 0.5)
for: 10m
annotations:
summary: "Traffic dropped >50% compared to 1 hour ago"
Traffic drops are often more concerning than traffic spikes.
Signal 3: Errors
Track error rate as a percentage, not absolute count:
# Error rate percentage
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100
But also track error types separately:
error_categories:
- 5xx: "Server errors (our fault)"
- 4xx_excluding_404: "Client errors (possible API issue)"
- timeout: "Request timeouts"
- circuit_breaker: "Dependency failures"
Signal 4: Saturation
The most underrated signal. Saturation answers: "how close are we to full?"
# CPU saturation
process_cpu_seconds_total / container_spec_cpu_quota
# Memory saturation
container_memory_working_set_bytes / container_spec_memory_limit_bytes
# Connection pool saturation
active_connections / max_connections
# Queue saturation (the one everyone forgets)
message_queue_depth / message_queue_capacity
Alert before you hit 100%. I use 80% as the threshold for warning and 95% for critical.
Putting It All Together
Every service gets a standard dashboard with four rows:
Row 1: Latency [p50] [p90] [p99] [error latency]
Row 2: Traffic [rate] [vs last week] [by endpoint]
Row 3: Errors [rate %] [by type] [by endpoint]
Row 4: Saturation [CPU] [Memory] [Connections] [Queue]
This fits on one screen. No scrolling. Any engineer can assess service health in 10 seconds.
The Anti-Pattern
Don't build a golden signals dashboard per service manually. Template it:
{
"dashboard": {
"title": "Golden Signals: {{ service_name }}",
"templating": {
"list": [
{ "name": "service", "type": "query" },
{ "name": "environment", "type": "custom", "options": ["prod", "staging"] }
]
}
}
}
One template, 50 dashboards. Update once, apply everywhere.
If you want golden signal monitoring that sets itself up automatically, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)