The Kubernetes Monitoring Maze
Kubernetes gives you a thousand metrics out of the box. Most teams monitor all of them and understand none of them.
After running K8s in production for four years, here's what actually matters.
The Three Layers
Kubernetes observability has three distinct layers, and you need different strategies for each:
Layer 1: Cluster Health (infrastructure)
Layer 2: Workload Health (your apps)
Layer 3: Application Performance (user experience)
Layer 1: Cluster Health
These are your "is the platform working?" metrics:
critical_cluster_metrics:
nodes:
- node_ready_status # Are all nodes healthy?
- node_cpu_utilization # Alert at 85%
- node_memory_utilization # Alert at 90%
- node_disk_pressure # Boolean alert
- node_pid_pressure # Rarely fires, always critical
control_plane:
- apiserver_request_latency_p99 # Alert > 1s
- etcd_disk_wal_fsync_duration # Alert > 100ms
- scheduler_pending_pods # Alert if > 0 for 5min
- controller_manager_queue_depth # Alert if growing
Pro tip: Don't alert on individual node CPU. Alert on cluster-level capacity:
# Alert when cluster is 80% utilized
(
sum(node_cpu_seconds_total{mode!="idle"})
/
sum(node_cpu_seconds_total)
) > 0.80
Layer 2: Workload Health
This is where most teams get it wrong. They monitor pods instead of workloads.
critical_workload_metrics:
deployments:
- available_replicas < desired_replicas # For > 5min
- deployment_generation != observed_generation # Stuck rollout
pods:
- restart_count increasing # CrashLoopBackOff detection
- container_oom_killed # Memory limits too low
- pod_pending_duration > 2min # Scheduling issues
hpa:
- current_replicas == max_replicas # Scale ceiling hit
- cpu_utilization_vs_target # Consistently above target
The most valuable alert I ever wrote:
# Detect pods stuck in CrashLoopBackOff
alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash-looping"
description: "{{ $labels.pod }} in {{ $labels.namespace }} has restarted {{ $value }} times in 15 minutes"
Layer 3: Application Performance
This is what your users actually care about:
application_metrics:
red_method: # Rate, Errors, Duration
- request_rate_per_second
- error_rate_percentage # Alert > 1%
- request_duration_p99 # Alert > 500ms
use_method: # Utilization, Saturation, Errors
- cpu_request_vs_limit_ratio
- memory_request_vs_limit_ratio
- network_receive_bytes_rate
The Dashboard That Saves Us
We built a single "K8s Health" dashboard with four panels:
- Cluster capacity — CPU/Memory/Disk utilization per node pool
- Workload status — Table of all deployments with health status
- Error rates — All services, sorted by error rate
- Recent events — K8s events filtered to warnings and errors
This one dashboard answers 90% of "is something wrong?" questions.
Common Mistakes
- Monitoring pods instead of services — Pods are ephemeral, services are what matter
- Not setting resource requests — Without requests, your metrics are meaningless
- Alerting on resource usage instead of SLOs — High CPU isn't a problem if latency is fine
- Ignoring the control plane — An unhealthy API server affects everything
If you want unified Kubernetes observability without the complexity, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)