DEV Community

Cover image for Kubernetes Observability: What to Monitor and Why
Samson Tanimawo
Samson Tanimawo

Posted on

Kubernetes Observability: What to Monitor and Why

The Kubernetes Monitoring Maze

Kubernetes gives you a thousand metrics out of the box. Most teams monitor all of them and understand none of them.

After running K8s in production for four years, here's what actually matters.

The Three Layers

Kubernetes observability has three distinct layers, and you need different strategies for each:

Layer 1: Cluster Health (infrastructure)
Layer 2: Workload Health (your apps)
Layer 3: Application Performance (user experience)
Enter fullscreen mode Exit fullscreen mode

Layer 1: Cluster Health

These are your "is the platform working?" metrics:

critical_cluster_metrics:
  nodes:
    - node_ready_status        # Are all nodes healthy?
    - node_cpu_utilization     # Alert at 85%
    - node_memory_utilization  # Alert at 90%
    - node_disk_pressure       # Boolean alert
    - node_pid_pressure        # Rarely fires, always critical

  control_plane:
    - apiserver_request_latency_p99  # Alert > 1s
    - etcd_disk_wal_fsync_duration   # Alert > 100ms
    - scheduler_pending_pods         # Alert if > 0 for 5min
    - controller_manager_queue_depth # Alert if growing
Enter fullscreen mode Exit fullscreen mode

Pro tip: Don't alert on individual node CPU. Alert on cluster-level capacity:

# Alert when cluster is 80% utilized
(
  sum(node_cpu_seconds_total{mode!="idle"}) 
  / 
  sum(node_cpu_seconds_total)
) > 0.80
Enter fullscreen mode Exit fullscreen mode

Layer 2: Workload Health

This is where most teams get it wrong. They monitor pods instead of workloads.

critical_workload_metrics:
  deployments:
    - available_replicas < desired_replicas  # For > 5min
    - deployment_generation != observed_generation  # Stuck rollout

  pods:
    - restart_count increasing       # CrashLoopBackOff detection
    - container_oom_killed            # Memory limits too low
    - pod_pending_duration > 2min     # Scheduling issues

  hpa:
    - current_replicas == max_replicas  # Scale ceiling hit
    - cpu_utilization_vs_target         # Consistently above target
Enter fullscreen mode Exit fullscreen mode

The most valuable alert I ever wrote:

# Detect pods stuck in CrashLoopBackOff
alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 15m
labels:
  severity: warning
annotations:
  summary: "Pod {{ $labels.pod }} is crash-looping"
  description: "{{ $labels.pod }} in {{ $labels.namespace }} has restarted {{ $value }} times in 15 minutes"
Enter fullscreen mode Exit fullscreen mode

Layer 3: Application Performance

This is what your users actually care about:

application_metrics:
  red_method:  # Rate, Errors, Duration
    - request_rate_per_second
    - error_rate_percentage        # Alert > 1%
    - request_duration_p99         # Alert > 500ms

  use_method:  # Utilization, Saturation, Errors
    - cpu_request_vs_limit_ratio
    - memory_request_vs_limit_ratio
    - network_receive_bytes_rate
Enter fullscreen mode Exit fullscreen mode

The Dashboard That Saves Us

We built a single "K8s Health" dashboard with four panels:

  1. Cluster capacity — CPU/Memory/Disk utilization per node pool
  2. Workload status — Table of all deployments with health status
  3. Error rates — All services, sorted by error rate
  4. Recent events — K8s events filtered to warnings and errors

This one dashboard answers 90% of "is something wrong?" questions.

Common Mistakes

  1. Monitoring pods instead of services — Pods are ephemeral, services are what matter
  2. Not setting resource requests — Without requests, your metrics are meaningless
  3. Alerting on resource usage instead of SLOs — High CPU isn't a problem if latency is fine
  4. Ignoring the control plane — An unhealthy API server affects everything

If you want unified Kubernetes observability without the complexity, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)