Samson Tanimawo

Posted on Apr 16

Kubernetes Observability: What to Monitor and Why

#kubernetes #observability #monitoring #sre

The Kubernetes Monitoring Maze

Kubernetes gives you a thousand metrics out of the box. Most teams monitor all of them and understand none of them.

After running K8s in production for four years, here's what actually matters.

The Three Layers

Kubernetes observability has three distinct layers, and you need different strategies for each:

Layer 1: Cluster Health (infrastructure)
Layer 2: Workload Health (your apps)
Layer 3: Application Performance (user experience)

Layer 1: Cluster Health

These are your "is the platform working?" metrics:

critical_cluster_metrics:
  nodes:
    - node_ready_status        # Are all nodes healthy?
    - node_cpu_utilization     # Alert at 85%
    - node_memory_utilization  # Alert at 90%
    - node_disk_pressure       # Boolean alert
    - node_pid_pressure        # Rarely fires, always critical

  control_plane:
    - apiserver_request_latency_p99  # Alert > 1s
    - etcd_disk_wal_fsync_duration   # Alert > 100ms
    - scheduler_pending_pods         # Alert if > 0 for 5min
    - controller_manager_queue_depth # Alert if growing

Pro tip: Don't alert on individual node CPU. Alert on cluster-level capacity:

# Alert when cluster is 80% utilized
(
  sum(node_cpu_seconds_total{mode!="idle"}) 
  / 
  sum(node_cpu_seconds_total)
) > 0.80

Layer 2: Workload Health

This is where most teams get it wrong. They monitor pods instead of workloads.

critical_workload_metrics:
  deployments:
    - available_replicas < desired_replicas  # For > 5min
    - deployment_generation != observed_generation  # Stuck rollout

  pods:
    - restart_count increasing       # CrashLoopBackOff detection
    - container_oom_killed            # Memory limits too low
    - pod_pending_duration > 2min     # Scheduling issues

  hpa:
    - current_replicas == max_replicas  # Scale ceiling hit
    - cpu_utilization_vs_target         # Consistently above target

The most valuable alert I ever wrote:

# Detect pods stuck in CrashLoopBackOff
alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 15m
labels:
  severity: warning
annotations:
  summary: "Pod {{ $labels.pod }} is crash-looping"
  description: "{{ $labels.pod }} in {{ $labels.namespace }} has restarted {{ $value }} times in 15 minutes"

Layer 3: Application Performance

This is what your users actually care about:

application_metrics:
  red_method:  # Rate, Errors, Duration
    - request_rate_per_second
    - error_rate_percentage        # Alert > 1%
    - request_duration_p99         # Alert > 500ms

  use_method:  # Utilization, Saturation, Errors
    - cpu_request_vs_limit_ratio
    - memory_request_vs_limit_ratio
    - network_receive_bytes_rate

The Dashboard That Saves Us

We built a single "K8s Health" dashboard with four panels:

Cluster capacity — CPU/Memory/Disk utilization per node pool
Workload status — Table of all deployments with health status
Error rates — All services, sorted by error rate
Recent events — K8s events filtered to warnings and errors

This one dashboard answers 90% of "is something wrong?" questions.

Common Mistakes

Monitoring pods instead of services — Pods are ephemeral, services are what matter
Not setting resource requests — Without requests, your metrics are meaningless
Alerting on resource usage instead of SLOs — High CPU isn't a problem if latency is fine
Ignoring the control plane — An unhealthy API server affects everything

If you want unified Kubernetes observability without the complexity, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community