DEV Community

Soumyajyoti Mahalanobish
Soumyajyoti Mahalanobish

Posted on

The 3 AM Alert That Wasn't Actually a Problem

By Soumyajyoti, Senior Software Engineer @ ProtectAI

It's 3 AM. Your phone buzzes aggressively with an alert notification:

DatasourceNoData | ๐—™๐—œ๐—ฅ๐—œ๐—ก๐—š
โ†’ Memory usage of [no value] in [no value] namespace is at %!f(string=)% (>80%)

Half-asleep, you grab your laptop, VPN into your infrastructure, and... everything's fine. No pods are crashing, no services are down, and CPU/memory usage across the cluster is normal. You've just been woken up by a false alert caused by a temporary blip in your monitoring system.

If this sounds familiar, you're not alone. Many Kubernetes operators and SREs face this exact challenge: how do you create robust monitoring alerts that catch real issues but don't wake you up for temporary glitches?

In this post, I'll share practical techniques our team developed for building reliable Grafana alerts for Kubernetes environments that strike the perfect balance between sensitivity and resilience.

The Hidden Costs of Alert Fatigue

Before diving into solutions, let's acknowledge why this problem matters.

False alerts aren't just annoyingโ€”they're expensive:

  • Engineer time and focus: Each unnecessary interruption costs ~23 minutes of recovery time to get back to productive work
  • Diminished alert credibility: Teams start ignoring alerts after too many false positives
  • SRE burnout: Nobody wants to be the on-call engineer for a system that cries wolf

At our organization, we discovered that over 60% of our after-hours alerts were false positives caused by monitoring system hiccups rather than actual infrastructure issues. That's why we invested time in refining our alerting strategy.

The Four Pillars of Reliable Kubernetes Alerts

Through trial and error, we've identified four key strategies that have dramatically reduced our false-positive rate while maintaining our ability to catch real issues.

1. Targeting the Right Workloads: Filtering for What Matters

When monitoring a Kubernetes cluster with dozens of namespaces and hundreds of deployments, alerting on everything is a recipe for noise. Instead, focus on what truly matters.

Here's how we built targeted CPU utilization alerts for critical deployments:

# CPU usage as percentage of limits for critical containers
sum by(namespace, container) (
  rate(container_cpu_usage_seconds_total{container=~"app-api|myworker"}[5m])
) / (
  sum by(namespace, container) (
    kube_pod_container_resource_limits{resource="cpu"}
  )
) * 100
or vector(0)  # Prevent "no data" scenarios
Enter fullscreen mode Exit fullscreen mode

This query targets specific containers that your application depends on. The or vector(0) ensures the query always returns data, even when no matches are found.

2. Beyond Simple Thresholds: Intelligent Alert Conditions

Alerting when any pod hits 80% CPU for a split second will bury you in notifications. Instead, use these techniques for more intelligent conditions:

  • Add time requirements: Require conditions to persist for a meaningful period
  • Consider rate of change: Alert on rapid increases, not just absolute values
  • Use contextual thresholds: Different workloads have different "normal" ranges

In our Grafana configuration, we set a 2-minute "Pending period" before firing alerts:

Image description

This means a threshold must be exceeded continuously for 2 minutes before an alert fires, eliminating most transient spikes.

3. The or vector(0) Pattern: Ensuring Queries Always Return Data

One of the most frustrating Grafana alert issues occurs when your query returns no data, resulting in [no value] placeholders in alert notifications. This commonly happens when:

  • Metrics are temporarily unavailable
  • The data source connection experiences a brief hiccup
  • A label selector matches nothing (e.g., a pod is no longer running)

The solution? Add or vector(0) to your PromQL queries:

# Without vector(0) - can lead to "no data" alerts:
sum(kube_pod_status_phase{namespace="production", phase="Pending"})

# With vector(0) - always returns data:
sum(kube_pod_status_phase{namespace="production", phase="Pending"}) or vector(0)
Enter fullscreen mode Exit fullscreen mode

This pattern ensures your query always returns values (with zeroes when no match), preventing those cryptic [no value] notifications.

4. The Secret Weapon: Proper No-Data and Error Handling

Even with perfect queries, temporary issues with Prometheus or Grafana can cause alert evaluation to fail. That's where Grafana's built-in error handling comes in.

In your alert configuration, find the "Configure no data and error handling" section and set:

  1. "Alert state if no data or all values are null" to "Normal"
  2. "Alert state if execution error or timeout" to "Normal"

Image description

This configuration tells Grafana to treat temporary data issues as normal conditions rather than alerting triggers. It's like telling your alert system, "If you're not sure, don't wake me up."

Conclusion: Sleep Better with Robust Alerting

Remember that alert tuning is an iterative process. Start with your most problematic alerts, implement these patterns, and observe the results before proceeding to others.

By implementing this, you'll create a monitoring system that respects your team's time and attention while still providing the critical safety net you need for production systems.

The next time you're on call, you might actually get some sleep!

Top comments (0)