Are Grafana UI Alerts Insufficient? Alertmanager Installation and Why

#monitoring #devops

Grafana's user-friendly interface is a great starting point for visualizing your metrics and creating basic alerts. However, in production environments with complex and critical systems, we often encounter situations where these built-in alerting mechanisms fall short. In this post, I will explain why Grafana UI alerts often fail to meet expectations and why we should turn to more advanced solutions like Alertmanager, using concrete examples from my own experience.

Grafana UI Alerts: Attractive at First Glance, Limited Deep Down

Creating an alert rule from Grafana's interface is quite simple. You select a metric, set a threshold, and configure a notification channel (Slack, email, etc.). This may be sufficient, especially for early-stage projects or simple monitoring needs. However, as your systems grow over time, the need for finer tuning, more complex conditions, and more advanced routing emerges.

For example, while monitoring the shipping module of a production ERP system, you might want to track the number of delayed orders. In Grafana UI, it is possible to create a simple rule like "delayed_orders > 10". However, when you want this rule to trigger only during business hours, apply different thresholds for a specific order type, or send notifications to different people when multiple thresholds are exceeded, Grafana's built-in features start to fall short. Last month, we were receiving alerts when the average CPU usage on a client's system exceeded 80%. However, these alerts were coming constantly, and we had to look at the dashboard to understand if there was a real problem. This led to a situation called "alert fatigue."

ℹ️ What is Alert Fatigue?

Alert fatigue is a state where system administrators or developers become tired of constantly receiving irrelevant or minor alerts, eventually becoming desensitized to real problems. This situation can lead to critical alerts being missed.

Grafana UI alerts are basically suitable for simple conditions built on individual metrics. However, they fall short for alerts that combine multiple metrics, require time-series windowing, or rely on complex state logic.

Why Do We Need Alertmanager?

As part of the Prometheus ecosystem, Alertmanager offers capabilities to group, route, and silence incoming alerts. This allows us to build a more robust and flexible alerting system that goes beyond Grafana UI alerts. One of the biggest advantages of Alertmanager is its ability to route alerts based on advanced rules.

For example, if a server's CPU usage exceeds 90%, you can send this alert directly to the system administrators group. However, if the disk utilization of the same server reaches 95%, you can route this alert to a different group (e.g., the storage team) with a higher priority. Furthermore, Alertmanager reduces notification density by grouping repetitive alerts. If a service goes offline temporarily, instead of constantly receiving "service offline" notifications, Alertmanager can merge these alerts and send them as a single notification after a certain period.

Once, in a client project, the disk space of a database server reached a critical level. Simple alerts coming from Grafana were missed because the alerts were constantly re-triggering. By deploying Alertmanager, we configured an alert when disk space exceeded 90%, and a more urgent alert and routing to the relevant team when it reached 95%. This allowed us to detect and resolve the issue much faster. This situation shows that Alertmanager is not just a notification tool, but also an alert management and reasoning engine.

Alertmanager Installation and Basic Configuration

Installing Alertmanager is typically done in an integrated manner with Prometheus. Prometheus collects metrics and sends these alerts to Alertmanager when alert rules are triggered. The configuration of Alertmanager is done via the alertmanager.yml file, which contains the basic settings that determine how Alertmanager will operate.

A basic alertmanager.yml file might look like this:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver'

receivers:
  - name: 'default-receiver'
    slack_configs:
      - api_url: '<YOUR_SLACK_WEBHOOK_URL>'
        channel: '#alerts'

# This section is used to route alerts with specific labels to different receivers.
# For example, we can send alerts with a 'critical' label to a different channel.
  - name: 'critical-receiver'
    slack_configs:
      - api_url: '<YOUR_SLACK_WEBHOOK_URL>'
        channel: '#critical-alerts'

# You can also configure alerts for Alertmanager itself.
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093'] # Alertmanager's own address

In this configuration:

global: General settings, such as how long after an alert is resolved it will be marked as resolved (resolve_timeout).
route: Contains routing rules such as how alerts will be grouped (group_by), how long to wait (group_wait), group interval (group_interval), and repeat interval (repeat_interval).
receivers: Defines where notifications will go (Slack, PagerDuty, email, etc.) and to which channel they will be sent.
alerting: Used to monitor the status of Alertmanager itself.

Even this simple configuration provides grouping and basic routing capabilities that you cannot achieve in Grafana UI. For example, the group_by: ['alertname', 'cluster', 'service'] line ensures that alerts of the same type are grouped for the same cluster and service. This ensures that you receive only a single notification even when there are multiple issues in a single service.

Advanced Routing and Grouping Strategies

The power of Alertmanager lies in its advanced routing and grouping capabilities. Within the route block, you can define sub-routes that route alerts with specific labels to different receivers. This allows you to intelligently distribute alerts based on the criticality level of your systems, the type of issue, or the affected team.

For example, while working on an e-commerce platform, you can send "5xx server error" alerts of the main website directly to the operations team, while routing alerts about the failure of a background analytics job to the data engineering team. This ensures that each team receives notifications related to their area of responsibility and prevents unnecessary distractions.

route:
  receiver: 'default-receiver'
  routes:
    - receiver: 'critical-receiver'
      matchers:
        - severity = "critical" # Alerts with the severity label 'critical'
      continue: true # Check other routes after this match as well

    - receiver: 'database-receiver'
      matchers:
        - service = "postgres" # Alerts with the service label 'postgres'
      continue: true

    - receiver: 'frontend-receiver'
      matchers:
        - service = "frontend"
        - alertname = "HighLatency"
      continue: false # Do not look for other routes after this match

In this example:

All alerts with the label severity = "critical" go to critical-receiver. Thanks to continue: true, these alerts can also go to default-receiver.
Alerts with the label service = "postgres" go to database-receiver.
Alerts matching service = "frontend" and alertname = "HighLatency" go to frontend-receiver, and no other route is searched due to continue: false.

This type of configuration is critical in combating "alert fatigue." For example, in a recent issue I experienced with an automation system on a production line, there were "low value" alerts coming from multiple sensors. When these alerts were triggered at the same time, it became difficult to understand which sensor was actually problematic. Using Alertmanager's grouping capabilities, we collected alerts from different sensors for the same line type under a single notification. This allowed us to find the root cause of the problem much faster.

Notification Strategies and Integrations

Another important feature offered by Alertmanager is notification strategies. Settings such as how often alerts will be repeated and whether a notification will be sent when an alert is resolved are used to manage the notification flow. repeat_interval determines how long to wait before sending a notification again if an alert is not resolved. This prevents critical issues from being forgotten, while making reminders at certain intervals instead of constantly receiving the same alert.

Alertmanager can integrate with many popular notification services such as Slack, PagerDuty, OpsGenie, and VictorOps. These integrations accelerate incident response processes by creating notifications that reach the right team at the right time. For example, integration with an on-call management system like PagerDuty can automatically call or send an SMS to the on-call engineer when a critical alert is triggered.

In my own projects, especially in my Android spam blocker app, I was sending critical error notifications directly to my development email address, while sending less critical error notifications to a logging service. This distinction allowed me to quickly resolve errors I encountered during the development process, while also keeping me informed about the overall health of the system. Alertmanager facilitates such flexible integrations and notification strategies.

Conclusion: Alertmanager for Smarter Alerting Systems

While Grafana UI alerts are useful for simple monitoring needs, they fall short when considering the complexity and criticality of production environments. Alertmanager allows us to build a more robust, flexible, and smart alerting system with its advanced grouping, routing, and notification management capabilities.

As your systems grow and become more complex, adopting tools like Alertmanager is inevitable to combat "alert fatigue" and prevent critical alerts from being missed. This not only provides faster incident response but also increases operational efficiency and significantly improves system reliability. My own experiences have proven time and again how valuable the detailed control and automation offered by Alertmanager is when managing complex systems.

As a next step, you can learn how to integrate Alertmanager with your Prometheus and create customized alert rules according to your own needs. This is one of the most important steps you can take for the health and security of your systems.