1. Introduction
The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts.
2. Run service alertmanager
Add service alertmanager to docker compose file and update the stack.
alertmanager:
image: prom/alertmanager:v0.22.2
container_name: alertmanager
volumes:
- /etc/alertmanager:/etc/alertmanager
command:
- '--config.file=/etc/alertmanager/config.yml'
- '--storage.path=/alertmanager'
ports:
- 9093:9093
restart: unless-stopped
3. Create alerting rules in Prometheus
We move to the server subfolder and open the content in the code editor, then create a new rules file. In the rules.yml, you will specify the conditions when you would like to be alerted.
$sudo nano /etc/prometheus/rules.yml
After you’ve decided on your alerting condition, you need to specify them in rules.yml. Its content is going to be the following:
- Trigger an alert if any of the monitoring targets (node-exporter and cAdvisor) are down for more than 30 seconds.
groups:
- name: AllInstances
rules:
- alert: InstanceDown
# Condition for alerting
expr: up == 0
for: 1m
# Annotation - additional informational labels to store more information
annotations:
title: 'Instance {{ $labels.instance }} down'
description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute.'
# Labels - additional labels to be attached to the alert
labels:
severity: 'critical'
- Trigger an alert if the Docker host CPU is under high load for more than 30 seconds
- alert: high_cpu_load
# Condition for alerting
expr: node_load1 > 1.5
for: 30s
# Annotation - additional informational labels to store more information
annotations:
title: 'Server under high load'
description: "Docker host is under high load, the avg load 1m is at {{ $value}}. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
# Labels - additional labels to be attached to the alert
labels:
severity: 'warning'
- Trigger an alert if the Docker host memory is almost full
- alert: high_memory_load
# Condition for alerting
expr: (sum(node_memory_MemTotal) - sum(node_memory_MemFree + node_memory_Buffers + node_memory_Cached) ) / sum(node_memory_MemTotal) * 100 > 85
for: 30s
# Annotation - additional informational labels to store more information
annotations:
title: 'Server memory is almost full'
description: "Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
# Labels - additional labels to be attached to the alert
labels:
severity: 'warning'
- Trigger an alert if the Docker host storage is almost full
- alert: high_storage_load
# Condition for alerting
expr: (node_filesystem_size{fstype="aufs"} - node_filesystem_free{fstype="aufs"}) / node_filesystem_size{fstype="aufs"} * 100 > 85
for: 30s
# Annotation - additional informational labels to store more information
annotations:
title: 'Server storage is almost full'
description: "Docker host storage usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
# Labels - additional labels to be attached to the alert
labels:
severity: 'warning'
- Trigger an alert if a container is down for more than 30 seconds
- alert: redis_down
# Condition for alerting
expr: absent(container_memory_usage_bytes{name="redis"})
for: 30s
# Annotation - additional informational labels to store more information
annotations:
title: 'Redis down'
description: "Redis container is down for more than 30 seconds."
# Labels - additional labels to be attached to the alert
labels:
severity: 'critical'
- Trigger an alert if a container is using more than 10% of total CPU cores for more than 30 seconds
- alert: redis_high_cpu
# Condition for alerting
expr: sum(rate(container_cpu_usage_seconds_total{name="redis"}[1m])) / count(node_cpu{mode="system"}) * 100 > 10
for: 30s
# Annotation - additional informational labels to store more information
annotations:
title: Redis high CPU usage'
description: "Redis CPU usage is {{ humanize $value}}%."
# Labels - additional labels to be attached to the alert
labels:
severity: 'warning'
- Trigger an alert if a container is using more than 1,2GB of RAM for more than 30 seconds.
- alert: redis_high_memory
# Condition for alerting
expr: sum(container_memory_usage_bytes{name="redis"}) > 1200000000
for: 30s
# Annotation - additional informational labels to store more information
annotations:
title: 'Redis high memory usage'
description: "Redis memory consumption is at {{ humanize $value}}."
# Labels - additional labels to be attached to the alert
labels:
severity: 'warning'
4. Set up Slack alerts
If you want to receive notifications via Slack, you should be part of a Slack workspace. To set up alerting in your Slack workspace, you’re going to need a Slack API URL. Go to Slack -> Administration -> Manage apps.
In the Manage apps directory, search for Incoming WebHooks and add it to your Slack workspace.
Next, specify in which channel you’d like to receive notifications from Alertmanager. (I’ve created #monitoring-infrastructure channel). After you confirm and add Incoming WebHooks integration, webhook URL (which is your Slack API URL) is displayed. Copy it.
5. Set up Alertmanager
The AlertManager service is responsible for handling alerts sent by the Prometheus server. AlertManager can send notifications via email, Pushover, Slack, HipChat or any other system that exposes a webhook interface.
The notification receivers can be configured in alertmanager/config.yml file. Copy the Slack Webhook URL into the api_url field and specify a Slack channel.
$sudo nano /etc/alertmanager/config.yml
global:
resolve_timeout: 1m
slack_api_url: 'https://hooks.slack.com/services/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'
route:
receiver: 'slack-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#monitoring-instances'
send_resolved: true
Reload configuration by sending POST request to /-/reload endpoint curl -X POST http://localhost:9093/-/reload . In a couple of minutes (after you stop at least one of your instances), you should be receiving your alert notifications through Slack, like this:
If you would like to improve your notifications and make them look nicer, you can use the template below, or use this tool and create your own.
global:
resolve_timeout: 1m
slack_api_url: 'https://hooks.slack.com/services/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'
route:
receiver: 'slack-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#monitoring-instances'
send_resolved: true
icon_url: https://avatars3.githubusercontent.com/u/3380462
title: |-
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }}
{{- if gt (len .CommonLabels) (len .GroupLabels) -}}
{{" "}}(
{{- with .CommonLabels.Remove .GroupLabels.Names }}
{{- range $index, $label := .SortedPairs -}}
{{ if $index }}, {{ end }}
{{- $label.Name }}="{{ $label.Value -}}"
{{- end }}
{{- end -}}
)
{{- end }}
text: >-
{{ range .Alerts -}}
*Alert:* {{ .Annotations.title }}{{ if .Labels.severity }} - `{{ .Labels.severity }}`{{ end }}
*Description:* {{ .Annotations.description }}
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
6. Set up PagerDuty Alerts
PagerDuty is one of the most well-known incident response platforms for IT departments. To set up alerting through PagerDuty, you need to create an account there. (PagerDuty is a paid service, but you can always do a 14-day free trial.) Once you’re logged in, go to Configuration -> Services -> + New Service.
Choose Prometheus from the Integration types list and give the service a name — I decided to call mine Prometheus Alertmanager. (You can also customize the incident settings, but I went with the default setup.) Then click save.
The Integration Key will be displayed. Copy the key.
You’ll need to update the content of your alertmanager.yml. It should look like the example below, but use your own service_key (integration key from PagerDuty). Pagerduty_url should stay the same and should be set to https://events.pagerduty.com/v2/enqueue. Save and restart the Alertmanager.
global:
resolve_timeout: 1m
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
route:x
receiver: 'pagerduty-notifications'
receivers:
- name: 'pagerduty-notifications'
pagerduty_configs:
- service_key: 0c1cc665a594419b6d215e81f4e38f7
send_resolved: trueglobal:
resolve_timeout: 1m
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
route:x
receiver: 'pagerduty-notifications'
receivers:
- name: 'pagerduty-notifications'
pagerduty_configs:
- service_key: 0c1cc665a594419b6d215e81f4e38f7
send_resolved: true
Stop one of your instances. After a couple of minutes, alert notifications should be displayed in PagerDuty.
In PagerDuty user settings, you can decide on how you’d like to be notified. I chose both — email and phone call — and I was notified via both.
7. Set up Gmail Alerts
If you prefer to be notified by email, the setup is even easier. Alertmanager can simply pass on emails to email services — in this case, Gmail — which then sends them on your behalf.
It’s not recommended that you use your personal password for this, so you should create an App Password. To do that, go to Account Settings -> Security -> Signing in to Google -> App password (if you don’t see App password as an option, you probably haven’t set up 2-Step Verification and will need to do that first). Copy the newly-created password.
You’ll need to update the content of your alertmanager.yml again. The content should look similar to the example below. Don’t forget to replace the email address with your own email address, and the password with your new app password.
global:
resolve_timeout: 1m
route:
receiver: 'gmail-notifications'
receivers:
- name: 'gmail-notifications'
email_configs:
- to: monitoringinstances@gmail.com
from: monitoringinstances@gmail.com
smarthost: smtp.gmail.com:587
auth_username: monitoringinstances@gmail.com
auth_identity: monitoringinstances@gmail.com
auth_password: password
send_resolved: true
Once again, after a couple of minutes (after you stop at least one of your instances), alert notifications should be sent to your Gmail.
8. Conclusion
Thank you very much for taking time to read this. I would really appreciate any comment in the comment section.
Enjoy🎉
Top comments (0)