Aisalkyn Aidarova

Posted on May 13

SRE Monitoring Lab 2

Prometheus Alerting + AlertManager + Grafana Alerts

Now you will build a REAL SRE alerting system.

In production, monitoring without alerts is useless.

This lab teaches you:

What alerts are
How SRE engineers detect failures
How Prometheus rules work
How AlertManager works
How alert routing works
How to troubleshoot alerts
How incidents are detected in production

You already have:

Prometheus
Grafana
Node Exporter

running on your EC2 Ubuntu machine.

WHAT SRE ENGINEERS MUST UNDERSTAND FIRST

Why Monitoring Alone Is Not Enough

Dashboards are passive.

SRE engineers cannot stare at dashboards 24/7.

Instead:

```text id="1o63gz"
System detects problems automatically
→ sends alerts
→ engineers respond




---

# REAL PRODUCTION FLOW



```text id="j1lszy"
Node Exporter
      ↓
Prometheus scrapes metrics
      ↓
Alert Rules evaluate metrics
      ↓
AlertManager receives alerts
      ↓
Email / Slack / PagerDuty

WHAT IS ALERTMANAGER?

Alertmanager handles alerts from Prometheus.

It:

groups alerts
deduplicates alerts
routes alerts
sends notifications

WHAT SRE ENGINEERS MUST KNOW

Prometheus:

stores metrics
evaluates rules

AlertManager:

handles notifications

Grafana:

visualization

LAB GOAL

You will create:

Alert	Trigger
High CPU	CPU > 70%
Node Down	Exporter unavailable
High Memory	RAM low

Then intentionally break system and watch alerts fire.

ARCHITECTURE

```text id="6utj9c"
Node Exporter
↓
Prometheus
↓
AlertManager
↓
Alert UI




---

# STEP 1 — DOWNLOAD ALERTMANAGER

SSH into EC2.

Go to:



```bash id="qg7s8h"
cd /tmp

Download:

```bash id="vgkg4j"
wget https://github.com/prometheus/alertmanager/releases/download/v0.28.1/alertmanager-0.28.1.linux-amd64.tar.gz




---

# STEP 2 — EXTRACT FILE



```bash id="gtu08m"
tar -xvf alertmanager-0.28.1.linux-amd64.tar.gz

STEP 3 — MOVE BINARIES

```bash id="azebn4"
sudo mv alertmanager-0.28.1.linux-amd64/alertmanager /usr/local/bin/

sudo mv alertmanager-0.28.1.linux-amd64/amtool /usr/local/bin/




---

# WHAT IS AMTOOL?

CLI tool for testing AlertManager.

---

# STEP 4 — CREATE USER



```bash id="3ib5q9"
sudo useradd -rs /bin/false alertmanager

STEP 5 — CREATE DIRECTORIES

```bash id="8dv8bd"
sudo mkdir /etc/alertmanager

sudo mkdir /var/lib/alertmanager




---

# STEP 6 — COPY CONFIG



```bash id="5q6l7r"
sudo cp /tmp/alertmanager-0.28.1.linux-amd64/alertmanager.yml /etc/alertmanager/

STEP 7 — SET OWNERSHIP

```bash id="mgg5lr"
sudo chown -R alertmanager:alertmanager /etc/alertmanager

sudo chown -R alertmanager:alertmanager /var/lib/alertmanager




---

# STEP 8 — CREATE SYSTEMD SERVICE



```bash id="gtb4ny"
sudo nano /etc/systemd/system/alertmanager.service

Paste:

```ini id="w3ckzm"
[Unit]
Description=AlertManager
After=network.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager

[Install]
WantedBy=multi-user.target




Save file.

---

# STEP 9 — START ALERTMANAGER



```bash id="6f9ok5"
sudo systemctl daemon-reload

sudo systemctl enable alertmanager

sudo systemctl start alertmanager

STEP 10 — VERIFY SERVICE

```bash id="vc7qop"
sudo systemctl status alertmanager




You should see:



```text id="e3h2vj"
active (running)

STEP 11 — OPEN SECURITY GROUP

Go to AWS Console.

EC2 → Security Groups

Add:

Port	Purpose
9093	AlertManager

Source:

```text id="5h4yfx"
0.0.0.0/0




---

# STEP 12 — VERIFY ALERTMANAGER UI

Browser:



```text id="rqj3uw"
http://YOUR_PUBLIC_IP:9093

You should see AlertManager UI.

WHAT SRE ENGINEERS MUST KNOW

Port	Service
9090	Prometheus
9093	AlertManager
9100	Node Exporter
3000	Grafana

STEP 13 — CONNECT PROMETHEUS TO ALERTMANAGER

Edit Prometheus config:

```bash id="kzkbbo"
sudo nano /etc/prometheus/prometheus.yml




Add:



```yaml id="qj4vdf"
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

STEP 14 — ADD RULE FILE

In same file add:

```yaml id="c81l7s"
rule_files:

"alert.rules.yml" ```

FULL STRUCTURE

```yaml id="vl5vlx"
global:
scrape_interval: 15s

rule_files:

"alert.rules.yml"

alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093




---

# STEP 15 — CREATE ALERT RULE FILE



```bash id="3xkq4y"
sudo nano /etc/prometheus/alert.rules.yml

Paste:

```yaml id="q4ly2k"
groups:

name: node_alerts

rules:

alert: NodeDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Node Exporter Down"
alert: HighCPUUsage
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) > 70
for: 1m
labels:
severity: warning
annotations:
summary: "High CPU Usage"
alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
for: 1m
labels:
severity: warning
annotations:
summary: "High Memory Usage"

Save file.

WHAT SRE ENGINEERS MUST UNDERSTAND

ALERT RULE STRUCTURE

Field	Meaning
alert	Alert name
expr	PromQL expression
for	Duration
labels	Severity
annotations	Human-readable info

WHAT IS PROMQL?

Prometheus Query Language.

Used for:

dashboards
alerts
troubleshooting

CPU ALERT EXPLANATION

This calculates:

```text id="jlwm0v"
Non-idle CPU percentage




Meaning:



```text id="l2p8lz"
Actual CPU usage

STEP 16 — VERIFY CONFIGURATION

Run:

```bash id="zhb2nf"
promtool check config /etc/prometheus/prometheus.yml




You should see:



```text id="rjlwmf"
SUCCESS

STEP 17 — RESTART PROMETHEUS

```bash id="2y7z4u"
sudo systemctl restart prometheus




---

# STEP 18 — VERIFY ALERTS PAGE

Browser:



```text id="1jlwmn"
http://YOUR_PUBLIC_IP:9090/alerts

You should see alert rules.

State:

```text id="pu4lt6"
Inactive




---

# STEP 19 — TRIGGER CPU ALERT

Install stress tool if not installed:



```bash id="a8cnfx"
sudo apt install stress -y

Generate load:

```bash id="v1q6gh"
stress --cpu 4 --timeout 180




---

# WHAT HAPPENS?

CPU rises.

Prometheus evaluates:



```text id="zgtzth"
CPU > 70%

After 1 minute:

```text id="bxl09l"
Alert fires




---

# STEP 20 — WATCH ALERT

Go to:



```text id="s5fhzl"
http://YOUR_PUBLIC_IP:9090/alerts

You should see:

```text id="5hktc4"
FIRING




---

# STEP 21 — CHECK ALERTMANAGER

Open:



```text id="hxtg9t"
http://YOUR_PUBLIC_IP:9093

You should see alert there.

WHAT SRE ENGINEERS MUST KNOW

Prometheus:

evaluates rules

AlertManager:

receives active alerts

STEP 22 — TEST NODE DOWN ALERT

Stop exporter:

```bash id="djlwmq"
sudo systemctl stop node_exporter




Wait 1 minute.

Refresh:



```text id="k1kvsn"
/alerts

You should see:

```text id="fdjlwm"
NodeDown FIRING




---

# THIS IS REAL SRE INCIDENT DETECTION

Production example:



```text id="jlwmx8"
Kubernetes node dies
→ exporter unreachable
→ NodeDown fires
→ PagerDuty alerts SRE

STEP 23 — TROUBLESHOOT AS SRE

CHECK SERVICE

```bash id="jlwm0c"
sudo systemctl status node_exporter




---

# CHECK PORT



```bash id="3y5r5v"
ss -tulnp | grep 9100

CHECK ENDPOINT

```bash id="jlwm4y"
curl localhost:9100/metrics




---

# CHECK PROMETHEUS TARGETS



```text id="xjlwm9"
http://YOUR_PUBLIC_IP:9090/targets

WHAT SRE ENGINEERS MUST MEMORIZE

Command	Purpose
systemctl	services
journalctl	logs
curl	endpoint test
ss	ports
top	CPU
free -m	RAM
df -h	disk

IMPORTANT SRE CONCEPTS

Alert Fatigue

Too many alerts = engineers ignore alerts.

SRE engineers carefully tune:

thresholds
durations
severities

Severity Levels

Severity	Meaning
critical	immediate action
warning	monitor
info	informational

WHY "for: 1m"?

Prevents temporary spikes causing alerts.

Without it:

```text id="jlwm1x"
Tiny CPU spike
→ false alert




---

# WHAT REAL COMPANIES USE

| Tool         | Purpose       |
| ------------ | ------------- |
| AlertManager | routing       |
| PagerDuty    | on-call       |
| Opsgenie     | escalation    |
| Slack        | notifications |
| Email        | alerts        |

---

# REAL INTERVIEW QUESTIONS

# What is AlertManager?

Handles alerts from Prometheus.

---

# Difference Between Prometheus and AlertManager?

| Prometheus      | AlertManager         |
| --------------- | -------------------- |
| stores metrics  | handles alerts       |
| evaluates rules | routes notifications |

---

# What is PromQL?

Prometheus Query Language.

---

# Why use "for" in alerts?

Avoid false positives.

---

# What happens if Node Exporter stops?

Prometheus target becomes DOWN and alert fires.

DEV Community

SRE Monitoring Lab 2

Prometheus Alerting + AlertManager + Grafana Alerts

WHAT SRE ENGINEERS MUST UNDERSTAND FIRST

Why Monitoring Alone Is Not Enough

WHAT IS ALERTMANAGER?

WHAT SRE ENGINEERS MUST KNOW

LAB GOAL

ARCHITECTURE

STEP 3 — MOVE BINARIES

STEP 5 — CREATE DIRECTORIES

STEP 7 — SET OWNERSHIP

STEP 10 — VERIFY SERVICE

STEP 11 — OPEN SECURITY GROUP

EC2 → Security Groups

WHAT SRE ENGINEERS MUST KNOW

STEP 13 — CONNECT PROMETHEUS TO ALERTMANAGER

STEP 14 — ADD RULE FILE

FULL STRUCTURE

WHAT SRE ENGINEERS MUST UNDERSTAND

ALERT RULE STRUCTURE

WHAT IS PROMQL?

CPU ALERT EXPLANATION

STEP 16 — VERIFY CONFIGURATION

STEP 17 — RESTART PROMETHEUS

WHAT SRE ENGINEERS MUST KNOW

STEP 22 — TEST NODE DOWN ALERT

STEP 23 — TROUBLESHOOT AS SRE

CHECK SERVICE

CHECK ENDPOINT

WHAT SRE ENGINEERS MUST MEMORIZE

IMPORTANT SRE CONCEPTS

Alert Fatigue

Severity Levels

WHY "for: 1m"?

Top comments (0)