DEV Community

Aisalkyn Aidarova
Aisalkyn Aidarova

Posted on

DevOps Monitoring & Alerting — Real-World Lab (Prometheus + Grafana)

1) Why DevOps sets up email notifications

Dashboards are passive. Alerts + email are active.

You need email notifications when:

  • You are on-call and must know about incidents immediately
  • The system is unattended (night/weekend)
  • You need evidence for SLAs and incident reports

DevOps goal:

  • Detect problems before users complain
  • Reduce MTTR (mean time to recovery)
  • Avoid “silent failure” (monitoring is broken but nobody knows)

2) What must be true before email notifications can work

Email notification depends on 4 layers:

  1. Exporter / Metrics exist (node_exporter up)
  2. Prometheus scrapes (Targets show UP)
  3. Grafana alert rule fires (Normal → Pending → Firing)
  4. Notification delivery (SMTP works + contact point + policy routes alerts)

In real life, most failures happen at layer 4.


3) Step-by-step: Configure SMTP on Grafana server (DevOps setup)

This is done on the machine running Grafana (your “monitor” instance).

Step 3.1 — SSH to the Grafana server

ssh -i ~/Downloads/keypaircalifornia.pem ubuntu@<GRAFANA_PUBLIC_IP>
Enter fullscreen mode Exit fullscreen mode

Step 3.2 — Edit Grafana config

sudo nano /etc/grafana/grafana.ini
Enter fullscreen mode Exit fullscreen mode

Step 3.3 — Add/enable SMTP section

For Gmail SMTP (lab-friendly):

[smtp]
enabled = true
host = smtp.gmail.com:587
user = YOUR_SENDER_GMAIL@gmail.com
password = YOUR_GMAIL_APP_PASSWORD
from_address = YOUR_SENDER_GMAIL@gmail.com
from_name = Grafana Alerts
skip_verify = true
startTLS_policy = OpportunisticStartTLS
Enter fullscreen mode Exit fullscreen mode

DevOps notes (what matters)

  • host: SMTP server + port
  • user: mailbox used to send alerts (sender)
  • password: App Password, not normal Gmail password
  • from_address: must match sender for best deliverability
  • startTLS_policy: enables encryption for SMTP

Step 3.4 — Restart Grafana to load changes

sudo systemctl restart grafana-server
sudo systemctl status grafana-server
Enter fullscreen mode Exit fullscreen mode

If Grafana fails to start, your config has a syntax problem.

Step 3.5 — Watch Grafana logs while testing (DevOps habit)

sudo journalctl -u grafana-server -f
Enter fullscreen mode Exit fullscreen mode

You keep this open when testing notifications.


4) Step-by-step: Gmail App Password (Most common failure)

Your error:
535 5.7.8 Username and Password not accepted (BadCredentials)

That means you used a normal password or Gmail blocked the sign-in.

Step 4.1 — Enable 2-Step Verification (required)

Google Account → Security → 2-Step Verification ON

Step 4.2 — Create App Password

Google Account → Security → App passwords → create one for “Mail”
Copy the 16-character app password.

Step 4.3 — Put that App Password in grafana.ini

Paste it without spaces.

Restart Grafana again.

DevOps tip

When you see:

  • 535 BadCredentials → wrong password/app password missing
  • 534-5.7.9 Application-specific password required → needs app password
  • connection timeout → network egress blocked / wrong SMTP host/port

5) Step-by-step: Configure Grafana UI (Contact point + policy)

SMTP is server-side. UI decides WHO gets notified.

Step 5.1 — Create Contact Point

Grafana → Alerting → Contact points → Create contact point

Step 5.2 — Test Contact Point (mandatory)

Click Test.

Expected:

  • UI: “Test notification sent”
  • Inbox: “Grafana test notification”
  • Logs: show email send attempt

If it fails:

  • Look at the UI error + logs
  • Fix SMTP first

Step 5.3 — Configure Notification Policy (routing)

Grafana → Alerting → Notification policies

Ensure there is a policy that routes alerts to your contact point.
Options:

  • Put your email contact point in the Default policy or
  • Create a policy that matches labels like:

    • severity = critical
    • team = devops

DevOps rule

No policy route → no notification, even if contact point exists.


6) Step-by-step: Create a “real” alert and trigger it

Step 6.1 — Create alert rule (example: High CPU)

Use Prometheus datasource and query:

CPU %:

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Enter fullscreen mode Exit fullscreen mode

Condition:

  • IS ABOVE 80
  • For 1m

Labels (important for routing):

  • severity = warning or critical
  • team = devops

Save rule.

Step 6.2 — Trigger CPU load on target machine

On node exporter VM:

sudo apt update
sudo apt install -y stress
stress --cpu 2 --timeout 180
Enter fullscreen mode Exit fullscreen mode

Step 6.3 — Watch alert state

Grafana → Alerting → Active alerts:

  • Normal → Pending → Firing

Step 6.4 — Confirm email arrives

You should get:

  • FIRING email
  • RESOLVED email after load ends

7) How DevOps reads an alert email (what matters)

When an alert email comes, DevOps must answer:

A) What is the problem?

  • “High CPU”
  • “Node down”
  • “Disk almost full”

This tells urgency and type of incident.

B) Which system/server?

Look for:

  • instance label (IP:port)
  • job label (node/prometheus)
  • environment label (prod/dev) if you use it

In your lab, the most important is:

  • instance="172.31.x.x:9100"

C) How bad is it?

Look for:

  • Severity label: warning vs critical
  • Actual value (CPU 92%, disk 95%)
  • “For 1m” or “For 5m” indicates persistence

D) Is it new or recurring?

Check:

  • Start time
  • Frequency
  • Similar previous emails

E) What action should I take first?

DevOps initial actions should be fast:

For High CPU:

  1. SSH to server
  2. Check top processes:
   top
   ps aux --sort=-%cpu | head
Enter fullscreen mode Exit fullscreen mode
  1. Identify cause: deployment? runaway job? attack?
  2. Mitigation: restart service, scale out, stop job

For Node Down:

  1. Check if host is reachable (ping/ssh)
  2. AWS instance status checks
  3. Security group changes?
  4. node_exporter service status

For Disk Full:

  1. Find biggest usage:
   df -h
   sudo du -xh / | sort -h | tail
Enter fullscreen mode Exit fullscreen mode
  1. Clean logs / expand disk / rotate logs

8) What DevOps must pay attention to (best practices)

1) Always alert on monitoring failures

Critical alert:

up{job="node"} == 0
Enter fullscreen mode Exit fullscreen mode

Because if node exporter dies, you become blind.

2) Avoid noisy alerts

Use:

  • FOR 1m or FOR 5m
  • Use avg / rate windows Otherwise you get spam and ignore alerts.

3) Include context in alerts

Use labels/annotations:

  • summary: “CPU above 80% on {{ $labels.instance }}”
  • description: “Check top, deployments, scaling”

4) Test notifications regularly

DevOps must test after:

  • SMTP changes
  • Grafana upgrades
  • firewall changes
  • password rotations

5) Separate “Warning” vs “Critical”

Example:

  • warning: CPU > 80% for 5m
  • critical: CPU > 95% for 2m

9) Mini checklist

✅ SMTP configured in /etc/grafana/grafana.ini
✅ Gmail App Password (not normal password)
✅ Grafana restarted
✅ Contact point created + Test succeeded
✅ Notification policy routes alerts to contact point
✅ Alert rule has correct query + labels
✅ Trigger event causes Firing + email received

🧪 PromQL LAB: Why Node Exporter Is Mandatory for DevOps

🔁 Architecture Reminder (Before Lab)

[ Linux Server ]
   └── node_exporter (system metrics)
            ↓
        Prometheus (scrapes metrics)
            ↓
        Grafana (query + alert + notify)
Enter fullscreen mode Exit fullscreen mode

LAB PART 1 — What Prometheus Knows WITHOUT Node Exporter

Step 1 — Open Prometheus UI

http://<PROMETHEUS_IP>:9090
Enter fullscreen mode Exit fullscreen mode

Go to Graph tab.


Step 2 — Run this query

up
Enter fullscreen mode Exit fullscreen mode

Expected result:

You will see something like:

up{job="prometheus"} = 1
Enter fullscreen mode Exit fullscreen mode

DevOps explanation:

  • Prometheus knows itself
  • It knows nothing about CPU, memory, disk
  • up only means “can I scrape this endpoint?”

👉 Important DevOps truth:

Prometheus by itself only knows if targets are reachable, not how the system behaves.


Step 3 — Try this query (WITHOUT node_exporter)

node_cpu_seconds_total
Enter fullscreen mode Exit fullscreen mode

Expected result:

No data

Why?

  • Prometheus does not collect OS metrics
  • Prometheus is not an agent
  • It only pulls what is exposed

👉 DevOps conclusion:

Prometheus is a collector, not a sensor.


LAB PART 2 — What Node Exporter Adds

Now node_exporter is installed and running on the target machine.


Step 4 — Confirm node exporter is scraped

up{job="node"}
Enter fullscreen mode Exit fullscreen mode

Expected result:

up{instance="172.31.x.x:9100", job="node"} = 1
Enter fullscreen mode Exit fullscreen mode

DevOps meaning:

  • Prometheus can reach node_exporter
  • Metrics are available
  • Monitoring is alive

LAB PART 3 — CPU Metrics (Most Common Incident)

Step 5 — Raw CPU metric

node_cpu_seconds_total
Enter fullscreen mode Exit fullscreen mode

What students see:

  • Multiple time series
  • Labels:

    • cpu="0"
    • mode="idle" | user | system | iowait

DevOps explanation:

  • Linux CPU time is cumulative
  • Metrics grow forever
  • We must use rate() to make sense of it

Step 6 — CPU usage percentage (REAL DEVOPS QUERY)

100 - (
  avg by (instance) (
    rate(node_cpu_seconds_total{mode="idle"}[5m])
  ) * 100
)
Enter fullscreen mode Exit fullscreen mode

What this shows:

  • CPU usage %
  • Per server

DevOps interpretation:

  • 0–30% → normal
  • 50–70% → watch
  • > 80% → alert
  • > 95% → incident

👉 Why DevOps needs this:

  • High CPU causes:

    • Slow apps
    • Timeouts
    • Failed deployments

LAB PART 4 — Memory Metrics (Silent Killers)

Step 7 — Total memory

node_memory_MemTotal_bytes
Enter fullscreen mode Exit fullscreen mode

Interpretation:

  • Physical RAM installed
  • Does NOT change

Step 8 — Available memory

node_memory_MemAvailable_bytes
Enter fullscreen mode Exit fullscreen mode

DevOps meaning:

  • How much memory apps can still use
  • Much better than “free memory”

Step 9 — Memory usage percentage

(
  1 - (
    node_memory_MemAvailable_bytes
    /
    node_memory_MemTotal_bytes
  )
) * 100
Enter fullscreen mode Exit fullscreen mode

DevOps interpretation:

  • Memory > 80% → danger
  • Memory leaks show slow increase
  • OOM kills happen suddenly

👉 Why DevOps needs this:

Memory issues crash apps without warning if not monitored.


LAB PART 5 — Disk Metrics (Most Dangerous)

Step 10 — Disk usage %

100 - (
  node_filesystem_avail_bytes{mountpoint="/"}
  /
  node_filesystem_size_bytes{mountpoint="/"}
) * 100
Enter fullscreen mode Exit fullscreen mode

DevOps interpretation:

  • Disk full = app crashes
  • Databases stop
  • Logs can’t write
  • OS can become unstable

👉 This alert is mandatory in production


LAB PART 6 — Network Metrics (Hidden Bottlenecks)

Step 11 — Network receive rate

rate(node_network_receive_bytes_total[5m])
Enter fullscreen mode Exit fullscreen mode

Step 12 — Network transmit rate

rate(node_network_transmit_bytes_total[5m])
Enter fullscreen mode Exit fullscreen mode

DevOps interpretation:

  • Sudden spikes → traffic surge or attack
  • Drops → network issues
  • Used in:

    • DDoS detection
    • Load testing validation

LAB PART 7 — Proving Why Node Exporter Is REQUIRED

Question to students:

“Why can’t Prometheus do this alone?”

Answer:

Prometheus:

  • ❌ Does not know CPU
  • ❌ Does not know memory
  • ❌ Does not know disk
  • ❌ Does not know network
  • ❌ Does not run on every server

Node Exporter:

  • ✅ Reads /proc, /sys
  • ✅ Exposes OS internals safely
  • ✅ Lightweight
  • ✅ Industry standard

👉 DevOps conclusion:

Prometheus without exporters is blind.


LAB PART 8 — Real Incident Simulation

Step 13 — Generate CPU load

stress --cpu 2 --timeout 120
Enter fullscreen mode Exit fullscreen mode

Step 14 — Watch PromQL graph change

100 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100
Enter fullscreen mode Exit fullscreen mode

DevOps observation:

  • CPU spikes
  • Alert transitions to Firing
  • Email notification sent

WHAT DEVOPS MUST PAY ATTENTION TO

1️⃣ Always monitor exporters themselves

up{job="node"} == 0
Enter fullscreen mode Exit fullscreen mode

Because:

If exporter dies, monitoring dies silently.


2️⃣ Use time windows correctly

  • rate(...[1m]) → fast reaction
  • rate(...[5m]) → stable alerts

3️⃣ Avoid raw counters

Bad:

node_cpu_seconds_total
Enter fullscreen mode Exit fullscreen mode

Good:

rate(node_cpu_seconds_total[5m])
Enter fullscreen mode Exit fullscreen mode

4️⃣ Labels matter

  • instance → which server
  • job → which role
  • mountpoint → which disk

“Prometheus collects metrics,
node_exporter exposes system data,
PromQL turns numbers into insight,
alerts turn insight into action.”

🧪 LAB: Monitor KIND Kubernetes using EC2 Prometheus (Central Monitoring)

🧱 Final Architecture (Explain First)

EC2 (Prometheus + Grafana)
          |
          |  scrape metrics
          v
KIND cluster
 ├─ control-plane node (node-exporter pod)
 ├─ worker node (node-exporter pod)
Enter fullscreen mode Exit fullscreen mode

👉 Prometheus stays on EC2
👉 KIND is just another “target”


PHASE 0 — Prerequisites

On your laptop:

  • KIND cluster running
  • kubectl configured
kubectl get nodes
Enter fullscreen mode Exit fullscreen mode

On EC2:

  • Prometheus already running
  • Prometheus UI accessible
  • You know EC2 private IP of Prometheus

PHASE 1 — Deploy Node Exporter in KIND (DaemonSet)

Why DaemonSet

“If something must run on every node → DaemonSet”

Node exporter:

  • Needs host access
  • One per node
  • Not per pod

STEP 1 — Create monitoring namespace

kubectl create namespace monitoring
Enter fullscreen mode Exit fullscreen mode

STEP 2 — Node Exporter DaemonSet for KIND

Create file: node-exporter-kind.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      hostPID: true
      hostNetwork: true
      containers:
        - name: node-exporter
          image: prom/node-exporter:latest
          args:
            - "--path.procfs=/host/proc"
            - "--path.sysfs=/host/sys"
            - "--path.rootfs=/host/root"
          ports:
            - containerPort: 9100
              hostPort: 9100
          volumeMounts:
            - name: proc
              mountPath: /host/proc
              readOnly: true
            - name: sys
              mountPath: /host/sys
              readOnly: true
            - name: root
              mountPath: /host/root
              readOnly: true
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: sys
          hostPath:
            path: /sys
        - name: root
          hostPath:
            path: /
Enter fullscreen mode Exit fullscreen mode

Apply it:

kubectl apply -f node-exporter-kind.yaml
Enter fullscreen mode Exit fullscreen mode

STEP 3 — Verify node exporter pods

kubectl get pods -n monitoring -o wide
Enter fullscreen mode Exit fullscreen mode

Expected:

  • One pod per KIND node
  • Each pod on a different node

👉 DevOps rule:
If a node has no exporter → you are blind on that node.


PHASE 2 — Expose Node Exporter to EC2 Prometheus

Key Concept (VERY IMPORTANT)

KIND runs locally.
EC2 Prometheus cannot directly reach 127.0.0.1.

So we expose node exporter via NodePort.


STEP 4 — Create NodePort Service

Create file: node-exporter-svc.yaml

apiVersion: v1
kind: Service
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  type: NodePort
  selector:
    app: node-exporter
  ports:
    - name: metrics
      port: 9100
      targetPort: 9100
      nodePort: 30910
Enter fullscreen mode Exit fullscreen mode

Apply:

kubectl apply -f node-exporter-svc.yaml
Enter fullscreen mode Exit fullscreen mode

STEP 5 — Verify NodePort

kubectl get svc -n monitoring
Enter fullscreen mode Exit fullscreen mode

You should see:

node-exporter   NodePort   9100:30910/TCP
Enter fullscreen mode Exit fullscreen mode

STEP 6 — Test metrics locally (sanity check)

From your laptop:

curl http://localhost:30910/metrics | head
Enter fullscreen mode Exit fullscreen mode

You MUST see:

node_cpu_seconds_total
node_memory_MemAvailable_bytes
Enter fullscreen mode Exit fullscreen mode

👉 If this fails → Prometheus will fail too.


PHASE 3 — Configure EC2 Prometheus to Scrape KIND

STEP 7 — Edit Prometheus config on EC2

SSH into EC2 Prometheus server:

ssh -i keypair.pem ubuntu@<EC2_IP>
Enter fullscreen mode Exit fullscreen mode

Edit config:

sudo nano /etc/prometheus/prometheus.yml
Enter fullscreen mode Exit fullscreen mode

Add this new job:

  - job_name: "kind-node-exporter"
    static_configs:
      - targets:
          - "<YOUR_LAPTOP_PUBLIC_IP>:30910"
Enter fullscreen mode Exit fullscreen mode

⚠️ Replace <YOUR_LAPTOP_PUBLIC_IP>
(Use curl ifconfig.me on laptop)


STEP 8 — Reload Prometheus

sudo systemctl restart prometheus
Enter fullscreen mode Exit fullscreen mode

Or if using reload endpoint:

curl -X POST http://localhost:9090/-/reload
Enter fullscreen mode Exit fullscreen mode

STEP 9 — Verify targets in Prometheus UI

Open:

Status → Targets
Enter fullscreen mode Exit fullscreen mode

Expected:

kind-node-exporter   UP
Enter fullscreen mode Exit fullscreen mode

👉 This is the big success moment.


PHASE 4 — PromQL Labs (KIND Nodes)

Now PromQL works unchanged.


LAB 1 — Is KIND node visible?

up{job="kind-node-exporter"}
Enter fullscreen mode Exit fullscreen mode

Interpretation:

  • 1 → node reachable
  • 0 → cluster blind

LAB 2 — CPU usage of KIND node

100 - (
  avg by (instance) (
    rate(node_cpu_seconds_total{mode="idle"}[5m])
  ) * 100
)
Enter fullscreen mode Exit fullscreen mode

Teach students:

  • This is host CPU
  • Includes kubelet, containers, OS

LAB 3 — Memory usage

(
  1 - (
    node_memory_MemAvailable_bytes
    /
    node_memory_MemTotal_bytes
  )
) * 100
Enter fullscreen mode Exit fullscreen mode

Explain:

  • High memory → pod OOMKills
  • Kubernetes hides this unless you look

LAB 4 — Disk usage (CRITICAL)

100 - (
  node_filesystem_avail_bytes{mountpoint="/"}
  /
  node_filesystem_size_bytes{mountpoint="/"}
) * 100
Enter fullscreen mode Exit fullscreen mode

Explain:

  • Disk full → kubelet stops
  • Pods fail silently

PHASE 5 — Create Alerts

Node exporter down (MANDATORY)

up{job="kind-node-exporter"} == 0
Enter fullscreen mode Exit fullscreen mode

High CPU

100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 80
Enter fullscreen mode Exit fullscreen mode

Disk almost full

node_filesystem_avail_bytes{mountpoint="/"} < 10 * 1024 * 1024 * 1024
Enter fullscreen mode Exit fullscreen mode

Alerts go to same email.


PHASE 6 — Incident Simulation

Scenario

Pods restarting randomly.

Step 1 — Kubernetes view

kubectl get pods
Enter fullscreen mode Exit fullscreen mode

Looks normal.

Step 2 — Node metrics (Prometheus)

CPU or disk is high.

Step 3 — DevOps action

kubectl cordon <node>
kubectl drain <node>
Enter fullscreen mode Exit fullscreen mode

👉 Node exporter revealed the real cause.

Top comments (0)