1) Why DevOps sets up email notifications
Dashboards are passive. Alerts + email are active.
You need email notifications when:
- You are on-call and must know about incidents immediately
- The system is unattended (night/weekend)
- You need evidence for SLAs and incident reports
DevOps goal:
- Detect problems before users complain
- Reduce MTTR (mean time to recovery)
- Avoid “silent failure” (monitoring is broken but nobody knows)
2) What must be true before email notifications can work
Email notification depends on 4 layers:
- Exporter / Metrics exist (node_exporter up)
- Prometheus scrapes (Targets show UP)
- Grafana alert rule fires (Normal → Pending → Firing)
- Notification delivery (SMTP works + contact point + policy routes alerts)
In real life, most failures happen at layer 4.
3) Step-by-step: Configure SMTP on Grafana server (DevOps setup)
This is done on the machine running Grafana (your “monitor” instance).
Step 3.1 — SSH to the Grafana server
ssh -i ~/Downloads/keypaircalifornia.pem ubuntu@<GRAFANA_PUBLIC_IP>
Step 3.2 — Edit Grafana config
sudo nano /etc/grafana/grafana.ini
Step 3.3 — Add/enable SMTP section
For Gmail SMTP (lab-friendly):
[smtp]
enabled = true
host = smtp.gmail.com:587
user = YOUR_SENDER_GMAIL@gmail.com
password = YOUR_GMAIL_APP_PASSWORD
from_address = YOUR_SENDER_GMAIL@gmail.com
from_name = Grafana Alerts
skip_verify = true
startTLS_policy = OpportunisticStartTLS
DevOps notes (what matters)
-
host: SMTP server + port -
user: mailbox used to send alerts (sender) -
password: App Password, not normal Gmail password -
from_address: must match sender for best deliverability -
startTLS_policy: enables encryption for SMTP
Step 3.4 — Restart Grafana to load changes
sudo systemctl restart grafana-server
sudo systemctl status grafana-server
If Grafana fails to start, your config has a syntax problem.
Step 3.5 — Watch Grafana logs while testing (DevOps habit)
sudo journalctl -u grafana-server -f
You keep this open when testing notifications.
4) Step-by-step: Gmail App Password (Most common failure)
Your error:
535 5.7.8 Username and Password not accepted (BadCredentials)
That means you used a normal password or Gmail blocked the sign-in.
Step 4.1 — Enable 2-Step Verification (required)
Google Account → Security → 2-Step Verification ON
Step 4.2 — Create App Password
Google Account → Security → App passwords → create one for “Mail”
Copy the 16-character app password.
Step 4.3 — Put that App Password in grafana.ini
Paste it without spaces.
Restart Grafana again.
DevOps tip
When you see:
-
535 BadCredentials→ wrong password/app password missing -
534-5.7.9 Application-specific password required→ needs app password -
connection timeout→ network egress blocked / wrong SMTP host/port
5) Step-by-step: Configure Grafana UI (Contact point + policy)
SMTP is server-side. UI decides WHO gets notified.
Step 5.1 — Create Contact Point
Grafana → Alerting → Contact points → Create contact point
- Type: Email
- Addresses: your receiver email (example: aisalkynaidarova8@gmail.com)
- Save
Step 5.2 — Test Contact Point (mandatory)
Click Test.
Expected:
- UI: “Test notification sent”
- Inbox: “Grafana test notification”
- Logs: show email send attempt
If it fails:
- Look at the UI error + logs
- Fix SMTP first
Step 5.3 — Configure Notification Policy (routing)
Grafana → Alerting → Notification policies
Ensure there is a policy that routes alerts to your contact point.
Options:
- Put your email contact point in the Default policy or
-
Create a policy that matches labels like:
severity = criticalteam = devops
DevOps rule
No policy route → no notification, even if contact point exists.
6) Step-by-step: Create a “real” alert and trigger it
Step 6.1 — Create alert rule (example: High CPU)
Use Prometheus datasource and query:
CPU %:
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Condition:
- IS ABOVE 80
- For 1m
Labels (important for routing):
-
severity = warningorcritical team = devops
Save rule.
Step 6.2 — Trigger CPU load on target machine
On node exporter VM:
sudo apt update
sudo apt install -y stress
stress --cpu 2 --timeout 180
Step 6.3 — Watch alert state
Grafana → Alerting → Active alerts:
- Normal → Pending → Firing
Step 6.4 — Confirm email arrives
You should get:
- FIRING email
- RESOLVED email after load ends
7) How DevOps reads an alert email (what matters)
When an alert email comes, DevOps must answer:
A) What is the problem?
- “High CPU”
- “Node down”
- “Disk almost full”
This tells urgency and type of incident.
B) Which system/server?
Look for:
-
instancelabel (IP:port) -
joblabel (node/prometheus) - environment label (prod/dev) if you use it
In your lab, the most important is:
instance="172.31.x.x:9100"
C) How bad is it?
Look for:
- Severity label: warning vs critical
- Actual value (CPU 92%, disk 95%)
- “For 1m” or “For 5m” indicates persistence
D) Is it new or recurring?
Check:
- Start time
- Frequency
- Similar previous emails
E) What action should I take first?
DevOps initial actions should be fast:
For High CPU:
- SSH to server
- Check top processes:
top
ps aux --sort=-%cpu | head
- Identify cause: deployment? runaway job? attack?
- Mitigation: restart service, scale out, stop job
For Node Down:
- Check if host is reachable (ping/ssh)
- AWS instance status checks
- Security group changes?
- node_exporter service status
For Disk Full:
- Find biggest usage:
df -h
sudo du -xh / | sort -h | tail
- Clean logs / expand disk / rotate logs
8) What DevOps must pay attention to (best practices)
1) Always alert on monitoring failures
Critical alert:
up{job="node"} == 0
Because if node exporter dies, you become blind.
2) Avoid noisy alerts
Use:
-
FOR 1morFOR 5m - Use
avg/ rate windows Otherwise you get spam and ignore alerts.
3) Include context in alerts
Use labels/annotations:
- summary: “CPU above 80% on {{ $labels.instance }}”
- description: “Check top, deployments, scaling”
4) Test notifications regularly
DevOps must test after:
- SMTP changes
- Grafana upgrades
- firewall changes
- password rotations
5) Separate “Warning” vs “Critical”
Example:
- warning: CPU > 80% for 5m
- critical: CPU > 95% for 2m
9) Mini checklist
✅ SMTP configured in /etc/grafana/grafana.ini
✅ Gmail App Password (not normal password)
✅ Grafana restarted
✅ Contact point created + Test succeeded
✅ Notification policy routes alerts to contact point
✅ Alert rule has correct query + labels
✅ Trigger event causes Firing + email received
🧪 PromQL LAB: Why Node Exporter Is Mandatory for DevOps
🔁 Architecture Reminder (Before Lab)
[ Linux Server ]
└── node_exporter (system metrics)
↓
Prometheus (scrapes metrics)
↓
Grafana (query + alert + notify)
LAB PART 1 — What Prometheus Knows WITHOUT Node Exporter
Step 1 — Open Prometheus UI
http://<PROMETHEUS_IP>:9090
Go to Graph tab.
Step 2 — Run this query
up
Expected result:
You will see something like:
up{job="prometheus"} = 1
DevOps explanation:
- Prometheus knows itself
- It knows nothing about CPU, memory, disk
-
uponly means “can I scrape this endpoint?”
👉 Important DevOps truth:
Prometheus by itself only knows if targets are reachable, not how the system behaves.
Step 3 — Try this query (WITHOUT node_exporter)
node_cpu_seconds_total
Expected result:
❌ No data
Why?
- Prometheus does not collect OS metrics
- Prometheus is not an agent
- It only pulls what is exposed
👉 DevOps conclusion:
Prometheus is a collector, not a sensor.
LAB PART 2 — What Node Exporter Adds
Now node_exporter is installed and running on the target machine.
Step 4 — Confirm node exporter is scraped
up{job="node"}
Expected result:
up{instance="172.31.x.x:9100", job="node"} = 1
DevOps meaning:
- Prometheus can reach node_exporter
- Metrics are available
- Monitoring is alive
LAB PART 3 — CPU Metrics (Most Common Incident)
Step 5 — Raw CPU metric
node_cpu_seconds_total
What students see:
- Multiple time series
-
Labels:
cpu="0"mode="idle" | user | system | iowait
DevOps explanation:
- Linux CPU time is cumulative
- Metrics grow forever
- We must use
rate()to make sense of it
Step 6 — CPU usage percentage (REAL DEVOPS QUERY)
100 - (
avg by (instance) (
rate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100
)
What this shows:
- CPU usage %
- Per server
DevOps interpretation:
- 0–30% → normal
- 50–70% → watch
- > 80% → alert
- > 95% → incident
👉 Why DevOps needs this:
-
High CPU causes:
- Slow apps
- Timeouts
- Failed deployments
LAB PART 4 — Memory Metrics (Silent Killers)
Step 7 — Total memory
node_memory_MemTotal_bytes
Interpretation:
- Physical RAM installed
- Does NOT change
Step 8 — Available memory
node_memory_MemAvailable_bytes
DevOps meaning:
- How much memory apps can still use
- Much better than “free memory”
Step 9 — Memory usage percentage
(
1 - (
node_memory_MemAvailable_bytes
/
node_memory_MemTotal_bytes
)
) * 100
DevOps interpretation:
- Memory > 80% → danger
- Memory leaks show slow increase
- OOM kills happen suddenly
👉 Why DevOps needs this:
Memory issues crash apps without warning if not monitored.
LAB PART 5 — Disk Metrics (Most Dangerous)
Step 10 — Disk usage %
100 - (
node_filesystem_avail_bytes{mountpoint="/"}
/
node_filesystem_size_bytes{mountpoint="/"}
) * 100
DevOps interpretation:
- Disk full = app crashes
- Databases stop
- Logs can’t write
- OS can become unstable
👉 This alert is mandatory in production
LAB PART 6 — Network Metrics (Hidden Bottlenecks)
Step 11 — Network receive rate
rate(node_network_receive_bytes_total[5m])
Step 12 — Network transmit rate
rate(node_network_transmit_bytes_total[5m])
DevOps interpretation:
- Sudden spikes → traffic surge or attack
- Drops → network issues
-
Used in:
- DDoS detection
- Load testing validation
LAB PART 7 — Proving Why Node Exporter Is REQUIRED
Question to students:
“Why can’t Prometheus do this alone?”
Answer:
Prometheus:
- ❌ Does not know CPU
- ❌ Does not know memory
- ❌ Does not know disk
- ❌ Does not know network
- ❌ Does not run on every server
Node Exporter:
- ✅ Reads
/proc,/sys - ✅ Exposes OS internals safely
- ✅ Lightweight
- ✅ Industry standard
👉 DevOps conclusion:
Prometheus without exporters is blind.
LAB PART 8 — Real Incident Simulation
Step 13 — Generate CPU load
stress --cpu 2 --timeout 120
Step 14 — Watch PromQL graph change
100 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100
DevOps observation:
- CPU spikes
- Alert transitions to Firing
- Email notification sent
WHAT DEVOPS MUST PAY ATTENTION TO
1️⃣ Always monitor exporters themselves
up{job="node"} == 0
Because:
If exporter dies, monitoring dies silently.
2️⃣ Use time windows correctly
-
rate(...[1m])→ fast reaction -
rate(...[5m])→ stable alerts
3️⃣ Avoid raw counters
Bad:
node_cpu_seconds_total
Good:
rate(node_cpu_seconds_total[5m])
4️⃣ Labels matter
-
instance→ which server -
job→ which role -
mountpoint→ which disk
“Prometheus collects metrics,
node_exporter exposes system data,
PromQL turns numbers into insight,
alerts turn insight into action.”
🧪 LAB: Monitor KIND Kubernetes using EC2 Prometheus (Central Monitoring)
🧱 Final Architecture (Explain First)
EC2 (Prometheus + Grafana)
|
| scrape metrics
v
KIND cluster
├─ control-plane node (node-exporter pod)
├─ worker node (node-exporter pod)
👉 Prometheus stays on EC2
👉 KIND is just another “target”
PHASE 0 — Prerequisites
On your laptop:
- KIND cluster running
- kubectl configured
kubectl get nodes
On EC2:
- Prometheus already running
- Prometheus UI accessible
- You know EC2 private IP of Prometheus
PHASE 1 — Deploy Node Exporter in KIND (DaemonSet)
Why DaemonSet
“If something must run on every node → DaemonSet”
Node exporter:
- Needs host access
- One per node
- Not per pod
STEP 1 — Create monitoring namespace
kubectl create namespace monitoring
STEP 2 — Node Exporter DaemonSet for KIND
Create file: node-exporter-kind.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
hostPID: true
hostNetwork: true
containers:
- name: node-exporter
image: prom/node-exporter:latest
args:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
- "--path.rootfs=/host/root"
ports:
- containerPort: 9100
hostPort: 9100
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: root
mountPath: /host/root
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /
Apply it:
kubectl apply -f node-exporter-kind.yaml
STEP 3 — Verify node exporter pods
kubectl get pods -n monitoring -o wide
Expected:
- One pod per KIND node
- Each pod on a different node
👉 DevOps rule:
If a node has no exporter → you are blind on that node.
PHASE 2 — Expose Node Exporter to EC2 Prometheus
Key Concept (VERY IMPORTANT)
KIND runs locally.
EC2 Prometheus cannot directly reach 127.0.0.1.
So we expose node exporter via NodePort.
STEP 4 — Create NodePort Service
Create file: node-exporter-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: node-exporter
namespace: monitoring
spec:
type: NodePort
selector:
app: node-exporter
ports:
- name: metrics
port: 9100
targetPort: 9100
nodePort: 30910
Apply:
kubectl apply -f node-exporter-svc.yaml
STEP 5 — Verify NodePort
kubectl get svc -n monitoring
You should see:
node-exporter NodePort 9100:30910/TCP
STEP 6 — Test metrics locally (sanity check)
From your laptop:
curl http://localhost:30910/metrics | head
You MUST see:
node_cpu_seconds_total
node_memory_MemAvailable_bytes
👉 If this fails → Prometheus will fail too.
PHASE 3 — Configure EC2 Prometheus to Scrape KIND
STEP 7 — Edit Prometheus config on EC2
SSH into EC2 Prometheus server:
ssh -i keypair.pem ubuntu@<EC2_IP>
Edit config:
sudo nano /etc/prometheus/prometheus.yml
Add this new job:
- job_name: "kind-node-exporter"
static_configs:
- targets:
- "<YOUR_LAPTOP_PUBLIC_IP>:30910"
⚠️ Replace <YOUR_LAPTOP_PUBLIC_IP>
(Use curl ifconfig.me on laptop)
STEP 8 — Reload Prometheus
sudo systemctl restart prometheus
Or if using reload endpoint:
curl -X POST http://localhost:9090/-/reload
STEP 9 — Verify targets in Prometheus UI
Open:
Status → Targets
Expected:
kind-node-exporter UP
👉 This is the big success moment.
PHASE 4 — PromQL Labs (KIND Nodes)
Now PromQL works unchanged.
LAB 1 — Is KIND node visible?
up{job="kind-node-exporter"}
Interpretation:
-
1→ node reachable -
0→ cluster blind
LAB 2 — CPU usage of KIND node
100 - (
avg by (instance) (
rate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100
)
Teach students:
- This is host CPU
- Includes kubelet, containers, OS
LAB 3 — Memory usage
(
1 - (
node_memory_MemAvailable_bytes
/
node_memory_MemTotal_bytes
)
) * 100
Explain:
- High memory → pod OOMKills
- Kubernetes hides this unless you look
LAB 4 — Disk usage (CRITICAL)
100 - (
node_filesystem_avail_bytes{mountpoint="/"}
/
node_filesystem_size_bytes{mountpoint="/"}
) * 100
Explain:
- Disk full → kubelet stops
- Pods fail silently
PHASE 5 — Create Alerts
Node exporter down (MANDATORY)
up{job="kind-node-exporter"} == 0
High CPU
100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 80
Disk almost full
node_filesystem_avail_bytes{mountpoint="/"} < 10 * 1024 * 1024 * 1024
Alerts go to same email.
PHASE 6 — Incident Simulation
Scenario
Pods restarting randomly.
Step 1 — Kubernetes view
kubectl get pods
Looks normal.
Step 2 — Node metrics (Prometheus)
CPU or disk is high.
Step 3 — DevOps action
kubectl cordon <node>
kubectl drain <node>
👉 Node exporter revealed the real cause.
Top comments (0)