Scenario
You are Senior SRE on-call.
Production users report:
```text id="jlwmmd"
Application is slow and timing out
Your mission:
* detect issue
* investigate metrics
* identify bottleneck
* create dashboards
* analyze telemetry
* verify recovery
This lab simulates REAL production troubleshooting.
---
# Architecture
```text id="jlwml8"
Linux EC2
↓
Node Exporter
↓
Prometheus
↓
Grafana
Skills Practiced
| Skill | Production Relevance |
|---|---|
| PromQL | real incident debugging |
| Grafana dashboards | observability |
| Linux telemetry | bottleneck analysis |
| CPU saturation | scaling decisions |
| memory pressure | OOM prevention |
| disk saturation | outage prevention |
| alerting | incident response |
| exporter failures | monitoring reliability |
Phase 1 — Validate Monitoring Stack
Step 1 — Check Targets
Open:
```text id="jlwmy2"
http://YOUR_IP:9090/targets
Expected:
* prometheus = UP
* node_exporter = UP
Production importance:
* monitoring itself must be healthy
---
# Step 2 — Verify Exporter Metrics
Run:
```bash id="jlwmg0"
curl localhost:9100/metrics | grep node_cpu
Production importance:
- validates exporter telemetry
- validates scrape endpoint
Phase 2 — Build Production Dashboard Mentality
Open Grafana:
```text id="jlwmdw"
http://YOUR_IP:3000
Open:
```text id="jlwmt6"
Node Exporter Full
Focus ONLY on:
| Panel | Why Important |
|---|---|
| CPU Busy | saturation |
| Sys Load | scheduler contention |
| CPU Pressure | waiting tasks |
| RAM Used | memory exhaustion |
| Swap Used | memory pressure |
| Root FS Used | disk exhaustion |
| IOwait | storage bottlenecks |
| Network Traffic | traffic spikes |
Phase 3 — Production CPU Incident
Scenario
Users report:
```text id="jlwmdn"
API latency extremely high
---
# Step 1 — Simulate CPU Saturation
Run:
```bash id="jlwmu6"
stress --cpu 2 --timeout 300
This fully loads your:
- 2 vCPUs
Step 2 — Observe Grafana
Watch:
- CPU Busy
- Sys Load
- CPU Pressure
Expected:
- CPU Busy near 100%
- load increases
- pressure rises
Step 3 — Investigate with PromQL
Open Prometheus:
```text id="jlwmgj"
http://YOUR_IP:9090
Run:
```text id="jlwmt9"
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)
Observe:
- CPU approaching 100%
Production importance:
- validates saturation
- confirms resource bottleneck
Step 4 — Linux Investigation
Run:
```bash id="jlwmev"
top
Questions:
* Which process consumes CPU?
* Is load average high?
* How much idle remains?
Exit:
```text id="jlwms5"
q
Step 5 — Advanced Analysis
Run:
```bash id="jlwmyq"
uptime
Example:
```text id="jlwmy1"
load average: 4.5, 3.9, 2.1
Your instance:
- 2 vCPUs
Interpretation:
- load > CPU count
- tasks waiting
- system saturated
Senior SRE concept:
```text id="jlwmsd"
load average measures runnable/waiting tasks
---
# Phase 4 — Memory Leak Incident
## Scenario
Application becomes slow after deployment.
Possible memory leak.
---
# Step 1 — Simulate Memory Pressure
Run:
```bash id="jlwmy0"
stress --vm 1 --vm-bytes 300M --timeout 300
Step 2 — Observe Grafana
Watch:
- RAM Used
- Memory Pressure
- Swap Used
Step 3 — Check OOM Events
Run:
```bash id="jlwmi7"
dmesg | grep -i oom
Production importance:
* detects memory exhaustion
* confirms kernel intervention
---
# Step 4 — Investigate Memory Consumers
Run:
```bash id="jlwmyf"
htop
Observe:
- memory-heavy processes
- swap activity
- CPU behavior
Phase 5 — Disk Saturation Incident
Scenario
Monitoring suddenly stops storing data.
Step 1 — Simulate Disk Consumption
Run:
```bash id="jlwmu3"
fallocate -l 2G incidentfile
---
# Step 2 — Observe Grafana
Watch:
* Root FS Used
* filesystem graphs
---
# Step 3 — Investigate Disk
Run:
```bash id="jlwmsu"
df -h
Then:
```bash id="jlwmu4"
sudo du -sh /var/lib/prometheus
Production importance:
* Prometheus itself consumes disk
* metrics retention matters
---
# Step 4 — Cleanup
Run:
```bash id="jlwmbm"
rm -f incidentfile
Observe recovery.
Phase 6 — Monitoring Failure Incident
Scenario
Dashboards suddenly show:
```text id="jlwmd8"
No Data
---
# Step 1 — Simulate Exporter Failure
Run:
```bash id="jlwml9"
sudo systemctl stop node_exporter
Step 2 — Observe
Prometheus Targets:
```text id="jlwmy9"
DOWN
Grafana:
* panels fail
* gaps appear
Production importance:
* monitoring outages
* exporter failures
* telemetry gaps
---
# Step 3 — Recovery
Run:
```bash id="jlwmgq"
sudo systemctl start node_exporter
Observe:
- recovery
- targets UP
Phase 7 — Production Alerting
Create CPU Alert
Create:
```bash id="jlwmu0"
sudo nano /etc/prometheus/alert.rules.yml
Add:
```yaml id="jlwms7"
groups:
- name: sre-alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) > 70
for: 1m
labels:
severity: critical
annotations:
summary: High CPU Usage
Step 2 — Add Rule to Prometheus
Edit:
```bash id="jlwmt2"
sudo nano /etc/prometheus/prometheus.yml
Add:
```yaml id="jlwmu7"
rule_files:
- "alert.rules.yml"
Step 3 — Restart Prometheus
```bash id="jlwmbt"
sudo systemctl restart prometheus
---
# Step 4 — Trigger Alert
Run CPU stress again.
Observe:
* alert fires
Production importance:
* incident detection
* automated monitoring
---
# Phase 8 — Senior SRE RCA
Now perform Root Cause Analysis.
Questions:
1. What caused CPU saturation?
2. Did load average confirm contention?
3. Did pressure metrics increase?
4. Was swap used?
5. Was disk healthy?
6. Did Prometheus capture telemetry correctly?
7. Did Grafana visualize incident correctly?
8. Which process caused issue?
9. How would you scale production?
10. Would horizontal or vertical scaling help?
---
# Phase 9 — Real Production Thinking
## What Senior SRE Engineers Actually Analyze
Not just:
```text id="jlwmu2"
CPU %
But:
- load
- pressure
- iowait
- steal time
- saturation
- latency
- telemetry gaps
- retention
- cardinality
- alert fatigue
Phase 10 — Final Architecture Understanding
```text id="jlwmd5"
Linux Kernel
↓
Node Exporter
↓
Prometheus Scraping
↓
Prometheus TSDB
↓
Grafana Visualization
↓
Alerting
↓
Incident Response
↓
SRE Investigation
This is extremely close to real production observability engineering used by Senior SRE teams.
Top comments (0)