Aisalkyn Aidarova

Posted on May 20

Production-Level SRE Lab — Prometheus + Grafana Incident Simulation

Scenario

You are Senior SRE on-call.

Production users report:

```text id="jlwmmd"
Application is slow and timing out




Your mission:

* detect issue
* investigate metrics
* identify bottleneck
* create dashboards
* analyze telemetry
* verify recovery

This lab simulates REAL production troubleshooting.

---

# Architecture



```text id="jlwml8"
Linux EC2
   ↓
Node Exporter
   ↓
Prometheus
   ↓
Grafana

Skills Practiced

Skill	Production Relevance
PromQL	real incident debugging
Grafana dashboards	observability
Linux telemetry	bottleneck analysis
CPU saturation	scaling decisions
memory pressure	OOM prevention
disk saturation	outage prevention
alerting	incident response
exporter failures	monitoring reliability

Phase 1 — Validate Monitoring Stack

Step 1 — Check Targets

Open:

```text id="jlwmy2"
http://YOUR_IP:9090/targets




Expected:

* prometheus = UP
* node_exporter = UP

Production importance:

* monitoring itself must be healthy

---

# Step 2 — Verify Exporter Metrics

Run:



```bash id="jlwmg0"
curl localhost:9100/metrics | grep node_cpu

Production importance:

validates exporter telemetry
validates scrape endpoint

Phase 2 — Build Production Dashboard Mentality

Open Grafana:

```text id="jlwmdw"
http://YOUR_IP:3000




Open:



```text id="jlwmt6"
Node Exporter Full

Focus ONLY on:

Panel	Why Important
CPU Busy	saturation
Sys Load	scheduler contention
CPU Pressure	waiting tasks
RAM Used	memory exhaustion
Swap Used	memory pressure
Root FS Used	disk exhaustion
IOwait	storage bottlenecks
Network Traffic	traffic spikes

Phase 3 — Production CPU Incident

Scenario

Users report:

```text id="jlwmdn"
API latency extremely high




---

# Step 1 — Simulate CPU Saturation

Run:



```bash id="jlwmu6"
stress --cpu 2 --timeout 300

This fully loads your:

2 vCPUs

Step 2 — Observe Grafana

Watch:

CPU Busy
Sys Load
CPU Pressure

Expected:

CPU Busy near 100%
load increases
pressure rises

Step 3 — Investigate with PromQL

Open Prometheus:

```text id="jlwmgj"
http://YOUR_IP:9090




Run:



```text id="jlwmt9"
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)

Observe:

CPU approaching 100%

Production importance:

validates saturation
confirms resource bottleneck

Step 4 — Linux Investigation

Run:

```bash id="jlwmev"
top




Questions:

* Which process consumes CPU?
* Is load average high?
* How much idle remains?

Exit:



```text id="jlwms5"
q

Step 5 — Advanced Analysis

Run:

```bash id="jlwmyq"
uptime




Example:



```text id="jlwmy1"
load average: 4.5, 3.9, 2.1

Your instance:

2 vCPUs

Interpretation:

load > CPU count
tasks waiting
system saturated

Senior SRE concept:

```text id="jlwmsd"
load average measures runnable/waiting tasks




---

# Phase 4 — Memory Leak Incident

## Scenario

Application becomes slow after deployment.

Possible memory leak.

---

# Step 1 — Simulate Memory Pressure

Run:



```bash id="jlwmy0"
stress --vm 1 --vm-bytes 300M --timeout 300

Step 2 — Observe Grafana

Watch:

RAM Used
Memory Pressure
Swap Used

Step 3 — Check OOM Events

Run:

```bash id="jlwmi7"
dmesg | grep -i oom




Production importance:

* detects memory exhaustion
* confirms kernel intervention

---

# Step 4 — Investigate Memory Consumers

Run:



```bash id="jlwmyf"
htop

Observe:

memory-heavy processes
swap activity
CPU behavior

Phase 5 — Disk Saturation Incident

Scenario

Monitoring suddenly stops storing data.

Step 1 — Simulate Disk Consumption

Run:

```bash id="jlwmu3"
fallocate -l 2G incidentfile




---

# Step 2 — Observe Grafana

Watch:

* Root FS Used
* filesystem graphs

---

# Step 3 — Investigate Disk

Run:



```bash id="jlwmsu"
df -h

Then:

```bash id="jlwmu4"
sudo du -sh /var/lib/prometheus




Production importance:

* Prometheus itself consumes disk
* metrics retention matters

---

# Step 4 — Cleanup

Run:



```bash id="jlwmbm"
rm -f incidentfile

Observe recovery.

Phase 6 — Monitoring Failure Incident

Scenario

Dashboards suddenly show:

```text id="jlwmd8"
No Data




---

# Step 1 — Simulate Exporter Failure

Run:



```bash id="jlwml9"
sudo systemctl stop node_exporter

Step 2 — Observe

Prometheus Targets:

```text id="jlwmy9"
DOWN




Grafana:

* panels fail
* gaps appear

Production importance:

* monitoring outages
* exporter failures
* telemetry gaps

---

# Step 3 — Recovery

Run:



```bash id="jlwmgq"
sudo systemctl start node_exporter

Observe:

recovery
targets UP

Phase 7 — Production Alerting

Create CPU Alert

Create:

```bash id="jlwmu0"
sudo nano /etc/prometheus/alert.rules.yml




Add:



```yaml id="jlwms7"
groups:
- name: sre-alerts
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) > 70
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: High CPU Usage

Step 2 — Add Rule to Prometheus

Edit:

```bash id="jlwmt2"
sudo nano /etc/prometheus/prometheus.yml




Add:



```yaml id="jlwmu7"
rule_files:
  - "alert.rules.yml"

Step 3 — Restart Prometheus

```bash id="jlwmbt"
sudo systemctl restart prometheus




---

# Step 4 — Trigger Alert

Run CPU stress again.

Observe:

* alert fires

Production importance:

* incident detection
* automated monitoring

---

# Phase 8 — Senior SRE RCA

Now perform Root Cause Analysis.

Questions:

1. What caused CPU saturation?
2. Did load average confirm contention?
3. Did pressure metrics increase?
4. Was swap used?
5. Was disk healthy?
6. Did Prometheus capture telemetry correctly?
7. Did Grafana visualize incident correctly?
8. Which process caused issue?
9. How would you scale production?
10. Would horizontal or vertical scaling help?

---

# Phase 9 — Real Production Thinking

## What Senior SRE Engineers Actually Analyze

Not just:



```text id="jlwmu2"
CPU %

But:

load
pressure
iowait
steal time
saturation
latency
telemetry gaps
retention
cardinality
alert fatigue

Phase 10 — Final Architecture Understanding

```text id="jlwmd5"
Linux Kernel
↓
Node Exporter
↓
Prometheus Scraping
↓
Prometheus TSDB
↓
Grafana Visualization
↓
Alerting
↓
Incident Response
↓
SRE Investigation




This is extremely close to real production observability engineering used by Senior SRE teams.

DEV Community

Production-Level SRE Lab — Prometheus + Grafana Incident Simulation

Scenario

Skills Practiced

Phase 1 — Validate Monitoring Stack

Step 1 — Check Targets

Phase 2 — Build Production Dashboard Mentality

Phase 3 — Production CPU Incident

Scenario

Step 2 — Observe Grafana

Step 3 — Investigate with PromQL

Step 4 — Linux Investigation

Step 5 — Advanced Analysis

Step 2 — Observe Grafana

Step 3 — Check OOM Events

Phase 5 — Disk Saturation Incident

Scenario

Step 1 — Simulate Disk Consumption

Phase 6 — Monitoring Failure Incident

Scenario

Step 2 — Observe

Phase 7 — Production Alerting

Create CPU Alert

Step 2 — Add Rule to Prometheus

Step 3 — Restart Prometheus

Phase 10 — Final Architecture Understanding

Top comments (0)