DEV Community

Aisalkyn Aidarova
Aisalkyn Aidarova

Posted on

Production-Level SRE Lab — Prometheus + Grafana Incident Simulation

Scenario

You are Senior SRE on-call.

Production users report:

```text id="jlwmmd"
Application is slow and timing out




Your mission:

* detect issue
* investigate metrics
* identify bottleneck
* create dashboards
* analyze telemetry
* verify recovery

This lab simulates REAL production troubleshooting.

---

# Architecture



```text id="jlwml8"
Linux EC2
   ↓
Node Exporter
   ↓
Prometheus
   ↓
Grafana
Enter fullscreen mode Exit fullscreen mode

Skills Practiced

Skill Production Relevance
PromQL real incident debugging
Grafana dashboards observability
Linux telemetry bottleneck analysis
CPU saturation scaling decisions
memory pressure OOM prevention
disk saturation outage prevention
alerting incident response
exporter failures monitoring reliability

Phase 1 — Validate Monitoring Stack

Step 1 — Check Targets

Open:

```text id="jlwmy2"
http://YOUR_IP:9090/targets




Expected:

* prometheus = UP
* node_exporter = UP

Production importance:

* monitoring itself must be healthy

---

# Step 2 — Verify Exporter Metrics

Run:



```bash id="jlwmg0"
curl localhost:9100/metrics | grep node_cpu
Enter fullscreen mode Exit fullscreen mode

Production importance:

  • validates exporter telemetry
  • validates scrape endpoint

Phase 2 — Build Production Dashboard Mentality

Open Grafana:

```text id="jlwmdw"
http://YOUR_IP:3000




Open:



```text id="jlwmt6"
Node Exporter Full
Enter fullscreen mode Exit fullscreen mode

Focus ONLY on:

Panel Why Important
CPU Busy saturation
Sys Load scheduler contention
CPU Pressure waiting tasks
RAM Used memory exhaustion
Swap Used memory pressure
Root FS Used disk exhaustion
IOwait storage bottlenecks
Network Traffic traffic spikes

Phase 3 — Production CPU Incident

Scenario

Users report:

```text id="jlwmdn"
API latency extremely high




---

# Step 1 — Simulate CPU Saturation

Run:



```bash id="jlwmu6"
stress --cpu 2 --timeout 300
Enter fullscreen mode Exit fullscreen mode

This fully loads your:

  • 2 vCPUs

Step 2 — Observe Grafana

Watch:

  • CPU Busy
  • Sys Load
  • CPU Pressure

Expected:

  • CPU Busy near 100%
  • load increases
  • pressure rises

Step 3 — Investigate with PromQL

Open Prometheus:

```text id="jlwmgj"
http://YOUR_IP:9090




Run:



```text id="jlwmt9"
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)
Enter fullscreen mode Exit fullscreen mode

Observe:

  • CPU approaching 100%

Production importance:

  • validates saturation
  • confirms resource bottleneck

Step 4 — Linux Investigation

Run:

```bash id="jlwmev"
top




Questions:

* Which process consumes CPU?
* Is load average high?
* How much idle remains?

Exit:



```text id="jlwms5"
q
Enter fullscreen mode Exit fullscreen mode

Step 5 — Advanced Analysis

Run:

```bash id="jlwmyq"
uptime




Example:



```text id="jlwmy1"
load average: 4.5, 3.9, 2.1
Enter fullscreen mode Exit fullscreen mode

Your instance:

  • 2 vCPUs

Interpretation:

  • load > CPU count
  • tasks waiting
  • system saturated

Senior SRE concept:

```text id="jlwmsd"
load average measures runnable/waiting tasks




---

# Phase 4 — Memory Leak Incident

## Scenario

Application becomes slow after deployment.

Possible memory leak.

---

# Step 1 — Simulate Memory Pressure

Run:



```bash id="jlwmy0"
stress --vm 1 --vm-bytes 300M --timeout 300
Enter fullscreen mode Exit fullscreen mode

Step 2 — Observe Grafana

Watch:

  • RAM Used
  • Memory Pressure
  • Swap Used

Step 3 — Check OOM Events

Run:

```bash id="jlwmi7"
dmesg | grep -i oom




Production importance:

* detects memory exhaustion
* confirms kernel intervention

---

# Step 4 — Investigate Memory Consumers

Run:



```bash id="jlwmyf"
htop
Enter fullscreen mode Exit fullscreen mode

Observe:

  • memory-heavy processes
  • swap activity
  • CPU behavior

Phase 5 — Disk Saturation Incident

Scenario

Monitoring suddenly stops storing data.


Step 1 — Simulate Disk Consumption

Run:

```bash id="jlwmu3"
fallocate -l 2G incidentfile




---

# Step 2 — Observe Grafana

Watch:

* Root FS Used
* filesystem graphs

---

# Step 3 — Investigate Disk

Run:



```bash id="jlwmsu"
df -h
Enter fullscreen mode Exit fullscreen mode

Then:

```bash id="jlwmu4"
sudo du -sh /var/lib/prometheus




Production importance:

* Prometheus itself consumes disk
* metrics retention matters

---

# Step 4 — Cleanup

Run:



```bash id="jlwmbm"
rm -f incidentfile
Enter fullscreen mode Exit fullscreen mode

Observe recovery.


Phase 6 — Monitoring Failure Incident

Scenario

Dashboards suddenly show:

```text id="jlwmd8"
No Data




---

# Step 1 — Simulate Exporter Failure

Run:



```bash id="jlwml9"
sudo systemctl stop node_exporter
Enter fullscreen mode Exit fullscreen mode

Step 2 — Observe

Prometheus Targets:

```text id="jlwmy9"
DOWN




Grafana:

* panels fail
* gaps appear

Production importance:

* monitoring outages
* exporter failures
* telemetry gaps

---

# Step 3 — Recovery

Run:



```bash id="jlwmgq"
sudo systemctl start node_exporter
Enter fullscreen mode Exit fullscreen mode

Observe:

  • recovery
  • targets UP

Phase 7 — Production Alerting

Create CPU Alert

Create:

```bash id="jlwmu0"
sudo nano /etc/prometheus/alert.rules.yml




Add:



```yaml id="jlwms7"
groups:
- name: sre-alerts
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) > 70
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: High CPU Usage
Enter fullscreen mode Exit fullscreen mode

Step 2 — Add Rule to Prometheus

Edit:

```bash id="jlwmt2"
sudo nano /etc/prometheus/prometheus.yml




Add:



```yaml id="jlwmu7"
rule_files:
  - "alert.rules.yml"
Enter fullscreen mode Exit fullscreen mode

Step 3 — Restart Prometheus

```bash id="jlwmbt"
sudo systemctl restart prometheus




---

# Step 4 — Trigger Alert

Run CPU stress again.

Observe:

* alert fires

Production importance:

* incident detection
* automated monitoring

---

# Phase 8 — Senior SRE RCA

Now perform Root Cause Analysis.

Questions:

1. What caused CPU saturation?
2. Did load average confirm contention?
3. Did pressure metrics increase?
4. Was swap used?
5. Was disk healthy?
6. Did Prometheus capture telemetry correctly?
7. Did Grafana visualize incident correctly?
8. Which process caused issue?
9. How would you scale production?
10. Would horizontal or vertical scaling help?

---

# Phase 9 — Real Production Thinking

## What Senior SRE Engineers Actually Analyze

Not just:



```text id="jlwmu2"
CPU %
Enter fullscreen mode Exit fullscreen mode

But:

  • load
  • pressure
  • iowait
  • steal time
  • saturation
  • latency
  • telemetry gaps
  • retention
  • cardinality
  • alert fatigue

Phase 10 — Final Architecture Understanding

```text id="jlwmd5"
Linux Kernel

Node Exporter

Prometheus Scraping

Prometheus TSDB

Grafana Visualization

Alerting

Incident Response

SRE Investigation




This is extremely close to real production observability engineering used by Senior SRE teams.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)