Aisalkyn Aidarova

Posted on May 19

Lab 2 — Your First Grafana Dashboard (Built from Scratch)

What you will build

Panel	Type	What it measures
CPU usage %	Time series	How hard your cores are working
System load (1m/5m/15m)	Time series	Load average trend
Memory used vs available	Time series	RAM consumption
Disk usage %	Gauge	How full your root partition is
Network traffic (in/out)	Time series	Bytes per second on eth0
Key metrics snapshot	Table	Current values side by side

Part 1 — Connect Grafana to Prometheus

Open Grafana: http://<EC2-PUBLIC-IP>:3000 (default login: admin / admin)
Go to Connections → Data sources → Add data source
Select Prometheus
Set URL to http://localhost:9090
Click Save & test — you should see a green success message

Why localhost? Grafana and Prometheus live on the same machine. Using localhost avoids exposing Prometheus to the internet.

Part 2 — Panel 1: CPU Usage %

Visualization: Time series

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Settings: Unit → Percent (0-100) · Min 0 · Max 100 · Threshold warning at 70, critical at 90

Verify: Run stress-ng --cpu 2 --timeout 60s and watch the panel spike.

Part 3 — Panel 2: System Load Average

Visualization: Time series — add 3 queries:

node_load1    # legend: 1m load
node_load5    # legend: 5m load
node_load15   # legend: 15m load

Settings: Unit → Short · Add a threshold line at your core count (e.g. 2 for t2.micro)

Load above core count = system is struggling to keep up.

Part 4 — Panel 3: Memory Usage

Visualization: Time series — add 2 queries:

node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes   # legend: Used
node_memory_MemAvailable_bytes                                 # legend: Available

Settings: Unit → bytes(IEC) · Enable Stack series

Verify: stress-ng --vm 1 --vm-bytes 400M --timeout 60s

Part 5 — Panel 4: Disk Usage (Gauge)

Visualization: Gauge

100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} 
      / node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"}) * 100)

Settings: Unit → Percent (0-100) · Thresholds: 0 green → 70 yellow → 85 red

Part 6 — Panel 5: Network Traffic

Visualization: Time series — add 2 queries:

rate(node_network_receive_bytes_total{device="eth0"}[5m])    # legend: In
rate(node_network_transmit_bytes_total{device="eth0"}[5m])   # legend: Out

Settings: Unit → bytes/sec(IEC)

If your instance uses ens5 instead of eth0, check with: node_network_receive_bytes_total in the Prometheus UI and look at the device label.

Part 7 — Panel 6: Key Metrics Table

Visualization: Table — set each query to Instant, add Reduce transformation → Last

100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)          # CPU %
node_load1                                                                   # Load 1m
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) 
  / node_memory_MemTotal_bytes * 100                                        # Memory %
100 - ((node_filesystem_avail_bytes{mountpoint="/"} 
      / node_filesystem_size_bytes{mountpoint="/"}) * 100)                 # Disk %

Part 8 — Save and stress test

Save the dashboard as EC2 System Health. Set auto-refresh to 30s.

Then run all three stressors simultaneously:

stress-ng --cpu 2 --timeout 120s &
stress-ng --vm 1 --vm-bytes 500M --timeout 120s &
stress-ng --io 2 --timeout 120s

Watch all panels respond in real time. After stress ends, verify metrics return to baseline.

Checkpoint questions

Answer before moving to Lab 3:

Why use rate() on node_cpu_seconds_total instead of the raw value?
What does load average 4.0 mean on a 2-core machine?
Why does the disk query use avail_bytes not free_bytes?
What changes if you switch [5m] to [1m] in the CPU query?
Why is avg by(instance) important in a multi-host setup?

DEV Community