The Monitoring Gap Every DevOps Engineer Faces
Full monitoring stacks like Prometheus + Grafana are great, but they take time to set up. What about the servers you inherit? The staging environments? The emergency VM you spin up during an outage?
Command-line monitoring is your immediate, universal answer. These tools work on every Linux box, no agents required. Better yet, they're fast enough to script into alerting workflows.
This post covers the essential Linux monitoring commands plus patterns to turn raw metrics into actionable alerts—perfect follow-up to our Bash scripting guide.
1. Real-Time Resource Dashboards
The top/htop Foundation
top gives you an instant system snapshot:
top - 11:26:45 up 5 days, 3:12, 2 users, load average: 1.23, 1.45, 1.67
Tasks: 234 total, 2 running, 232 sleeping, 0 stopped, 0 zombie
%Cpu(s): 12.3 us, 8.7 sy, 0.0 ni, 78.9 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 7900.2 total, 1234.5 free, 4567.8 used, 2097.9 buff/cache
Pro move: htop (install with apt install htop)
Mouse/keyboard navigation
Color-coded resource bars
Tree view of processes (
F5)
Quick filters:
htop -p $(pgrep -d, nginx) # Monitor nginx processes only
Memory Deep Dive: free -h
free -h
total used free shared buff/cache available
Mem: 7.7Gi 4.2Gi 1.2Gi 128Mi 2.3Gi 3.1Gi
Swap: 2.0Gi 0B 2.0Gi
What matters: Focus on available column, not free. Linux aggressively caches to disk.
2. CPU Analysis: Who's Eating Cycles?
Per-Process Breakdown
ps aux --sort=-%cpu | head -10
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
mysql 1234 45.2 12.3 2.1g 980m ? S 10:00 3:45 /usr/sbin/mysqld
Historical CPU Trends: sar
# Install: apt install sysstat
sar -u 1 5 # CPU every 1 sec, 5 samples
sar -u -f /var/log/sysstat/sa08 # Yesterday's data
Average: CPU %user %nice %system %iowait %steal %idle
Average: all 12.34 0.00 8.76 1.23 0.00 77.67
Alert pattern:
#!/bin/bash
if sar -u 1 3 | tail -1 | awk '{if($8 < 70) exit 1}'; then
echo "CPU idle <70% for 3s - investigate!"
fi
3. Disk I/O: The Silent Killer
Current I/O: iostat
iostat -x 1 5
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sda 23.4 1.2 234.5 12.3 0.0 10.2 0.00 89.12 0.1 2.3 0.45 10.0 6.2 1.23 45.2
Red flags: %util >80%, await >20ms
Disk Space Alerts: df
df -h --output=source,fstype,size,used,avail,pcent,target | grep -v tmpfs
Scriptable alert:
df -h | grep -E "[8-9][0-9]%|[9][0-9]%|[100]%" || echo "Disk healthy"
4. Network Troubleshooting Masters
Active Connections: ss
# Replace netstat everywhere
ss -tuln # Listening TCP/UDP
ss -tunap | grep :80 # Processes on port 80
ss -t state established | grep :443 | wc -l # Active HTTPS connections
Drop Counters: netstat or ss
netstat -s | grep -E "errors|dropped|retrans"
Ip:
1234 total packets received
56 dropped because of memory problems
Live Packet Capture: tcpdump
# Capture 100 packets on interface eth0, port 80
sudo tcpdump -i eth0 -c 100 port 80 -w capture.pcap
# Read capture
tcpdump -r capture.pcap -nn
5. Log Monitoring: Beyond tail -f
Service Logs: journalctl
journalctl -u nginx -f # Follow nginx logs
journalctl -u nginx --since "1h ago" # Last hour
journalctl -p err -u nginx # Only errors
journalctl --no-pager | grep -i panic # System panics
Pattern Mining: grep + awk
# Count 5xx errors per minute
journalctl -u nginx --since "10min ago" | \
grep " 500 " | \
awk '{print $1, $2}' | cut -d. -f1 | sort | uniq -c
# Slow requests (>2s)
awk '$NF > 2 {print}' /var/log/nginx/access.log
6. Production Alerting Patterns
CPU/Memory Watchdog
#!/bin/bash
set -euo pipefail
alert() { curl -X POST -d "CPU ${CPU}%, MEM ${MEM}%" "$SLACK_WEBHOOK"; }
CPU=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
MEM=$(free | awk '/Mem:/ {printf "%.0f", $3/$2 * 100}')
[[ "$CPU" -gt 80 || "$MEM" -gt 80 ]] && alert
Disk Space Guardian
#!/bin/bash
for fs in $(df --local --output=source | tail -n +2); do
usage=$(df $fs | tail -1 | awk '{print $5}' | sed 's/%//')
[[ $usage -gt 85 ]] && echo "ALERT: $fs at ${usage}%"
done
Cron schedule:
# Every 5 minutes
*/5 * * * * /usr/local/bin/check_resources.sh
7. One-Line Dashboards
Combine tools into instant observability:
# System overview (alias this to 'sys')
watch -n 2 'printf "\nCPU: "; sar -u 1 1 |tail-1; printf "MEM: "; free -h |tail-1; printf "DISK: "; df -h / /var |tail -2'
# Top resource hogs
watch -n 2 'ps aux --sort=-%cpu | head -8; echo "---"; ps aux --sort=-%mem | head -8'
Quick Reference Table
| Scenario | Command | Pro Tip |
| ----------- | ---------------------- | ------------------------------------ |
| CPU trends | sar -u 1 5 | Historical data in /var/log/sysstat/ |
| Memory | free -h | Watch available, ignore free |
| Disk I/O | iostat -x 1 | %util >80% = trouble |
| Connections | ss -tuln | Modern netstat replacement |
| Logs | journalctl -u nginx -f | systemd's tail -f |
| Processes | htop -p $(pgrep nginx) | Filter to specific app |
Top comments (0)