DEV Community

Tom
Tom

Posted on • Originally published at bubobot.com

Monitoring System Health with Linux CLIs: Troubleshooting Disk, Memory, and CPU for Better Uptime

Untitled

Let's be real, downtime sucks. Nobody wants to deal with angry users or scramble to restore service at 3 AM. But when disaster strikes, knowing which command-line tools to reach for can be the difference between a quick fix and an extended outage.

I've spent countless hours in the terminal wrestling unresponsive servers back to life. Here are the CLI tools I use most often - organized by scenario so you can quickly find what you need when things go sideways.

Scenario 1: System Down – Time to Investigate!

Your system crashed. First step? Get it back online ASAP (restart, rollback, whatever it takes). Then it's time to play detective and figure out why it happened.

Your Investigation Toolkit

# How long was the system up before the crash?
$ uptime
 11:23:42 up 2 days, 1:14,  3 users,  load average: 15.32, 12.67, 10.21

# What does the system log say about the crash?
$ journalctl -p err..emerg -b -1
May 12 03:42:11 webserver kernel: Out of memory: Kill process 4312 (java) score 567 or sacrifice child
May 12 03:42:11 webserver kernel: Killed process 4312 (java) total-vm:18245652kB, anon-rss:11291012kB

# Which processes are consuming resources now?
$ htop  # Interactive process viewer

Enter fullscreen mode Exit fullscreen mode

When a system crashes, the first things I check are:

  • System load with uptime - Those load averages of 15+ on the example above suggest severe overload

  • System logs with journalctl - Look for OOM (Out Of Memory) killers, kernel panics, and service failures

  • Resource usage with htop - Find CPU/memory hogs that might be causing problems

  • Disk space with df -h - A full disk can wreak havoc across the system

Quick Analysis Example

# Check what's taking up disk space
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        30G   30G   12K 100% /

# Find the culprit
$ du -h --max-depth=1 / | sort -hr | head -10
15G  /var
8G   /usr
4G   /opt
2G   /home

# Dig deeper
$ du -h --max-depth=1 /var | sort -hr | head -5
14G  /var/log
512M /var/lib
128M /var/cache

# Find the specific log files
$ find /var/log -type f -name "*.log" -size +100M | xargs ls -lh
-rw-r--r-- 1 root root 12G May 12 03:40 /var/log/application.log

Enter fullscreen mode Exit fullscreen mode

This investigation shows a classic scenario: a runaway log file filled the disk, causing the system to crash. Time to implement log rotation!

Scenario 2: System Slowdown – Feeling Sluggish?

Sometimes systems don't crash outright - they just start performing poorly. Pages load slowly, requests time out intermittently, and everything feels... off.

Your Slowdown Diagnostic Tools

# Check memory usage
$ free -h
              total        used        free      shared  buff/cache   available
Mem:           31Gi        28Gi       256Mi       1.0Gi       2.7Gi       1.2Gi
Swap:          2.0Gi       2.0Gi          0B

# Find memory hogs
$ ps aux --sort=-%mem | head -10
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mysql    12345 95.7 85.2 5270716 27311600 ?    Ssl  May10 3122:41 mysqld

# Check disk I/O
$ iostat -xz 1
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %util
sda             12.00  1450.00    152.00  24512.00     0.00     0.00  95.60

# All-in-one monitoring
$ dstat -cdngy 1
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw
 12   8  45  35   0|   0  24.5M|   0     0 |   0     0 | 789  1425
 15  10  40  35   0|   0  26.2M|  16k   12k|   0     0 | 812  1567

Enter fullscreen mode Exit fullscreen mode

The output here tells a clear story:

  1. Memory is almost exhausted (28GB used out of 31GB)

  2. MySQL is consuming 85% of system memory

  3. Disk I/O is at 95% utilization with heavy writes

  4. CPU is spending 35% of its time waiting for I/O operations

This is a classic case of a database server that needs optimization - either query tuning, proper indexing, or possibly hardware upgrades.

Quick Action Steps

For memory issues:

# Clear cache if needed (be careful!)
$ sync; echo 3 > /proc/sys/vm/drop_caches

# Restart the memory-hogging service
$ systemctl restart mysql

Enter fullscreen mode Exit fullscreen mode

For disk I/O issues:

# Find processes causing disk I/O
$ iotop -o

# Check for large files that might be causing issues
$ ncdu /var

Enter fullscreen mode Exit fullscreen mode

Scenario 3: Resource Overload – "Help! I'm drowning!"

Sometimes systems get overwhelmed by too many requests, background jobs, or runaway processes.

Your Resource Management Arsenal

# See which processes are using the most CPU
$ top -b -n 1 -o %CPU | head -20

# Find I/O bottlenecks
$ iotop -o -b -n 2

# Check network connections and listening ports
$ ss -tuln
$ netstat -tnlp

# Monitor network traffic
$ iftop -i eth0

Enter fullscreen mode Exit fullscreen mode

Example Resource Overload Diagnosis

Let's say your web server is struggling with too many connections:

# Check current connection count
$ ss -s
Total: 1425
TCP:   1418 (estab 1124, closed 276, orphaned 0, timewait 267)

# See which processes have the most open connections
$ lsof -i | grep ESTABLISHED | awk '{print $1}' | sort | uniq -c | sort -rn
    987 nginx
    124 php-fpm
     13 sshd

# Check nginx process details
$ ps aux | grep nginx
www-data 12345  98.7  2.3 142796 28392 ?       R    09:27  12:42 nginx: worker process

Enter fullscreen mode Exit fullscreen mode

This shows a classic overload scenario - nginx is handling nearly 1,000 connections and consuming 98.7% CPU. Time to either scale up or implement rate limiting!

Scenario 4: Peak Demand – Handling the Rush Hour

During peak traffic periods, you need to keep a close eye on system performance to ensure everything scales properly.

# Monitor load averages over time
$ watch -n 10 "uptime"

# Check system activity reports
$ sar -q 1 10  # Load average
$ sar -r 1 10  # Memory usage
$ sar -b 1 10  # I/O operations

# Track process resource usage over time
$ pidstat -r -u -d 1 10

Enter fullscreen mode Exit fullscreen mode

Handling Peak Load

During high load periods, you might need to temporarily prioritize critical processes:

# Give your database higher priority
$ renice -n -5 -p $(pgrep mysql)

# Limit CPU usage of non-critical background jobs
$ cpulimit -p $(pgrep backup_script) -l 30  # Limit to 30% CPU

Enter fullscreen mode Exit fullscreen mode

Creating Your Own Monitoring Dashboard

Want to create a simple monitoring dashboard? Here's a quick script I use:

#!/bin/bash
# Simple terminal dashboard
while true; do
  clear
  echo "=== SYSTEM DASHBOARD === $(date) ==="
  echo ""
  echo "=== LOAD ==="
  uptime
  echo ""
  echo "=== MEMORY ==="
  free -h
  echo ""
  echo "=== DISK ==="
  df -h | grep -v tmpfs
  echo ""
  echo "=== TOP PROCESSES ==="
  ps aux --sort=-%cpu | head -6
  echo ""
  echo "=== RECENT ERRORS ==="
  journalctl -p err..emerg -n 5 --no-pager

  sleep 5
done

Enter fullscreen mode Exit fullscreen mode

Save this as dashboard.sh, make it executable with chmod +x dashboard.sh, and run it in a terminal window for a simple real-time system overview.

Pro Tips from the Trenches

After years of dealing with system issues, here are some best practices I've learned:

  1. Establish baselines - Know what "normal" looks like so you can quickly spot abnormal behavior

  2. Use screen or tmux for long-running diagnostics - Nothing worse than losing your SSH connection during troubleshooting

  3. Create aliases for common commands - Add these to your .bashrc:

alias meminfo='free -h'
alias cpuinfo='top -b -n 1 | head -20'
alias diskinfo='df -h'
alias ioinfo='iostat -xz 1 5'

Enter fullscreen mode Exit fullscreen mode
  1. Keep a troubleshooting journal - Document issues and solutions for faster resolution next time

  2. Set up automated monitoring - Don't rely solely on manual checks

Automating Your Monitoring

While these CLI tools are invaluable for troubleshooting, you shouldn't rely on manual checks alone. Automated monitoring systems can alert you before small issues become major outages.

For critical production systems, consider setting up:

  1. Resource threshold alerts - Get notified when CPU, memory, or disk usage crosses critical thresholds

  2. Service availability checks - Ensure your key services remain responsive

  3. Log analysis - Automatically scan logs for error patterns

  4. Performance metrics - Track response times to catch slowdowns early

Conclusion

Linux CLI tools provide immediate insights when things go wrong, but they're most powerful when you know which ones to use in specific scenarios. Keep this guide handy for your next firefighting session!

Remember that while manual CLI troubleshooting is essential, combining these techniques with automated monitoring gives you the best of both worlds - deep diagnostic capabilities plus proactive notification when things start to go wrong.


For more CLI-based monitoring tips and advanced Linux troubleshooting techniques, check out our comprehensive guide on the Bubobot blog.

SystemMonitoring, #LinuxCLI, #BetterUptime

Read more at https://bubobot.com/blog/linux-cl-is-for-system-health-prevent-downtime-ensure-uptime?utm_source=dev.to

Top comments (0)