Untitled
Let's be real, downtime sucks. Nobody wants to deal with angry users or scramble to restore service at 3 AM. But when disaster strikes, knowing which command-line tools to reach for can be the difference between a quick fix and an extended outage.
I've spent countless hours in the terminal wrestling unresponsive servers back to life. Here are the CLI tools I use most often - organized by scenario so you can quickly find what you need when things go sideways.
Scenario 1: System Down – Time to Investigate!
Your system crashed. First step? Get it back online ASAP (restart, rollback, whatever it takes). Then it's time to play detective and figure out why it happened.
Your Investigation Toolkit
# How long was the system up before the crash?
$ uptime
11:23:42 up 2 days, 1:14, 3 users, load average: 15.32, 12.67, 10.21
# What does the system log say about the crash?
$ journalctl -p err..emerg -b -1
May 12 03:42:11 webserver kernel: Out of memory: Kill process 4312 (java) score 567 or sacrifice child
May 12 03:42:11 webserver kernel: Killed process 4312 (java) total-vm:18245652kB, anon-rss:11291012kB
# Which processes are consuming resources now?
$ htop # Interactive process viewer
When a system crashes, the first things I check are:
System load with uptime - Those load averages of 15+ on the example above suggest severe overload
System logs with journalctl - Look for OOM (Out Of Memory) killers, kernel panics, and service failures
Resource usage with htop - Find CPU/memory hogs that might be causing problems
Disk space with df -h - A full disk can wreak havoc across the system
Quick Analysis Example
# Check what's taking up disk space
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 30G 30G 12K 100% /
# Find the culprit
$ du -h --max-depth=1 / | sort -hr | head -10
15G /var
8G /usr
4G /opt
2G /home
# Dig deeper
$ du -h --max-depth=1 /var | sort -hr | head -5
14G /var/log
512M /var/lib
128M /var/cache
# Find the specific log files
$ find /var/log -type f -name "*.log" -size +100M | xargs ls -lh
-rw-r--r-- 1 root root 12G May 12 03:40 /var/log/application.log
This investigation shows a classic scenario: a runaway log file filled the disk, causing the system to crash. Time to implement log rotation!
Scenario 2: System Slowdown – Feeling Sluggish?
Sometimes systems don't crash outright - they just start performing poorly. Pages load slowly, requests time out intermittently, and everything feels... off.
Your Slowdown Diagnostic Tools
# Check memory usage
$ free -h
total used free shared buff/cache available
Mem: 31Gi 28Gi 256Mi 1.0Gi 2.7Gi 1.2Gi
Swap: 2.0Gi 2.0Gi 0B
# Find memory hogs
$ ps aux --sort=-%mem | head -10
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
mysql 12345 95.7 85.2 5270716 27311600 ? Ssl May10 3122:41 mysqld
# Check disk I/O
$ iostat -xz 1
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %util
sda 12.00 1450.00 152.00 24512.00 0.00 0.00 95.60
# All-in-one monitoring
$ dstat -cdngy 1
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read writ| recv send| in out | int csw
12 8 45 35 0| 0 24.5M| 0 0 | 0 0 | 789 1425
15 10 40 35 0| 0 26.2M| 16k 12k| 0 0 | 812 1567
The output here tells a clear story:
Memory is almost exhausted (28GB used out of 31GB)
MySQL is consuming 85% of system memory
Disk I/O is at 95% utilization with heavy writes
CPU is spending 35% of its time waiting for I/O operations
This is a classic case of a database server that needs optimization - either query tuning, proper indexing, or possibly hardware upgrades.
Quick Action Steps
For memory issues:
# Clear cache if needed (be careful!)
$ sync; echo 3 > /proc/sys/vm/drop_caches
# Restart the memory-hogging service
$ systemctl restart mysql
For disk I/O issues:
# Find processes causing disk I/O
$ iotop -o
# Check for large files that might be causing issues
$ ncdu /var
Scenario 3: Resource Overload – "Help! I'm drowning!"
Sometimes systems get overwhelmed by too many requests, background jobs, or runaway processes.
Your Resource Management Arsenal
# See which processes are using the most CPU
$ top -b -n 1 -o %CPU | head -20
# Find I/O bottlenecks
$ iotop -o -b -n 2
# Check network connections and listening ports
$ ss -tuln
$ netstat -tnlp
# Monitor network traffic
$ iftop -i eth0
Example Resource Overload Diagnosis
Let's say your web server is struggling with too many connections:
# Check current connection count
$ ss -s
Total: 1425
TCP: 1418 (estab 1124, closed 276, orphaned 0, timewait 267)
# See which processes have the most open connections
$ lsof -i | grep ESTABLISHED | awk '{print $1}' | sort | uniq -c | sort -rn
987 nginx
124 php-fpm
13 sshd
# Check nginx process details
$ ps aux | grep nginx
www-data 12345 98.7 2.3 142796 28392 ? R 09:27 12:42 nginx: worker process
This shows a classic overload scenario - nginx is handling nearly 1,000 connections and consuming 98.7% CPU. Time to either scale up or implement rate limiting!
Scenario 4: Peak Demand – Handling the Rush Hour
During peak traffic periods, you need to keep a close eye on system performance to ensure everything scales properly.
# Monitor load averages over time
$ watch -n 10 "uptime"
# Check system activity reports
$ sar -q 1 10 # Load average
$ sar -r 1 10 # Memory usage
$ sar -b 1 10 # I/O operations
# Track process resource usage over time
$ pidstat -r -u -d 1 10
Handling Peak Load
During high load periods, you might need to temporarily prioritize critical processes:
# Give your database higher priority
$ renice -n -5 -p $(pgrep mysql)
# Limit CPU usage of non-critical background jobs
$ cpulimit -p $(pgrep backup_script) -l 30 # Limit to 30% CPU
Creating Your Own Monitoring Dashboard
Want to create a simple monitoring dashboard? Here's a quick script I use:
#!/bin/bash
# Simple terminal dashboard
while true; do
clear
echo "=== SYSTEM DASHBOARD === $(date) ==="
echo ""
echo "=== LOAD ==="
uptime
echo ""
echo "=== MEMORY ==="
free -h
echo ""
echo "=== DISK ==="
df -h | grep -v tmpfs
echo ""
echo "=== TOP PROCESSES ==="
ps aux --sort=-%cpu | head -6
echo ""
echo "=== RECENT ERRORS ==="
journalctl -p err..emerg -n 5 --no-pager
sleep 5
done
Save this as dashboard.sh, make it executable with chmod +x dashboard.sh, and run it in a terminal window for a simple real-time system overview.
Pro Tips from the Trenches
After years of dealing with system issues, here are some best practices I've learned:
Establish baselines - Know what "normal" looks like so you can quickly spot abnormal behavior
Use screen or tmux for long-running diagnostics - Nothing worse than losing your SSH connection during troubleshooting
Create aliases for common commands - Add these to your .bashrc:
alias meminfo='free -h'
alias cpuinfo='top -b -n 1 | head -20'
alias diskinfo='df -h'
alias ioinfo='iostat -xz 1 5'
Keep a troubleshooting journal - Document issues and solutions for faster resolution next time
Set up automated monitoring - Don't rely solely on manual checks
Automating Your Monitoring
While these CLI tools are invaluable for troubleshooting, you shouldn't rely on manual checks alone. Automated monitoring systems can alert you before small issues become major outages.
For critical production systems, consider setting up:
Resource threshold alerts - Get notified when CPU, memory, or disk usage crosses critical thresholds
Service availability checks - Ensure your key services remain responsive
Log analysis - Automatically scan logs for error patterns
Performance metrics - Track response times to catch slowdowns early
Conclusion
Linux CLI tools provide immediate insights when things go wrong, but they're most powerful when you know which ones to use in specific scenarios. Keep this guide handy for your next firefighting session!
Remember that while manual CLI troubleshooting is essential, combining these techniques with automated monitoring gives you the best of both worlds - deep diagnostic capabilities plus proactive notification when things start to go wrong.
For more CLI-based monitoring tips and advanced Linux troubleshooting techniques, check out our comprehensive guide on the Bubobot blog.
SystemMonitoring, #LinuxCLI, #BetterUptime
Read more at https://bubobot.com/blog/linux-cl-is-for-system-health-prevent-downtime-ensure-uptime?utm_source=dev.to
Top comments (0)