Every DevOps engineer eventually gets this message:
"The server is slow. Can you check?"
What follows is a systematic investigation. Not random commands — a structured approach to find out exactly what's wrong. Here are the 20 Linux commands I use every week, organized by the problem they solve.
CPU Troubleshooting
1. top — Real-Time Process Overview
top -bn1 | head -20
The first command I run. It answers three questions instantly:
- How loaded is the CPU? (look at
%Cpu(s)line) - Which process is eating CPU? (sort by
%CPU) - Is the system swapping? (
Swapline — if swap used is high, you have a memory problem)
top - 14:32:01 up 45 days, 3:42, 2 users, load average: 8.52, 4.21, 2.10
Tasks: 312 total, 3 running, 309 sleeping
%Cpu(s): 78.2 us, 5.1 sy, 0.0 ni, 15.3 id, 0.0 wa, 0.0 hi, 1.4 si
MiB Mem : 16384.0 total, 1024.5 free, 12288.3 used, 3071.2 cache
MiB Swap: 4096.0 total, 4096.0 free, 0.0 used. 3584.2 avail Mem
Reading the load average: Three numbers = 1min, 5min, 15min averages. Compare them to your CPU count:
- 4-core machine with load average 4.0 → 100% utilized
- 4-core machine with load average 8.0 → overloaded, processes are queuing
2. mpstat — Per-CPU Breakdown
mpstat -P ALL 1 3
Shows utilization per CPU core. If one core is at 100% while others are idle, you have a single-threaded bottleneck.
CPU %usr %sys %iowait %idle
0 95.2 3.1 0.0 1.7 ← Bottleneck on core 0
1 2.4 1.0 0.0 96.6
2 3.1 0.8 0.0 96.1
3 1.8 0.5 0.0 97.7
3. pidstat — Per-Process CPU Usage Over Time
pidstat -u 1 5
Unlike top (which shows a snapshot), pidstat shows CPU usage sampled every second over 5 intervals. This catches processes that spike briefly and go idle.
Memory Troubleshooting
4. free — Memory Overview
free -h
total used free shared buff/cache available
Mem: 16Gi 12Gi 512Mi 64Mi 3.5Gi 3.2Gi
Swap: 4.0Gi 0B 4.0Gi
Key insight: Don't look at "free" — look at "available." Linux uses free memory for disk caching (buff/cache), which is released when applications need it. "Available" tells you how much memory is actually available for new processes.
Red flags:
-
availableis less than 10% oftotal→ memory pressure -
Swap usedis non-zero and growing → active swapping, performance will degrade
5. ps — Top Memory Consumers
ps aux --sort=-%mem | head -15
Lists processes sorted by memory usage (highest first). Quick way to find the memory hog.
6. vmstat — Virtual Memory Statistics
vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 0 524288 102400 3670016 0 0 12 156 450 890 35 5 58 2
5 3 0 262144 102400 3670016 0 0 890 2048 1200 2400 85 10 0 5
What to watch:
-
r(running): If consistently greater than CPU count → CPU bottleneck -
b(blocked): Processes waiting for I/O → disk bottleneck -
si/so(swap in/out): Non-zero means active swapping → memory issue -
wa(I/O wait): >20% → disk is the bottleneck
Disk Troubleshooting
7. df — Disk Space
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 100G 92G 8.0G 92% /
/dev/sdb1 500G 234G 266G 47% /data
Critical rule: When a disk hits 100%, bad things happen — databases crash, logs stop writing, containers fail to start. Set alerts at 80% and 90%.
8. du — What's Using the Space
# Top 10 largest directories under /
du -h --max-depth=1 / 2>/dev/null | sort -rh | head -10
# Find files larger than 100MB
find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null
Common culprits:
-
/var/log/— unrotated logs -
/tmp/— leftover temp files -
/var/lib/docker/— docker images and volumes - Container overlay filesystems
9. iostat — Disk I/O Performance
iostat -xz 1 3
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s await %util
sda 12.0 450.0 0.05 28.12 0.00 180.0 8.52 95.3
sdb 2.0 5.0 0.01 0.02 0.00 1.0 1.20 0.8
Key columns:
-
%util: >80% means the disk is near saturation -
await: Average time (ms) for I/O requests. >10ms on SSD or >20ms on HDD means slowdown -
w/s: Writes per second — correlate with your application's write patterns
10. lsof — Open Files by Process
# Which process has a specific file open?
lsof /var/log/syslog
# All files opened by a process
lsof -p $(pgrep nginx)
# Find deleted files still holding disk space
lsof +L1
The +L1 trick is gold. Sometimes df shows 95% used but du only accounts for 60%. The difference is deleted files still held open by running processes. The fix: restart the process holding the deleted file.
Network Troubleshooting
11. ss — Socket Statistics (Modern netstat)
# Active connections
ss -tunapl
# Count connections by state
ss -s
# Find what's listening on a specific port
ss -tlnp | grep :8080
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 0 128 0.0.0.0:8080 0.0.0.0:* users:(("nginx",pid=1234))
ESTAB 0 0 10.0.1.5:8080 10.0.2.3:54321 users:(("nginx",pid=1234))
Useful patterns:
- Too many
CLOSE_WAIT→ your application isn't closing connections properly - Too many
TIME_WAIT→ high connection churn, consider connection pooling -
Recv-Qgrowing → application can't process data fast enough
12. dig — DNS Resolution
# Basic lookup
dig api.example.com
# Short answer only
dig +short api.example.com
# Trace the full DNS resolution path
dig +trace api.example.com
# Query a specific DNS server
dig @8.8.8.8 api.example.com
When services can't communicate, DNS is the cause more often than you'd expect.
13. curl — HTTP Debugging
# Check if a service is responding
curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health
# See response time breakdown
curl -w "\nDNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nTotal: %{time_total}s\n" -o /dev/null -s https://api.example.com
# Test with specific headers
curl -H "Authorization: Bearer $TOKEN" https://api.example.com/v1/users
The -w timing breakdown is incredibly useful. It tells you exactly where latency is coming from — DNS, TCP connection, TLS handshake, or server processing.
14. tcpdump — Packet Capture
# Capture traffic on port 8080
tcpdump -i eth0 port 8080 -nn
# Capture and save to file for Wireshark analysis
tcpdump -i eth0 port 443 -w capture.pcap -c 1000
# Show HTTP requests
tcpdump -i eth0 -A -s 0 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)'
Last resort debugging — when logs show nothing and metrics are inconclusive, packet captures reveal the truth.
Process Troubleshooting
15. journalctl — Systemd Service Logs
# Last 100 lines of a service
journalctl -u nginx --no-pager -n 100
# Follow logs in real-time
journalctl -u nginx -f
# Logs since last boot
journalctl -u nginx -b
# Logs from last hour
journalctl -u nginx --since "1 hour ago"
# Filter by priority (errors only)
journalctl -u nginx -p err
16. strace — System Call Tracing
# Trace a running process
strace -p $(pgrep -f payment-service) -f -e trace=network
# Trace a command from start
strace -f -e trace=open,read,write -o /tmp/trace.log ./my-app
# Count system calls (performance overview)
strace -c -p $(pgrep nginx)
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- --------
65.23 0.452310 12 37692 epoll_wait
18.41 0.127650 3 42530 write
10.12 0.070180 2 35090 read
When you've exhausted logs and metrics, strace shows you exactly what a process is doing at the system call level. It's the ultimate debugging tool.
17. dmesg — Kernel Messages
# Recent kernel messages
dmesg -T | tail -50
# Filter for errors
dmesg -T --level=err,warn
# OOM killer events
dmesg -T | grep -i "oom\|out of memory\|killed process"
If the kernel OOM-kills your process, it won't appear in application logs. It only appears in dmesg and in /var/log/kern.log.
Quick Diagnostic One-Liners
18. System Load Summary
uptime
14:32:01 up 45 days, 3:42, 2 users, load average: 2.15, 1.92, 1.45
First thing I check. If load average is within normal range and the server "feels slow," the problem is elsewhere — network, database, external API.
19. Who's Logged In & What Are They Doing
w
USER TTY FROM LOGIN@ IDLE WHAT
alice pts/0 10.0.1.100 14:20 0.00s top
bob pts/1 10.0.2.50 14:25 5:00 vi /etc/nginx/nginx.conf
Important during incidents — know who else is on the server and what they're changing.
20. Quick Health Check Script
#!/bin/bash
echo "=== System Health Check ==="
echo ""
echo "--- Load Average ---"
uptime
echo ""
echo "--- Memory ---"
free -h
echo ""
echo "--- Disk ---"
df -h | grep -E '^/dev/'
echo ""
echo "--- Top CPU Processes ---"
ps aux --sort=-%cpu | head -6
echo ""
echo "--- Top Memory Processes ---"
ps aux --sort=-%mem | head -6
echo ""
echo "--- Network Connections ---"
ss -s
echo ""
echo "--- Recent Errors ---"
dmesg -T --level=err,warn 2>/dev/null | tail -5
journalctl -p err --since "1 hour ago" --no-pager 2>/dev/null | tail -5
Save this as healthcheck.sh on every server. When someone says "the server is slow," run this first.
The Diagnostic Flow
When you get a "server is slow" report, follow this order:
1. uptime → Is the server actually loaded?
2. top → CPU? Memory? Which process?
3. free -h → Memory pressure? Swapping?
4. df -h → Disk full?
5. iostat -xz 1 → Disk I/O saturated?
6. ss -s → Connection issues?
7. dmesg -T → OOM kills? Hardware errors?
8. journalctl -p err→ Service-level errors?
This takes under 2 minutes and identifies the bottleneck category (CPU, memory, disk, network) in almost every case.
These aren't obscure commands. They're the everyday toolkit that separates "I think the server is slow" from "the payment-service process is consuming 94% CPU due to a regex backtracking bug in the input validation module."
Precision beats guesswork. Every time.
What's your go-to Linux troubleshooting command? Drop it in the comments.
Follow me for more practical DevOps and SRE content.
Top comments (0)