DEV Community

Cover image for Linux Troubleshooting for DevOps: 20 Commands I Use Every Single Week
S, Sanjay
S, Sanjay

Posted on

Linux Troubleshooting for DevOps: 20 Commands I Use Every Single Week

Every DevOps engineer eventually gets this message:

"The server is slow. Can you check?"

What follows is a systematic investigation. Not random commands — a structured approach to find out exactly what's wrong. Here are the 20 Linux commands I use every week, organized by the problem they solve.


CPU Troubleshooting

1. top — Real-Time Process Overview

top -bn1 | head -20
Enter fullscreen mode Exit fullscreen mode

The first command I run. It answers three questions instantly:

  • How loaded is the CPU? (look at %Cpu(s) line)
  • Which process is eating CPU? (sort by %CPU)
  • Is the system swapping? (Swap line — if swap used is high, you have a memory problem)
top - 14:32:01 up 45 days,  3:42,  2 users,  load average: 8.52, 4.21, 2.10
Tasks: 312 total,   3 running, 309 sleeping
%Cpu(s): 78.2 us,  5.1 sy,  0.0 ni, 15.3 id,  0.0 wa,  0.0 hi,  1.4 si
MiB Mem :  16384.0 total,   1024.5 free,  12288.3 used,   3071.2 cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.   3584.2 avail Mem
Enter fullscreen mode Exit fullscreen mode

Reading the load average: Three numbers = 1min, 5min, 15min averages. Compare them to your CPU count:

  • 4-core machine with load average 4.0 → 100% utilized
  • 4-core machine with load average 8.0 → overloaded, processes are queuing

2. mpstat — Per-CPU Breakdown

mpstat -P ALL 1 3
Enter fullscreen mode Exit fullscreen mode

Shows utilization per CPU core. If one core is at 100% while others are idle, you have a single-threaded bottleneck.

CPU    %usr   %sys  %iowait  %idle
  0   95.2    3.1     0.0     1.7    ← Bottleneck on core 0
  1    2.4    1.0     0.0    96.6
  2    3.1    0.8     0.0    96.1
  3    1.8    0.5     0.0    97.7
Enter fullscreen mode Exit fullscreen mode

3. pidstat — Per-Process CPU Usage Over Time

pidstat -u 1 5
Enter fullscreen mode Exit fullscreen mode

Unlike top (which shows a snapshot), pidstat shows CPU usage sampled every second over 5 intervals. This catches processes that spike briefly and go idle.


Memory Troubleshooting

4. free — Memory Overview

free -h
Enter fullscreen mode Exit fullscreen mode
              total    used    free   shared  buff/cache   available
Mem:           16Gi    12Gi   512Mi     64Mi        3.5Gi      3.2Gi
Swap:         4.0Gi      0B   4.0Gi
Enter fullscreen mode Exit fullscreen mode

Key insight: Don't look at "free" — look at "available." Linux uses free memory for disk caching (buff/cache), which is released when applications need it. "Available" tells you how much memory is actually available for new processes.

Red flags:

  • available is less than 10% of total → memory pressure
  • Swap used is non-zero and growing → active swapping, performance will degrade

5. ps — Top Memory Consumers

ps aux --sort=-%mem | head -15
Enter fullscreen mode Exit fullscreen mode

Lists processes sorted by memory usage (highest first). Quick way to find the memory hog.

6. vmstat — Virtual Memory Statistics

vmstat 1 5
Enter fullscreen mode Exit fullscreen mode
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  0      0 524288 102400 3670016   0    0    12   156  450  890 35  5 58  2
 5  3      0 262144 102400 3670016   0    0   890  2048 1200 2400 85 10  0  5
Enter fullscreen mode Exit fullscreen mode

What to watch:

  • r (running): If consistently greater than CPU count → CPU bottleneck
  • b (blocked): Processes waiting for I/O → disk bottleneck
  • si/so (swap in/out): Non-zero means active swapping → memory issue
  • wa (I/O wait): >20% → disk is the bottleneck

Disk Troubleshooting

7. df — Disk Space

df -h
Enter fullscreen mode Exit fullscreen mode
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       100G   92G  8.0G  92% /
/dev/sdb1       500G  234G  266G  47% /data
Enter fullscreen mode Exit fullscreen mode

Critical rule: When a disk hits 100%, bad things happen — databases crash, logs stop writing, containers fail to start. Set alerts at 80% and 90%.

8. du — What's Using the Space

# Top 10 largest directories under /
du -h --max-depth=1 / 2>/dev/null | sort -rh | head -10

# Find files larger than 100MB
find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null
Enter fullscreen mode Exit fullscreen mode

Common culprits:

  • /var/log/ — unrotated logs
  • /tmp/ — leftover temp files
  • /var/lib/docker/ — docker images and volumes
  • Container overlay filesystems

9. iostat — Disk I/O Performance

iostat -xz 1 3
Enter fullscreen mode Exit fullscreen mode
Device   r/s   w/s  rMB/s  wMB/s  rrqm/s  wrqm/s  await  %util
sda     12.0  450.0  0.05   28.12    0.00   180.0   8.52   95.3
sdb      2.0    5.0  0.01    0.02    0.00     1.0   1.20    0.8
Enter fullscreen mode Exit fullscreen mode

Key columns:

  • %util: >80% means the disk is near saturation
  • await: Average time (ms) for I/O requests. >10ms on SSD or >20ms on HDD means slowdown
  • w/s: Writes per second — correlate with your application's write patterns

10. lsof — Open Files by Process

# Which process has a specific file open?
lsof /var/log/syslog

# All files opened by a process
lsof -p $(pgrep nginx)

# Find deleted files still holding disk space
lsof +L1
Enter fullscreen mode Exit fullscreen mode

The +L1 trick is gold. Sometimes df shows 95% used but du only accounts for 60%. The difference is deleted files still held open by running processes. The fix: restart the process holding the deleted file.


Network Troubleshooting

11. ss — Socket Statistics (Modern netstat)

# Active connections
ss -tunapl

# Count connections by state
ss -s

# Find what's listening on a specific port
ss -tlnp | grep :8080
Enter fullscreen mode Exit fullscreen mode
State   Recv-Q  Send-Q  Local Address:Port  Peer Address:Port  Process
LISTEN  0       128     0.0.0.0:8080        0.0.0.0:*          users:(("nginx",pid=1234))
ESTAB   0       0       10.0.1.5:8080       10.0.2.3:54321     users:(("nginx",pid=1234))
Enter fullscreen mode Exit fullscreen mode

Useful patterns:

  • Too many CLOSE_WAIT → your application isn't closing connections properly
  • Too many TIME_WAIT → high connection churn, consider connection pooling
  • Recv-Q growing → application can't process data fast enough

12. dig — DNS Resolution

# Basic lookup
dig api.example.com

# Short answer only
dig +short api.example.com

# Trace the full DNS resolution path
dig +trace api.example.com

# Query a specific DNS server
dig @8.8.8.8 api.example.com
Enter fullscreen mode Exit fullscreen mode

When services can't communicate, DNS is the cause more often than you'd expect.

13. curl — HTTP Debugging

# Check if a service is responding
curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health

# See response time breakdown
curl -w "\nDNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nTotal: %{time_total}s\n" -o /dev/null -s https://api.example.com

# Test with specific headers
curl -H "Authorization: Bearer $TOKEN" https://api.example.com/v1/users
Enter fullscreen mode Exit fullscreen mode

The -w timing breakdown is incredibly useful. It tells you exactly where latency is coming from — DNS, TCP connection, TLS handshake, or server processing.

14. tcpdump — Packet Capture

# Capture traffic on port 8080
tcpdump -i eth0 port 8080 -nn

# Capture and save to file for Wireshark analysis
tcpdump -i eth0 port 443 -w capture.pcap -c 1000

# Show HTTP requests
tcpdump -i eth0 -A -s 0 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)'
Enter fullscreen mode Exit fullscreen mode

Last resort debugging — when logs show nothing and metrics are inconclusive, packet captures reveal the truth.


Process Troubleshooting

15. journalctl — Systemd Service Logs

# Last 100 lines of a service
journalctl -u nginx --no-pager -n 100

# Follow logs in real-time
journalctl -u nginx -f

# Logs since last boot
journalctl -u nginx -b

# Logs from last hour
journalctl -u nginx --since "1 hour ago"

# Filter by priority (errors only)
journalctl -u nginx -p err
Enter fullscreen mode Exit fullscreen mode

16. strace — System Call Tracing

# Trace a running process
strace -p $(pgrep -f payment-service) -f -e trace=network

# Trace a command from start
strace -f -e trace=open,read,write -o /tmp/trace.log ./my-app

# Count system calls (performance overview)
strace -c -p $(pgrep nginx)
Enter fullscreen mode Exit fullscreen mode
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- --------
 65.23    0.452310          12     37692           epoll_wait
 18.41    0.127650           3     42530           write
 10.12    0.070180           2     35090           read
Enter fullscreen mode Exit fullscreen mode

When you've exhausted logs and metrics, strace shows you exactly what a process is doing at the system call level. It's the ultimate debugging tool.

17. dmesg — Kernel Messages

# Recent kernel messages
dmesg -T | tail -50

# Filter for errors
dmesg -T --level=err,warn

# OOM killer events
dmesg -T | grep -i "oom\|out of memory\|killed process"
Enter fullscreen mode Exit fullscreen mode

If the kernel OOM-kills your process, it won't appear in application logs. It only appears in dmesg and in /var/log/kern.log.


Quick Diagnostic One-Liners

18. System Load Summary

uptime
Enter fullscreen mode Exit fullscreen mode
 14:32:01 up 45 days,  3:42,  2 users,  load average: 2.15, 1.92, 1.45
Enter fullscreen mode Exit fullscreen mode

First thing I check. If load average is within normal range and the server "feels slow," the problem is elsewhere — network, database, external API.

19. Who's Logged In & What Are They Doing

w
Enter fullscreen mode Exit fullscreen mode
USER     TTY      FROM             LOGIN@   IDLE   WHAT
alice    pts/0    10.0.1.100       14:20    0.00s  top
bob      pts/1    10.0.2.50        14:25    5:00   vi /etc/nginx/nginx.conf
Enter fullscreen mode Exit fullscreen mode

Important during incidents — know who else is on the server and what they're changing.

20. Quick Health Check Script

#!/bin/bash
echo "=== System Health Check ==="
echo ""
echo "--- Load Average ---"
uptime
echo ""
echo "--- Memory ---"
free -h
echo ""
echo "--- Disk ---"
df -h | grep -E '^/dev/'
echo ""
echo "--- Top CPU Processes ---"
ps aux --sort=-%cpu | head -6
echo ""
echo "--- Top Memory Processes ---"
ps aux --sort=-%mem | head -6
echo ""
echo "--- Network Connections ---"
ss -s
echo ""
echo "--- Recent Errors ---"
dmesg -T --level=err,warn 2>/dev/null | tail -5
journalctl -p err --since "1 hour ago" --no-pager 2>/dev/null | tail -5
Enter fullscreen mode Exit fullscreen mode

Save this as healthcheck.sh on every server. When someone says "the server is slow," run this first.


The Diagnostic Flow

When you get a "server is slow" report, follow this order:

1. uptime          → Is the server actually loaded?
2. top             → CPU? Memory? Which process?
3. free -h         → Memory pressure? Swapping?
4. df -h           → Disk full?
5. iostat -xz 1    → Disk I/O saturated?
6. ss -s           → Connection issues?
7. dmesg -T        → OOM kills? Hardware errors?
8. journalctl -p err→ Service-level errors?
Enter fullscreen mode Exit fullscreen mode

This takes under 2 minutes and identifies the bottleneck category (CPU, memory, disk, network) in almost every case.


These aren't obscure commands. They're the everyday toolkit that separates "I think the server is slow" from "the payment-service process is consuming 94% CPU due to a regex backtracking bug in the input validation module."

Precision beats guesswork. Every time.


What's your go-to Linux troubleshooting command? Drop it in the comments.

Follow me for more practical DevOps and SRE content.

Top comments (0)