S, Sanjay

Posted on Mar 13

Linux Troubleshooting for DevOps: 20 Commands I Use Every Single Week

#linux #devops #sre #troubleshooting

Every DevOps engineer eventually gets this message:

"The server is slow. Can you check?"

What follows is a systematic investigation. Not random commands — a structured approach to find out exactly what's wrong. Here are the 20 Linux commands I use every week, organized by the problem they solve.

CPU Troubleshooting

1. `top` — Real-Time Process Overview

top -bn1 | head -20

The first command I run. It answers three questions instantly:

How loaded is the CPU? (look at %Cpu(s) line)
Which process is eating CPU? (sort by %CPU)
Is the system swapping? (Swap line — if swap used is high, you have a memory problem)

top - 14:32:01 up 45 days,  3:42,  2 users,  load average: 8.52, 4.21, 2.10
Tasks: 312 total,   3 running, 309 sleeping
%Cpu(s): 78.2 us,  5.1 sy,  0.0 ni, 15.3 id,  0.0 wa,  0.0 hi,  1.4 si
MiB Mem :  16384.0 total,   1024.5 free,  12288.3 used,   3071.2 cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used.   3584.2 avail Mem

Reading the load average: Three numbers = 1min, 5min, 15min averages. Compare them to your CPU count:

4-core machine with load average 4.0 → 100% utilized
4-core machine with load average 8.0 → overloaded, processes are queuing

2. `mpstat` — Per-CPU Breakdown

mpstat -P ALL 1 3

Shows utilization per CPU core. If one core is at 100% while others are idle, you have a single-threaded bottleneck.

CPU    %usr   %sys  %iowait  %idle
  0   95.2    3.1     0.0     1.7    ← Bottleneck on core 0
  1    2.4    1.0     0.0    96.6
  2    3.1    0.8     0.0    96.1
  3    1.8    0.5     0.0    97.7

3. `pidstat` — Per-Process CPU Usage Over Time

pidstat -u 1 5

Unlike top (which shows a snapshot), pidstat shows CPU usage sampled every second over 5 intervals. This catches processes that spike briefly and go idle.

Memory Troubleshooting

4. `free` — Memory Overview

free -h

              total    used    free   shared  buff/cache   available
Mem:           16Gi    12Gi   512Mi     64Mi        3.5Gi      3.2Gi
Swap:         4.0Gi      0B   4.0Gi

Key insight: Don't look at "free" — look at "available." Linux uses free memory for disk caching (buff/cache), which is released when applications need it. "Available" tells you how much memory is actually available for new processes.

Red flags:

available is less than 10% of total → memory pressure
Swap used is non-zero and growing → active swapping, performance will degrade

5. `ps` — Top Memory Consumers

ps aux --sort=-%mem | head -15

Lists processes sorted by memory usage (highest first). Quick way to find the memory hog.

6. `vmstat` — Virtual Memory Statistics

vmstat 1 5

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  0      0 524288 102400 3670016   0    0    12   156  450  890 35  5 58  2
 5  3      0 262144 102400 3670016   0    0   890  2048 1200 2400 85 10  0  5

What to watch:

r (running): If consistently greater than CPU count → CPU bottleneck
b (blocked): Processes waiting for I/O → disk bottleneck
si/so (swap in/out): Non-zero means active swapping → memory issue
wa (I/O wait): >20% → disk is the bottleneck

Disk Troubleshooting

7. `df` — Disk Space

df -h

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       100G   92G  8.0G  92% /
/dev/sdb1       500G  234G  266G  47% /data

Critical rule: When a disk hits 100%, bad things happen — databases crash, logs stop writing, containers fail to start. Set alerts at 80% and 90%.

8. `du` — What's Using the Space

# Top 10 largest directories under /
du -h --max-depth=1 / 2>/dev/null | sort -rh | head -10

# Find files larger than 100MB
find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null

Common culprits:

/var/log/ — unrotated logs
/tmp/ — leftover temp files
/var/lib/docker/ — docker images and volumes
Container overlay filesystems

9. `iostat` — Disk I/O Performance

iostat -xz 1 3

Device   r/s   w/s  rMB/s  wMB/s  rrqm/s  wrqm/s  await  %util
sda     12.0  450.0  0.05   28.12    0.00   180.0   8.52   95.3
sdb      2.0    5.0  0.01    0.02    0.00     1.0   1.20    0.8

Key columns:

%util: >80% means the disk is near saturation
await: Average time (ms) for I/O requests. >10ms on SSD or >20ms on HDD means slowdown
w/s: Writes per second — correlate with your application's write patterns

10. `lsof` — Open Files by Process

# Which process has a specific file open?
lsof /var/log/syslog

# All files opened by a process
lsof -p $(pgrep nginx)

# Find deleted files still holding disk space
lsof +L1

The +L1 trick is gold. Sometimes df shows 95% used but du only accounts for 60%. The difference is deleted files still held open by running processes. The fix: restart the process holding the deleted file.

Network Troubleshooting

11. `ss` — Socket Statistics (Modern `netstat`)

# Active connections
ss -tunapl

# Count connections by state
ss -s

# Find what's listening on a specific port
ss -tlnp | grep :8080

State   Recv-Q  Send-Q  Local Address:Port  Peer Address:Port  Process
LISTEN  0       128     0.0.0.0:8080        0.0.0.0:*          users:(("nginx",pid=1234))
ESTAB   0       0       10.0.1.5:8080       10.0.2.3:54321     users:(("nginx",pid=1234))

Useful patterns:

Too many CLOSE_WAIT → your application isn't closing connections properly
Too many TIME_WAIT → high connection churn, consider connection pooling
Recv-Q growing → application can't process data fast enough

12. `dig` — DNS Resolution

# Basic lookup
dig api.example.com

# Short answer only
dig +short api.example.com

# Trace the full DNS resolution path
dig +trace api.example.com

# Query a specific DNS server
dig @8.8.8.8 api.example.com

When services can't communicate, DNS is the cause more often than you'd expect.

13. `curl` — HTTP Debugging

# Check if a service is responding
curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health

# See response time breakdown
curl -w "\nDNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nTotal: %{time_total}s\n" -o /dev/null -s https://api.example.com

# Test with specific headers
curl -H "Authorization: Bearer $TOKEN" https://api.example.com/v1/users

The -w timing breakdown is incredibly useful. It tells you exactly where latency is coming from — DNS, TCP connection, TLS handshake, or server processing.

14. `tcpdump` — Packet Capture

# Capture traffic on port 8080
tcpdump -i eth0 port 8080 -nn

# Capture and save to file for Wireshark analysis
tcpdump -i eth0 port 443 -w capture.pcap -c 1000

# Show HTTP requests
tcpdump -i eth0 -A -s 0 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)'

Last resort debugging — when logs show nothing and metrics are inconclusive, packet captures reveal the truth.

Process Troubleshooting

15. `journalctl` — Systemd Service Logs

# Last 100 lines of a service
journalctl -u nginx --no-pager -n 100

# Follow logs in real-time
journalctl -u nginx -f

# Logs since last boot
journalctl -u nginx -b

# Logs from last hour
journalctl -u nginx --since "1 hour ago"

# Filter by priority (errors only)
journalctl -u nginx -p err

16. `strace` — System Call Tracing

# Trace a running process
strace -p $(pgrep -f payment-service) -f -e trace=network

# Trace a command from start
strace -f -e trace=open,read,write -o /tmp/trace.log ./my-app

# Count system calls (performance overview)
strace -c -p $(pgrep nginx)

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- --------
 65.23    0.452310          12     37692           epoll_wait
 18.41    0.127650           3     42530           write
 10.12    0.070180           2     35090           read

When you've exhausted logs and metrics, strace shows you exactly what a process is doing at the system call level. It's the ultimate debugging tool.

17. `dmesg` — Kernel Messages

# Recent kernel messages
dmesg -T | tail -50

# Filter for errors
dmesg -T --level=err,warn

# OOM killer events
dmesg -T | grep -i "oom\|out of memory\|killed process"

If the kernel OOM-kills your process, it won't appear in application logs. It only appears in dmesg and in /var/log/kern.log.

Quick Diagnostic One-Liners

18. System Load Summary

uptime

 14:32:01 up 45 days,  3:42,  2 users,  load average: 2.15, 1.92, 1.45

First thing I check. If load average is within normal range and the server "feels slow," the problem is elsewhere — network, database, external API.

19. Who's Logged In & What Are They Doing

USER     TTY      FROM             LOGIN@   IDLE   WHAT
alice    pts/0    10.0.1.100       14:20    0.00s  top
bob      pts/1    10.0.2.50        14:25    5:00   vi /etc/nginx/nginx.conf

Important during incidents — know who else is on the server and what they're changing.

20. Quick Health Check Script

#!/bin/bash
echo "=== System Health Check ==="
echo ""
echo "--- Load Average ---"
uptime
echo ""
echo "--- Memory ---"
free -h
echo ""
echo "--- Disk ---"
df -h | grep -E '^/dev/'
echo ""
echo "--- Top CPU Processes ---"
ps aux --sort=-%cpu | head -6
echo ""
echo "--- Top Memory Processes ---"
ps aux --sort=-%mem | head -6
echo ""
echo "--- Network Connections ---"
ss -s
echo ""
echo "--- Recent Errors ---"
dmesg -T --level=err,warn 2>/dev/null | tail -5
journalctl -p err --since "1 hour ago" --no-pager 2>/dev/null | tail -5

Save this as healthcheck.sh on every server. When someone says "the server is slow," run this first.

The Diagnostic Flow

When you get a "server is slow" report, follow this order:

1. uptime          → Is the server actually loaded?
2. top             → CPU? Memory? Which process?
3. free -h         → Memory pressure? Swapping?
4. df -h           → Disk full?
5. iostat -xz 1    → Disk I/O saturated?
6. ss -s           → Connection issues?
7. dmesg -T        → OOM kills? Hardware errors?
8. journalctl -p err→ Service-level errors?

This takes under 2 minutes and identifies the bottleneck category (CPU, memory, disk, network) in almost every case.

These aren't obscure commands. They're the everyday toolkit that separates "I think the server is slow" from "the payment-service process is consuming 94% CPU due to a regex backtracking bug in the input validation module."

Precision beats guesswork. Every time.

What's your go-to Linux troubleshooting command? Drop it in the comments.

Follow me for more practical DevOps and SRE content.

DEV Community

Linux Troubleshooting for DevOps: 20 Commands I Use Every Single Week

CPU Troubleshooting

1. `top` — Real-Time Process Overview

2. `mpstat` — Per-CPU Breakdown

3. `pidstat` — Per-Process CPU Usage Over Time

Memory Troubleshooting

4. `free` — Memory Overview

5. `ps` — Top Memory Consumers

6. `vmstat` — Virtual Memory Statistics

Disk Troubleshooting

7. `df` — Disk Space

8. `du` — What's Using the Space

9. `iostat` — Disk I/O Performance

10. `lsof` — Open Files by Process

Network Troubleshooting

11. `ss` — Socket Statistics (Modern `netstat`)

12. `dig` — DNS Resolution

13. `curl` — HTTP Debugging

14. `tcpdump` — Packet Capture

Process Troubleshooting

15. `journalctl` — Systemd Service Logs

16. `strace` — System Call Tracing

17. `dmesg` — Kernel Messages

Quick Diagnostic One-Liners

18. System Load Summary

19. Who's Logged In & What Are They Doing

20. Quick Health Check Script

The Diagnostic Flow

Top comments (0)

CPU Troubleshooting

1. top — Real-Time Process Overview

2. mpstat — Per-CPU Breakdown

3. pidstat — Per-Process CPU Usage Over Time

Memory Troubleshooting

4. free — Memory Overview

5. ps — Top Memory Consumers

6. vmstat — Virtual Memory Statistics

Disk Troubleshooting

7. df — Disk Space

8. du — What's Using the Space

9. iostat — Disk I/O Performance

10. lsof — Open Files by Process

Network Troubleshooting

11. ss — Socket Statistics (Modern netstat)

12. dig — DNS Resolution

13. curl — HTTP Debugging

14. tcpdump — Packet Capture

Process Troubleshooting

15. journalctl — Systemd Service Logs

16. strace — System Call Tracing

17. dmesg — Kernel Messages

Quick Diagnostic One-Liners

18. System Load Summary

19. Who's Logged In & What Are They Doing

20. Quick Health Check Script

The Diagnostic Flow

1. `top` — Real-Time Process Overview

2. `mpstat` — Per-CPU Breakdown

3. `pidstat` — Per-Process CPU Usage Over Time

4. `free` — Memory Overview

5. `ps` — Top Memory Consumers

6. `vmstat` — Virtual Memory Statistics

7. `df` — Disk Space

8. `du` — What's Using the Space

9. `iostat` — Disk I/O Performance

10. `lsof` — Open Files by Process

11. `ss` — Socket Statistics (Modern `netstat`)

12. `dig` — DNS Resolution

13. `curl` — HTTP Debugging

14. `tcpdump` — Packet Capture

15. `journalctl` — Systemd Service Logs

16. `strace` — System Call Tracing

17. `dmesg` — Kernel Messages