Anderson Leite

Posted on Mar 27

Your production server is dead. Hard reboot, what caused the issue?

#gnulinux #devops #incidenthandling #rootcauseanalysis

Last week, one of our production server became completely unresponsive: No SSH. No SSM. No ping. The application was down, the database was spiking, and the monitoring had been screaming for minutes before anyone noticed.

We had to force-reboot to recover. But then came the hard part: figuring out what happened on a machine where all the pre-crash state was gone.

This is the story of how basic GNU/Linux tools (the kind most cloud engineers never bother learning these days since "we can check grafana") gave us the complete picture when our fancy observability stack had nothing.

The scene

Here's what we knew after the reboot:

The EC2 instance (48 CPUs, 92GB RAM, Ubuntu 24.04) had been completely unreachable for ~18 minutes until the team take the decision of hard-reboot it
Both RDS primary and replica showed a CPU spike during the same window
Application logs in our centralized logging (Loki) showed nothing unusual
Sentry had captured DNS resolution failures to the database
The ops team couldn't SSH/SSM session to it or even ping the server

The monitoring dashboards showed us that something happened. But not what or why.

Step 1: Was it AWS's fault? (`aws ec2 describe-instance-status`)

The first instinct in cloud is to blame the cloud. Fair enough — hardware fails, hypervisors crash, networks partition.

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name StatusCheckFailed_System \
  --dimensions Name=InstanceId,Value=i-XXXXXXXXX \
  --start-time 2026-03-26T13:00:00Z \
  --end-time 2026-03-26T14:00:00Z \
  --period 60 --statistics Maximum

Both StatusCheckFailed_System (hypervisor/host) and StatusCheckFailed_Instance (OS-level) were 0.0 across every minute. AWS thought the instance was perfectly healthy the entire time.

This is actually the trickiest result: It means the problem was inside the OS, but subtle enough that AWS's health checks (which are fairly basic) didn't catch it.

Lesson: Don't stop at "AWS says it's fine." That just narrows the scope, it doesn't answer the question.

Step 2: The serial console is your black box recorder (`get-console-output`)

After a plane crash, investigators look for the black box. After a server crash, the EC2 serial console serves a similar purpose:

aws ec2 get-console-output --instance-id i-XXXXXXXXX --latest

Bad news: The output showed a clean boot sequence from the reboot. The serial console buffer had been overwritten. No kernel panic, no OOM killer messages, the pre-crash evidence was gone (my bad on this, I should have checked it before the reboot!)

This happens often after hard reboots. The lesson here is that if you need to preserve the console output, grab it before rebooting if possible. In our case, the server was completely unreachable, so we had no choice.

Step 3: The hero nobody talks about: `sar`

This is where most cloud-native engineers would be stuck. The server was rebooted, Docker logs were gone, the console was overwritten, application is up and running again. What's left?

sar (System Activity Reporter), part of the sysstat package. It silently collects system metrics every 10 minutes and writes them to /var/log/sysstat/. Unlike in-memory metrics, these files survive reboots.

sar -A -f /var/log/sysstat/sa26 -s 14:10:00 -e 14:35:00

This single command revealed everything:

Metric	Baseline	During incident
CPU iowait	0.04%	71%
Memory used	20GB (20%)	82GB (85%)
Page cache	20GB	617MB
NVMe queue depth	0.21	95
NVMe await	1.1ms	53ms
Load average	6	1,075
Blocked processes	0	38

The system was in a memory/IO thrashing death spiral. Something consumed ~60GB of RAM, forcing the kernel to evict all page cache, which turned every disk read into a physical I/O operation, which saturated the NVMe drive, which made every process on the system wait for disk.

But here's the gotcha that cost us 30 minutes: The server's clock was set to UTC+1, while our Slack timestamps and monitoring were in UTC. We initially ran sar for the wrong time window and saw perfectly normal metrics. Always check date on the machine and adjust accordingly.

Step 4: Checking what the kernel saw (`journalctl`, `dmesg`, `syslog`)

dmesg | grep -i oom
# (empty)

journalctl --since "2026-03-26 14:10" --until "2026-03-26 14:35" -k | grep -i -E "oom|kill|memory"
# Only post-reboot boot messages

No OOM killer was ever invoked. The system became unresponsive before the kernel could trigger it. This is an important finding, it means there was no safety net. (Spoiler: the system had zero swap configured, so there was no buffer between "memory pressure" and "completely dead.")

But syslog had a clue:

grep -i "no buffer space" /var/log/syslog*

2026-03-26T14:14:53 scanner-agent: "netlink receive: recvmsg: no buffer space available"
2026-03-26T14:15:39 scanner-agent: "netlink receive: recvmsg: no buffer space available"
2026-03-26T14:15:44 scanner-agent: "netlink receive: recvmsg: no buffer space available"

Socket buffers were exhausted at 14:14, the very start of the incident. This explained the DNS resolution failures our application was reporting to Sentry. The system literally couldn't make new network connections.

Step 5: Who ate all the memory? (`docker stats`, `docker inspect`)

docker stats --no-stream
docker inspect --format '{{.Name}} {{.HostConfig.Memory}}' $(docker ps -q)

Every single container showed Memory: 0: unlimited. No container had memory limits set. Any process could consume the entire 92GB of host RAM without Docker lifting a finger.

Step 6: Finding the trigger (`systemctl`, `journalctl`)

At this point we knew what happened (memory exhaustion → IO thrashing → system death), but not what caused it, we need to dig more:

journalctl -u scanner-agent-scanner.service \
  --since "2026-03-26 14:00" --until "2026-03-26 14:33" --no-pager

There it was. A third-party security scanning agent had been running a full filesystem scan of / — including 3.7TB of image files and 47GB of Docker overlay layers continuously for nearly 5 days. When systemd stopped it during the reboot, it reported:

Consumed 2d 18h 18min 42.045s CPU time, 3.9G memory peak

The scanner respected its own 4GB memory limit, but its sustained disk I/O (~101 MB/s of reads) was flooding the kernel page cache. When the application's hourly batch of ~30 cron jobs fired at the top of the hour and needed memory, the kernel had to aggressively reclaim pages and the system entered a thrashing spiral it couldn't escape.

Step 7: Was this a one-time thing? (historical `sar`)

for day in 20 21 22 23 24 25; do
  echo "=== March $day ==="
  sar -r -f /var/log/sysstat/sa$day -s 14:50:00 -e 15:20:00 2>/dev/null
done

Memory was elevated around the same time on every previous day, the scanner was always running. But it only crashed on the 26th because that's when the scanner's I/O peak happened to coincide perfectly with the application's cron storm.

The tools that saved us

Here's the thing: None of our cloud-native observability tools helped with the root cause:

Grafana showed the outage happened
Sentry showed DNS failures
Uptime Kuma showed HTTP 504s

But the why came entirely from:

Tool	What it told us
`sar`	Complete system metrics from before the crash, surviving the reboot
`journalctl`	Which systemd service was responsible, how long it ran, resource consumption
`syslog` / `grep`	Socket buffer exhaustion timeline
`dmesg`	Confirmed no OOM killer fired
`docker stats` / `inspect`	No memory limits on any container
`systemctl cat`	The scanner's configuration and exclusion gaps
`df -h`	The full scope of what the scanner was trying to scan
`crontab -l`	The cron storm pattern

These are not exotic tools. They come pre-installed on every GNU/Linux server. And yet, I've interviewed dozens of SRE and platform engineering candidates who couldn't tell you what sar does or how to read journalctl output.

What we got wrong (and what we fixed)

Zero swap space — There was no buffer between "memory pressure" and "kernel can't function." We added 8GB swap with swappiness=10 as a safety net.
No container memory limits — All 13 containers were running unlimited. We're adding explicit limits to every one.
No thrashing alerts — We had uptime monitoring but no alerts on iowait, memory pressure (PSI metrics), or load average. By the time Uptime Kuma noticed, the server was already dead.
30 cron jobs in 10 minutes — The application fired ~30 PHP processes in the first 10 minutes of every hour. We're staggering them across the full 60-minute window.
Third-party agent running unaudited — A security scanner was doing a full filesystem scan of 4.3TB with no exclusions. Nobody had reviewed its configuration since installation.

The takeaway

Cloud abstractions are wonderful until they aren't. When your EC2 instance is unresponsive and your Kubernetes dashboard is useless because the node is dead, you're back to basics: sar, journalctl, dmesg, syslog, free, df, ps.

These tools have been around for decades. They're boring. They don't have nice UIs. But when everything else fails, they're what you have.

If you're an SRE, platform engineer, or DevOps engineer and you can't use these tools fluently, you have a gap in your skillset that will bite you during the worst possible moment — a production incident where the clock is ticking and your observability platform has nothing for you.

Go install sysstat on your servers today. Future-you during an incident will thank you.

The investigation took about 2 hours from start to confirmed root cause. Without sar, we might never have found it.