Ajit Kumar

Posted on Jan 13

When Cloud Storage Fails: The DevOps Playbook for EC2 Disk Crises

#aws #ec2 #memory #devops

We’ve all been there: the monitoring alerts start screaming, the API returns 500 Internal Server Error, and your SSH terminal feels like it's wading through mud. The culprit? 100% Disk Usage.

In a "system down" scenario, every second counts, but so does every character you type. This guide covers how to triage, expand, and optimize your EC2 storage without losing data or your mind.

Phase 1: The Triage (Emergency Diagnosis)

Before touching the AWS Console, you need to know exactly where the pressure is.

1. Check the "Bucket" vs. the "Hardware"

df -h: Shows how much "water" (data) is in your "buckets" (filesystems).
lsblk: Shows the actual size of the "buckets" (hardware volumes).

# Check filesystem usage
df -h /

# Check physical block devices
lsblk

⚠️ The Senior "Gotcha": If lsblk shows a 20GB disk but df -h shows an 11GB partition, your hardware was expanded but your filesystem wasn't. If you see a device like xvdb with no MOUNTPOINT, you have a "ghost volume" that you are paying for but not using.

2. Find the "Space Hog"

sudo du -sh /var/log /home/* 2>/dev/null | sort -h

⚠️ Caution: Running du on a massive disk can be slow. If the system is barely breathing, start with /var/log first, as that is the most common culprit.

Phase 2: Immediate "Breathing Room"

You cannot move or edit files effectively if the disk is at 100%. You need to clear a few megabytes just to make the system stable enough to fix.

1. The Journal Vacuum

sudo journalctl --vacuum-time=1d

Why? Systemd logs can grow to gigabytes. This safely deletes everything older than a day.

2. Truncate, Don't Delete

⛔ DO NOT rm a log file that a service (Nginx, Gunicorn) is currently writing to. The OS won't free the space until the service restarts, and the service might crash because it can't find its log file.

# This empties the file to 0 bytes but keeps the file descriptor open
sudo truncate -s 0 /var/log/syslog

Phase 3: Expanding the Root Volume (The Big Move)

If your 12GB drive is simply too small for your application, it's time to grow.

Step 1: AWS Web Console

Navigate to EC2 > Volumes.
Select your root volume (usually attached to /dev/sda1).
Actions > Modify Volume. Change the size (e.g., from 12GB to 20GB).

🛑 THE CRITICAL "WAIT" (Optimization Stage)

After you click "Modify," AWS begins the Optimization process.

The Warning: In the Volume list, the state will show "modifying (optimizing)".
The Action: Do not run the Linux resize commands yet. While AWS says you can resize the partition during optimization, on a "stressed" t2.medium, it is much safer to wait until the volume status returns to a solid "in-use" and the new size is officially reflected in the AWS metadata.

Step 2: Grow the Partition and Filesystem

Once AWS shows the new size, tell Linux to recognize it:

# 1. Expand the partition (Note the space between xvda and 1)
sudo growpart /dev/xvda 1

# 2. Resize the filesystem
sudo resize2fs /dev/xvda1

Verify: df -h / should now reflect the new 20GB capacity.

Phase 4: Mounting Secondary Volumes (The "Ghost" 4GB)

If you have a secondary 4GB volume (xvdb) that was unlinked after a reboot, here is how to mount it permanently.

1. Format (Only if Empty!) and Mount

# Format as XFS (Verify it's empty first with 'sudo blkid /dev/xvdb')
sudo mkfs -t xfs /dev/xvdb

# Create a permanent mount point
sudo mkdir -p /mnt/extra_storage
sudo mount /dev/xvdb /mnt/extra_storage

2. The Permission "Gotcha"

New mounts are owned by root. Your application user (usually ubuntu) will get "Permission Denied" if it tries to write here.

sudo chown -R ubuntu:ubuntu /mnt/extra_storage

3. The "Reboot-Proof" fstab Fix

⚠️ DANGER: A typo in /etc/fstab will make your EC2 instance unbootable.

Get the UUID: sudo blkid /dev/xvdb
Open the file: sudo nano /etc/fstab
Add: UUID=your-uuid /mnt/extra_storage xfs defaults,nofail 0 2
Verification: NEVER REBOOT without running sudo mount -a. If this command returns an error, your fstab is wrong. Fix it now, or you’ll be locked out of your server.

Phase 5: Moving Data with Symbolic Links (Zero Breaking Changes)

You want to move a heavy folder (e.g., /home/ubuntu/app-data) to the new 4GB drive without changing your scripts or S3 copy commands.

# 1. Move the data (Ensure the 4GB drive is mounted first!)
mv /home/ubuntu/app-data /mnt/extra_storage/

# 2. Create the Symlink (The Pointer)
ln -s /mnt/extra_storage/app-data /home/ubuntu/app-data

Why this works: Your scripts still point to /home/ubuntu/app-data, but the OS transparently sends the data to the 4GB drive.

Final Senior DevOps Checklist

Step	Command	Caution / Warning
Check Space	`df -h`	`df` shows partition, `lsblk` shows physical disk.
Clean Logs	`truncate -s 0`	Don't delete active logs; you'll break service logging.
AWS Resize	`Modify Volume`	Wait for the "Optimizing" stage to finish before Linux commands.
Expand Partition	`growpart`	Ensure there is a space between the device and partition number.
Persistent Mount	`nano /etc/fstab`	Crucial: Always add `nofail` so the EC2 boots even if the disk fails.
Verify fstab	`sudo mount -a`	If this fails, DO NOT REBOOT. Fix the file first.
Permissions	`chown -R`	Your app user needs ownership of the new mount point.

Summary

By expanding your root drive and correctly mounting secondary volumes with symbolic links, you can scale your EC2 storage without refactoring a single line of your deployment scripts.

DEV Community