Kenneth Atta Oppong

Posted on Apr 5

How I Shrunk an AWS EBS Volume (And Survived the GRUB Panic to Tell the Tale)

#devops #cloud #aws #linux

A real-world war story about migrating a Jenkins server from a 650GB EBS volume to 200GB — on Ubuntu 24.04 Noble Numbat — with every curveball included.

The Problem Nobody Warns You About

There's a dirty little secret about AWS EBS volumes that you only discover when your disk usage report shows 134GB used on a 650GB volume: you can't shrink them.

Increase? Absolutely. Shrink? AWS will laugh at you. Whether you use the Console or the CLI, the API-level restriction is the same — you cannot create a volume smaller than its source snapshot.

So when our Jenkins server was sitting at 22% disk usage on a bloated 650GB volume — over 400GB completely wasted — I knew we needed to do this the hard way.

Here's exactly how I did it, including the parts that went wrong.

The Setup

Instance: t3.large (Ubuntu 24.04 Noble Numbat)
Current volume: 650GB gp3, 22% used (~134GB actual data)
Target volume: 200GB (plenty of headroom)
Boot setup: GPT + UEFI with 4 partitions

Before touching anything, I confirmed the actual disk layout:

df -h

Filesystem       Size  Used Avail Use% Mounted on
/dev/root        629G  134G  496G  22% /
/dev/nvme0n1p16  881M  159M  660M  20% /boot
/dev/nvme0n1p15  105M  6.2M   99M   6% /boot/efi

lsblk

nvme0n1      259:0    0  650G  0 disk
├─nvme0n1p1  259:1    0  649G  0 part /
├─nvme0n1p14 259:2    0    4M  0 part
├─nvme0n1p15 259:3    0  106M  0 part /boot/efi
└─nvme0n1p16 259:4    0  913M  0 part /boot

Four partitions. GPT. UEFI. This wasn't going to be a simple rsync / job.

Why the Snapshot Route Fails

My first instinct was: create a snapshot → spin up a smaller volume from it. Simple, right?

Wrong. AWS enforces a hard rule: a volume created from a snapshot must be equal to or larger than the original snapshot's size. Since the snapshot captures the full 650GB volume, you can't use it to create a 200GB volume. The Console won't let you, and the CLI will throw an error.

The only real path forward is to create a brand new empty volume and copy the data manually.

The Migration Strategy

The correct approach:

Attach a new 200GB volume to the running instance
Recreate the partition table manually
Format all partitions
Copy all data with rsync
Reinstall GRUB
Swap the volumes

Simple in theory. Let's talk about what actually happened.

Step 1 — Scale Up First (Smart Move)

Before anything else, I scaled the instance from t3.large to a c7a.2xlarge (8 vCPUs, 16GB RAM). The copy operation is I/O bound, not CPU bound, but the bigger instance gives you better network and EBS bandwidth headroom.

More importantly, I boosted the EBS throughput on both volumes — because I initially only boosted the new one and forgot the old disk is the one being read from:

# Boost old volume (source reads)
aws ec2 modify-volume --volume-id vol-OLD --iops 16000 --throughput 1000

# Boost new volume (destination writes)
aws ec2 modify-volume --volume-id vol-NEW --iops 16000 --throughput 1000

The default gp3 throughput cap is 125 MB/s. At 1000 MB/s, the rsync finished dramatically faster. AWS applies this live — no restart needed — though you'll see it say "optimizing (0%)" for a few minutes first.

Step 2 — Attach the New 200GB Volume

After creating and attaching the new volume, it showed up as nvme1n1:

lsblk /dev/nvme1n1
# NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
# nvme1n1 259:5    0  200G  0 disk

Step 3 — Recreate the Partition Table

This is where things get interesting. My first attempt was the obvious one:

sudo sgdisk -R /dev/nvme1n1 /dev/nvme0n1

Problem: partition 1 is too big for the disk.
Aborting write operation!

Of course. The partition table from a 650GB disk can't be cloned onto a 200GB disk because partition 1 is 649GB. We need to recreate the table manually.

First, I grabbed the exact sector positions of the small partitions from the old disk:

sudo sgdisk -p /dev/nvme0n1

Number  Start (sector)    End (sector)  Size       Code  Name
   1         2099200      1363148766   649.0 GiB   8300
  14            2048           10239   4.0 MiB     EF02
  15           10240          227327   106.0 MiB   EF00
  16          227328         2097152   913.0 MiB   EA00

Then wiped the new disk and recreated each partition using those exact sector positions, letting p1 fill the remaining space:

sudo sgdisk --zap-all /dev/nvme1n1

# p14 - BIOS boot (exact same sectors)
sudo sgdisk -n 14:2048:10239 -t 14:EF02 /dev/nvme1n1

# p15 - EFI (exact same sectors)
sudo sgdisk -n 15:10240:227327 -t 15:EF00 /dev/nvme1n1

# p16 - Boot (exact same sectors)
sudo sgdisk -n 16:227328:2097152 -t 16:EA00 /dev/nvme1n1

# p1 - Root — fill all remaining space on the new disk
sudo sgdisk -n 1:2099200:0 -t 1:8300 /dev/nvme1n1

Result: a clean 200GB disk with all 4 partitions correctly laid out, p1 using the remaining ~199GB.

Step 4 — Format the Partitions

sudo mkfs.ext4 /dev/nvme1n1p1    # Root
sudo mkfs.ext4 /dev/nvme1n1p16   # Boot
sudo mkfs.fat -F32 /dev/nvme1n1p15  # EFI
# p14 is BIOS boot — no filesystem needed

A minor warning appeared on the EFI partition (fatlabel: differences between boot sector and its backup) — harmless, just a minor FAT metadata inconsistency.

Step 5 — Set Filesystem Labels

Here's something that saved a lot of headache: the original fstab used labels instead of UUIDs:

LABEL=cloudimg-rootfs   /        ext4   discard,commit=30,errors=remount-ro   0 1
LABEL=BOOT              /boot    ext4   defaults                               0 2
LABEL=UEFI              /boot/efi vfat  umask=0077                            0 1

This meant I didn't need to edit fstab at all — I just needed to set the same labels on the new partitions:

sudo e2label /dev/nvme1n1p1 cloudimg-rootfs
sudo e2label /dev/nvme1n1p16 BOOT
sudo fatlabel /dev/nvme1n1p15 UEFI

Step 6 — Mount and rsync

sudo mkdir /mnt/newvol
sudo mount /dev/nvme1n1p1 /mnt/newvol
sudo mkdir -p /mnt/newvol/boot/efi
sudo mount /dev/nvme1n1p16 /mnt/newvol/boot
sudo mount /dev/nvme1n1p15 /mnt/newvol/boot/efi

Then the big copy:

# Root filesystem
sudo rsync -axHAWXS --numeric-ids --info=progress2 / /mnt/newvol/

# Boot and EFI
sudo rsync -axHAWXS --numeric-ids --info=progress2 /boot/ /mnt/newvol/boot/
sudo rsync -axHAWXS --numeric-ids --info=progress2 /boot/efi/ /mnt/newvol/boot/efi/

The -x flag is critical — it keeps rsync from crossing filesystem boundaries, so it only copies what's on the current mount.

With both volumes boosted to 1000 MB/s throughput, 134GB of data moved comfortably.

Step 7 — Reinstall GRUB

sudo mount --bind /dev /mnt/newvol/dev
sudo mount --bind /proc /mnt/newvol/proc
sudo mount --bind /sys /mnt/newvol/sys
sudo mount --bind /sys/firmware/efi/efivars /mnt/newvol/sys/firmware/efi/efivars

sudo chroot /mnt/newvol

grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=ubuntu
update-grub
exit

GRUB installed cleanly. update-grub found both kernel versions. Everything looked good.

Step 8 — The Volume Swap

Stopped the instance, detached the old 650GB volume, attached the new 200GB volume as /dev/xvda, reverted the instance type back to t3.large, and started it up.

And then...

The Kernel Panic

VFS: Cannot open root device "PARTUUID=ede263c9-faae-4d70-b586-306f8f26c5d3"
Kernel panic - not syncing: VFS: Unable to mount root fs

The kernel was looking for a PARTUUID that belonged to the old volume's partition. Even though we'd reinstalled GRUB and run update-grub, the kernel command line still had the old PARTUUID hardcoded in it.

The culprit was a file I hadn't noticed:

/etc/default/grub.d/40-force-partuuid.cfg

This Ubuntu/AWS-specific file explicitly overrides the root device with a hardcoded PARTUUID:

GRUB_FORCE_PARTUUID=ede263c9-faae-4d70-b586-306f8f26c5d3

This file bypasses the label-based fstab entirely and tells GRUB to pass a specific PARTUUID as the root= kernel parameter. Our update-grub had correctly regenerated grub.cfg using the new PARTUUID — but then GRUB's initrdless boot path picked up this override file and used the old PARTUUID anyway.

The instance dropped into the initramfs emergency shell.

The initramfs Recovery

From the BusyBox shell, the fix was straightforward once we knew what to look for:

mkdir /mnt
mount /dev/nvme0n1p1 /mnt

cat /mnt/etc/default/grub.d/40-force-partuuid.cfg
# GRUB_FORCE_PARTUUID=ede263c9-faae-4d70-b586-306f8f26c5d3  ← old PARTUUID

# Update it with the new partition's PARTUUID
echo 'GRUB_FORCE_PARTUUID=651df20d-498a-4d06-aeaf-1ddc1fc22872' \
  > /mnt/etc/default/grub.d/40-force-partuuid.cfg

Then chroot in and regenerate GRUB:

mount /dev/nvme0n1p16 /mnt/boot
mount /dev/nvme0n1p15 /mnt/boot/efi
mount --bind /dev /mnt/dev
mount --bind /proc /mnt/proc
mount --bind /sys /mnt/sys
mount --bind /sys/firmware/efi/efivars /mnt/sys/firmware/efi/efivars

chroot /mnt

update-grub
grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=ubuntu
exit

reboot

On the next boot, it came up clean.

The Lesson About GRUB_FORCE_PARTUUID

This is the part that will save you hours if you're doing this on an AWS Ubuntu instance.

Ubuntu's cloud images ship with 40-force-partuuid.cfg specifically to enable initrdless booting — a performance optimisation where the kernel boots directly from the partition without needing an initrd image. The tradeoff is that the PARTUUID is hardcoded. When you move to a new partition (even with the same label), this file must be updated.

The fix is to get the new partition's PARTUUID:

sudo blkid /dev/nvme0n1p1
# ... PARTUUID="651df20d-498a-4d06-aeaf-1ddc1fc22872" ...

And update the file before swapping the volumes (lesson learned the hard way):

echo 'GRUB_FORCE_PARTUUID=651df20d-498a-4d06-aeaf-1ddc1fc22872' \
  > /etc/default/grub.d/40-force-partuuid.cfg

update-grub

Final Verification

After boot:

df -h

Filesystem       Size  Used Avail Use% Mounted on
/dev/root        196G  134G   55G  71% /
/dev/nvme0n1p16  881M  159M  660M  20% /boot
/dev/nvme0n1p15  105M  6.2M   99M   6% /boot/efi

200GB. Jenkins running. All jobs intact.

Cleanup

Once you've confirmed everything is working:

Delete the old 650GB volume — it's just costing money now
Revert EBS throughput on the new volume back to 3000 IOPS / 125 MB/s (unless you actually need the boost)
Tag the new volume clearly so future-you knows what it is

The Full Checklist (Do This Right the First Time)

For anyone attempting this on an AWS Ubuntu Noble instance, here's the complete ordered checklist:

[ ] Confirm actual used space with df -h
[ ] Scale up instance type for faster copy (optional but worth it)
[ ] Create and attach new volume in the same AZ
[ ] Boost EBS throughput on both volumes
[ ] Wipe new disk: sgdisk --zap-all
[ ] Recreate partitions manually using exact sector positions from old disk
[ ] Format all partitions (ext4 for root/boot, FAT32 for EFI, skip BIOS boot)
[ ] Set filesystem labels to match old disk
[ ] Mount all partitions in correct order
[ ] rsync root, boot, and EFI separately
[ ] Update /etc/default/grub.d/40-force-partuuid.cfg with the new PARTUUID ← don't skip this
[ ] Chroot, run update-grub and grub-install
[ ] Unmount cleanly
[ ] Stop instance, swap volumes, start instance
[ ] Verify disk size, mounts, and services
[ ] Delete old volume
[ ] Revert EBS throughput to baseline

TL;DR

AWS won't let you shrink an EBS volume through snapshots — the new volume must always be ≥ the snapshot size. The only way to genuinely reduce volume size is to copy the data to a new, smaller volume. On Ubuntu Noble with AWS cloud images, the key gotcha is 40-force-partuuid.cfg — a file that hardcodes the root partition's PARTUUID into the GRUB boot config for initrdless booting. Forget to update it, and you'll meet the initramfs panic shell. Update it before you swap volumes, and everything works cleanly.

It's a bit of an adventure, but 450GB of recovered disk space is absolutely worth it.

DEV Community