Kubernetes Storage: Trading a Ferrari for a Reliable Minivan.

#kubernetes #devops #longhorn #sre

Okay, let’s step back a bit. About two weeks ago, I was performing open-heart surgery on my production-grade Kubernetes cluster — I swapped out the storage backbone from Rook-Ceph to Longhorn.

And I'm happy to report: the patient is not only alive but running better than ever.

No theoretical deep-dive here—this is a raw, post-migration debrief from the trenches. If you've ever whispered the words "my storage is a bit... fragile," grab a coffee. This one's for you.

Part 1: The "Why Now?" Moment

Let's be real: I didn't just wake up and decide to rip out a core infrastructure piece for fun. Rook-Ceph is powerful. It’s like owning a Formula 1 car. But my needs? I was basically just doing a school run.

I needed reliable block storage for my databases, backups and queues. Instead, I got:

"Operational Russian Roulette": A single Ceph Monitor having a bad day could trigger a debugging session that felt like defusing a bomb. So many moving parts!
Resource Hunger Games: My Ceph OSDs would constantly brawl with kubelet for CPU and RAM. The result? Unpredictable node instability that made me a little twitchy.
The "Day-2 Ops" Black Hole: I found myself needing to become a Ceph expert, just to keep the lights on. That's not a strategic investment; that's a part-time job I didn't apply for.

After one too many 2 AM pages, the message was clear: My F1 car was too high-maintenance for the daily commute. I needed a reliable minivan.

Part 2: The Switch - A Storage Mindset Shift

Moving from Rook-Ceph to Longhorn isn't a simple "plug and play." It's a fundamental philosophical change.

I went from managing dedicated raw block devices to using shared filesystems. Think of it like swapping a dedicated warehouse for every tenant (Ceph) for a modern apartment building with secure, individual units (Longhorn).

Here’s where the real work happened:

Change #1: The Great "Raw Disk" Purge
Gone are the days of provisioning special /dev/sdb EBS volumes for Ceph to devour. Longhorn just uses a folder on your existing filesystem. I deleted so much convoluted Terraform and disk-prep code. It was cathartic.

👋 Shout-out to my Talos users! This new model is a dream for you. Use the partitioning feature to create a dedicated spot, format it, mount it at /var/lib/longhorn, and you're golden. All the isolation, none of the raw device voodoo.

Change #2: The Prerequisite Scavenger Hunt
Longhorn runs on classic Linux tech: iSCSI and NFS. This meant I had to ensure every node had open-iscsi and nfs-utils installed and enabled. A quick update to my node bootstrap scripts (or MachineConfig for the Talos crew) and I was in business.

Change #3: Label Liberation
Rook-Ceph required me to manually tag nodes with labels like storage-node=true. Longhorn's manager DaemonSet just automatically discovers everything. I tore those labels off and celebrated my newfound simplicity.

Part 3: My 2-Week Migration Playbook (That Actually Worked)

So how did I do it without a multi-day outage? Carefully.

Stage 1: The Controlled Demolition. I followed the Rook docs to decommission Ceph safely. (Let me say this louder for the people in the back: HAVE VERIFIED BACKUPS).
Stage 2: The Node Makeover. I rolled through my nodes, wiping Ceph configs, installing the iSCSI/NFS prereqs, and setting up my dedicated Longhorn partition.
Stage 3: The New Sheriff in Town. A simple kubectl apply brought Longhorn online. It was almost... aNticlImActIC.
Stage 4: The Grand Reopening. I created a new Longhorn StorageClass and began methodically migrating my StatefulSets. One by one, they came online with their new, simpler storage backend.

The Verdict After 2 Weeks?

I traded a feature-laden behemoth for a focused, Kubernetes-native specialist. And I have zero regrets.

✅ My on-call phone has stopped buzzing. The "mystery" storage instability is gone.
✅ Debugging is now... logical. The UI is clear, and the logs make sense.
✅ It just feels solid. Recovery is faster, and the entire system is more predictable.

Sometimes, the "best" tool isn't the right tool. For me, Longhorn was the right tool.

Ever been through a major infrastructure swap as a solo operator? Was it a glorious victory or a cautionary tale? I'd love to hear your war stories in the comments!

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.