War Story: File System Corruption Caused a 2-Day Outage at Our Data Center

#story #file #system #corruption

War Story: File System Corruption Caused a 2-Day Outage at Our Data Center

It was 3:12 AM on a Tuesday when our first PagerDuty alert fired. The on-call engineer, bleary-eyed, logged into our primary data center’s monitoring dashboard and froze: 40% of our production storage nodes were reporting read-only file systems, and customer-facing APIs were returning 503 errors at a rate of 12% and climbing.

The Initial Outage

Our data center runs a 3-tier storage architecture: 12 Ceph storage nodes backing a fleet of 48 application servers, all connected via 100GbE spine-leaf switching. The alert cluster indicated that 5 of the 12 Ceph OSDs (Object Storage Daemons) had gone into read-only mode within a 90-second window. Initial triage assumed a network partition, but BGP and link status checks showed all switch ports were up, and latency between nodes was <1ms.

We failed over to our secondary data center 20 minutes later, but the damage was done: 18 minutes of partial outage had already impacted 14,000 active users, and our secondary site was running at 80% capacity, with no room for the primary site’s workload if the outage dragged on.

Root Cause Discovery

By 6:00 AM, our engineering team had gathered for a war room call. We pulled kernel logs from the affected Ceph nodes first. Every node that had gone read-only showed the same error: EXT4-fs error (device sda3): ext4_journal_check_start:56: Detected aborted journal. Further digging revealed that the journal for each affected node’s ext4 file system had been corrupted, triggering the kernel to remount the file system as read-only to prevent further data loss.

But why? We cross-referenced the journal corruption timestamps with our change management logs. The only change in the 24 hours prior was a firmware update to the 12 SAS expanders in our storage chassis, pushed via our automated hardware management pipeline at 2:15 AM that morning. The firmware update was supposed to patch a minor latency bug, but we’d skipped full regression testing for "low-risk" hardware updates.

We pulled the SAS expander firmware’s release notes and found a known (but undocumented in our patch notes) issue: the firmware mishandled SCSI command queuing for drives with 4K native sectors, causing intermittent journal write failures during heavy I/O. Our Ceph nodes were under peak backup load at 2:15 AM, which triggered the bug, corrupted the ext4 journals, and cascaded into read-only file systems.

The 2-Day Recovery

Recovering corrupted ext4 journals on production storage nodes is not a fast process. First, we had to stop all Ceph services on the affected nodes to prevent further writes. Then, we ran fsck.ext4 on each corrupted volume, but 3 of the 5 nodes had irrecoverable journal errors that required restoring from the previous night’s snapshot.

Restoring snapshots took 14 hours alone: our storage nodes hold 2.4PB of data, and even with 100GbE links, pulling snapshots from our off-site backup cluster took time. We also had to rebalance the Ceph cluster after restoring nodes, which took another 8 hours to ensure data redundancy met our SLA requirements.

We brought the primary data center back online 47 hours after the first alert, once we’d verified all file systems were clean, Ceph health was OK, and application servers were processing requests without errors. The total downtime for the primary site was 2 days, 23 minutes; the secondary site handled 100% of traffic for 46 of those hours.

Lessons Learned

This outage cost us $420k in SLA credits and engineering time, but it taught us five critical lessons:

Never skip regression testing for hardware firmware updates, even for "low-risk" patches. We now require 72 hours of soak testing in a staging environment that mirrors production storage workloads for all firmware updates.
Implement pre-flight checks for automated hardware updates: our pipeline now verifies that no high I/O workloads are running before pushing firmware, and rolls back automatically if journal errors are detected in the first 10 minutes post-update.
Add file system health checks to our standard monitoring stack: we now alert on ext4 journal errors, read-only remounts, and SCSI command failures in real time, instead of relying on Ceph-level alerts.
Improve snapshot restore speeds: we’ve since deployed a local backup cache in the primary data center, cutting snapshot restore time from 14 hours to 3 hours.
Update our incident response playbook to include a dedicated storage war room role, focused solely on file system and hardware triage, to avoid splitting engineers between network and storage debugging.

Final Takeaway

File system corruption is a silent killer for data center operations, and even "minor" firmware updates can cascade into multi-day outages if you skip proper testing. We’d gotten complacent with our automated pipelines, but this incident forced us to tighten our change management processes and prioritize storage resilience over deployment speed.