Postmortem: Our Vault 1.15 Instance Went Down Due to a Disk Full Error

#postmortem #vault #instance #went

Postmortem: Our Vault 1.15 Instance Went Down Due to a Disk Full Error

Executive Summary

On October 12, 2024 at 14:32 UTC, our production Vault 1.15.0 instance experienced a total outage lasting 47 minutes. The root cause was identified as a full underlying storage volume hosting Vault’s data directory, which prevented the process from writing critical state, audit logs, and encrypted secret data. No data was lost, and all services relying on Vault resumed normal operation after storage was expanded and the instance was restarted.

Incident Timeline

14:32 UTC: First alert triggered via Datadog monitoring for Vault API 5xx error rate exceeding 10%.
14:34 UTC: On-call engineers acknowledged the alert and attempted to SSH into the Vault node; SSH access failed due to disk full errors preventing new process creation.
14:37 UTC: Engineers accessed the node via out-of-band console, confirmed /var/lib/vault (Vault data directory) volume was 100% full.
14:41 UTC: Attempted to clear Vault audit logs (stored on the same volume) but failed due to no write space for log truncation operations.
14:45 UTC: Provisioned 50GB additional storage to the underlying EBS volume via AWS console, expanded the filesystem to utilize new space.
14:48 UTC: Restarted the Vault systemd service; service failed to start due to corrupted raft.db file from incomplete writes during the outage.
14:52 UTC: Restored raft.db from the most recent hourly backup (taken at 14:00 UTC) and replayed the WAL (Write-Ahead Log) to recover data up to 14:31 UTC.
14:59 UTC: Vault service started successfully, unsealed via Shamir's secret sharing (3 of 5 key shares used), and passed health checks.
15:19 UTC: All downstream services (CI/CD pipelines, microservices, database credential rotators) reconnected to Vault and resumed normal operation. Incident marked as resolved.

Root Cause Analysis

Vault 1.15 stores critical data in three primary locations on the configured data volume:

Raft Consensus Data: For Vault's integrated storage (raft) backend, including raft.db and WAL files. In our deployment, this accounted for 62% of volume usage.
Audit Logs: We had configured verbose audit logging to a file on the same volume, which accounted for 28% of usage, growing at ~1.2GB per day.
Plugin Data: Custom secrets engine plugins and cached auth tokens, accounting for the remaining 10% of usage.

We had set up a CloudWatch alarm for 80% volume utilization, but the alarm was misconfigured to trigger only if utilization remained above 80% for 3 consecutive 5-minute periods. On the day of the incident, a sudden spike in audit log volume (due to a misconfigured test service generating 4x normal API traffic) pushed utilization from 78% to 100% in 12 minutes, bypassing the alarm threshold duration. Additionally, we had not enabled automated log rotation for Vault audit logs, and the data volume was initially provisioned with only 100GB of space, which we had not reviewed in 6 months.

Remediation Steps

During the incident, we took the following steps to restore service:

Expanded the EBS volume to 150GB and resized the ext4 filesystem to immediately free up write space.
Restored the corrupted raft.db file from the 14:00 UTC backup, then applied WAL entries from 14:00–14:31 UTC to minimize data loss (only 1 minute of write operations were lost, affecting non-critical test secrets).
Unsealed Vault using 3 of 5 Shamir key shares, as the instance had lost its seal state due to the crash.
Validated all Vault endpoints, checked audit log integrity, and confirmed no unauthorized access attempts during the outage.

Lessons Learned

Monitoring Gaps: Our volume utilization alarm was too lenient (3 consecutive periods) to catch rapid spikes. We also lacked alerts for Vault-specific disk pressure, such as failed write operations in Vault logs.
Log Management: Storing audit logs on the same volume as Vault data created a single point of failure. We also had no log rotation in place, leading to unbounded growth.
Backup Strategy: While hourly backups were sufficient for this incident, we had not tested WAL replay procedures in 3 months, leading to a 4-minute delay during recovery.
Capacity Planning: We had not reviewed Vault storage growth trends in 6 months, and the initial volume size was based on launch-time estimates, not actual usage.

Prevention Measures

To prevent recurrence, we implemented the following changes within 72 hours of the incident:

Separated Vault audit logs to a dedicated CloudWatch Logs group, with automated rotation after 7 days and 30-day retention. Removed file-based audit logging entirely.
Updated volume utilization alarms to trigger immediately when utilization exceeds 85%, with a second alarm at 90% for urgent escalation. Added Vault-specific metrics to Datadog: vault.core.write_errors, vault.audit.log_write_errors.
Provisioned a new 200GB EBS volume for Vault data, with automated scaling enabled via AWS Auto Scaling for EBS, set to add 20GB when utilization exceeds 80%.
Implemented weekly disaster recovery drills, including raft.db restoration and WAL replay, to ensure the team is familiar with recovery procedures.
Set up a monthly capacity review for all Vault-related infrastructure, tracking storage growth, API traffic, and secret count trends.
Upgraded to Vault 1.15.3 (the latest patch version) which includes improved disk pressure handling, returning 503 errors instead of crashing when write operations fail.

Conclusion

This incident highlighted the importance of separating critical service data from log storage, and configuring monitoring to catch rapid resource spikes. While no customer data was lost, the 47-minute outage impacted internal CI/CD pipelines and test environments. The changes implemented post-incident have reduced the risk of similar outages to near zero, and we have shared these findings with our engineering organization to improve overall infrastructure resilience.