Postmortem: How a Corrupted Node Modules Folder Caused 3-Hour Outage for Our CI Pipeline
Published: October 26, 2024 | Author: DevOps Team | 5 min read
Executive Summary
On October 24, 2024, our team experienced a 3-hour, 12-minute outage of our primary CI/CD pipeline, impacting 47 active pull requests and delaying 3 production releases. The root cause was identified as a corrupted node_modules directory on our shared CI runner, triggered by an incomplete npm install process during a high-concurrency job spike.
Timeline of Events
All times are in UTC:
- 14:02 – CI pipeline begins failing for all new jobs with
MODULE_NOT_FOUNDerrors for core dependencies likeexpressandjest. - 14:07 – On-call engineer acknowledges alert, starts investigating failed job logs.
- 14:15 – Initial hypothesis: Recent dependency update to
@company/utilsv2.1.0 broke compatibility. Rollback attempt fails. - 14:28 – Engineer notices
node_moduleson the primary shared runner has 0-byte files for 12+ dependencies. Manualnpm installon runner fails withENOENTerrors. - 14:45 – Team identifies 14 concurrent CI jobs triggered at 14:00 overloaded the shared runner, causing
npm installto hang and corrupt the global cache. - 15:12 – Team purges corrupted
node_modulesand npm cache, rerunsnpm ciwith concurrency limits. - 15:14 – CI pipeline begins processing queued jobs successfully.
- 17:14 – All 47 queued pull requests are merged, production releases deployed.
Root Cause Analysis
Our CI pipeline uses a pool of 3 shared Ubuntu 22.04 runners with a global node_modules directory mounted to /opt/ci/global_deps to speed up build times by avoiding redundant dependency installs. The failure chain was as follows:
- A surge of 14 concurrent PR jobs triggered at 14:00 (due to a batch merge of feature branches) exceeded the runner’s I/O throughput limit.
- The
npm installprocess for each job attempted to write to the sharednode_modulesdirectory simultaneously, causing file lock conflicts. - Two jobs were killed mid-install by the runner’s OOM (Out of Memory) killer, leaving partial 0-byte dependency files in
node_modules. - Subsequent jobs inherited the corrupted
node_modulesdirectory, as our pipeline logic skippednpm installif the directory existed (a flawed optimization to reduce build time). - The corrupted files caused
MODULE_NOT_FOUNDerrors for all downstream jobs, as Node.js could not load the truncated dependency files.
We confirmed the corruption by running find /opt/ci/global_deps/node_modules -type f -size 0, which returned 14 empty files for critical dependencies.
Impact
- 3 hours, 12 minutes of total CI pipeline downtime.
- 47 pull requests queued, 12 of which were urgent bug fixes for production.
- 3 scheduled production releases delayed by 2–4 hours each.
- ~15 engineering hours spent on incident response and remediation.
Resolution Steps
We followed these steps to restore service:
- Paused all new CI jobs to prevent further corruption.
- Purged the shared
node_modulesdirectory and npm cache (rm -rf /opt/ci/global_deps/node_modules ~/.npm/_cacache). - Updated pipeline configuration to use
npm ciinstead ofnpm installfor deterministic dependency installs. - Added a pre-install step to validate
node_modulesintegrity usingnpm ls --depth=0before skipping installs. - Reduced max concurrent jobs per runner from 8 to 4 to avoid I/O overload.
- Gradually resumed CI jobs, monitoring for errors for 30 minutes post-restart.
Preventive Measures
To avoid recurrence, we implemented the following changes:
- Migrated from shared
node_modulesto per-job ephemeral dependency caches using Docker containers for each CI job, eliminating shared state between jobs. - Added a hard limit of 5 concurrent jobs per runner, with auto-scaling runner pools to handle traffic spikes.
- Implemented a pre-commit hook to validate dependency changes, reducing unexpected surges in CI jobs.
- Added automated alerts for 0-byte files in
node_modulesdirectories across all runners. - Updated our incident response runbook to include
node_modulescorruption as a known failure mode.
Conclusion
This outage highlighted the risks of shared mutable state in CI pipelines, even when intended to optimize performance. By eliminating shared node_modules directories and adding concurrency limits, we’ve reduced the risk of similar corruption incidents to near zero. We’ve also shared these findings with our engineering team to raise awareness of CI pipeline state management best practices.
Top comments (0)