Postmortem: How a Corrupted Node Modules Folder Caused 3-Hour Outage for Our CI Pipeline

#postmortem #corrupted #node #modules

Postmortem: How a Corrupted Node Modules Folder Caused 3-Hour Outage for Our CI Pipeline

Published: October 26, 2024 | Author: DevOps Team | 5 min read

Executive Summary

On October 24, 2024, our team experienced a 3-hour, 12-minute outage of our primary CI/CD pipeline, impacting 47 active pull requests and delaying 3 production releases. The root cause was identified as a corrupted node_modules directory on our shared CI runner, triggered by an incomplete npm install process during a high-concurrency job spike.

Timeline of Events

All times are in UTC:

14:02 – CI pipeline begins failing for all new jobs with MODULE_NOT_FOUND errors for core dependencies like express and jest.
14:07 – On-call engineer acknowledges alert, starts investigating failed job logs.
14:15 – Initial hypothesis: Recent dependency update to @company/utils v2.1.0 broke compatibility. Rollback attempt fails.
14:28 – Engineer notices node_modules on the primary shared runner has 0-byte files for 12+ dependencies. Manual npm install on runner fails with ENOENT errors.
14:45 – Team identifies 14 concurrent CI jobs triggered at 14:00 overloaded the shared runner, causing npm install to hang and corrupt the global cache.
15:12 – Team purges corrupted node_modules and npm cache, reruns npm ci with concurrency limits.
15:14 – CI pipeline begins processing queued jobs successfully.
17:14 – All 47 queued pull requests are merged, production releases deployed.

Root Cause Analysis

Our CI pipeline uses a pool of 3 shared Ubuntu 22.04 runners with a global node_modules directory mounted to /opt/ci/global_deps to speed up build times by avoiding redundant dependency installs. The failure chain was as follows:

A surge of 14 concurrent PR jobs triggered at 14:00 (due to a batch merge of feature branches) exceeded the runner’s I/O throughput limit.
The npm install process for each job attempted to write to the shared node_modules directory simultaneously, causing file lock conflicts.
Two jobs were killed mid-install by the runner’s OOM (Out of Memory) killer, leaving partial 0-byte dependency files in node_modules.
Subsequent jobs inherited the corrupted node_modules directory, as our pipeline logic skipped npm install if the directory existed (a flawed optimization to reduce build time).
The corrupted files caused MODULE_NOT_FOUND errors for all downstream jobs, as Node.js could not load the truncated dependency files.

We confirmed the corruption by running find /opt/ci/global_deps/node_modules -type f -size 0, which returned 14 empty files for critical dependencies.

Impact

3 hours, 12 minutes of total CI pipeline downtime.
47 pull requests queued, 12 of which were urgent bug fixes for production.
3 scheduled production releases delayed by 2–4 hours each.
~15 engineering hours spent on incident response and remediation.

Resolution Steps

We followed these steps to restore service:

Paused all new CI jobs to prevent further corruption.
Purged the shared node_modules directory and npm cache (rm -rf /opt/ci/global_deps/node_modules ~/.npm/_cacache).
Updated pipeline configuration to use npm ci instead of npm install for deterministic dependency installs.
Added a pre-install step to validate node_modules integrity using npm ls --depth=0 before skipping installs.
Reduced max concurrent jobs per runner from 8 to 4 to avoid I/O overload.
Gradually resumed CI jobs, monitoring for errors for 30 minutes post-restart.

Preventive Measures

To avoid recurrence, we implemented the following changes:

Migrated from shared node_modules to per-job ephemeral dependency caches using Docker containers for each CI job, eliminating shared state between jobs.
Added a hard limit of 5 concurrent jobs per runner, with auto-scaling runner pools to handle traffic spikes.
Implemented a pre-commit hook to validate dependency changes, reducing unexpected surges in CI jobs.
Added automated alerts for 0-byte files in node_modules directories across all runners.
Updated our incident response runbook to include node_modules corruption as a known failure mode.

Conclusion

This outage highlighted the risks of shared mutable state in CI pipelines, even when intended to optimize performance. By eliminating shared node_modules directories and adding concurrency limits, we’ve reduced the risk of similar corruption incidents to near zero. We’ve also shared these findings with our engineering team to raise awareness of CI pipeline state management best practices.