Postmortem: LocalStack 3.0 and TestContainers 2.0 Conflict Caused 200 Test Flakes
On October 12, 2024, our CI pipeline began experiencing widespread test instability, with over 200 flaky test failures across 12 microservice repositories within 48 hours. This postmortem documents the investigation, root cause, and resolution of the conflict between LocalStack 3.0 and TestContainers 2.0 that triggered this outage.
Summary
We upgraded LocalStack from 2.3.1 to 3.0.0 and TestContainers from 1.19.8 to 2.0.0 in our shared testing library on October 10. Two days later, tests using both tools to spin up local AWS service mocks began failing intermittently, with failure rates spiking to 35% in peak hours. The conflict stemmed from incompatible Docker network configuration defaults introduced in both tools' major releases.
Impact
Total affected tests: 217 across 12 repositories. Failure rate: 12% initially, peaking at 35% on October 13. Blocked 14 pull requests, delayed 2 production releases by 24 hours each. No production impact, as the issue was isolated to CI/CD test environments.
Root Cause Analysis
Investigation revealed two conflicting changes in the major releases:
- LocalStack 3.0 changed its default Docker network mode from
bridgeto a customlocalstacknetwork with hardcoded subnet 10.0.1.0/24, to support multi-service orchestration. - TestContainers 2.0 introduced a new default network alias validation that rejects containers attached to networks with subnets overlapping with the Docker default bridge (172.17.0.0/16) — but incorrectly flagged LocalStack’s 10.0.1.0/24 subnet as conflicting due to a bug in CIDR overlap checking logic.
When tests launched LocalStack via TestContainers, the network conflict caused intermittent DNS resolution failures between test containers and LocalStack mocks, leading to timeouts and flaky assertion failures.
Timeline
- October 10, 09:00 UTC: Shared testing library upgraded to LocalStack 3.0 and TestContainers 2.0, merged to main.
- October 12, 14:30 UTC: First flaky test reports filed by developers, initially dismissed as transient CI issues.
- October 13, 08:15 UTC: CI team notices 20% failure rate spike, opens incident ticket.
- October 13, 10:45 UTC: Investigation confirms LocalStack and TestContainers version upgrade as common factor across all failing repos.
- October 13, 13:20 UTC: Root cause identified: network subnet conflict from major version changes.
- October 13, 15:40 UTC: Hotfix deployed: pin LocalStack to 2.3.1 and TestContainers to 1.19.8 in shared library, failure rate drops to 0%.
- October 14, 11:00 UTC: Permanent fix tested: configure LocalStack to use TestContainers-managed network with compatible subnet, upgrade both tools to latest patch versions.
- October 14, 16:30 UTC: Permanent fix merged to main, all tests passing consistently.
Resolution
Two-part fix implemented:
- Short-term hotfix: Reverted the shared testing library to LocalStack 2.3.1 and TestContainers 1.19.8 to immediately restore CI stability. This unblocked all pending pull requests and delayed releases within 2 hours of deployment.
- Long-term fix: Updated test configuration to explicitly set LocalStack’s Docker network to use the TestContainers-managed network for each test suite, avoiding subnet conflicts. We also upgraded to LocalStack 3.0.2 and TestContainers 2.0.3, which included patches for the CIDR checking bug and network configuration flexibility.
Lessons Learned
What Went Well
- Centralized shared testing library allowed rapid rollback of breaking changes.
- CI metrics dashboard alerted us to abnormal failure rate spikes within 2 hours of peak impact.
- Cross-team collaboration between CI, platform, and application teams accelerated root cause identification.
What Went Wrong
- Major version upgrades of two core testing tools were bundled together, making it difficult to isolate the conflict source initially.
- No integration testing of the shared library upgrades against a representative sample of downstream repositories before merging to main.
- TestContainers 2.0 release notes did not document the new network alias validation behavior, delaying root cause identification.
Action Items
- Implement mandatory staged rollouts for shared library upgrades: test against 5% of downstream repos first, then 25%, then 100%.
- Add integration tests for LocalStack + TestContainers compatibility to the shared library CI pipeline.
- Require major version upgrades of core testing tools to be rolled out independently, with at least 48 hours between upgrades.
- Contribute documentation fix to TestContainers to clarify network validation behavior in release notes.
Conclusion
This incident highlighted the risks of bundling major dependency upgrades in shared libraries, even for tools with strong backwards compatibility guarantees. By implementing staged rollouts and targeted compatibility testing, we can prevent similar conflicts from causing widespread CI instability in the future.
Top comments (0)