Hey there, fellow developer! 👋 Let’s talk about the silent saboteur of every monorepo pipeline: flaky tests. You know the ones—tests that pass 90% of the time but fail randomly, leaving you squinting at logs at 2 AM, muttering, “But it worked on my machine!”
In a monorepo, flaky tests aren’t just annoying—they’re codebase-wide grenades. One shaky test can block deployments for all projects. But fear not! Let’s turn those flaky foes into harmless quirks with quarantine pipelines and battle-tested hacks.
What Makes Flaky Tests So Dangerous in Monorepos?
Flaky tests fail unpredictably due to:
- Race conditions (e.g., timing issues in async code).
- Shared state (tests stepping on each other’s toes).
- External dependencies (APIs, databases acting up).
In monorepos, these failures ripple across projects, grinding pipelines to a halt. 😱
Strategy 1: The Quarantine Pipeline 🚨
Don’t let flaky tests poison your main CI/CD pipeline. Isolate them!
Step 1: Identify the Culprits
Use tools like Jest Circus, pytest-flake-finder, or custom scripts to detect flaky tests by rerunning them N times:
# Example: Rerun failing tests 3 times to confirm flakiness
npm test -- --ci --runInBand --detectOpenHandles --testFailureRetry=3
Step 2: Move Them to Quarantine
Create a separate pipeline/job just for flaky tests:
# GitHub Actions Example
jobs:
main-tests:
runs-on: ubuntu-latest
steps:
- run: npm test -- --excludeFlaky
quarantine:
needs: main-tests
runs-on: ubuntu-latest
steps:
- run: npm test -- --onlyFlaky
Why this works:
- Main pipeline stays fast and stable.
- Quarantine runs flaky tests after the main build, so they don’t block deploys.
- You get visibility into flaky tests without derailing progress.
Strategy 2: Stability Hacks to Neutralize Flakiness 🛡️
Hack 1: Retry with Care
Retry flaky tests in the quarantine pipeline only:
# GitLab CI Example
quarantine:
retry:
max: 2 # Retry failed tests up to 2 times
script:
- npm run test:flaky
Pro Tip: Avoid retries in your main pipeline—they mask real issues!
Hack 2: Kill Shared State
- Isolate databases: Use Docker containers with fresh DB instances per test.
- Randomize test order: Prevent tests from depending on execution sequence.
jest --shuffle # Randomize test order
Hack 3: Timeout Buffers
Add grace periods for slow operations:
// Jest example
test("slow API test", async () => {
jest.setTimeout(15000); // 15s timeout instead of default 5s
});
Real-World Win: How Startup X Saved Their Sanity
A 20-project monorepo was failing 30% of deployments due to flaky tests. They:
- Moved 15 flaky tests to a quarantine pipeline.
- Fixed 10/15 with better timeout handling and DB isolation.
- Reduced pipeline failures by 80% in 2 weeks.
Pitfalls to Avoid
- Ignoring Quarantine: Don’t let flaky tests pile up—fix them incrementally.
- Over-Retrying: Retries != fixes. Use them sparingly.
- No Monitoring: Track flaky test frequency with dashboards (e.g., Datadog, Grafana).
Tools to Fight the Flake
- CircleCI Flaky Test Insights: Auto-detects and surfaces flaky tests.
- Buildkite Test Analytics: Visualizes flaky test trends.
- Custom Scripts: Automate flaky test detection with cron jobs.
Your Action Plan
- Audit: Find flaky tests with reruns and logs.
- Quarantine: Move them to a separate pipeline.
- Fix: Tackle the root cause (shared state, timeouts, etc.).
- Monitor: Prevent new flaky tests from creeping in.
Final Thought: Flaky tests are like weeds—ignore them, and they’ll take over your garden. But with quarantine pipelines and smart hacks, you’ll reclaim your CI/CD sanity.
Got a flaky test horror story or hack? Share it below—let’s build a flaky-free future! 🚀
Top comments (0)