Alex Aslam

Posted on May 19

Slay the Flaky Test Dragon: How to Quarantine Monorepo Chaos Without Losing Your Mind

#webdev #programming #javascript #automation

Hey there, fellow developer! 👋 Let’s talk about the silent saboteur of every monorepo pipeline: flaky tests. You know the ones—tests that pass 90% of the time but fail randomly, leaving you squinting at logs at 2 AM, muttering, “But it worked on my machine!”

In a monorepo, flaky tests aren’t just annoying—they’re codebase-wide grenades. One shaky test can block deployments for all projects. But fear not! Let’s turn those flaky foes into harmless quirks with quarantine pipelines and battle-tested hacks.

What Makes Flaky Tests So Dangerous in Monorepos?

Flaky tests fail unpredictably due to:

Race conditions (e.g., timing issues in async code).
Shared state (tests stepping on each other’s toes).
External dependencies (APIs, databases acting up).

In monorepos, these failures ripple across projects, grinding pipelines to a halt. 😱

Strategy 1: The Quarantine Pipeline 🚨

Don’t let flaky tests poison your main CI/CD pipeline. Isolate them!

Step 1: Identify the Culprits

Use tools like Jest Circus, pytest-flake-finder, or custom scripts to detect flaky tests by rerunning them N times:

# Example: Rerun failing tests 3 times to confirm flakiness  
npm test -- --ci --runInBand --detectOpenHandles --testFailureRetry=3

Step 2: Move Them to Quarantine

Create a separate pipeline/job just for flaky tests:

# GitHub Actions Example  
jobs:  
  main-tests:  
    runs-on: ubuntu-latest  
    steps:  
      - run: npm test -- --excludeFlaky  

  quarantine:  
    needs: main-tests  
    runs-on: ubuntu-latest  
    steps:  
      - run: npm test -- --onlyFlaky

Why this works:

Main pipeline stays fast and stable.
Quarantine runs flaky tests after the main build, so they don’t block deploys.
You get visibility into flaky tests without derailing progress.

Strategy 2: Stability Hacks to Neutralize Flakiness 🛡️

Hack 1: Retry with Care

Retry flaky tests in the quarantine pipeline only:

# GitLab CI Example  
quarantine:  
  retry:  
    max: 2  # Retry failed tests up to 2 times  
  script:  
    - npm run test:flaky

Pro Tip: Avoid retries in your main pipeline—they mask real issues!

Hack 2: Kill Shared State

Isolate databases: Use Docker containers with fresh DB instances per test.
Randomize test order: Prevent tests from depending on execution sequence.

  jest --shuffle  # Randomize test order

Hack 3: Timeout Buffers

Add grace periods for slow operations:

// Jest example  
test("slow API test", async () => {  
  jest.setTimeout(15000); // 15s timeout instead of default 5s  
});

Real-World Win: How Startup X Saved Their Sanity

A 20-project monorepo was failing 30% of deployments due to flaky tests. They:

Moved 15 flaky tests to a quarantine pipeline.
Fixed 10/15 with better timeout handling and DB isolation.
Reduced pipeline failures by 80% in 2 weeks.

Pitfalls to Avoid

Ignoring Quarantine: Don’t let flaky tests pile up—fix them incrementally.
Over-Retrying: Retries != fixes. Use them sparingly.
No Monitoring: Track flaky test frequency with dashboards (e.g., Datadog, Grafana).

Tools to Fight the Flake

CircleCI Flaky Test Insights: Auto-detects and surfaces flaky tests.
Buildkite Test Analytics: Visualizes flaky test trends.
Custom Scripts: Automate flaky test detection with cron jobs.

Your Action Plan

Audit: Find flaky tests with reruns and logs.
Quarantine: Move them to a separate pipeline.
Fix: Tackle the root cause (shared state, timeouts, etc.).
Monitor: Prevent new flaky tests from creeping in.

Final Thought: Flaky tests are like weeds—ignore them, and they’ll take over your garden. But with quarantine pipelines and smart hacks, you’ll reclaim your CI/CD sanity.

Got a flaky test horror story or hack? Share it below—let’s build a flaky-free future! 🚀

DEV Community