Detecting Flaky Tests in CI/CD Using Machine Learning: A Research Approach
The Problem
In modern CI/CD environments, automated tests are expected to provide fast and reliable feedback. However, flaky tests โ tests that pass and fail intermittently without code changes โ introduce instability into the pipeline.
A flaky test may:
- Pass locally but fail in CI
- Fail due to timing issues or race conditions
- Fail because of shared state or environment dependencies
Over time, flaky tests reduce trust in automation and slow down engineering velocity.
Why It Damages CI/CD Velocity
When a test fails, engineers must decide:
- Is this a real regression?
- Or just another flaky failure?
This uncertainty causes:
- Repeated pipeline reruns
- Increased build time
- Delayed releases
- Developer frustration
In high-frequency deployment environments, flaky tests silently become productivity killers.
Why Traditional Approaches Fail
Several mitigation strategies are commonly used:
1. Reruns
Automatically rerunning failed tests may hide instability but does not eliminate the root cause.
2. Retry Logic
Retrying tests reduces visible failures but increases pipeline time and masks systemic issues.
3. Manual Tagging
Marking tests as flaky requires human intervention and constant maintenance.
All these methods are reactive rather than predictive.
Proposed Machine Learning Approach
Instead of reacting to flaky behavior, we can attempt to predict it.
The idea is to model test instability using historical execution data.
Feature Engineering
Potential predictive signals include:
- Historical failure frequency
- Time between failures
- Execution duration variance
- Commit correlation patterns
- Environment-specific behavior
These features can be extracted from CI execution logs.
Labeling Strategy
A test can be labeled as flaky if:
- It alternates between pass and fail without related code changes
- Failure patterns show inconsistency over multiple builds
This labeling enables supervised learning.
Model Selection
Initial models for experimentation:
- Logistic Regression
- Random Forest
- Gradient Boosting
These models can classify tests into:
- Stable
- Potentially flaky
Initial Experimental Setup
To ensure this research remains independent and reproducible:
- Test framework: Playwright
- CI data source: Synthetic execution logs
- Dataset: Artificially generated instability patterns
No proprietary or company data is used.
The dataset simulates:
- Random intermittent failures
- Timing-based instability
- Controlled failure injection
Preliminary Results
In early synthetic experiments:
- Accuracy: ~82%
- Precision: Moderate
- Recall: Strong for frequently unstable tests
Observations
- Historical variance in execution duration is a strong indicator
- Tests with environment-dependent patterns show higher unpredictability
- Simpler models perform surprisingly well with structured features
These results suggest feasibility, though real-world validation is required.
Next Steps
Future improvements include:
- Collecting real-world open-source CI datasets
- Improving feature selection
- Exploring time-series modeling
- Integrating predictions directly into CI pipelines
The long-term goal is proactive CI reliability โ identifying unstable tests before they disrupt delivery.
๐ GitHub Repository:
https://github.com/srivastava-rajeev/flaky-test-prediction-ml
Update (Feb 22, 2026): Experimental Results from Reproducible Pipeline
I ran the end-to-end pipeline from this repository:
https://github.com/srivastava-rajeev/flaky-test-prediction-ml
Latest Metrics
- Logistic Regression: ROC-AUC 0.944, Precision@0.5 0.966, Recall@0.5 0.929
- Random Forest: ROC-AUC 0.950, Precision@0.5 0.966, Recall@0.5 0.929
CI Threshold Simulation (Logistic Regression)
- t=0.30 -> estimated policy cost 548.00
- t=0.50 -> estimated policy cost 548.00
- t=0.70 -> estimated policy cost 569.33 (+21.33)
Key Takeaway
Model quality is important, but CI impact depends heavily on threshold policy and false-negative cost trade-offs.
Reproducible Artifacts
- data/processed/sample_features.csv
- models/results/baseline_metrics.json
- ci_integration/threshold_scenarios.csv
Top comments (0)