Rajeev Srivastava

Posted on Feb 22

Detecting Flaky Tests in CI/CD Using Machine Learning: A Research Approach

#testing #devops #machinelearning #cicd

Detecting Flaky Tests in CI/CD Using Machine Learning: A Research Approach

The Problem

In modern CI/CD environments, automated tests are expected to provide fast and reliable feedback. However, flaky tests — tests that pass and fail intermittently without code changes — introduce instability into the pipeline.

A flaky test may:

Pass locally but fail in CI
Fail due to timing issues or race conditions
Fail because of shared state or environment dependencies

Over time, flaky tests reduce trust in automation and slow down engineering velocity.

Why It Damages CI/CD Velocity

When a test fails, engineers must decide:

Is this a real regression?
Or just another flaky failure?

This uncertainty causes:

Repeated pipeline reruns
Increased build time
Delayed releases
Developer frustration

In high-frequency deployment environments, flaky tests silently become productivity killers.

Why Traditional Approaches Fail

Several mitigation strategies are commonly used:

1. Reruns

Automatically rerunning failed tests may hide instability but does not eliminate the root cause.

2. Retry Logic

Retrying tests reduces visible failures but increases pipeline time and masks systemic issues.

3. Manual Tagging

Marking tests as flaky requires human intervention and constant maintenance.

All these methods are reactive rather than predictive.

Proposed Machine Learning Approach

Instead of reacting to flaky behavior, we can attempt to predict it.

The idea is to model test instability using historical execution data.

Feature Engineering

Potential predictive signals include:

Historical failure frequency
Time between failures
Execution duration variance
Commit correlation patterns
Environment-specific behavior

These features can be extracted from CI execution logs.

Labeling Strategy

A test can be labeled as flaky if:

It alternates between pass and fail without related code changes
Failure patterns show inconsistency over multiple builds

This labeling enables supervised learning.

Model Selection

Initial models for experimentation:

Logistic Regression
Random Forest
Gradient Boosting

These models can classify tests into:

Stable
Potentially flaky

Initial Experimental Setup

To ensure this research remains independent and reproducible:

Test framework: Playwright
CI data source: Synthetic execution logs
Dataset: Artificially generated instability patterns

No proprietary or company data is used.

The dataset simulates:

Random intermittent failures
Timing-based instability
Controlled failure injection

Preliminary Results

In early synthetic experiments:

Accuracy: ~82%
Precision: Moderate
Recall: Strong for frequently unstable tests

Observations

Historical variance in execution duration is a strong indicator
Tests with environment-dependent patterns show higher unpredictability
Simpler models perform surprisingly well with structured features

These results suggest feasibility, though real-world validation is required.

Next Steps

Future improvements include:

Collecting real-world open-source CI datasets
Improving feature selection
Exploring time-series modeling
Integrating predictions directly into CI pipelines

The long-term goal is proactive CI reliability — identifying unstable tests before they disrupt delivery.

🔗 GitHub Repository:

https://github.com/srivastava-rajeev/flaky-test-prediction-ml

Update (Feb 22, 2026): Experimental Results from Reproducible Pipeline

I ran the end-to-end pipeline from this repository:
https://github.com/srivastava-rajeev/flaky-test-prediction-ml

Latest Metrics

Logistic Regression: ROC-AUC 0.944, Precision@0.5 0.966, Recall@0.5 0.929
Random Forest: ROC-AUC 0.950, Precision@0.5 0.966, Recall@0.5 0.929

CI Threshold Simulation (Logistic Regression)

t=0.30 -> estimated policy cost 548.00
t=0.50 -> estimated policy cost 548.00
t=0.70 -> estimated policy cost 569.33 (+21.33)

Key Takeaway

Model quality is important, but CI impact depends heavily on threshold policy and false-negative cost trade-offs.

Reproducible Artifacts

data/processed/sample_features.csv
models/results/baseline_metrics.json
ci_integration/threshold_scenarios.csv

DEV Community

Detecting Flaky Tests in CI/CD Using Machine Learning: A Research Approach

Detecting Flaky Tests in CI/CD Using Machine Learning: A Research Approach

The Problem

Why It Damages CI/CD Velocity

Why Traditional Approaches Fail

1. Reruns

2. Retry Logic

3. Manual Tagging

Proposed Machine Learning Approach

Feature Engineering

Labeling Strategy

Model Selection

Initial Experimental Setup

Preliminary Results

Observations

Next Steps

Update (Feb 22, 2026): Experimental Results from Reproducible Pipeline

Latest Metrics

CI Threshold Simulation (Logistic Regression)

Key Takeaway

Reproducible Artifacts

Top comments (0)