DEV Community

Cover image for Detecting Flaky Tests in CI/CD Using Machine Learning: A Research Approach
Rajeev Srivastava
Rajeev Srivastava

Posted on

Detecting Flaky Tests in CI/CD Using Machine Learning: A Research Approach

Detecting Flaky Tests in CI/CD Using Machine Learning: A Research Approach

The Problem

In modern CI/CD environments, automated tests are expected to provide fast and reliable feedback. However, flaky tests โ€” tests that pass and fail intermittently without code changes โ€” introduce instability into the pipeline.

A flaky test may:

  • Pass locally but fail in CI
  • Fail due to timing issues or race conditions
  • Fail because of shared state or environment dependencies

Over time, flaky tests reduce trust in automation and slow down engineering velocity.


Why It Damages CI/CD Velocity

When a test fails, engineers must decide:

  • Is this a real regression?
  • Or just another flaky failure?

This uncertainty causes:

  • Repeated pipeline reruns
  • Increased build time
  • Delayed releases
  • Developer frustration

In high-frequency deployment environments, flaky tests silently become productivity killers.


Why Traditional Approaches Fail

Several mitigation strategies are commonly used:

1. Reruns

Automatically rerunning failed tests may hide instability but does not eliminate the root cause.

2. Retry Logic

Retrying tests reduces visible failures but increases pipeline time and masks systemic issues.

3. Manual Tagging

Marking tests as flaky requires human intervention and constant maintenance.

All these methods are reactive rather than predictive.


Proposed Machine Learning Approach

Instead of reacting to flaky behavior, we can attempt to predict it.

The idea is to model test instability using historical execution data.

Feature Engineering

Potential predictive signals include:

  • Historical failure frequency
  • Time between failures
  • Execution duration variance
  • Commit correlation patterns
  • Environment-specific behavior

These features can be extracted from CI execution logs.

Labeling Strategy

A test can be labeled as flaky if:

  • It alternates between pass and fail without related code changes
  • Failure patterns show inconsistency over multiple builds

This labeling enables supervised learning.

Model Selection

Initial models for experimentation:

  • Logistic Regression
  • Random Forest
  • Gradient Boosting

These models can classify tests into:

  • Stable
  • Potentially flaky

Initial Experimental Setup

To ensure this research remains independent and reproducible:

  • Test framework: Playwright
  • CI data source: Synthetic execution logs
  • Dataset: Artificially generated instability patterns

No proprietary or company data is used.

The dataset simulates:

  • Random intermittent failures
  • Timing-based instability
  • Controlled failure injection

Preliminary Results

In early synthetic experiments:

  • Accuracy: ~82%
  • Precision: Moderate
  • Recall: Strong for frequently unstable tests

Observations

  • Historical variance in execution duration is a strong indicator
  • Tests with environment-dependent patterns show higher unpredictability
  • Simpler models perform surprisingly well with structured features

These results suggest feasibility, though real-world validation is required.


Next Steps

Future improvements include:

  • Collecting real-world open-source CI datasets
  • Improving feature selection
  • Exploring time-series modeling
  • Integrating predictions directly into CI pipelines

The long-term goal is proactive CI reliability โ€” identifying unstable tests before they disrupt delivery.


๐Ÿ”— GitHub Repository:

https://github.com/srivastava-rajeev/flaky-test-prediction-ml

Update (Feb 22, 2026): Experimental Results from Reproducible Pipeline

I ran the end-to-end pipeline from this repository:
https://github.com/srivastava-rajeev/flaky-test-prediction-ml

Latest Metrics

CI Threshold Simulation (Logistic Regression)

  • t=0.30 -> estimated policy cost 548.00
  • t=0.50 -> estimated policy cost 548.00
  • t=0.70 -> estimated policy cost 569.33 (+21.33)

Key Takeaway

Model quality is important, but CI impact depends heavily on threshold policy and false-negative cost trade-offs.

Reproducible Artifacts

  • data/processed/sample_features.csv
  • models/results/baseline_metrics.json
  • ci_integration/threshold_scenarios.csv

Top comments (0)