DEV Community

Artem
Artem

Posted on

Catching Flaky Tests Before They Hit CI: Meet FlakyDetector

By Artem Alimpiev — Python Backend Developer

Every engineering team has that test.

The one that passes locally, passes on your teammate’s machine, passes three times in CI… and then suddenly fails at 2 AM for absolutely no reason.

So somebody reruns the pipeline.

Again.

And again.

Eventually the build goes green, everyone moves on, and the flaky test quietly stays in the repository like a cursed artifact nobody wants to touch.

I’ve seen this happen too many times in real projects. And honestly, the bigger the infrastructure becomes, the worse the problem gets.

At some point I started asking myself a simple question:

Why are we still detecting flaky tests after they break CI?

That question eventually turned into FlakyDetector — an AST + Machine Learning powered system for detecting flaky tests before they hit your pipelines.

GitHub Repository:
FlakyDetector GitHub Repository

Flaky Tests Are More Expensive Than Most Teams Realize

People often treat flaky tests as “just annoying.”

But flaky tests are actually infrastructure debt.

They:

  • slow down releases
  • waste CI resources
  • destroy trust in automation
  • create false negatives
  • normalize ignoring failed builds

And that last one is especially dangerous.

Because once engineers stop trusting CI, the entire feedback loop starts collapsing.

Google engineers once reported that a significant percentage of test failures inside large CI systems were caused by flakiness rather than real regressions.

Think about that for a second.

Imagine debugging failures that aren’t even real bugs.

Now combine that with:

  • async systems
  • distributed services
  • parallel test execution
  • unstable timing
  • shared global state
  • external APIs

Suddenly your test suite behaves less like deterministic engineering and more like a physics experiment.

Most Flaky Detection Tools React Too Late

The majority of flaky-test tooling works reactively.

Usually the workflow looks like this:

Write test ↓ Push code ↓ CI randomly fails ↓ Retry pipeline ↓ Lose 40 minutes of engineering time

Traditional systems rely on:

  • rerun statistics
  • CI telemetry
  • historical failure tracking
  • probabilistic heuristics

Useful? Absolutely.

Preventive? Not really.

I wanted something different.

I wanted a system that could analyze the source code itself and detect risky patterns before the tests ever started failing in production CI environments.

That’s where AST analysis became incredibly interesting.

AST Analysis: Looking at Code Structurally

If you’ve never worked with Python ASTs directly, here’s the simple explanation.

Python code isn’t just text.

Under the hood, Python converts code into an Abstract Syntax Tree — a structured representation the interpreter can reason about.

For example:

time.sleep(5)
Enter fullscreen mode Exit fullscreen mode

isn’t stored as plain text internally.

It becomes semantic structure:

  • function call
  • module reference
  • execution dependency
  • timing behavior

And once you have structure, you can detect patterns.

That changes everything.

FlakyDetector scans Python test suites and searches for architectural anti-patterns associated with nondeterministic behavior.

Examples include:

  • `time.sleep()
  • datetime.now()`
  • mutable global state
  • unmocked network requests
  • dangerous fixture scopes
  • resource leakage
  • high cyclomatic complexity

Or in testing terminology:

Test Smells.

Yes, that’s a real technical term. And yes, it sounds slightly ridiculous.

Why I Added Machine Learning

At first, FlakyDetector started as a pure AST rule engine.

But I quickly ran into a problem.

Real flaky behavior is rarely caused by a single issue.

Usually it’s a combination of:

  • timing dependencies
  • complexity
  • state mutations
  • async interactions
  • fixture misuse

That means simple rule matching eventually hits limits.

So I added an ML classification layer using CatBoost.

The pipeline now looks roughly like this:

Python Test Code ↓ AST Parsing ↓ Feature Extraction ↓ 42-Dimensional Feature Vector ↓ CatBoost Classification ↓ Flaky Probability + Severity
Enter fullscreen mode Exit fullscreen mode

And honestly, CatBoost turned out to be a surprisingly strong fit.

Most developers associate CatBoost with recommendation systems or business analytics. But it’s extremely good at structured tabular feature classification.

Which is exactly what AST-derived metrics become.

The 42-Feature Detection System

This is where the project became much more than “another linter.”

FlakyDetector currently extracts a 42-dimensional feature space.

The features include:

Feature Group Count
AST Features 16
Category Features 9
Fixture Analysis 5
Derived Metrics 3
Confidence Scores 8
Test Smells

Examples include:

  • timing dependency counters
  • network interaction detection
  • mutation ratios
  • fixture scope analysis
  • pattern diversity metrics
  • cyclomatic complexity scoring

The system then classifies the probability of flakiness and assigns severity levels.

But the important part is this:

The model is explainable.

That matters a lot in developer tooling.

Nobody wants a black-box AI saying:

“Your test is dangerous. Trust me.”

So FlakyDetector exposes:

  • confidence levels
  • detected anti-patterns
  • feature importance
  • severity categories

The goal is not just prediction.

The goal is understanding.

The Architecture: Clean, Fast, and CI-Friendly

The project follows a Hexagonal Architecture approach.

That means the core analysis engine stays isolated from:

  • UI
  • infrastructure
  • APIs
  • storage layers
  • CI integrations

The stack currently includes:

  • Python 3.12
  • FastAPI
  • Pydantic v2
  • CatBoost
  • React + Vite
  • ChromaDB
  • Ollama
  • Docker
  • GitHub Actions

I also focused heavily on developer experience.

The project uses:

  • uv for extremely fast dependency management
  • ruff for linting and formatting
  • pyright strict typing
  • pre-commit
  • Dockerized infrastructure

Because let’s be honest:

Nobody wants reliability tooling that itself becomes maintenance debt.

GitHub Actions Integration Is Where It Gets Practical

This is probably the most important feature for real teams.

FlakyDetector can block problematic tests directly inside CI pipelines.

Example:

- name: Run FlakyDetector run: uv run python scripts/scan_folder.py ./tests --fail-on-critical
Enter fullscreen mode Exit fullscreen mode

If the system detects critical anti-patterns, the pipeline fails immediately.

That means developers catch instability risks during code review instead of after the test suite starts randomly exploding three weeks later.

This is especially useful for:

  • fintech platforms
  • SaaS systems
  • async Python services
  • microservice environments
  • ML infrastructure
  • large monorepos

The RAG + LLM Layer Sounds Weird… But It’s Useful

One experimental feature I added integrates:

  • ChromaDB
  • vector search
  • local LLMs via Ollama
  • semantic instability analysis

At first glance it sounds like:

“Congratulations, we added AI to flaky tests.”

Which honestly made me laugh too.

But there’s a practical reason behind it.

Large repositories often contain repeated instability patterns across multiple teams.

The semantic search layer allows engineers to find tests with similar architectural problems.

For example:

“Show me tests similar to this flaky async Redis integration.”

That becomes surprisingly powerful in enterprise-scale repositories.

It’s still evolving, but I think AI-assisted reliability engineering is going to become much more important over the next few years.

Why I Think This Problem Matters

Modern engineering teams invest enormous effort into:

  • observability
  • security
  • performance
  • type safety
  • infrastructure automation

But test reliability is still oddly under-engineered.

We accept flaky tests as “normal.”

That’s strange if you think about it.

Because unstable tests don’t just waste CI minutes.

They quietly destroy confidence in the engineering process itself.

FlakyDetector is my attempt to treat flaky testing as an architectural problem instead of random chaos.

And honestly, I think the industry needs more tools that shift reliability checks earlier into the development lifecycle.

The same way:

  • CodeQL changed security scanning
  • mypy changed Python typing
  • Ruff changed linting performance

Static reliability analysis could become a completely normal part of CI pipelines.

Try It Yourself

Quick setup:

git clone https://github.com/Artem7898/flakydetector cd flakydetector 

uv venv --python 3.12 

source .venv/bin/activate 

uv pip install -e ".[dev]" 

python scripts/train_model.py
Enter fullscreen mode Exit fullscreen mode

Run a scan:

uv run python scripts/scan_folder.py ./tests/
Enter fullscreen mode Exit fullscreen mode

And yes…

There’s a good chance it finds something uncomfortable in your test suite.

One Final Question

How many flaky tests are currently sitting in your repository pretending to be “temporary”?

And how much engineering time are they silently burning every single week?

That’s probably worth thinking about.

If this article was useful — follow me for more deep dives into Python backend engineering, CI/CD systems, AST tooling, static analysis, and modern developer infrastructure.

I’d genuinely love to hear how your team deals with flaky tests today.

Retry buttons?
Quarantine lists?
Pure denial?

Write in the comments — I’m curious.

Top comments (2)

Collapse
 
ldrscke profile image
Christian Ledermann

Flaky tests?
@skip or @xfail

Collapse
 
ldrscke profile image
Christian Ledermann

I am interested in the

FlakyDetector currently extracts a 42-dimensional feature space.

see this comment