Artem

Posted on Jun 28

Catching Flaky Tests Before They Hit CI: Meet FlakyDetector

#python #testing #machinelearning #devops

By Artem Alimpiev — Python Backend Developer

Every engineering team has that test.

The one that passes locally, passes on your teammate’s machine, passes three times in CI… and then suddenly fails at 2 AM for absolutely no reason.

So somebody reruns the pipeline.

Again.

And again.

Eventually the build goes green, everyone moves on, and the flaky test quietly stays in the repository like a cursed artifact nobody wants to touch.

I’ve seen this happen too many times in real projects. And honestly, the bigger the infrastructure becomes, the worse the problem gets.

At some point I started asking myself a simple question:

Why are we still detecting flaky tests after they break CI?

That question eventually turned into FlakyDetector — an AST + Machine Learning powered system for detecting flaky tests before they hit your pipelines.

GitHub Repository:
FlakyDetector GitHub Repository

Flaky Tests Are More Expensive Than Most Teams Realize

People often treat flaky tests as “just annoying.”

But flaky tests are actually infrastructure debt.

They:

slow down releases
waste CI resources
destroy trust in automation
create false negatives
normalize ignoring failed builds

And that last one is especially dangerous.

Because once engineers stop trusting CI, the entire feedback loop starts collapsing.

Google engineers once reported that a significant percentage of test failures inside large CI systems were caused by flakiness rather than real regressions.

Think about that for a second.

Imagine debugging failures that aren’t even real bugs.

Now combine that with:

async systems
distributed services
parallel test execution
unstable timing
shared global state
external APIs

Suddenly your test suite behaves less like deterministic engineering and more like a physics experiment.

Most Flaky Detection Tools React Too Late

The majority of flaky-test tooling works reactively.

Usually the workflow looks like this:

Write test ↓ Push code ↓ CI randomly fails ↓ Retry pipeline ↓ Lose 40 minutes of engineering time

Traditional systems rely on:

rerun statistics
CI telemetry
historical failure tracking
probabilistic heuristics

Useful? Absolutely.

Preventive? Not really.

I wanted something different.

I wanted a system that could analyze the source code itself and detect risky patterns before the tests ever started failing in production CI environments.

That’s where AST analysis became incredibly interesting.

AST Analysis: Looking at Code Structurally

If you’ve never worked with Python ASTs directly, here’s the simple explanation.

Python code isn’t just text.

Under the hood, Python converts code into an Abstract Syntax Tree — a structured representation the interpreter can reason about.

For example:

time.sleep(5)

isn’t stored as plain text internally.

It becomes semantic structure:

function call
module reference
execution dependency
timing behavior

And once you have structure, you can detect patterns.

That changes everything.

FlakyDetector scans Python test suites and searches for architectural anti-patterns associated with nondeterministic behavior.

Examples include:

`time.sleep()
datetime.now()`
mutable global state
unmocked network requests
dangerous fixture scopes
resource leakage
high cyclomatic complexity

Or in testing terminology:

Test Smells.

Yes, that’s a real technical term. And yes, it sounds slightly ridiculous.

Why I Added Machine Learning

At first, FlakyDetector started as a pure AST rule engine.

But I quickly ran into a problem.

Real flaky behavior is rarely caused by a single issue.

Usually it’s a combination of:

timing dependencies
complexity
state mutations
async interactions
fixture misuse

That means simple rule matching eventually hits limits.

So I added an ML classification layer using CatBoost.

The pipeline now looks roughly like this:

Python Test Code ↓ AST Parsing ↓ Feature Extraction ↓ 42-Dimensional Feature Vector ↓ CatBoost Classification ↓ Flaky Probability + Severity

And honestly, CatBoost turned out to be a surprisingly strong fit.

Most developers associate CatBoost with recommendation systems or business analytics. But it’s extremely good at structured tabular feature classification.

Which is exactly what AST-derived metrics become.

The 42-Feature Detection System

This is where the project became much more than “another linter.”

FlakyDetector currently extracts a 42-dimensional feature space.

The features include:

Feature Group Count AST Features 16 Category Features 9 Fixture Analysis 5 Derived Metrics 3 Confidence Scores 8 Test Smells

Examples include:

timing dependency counters
network interaction detection
mutation ratios
fixture scope analysis
pattern diversity metrics
cyclomatic complexity scoring

The system then classifies the probability of flakiness and assigns severity levels.

But the important part is this:

The model is explainable.

That matters a lot in developer tooling.

Nobody wants a black-box AI saying:

“Your test is dangerous. Trust me.”

So FlakyDetector exposes:

confidence levels
detected anti-patterns
feature importance
severity categories

The goal is not just prediction.

The goal is understanding.

The Architecture: Clean, Fast, and CI-Friendly

The project follows a Hexagonal Architecture approach.

That means the core analysis engine stays isolated from:

UI
infrastructure
APIs
storage layers
CI integrations

The stack currently includes:

Python 3.12
FastAPI
Pydantic v2
CatBoost
React + Vite
ChromaDB
Ollama
Docker
GitHub Actions

I also focused heavily on developer experience.

The project uses:

uv for extremely fast dependency management
ruff for linting and formatting
pyright strict typing
pre-commit
Dockerized infrastructure

Because let’s be honest:

Nobody wants reliability tooling that itself becomes maintenance debt.

GitHub Actions Integration Is Where It Gets Practical

This is probably the most important feature for real teams.

FlakyDetector can block problematic tests directly inside CI pipelines.

Example:

- name: Run FlakyDetector run: uv run python scripts/scan_folder.py ./tests --fail-on-critical

If the system detects critical anti-patterns, the pipeline fails immediately.

That means developers catch instability risks during code review instead of after the test suite starts randomly exploding three weeks later.

This is especially useful for:

fintech platforms
SaaS systems
async Python services
microservice environments
ML infrastructure
large monorepos

The RAG + LLM Layer Sounds Weird… But It’s Useful

One experimental feature I added integrates:

ChromaDB
vector search
local LLMs via Ollama
semantic instability analysis

At first glance it sounds like:

“Congratulations, we added AI to flaky tests.”

Which honestly made me laugh too.

But there’s a practical reason behind it.

Large repositories often contain repeated instability patterns across multiple teams.

The semantic search layer allows engineers to find tests with similar architectural problems.

For example:

“Show me tests similar to this flaky async Redis integration.”

That becomes surprisingly powerful in enterprise-scale repositories.

It’s still evolving, but I think AI-assisted reliability engineering is going to become much more important over the next few years.

Why I Think This Problem Matters

Modern engineering teams invest enormous effort into:

observability
security
performance
type safety
infrastructure automation

But test reliability is still oddly under-engineered.

We accept flaky tests as “normal.”

That’s strange if you think about it.

Because unstable tests don’t just waste CI minutes.

They quietly destroy confidence in the engineering process itself.

FlakyDetector is my attempt to treat flaky testing as an architectural problem instead of random chaos.

And honestly, I think the industry needs more tools that shift reliability checks earlier into the development lifecycle.

The same way:

CodeQL changed security scanning
mypy changed Python typing
Ruff changed linting performance

Static reliability analysis could become a completely normal part of CI pipelines.

Try It Yourself

Quick setup:

git clone https://github.com/Artem7898/flakydetector cd flakydetector 

uv venv --python 3.12 

source .venv/bin/activate 

uv pip install -e ".[dev]" 

python scripts/train_model.py

Run a scan:

uv run python scripts/scan_folder.py ./tests/

And yes…

There’s a good chance it finds something uncomfortable in your test suite.

One Final Question

How many flaky tests are currently sitting in your repository pretending to be “temporary”?

And how much engineering time are they silently burning every single week?

That’s probably worth thinking about.

If this article was useful — follow me for more deep dives into Python backend engineering, CI/CD systems, AST tooling, static analysis, and modern developer infrastructure.

I’d genuinely love to hear how your team deals with flaky tests today.

Retry buttons?
Quarantine lists?
Pure denial?

Write in the comments — I’m curious.