By Artem Alimpiev — Python Backend Developer
Every engineering team has that test.
The one that passes locally, passes on your teammate’s machine, passes three times in CI… and then suddenly fails at 2 AM for absolutely no reason.
So somebody reruns the pipeline.
Again.
And again.
Eventually the build goes green, everyone moves on, and the flaky test quietly stays in the repository like a cursed artifact nobody wants to touch.
I’ve seen this happen too many times in real projects. And honestly, the bigger the infrastructure becomes, the worse the problem gets.
At some point I started asking myself a simple question:
Why are we still detecting flaky tests after they break CI?
That question eventually turned into FlakyDetector — an AST + Machine Learning powered system for detecting flaky tests before they hit your pipelines.
GitHub Repository:
FlakyDetector GitHub Repository
Flaky Tests Are More Expensive Than Most Teams Realize
People often treat flaky tests as “just annoying.”
But flaky tests are actually infrastructure debt.
They:
- slow down releases
- waste CI resources
- destroy trust in automation
- create false negatives
- normalize ignoring failed builds
And that last one is especially dangerous.
Because once engineers stop trusting CI, the entire feedback loop starts collapsing.
Google engineers once reported that a significant percentage of test failures inside large CI systems were caused by flakiness rather than real regressions.
Think about that for a second.
Imagine debugging failures that aren’t even real bugs.
Now combine that with:
- async systems
- distributed services
- parallel test execution
- unstable timing
- shared global state
- external APIs
Suddenly your test suite behaves less like deterministic engineering and more like a physics experiment.
Most Flaky Detection Tools React Too Late
The majority of flaky-test tooling works reactively.
Usually the workflow looks like this:
Write test ↓ Push code ↓ CI randomly fails ↓ Retry pipeline ↓ Lose 40 minutes of engineering time
Traditional systems rely on:
- rerun statistics
- CI telemetry
- historical failure tracking
- probabilistic heuristics
Useful? Absolutely.
Preventive? Not really.
I wanted something different.
I wanted a system that could analyze the source code itself and detect risky patterns before the tests ever started failing in production CI environments.
That’s where AST analysis became incredibly interesting.
AST Analysis: Looking at Code Structurally
If you’ve never worked with Python ASTs directly, here’s the simple explanation.
Python code isn’t just text.
Under the hood, Python converts code into an Abstract Syntax Tree — a structured representation the interpreter can reason about.
For example:
time.sleep(5)
isn’t stored as plain text internally.
It becomes semantic structure:
- function call
- module reference
- execution dependency
- timing behavior
And once you have structure, you can detect patterns.
That changes everything.
FlakyDetector scans Python test suites and searches for architectural anti-patterns associated with nondeterministic behavior.
Examples include:
- `time.sleep()
- datetime.now()`
- mutable global state
- unmocked network requests
- dangerous fixture scopes
- resource leakage
- high cyclomatic complexity
Or in testing terminology:
Test Smells.
Yes, that’s a real technical term. And yes, it sounds slightly ridiculous.
Why I Added Machine Learning
At first, FlakyDetector started as a pure AST rule engine.
But I quickly ran into a problem.
Real flaky behavior is rarely caused by a single issue.
Usually it’s a combination of:
- timing dependencies
- complexity
- state mutations
- async interactions
- fixture misuse
That means simple rule matching eventually hits limits.
So I added an ML classification layer using CatBoost.
The pipeline now looks roughly like this:
Python Test Code ↓ AST Parsing ↓ Feature Extraction ↓ 42-Dimensional Feature Vector ↓ CatBoost Classification ↓ Flaky Probability + Severity
And honestly, CatBoost turned out to be a surprisingly strong fit.
Most developers associate CatBoost with recommendation systems or business analytics. But it’s extremely good at structured tabular feature classification.
Which is exactly what AST-derived metrics become.
The 42-Feature Detection System
This is where the project became much more than “another linter.”
FlakyDetector currently extracts a 42-dimensional feature space.
The features include:
Feature Group Count
AST Features 16
Category Features 9
Fixture Analysis 5
Derived Metrics 3
Confidence Scores 8
Test Smells
Examples include:
- timing dependency counters
- network interaction detection
- mutation ratios
- fixture scope analysis
- pattern diversity metrics
- cyclomatic complexity scoring
The system then classifies the probability of flakiness and assigns severity levels.
But the important part is this:
The model is explainable.
That matters a lot in developer tooling.
Nobody wants a black-box AI saying:
“Your test is dangerous. Trust me.”
So FlakyDetector exposes:
- confidence levels
- detected anti-patterns
- feature importance
- severity categories
The goal is not just prediction.
The goal is understanding.
The Architecture: Clean, Fast, and CI-Friendly
The project follows a Hexagonal Architecture approach.
That means the core analysis engine stays isolated from:
- UI
- infrastructure
- APIs
- storage layers
- CI integrations
The stack currently includes:
- Python 3.12
- FastAPI
- Pydantic v2
- CatBoost
- React + Vite
- ChromaDB
- Ollama
- Docker
- GitHub Actions
I also focused heavily on developer experience.
The project uses:
- uv for extremely fast dependency management
- ruff for linting and formatting
- pyright strict typing
- pre-commit
- Dockerized infrastructure
Because let’s be honest:
Nobody wants reliability tooling that itself becomes maintenance debt.
GitHub Actions Integration Is Where It Gets Practical
This is probably the most important feature for real teams.
FlakyDetector can block problematic tests directly inside CI pipelines.
Example:
- name: Run FlakyDetector run: uv run python scripts/scan_folder.py ./tests --fail-on-critical
If the system detects critical anti-patterns, the pipeline fails immediately.
That means developers catch instability risks during code review instead of after the test suite starts randomly exploding three weeks later.
This is especially useful for:
- fintech platforms
- SaaS systems
- async Python services
- microservice environments
- ML infrastructure
- large monorepos
The RAG + LLM Layer Sounds Weird… But It’s Useful
One experimental feature I added integrates:
- ChromaDB
- vector search
- local LLMs via Ollama
- semantic instability analysis
At first glance it sounds like:
“Congratulations, we added AI to flaky tests.”
Which honestly made me laugh too.
But there’s a practical reason behind it.
Large repositories often contain repeated instability patterns across multiple teams.
The semantic search layer allows engineers to find tests with similar architectural problems.
For example:
“Show me tests similar to this flaky async Redis integration.”
That becomes surprisingly powerful in enterprise-scale repositories.
It’s still evolving, but I think AI-assisted reliability engineering is going to become much more important over the next few years.
Why I Think This Problem Matters
Modern engineering teams invest enormous effort into:
- observability
- security
- performance
- type safety
- infrastructure automation
But test reliability is still oddly under-engineered.
We accept flaky tests as “normal.”
That’s strange if you think about it.
Because unstable tests don’t just waste CI minutes.
They quietly destroy confidence in the engineering process itself.
FlakyDetector is my attempt to treat flaky testing as an architectural problem instead of random chaos.
And honestly, I think the industry needs more tools that shift reliability checks earlier into the development lifecycle.
The same way:
- CodeQL changed security scanning
- mypy changed Python typing
- Ruff changed linting performance
Static reliability analysis could become a completely normal part of CI pipelines.
Try It Yourself
Quick setup:
git clone https://github.com/Artem7898/flakydetector cd flakydetector
uv venv --python 3.12
source .venv/bin/activate
uv pip install -e ".[dev]"
python scripts/train_model.py
Run a scan:
uv run python scripts/scan_folder.py ./tests/
And yes…
There’s a good chance it finds something uncomfortable in your test suite.
One Final Question
How many flaky tests are currently sitting in your repository pretending to be “temporary”?
And how much engineering time are they silently burning every single week?
That’s probably worth thinking about.
If this article was useful — follow me for more deep dives into Python backend engineering, CI/CD systems, AST tooling, static analysis, and modern developer infrastructure.
I’d genuinely love to hear how your team deals with flaky tests today.
Retry buttons?
Quarantine lists?
Pure denial?
Write in the comments — I’m curious.




Top comments (2)
Flaky tests?
@skipor@xfailI am interested in the
see this comment