I built an open-source CLI to detect flaky tests -- here's what I learned

#testing #opensource #javascript #qa

I've been a Lead SDET for over a decade. In that time, one problem has followed me across every team, every company, every stack: flaky tests.

You know the drill. A test passes on your machine, fails in CI, passes when you re-run it, fails again tomorrow. Someone adds @skip or retry: 3 and moves on. The test suite slowly becomes a graveyard of unreliable signals.

Six months ago, I got fed up enough to actually build something about it.

The real cost of flaky tests

Before I talk about the tool, let me share some numbers from my own experience:

A team I worked with had a CI pipeline that took 22 minutes. Flaky test failures caused an average of 2.3 re-runs per PR. That's ~50 minutes of wasted CI time per PR.
Developers started ignoring test failures entirely. "It's probably flaky" became the default assumption -- even for real bugs.
We once shipped a regression to production because the actual failing test was hidden in a pile of known-flaky noise.

Flaky tests don't just waste time. They erode trust. And once your team stops trusting the test suite, you've lost the entire point of having tests.

What I built

DeFlaky is an open-source CLI that detects flaky tests by running your test suite multiple times and analyzing the results.

Install it:

npm i -g deflaky-cli

Run it against any test command:

deflaky -c "npx playwright test" -r 5

That runs your Playwright tests 5 times, collects the results, and outputs a report showing which tests are flaky, along with a FlakeScore for each one.

What's a FlakeScore?

FlakeScore is a 0-100 metric that quantifies how unstable a test is. A test that passes 5/5 times scores 0 (stable). A test that passes 3/5 times scores higher. The scoring accounts for pass/fail variance, consecutive failure patterns, and historical data if available.

It gives you a single number to prioritize which flaky tests to fix first.

AI root cause analysis

This is the part I'm most excited about. DeFlaky can optionally analyze your flaky test failures using LLMs to suggest probable root causes.

deflaky -c "pytest tests/" -r 5 --analyze

It looks at the failure stack traces, test code, timing data, and failure patterns, then suggests whether the flakiness is likely caused by:

Race conditions or timing issues
Shared state between tests
External dependency instability
Environment-specific problems
Non-deterministic data

It supports 5 LLM providers, so you can use whichever one you're already paying for.

Framework-agnostic by design

DeFlaky doesn't parse your test files or understand your framework's internals. It wraps your test command, captures the output, and analyzes the results. This means it works with:

Playwright (my primary use case)
Selenium / WebDriver
Cypress
Jest
Pytest
Any framework that outputs test results to stdout or generates JUnit XML

This was a deliberate design choice. Integrating deeply with one framework means maintaining compatibility with every version. Wrapping the command means DeFlaky works with whatever you're already using.

What I learned building this

1. Flakiness detection is harder than it sounds. Running tests N times seems simple, but you need to handle partial failures, timeouts, framework crashes, and the difference between "test failed" and "test runner failed."

2. People don't know which tests are flaky. Most teams have a vague sense ("oh yeah, that login test is weird"), but nobody has an actual inventory. Just generating the list is valuable.

3. AI analysis is surprisingly useful here. I was skeptical, but LLMs are genuinely good at looking at a stack trace + test code and saying "this looks like a race condition because you're not waiting for the network request to complete." It's not magic, but it saves investigation time.

4. The CLI-first approach was right. I considered building a SaaS dashboard first, but starting with the CLI meant developers could try it in 30 seconds with zero signup. That feedback loop was invaluable.

Try it

The CLI is free and MIT licensed. Always will be.

npm i -g deflaky-cli
deflaky -c "your-test-command" -r 5

If you want historical tracking, team dashboards, and CI integration, there's a Pro tier at $19/mo with a 15-day free trial at deflaky.com.

GitHub: github.com/PramodDutta/deflaky
npm: npmjs.com/package/deflaky-cli
Blog: deflaky.com/blog (41 posts on flaky test patterns and strategies)

I'd love feedback. What's your worst flaky test story?

I'm Pramod Dutta, Lead SDET and creator of The Testing Academy on YouTube. I build testing tools and write about test automation at deflaky.com/blog.