The real cost of flaky CI: a quick community survey

Every engineering team knows the feeling: a test is failing, but only sometimes. You re-run it. It passes. You push. It fails again in CI. The PR sits blocked. You ping a teammate. They re-run it. It passes again.

That hour of context-switching never shows up in a sprint retrospective.

The numbers most teams don't track

A Launchable study found that flaky tests account for roughly 10–15% of all CI failures in medium-to-large codebases. Stripe's engineering blog put the average cost of a single flaky test investigation at 1–3 engineer-hours. Google classifies a test as flaky if it fails more than 1% of the time — and they run millions of tests per day.

For a team of 15 engineers running CI 50 times a day, even a 5% flakiness rate means 2–3 hours of engineer time evaporated daily on reruns alone. That's before the morale cost of a CI pipeline nobody trusts.

The part nobody talks about: attribution

Detection is table stakes now. Tools like BuildPulse, Trunk, and Buildkite Test Engine flag flaky tests reliably. The harder problem — the one that actually gets tests fixed — is attribution: which commit first made this test flaky?

Without that answer, a flaky test becomes a permanent label on a test nobody wants to touch. With it, the PR that introduced the problem gets a comment and an engineer who knows exactly what they changed.

Most teams are still running manual git bisect sessions — or just ignoring the test entirely.

We're studying this — and we'd love your data

We're running a quick 6-question survey on how engineering teams experience and handle flaky CI today:

How much time is your team actually losing per week?
What tools are you using (or not)?
What would you pay for a tool that automatically identifies the introducing commit — no manual git bisect needed?

The last question uses Van Westendorp price-sensitivity methodology. We'll publish the aggregated results publicly once we hit 30 responses.

Takes 3 minutes. No signup required.

→ Take the survey

If you've got a war story about a particularly expensive flaky test, drop it in the comments. I'd genuinely like to read it.