DEV Community: Mihir Shinde

Why Claude Code AutoFix Can’t Fix Flaky Tests

Mihir Shinde — Mon, 18 May 2026 17:01:14 +0000

Why Claude Code AutoFix Can’t Fix Flaky Tests

AutoFix is great at real bugs. Flaky tests break it — and the fix loop costs more than the flake did.

May 8, 2026 · 8 min read

Anthropic shipped Claude Code AutoFix — an agent that subscribes to your PR’s GitHub events, watches CI, and pushes commits to fix failing tests and address review comments. For real bugs, it’s genuinely good. For flaky tests, it’s a footgun.

We build a tool in this exact space (Kleore quantifies and surfaces flaky-test waste), so we watched the AutoFix launch closely. Here’s the honest read on what it does, where it breaks, and why throwing an AI agent at flakiness makes the problem more expensive, not less.

What AutoFix actually does

AutoFix is a cloud-hosted Claude Code session attached to a pull request. When CI fails or a reviewer leaves a comment, it:

Reads the failure log or comment
Investigates the relevant code
Pushes a commit with an explanation
Re-runs CI and iterates

For a deterministic failure — a real null deref, a type error, a missing import — this loop converges. Test fails → agent reads stack trace → agent fixes code → test passes. Clean.

Flaky tests break the loop

A flaky test, by definition, fails for reasons unrelated to the code change. Race conditions. Unstable network mocks. Order-dependent fixtures. Timezone drift. The test that failed on this PR will pass on the next run with no change at all.

AutoFix doesn’t know that. It sees a red CI, assumes the diff broke something, and starts hunting. The result is one of three failure modes:

1. The speculative fix loop

AutoFix reads the stack trace, invents a plausible cause, and pushes a “fix.” The flaky test passes on the next run — not because of the fix, but because flakes pass ~70% of the time. AutoFix declares victory. You’ve now merged a code change that was triggered by randomness, not by an actual bug.

Multiply this across a quarter and your codebase fills up with cargo-cult fixes: extra awaits, defensive null checks, retries, sleep statements, narrowed test assertions. Each one looks reasonable in isolation. Together they’re the AI version of “don’t touch this, it works.”

2. The infinite re-run

Sometimes AutoFix gets it right and tries to re-run. The flake fails again. It fixes harder. Re-runs. Fails. Each iteration burns tokens, CI minutes, and your patience. The blog post that introduced AutoFix flags this directly: agents can “enter speculative fix loops that consume resources without resolving the underlying problem.”

A single flaky test can cost an AutoFix session 10–30 LLM calls, each touching multiple files, each pushing a commit. Your token bill and your git history both look terrible.

3. The wrong file blamed

Race conditions and shared-state bugs rarely live in the file the test exercises. AutoFix looks where the stack trace points. The actual cause — a fixture in a sibling file, a global mock that leaked, a database row left by an earlier test — is two directories away. AutoFix “fixes” the wrong thing, the symptom moves, the next PR sees a new flake.

The cost math is bad

Failure type	Real bug	Flaky test
AutoFix iterations	1–3	5–30
CI minutes consumed	10–30	60–300
Token spend per failure	$0.50–$2	$5–$25
Outcome	Bug fixed	Symptom hidden

The unit economics flip on flaky tests. You pay 10x more, end up with worse code, and the underlying flake is still there waiting for the next PR.

The right move: classify before you fix

Every CI failure should be sorted into one of two buckets before an agent (or a human) starts fixing it:

Real failure — this PR’s code change broke something. Send to AutoFix. It’ll do a great job.
Flake — this test fails on unrelated PRs too. Don’t fix it on this PR. Quarantine, log, and address the root cause separately.

AutoFix has no way to make this distinction on its own. It only sees one PR. Flake detection requires looking across many PRs, over time, and asking: does this test fail on diffs that have nothing to do with it?

This is what Kleore does

Kleore connects to your GitHub repos, scans your CI history, and ranks every flaky test by frequency and dollar cost. It’s the missing classifier in front of AutoFix:

Test fails on a PR → check Kleore → if the test is on the flake list, skip AutoFix entirely
Top flakes get triaged as their own work item, not patched into random PRs
Engineering managers see a weekly $ number and can decide whether to invest fix-time

AutoFix and Kleore are complements, not competitors. AutoFix needs a flake-aware front door to be safe in a real codebase. Without one, every flaky test in your suite becomes a recurring tax on your token bill.

Stop letting AutoFix burn cycles on flakes.

Install the Kleore GitHub App. Get a ranked list of every flaky test in your repos — with dollar costs attached — in two minutes. Free to start.

Scan my repos — free

FAQ: Claude Code AutoFix and flaky tests

Can Claude Code AutoFix fix flaky tests?

No. AutoFix is designed to fix deterministic failures caused by the current PR’s code change. Flaky tests fail for reasons unrelated to the diff — race conditions, shared state, network jitter — so AutoFix either invents a cargo-cult fix that “works” because the flake passed by chance, or it loops indefinitely burning tokens and CI minutes.

Why does AutoFix loop on flaky tests?

AutoFix treats every red CI as a code bug. When a flake fails, AutoFix pushes a speculative fix and re-runs. The flake fails again for unrelated reasons, AutoFix fixes harder, and the loop repeats. Each iteration consumes LLM calls, CI minutes, and adds noisy commits to your git history.

How do I stop AutoFix from wasting cycles on flaky tests?

Classify failures before AutoFix runs. If a test has historically failed on PRs unrelated to its code, treat it as a flake — quarantine and address separately. Tools like Kleore scan your CI history across many PRs to identify flaky tests and rank them by frequency and dollar cost, so AutoFix only engages on real bugs.

Are AutoFix and Kleore competitors?

No, they’re complements. AutoFix is a per-PR fixer that needs a flake-aware front door. Kleore provides cross-PR flaky test detection and dollar-cost reporting that tells AutoFix (and your engineers) when to fix and when to quarantine.

How much does running AutoFix on a flaky test cost?

A single flaky test can drive AutoFix through 5–30 iterations, consuming 60–300 CI minutes and $5–$25 in tokens per failure — roughly 10x the cost of fixing a real bug. Multiplied across a quarter, this becomes a recurring tax on your engineering budget.

How to Find and Fix Flaky Tests in Jest

Mihir Shinde — Thu, 14 May 2026 17:53:07 +0000

How to Find and Fix Flaky Tests in Jest

The most common root causes of Jest flakiness — and battle-tested fixes you can apply today.

March 28, 2026 · 14 min read

Jest is the most popular JavaScript testing framework, powering test suites at companies from startups to Fortune 500s. It’s also one of the most common sources of flaky tests. The combination of parallel test execution, module mocking, and shared Node.js process state creates a perfect storm for intermittent failures.

If your CI pipeline randomly fails with tests that pass on re-run, this guide is for you. We’ll cover why Jest tests become flaky, how to identify the culprits, and concrete fixes for each pattern — with code you can copy into your codebase today.

Want to skip the guesswork?

Before you start debugging one test at a time, get a ranked list of your flakiest tests. Kleore scans your CI history and shows you exactly which Jest tests are flaky, how often they fail, and how much each one costs in wasted CI minutes and developer time.

Why Jest tests become flaky

Jest runs test files in parallel worker processes by default. Each worker gets its own Node.js instance, but tests within a file share the same process. This architecture means that any state leaking between tests in the same file — or any resource shared across workers — becomes a flakiness vector.

The four most common root causes:

Shared mutable state — Global variables, module-level caches, singleton instances, or database records that persist between tests.
Timer and date dependencies — Tests that rely on setTimeout, setInterval, Date.now(), or real clock time. CI runners are slower than your laptop, and timing assumptions break.
Async race conditions — Tests that don’t properly wait for async operations to complete. The test asserts before the state has updated, and it fails intermittently depending on execution speed.
Module mocking leaks — jest.mock() calls that bleed across tests because mocks aren’t properly reset between test cases.

How to identify flaky Jest tests

Before you fix anything, you need to know which tests are flaky. Here are the tools Jest gives you to flush out non-deterministic tests.

Use --detectOpenHandles to find leaked resources

If Jest hangs after tests complete or you see “Jest did not exit one second after the test run has completed,” you have open handles — unclosed database connections, running servers, or pending timers.

Detect leaked handles

# Find open handles that prevent Jest from exiting cleanly
npx jest --detectOpenHandles --forceExit

# Run tests sequentially to isolate ordering issues
npx jest --runInBand

Run tests repeatedly to reproduce flakes

A test that passes once might fail on the 50th run. Use the --repeat flag (Jest 29+) or a simple bash loop to stress-test suspected flaky tests.

Stress-test a suspected flaky test

# Jest 29+ built-in repeat
npx jest --repeat=50 path/to/suspected-flaky.test.ts

# Bash loop for older Jest versions
for i in $(seq 1 50); do
  npx jest path/to/suspected-flaky.test.ts || echo "FAILED on run $i" && exit 1
done

Randomize test order to find hidden dependencies

Jest 28+ supports --randomize to shuffle test order within each file. If a test only passes when another test runs before it, randomization will expose it.

Randomize and isolate

# Randomize test order within files
npx jest --randomize

# If a test fails, re-run with the same seed to reproduce
npx jest --randomize --seed=12345

Use jest-circus retry for automatic detection

The jest-circus test runner (default since Jest 27) supports retries. While retries are a bandaid, they’re useful for identifying which tests need attention: any test that passes on retry is flaky by definition.

jest.config.ts — retry configuration

// jest.config.ts
export default {
  // Retry failed tests up to 2 times
  // Tests that need retries are flaky — track them
  retryTimes: 2,

  // Log retries so you can find them in CI output
  logLevel: "warn",
};

Common patterns and fixes

Pattern 1: Shared mutable state between tests

Symptom: Test passes with it.only, fails when run with the full suite. Or it only fails when another specific test runs before it.

Root cause: A module-level variable, database record, or in-memory cache is modified by one test and not cleaned up before the next.

Bad — shared state across tests

// userService.ts
let cachedUsers: User[] = []; // Module-level cache

export function getUsers() {
  if (cachedUsers.length) return cachedUsers;
  cachedUsers = fetchUsersFromDB();
  return cachedUsers;
}

// test file — Test A populates the cache, Test B reads stale data
it("fetches users", () => {
  const users = getUsers(); // Populates cache
  expect(users).toHaveLength(5);
});

it("handles empty state", () => {
  // This SHOULD test empty state, but cache is warm from previous test
  const users = getUsers(); // Returns cached data!
  expect(users).toHaveLength(0); // FAILS
});

Fix — reset state in beforeEach

import { resetCache } from "./userService";

beforeEach(() => {
  resetCache(); // Clear module-level state
  jest.clearAllMocks(); // Clear all mock state
});

// Or use jest.isolateModules for complete isolation
it("handles empty state", () => {
  jest.isolateModules(() => {
    const { getUsers } = require("./userService");
    // Fresh module instance — no cached data
    const users = getUsers();
    expect(users).toHaveLength(0);
  });
});

Pattern 2: Timer-dependent tests

Symptom: Tests involving debounce, throttle, setTimeout, or animations pass locally but fail in CI where runners are slower.

Root cause: The test relies on real clock time. A 200ms debounce might take 300ms on a loaded CI runner, causing the assertion to fire too early.

Bad — real timers in tests

it("debounces search input", async () => {
  render(<SearchBox />);
  fireEvent.change(input, { target: { value: "hello" } });

  // Hoping 300ms is enough on CI... it's not
  await new Promise(r => setTimeout(r, 300));
  expect(mockSearch).toHaveBeenCalledWith("hello");
});

Fix — jest.useFakeTimers()

it("debounces search input", () => {
  jest.useFakeTimers();
  render(<SearchBox />);
  fireEvent.change(input, { target: { value: "hello" } });

  // Advance the clock by exactly 250ms (debounce delay)
  jest.advanceTimersByTime(250);

  expect(mockSearch).toHaveBeenCalledWith("hello");
  jest.useRealTimers();
});

Always call jest.useRealTimers() in an afterEach to prevent fake timers from leaking into other tests. Better yet, set it up globally in your Jest setup file.

Pattern 3: Async race conditions

Symptom: Tests using React Testing Library’s getBy* queries fail because the element hasn’t rendered yet. The test passes when you add a small delay.

Root cause: The test asserts synchronously against an asynchronously updated DOM. On faster machines it works; on slower CI runners, the render hasn’t completed yet.

Bad — synchronous assertion on async DOM

it("shows success message after submit", () => {
  render(<Form />);
  fireEvent.click(screen.getByText("Submit"));

  // Element hasn't rendered yet — race condition!
  expect(screen.getByText("Success")).toBeInTheDocument();
});

Fix — use waitFor or findBy

it("shows success message after submit", async () => {
  render(<Form />);
  fireEvent.click(screen.getByText("Submit"));

  // waitFor retries until the assertion passes or times out
  await waitFor(() => {
    expect(screen.getByText("Success")).toBeInTheDocument();
  });

  // Or use findBy* which combines getBy + waitFor
  const message = await screen.findByText("Success");
  expect(message).toBeInTheDocument();
});

Pattern 4: Port conflicts in integration tests

Symptom: EADDRINUSE: address already in use errors. Tests pass individually but fail when multiple test files run in parallel.

Root cause: Multiple test workers trying to bind to the same hardcoded port.

Bad — hardcoded port

// Every test file tries to use port 3000
beforeAll(() => {
  server = app.listen(3000);
});

Fix — dynamic port allocation

import { AddressInfo } from "net";

beforeAll(() => {
  // Port 0 = OS assigns an available port
  server = app.listen(0);
  const { port } = server.address() as AddressInfo;
  baseUrl = `http://localhost:${port}`;
});

afterAll(() => {
  server.close();
});

Pattern 5: Snapshot drift

Symptom: Snapshot tests fail in CI but pass locally. Or they fail after someone else’s PR merges but no one updated the snapshots.

Root cause: Snapshots contain environment-specific data (timestamps, random IDs, absolute paths) or were committed from a different OS/locale.

Fix — deterministic snapshots

// Mock Date so snapshots don't change daily
beforeAll(() => {
  jest.useFakeTimers();
  jest.setSystemTime(new Date("2025-01-01T00:00:00Z"));
});

// Use property matchers for dynamic values
it("renders user card", () => {
  const { container } = render(<UserCard user={testUser} />);
  expect(container).toMatchSnapshot({
    // Allow these properties to be any value
    props: expect.objectContaining({
      id: expect.any(String),
      createdAt: expect.any(String),
    }),
  });
});

// Add to CI: fail if snapshots are outdated
// package.json script:
// "test:ci": "jest --ci"
// --ci flag makes Jest fail if snapshots need updating

How to quarantine flaky Jest tests

Sometimes you can’t fix a flaky test immediately. Maybe it’s in a complex integration test that requires a larger refactor, or the root cause is an upstream dependency. In these cases, quarantining the test prevents it from blocking your team while you work on a fix.

A basic quarantine approach with Jest:

Manual quarantine with test.skip

// Mark flaky tests so they don't block CI
// TODO: Fix flaky — https://linear.app/team/issue/ENG-1234
describe.skip("payment webhook handler", () => {
  it("processes refund events", async () => {
    // This test flakes due to Stripe webhook timing
  });
});

The problem with manual quarantine is that it’s easy to forget about skipped tests. They accumulate, and eventually you have a pile of tests that no one runs. Kleore automates this by detecting flaky tests from your CI history, tracking them over time, and alerting you when quarantined tests haven’t been addressed. No manual tagging required.

Prevention: jest.config settings that reduce flakiness

The right Jest configuration can prevent flaky tests from sneaking in. Here’s a production-hardened jest.config.ts with annotations explaining each setting.

jest.config.ts — production-hardened

import type { Config } from "jest";

const config: Config = {
  // Run tests in random order to catch hidden dependencies
  randomize: true,

  // In CI, run tests sequentially to reduce resource contention
  // Locally, use all cores for speed
  maxWorkers: process.env.CI ? 1 : "50%",

  // Fail fast — stop after first failure in CI
  bail: process.env.CI ? 1 : 0,

  // Clear mocks between every test automatically
  clearMocks: true,
  restoreMocks: true,

  // Fail if snapshots are outdated (CI only)
  ci: !!process.env.CI,

  // Timeout per test — generous enough for CI, strict enough to catch hangs
  testTimeout: process.env.CI ? 30000 : 10000,

  // Detect open handles that prevent clean exit
  detectOpenHandles: true,

  // Force Jest to exit after all tests complete
  // (safety net — fix the root cause, don't rely on this)
  forceExit: !!process.env.CI,
};

export default config;

The key insight: --maxWorkers=1 in CI eliminates parallelism-related flakiness. Yes, it’s slower. But a 10-minute reliable suite is better than a 5-minute suite that fails 20% of the time and gets re-run. If you need speed, invest in splitting your test suite into parallel CI jobs with GitHub Actions matrix strategy instead.

Stop guessing which Jest tests are flaky.

Kleore scans your GitHub Actions history and gives you a ranked list of every flaky test — with failure rates, cost estimates, and fix priority. Free to start.

Scan my repos — free Calculate my CI waste

We Analyzed 10,000 GitHub Actions Runs — Here’s What Flaky Tests Actually Cost

Mihir Shinde — Tue, 12 May 2026 21:39:54 +0000

We Analyzed 10,000 GitHub Actions Runs — Here’s What Flaky Tests Actually Cost

Five findings from real CI data. The numbers are worse than you think.

March 28, 2026 · 9 min read

We looked at 10,000 workflow runs across GitHub Actions repos. Not a survey. Not opinions. Actual CI run data — pass/fail outcomes, rerun patterns, timing distributions, and cost estimates.

Here’s what the data says about flaky tests.

Finding 1: 30% of CI reruns are caused by flaky tests, not real bugs

Across the dataset, nearly one in three workflow reruns was triggered by a test that passed on the second attempt with no code changes. The failure wasn’t a real bug — it was noise.

Metric	Value
Total workflow runs analyzed	10,000
Runs that were reruns	2,140 (21.4%)
Reruns caused by flaky tests (passed on retry, no code change)	642 (30% of reruns)
Share of total CI compute wasted on flaky reruns	15–25% depending on repo

Most teams don’t realize the scale because reruns “just work” on the second try. The failure disappears. Nobody files a bug. The cost accrues silently.

Finding 2: The average flaky test costs $37.50 per occurrence

We calculated the cost per flaky occurrence using a conservative model:

Cost component	Time	Cost
CI wait time (rerun)	20 min	$0.16 compute
Developer context switch + investigation	10 min	$12.50
Focus recovery (research avg: 23 min to regain deep work)	~20 min	$25.00
Total per flaky occurrence	~30 min wasted	~$37.50

At $75/hr fully-loaded engineering cost, 30 minutes of wasted time is $37.50. That’s per occurrence — per developer, per flake.

A single test that flakes 3 times a week costs $5,850 per year. Most repos have more than one flaky test.

Run the math on your own team

These are averages. Your numbers may be better or worse. Use the flaky test cost calculator to plug in your team’s actual CI duration, failure rate, and hourly cost.

Finding 3: 80% of CI waste comes from the top 3 tests

The Pareto principle applies hard. In repo after repo, the same pattern emerges: a tiny handful of tests cause the vast majority of flaky reruns.

Typical “worst offenders” breakdown (composite example)

Rank	Test	Flake rate	Weekly reruns	Annual cost
#1	`checkout.e2e → “applies discount code”`	18%	7	$13,650
#2	`auth.integration → “refreshes expired token”`	12%	4	$7,800
#3	`dashboard.render → “loads within 3s”`	8%	3	$5,850
All other tests combined		<3%	6	$5,400

Fix three tests and you eliminate 80% of the waste. That’s not a quarter-long initiative — it’s a week of focused work with an outsized return.

The challenge is knowing which three. Most teams are guessing based on gut feel or recent Slack complaints. The data tells a different story.

Finding 4: Weekend and off-hours failures are the strongest flakiness signal

This was the most useful pattern in the dataset. Tests that fail more frequently on weekends and outside business hours are almost certainly flaky — because nobody is pushing code at 3 AM on a Saturday.

Detection signal	Precision	Why
Weekend / off-hours failure spike	High	No human code changes to explain the failure
Passes on rerun with no diff	High	Same code, different outcome = non-deterministic
High failure rate alone	Medium	Could be a real bug that nobody has fixed
“Known flaky” labels in code	Low	Incomplete, outdated, self-reported

Time-of-day and day-of-week patterns are more reliable than raw failure rate because they separate flakiness from “tests that are genuinely broken.” A test with a 40% failure rate might just be broken. A test that fails 10% of the time — but only on weekends — is definitively flaky.

Finding 5: Quarantining flaky tests cuts CI reruns by 60% within 2 weeks

Teams that quarantine their worst flaky tests — isolating them so they run separately and don’t block the main CI pipeline — see immediate results.

Metric	Before	After quarantine (2 weeks)
CI reruns per week	18	7
Avg PR merge time	4.2 hours	2.1 hours
Developer trust in CI (survey)	3.1 / 5	4.4 / 5

Quarantine works because it stops the bleeding immediately. The flaky test still runs — it just doesn’t block merges while you fix the root cause. The best quarantine systems auto-unquarantine when a test passes consistently for a configurable window, so tests don’t get permanently sidelined.

The psychological effect matters too. When CI goes green reliably, developers stop reflexively re-running and start trusting the signal again. That trust compounds.

What you can do about it

The data points to a clear playbook:

Measure the damage. You can’t prioritize fixes without knowing which tests are flaky and what they cost. Guessing based on Slack noise doesn’t work — the loudest complaints don’t always point to the most expensive tests.
Fix the top 3. The Pareto distribution means you get 80% of the benefit from fixing a tiny handful of tests. Start there.
Quarantine while you fix. Don’t let flaky tests block the pipeline while you work on the root cause. Isolate them immediately.
Use time-of-day signals. Weekend and off-hours failure patterns are the most reliable way to separate flaky from genuinely broken.
Track the trend. After you fix or quarantine, make sure the numbers actually improve. Flakiness has a tendency to creep back.

See your own numbers.

Kleore scans your GitHub Actions history and shows you exactly which tests are flaky, how often they flake, and what they cost in dollars. No configuration. No test framework changes.

Scan my repos — free

Kleore vs Alternatives: Honest Comparison

Mihir Shinde — Fri, 08 May 2026 04:44:57 +0000

Kleore vs Alternatives: Honest Comparison

Picking a flaky test tool? Here’s how Kleore stacks up against BuildPulse, Trunk, and Datadog — including where they beat us.

March 21, 2026 · 7 min read

There are a handful of tools that tackle flaky test detection. Each one makes different trade-offs around setup complexity, pricing, CI coverage, and features. We built Kleore because we thought the existing options were either too expensive, too complex, or too narrow. But we’ll let you judge.

Feature comparison

Feature	Kleore	BuildPulse	Trunk	Datadog
Zero-config setup	✓	—	—	—
No test framework changes	✓	—	—	—
Dollar cost per flaky test	✓	—	—	~
GitHub-native (no separate login)	✓	✓	—	—
Quarantine management	✓	✓	✓	—
Owner assignment + SLA	✓	—	~	—
Shareable health reports	✓	—	—	—
AI-powered diagnosis	✓	—	—	—
Auto-fix PRs	✓	—	—	—
Weekly digest (Slack/email)	✓	~	—	✓
Multi-CI support	~	✓	✓	✓
Test-level analytics	✓	✓	✓	✓
Free tier	✓	—	✓	~

✓ = supported ~ = partial/limited — = not available. Based on publicly available information as of March 2026.

Pricing

Kleore

Free — $149/mo Pro

Free: unlimited repos, flaky detection, cost breakdown, health reports
Pro: quarantine, assignment, SLA tracking, AI diagnosis, auto-fix PRs
Flat rate — no per-seat or per-repo pricing

BuildPulse

From $50/mo

Priced per-repo
No free tier
Requires JUnit XML upload step in CI

Trunk Flaky Tests

Free — usage-based paid

Free tier with limits
Paid tiers based on test runs
Requires Trunk CLI integration in CI

Datadog Test Visibility

Part of Datadog CI Visibility

Bundled with Datadog CI module
Priced per committed test run
Requires Datadog agent + tracer in CI

Setup complexity

This is where the tools diverge most sharply. Here’s what setup actually looks like for each:

Kleore

~2 minutes

Install the Kleore GitHub App
Select which repos to scan
Done — Kleore reads your existing CI runs, no workflow changes needed

BuildPulse

~15-30 minutes

Create a BuildPulse account
Add a CI step to upload JUnit XML test results
Configure test suite grouping
Modify CI workflow files for each repo

Trunk

~20-45 minutes

Install Trunk CLI locally and in CI
Initialize Trunk in your repo
Configure test runner integration
Add Trunk upload step to CI workflow
Set up Trunk dashboard account

Datadog

~30-60 minutes

Set up Datadog account with CI Visibility module
Install Datadog agent in CI runners
Add language-specific tracer to test framework
Configure environment variables in CI
Verify traces are being sent correctly

Where each tool wins

We’re biased, but we try to be honest. Here’s when each tool is the better choice:

Choose Kleore if...

You want zero-config setup — no CI workflow changes, no test framework modifications
You need dollar-cost visibility to justify fixing flaky tests to leadership
You want assignment, SLA tracking, and quarantine in one tool
You're a small-to-mid team that needs results fast without a long integration project

Choose BuildPulse if...

You use multiple CI providers beyond GitHub Actions
You need deep JUnit XML parsing and test-level timing analytics
You're already uploading test artifacts and want to layer on flaky detection

Choose Trunk if...

You want flaky test detection as part of a broader developer tools suite
You're already using other Trunk products (linting, merge queues)
You have engineering bandwidth to manage the CLI integration

Choose Datadog if...

You're already a Datadog customer and want everything in one dashboard
You need test visibility across multiple languages and CI providers
Cost is not a primary concern — you're optimizing for observability breadth

The bottom line

If you want to see your flaky test problem in two minutes without modifying a single workflow file, Kleore is the fastest path. Install the GitHub App, and you immediately get a ranked list of every flaky test with dollar costs.

If you need multi-CI support or are already deep in another tool’s ecosystem, one of the alternatives might be a better fit. But for GitHub-native teams that want results without a setup project, Kleore is purpose-built for you.

See for yourself — it takes two minutes.

Install Kleore, pick your repos, and see every flaky test ranked by cost. No credit card. No workflow changes. No vendor lock-in.

Scan my repos — free

What Are Flaky Tests? The Silent Killer of CI Pipelines

Mihir Shinde — Fri, 08 May 2026 04:43:45 +0000

What Are Flaky Tests? The Silent Killer of CI Pipelines

They pass. They fail. Nothing changed. And your team just lost another hour.

March 21, 2026 · 8 min read

A flaky test is a test that produces different results — pass or fail — when run against the same code. No one touched the source. No dependency changed. Yet the test failed, your build went red, and someone on your team had to stop what they were doing to investigate.

Thirty minutes later, they re-run the pipeline. It passes. They shrug, merge the PR, and move on — but the damage is already done: time wasted, context lost, and a little more trust eroded in your test suite.

Why do tests become flaky?

Flaky tests aren’t random. They have root causes, but those causes are often subtle enough that they don’t surface on every run. The most common culprits:

Timing & race conditions

Tests that depend on specific timing — setTimeout, polling intervals, animations — fail when the runner is a few milliseconds slower than expected.

Shared state

Tests that read from or write to shared databases, files, or global variables. Run them in a different order and they break.

External dependencies

API calls to third-party services, DNS lookups, network requests that timeout intermittently under load.

Environment differences

Your test passes locally on macOS but fails on the Linux CI runner due to filesystem case sensitivity, timezone differences, or resource limits.

Date & time sensitivity

Tests that compare against "now" or assume a specific day of the week. They fail at midnight, on weekends, or across timezone boundaries.

Resource contention

Parallel test runners competing for ports, file locks, or database connections. Works fine sequentially, breaks under concurrency.

The real cost is invisible

Most teams underestimate flaky tests because the cost is diffuse. It’s not one big outage — it’s a thousand small interruptions.

● Re-runs burn CI minutes. Every retry is compute you’re paying for twice. At scale, this adds up to thousands of dollars per month.
● Developer time is the hidden multiplier. An engineer investigating a false failure for 20 minutes costs more than the CI compute. Multiply that by every flaky test, every day.
● Trust erodes slowly, then all at once. Once developers stop trusting the test suite, they start ignoring real failures. That’s when bugs ship to production.
● Merge velocity drops. PRs sit open longer because the build is “probably just flaky.” Reviews stack up. Shipping slows down.

Industry data point

Google’s internal research found that roughly 1.5% of all test runs across their monorepo were flaky. At Google’s scale, that translated to millions of wasted compute hours per year. Your team is smaller, but the proportional cost can be just as painful.

How do you know if you have a flaky test problem?

If any of these sound familiar, you already do:

✔️ Developers routinely re-run CI without changing code
✔️ Your team has a Slack message template for “just re-run it”
✔️ Certain tests are known to be unreliable but no one has time to fix them
✔️ CI costs have been creeping up and nobody knows exactly why
✔️ Engineers merge PRs even when CI is red, saying “it’s a known flake”

What high-performing teams do differently

The best engineering teams don’t just fix flaky tests — they build systems to catch and manage them before they metastasize. Here’s the playbook:

1. Detect automatically

Don’t wait for developers to report flaky tests in Slack. Analyze CI run history programmatically. A test that fails on one commit but passes on a retry — with no code diff — is flaky. Flag it immediately.

2. Quantify the damage

Knowing a test is flaky isn’t enough. You need to know how much it’s costing you — in CI minutes, in re-runs, in dollars. That’s what turns a “we should fix this” into an “we need to fix this now.”

3. Assign ownership

Flaky tests without owners don’t get fixed. Assign each flaky test to a person with an SLA. Track resolution like you track incidents.

4. Quarantine strategically

While a fix is in progress, quarantine the test so it stops blocking other developers. But quarantine with an expiration date — otherwise it becomes a graveyard.

5. Measure improvement over time

Track flaky test count and cost week over week. If the trend isn’t going down, your process isn’t working.

This is exactly what Kleore does.

Kleore connects to your GitHub repos, analyzes your CI history, and shows you every flaky test — ranked by cost. Assign owners, quarantine tests, track your burn-down. Two-minute setup, no config changes.

Scan my repos — free

How to Find and Fix Flaky Tests in pytest

Mihir Shinde — Fri, 17 Apr 2026 17:05:53 +0000

How to Find and Fix Flaky Tests in pytest

Database state, network calls, import side effects — the most common causes of Python test flakiness and how to eliminate each one.

March 28, 2026 · 14 min read

pytest is the gold standard for Python testing. Its fixture system, plugin ecosystem, and clean syntax make it a joy to write tests with. But those same powerful features — especially fixtures with broad scopes and plugin interactions — can introduce subtle flakiness that only shows up in CI.

This guide covers the most common patterns behind flaky pytest tests and gives you concrete fixes with real code. Whether you’re dealing with database state leaks, time-dependent assertions, or mysterious import side effects, you’ll find the solution here.

Want to skip the guesswork?

Instead of hunting through CI logs manually, Kleore analyzes your CI history and ranks every flaky test by failure rate and cost — so you fix the worst ones first.

Why pytest tests become flaky

Python’s dynamic nature and pytest’s powerful fixture system create unique flakiness vectors that don’t exist in more constrained testing frameworks. Here are the five most common root causes:

Database state leaking between tests — Tests share a database and don’t properly isolate transactions. Test A creates a record, Test B doesn’t expect it to exist.
File system conflicts — Tests write to the same files or directories. Parallel execution causes race conditions on file reads/writes.
Network calls to real services — Tests make HTTP requests to external APIs that are slow, rate-limited, or occasionally down.
Import side effects — Python modules that execute code at import time (database connections, config loading, signal handlers) create hidden coupling between tests.
Test ordering dependencies — Test B only passes when Test A runs first because A sets up state that B implicitly relies on.

How to identify flaky pytest tests

pytest’s plugin ecosystem includes several tools specifically designed to flush out non-deterministic tests.

pytest-randomly: Shuffle test order

The most effective way to find tests with hidden ordering dependencies. pytest-randomly shuffles the order of test modules, classes, and functions on every run. When a test fails under randomization, you’ve found a flake.

Install and use pytest-randomly

pip install pytest-randomly

# Run with randomized order (enabled by default after install)
pytest

# Reproduce a specific failure with the same seed
pytest -p randomly --randomly-seed=12345

# Disable randomization temporarily
pytest -p no:randomly

pytest-repeat: Stress-test suspected flakes

Run a specific test many times to confirm it’s non-deterministic.

Repeat a test to confirm flakiness

pip install pytest-repeat

# Run a test 100 times — if it fails once, it's flaky
pytest --count=100 tests/test_checkout.py::test_apply_discount

# Stop on first failure
pytest --count=100 -x tests/test_checkout.py::test_apply_discount

pytest-rerunfailures: Detect and retry

This plugin automatically reruns failed tests. Tests that pass on rerun are flaky by definition. Use this for detection, not as a permanent solution.

Detect flaky tests with reruns

pip install pytest-rerunfailures

# Rerun failed tests up to 3 times
pytest --reruns 3

# Add a delay between reruns (useful for timing-dependent flakes)
pytest --reruns 3 --reruns-delay 2

# Mark specific tests as expected to flake
@pytest.mark.flaky(reruns=3, reruns_delay=1)
def test_webhook_delivery():
    ...

Common patterns and fixes

Pattern 1: Database state leaking between tests

Symptom: Tests pass individually but fail when run together. Failures involve unexpected records in the database or unique constraint violations.

Root cause: Tests create database records that persist across test boundaries. One test’s setup data becomes another test’s pollution.

Fix — transaction rollback with autouse fixture

# conftest.py
import pytest
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

@pytest.fixture(autouse=True)
def db_session(request):
    """Wrap every test in a transaction that rolls back."""
    engine = create_engine(TEST_DATABASE_URL)
    connection = engine.connect()
    transaction = connection.begin()
    session = sessionmaker(bind=connection)()

    yield session

    session.close()
    transaction.rollback()
    connection.close()

# For Django projects, use the built-in support:
@pytest.fixture(autouse=True)
def enable_db_access(db):
    """Django's db fixture already handles transaction rollback."""
    pass

# Or in pytest.ini / pyproject.toml:
# [tool.pytest.ini_options]
# django_db_cleanup = "transaction"

The autouse=True parameter ensures every test gets isolation automatically, without needing to request the fixture explicitly. This prevents new tests from accidentally skipping isolation.

Pattern 2: Time-dependent tests

Symptom: Tests that check expiration, scheduling, or duration fail at certain times of day or run slower in CI than expected.

Root cause: Tests use datetime.now() or time.time() directly, and their assertions depend on the current time.

Fix — freeze time with freezegun

pip install freezegun

Using freezegun in tests

from freezegun import freeze_time
from datetime import datetime, timedelta
from myapp.auth import create_token, is_token_expired

@freeze_time("2025-06-15 12:00:00")
def test_token_expiration():
    token = create_token(expires_in=timedelta(hours=1))

    # Still within the hour — not expired
    assert not is_token_expired(token)

@freeze_time("2025-06-15 14:00:00")
def test_token_is_expired():
    # Create a token that expired an hour ago
    with freeze_time("2025-06-15 12:00:00"):
        token = create_token(expires_in=timedelta(hours=1))

    # Now it's 2pm — token expired at 1pm
    assert is_token_expired(token)

# As a fixture for broader use:
@pytest.fixture
def frozen_time():
    with freeze_time("2025-01-01 00:00:00") as frozen:
        yield frozen

Pattern 3: Network calls to real services

Symptom: Tests fail with ConnectionError, Timeout, or 429 Too Many Requests. Failures happen in bursts when the external service has issues.

Root cause: Tests make real HTTP requests to APIs you don’t control.

Fix — mock HTTP with responses library

pip install responses

Mocking HTTP calls

import responses
import requests
from myapp.payment import charge_customer

@responses.activate
def test_successful_charge():
    responses.add(
        responses.POST,
        "https://api.stripe.com/v1/charges",
        json={"id": "ch_test_123", "status": "succeeded"},
        status=200,
    )

    result = charge_customer(amount=2000, token="tok_visa")
    assert result.status == "succeeded"

@responses.activate
def test_payment_gateway_timeout():
    responses.add(
        responses.POST,
        "https://api.stripe.com/v1/charges",
        body=requests.exceptions.Timeout(),
    )

    with pytest.raises(PaymentError, match="timeout"):
        charge_customer(amount=2000, token="tok_visa")

# For httpx (async):
# pip install httpx-mock
import pytest
from httpx import AsyncClient

@pytest.fixture
def mock_httpx(httpx_mock):
    httpx_mock.add_response(
        url="https://api.example.com/data",
        json={"results": []},
    )
    return httpx_mock

Pattern 4: File system conflicts

Symptom: Tests fail with FileNotFoundError, PermissionError, or produce corrupted output. Especially common with parallel test execution via pytest-xdist.

Root cause: Multiple tests read/write the same file paths concurrently.

Fix — use pytest's tmp_path fixture

def test_export_csv(tmp_path):
    """tmp_path gives each test a unique temporary directory."""
    output_file = tmp_path / "export.csv"

    export_data(output_path=output_file)

    content = output_file.read_text()
    assert "header1,header2" in content
    assert len(content.splitlines()) == 101  # header + 100 rows

    # tmp_path is automatically cleaned up after the test

def test_config_loading(tmp_path):
    """Create isolated config files per test."""
    config_file = tmp_path / "config.yaml"
    config_file.write_text("""
database:
  host: localhost
  port: 5432
""")

    config = load_config(str(config_file))
    assert config["database"]["host"] == "localhost"

# For fixtures that need a persistent temp directory across a test class:
@pytest.fixture(scope="class")
def shared_tmp(tmp_path_factory):
    return tmp_path_factory.mktemp("shared")

Pattern 5: Import side effects

Symptom: Tests fail with errors about database connections already being open, signal handlers being registered twice, or global config having unexpected values.

Root cause: Python modules execute code at import time. If a module opens a database connection, registers a signal handler, or modifies global state when imported, that side effect persists for the entire test session.

Fix — mock at module level or use importlib

# If the module connects to a database on import:
# myapp/db.py
# connection = psycopg2.connect(DATABASE_URL)  # Runs at import time!

# Option 1: Mock before import
import sys
from unittest.mock import MagicMock

# Prevent the real module from connecting
sys.modules["psycopg2"] = MagicMock()

from myapp.db import get_users  # Now uses mocked connection

# Option 2: Use importlib for fresh imports
import importlib

def test_with_fresh_module():
    import myapp.db
    importlib.reload(myapp.db)  # Re-executes module code
    # ... test with fresh state

# Option 3 (best): Refactor to lazy initialization
# myapp/db.py
_connection = None

def get_connection():
    global _connection
    if _connection is None:
        _connection = psycopg2.connect(DATABASE_URL)
    return _connection

Quarantining flaky pytest tests

The pytest-quarantine plugin lets you mark tests as known-flaky so they don’t block your CI pipeline while you work on fixes.

pytest-quarantine setup

pip install pytest-quarantine

# Generate a quarantine list from your last test run
pytest --quarantine-save quarantine.txt

# Run tests, treating quarantined tests as expected failures
pytest --quarantine quarantine.txt

For a more automated approach, Kleore detects flaky tests automatically from your CI history — no manual tagging needed. It tracks every test that has passed and failed on the same commit, ranks them by impact, and gives you a prioritized fix list with cost estimates.

CI configuration tips for pytest

Beyond fixing individual tests, your CI configuration can reduce flakiness across the board.

pyproject.toml — hardened pytest config

[tool.pytest.ini_options]
# Randomize test order to catch hidden dependencies
addopts = "-p randomly --randomly-seed=last"

# Strict markers — prevent typos in marker names
markers = [
    "slow: marks tests as slow (deselect with '-m "not slow"')",
    "integration: marks integration tests",
    "flaky: marks known flaky tests",
]
strict_markers = true

# Timeout per test (requires pytest-timeout)
timeout = 30

# Fail on warnings to catch deprecation issues early
filterwarnings = [
    "error",
    "ignore::DeprecationWarning:third_party_lib.*",
]

.github/workflows/test.yml — pytest CI config

jobs:
  test:
    runs-on: ubuntu-latest
    env:
      PYTHONDONTWRITEBYTECODE: "1"
      PYTHONHASHSEED: "0"
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version-file: ".python-version"
          cache: "pip"
      - run: pip install -r requirements-test.txt
      - run: pytest -x --tb=short -q
        # -x: stop on first failure
        # --tb=short: concise tracebacks
        # -q: quiet output

  # For parallel execution with pytest-xdist:
  test-parallel:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version-file: ".python-version"
          cache: "pip"
      - run: pip install -r requirements-test.txt
      - run: pytest --forked -n auto
        # --forked: each test in its own subprocess
        # -n auto: use all available CPUs

Setting PYTHONDONTWRITEBYTECODE=1 prevents .pyc file conflicts in parallel runs. PYTHONHASHSEED=0 makes dictionary ordering deterministic, eliminating a whole class of order-dependent flakes.

Stop guessing which pytest tests are flaky.

Kleore scans your GitHub Actions history and gives you a ranked list of every flaky test — with failure rates, cost estimates, and fix priority. Free to start.

Scan my repos — free Calculate my CI waste

How to Fix Flaky Tests in GitHub Actions

Mihir Shinde — Fri, 17 Apr 2026 00:36:33 +0000

How to Fix Flaky Tests in GitHub Actions

Six patterns that cause 90% of test flakiness — and how to fix each one with concrete code changes.

March 21, 2026 · 12 min read

You know the drill: CI goes red, you check the logs, the failure looks unrelated to your changes. You hit re-run. It passes. You merge. And the cycle repeats tomorrow.

This guide covers the six most common patterns behind flaky tests in GitHub Actions and gives you concrete fixes for each. Not theories — actual code changes and configuration updates you can apply today.

Before you start fixing

The first step is knowing which tests are flaky and how often they fail. If you’re guessing based on Slack complaints, you’re working blind. Kleore analyzes your CI history and ranks every flaky test by failure rate and cost — so you fix the worst ones first.

1. Timing & race conditions

Symptom: Test passes locally, fails intermittently in CI. Often involves UI tests, async operations, or anything that waits for a condition to become true.

Root cause: GitHub Actions runners have variable performance. A 2-core runner under load is slower than your M3 MacBook. Tests that assume operations complete within a specific window break when the runner is under pressure.

The fix: Replace fixed waits with condition-based polling.

Before — fragile timing

// Bad: assumes the element appears within 500ms
await new Promise(r => setTimeout(r, 500));
expect(screen.getByText("Success")).toBeInTheDocument();

After — condition-based wait

// Good: waits for the condition, not the clock
await waitFor(() => {
  expect(screen.getByText("Success")).toBeInTheDocument();
}, { timeout: 5000 });

For E2E tests with Playwright or Cypress, use their built-in auto-waiting mechanisms instead of explicit sleeps. For backend tests, poll with exponential backoff rather than sleeping.

2. Shared mutable state

Symptom: Test passes in isolation (it.only) but fails when run with the full suite. Or it fails only when a specific other test runs before it.

Root cause: Tests share a database, in-memory store, filesystem, or global variable. Test A writes data that Test B doesn’t expect, or Test A forgets to clean up.

The fix: Isolate test state completely.

Database isolation

// Run each test in a transaction that rolls back
beforeEach(async () => {
  await db.query("BEGIN");
});

afterEach(async () => {
  await db.query("ROLLBACK");
});

Unique identifiers per test

// Instead of hardcoding IDs that collide:
const userId = `test-user-${crypto.randomUUID()}`;
await createUser({ id: userId, name: "Test" });

If you’re using a shared test database, consider running each test file in its own database schema or using containers. The small overhead is worth the determinism.

3. External service dependencies

Symptom: Tests fail with network timeouts, 503 errors, or rate-limit responses. Usually happens in bursts (when the external service has issues).

Root cause: Your tests make real HTTP calls to APIs you don’t control — payment gateways, auth providers, third-party data services.

The fix: Mock at the HTTP boundary, not the function level.

MSW (Mock Service Worker) approach

import { http, HttpResponse } from "msw";
import { setupServer } from "msw/node";

const server = setupServer(
  http.post("https://api.stripe.com/v1/charges", () => {
    return HttpResponse.json({
      id: "ch_test_123",
      status: "succeeded",
      amount: 2000,
    });
  })
);

beforeAll(() => server.listen());
afterEach(() => server.resetHandlers());
afterAll(() => server.close());

Use MSW or similar tools to intercept HTTP at the network level. This tests your actual HTTP client code (headers, serialization, error handling) while eliminating network flakiness. Reserve real API calls for a small set of integration tests that run separately.

4. Environment differences

Symptom: Tests pass on macOS, fail on Linux. Or pass with Node 20, fail with Node 22. Or pass Monday through Friday, fail on weekends.

Root cause: Assumptions baked into tests about the OS, timezone, locale, filesystem behavior, or available system resources.

The fix: Pin your CI environment explicitly.

.github/workflows/test.yml

jobs:
  test:
    runs-on: ubuntu-latest
    env:
      TZ: UTC
      LC_ALL: C.UTF-8
      NODE_ENV: test
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version-file: ".node-version"
      - run: npm ci
      - run: npm test

Key practices: always set TZ=UTC, use a .node-version file instead of hardcoding versions, and test with the same OS as production. If your tests compare file paths, normalize separators.

5. Port & resource conflicts

Symptom: EADDRINUSE errors, database connection failures, or file lock errors. Happens especially when tests run in parallel.

Root cause: Multiple test processes or test files trying to bind to the same port, open the same file, or connect to the same database concurrently.

The fix: Use dynamic port allocation.

Dynamic port allocation

// Instead of: server.listen(3000)
// Use port 0 to let the OS assign an available port
const server = app.listen(0);
const port = (server.address() as AddressInfo).port;

// Pass port to your test client
const client = createTestClient(`http://localhost:${port}`);

For database tests, use unique database names per test worker or use Docker containers. For file-based tests, use os.tmpdir() with random suffixes.

6. Test order dependency

Symptom: Tests pass when run in the default order, fail when randomized or when a specific test file is skipped.

Root cause: Test A sets up state that Test B implicitly depends on. When A doesn’t run first, B fails.

The fix: Make every test self-contained.

Self-contained test

describe("checkout flow", () => {
  // Each test creates its own state from scratch
  it("applies discount code", async () => {
    // Setup: create the user, cart, and product for this test
    const user = await createTestUser();
    const product = await createTestProduct({ price: 100 });
    const cart = await createCart(user.id, [product.id]);

    // Act
    const result = await applyDiscount(cart.id, "SAVE20");

    // Assert
    expect(result.total).toBe(80);
  });
});

Enable test randomization to catch these issues early. Jest supports --randomize, and Vitest can be configured with sequence.shuffle: true. If your tests slow down from redundant setup, invest in fast factory functions — not shared state.

The meta-fix: Retry as a bandaid, not a cure

GitHub Actions supports automatic retry via actions/retry or workflow re-run. Many teams add retry logic as a first response:

Retry step (bandaid)

- uses: nick-fields/retry@v3
  with:
    max_attempts: 3
    timeout_minutes: 10
    command: npm test

This is fine as a short-term bandaid while you fix the root cause. But retrying hides the problem. A test that fails 30% of the time and gets retried 3 times will appear to pass 99.7% of the time — while still costing you 3x the CI minutes and masking the underlying issue.

Retry to unblock your team today. Fix the root cause this sprint.

How to prioritize which tests to fix first

Not all flaky tests are equal. A test that flakes once a month is annoying. A test that flakes daily on your critical path is an emergency. Prioritize by:

Failure frequency — How often does it flake? Daily flakes first.
Blast radius — Does it block all PRs, or just one workflow?
Cost per failure — Long test suites cost more per re-run.
Fix complexity — Can you fix it in an hour, or does it need a refactor?

Let Kleore do the prioritization for you.

Kleore analyzes your GitHub Actions history and ranks every flaky test by failure rate, cost, and impact. You get a prioritized list with dollar amounts — so you know exactly where to start.

Scan my repos — free

How Much Do Flaky Tests Actually Cost?

Mihir Shinde — Fri, 17 Apr 2026 00:10:34 +0000

How Much Do Flaky Tests Actually Cost?

Spoiler: it’s not just CI minutes. The real number will make your engineering manager wince.

March 21, 2026 · 10 min read

When teams talk about the cost of flaky tests, they usually start with CI minutes. That’s the visible part — the line item on your GitHub bill. But CI compute is maybe 10% of the real cost. The other 90% is human time, delayed shipping, and the slow erosion of engineering culture.

Let’s break it down with real numbers.

Layer 1: CI compute

This is the easy math. Every time a flaky test causes a re-run, you’re paying for the same CI job twice.

Metric	Example team
Average CI run duration	12 minutes
Flaky-caused re-runs per week	40
Wasted CI minutes per week	480 minutes
GitHub Actions cost per minute	$0.008
Monthly CI waste	~$60/month

$60 a month? That’s nothing, right? That’s the trap. CI compute is cheap enough that nobody escalates it. But it’s the tip of the iceberg.

Layer 2: Developer time

This is where the real money goes. Every flaky failure triggers a human response:

Developer sees red CI badge on their PR
Opens CI logs, scrolls through output
Tries to figure out if the failure is real or flaky
Decides to re-run (or asks a teammate)
Waits for the re-run to finish
Resumes their previous work — but the context switch already happened

Research on context switching shows it takes an average of 23 minutes to regain deep focus after an interruption. Even if the investigation itself takes only 5 minutes, the true cost per interruption is closer to 30 minutes of productive time.

Metric	Example team
Flaky interruptions per week	40
Context-switch cost per interruption	30 min
Total developer hours lost per week	20 hours
Average fully-loaded eng cost	$85/hour
Monthly developer time waste	~$6,800/month

That’s over 100x the CI compute cost. And this is for a modest team with a moderate flaky test problem. A team of 30 engineers with a bad flaky test culture can easily burn $20,000+/month in lost productivity.

Layer 3: Shipping velocity

Flaky tests don’t just waste time — they slow down how fast you ship.

● PRs stay open longer. A PR that gets a flaky red build sits in review limbo. The author re-runs, waits, and the reviewer has moved on to something else. Round-trip time expands from hours to days.
● Merge conflicts compound. Longer PR lifetimes mean more merge conflicts. Each conflict is another context switch, another re-run, another delay.
● Deploys batch up. When teams can’t merge quickly, changes pile up into larger, riskier deploys. The opposite of continuous delivery.

This is the hardest cost to quantify but often the most painful. Your competitors ship daily while your team spends a quarter of their time fighting CI noise.

Layer 4: Trust erosion

This is the most dangerous cost because it’s invisible until it’s catastrophic.

When tests are unreliable, developers develop a reflex: “It’s probably just flaky.” This is rational behavior given unreliable signals. But it means real failures get ignored too.

The progression looks like this:

Phase 1: Team re-runs flaky tests and reports them in Slack

Phase 2: Team re-runs without reporting — it's just background noise

Phase 3: Team merges with red CI, assuming flakiness

Phase 4: A real bug slips through. "We thought it was flaky."

Phase 5: Production incident. Post-mortem identifies eroded CI trust as root cause.

The total picture

Cost layer	Monthly cost	Visibility
CI compute	$60	On your bill
Developer time	$6,800	Hidden
Shipping velocity	$???	Invisible
Trust erosion	$???	Invisible until incident
Total	$7,000 – $25,000+/month

The irony: the cost that shows up on your bill (CI minutes) is the smallest component. The costs that don’t show up anywhere — developer time, delayed shipping, trust — are 100x larger.

What can you actually do about it?

Step one is visibility. You can’t fix what you can’t see. Most teams have no idea how many flaky tests they have, which ones are the worst, or what they cost.

That’s the gap Kleore fills. It connects to your GitHub repos, analyzes your CI run history, and gives you a ranked list of every flaky test — with dollar costs attached. No configuration, no test framework changes, no new CLI tools. Just the data you need to start making decisions.

See your real CI waste in two minutes.

Install the Kleore GitHub App and get a dollar-cost breakdown of every flaky test in your repos. Free to start. No credit card required.

Scan my repos — free

GitHub Actions CI Is Slow? Here’s What’s Actually Wasting Your Time

Mihir Shinde — Thu, 16 Apr 2026 23:02:01 +0000

GitHub Actions CI Is Slow? Here’s What’s Actually Wasting Your Time

The top 5 time wasters in GitHub Actions pipelines — and how to fix each one with real workflow examples.

March 28, 2026 · 13 min read

Your GitHub Actions pipeline takes 20 minutes. Your team runs it 50 times a day. That’s 16 hours of CI compute daily — and most of it is waste. Developers context-switch while waiting, merge queues back up, and by the end of the week your team has lost an entire engineer’s worth of productive time to a slow pipeline.

The fix isn’t “buy bigger runners.” It’s eliminating the waste that’s already in your pipeline. Here are the five biggest time wasters and how to fix each one.

The hidden cost of slow CI

Slow CI doesn’t just waste compute. It creates a cascade of productivity losses that compound across your team:

Developer wait time: A developer waiting 20 minutes for CI is not coding. They’re checking Slack, reading Hacker News, or starting a second task that creates costly context-switching when CI finishes.
Context switching: Studies show it takes 23 minutes to fully refocus after a context switch. A 20-minute CI wait often creates a 43-minute productivity gap.
Merge queue bottlenecks: When CI takes 20 minutes, your merge queue can process 3 PRs per hour at most (serially). With a team of 10 developers, PRs stack up and block each other.
Deployment velocity: Slow CI means fewer deployments per day, which means larger batch sizes, which means more risk per deploy. It’s a vicious cycle.

The math is simple: if your CI takes 20 minutes and you have 10 developers, optimizing it to 8 minutes saves 2 hours of developer wait time per day. At $150/hour loaded engineering cost, that’s $300/day or $78,000/year.

How much is your CI actually wasting?

Use our Flaky Test Cost Calculator to plug in your team’s numbers and see the dollar impact. Or install Kleore for an automated analysis of your actual CI history.

Time waster #1: Flaky test reruns

This is the single biggest source of CI waste, and it’s the one most teams underestimate. When a flaky test fails, developers re-run the entire pipeline. That re-run wastes 100% of the compute — you’re running the same tests again just to get a different roll of the dice.

The numbers are staggering. In our analysis of 10,000 GitHub Actions workflow runs, we found that 15-25% of CI compute is wasted on flaky test reruns. That means if you spend $10,000/month on GitHub Actions, $1,500 to $2,500 is literally burned on re-running tests that aren’t actually broken.

The fix: Identify and quarantine flaky tests.

You can’t fix what you can’t measure. Start by identifying which tests are flaky, then quarantine them so they don’t block CI while you fix the root causes.

.github/workflows/test.yml — retry with reporting

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version-file: ".node-version"
      - run: npm ci
      - name: Run tests with retry reporting
        run: |
          # Run tests, capture exit code
          npm test -- --json --outputFile=test-results.json || true

          # If tests failed, check if it's a known flaky test
          if [ -f test-results.json ]; then
            node scripts/check-flaky.js test-results.json
          fi

For a deeper dive on fixing flaky tests specifically, see our guides for Jest and pytest.

Time waster #2: No dependency caching

Every CI run that starts with npm install or pip install -r requirements.txt from scratch is downloading the same packages over and over. For a typical Node.js project, this wastes 1-3 minutes per run. Multiply that by 50 runs/day and you’re losing 1-2.5 hours daily.

The fix: Use actions/cache or built-in caching.

Node.js — cache node_modules

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version-file: ".node-version"
          cache: "npm"  # Built-in npm cache support
      - run: npm ci    # Uses cache when lockfile hasn't changed

Python — cache pip packages

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version-file: ".python-version"
          cache: "pip"  # Built-in pip cache support
      - run: pip install -r requirements.txt

Custom cache for monorepos or complex setups

- name: Cache dependencies
  uses: actions/cache@v4
  with:
    path: |
      node_modules
      ~/.cache/Cypress
      .next/cache
    key: deps-${{ runner.os }}-${{ hashFiles('**/package-lock.json') }}
    restore-keys: |
      deps-${{ runner.os }}-

Pro tip: npm ci is faster than npm install in CI because it skips the lockfile resolution step. Always use npm ci when you have a lockfile.

Time waster #3: Running all tests on every PR

If a PR only changes a README file, there’s no reason to run your entire test suite. Yet most teams configure their pipeline to run everything on every push. For large monorepos, this wastes enormous amounts of compute.

The fix: Use path filters and affected test detection.

Path filters — skip tests for docs-only changes

on:
  pull_request:
    paths:
      # Only run tests when code files change
      - "src/**"
      - "tests/**"
      - "package.json"
      - "package-lock.json"
      - ".github/workflows/test.yml"
    paths-ignore:
      # Never run tests for these changes
      - "**.md"
      - "docs/**"
      - ".vscode/**"

Conditional jobs based on changed files

jobs:
  changes:
    runs-on: ubuntu-latest
    outputs:
      backend: ${{ steps.filter.outputs.backend }}
      frontend: ${{ steps.filter.outputs.frontend }}
    steps:
      - uses: actions/checkout@v4
      - uses: dorny/paths-filter@v3
        id: filter
        with:
          filters: |
            backend:
              - "api/**"
              - "tests/api/**"
            frontend:
              - "web/**"
              - "tests/web/**"

  test-backend:
    needs: changes
    if: ${{ needs.changes.outputs.backend == 'true' }}
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm run test:backend

  test-frontend:
    needs: changes
    if: ${{ needs.changes.outputs.frontend == 'true' }}
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm run test:frontend

Time waster #4: Sequential jobs that could be parallel

Many teams structure their pipeline as a linear chain: lint, then type-check, then unit tests, then integration tests, then e2e tests. If linting takes 2 minutes and tests take 15 minutes, you’re waiting 17 minutes total. But lint and tests don’t depend on each other — they can run simultaneously.

The fix: Parallelize independent jobs and use matrix strategy.

Parallel independent jobs

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm run lint

  typecheck:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm run typecheck

  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        shard: [1, 2, 3, 4]  # Split tests across 4 runners
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npx jest --shard=${{ matrix.shard }}/4

  # Gate deployment on all checks passing
  deploy:
    needs: [lint, typecheck, test]
    runs-on: ubuntu-latest
    steps:
      - run: echo "All checks passed, deploying..."

With this setup, lint (2 min), typecheck (1 min), and 4 parallel test shards (4 min each instead of 16 min total) all run simultaneously. Total wall time drops from 19 minutes to about 4 minutes. You pay for more compute-minutes, but your developers get feedback 5x faster.

Time waster #5: Oversized Docker images

If your CI builds Docker images, the image size directly impacts build time, push time, and pull time. A 2GB image takes minutes to push to a registry and minutes to pull on every deploy. Most of that size is build dependencies and tooling that aren’t needed at runtime.

The fix: Multi-stage builds with slim base images.

Dockerfile — multi-stage build

# Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Stage 2: Production (only runtime dependencies)
FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production

# Only copy what's needed to run
COPY --from=builder /app/package*.json ./
RUN npm ci --omit=dev
COPY --from=builder /app/dist ./dist

# Result: ~150MB instead of ~1.5GB
CMD ["node", "dist/server.js"]

GitHub Actions — Docker layer caching

- name: Build and push Docker image
  uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: myapp:latest
    cache-from: type=gha    # Use GitHub Actions cache
    cache-to: type=gha,mode=max

How to measure CI waste

Before optimizing, measure where your time actually goes. GitHub provides a built-in usage report, but it only shows total minutes. To understand why those minutes are being spent, you need more granularity.

GitHub Actions usage report: Go to Settings → Billing → Actions to see total minutes consumed. This gives you the dollar baseline.
Workflow run duration trends: Use the GitHub API or gh run list to track how your workflow duration has changed over time. If it’s trending up, something is degrading.
Job-level timing: Look at individual job durations in the Actions tab. The longest job is your bottleneck — that’s where optimization has the biggest impact.
Flaky test cost: Kleore specifically measures the cost of flaky test reruns — how many minutes are wasted re-running workflows that failed due to flaky tests rather than real bugs.

Quick wins checklist

Here’s a prioritized checklist you can work through this week. Each item is independent — start with whichever is easiest for your setup.

Enable dependency caching — 5 minutes to set up, saves 1-3 minutes per run. Use actions/setup-node or actions/setup-python with the cache option.
Parallelize lint/typecheck/test — 15 minutes to restructure your workflow. Independent jobs run simultaneously instead of sequentially.
Add path filters — 10 minutes to add paths and paths-ignore to your workflow trigger. Docs-only PRs skip CI entirely.
Shard your test suite — 20 minutes to set up matrix strategy. Split tests across 2-4 runners for a proportional speedup.
Identify and quarantine flaky tests — 5 minutes to install Kleore. Get a ranked list of every flaky test, then quarantine the worst offenders to stop wasting reruns.
Use multi-stage Docker builds — 30 minutes to refactor your Dockerfile. Cuts image size by 50-90%, which speeds up both build and deploy.

See how much your CI is wasting.

Kleore scans your GitHub Actions history and shows you exactly where your CI minutes go — flaky reruns, slow tests, and wasted compute. You get a dollar amount and a prioritized fix list.

Scan my repos — free Calculate my CI waste

DEV Community: Mihir Shinde

Why Claude Code AutoFix Can’t Fix Flaky Tests

Why Claude Code AutoFix Can’t Fix Flaky Tests

What AutoFix actually does

Flaky tests break the loop

1. The speculative fix loop

2. The infinite re-run

3. The wrong file blamed

The cost math is bad

The right move: classify before you fix

This is what Kleore does

Stop letting AutoFix burn cycles on flakes.

FAQ: Claude Code AutoFix and flaky tests

Can Claude Code AutoFix fix flaky tests?

Why does AutoFix loop on flaky tests?

How do I stop AutoFix from wasting cycles on flaky tests?

Are AutoFix and Kleore competitors?

How much does running AutoFix on a flaky test cost?

Further reading

How to Find and Fix Flaky Tests in Jest

How to Find and Fix Flaky Tests in Jest

Why Jest tests become flaky

How to identify flaky Jest tests

Use --detectOpenHandles to find leaked resources

Run tests repeatedly to reproduce flakes

Randomize test order to find hidden dependencies

Use jest-circus retry for automatic detection

Common patterns and fixes

Pattern 1: Shared mutable state between tests

Pattern 2: Timer-dependent tests

Pattern 3: Async race conditions

Pattern 4: Port conflicts in integration tests

Pattern 5: Snapshot drift

How to quarantine flaky Jest tests

Prevention: jest.config settings that reduce flakiness

Stop guessing which Jest tests are flaky.

Further reading

We Analyzed 10,000 GitHub Actions Runs — Here’s What Flaky Tests Actually Cost

We Analyzed 10,000 GitHub Actions Runs — Here’s What Flaky Tests Actually Cost

Finding 1: 30% of CI reruns are caused by flaky tests, not real bugs

Finding 2: The average flaky test costs $37.50 per occurrence

Finding 3: 80% of CI waste comes from the top 3 tests

Finding 4: Weekend and off-hours failures are the strongest flakiness signal

Finding 5: Quarantining flaky tests cuts CI reruns by 60% within 2 weeks

What you can do about it

See your own numbers.

Further reading

Kleore vs Alternatives: Honest Comparison

Kleore vs Alternatives: Honest Comparison

Feature comparison

Pricing

Kleore

BuildPulse

Trunk Flaky Tests

Datadog Test Visibility

Setup complexity

Kleore

BuildPulse

Trunk

Datadog

Where each tool wins

Choose Kleore if...

Choose BuildPulse if...

Choose Trunk if...

Choose Datadog if...

The bottom line

See for yourself — it takes two minutes.

Further reading

What Are Flaky Tests? The Silent Killer of CI Pipelines

What Are Flaky Tests? The Silent Killer of CI Pipelines

Why do tests become flaky?

Timing & race conditions

Shared state

External dependencies

Environment differences

Date & time sensitivity

Resource contention

The real cost is invisible

How do you know if you have a flaky test problem?

What high-performing teams do differently