Taras H

Posted on Jun 9 • Originally published at codenotes.tech

Why Integration Tests Flake in CI but Pass Locally

#testing #ci #backend #debugging

An integration test that passes locally and fails in CI is usually not random.

It is usually depending on something the test does not control:

shared database state
test order
worker parallelism
real time
background jobs
service startup timing
reused fixture data

That is why the same test can fail in CI, pass on rerun, and then fail again tomorrow.
The rerun did not fix anything.
It only gave the hidden assumption a better environment.

I have found that flaky integration tests become much easier to fix when you stop asking:

How do I make this test pass?

and start asking:

What did this test borrow from the environment?

Here are the failure patterns I would check first.

1. The Test Shares Data With Another Test

This is the classic one.

The test creates a user like this:

const user = await createUser({
  email: 'integration@example.com',
})

It passes locally because you run one file against one clean database.

Then CI runs several files in parallel.
Another test creates the same email.
A previous run leaves a row behind.
A worker truncates a table while another worker is asserting behavior.

Now the test fails with a unique constraint error, a wrong row count, or a status that makes no sense.

The fix is not to retry the test.
The fix is to give every run ownership of its own data.

function testId(name: string) {
  return [
    process.env.GITHUB_RUN_ID ?? 'local',
    process.env.TEST_WORKER_ID ?? 'w0',
    name,
    crypto.randomUUID(),
  ].join('_')
}

const runId = testId('order-create')

const user = await createUser({
  email: `${runId}@example.test`,
  testRunId: runId,
})

Now the failure log can tell you which run created the data, and cleanup can target only that run.

2. Cleanup Works Locally But Breaks Under Parallelism

Database cleanup is part of the test design.
It is not just housekeeping.

For serial tests, truncating tables between cases can be fine:

TRUNCATE TABLE
  outbox_events,
  payment_attempts,
  order_items,
  orders,
  users
RESTART IDENTITY CASCADE;

But if multiple CI workers share the same database, truncation can become destructive.
One worker may delete rows while another worker is still using them.

Better options are:

database per worker
schema per worker
run-scoped data with a testRunId
transaction rollback when the app and test can share the same transaction boundary

The exact choice depends on the stack.
The rule is simpler:

One worker should not be able to delete or mutate another worker's data.

3. The Test Sleeps Instead Of Waiting For Behavior

This is another common source of CI-only flakes:

await request(app).post('/api/orders').send(payload)

await sleep(500)

const order = await db.order.findFirst({
  where: { externalReference: payload.externalReference },
})

expect(order?.status).toBe('confirmed')

This test is not waiting for the system.
It is waiting for the clock.

If CI is slow, 500 ms is not enough.
If CI is fast, the sleep wastes time.
If the worker crashed, the test waits anyway and then fails with weak evidence.

Prefer bounded polling against the state that matters:

async function eventually<T>(
  read: () => Promise<T>,
  assert: (value: T) => void,
  { timeoutMs = 5000, intervalMs = 100 } = {}
) {
  const deadline = Date.now() + timeoutMs
  let lastError: unknown

  while (Date.now() < deadline) {
    const value = await read()

    try {
      assert(value)
      return
    } catch (error) {
      lastError = error
      await sleep(intervalMs)
    }
  }

  throw lastError
}

The timeout still exists, but now it protects a meaningful condition.
The test waits for a durable effect, not an arbitrary delay.

4. The Test Accidentally Depends On Real Time

Some tests fail only near midnight, month end, daylight saving changes, or slow CI runs.

Examples:

token expiration
trial period calculation
scheduled jobs
invoice dates
order expiry
time-zone conversion

If time is not the behavior under test, freeze it.

beforeEach(() => {
  clock.freeze(new Date('2026-05-30T10:00:00.000Z'))
})

afterEach(() => {
  clock.restore()
})

Also make the CI time zone explicit:

env:
  TZ: UTC

One caveat: make sure the application and database agree about time.
If the app uses a fake JavaScript clock but the database uses now(), the test can still be inconsistent.

5. The Dependency Is Running But Not Ready

A container can be "up" before it is useful.

PostgreSQL may accept connections before migrations finish.
An HTTP stub may open a port before fixtures are loaded.
A worker may start after the first request already created a job.

The first test fails.
The rerun passes.

That is a readiness problem, not a flaky assertion.

Make startup explicit:

beforeAll(async () => {
  await waitForDatabase()
  await runMigrations()
  await resetDatabase()
  await waitForWorker()
})

"Port is open" is weaker than "the service can do the work this test needs."

What I Log When A CI Integration Test Fails

Before changing the test, I want the failure to leave evidence.

At minimum:

test name and file
CI run id
attempt number
worker id
database or schema name
random seed, if the runner has one
ids of created users/orders/tenants
current time and time zone
recent relevant database rows
pending background jobs
dependency stub calls

Without that, a rerun can make the test green without teaching you anything.

A Useful Debugging Order

When an integration test flakes in CI, I check this sequence:

Does it reuse global fixture values?
Can two workers touch the same rows?
Does cleanup delete another worker's data?
Does it sleep instead of waiting for observable state?
Does it depend on the real clock or local time zone?
Are containers actually ready before tests start?
Does the failure reproduce only with parallelism?
Is the test exposing a real race condition in the product?

That last point matters.

Sometimes a flaky integration test is not "just a bad test."
Sometimes it is the only thing showing you that a boundary is unsafe: a transaction commits too early, an idempotency key is missing, a worker can process the same job twice, or a status transition is not atomic.

Takeaway

A flaky integration test is a test with an uncontrolled dependency.

The dependency might be data, time, order, parallelism, cleanup, startup, or a real product race.

Do not start by hiding the failure.
Find what the test borrowed from the environment.
Then make that dependency explicit, isolated, observable, or removed.

I wrote a longer version with database cleanup strategies, bounded polling examples, quarantine rules, and a checklist here:

https://codenotes.tech/blog/flaky-integration-tests-in-ci