An integration test that passes locally and fails in CI is usually not random.
It is usually depending on something the test does not control:
- shared database state
- test order
- worker parallelism
- real time
- background jobs
- service startup timing
- reused fixture data
That is why the same test can fail in CI, pass on rerun, and then fail again tomorrow.
The rerun did not fix anything.
It only gave the hidden assumption a better environment.
I have found that flaky integration tests become much easier to fix when you stop asking:
How do I make this test pass?
and start asking:
What did this test borrow from the environment?
Here are the failure patterns I would check first.
1. The Test Shares Data With Another Test
This is the classic one.
The test creates a user like this:
const user = await createUser({
email: 'integration@example.com',
})
It passes locally because you run one file against one clean database.
Then CI runs several files in parallel.
Another test creates the same email.
A previous run leaves a row behind.
A worker truncates a table while another worker is asserting behavior.
Now the test fails with a unique constraint error, a wrong row count, or a status that makes no sense.
The fix is not to retry the test.
The fix is to give every run ownership of its own data.
function testId(name: string) {
return [
process.env.GITHUB_RUN_ID ?? 'local',
process.env.TEST_WORKER_ID ?? 'w0',
name,
crypto.randomUUID(),
].join('_')
}
const runId = testId('order-create')
const user = await createUser({
email: `${runId}@example.test`,
testRunId: runId,
})
Now the failure log can tell you which run created the data, and cleanup can target only that run.
2. Cleanup Works Locally But Breaks Under Parallelism
Database cleanup is part of the test design.
It is not just housekeeping.
For serial tests, truncating tables between cases can be fine:
TRUNCATE TABLE
outbox_events,
payment_attempts,
order_items,
orders,
users
RESTART IDENTITY CASCADE;
But if multiple CI workers share the same database, truncation can become destructive.
One worker may delete rows while another worker is still using them.
Better options are:
- database per worker
- schema per worker
- run-scoped data with a
testRunId - transaction rollback when the app and test can share the same transaction boundary
The exact choice depends on the stack.
The rule is simpler:
One worker should not be able to delete or mutate another worker's data.
3. The Test Sleeps Instead Of Waiting For Behavior
This is another common source of CI-only flakes:
await request(app).post('/api/orders').send(payload)
await sleep(500)
const order = await db.order.findFirst({
where: { externalReference: payload.externalReference },
})
expect(order?.status).toBe('confirmed')
This test is not waiting for the system.
It is waiting for the clock.
If CI is slow, 500 ms is not enough.
If CI is fast, the sleep wastes time.
If the worker crashed, the test waits anyway and then fails with weak evidence.
Prefer bounded polling against the state that matters:
async function eventually<T>(
read: () => Promise<T>,
assert: (value: T) => void,
{ timeoutMs = 5000, intervalMs = 100 } = {}
) {
const deadline = Date.now() + timeoutMs
let lastError: unknown
while (Date.now() < deadline) {
const value = await read()
try {
assert(value)
return
} catch (error) {
lastError = error
await sleep(intervalMs)
}
}
throw lastError
}
The timeout still exists, but now it protects a meaningful condition.
The test waits for a durable effect, not an arbitrary delay.
4. The Test Accidentally Depends On Real Time
Some tests fail only near midnight, month end, daylight saving changes, or slow CI runs.
Examples:
- token expiration
- trial period calculation
- scheduled jobs
- invoice dates
- order expiry
- time-zone conversion
If time is not the behavior under test, freeze it.
beforeEach(() => {
clock.freeze(new Date('2026-05-30T10:00:00.000Z'))
})
afterEach(() => {
clock.restore()
})
Also make the CI time zone explicit:
env:
TZ: UTC
One caveat: make sure the application and database agree about time.
If the app uses a fake JavaScript clock but the database uses now(), the test can still be inconsistent.
5. The Dependency Is Running But Not Ready
A container can be "up" before it is useful.
PostgreSQL may accept connections before migrations finish.
An HTTP stub may open a port before fixtures are loaded.
A worker may start after the first request already created a job.
The first test fails.
The rerun passes.
That is a readiness problem, not a flaky assertion.
Make startup explicit:
beforeAll(async () => {
await waitForDatabase()
await runMigrations()
await resetDatabase()
await waitForWorker()
})
"Port is open" is weaker than "the service can do the work this test needs."
What I Log When A CI Integration Test Fails
Before changing the test, I want the failure to leave evidence.
At minimum:
- test name and file
- CI run id
- attempt number
- worker id
- database or schema name
- random seed, if the runner has one
- ids of created users/orders/tenants
- current time and time zone
- recent relevant database rows
- pending background jobs
- dependency stub calls
Without that, a rerun can make the test green without teaching you anything.
A Useful Debugging Order
When an integration test flakes in CI, I check this sequence:
- Does it reuse global fixture values?
- Can two workers touch the same rows?
- Does cleanup delete another worker's data?
- Does it sleep instead of waiting for observable state?
- Does it depend on the real clock or local time zone?
- Are containers actually ready before tests start?
- Does the failure reproduce only with parallelism?
- Is the test exposing a real race condition in the product?
That last point matters.
Sometimes a flaky integration test is not "just a bad test."
Sometimes it is the only thing showing you that a boundary is unsafe: a transaction commits too early, an idempotency key is missing, a worker can process the same job twice, or a status transition is not atomic.
Takeaway
A flaky integration test is a test with an uncontrolled dependency.
The dependency might be data, time, order, parallelism, cleanup, startup, or a real product race.
Do not start by hiding the failure.
Find what the test borrowed from the environment.
Then make that dependency explicit, isolated, observable, or removed.
I wrote a longer version with database cleanup strategies, bounded polling examples, quarantine rules, and a checklist here:
Top comments (0)