Sebastian Lim

Posted on Feb 15

Flaky Tests Are Not a Testing Problem. They're a Feedback Loop You Broke.

#cicd #codequality #devops #testing

Every retry rule in your CI pipeline is a painkiller. It suppresses the symptom, the stock of broken code keeps growing underneath, and nobody feels the pain until the whole system is addicted.

I came across this post on HN that perfectly illustrates the pattern: retries everywhere, quarantining tests, adding waits, slowly losing trust in CI signal. The author asked whether flakiness is "a test problem, a product problem, or infrastructure noise."

It's none of those. It's a system structure problem. And if you look at it through the lens of System Dynamics, the diagnosis becomes obvious.

The Addiction Loop (R1)

Every "fix" that masks a failure instead of resolving it feeds a reinforcing loop:

    (Symptom)                    (Symptom Relief)
    RED PIPELINE  ---------------->  GREEN BUILD
         ^                                |
         |          (Short Term)          |
         |           BALANCING            |
         |             LOOP               |
         |                                v
         +----------- <RETRY> ------------+
         |          (Intervention)        |
         |                                |
         |                                |
         | (Long Term Side-Effect)        |
         |     REINFORCING LOOP (R1)      |
         |     "The Addiction"            |
         |                                |
         |                                v
    +---------+                    +-------------+
    |  MORE   | <----------------- | HIDDEN BUGS |
    | FLAKINESS|      (Delay)       | ACCUMULATE  |
    +---------+                    +-------------+

Look at the bottom loop. Every time you hit "retry," you feel good because the light turns green. But you are feeding R1: hidden bugs accumulate, making the system flakier, forcing you to retry even more tomorrow.

This is textbook "Shifting the Burden" — one of the classic system archetypes identified by Donella Meadows. The short-term fix (retry) actively undermines the long-term solution (actually fixing the bug).

The Erosion Loop (R2)

R1 doesn't run alone. It drags a second loop behind it:

 Actual Quality           Perceived Quality
    (Lots of Red)  --------->  (It's just noise)
          ^                           |
          |                           |
          |      REINFORCING          v
          |        LOOP (R2)      LOWER STANDARDS
          |     "The Erosion"     (Normalize Failure)
          |                           |
          |                           |
          +---------------------------+
             (Less debugging,
              more merging)

This is "Drift to Low Performance." Because you don't trust the CI signal, you lower your standards. Because you lower standards, you merge worse code. Which makes the signal even less trustworthy. Repeat until your CI pipeline is a decoration.

The original HN post asked: "how do QA and engineering teams split responsibility?" That's the wrong question. The real question is: how do you make the pain of instability felt by the person who introduced it? Right now, the infrastructure absorbs the pain by retrying, so developers never feel it. They keep submitting flaky code because the system lets them get away with it.

I Ran Into the Same Wall

I built a CI pipeline to port an entire ROS 2 Desktop stack onto two non-officially-supported Linux distributions — openEuler (CentOS-based) and openKylin (Ubuntu-based) on RISC-V. Two different base systems, 973 packages, zero upstream CI support.

My system went through three phases, and they map directly to the dynamics above.

v1: The Brute-Force Probe. I blindly pulled all 973 packages into the pipeline and let them build. This triggered widespread breakages — but it wasn't a failure, it was a data mining operation. I successfully built 597 packages (proving feasibility), and more importantly, I mapped exactly 214 specific dependency gaps and 151 build failures. The pipeline wasn't meant to pass. It was meant to make every hidden stock of problems visible.

v2: The Verification Engine. Armed with v1's data, I built a system that verifies before building — probing the OS environment to identify dependency gaps before consuming expensive build resources. Build attempts dropped, success rate went up, because I stopped feeding garbage into the pipeline. (GitHub repo)

v3: Incremental Stock Management. Instead of tackling everything at once, I identify small batches of problematic dependencies, isolate them into manageable "stocks," and resolve them one group at a time. Subtraction, not addition.

My System Is Addicted Too

Here's the part where I punch myself in the face.

My own CI has the same addiction pattern. I use virtual environments to bypass system dependency conflicts. I have masquerade rules that spoof package identities. If you look at the architecture diagram in my README, you'll spot multiple "intervention" nodes — each one is a band-aid.

I know this is not sustainable. These are temporary splints, not fixes.

But here's the difference: I know these are band-aids. Most teams don't. They think retries are "solutions." I know my virtualenv bypass is a temporary splint that I chose consciously, with full awareness of the technical debt I'm taking on. Being aware of the addiction and being consumed by it are two very different things.

The Real Bottleneck

I can identify the stock that's poisoning your pipeline. I can design the feedback loop that makes the right person feel the pain. What I can't do is force an organization to care.

And that's usually the real bottleneck — not the flaky tests, but the system's refusal to let anyone feel the consequences.

*If you're dealing with a similar "build-first-verify-never" problem, the v2 Verification Engine repo shows this systems thinking approach applied to a real project. I'm looking for exactly these kinds of challenges.

Top comments (3)

JP • Feb 19

Spot on! Flaky tests are almost always a code smell for race conditions, unpredictable state, or tightly coupled dependencies. Static analysis tools can catch many of these patterns before they become flaky tests — think unhandled async calls, global state mutations, timing assumptions. Fix the root cause, not just the test.

Mahima From HeyDev • Feb 15

Love the “retry rules are painkillers” framing - it matches what I’ve seen on teams where CI slowly stops being a trusted signal. The practical fix that worked best for us was forcing a fast, deterministic failure: quarantine is time-boxed, and owners get paged if a test flakes twice in a day, so the pain is felt close to the change that introduced it.

One detail I’m curious about: do you have a heuristic for distinguishing infra jitter (timeouts, shared runners) vs genuine concurrency bugs, without turning the pipeline into a profiler?

Sebastian Lim • Feb 16 • Edited

Good question. My heuristic is simple: correlate failures with code changes, not with time.

If the same test on the same code fails randomly across runs, it's infra jitter, the environment is the variable, not the code. If failures cluster around specific commits and disappear on revert, it's a code bug.

You don't need a profiler for this. You need a log that tracks (commit, test, result) and a script that flags: "this test only started failing after commit X." That's your lightweight sensor.

The system dynamics framing: infra noise and code bugs are two separate inflow sources feeding the same stock (flaky tests). The cheapest intervention is a filter at the inflow — attribute before you diagnose.

In my ROS 2 porting project, v2 does exactly this at the dependency level, probe the environment state first, separate "the OS is missing something" from "the code is broken," then route them differently. Same principle, different layer.