GauntletCI

Posted on May 9 • Originally published at gauntletci.com

The Asymmetry of Change: Why Your Tests Are Looking the Wrong Way

#testing #dotnet #csharp #ai

The Asymmetry of Change: Why Your Tests Are Looking the Wrong Way

A passing build is often treated as a certificate of correctness. In reality, it's a narrow contract.

It doesn't prove your code is right. It proves that the assertions you wrote in the past, against behaviors you anticipated back then, still hold true today.

When you open a pull request, your unit tests ask: "Does the system still behave the way it used to?"

The question you actually need to answer is different: "Is the new behavior I just introduced safe?"

Those aren't the same thing. And that gap is exactly where production incidents live.

The Wrong Question

Here's the problem: tests are a snapshot of past understanding.

Your code changed. Your tests didn't. And somehow the build is still green.

A guard clause disappears. No test explicitly covered it because the guard was the coverage. A condition gets narrowed. An exception handler gets swapped. A state transition loses a validation step.

The test suite sees none of this, because it was never asked to care about these things. It was asked about something else. Something that still works fine.

The Evidence

This isn't theoretical. Multiple independent studies have found the same pattern across every major programming language.

Test Co-Evolution Studies:

A 2025 study analyzed 526 repositories across JavaScript, TypeScript, Java, Python, PHP, and C#. Finding: asynchronous evolution of tests and code is pervasive. [1] Earlier work on 975 Java projects reached the same conclusion: production code frequently changes without test updates. [2] This has been documented since at least 2010. [3]

Chromium CI Study:

Researchers analyzed 1.5 million test executions across 14,000 commits. Result: even with 99.2% precision, modern flakiness detection still caused 76.2% of real regression faults to be missed. [4] Not because tests were missing. Because the tests that existed were being silenced.

Real Example - Django 6.0:

A refactor in the querystring template tag introduced a loop that worked fine for standard dictionaries but silently broke QueryDict instances. Existing tests passed. The bug shipped. It was caught only by a targeted rendered-output test that nobody thought to run regularly. [5]

The Numbers:

In an analysis of 598 pull requests across 57 open-source .NET repositories, 71% of PRs submitted without test file modifications contained at least one behavioral risk indicator. [6] That's not an outlier. That's the norm.

The Time Machine Problem

Every diff is a time machine moving in one direction.

The assertions stay where they were written. The code underneath moves forward.

// Before: implicit contract
if (user == null) return; 
Process(user.Name);

// After: contract broken, tests don't notice
Process(user.Name);

The guard was always there. Because it was always there, nobody wrote a test for the null case. It was implicit in the structure. The contract was protected by accident.

Remove that guard, and the test suite stays green. It's not "broken." It just never knew the guard mattered.

This is the Implicit Contract problem. And it's everywhere.

Why Code Review Isn't Enough

We rely on code review to catch these slips.

But human reviewers have a context window too. On a Tuesday afternoon, looking at a 400-line diff, they might see a refactor and miss that a crucial exception handler got swapped or a validation step disappeared.

We are asking humans to perform high-stakes pattern matching against a moving target. It's a process designed for fatigue.

Plus: reviewers didn't write the original code. They don't carry the full behavioral contract in their head. The removed guard clause looks like cleanup. The narrowed condition looks like a legitimate business rule change.

Code review is essential. But it's not a safety net. It's a second pair of eyes that also gets tired.

The Deterministic Answer

Here's what actually works: catch these patterns before anyone else sees the code.

Not with an LLM that sometimes forgets what you told it thirty messages ago. Not with probabilities. With deterministic rules that fire the same way every single time.

A Roslyn-powered engine that scans your diff and flags:

Removed guard clauses or defensive conditions
Narrowed catch blocks (catch(Exception) → catch(ArgumentException))
Validation steps removed from state transitions
Thread-blocking patterns introduced in async code (e.g., new Thread.Sleep())
Behavioral changes that touch no test files

Each of these is a pattern that has caused real production incidents. Each can slip past a green test suite.

The output is a checklist, not a verdict. You still decide what's actually a risk and what isn't. But you decide with full information, at the moment of change, when the logic is still fresh in your head.

Moving the "Uh-Oh" Moment

The most expensive place to have an "uh-oh" moment is in a post-mortem.

The second most expensive is a failed staging build.

The goal is to move that realization to your local terminal. The millisecond you hit save. Before you even think about committing.

When you catch unvalidated behavioral changes while the code is still in front of you, you don't just keep the build green. You ensure the build is actually correct.

You stop the time machine before it leaves the station.

What's Next

If this problem feels familiar, you've already felt the cost of it.

The question isn't whether these gaps exist. The evidence is clear: they're everywhere. The question is whether you want to keep finding them in production, or find them at the diff.

Learn more:

References

[1] Miranda, J. et al. (2025). Test Co-Evolution in Software Projects: A Large-Scale Empirical Study. Journal of Software: Evolution and Process. DOI: 10.1002/smr.70035

[2] Sun, W. et al. (2021). Understanding and Facilitating the Co-Evolution of Production and Test Code. IEEE International Conference on Software Engineering (ICSE).

[3] Gergely, T. et al. (2010). Studying the co-evolution of production and test code in open source and industrial developer test processes through repository mining. Empirical Software Engineering. DOI: 10.1007/s10664-010-9143-7

[4] Haben, G., Habchi, S., Papadakis, M., Cordy, M., & Le Traon, Y. (2023). The Importance of Discerning Flaky from Fault-triggering Test Failures: A Case Study on the Chromium CI. arXiv:2302.10594.

[5] Moreau, M. (2026). How a Single Test Revealed a Bug in Django 6.0. Lincoln Loop.

[6] Cogen, E. (2025). GauntletCI Corpus Analysis. 598 pull requests across 57 open-source .NET repositories.

Eric I. Cogen builds software for production. Twenty years in .NET, twenty years of shipping bugs that tests never caught. GauntletCI is the pre-commit gate he wishes he'd had all along.

DEV Community

The Asymmetry of Change: Why Your Tests Are Looking the Wrong Way

The Asymmetry of Change: Why Your Tests Are Looking the Wrong Way

The Wrong Question

The Evidence

The Time Machine Problem

Why Code Review Isn't Enough

The Deterministic Answer

Moving the "Uh-Oh" Moment

What's Next

References

Top comments (0)