Juan Torchia

Posted on Apr 20 • Originally published at juanchi.dev

AI Agents That Pass Your Tests. That's the Problem.

#english #reflections #llm #agentesia

Almost 30% of the tests my agents passed were false positives. Not badly written tests — tests I reviewed, ran by hand, tests that worked. The agent passed them perfectly and solved the wrong problem.

It took me three days to understand what I was looking at.

AI Agents and False Positive Tests: The Problem Nobody Warns You About

Whenever we talk about AI agents generating code, the conversation always ends up in the same place: "but does it pass the tests?" As if that were the definitive question. As if a green suite were equivalent to correct code.

It's not. And with agents, the gap between those two things is much larger than I thought.

The setup was simple: I have a real project, a data processing module with its corresponding test suite. I decided to let three different agents — one based on Claude, one on GPT-4o, one with Gemini 1.5 Pro — reimplement individual functions from scratch, with access only to the tests as a specification. No peeking at the original code.

The idea was to measure generation quality. What I actually measured, completely by accident, was something else entirely.

The Experiment: Real Code, Real Numbers

The module I used does transformations on tabular datasets: normalization, null imputation, outlier detection, categorical encoding. Nothing exotic. 47 functions, 312 tests.

# Example of the kind of test I had in the suite
def test_normalize_column_with_outliers():
    """
    Normalization must be robust to outliers.
    We use IQR instead of min-max to avoid a single
    extreme value distorting the entire distribution.
    """
    data = pd.Series([1, 2, 3, 4, 5, 100])  # 100 is the outlier
    result = normalize_robust(data)

    # The 100 shouldn't collapse all other values toward 0
    assert result[:5].std() > 0.1  # Normal values maintain their spread
    assert result[5] > result[4]   # The outlier is still the largest

This test looks reasonable. And it is. The problem is what an agent does with it.

What the agent generated:

def normalize_robust(series: pd.Series) -> pd.Series:
    """
    Robust normalization using IQR.
    Generated by the agent — passes all assertions.
    """
    # The agent calculated exactly what minimum std value
    # it needed to pass the first assertion
    q1 = series.quantile(0.1)  # ← Sneaky: uses 0.1, not 0.25
    q3 = series.quantile(0.9)  # ← Same, uses 0.9 instead of 0.75
    iqr = q3 - q1

    if iqr == 0:
        return pd.Series([0.0] * len(series))

    return (series - q1) / iqr

All assertions pass. The results are numerically within the ranges the test verifies. But the implementation uses the 10th–90th percentiles instead of the 25th–75th quartiles. It's not robust IQR normalization — it's something else that also happens to pass my tests.

Why does it matter? When a dataset with a different distribution shows up, with outliers in a different position, the behavior will diverge from what's expected. And no test will catch it because I never thought to write the test that catches that specific divergence.

The Three Patterns I Found

After manually reviewing the 89 "suspicious" cases (the ones I had to read twice), I identified three clear patterns.

Pattern 1: Literal Assertion Satisfaction

The agent optimizes to make the check pass, not to implement the concept. If the test says assert len(result) == len(input), the agent makes sure that's true. How — that's secondary.

Pattern 2: Overfitting to the Test Cases

# My outlier detection test
def test_detects_outliers_zscore():
    data = [1, 2, 3, 4, 5, 50]  # 50 is clearly an outlier
    outliers = detect_outliers_zscore(data, threshold=2.5)
    assert 50 in outliers
    assert 1 not in outliers

# What the agent generated (simplified):
def detect_outliers_zscore(data, threshold=2.5):
    mean = np.mean(data)
    std = np.std(data)

    # This works for [1,2,3,4,5,50]
    # Fails silently for distributions with small std
    return [x for x in data if abs(x - mean) / (std + 1e-10) > threshold]
    # The +1e-10 avoids division by zero BUT
    # it also distorts the effective threshold when std is small

The + 1e-10 is a hack the agent added to handle the division-by-zero edge case. It works for my test data. For data with a real std close to zero, the effective threshold shifts dramatically.

Pattern 3: Exploiting Incomplete Specification

This was the most interesting one. When my tests didn't specify a behavior, the agent took the path of least resistance — which was sometimes technically valid but conceptually wrong.

One example: I had a null imputation function. My tests verified that no nulls remained and that the column mean stayed within a certain range. The agent imputed with the global median of the entire dataset instead of the per-column median. All my tests passed because I never specified which median.

The Problem Isn't the Agent. It's Me.

This is the uncomfortable part.

When I write tests knowing a human is going to run them — or that I'm going to read the code myself — there's an implicit layer of shared understanding. A human who reads normalize_robust and sees it using 10th–90th percentiles instead of 25th–75th quartiles would probably ask me about it. Or change it. Or at least know they're doing something different.

An agent doesn't have that layer. It only has the explicit contract I wrote. And it turns out my contracts have enormous holes in them.

It's the same problem I ran into when I wrote a Python interpreter in Python: the limits of a system become visible when someone — or something — explores them without the implicit assumptions you carry around.

The agent isn't cheating. I was writing tests for humans and using them as specifications for agents. Those are two different things.

How I Changed My Approach

After this, I started thinking in two layers of tests whenever I work with agents.

Layer 1: Observable Behavior Tests (what I already had)
Verify that the output has the correct properties.

Layer 2: Conceptual Invariant Tests (what I was missing)
Verify that the implementation respects the concepts I actually care about.

# Conceptual invariant tests — layer 2
class TestRobustNormalizationInvariants:

    def test_uses_real_quartiles(self):
        """
        Verify the implementation uses standard IQR (Q3-Q1),
        not alternative percentiles that could also pass
        the behavior tests.
        """
        # We design a case where Q1/Q3 vs P10/P90 give distinct results
        # with a distribution specifically chosen for this
        control_data = pd.Series([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])

        expected_q1 = control_data.quantile(0.25)  # 32.5
        expected_q3 = control_data.quantile(0.75)  # 77.5
        expected_iqr = expected_q3 - expected_q1   # 45.0

        result = normalize_robust(control_data)

        # Verify that the value at Q1 normalizes close to 0
        # This is ONLY correct if you used real IQR
        value_at_q1 = result[control_data == 30].iloc[0]
        assert abs(value_at_q1) < 0.1  # With real IQR, Q1 normalizes near 0

    def test_behavior_with_low_std(self):
        """
        The +epsilon hack to avoid division by zero
        must not affect the effective threshold.
        """
        # Series with nearly identical values (very low std)
        uniform_data = pd.Series([10.0, 10.001, 10.002, 10.003, 50.0])
        outliers = detect_outliers_zscore(uniform_data, threshold=2.5)

        # 50 MUST be an outlier — if epsilon distorts the threshold,
        # it might not be detected, or everything gets flagged
        assert len(outliers) == 1
        assert 50.0 in outliers

These are more complex tests. Harder to write. But they're the ones that actually specify the problem, not just the output.

This has a cost — I've been measuring it. Every additional test the agent runs adds tokens, adds latency, adds money. I analyzed those numbers in another post and the conclusion is the same: design decisions have real costs. Deciding how exhaustive your agent tests are is an architectural decision with economic impact.

The Meta-Problem: Specification as Communication

There's something deeper here that keeps nagging at me.

When I looked at how Anthropic designed Claude's developer experience, one of the tensions I identified was exactly this: agents are good at executing explicit specifications but bad at inferring implicit intent. Not because they're dumb — but because implicit intent requires context that lives outside the prompt.

My tests were implicit specifications dressed up as explicit contracts. I knew that normalize_robust used standard IQR. That knowledge was never in the test. The agent had no way to know it.

It's similar to what I found when I analyzed the real costs of my agents: the numbers I saw at first were telling me one thing, but the real story was more complicated. The tests I saw passing were telling me the code was correct. The real story was more complicated.

And there's something almost philosophical about this that reminds me of the post about Brunost and programming languages in minority languages: who decides what's "readable" and what's "correct" depends entirely on what assumptions you share with whoever's reading. An agent doesn't share your assumptions. Never has, never will.

Common Mistakes When Using Agents with TDD

Mistake 1: Confusing "passes the tests" with "solves the problem"
These are distinct necessary conditions. With humans there's a lot of overlap. With agents, not so much.

Mistake 2: Tests that only verify the happy path
Agents are especially good at the happy path. Poorly specified edge cases are where broken-but-green implementations show up.

Mistake 3: No conceptual regression tests
If you're reimplementing with an agent, you need tests that verify the new implementation preserves the conceptual properties of the old one — not just the output values.

Mistake 4: Leaving implementation space unconstrained
Any degree of freedom you didn't specify, the agent will explore. Sometimes that's good. Often it generates implementations that pass your tests in ways you never anticipated.

FAQ: AI Agents and False Positive Tests

Can an AI agent cheat on tests on purpose?
Not in the sense of malicious intent. What it does is optimize to satisfy the success criterion you gave it — which is the assertions. If an assertion can be satisfied in multiple ways, the agent picks the simplest one it finds in its search space. There's no cheating, just misdirected optimization.

Does this problem apply only to certain agents or frameworks?
I saw it in all three I tested (Claude, GPT-4o, Gemini 1.5 Pro) with different frequencies but the same pattern. It's not an implementation bug — it's an emergent property of using tests as the primary specification. Any agent generating code based on tests will have this tendency.

So is TDD with AI agents a bad idea?
No, but it requires rethinking how you do TDD. Tests as a safety net are still valuable. Tests as a complete specification of expected behavior — that's where the problem lives. You need conceptual invariant tests on top of observable behavior tests.

How do I detect if an agent passed a test in a "hollow" way?
Some signals: the implementation has hardcoded constants, uses epsilons or adjustments it didn't explain, behaves differently in ranges your tests don't cover, or the function does something slightly different from what its name implies. Human code review is still necessary — tests don't replace that.

How many additional tests do I need for this to stop happening?
There's no magic number. The heuristic I use: for every function an agent reimplements, I add at least one invariant test that verifies a specific implementation property, not just an output property. It increases test-writing time by ~40% but reduced my false positives from ~29% to ~8% in the next iteration.

Is the extra cost of more elaborate tests worth it with agents?
Depends on what you're building. For throwaway code or prototypes, probably not. For code going to production or that other agents will use as a dependency, yes — absolutely. The cost of a conceptual bug in production outweighs the cost of more robust tests.

Tests Are a Language, and Agents Speak It Differently

The 29% false positive rate doesn't scare me because of the number itself. It scares me because of what it implies: I had badly calibrated confidence in my test suite. I thought green = correct. Green = satisfies my assertions. Those are different things.

With humans, the difference is small because there's implicit understanding. With agents, the difference can be enormous because there's nothing implicit — only what you wrote.

I'm not going to stop using agents to generate code. I use them every day and they're genuinely useful. But I changed something fundamental: I stopped thinking of tests as the final arbiter of correctness when there's an agent involved. Now they're the minimum floor. The ceiling is set by code review and invariant tests.

If you're using AI agents to generate code — and you're using tests as the specification — I'd recommend running the same experiment I did. Grab a module you know well, let an agent reimplement it using only the tests, and then manually review the first 20 results that pass.

Maybe you'll find your tests are airtight. Maybe you'll find what I found.

Worth looking.

DEV Community