most AI-generated tests are worse than no tests

#testing #ai #markdown

i started having claude write tests for my project and quickly realized something: most of them were useless. they passed, but they didn't test anything and i had a major sense of false security seeing 12/12 tests pass, etc. over and over.

a test that asserts expect(result).toBeDefined() after calling a function is technically a passing test. it will never fail unless the function throws.

this was like 80% of what i was getting. tests that exercised code paths without actually checking that the code did the right thing so I had great code coverage but stuff was still breaking constantly.

so i started thinking about what makes a test actually worth having, and i ended up with a set of gates that changed how i think about it.

the mutation test is the most important one. take a passing test, go flip a condition or change a return value in the source code, and run the test again. if it still passes, then the test is garbage because it's not actually sensitive to the behavior it claims to test. this is the thing that catches toBeDefined and toBeTruthy and all the other assertions that look like tests but aren't.

the other gates i landed on:

no weak assertions. toBeDefined, toBeTruthy, toBeInstanceOf — these get rejected automatically. if your assertion would pass on literally any non-null value, it's not an assertion.
isolation. the test has to pass in isolation, not just when the full suite runs. tests that accidentally depend on state from other tests are a time bomb.
coverage delta. the test has to actually cover new lines. a test that just re-covers stuff that's already tested is noise.
full suite compatibility. doesn't break anything else.

i eventually built a skill for claude code that automates this whole loop. it finds untested files, generates tests, runs them through all five gates, retries up to 3 times with feedback if they fail, and commits the ones that pass. it runs in a background tmux session so i can work on other stuff while it churns through the backlog.

the point is that "AI can write tests" is true in the same way that "AI can write code" is true. the output looks correct and compiles and runs. the question is whether it actually works. and for tests specifically, "works" means "fails when the code is wrong." most AI-generated tests don't clear that bar.

if you're using AI to write tests, add a mutation step. flip something in the source, run the test. if it still passes, delete it. that one check filters out more garbage than everything else combined.

markdown edited in ginsberg.ai

DEV Community

most AI-generated tests are worse than no tests

Top comments (0)