DEV Community

Cover image for Never Let Claude Code Tell You It's Done
Harry Floyd
Harry Floyd

Posted on • Originally published at harryfloyd.substack.com

Never Let Claude Code Tell You It's Done

What you will do: add one test that catches the agent's mistakes, then wire it so Claude Code runs it on its own and cannot end a turn while it is failing. About twenty minutes.

Who this is for: you use Claude Code on real code, and it has told you "fixed it, tests pass, done" when it was not.

Who should skip it: if your agent already cannot end on a failing test, you are past this.

You need: Claude Code (version 2.1.143 or later), Python 3 for the worked example (it uses the built-in test runner, nothing to install), and a project of your own.

1. Write a test that says what you want

A test is a small program that runs your code and checks it does the right thing. The important word is yours. The test has to encode what you want the code to do, because if the agent writes both the code and the test, it can quietly make the two agree. The check has to come from outside the agent, or it is not a check.

Here is the smallest possible example: a function, and a test for it. The agent was asked to fix add, and reported back that it was done.

calc.py

def add(a, b):
    return a - b  # wrong
Enter fullscreen mode Exit fullscreen mode

test_calc.py

from calc import add
import unittest

class TestAdd(unittest.TestCase):
    def test_adds(self):
        self.assertEqual(add(2, 3), 5)

if __name__ == "__main__":
    unittest.main()
Enter fullscreen mode Exit fullscreen mode

Save both files in the same empty folder and open your terminal there. The agent was certain. The test does not care how certain it was. add(2, 3) came back -1, not 5, and now you know, in one second, that "done" was not true.

2. Put it behind a gate the agent cannot talk past

Save this as check.sh in your project:

#!/bin/bash
cd "$(dirname "$0")"
tmpdir=$(mktemp -d)
trap 'rm -rf "$tmpdir"' EXIT
export PYTHONPYCACHEPREFIX="$tmpdir"
python3 -m unittest 2>&1
status=$?
[ "$status" -eq 0 ] && exit 0
exit 2
Enter fullscreen mode Exit fullscreen mode

That 2 is the load-bearing part. An ordinary failure exits 1. We exit 2 on purpose, because of what Claude Code does with it next.

3. Make the gate run itself

Claude Code has hooks: scripts it runs for you on certain events without being asked. The one we want is Stop, which runs the instant the agent tries to end its turn.

Wire check.sh to it. In your project, make the .claude folder if it is not there, then create or open .claude/settings.json and add:

{
  "hooks": {
    "Stop": "bash \\"${CLAUDE_PROJECT_DIR}/check.sh\""
  }
}
Enter fullscreen mode Exit fullscreen mode

When a Stop hook exits 2, Claude Code blocks the stop: it refuses to let the agent finish, feeds your test failures back to it as the reason, and makes it keep working. The agent cannot tell you it is done while the suite is red, because the suite, not the agent, now decides when the turn is allowed to end.

4. When it still goes wrong

  • A gate that always blocks would loop. Claude Code ends the turn on its own after 8 consecutive blocks (change the limit with the CLAUDE_CODE_STOP_HOOK_BLOCK_CAP environment variable).
  • A big suite makes every stop slow. Point check.sh at a fast subset (the tests near what you changed) and leave the full run to CI.
  • Green is only as good as the test. The gate proves the tests pass, not that the tests are enough.
  • It only guards what the tests touch. Untested code, prose claims, "I checked the docs": the gate sees none of that.

5. Prove the gate fires

Do not take my word, or the agent's, that the gate works. Break something on purpose: change a line so a test fails, then ask Claude Code to wrap up. Watch it get pulled back and handed the failure instead of stopping. Once you have seen it block, a clean finish from the agent finally means something.

That is the floor, and it is a real one: the agent can no longer decide for itself that the job is done. Something it cannot argue with does. The next step is widening the gate from "the tests pass" to "the tests are worth passing."

Top comments (0)