141 tests passed. One of them had been mutating production state the entire time.
The test was test_warmup_eval_no_flag_subprocess_exits_0. It invoked the real warmup eval subcommand, without , dry run, against the live data directory. There was a comment in the test file that said "in memory fallback" to explain why this was fine. That comment was wrong.
Here is what actually happened. The x engine is installed as an editable install. When the subprocess launches, config.STATE_PATH resolves to the real state.json. The previous test path happened to be the "no change detected" path, which persisted nothing, so the mutation never showed up. You could run the suite a thousand times and never notice.
Then I landed a fix that seeded the dwell anchor on first warmup evaluate (commit 1e1e53d). Suddenly the test was writing a real anchor into production state on every run. Not obviously. Not loudly. Just quietly advancing the live warmup sequence every time I ran pytest.
This is a classic subprocess test trap. You write a test that shells out to your own binary, you assume the binary will operate on some isolated scratch space, and you never verify that assumption. If the binary happens to not write anything, you never find out you were wrong. The test suite goes green, you ship, and the invariant you thought you had is fiction.
The fix is simple. I added an X_ENGINE_STATE_PATH environment variable that overrides config.STATE_PATH when set, with the default unchanged when the variable is absent. Then I pointed all three warmup eval subprocess tests at a tmp_path file via that env var. The subprocess now writes to a throwaway path that pytest cleans up automatically. No patching, no mocking, no monkeypatching os.environ inside the process. Just an env var the binary already knows how to read.
Verified: the full 141 test suite now leaves the live state.json byte identical after a run. Checksums before and after. No drift.
What would I do differently?
First, any test that invokes the real binary should set up isolated state before it runs. Not as an afterthought. As the first thing written. The question "where does this binary write?" should be answered before the test is considered valid, not after a correct unrelated fix surfaces the bug.
Second, the "in memory fallback" comment should have been a test assertion. If you believe a subprocess has no side effects on disk, assert it: snapshot the relevant file mtimes or checksums before the call, check them after, fail if anything changed. Comments rot. Assertions catch regressions on the next commit that actually matters.
The dwell anchor seeding fix was correct. It is also what exposed this. Sometimes a right change breaks something that should have been broken all along. That is worth writing down.
Top comments (0)