7 out of 7 tests pass. That number is boring and exactly correct.
Adding a CLI subcommand sounds trivial until you count the actual failure modes: the subparser is not wired up, the func is not registered, the dry run flag is silently swallowed, the argument parser crashes with a useless error. Any of those can bite you. The only way to know they do not is to test at three distinct levels, which is what this commit actually does.
Here is what wiring warmup eval looks like in practice.
The parser registration
argparse subcommands get wired in __main__.py by calling add_parser on the subparsers object, then setting set_defaults(func=cmd_warmup_eval). That last line is the critical one. If you forget it, args.func does not exist and every invocation blows up with an AttributeError that tells you nothing useful. The unit test covering this is two lines but it is the most important one in the file.
The dry run flag
, dry run is just add_argument(', dry run', action='store_true'). Fifteen seconds of work at the parser layer. The real discipline question is whether the flag is actually honored inside cmd_warmup_eval or silently accepted and then ignored. Wiring the flag is not the same as plumbing it through.
The test triangle
Three types, all necessary, none redundant.
Unit tests confirm the command function behaves correctly given controlled inputs, without touching the full argument parser or the shell.
Subparser tests confirm the parser tree is wired correctly: that python m x_engine warmup eval dispatches to cmd_warmup_eval and not to the default error handler. This is where set_defaults either proves itself or does not.
Subprocess acceptance tests confirm the whole thing actually runs when invoked from a shell, including the import chain, the argument parsing, and exit codes. This is the test that mirrors what a real user does. It is also the one that catches the failure mode where unit tests pass fine and then a missing import blows up at runtime because the module structure changed under you.
Skip any of the three and you have false confidence somewhere in the stack.
What I would do differently
The , dry run flag is wired at the parser and passes into cmd_warmup_eval. What the subprocess acceptance test checks is that the command exits cleanly with , dry run. It does not assert that , dry run actually changes behavior versus a live run. That means the flag could be a no op and the test still passes with a green check.
The fix is one extra assertion: capture stdout or a side effect that differs between dry and live modes, and verify the diff. Without it, the test proves the flag is accepted, not that it is respected. Those are different guarantees.
7/7 is the right score. Writing fewer tests would have been faster and worse.
Top comments (0)