testing the part of my chess app that downloads a 50mb binary

#python #testing #pytest #opensource

I added CI to my chess GUI last week and immediately hit a wall. Most of the app is easy to test. Board state, move legality, FEN parsing, all pure functions with obvious inputs and outputs. python-chess does the heavy lifting and you just assert on the results.

The part I actually wanted covered was the installer. And the installer is the worst possible thing to test.

Here's what it does. On first launch it figures out what your CPU supports, picks the matching stockfish build, downloads it from the stockfish releases page, and caches it. Stockfish ships around eight builds per release. vnni512, avx512, bmi2, avx2, sse41-popcnt, ssse3, and a plain x86-64 fallback. The fast ones use instructions older chips don't have, so picking wrong means the app launches and instantly dies.

So the function I most needed to trust was one that reads hardware, talks to the network, downloads 50MB, and then runs a binary. You can't do any of that honestly in a github runner. The runner's CPU isn't your user's CPU. Downloading 50MB on every push is slow and flaky. And you obviously can't test "does the avx512 build run" on a machine that may not have avx512.

My first attempt was to just run the real thing in CI and see what happened. It worked maybe four times out of five. The fifth time the stockfish release page rate-limited the download and the whole run went red. A test that fails because someone else's server had a bad day is worse than no test.

So I split it up.

The actual decision, which build matches these CPU features, is pure. It takes a set of feature strings and returns a build name. No network, no hardware reads, nothing. Once it's pure you can throw any fake CPU at it.

import pytest
from installer import pick_build

CASES = [
    ({"avx512_vnni", "avx512f", "avx2"}, "vnni512"),
    ({"avx512f", "avx2", "bmi2"},        "avx512"),
    ({"avx2", "bmi2"},                   "bmi2"),
    ({"avx2"},                           "avx2"),
    ({"sse4_1", "popcnt"},               "sse41-popcnt"),
    ({"ssse3"},                          "ssse3"),
    (set(),                              "x86-64"),
]

@pytest.mark.parametrize("features, expected", CASES)
def test_pick_build(features, expected):
    assert pick_build(features) == expected

That last case is the one that matters. An empty feature set has to return the generic build, not crash and not return None. If pick_build ever returns None, the download step gets a null and the error you see three steps later tells you nothing useful. So the "CPU from 2009" case is a real test now, not a hope.

The hardware reading is separate. On windows it parses wmic output, on linux it reads /proc/cpuinfo, on macos it shells out to sysctl. I pulled the parsing apart from the reading, so the parser takes a raw string and returns the feature set. Now I can paste real /proc/cpuinfo dumps from three different machines into the test files and assert the parser pulls the right flags out. The actual reading of the file stays untested, which is fine, it's one line.

The download is mocked. I don't care that requests works, I care that my code writes the file to the cache path, names it right, and skips the download when the cache already has it. So the test patches the network call and checks the side effects instead of the bytes.

The hardest one was the fallback. The whole point of the installer is that if the picked build crashes in the first half second, it downgrades and tries the next one down the list. To test that without an actual crashing binary, I made the launch function take the thing it runs as an argument instead of calling it directly. The test passes in a fake that raises on the first two builds and succeeds on the third, then asserts the installer ended up on the third.

def test_fallback_walks_down(tmp_path):
    attempts = []
    def fake_run(build):
        attempts.append(build)
        if len(attempts) < 3:
            raise CrashedImmediately(build)
    install_and_launch(features={"avx512_vnni"}, runner=fake_run, cache=tmp_path)
    assert attempts == ["vnni512", "avx512", "bmi2"]

The injectable runner felt like a hack when I wrote it. It isn't. It turned the one piece of logic I couldn't otherwise reach into the easiest thing in the file to test.

What this doesn't cover: it never proves a real vnni512 binary runs on a real vnni512 chip. The whole bug class that started this project, shipping an instruction the CPU can't decode, is exactly the thing CI can't catch, because the runner is one fixed machine. The logic is tested. The hardware promise still only gets verified when a real person on a real laptop opens the app. So I added a tiny self-check on startup that logs the chosen build and whether it launched clean. At least when it does break on someone's machine, I'll know which build it picked.

Repo's here if you want to see the messy version: https://github.com/TiltedLunar123/stockfish-chess

It works. Not perfect, but it works.