Niv Dvir

Posted on May 8

Your parity gate must enforce the number you publish: a testing methodology for porting ML models across runtimes

#machinelearning #testing #swift #ai

You ship an ML model port from Python to another runtime — Swift, C++, ONNX, whatever. The build succeeds. The model loads. You feed it the canonical test image and the output looks right: valid JSON, sensible structure, plausible coordinates.

Then you measure pixel-for-pixel against the Python reference. One bbox edge is off by 9 px. Or 200 px. Deterministically. On every test image. And it's been like that for weeks because your test gate ran at a 30 px tolerance and never flagged it.

This kind of silent drift will happen on any cross-runtime ML port that doesn't make a strict parity gate a first-class part of the build. This post is about the four-component setup that catches it — generalized so you can apply it to your own port.

The lesson, distilled: your parity gate has to enforce the number you publish. A gate that's looser than the claim is not a gate. It's a decoration that gives you false confidence.

Why cross-runtime ML ports drift silently

When you port a model implementation across runtimes, identical weights and identical architecture do not guarantee identical output. The places where you accumulate sub-pixel error:

Image preprocessing. Every framework's "Lanczos" is a different Lanczos. PIL, Core Image, OpenCV, TensorFlow's tf.image.resize — they all produce subtly different output for the same input. A 1- or 2-level difference at 8-bit precision in a few hundred edge pixels propagates through attention.
Numerical precision boundaries. When the model uses FP16/BF16 in some places and FP32 in others, the exact ordering of cast → multiply → sum can move a least-significant bit. Across a few thousand multiplies, the LSBs add up.
Tokenizer / chat-template subtleties. If your runtime emits image-token positions one off from what the reference does, attention is one position off everywhere. The output looks plausible because the model gracefully degrades — it just degrades wrong.
Initialization timing. Caches, position-ID buffers, RoPE state — anything that persists between calls. A buffer that resets in Python but persists in your port produces different output on the second image than the first.

Each alone produces small drift. Stacked, they produce wildly wrong output. Individually, none throws an error. They all pass shape checks. They all return something.

You only catch them if you compare your runtime's output against a known-correct reference, at the strictness you care about, on every edge of every output, every time the gate runs.

The 30 px gate that hid a 9 px bug

The specific incident: a published claim of "≤ 2 px parity on all 8 edges of both panels" against the Python mlx-vlm reference. The gate enforcing that claim ran at TOLERANCE=30 px. It had been at 30 px since early prototyping when anything sub-50-px was a win.

Two of three canonical test images were drifting by ~9 / ~8 / ~5 px on outer edges. Nobody noticed because the gate was 15× looser than the published number it was supposed to enforce.

The forensic re-measurement happened only when preparing to upstream the patch — opening the actual measurements alongside the published prose. That's when the gate was clearly a decorative artifact, not a check.

The fix wasn't a single bug, though there was a bug, in chat-template content ordering. The deeper fix was the realization that the test infrastructure no longer matched the claim it was supposed to enforce. Once a strict gate was wired to the actual published number, the bug surfaced on the first run.

The four components of a working parity setup

To make this robust, four things have to exist together:

1. A saved, deterministic reference output

The reference is the source of truth. Every other piece compares against it.

canonical_baselines.json (the file's name in this project — call it whatever) holds the Python reference output for every canonical test image, generated at temperature=0 with a pinned model snapshot:

{
  "model": "mlx-community/Qwen2.5-VL-7B-Instruct-4bit",
  "snapshot_hash": "fdcc572e8b05ba9daeaf71be8c9e4267c826ff9b",
  "mlx_vlm_version": "0.4.3",
  "max_edge_delta_allowed": 2,
  "prompt": "Detect these two UI panels...",
  "images": [
    {
      "name": "leetcode_test",
      "path": "_zero_px_test/leetcode_test.png",
      "size_px": [3078, 2114],
      "model_resize": [1260, 868],
      "panels": {
        "question": [1, 146, 421, 626],
        "editor":   [421, 146, 881, 626]
      }
    }
  ]
}

Three things are pinned in this file: the model snapshot hash, the reference framework's version, and the input dimensions / preprocessing parameters that produced the output. All three are required. If any one of them changes, the saved output is no longer a valid reference, and the gate must be re-baselined deliberately rather than passing trivially.

The published claim — "≤ 2 px on every edge" — also lives in this file as max_edge_delta_allowed: 2. The number now lives in code, not in prose. If you ever want to publish a different number, you have to edit this file, which means the gate adjusts at the same time.

2. A strict per-edge comparison gate

strict_2px_gate.py: runs the freshly-built non-reference binary on every canonical image, parses the model's output, compares each edge of each output element against the saved reference, and exits non-zero if any single edge exceeds the allowed delta.

for img in spec["images"]:
    out = run_runtime(args.binary, ipath, rw, rh)
    panels = parse_bboxes(out)
    for label, ref_bbox in img["panels"].items():
        actual = panels.get(label)
        if not actual:
            failed.append((img["name"], label, "missing"))
            continue
        for i, edge_name in enumerate(["x1", "y1", "x2", "y2"]):
            delta = abs(actual[i] - ref_bbox[i])
            if delta > max_allowed:
                failed.append((img["name"], label, edge_name, delta))

Two non-obvious things this gate gets right:

Per-edge, not per-bbox. A bbox can have an average IoU of 0.95 with the reference and still be wrong by 12 px on one specific edge. IoU is too forgiving as the headline metric. Per-edge maxes catch the asymmetric drift that the most subtle preprocessing bugs produce.
Missing outputs are failures. If the model returns a bbox for "question" but not "editor", that's not "no measurement" — it's a failure. The gate flags it loudly.

3. A bootstrap script that fails on incomplete patches

setup_and_verify.sh: the only sanctioned way to set up a parity-test clone. It fresh-clones the upstream library at the pinned commit, applies your patch, builds the reproducer with the production-equivalent toolchain, and runs the gate. Aborts loudly on any failure.

# After git apply, sanity-check the patch landed the canonical fix line:
if ! grep -q 'message.images.map { _ in ["type": "image"] }' \
        Libraries/MLXVLM/Models/Qwen2VL.swift; then
    echo "Patch applied but Qwen2VL.swift is missing the image-first chat-template fix."
    echo "The patch file is incomplete — re-derive from working tree."
    exit 1
fi

The grep-the-patched-checkout trick catches the most insidious version of this problem: a patch that applies cleanly but is missing a fix line because the patch file itself was generated incompletely. The patched build looks fine and runs to completion. It produces output that's wrong by 9 px. Your gate would have to be tight enough to catch that.

The grep makes the gate stricter — it ensures the patch landed the specific lines that fix the bug you claim to have fixed.

4. Multi-implementation cross-checks where you have them

If your project has more than one consumer (an SDK, an MCP server, a dynamic-library entry point, an alternative runtime path), wire them all to a single cross-adapter gate:

# scripts/parity/cross_adapter_gate.sh — runs the same image through every adapter
A_BBOX=$(./run_adapter_a "$IMAGE")
B_BBOX=$(./run_adapter_b "$IMAGE")
C_BBOX=$(./run_adapter_c "$IMAGE")
diff <(echo "$A_BBOX") <(echo "$B_BBOX") || fail "A vs B"
diff <(echo "$A_BBOX") <(echo "$C_BBOX") || fail "A vs C"

This catches integration drift — situations where one adapter has stale state, a different image preprocessing path, or has drifted from the canonical implementation. The expected result here is 0 px tolerance. All your adapters call the same model with the same preprocessing, so any difference is a bug.

In this project, this gate runs at zero tolerance across every adapter that ships. It catches anyone breaking adapter parity before it gets noticed in the field.

Wiring all four together

The gate runs in this order on every parity check:

Fresh-clone the reference library at the pinned commit. (No reusing dirty trees.)
Apply your patch. Fail if it doesn't apply cleanly.
Grep the patched tree for the literal lines your patch was supposed to introduce. Fail if any are missing.
Build with the production-equivalent toolchain (in this project's case xcodebuild, not swift build — different code paths).
Run the binary on every canonical image.
Compare every edge of every output element to the saved reference.
Run the cross-adapter gate.
Pass = no edge of any output exceeds max_edge_delta_allowed AND every adapter agrees at zero tolerance.

Any step that fails aborts the whole gate with a specific error. There is no "warning"; there is no "fuzzy pass." There is "the published number holds" or "the published number does not hold." Binary.

What this generalizes to

The specific scripts above are tuned for a Swift port of Qwen2.5-VL with mlx-swift-lm as the runtime. The methodology is platform-agnostic.

If you're porting an ML model across runtimes, the minimum-viable parity setup looks like this:

One reference output file, pinned to a specific reference-framework version + model snapshot, with the number you'd publish embedded as a hard limit.
One gate script that runs the non-reference binary on every test case and compares per-output-element, per-edge (or per-token, per-pixel — whatever the units of your model output are) against the reference, with a strict tolerance.
One bootstrap script that ensures the binary under test is built from a clean checkout with the canonical patch fully applied, and aborts if any patch line is missing.
One cross-adapter sanity check if you have multiple consumers.

That's it. Four components — a few short scripts plus a JSON file. It runs in a few minutes after the first build (which dominates wall time anyway).

The reason to build the gate first, before the bug-hunt, is that the bugs you find in your port are the ones the gate flags. A gate at the right tolerance turns "this is mysteriously wrong" into "this specific edge of this specific output exceeds the threshold by N px on this specific image." That's debuggable. A gate at the wrong tolerance produces no information at all.

Provenance

This came out of a Swift port of Qwen2.5-VL on Apple Silicon. The 9 px drift incident happened while preparing the upstream PR for the model fixes. A previously-published "≤ 2 px parity on all 8 edges" claim was sitting on top of a 30 px gate. The four-component setup above is what replaced it.

The bug-hunt war story (failed model approaches, the bug list, what eventually worked) is on dev.to as On-Device Document Grounding on macOS: Getting Qwen2.5-VL to Actually Work in Swift. This post is the testing-methodology companion.

The scripts I described — canonical_baselines.json, strict_2px_gate.py, setup_and_verify.sh, cross_adapter_gate.sh — live in the project's scripts/parity/ directory if you want to adapt them.

The summary: the parity gate must enforce the number you publish. A gate that's looser than the claim is not a gate. Build it first.

DEV Community