DEV Community: Niv Dvir

Your parity gate must enforce the number you publish: a testing methodology for porting ML models across runtimes

Niv Dvir — Fri, 08 May 2026 16:28:08 +0000

You ship an ML model port from Python to another runtime — Swift, C++, ONNX, whatever. The build succeeds. The model loads. You feed it the canonical test image and the output looks right: valid JSON, sensible structure, plausible coordinates.

Then you measure pixel-for-pixel against the Python reference. One bbox edge is off by 9 px. Or 200 px. Deterministically. On every test image. And it's been like that for weeks because your test gate ran at a 30 px tolerance and never flagged it.

This kind of silent drift will happen on any cross-runtime ML port that doesn't make a strict parity gate a first-class part of the build. This post is about the four-component setup that catches it — generalized so you can apply it to your own port.

The lesson, distilled: your parity gate has to enforce the number you publish. A gate that's looser than the claim is not a gate. It's a decoration that gives you false confidence.

Why cross-runtime ML ports drift silently

When you port a model implementation across runtimes, identical weights and identical architecture do not guarantee identical output. The places where you accumulate sub-pixel error:

Image preprocessing. Every framework's "Lanczos" is a different Lanczos. PIL, Core Image, OpenCV, TensorFlow's tf.image.resize — they all produce subtly different output for the same input. A 1- or 2-level difference at 8-bit precision in a few hundred edge pixels propagates through attention.
Numerical precision boundaries. When the model uses FP16/BF16 in some places and FP32 in others, the exact ordering of cast → multiply → sum can move a least-significant bit. Across a few thousand multiplies, the LSBs add up.
Tokenizer / chat-template subtleties. If your runtime emits image-token positions one off from what the reference does, attention is one position off everywhere. The output looks plausible because the model gracefully degrades — it just degrades wrong.
Initialization timing. Caches, position-ID buffers, RoPE state — anything that persists between calls. A buffer that resets in Python but persists in your port produces different output on the second image than the first.

Each alone produces small drift. Stacked, they produce wildly wrong output. Individually, none throws an error. They all pass shape checks. They all return something.

You only catch them if you compare your runtime's output against a known-correct reference, at the strictness you care about, on every edge of every output, every time the gate runs.

The 30 px gate that hid a 9 px bug

The specific incident: a published claim of "≤ 2 px parity on all 8 edges of both panels" against the Python mlx-vlm reference. The gate enforcing that claim ran at TOLERANCE=30 px. It had been at 30 px since early prototyping when anything sub-50-px was a win.

Two of three canonical test images were drifting by ~9 / ~8 / ~5 px on outer edges. Nobody noticed because the gate was 15× looser than the published number it was supposed to enforce.

The forensic re-measurement happened only when preparing to upstream the patch — opening the actual measurements alongside the published prose. That's when the gate was clearly a decorative artifact, not a check.

The fix wasn't a single bug, though there was a bug, in chat-template content ordering. The deeper fix was the realization that the test infrastructure no longer matched the claim it was supposed to enforce. Once a strict gate was wired to the actual published number, the bug surfaced on the first run.

The four components of a working parity setup

To make this robust, four things have to exist together:

1. A saved, deterministic reference output

The reference is the source of truth. Every other piece compares against it.

canonical_baselines.json (the file's name in this project — call it whatever) holds the Python reference output for every canonical test image, generated at temperature=0 with a pinned model snapshot:

{
  "model": "mlx-community/Qwen2.5-VL-7B-Instruct-4bit",
  "snapshot_hash": "fdcc572e8b05ba9daeaf71be8c9e4267c826ff9b",
  "mlx_vlm_version": "0.4.3",
  "max_edge_delta_allowed": 2,
  "prompt": "Detect these two UI panels...",
  "images": [
    {
      "name": "leetcode_test",
      "path": "_zero_px_test/leetcode_test.png",
      "size_px": [3078, 2114],
      "model_resize": [1260, 868],
      "panels": {
        "question": [1, 146, 421, 626],
        "editor":   [421, 146, 881, 626]
      }
    }
  ]
}

Three things are pinned in this file: the model snapshot hash, the reference framework's version, and the input dimensions / preprocessing parameters that produced the output. All three are required. If any one of them changes, the saved output is no longer a valid reference, and the gate must be re-baselined deliberately rather than passing trivially.

The published claim — "≤ 2 px on every edge" — also lives in this file as max_edge_delta_allowed: 2. The number now lives in code, not in prose. If you ever want to publish a different number, you have to edit this file, which means the gate adjusts at the same time.

2. A strict per-edge comparison gate

strict_2px_gate.py: runs the freshly-built non-reference binary on every canonical image, parses the model's output, compares each edge of each output element against the saved reference, and exits non-zero if any single edge exceeds the allowed delta.

for img in spec["images"]:
    out = run_runtime(args.binary, ipath, rw, rh)
    panels = parse_bboxes(out)
    for label, ref_bbox in img["panels"].items():
        actual = panels.get(label)
        if not actual:
            failed.append((img["name"], label, "missing"))
            continue
        for i, edge_name in enumerate(["x1", "y1", "x2", "y2"]):
            delta = abs(actual[i] - ref_bbox[i])
            if delta > max_allowed:
                failed.append((img["name"], label, edge_name, delta))

Two non-obvious things this gate gets right:

Per-edge, not per-bbox. A bbox can have an average IoU of 0.95 with the reference and still be wrong by 12 px on one specific edge. IoU is too forgiving as the headline metric. Per-edge maxes catch the asymmetric drift that the most subtle preprocessing bugs produce.
Missing outputs are failures. If the model returns a bbox for "question" but not "editor", that's not "no measurement" — it's a failure. The gate flags it loudly.

3. A bootstrap script that fails on incomplete patches

setup_and_verify.sh: the only sanctioned way to set up a parity-test clone. It fresh-clones the upstream library at the pinned commit, applies your patch, builds the reproducer with the production-equivalent toolchain, and runs the gate. Aborts loudly on any failure.

# After git apply, sanity-check the patch landed the canonical fix line:
if ! grep -q 'message.images.map { _ in ["type": "image"] }' \
        Libraries/MLXVLM/Models/Qwen2VL.swift; then
    echo "Patch applied but Qwen2VL.swift is missing the image-first chat-template fix."
    echo "The patch file is incomplete — re-derive from working tree."
    exit 1
fi

The grep-the-patched-checkout trick catches the most insidious version of this problem: a patch that applies cleanly but is missing a fix line because the patch file itself was generated incompletely. The patched build looks fine and runs to completion. It produces output that's wrong by 9 px. Your gate would have to be tight enough to catch that.

The grep makes the gate stricter — it ensures the patch landed the specific lines that fix the bug you claim to have fixed.

4. Multi-implementation cross-checks where you have them

If your project has more than one consumer (an SDK, an MCP server, a dynamic-library entry point, an alternative runtime path), wire them all to a single cross-adapter gate:

# scripts/parity/cross_adapter_gate.sh — runs the same image through every adapter
A_BBOX=$(./run_adapter_a "$IMAGE")
B_BBOX=$(./run_adapter_b "$IMAGE")
C_BBOX=$(./run_adapter_c "$IMAGE")
diff <(echo "$A_BBOX") <(echo "$B_BBOX") || fail "A vs B"
diff <(echo "$A_BBOX") <(echo "$C_BBOX") || fail "A vs C"

This catches integration drift — situations where one adapter has stale state, a different image preprocessing path, or has drifted from the canonical implementation. The expected result here is 0 px tolerance. All your adapters call the same model with the same preprocessing, so any difference is a bug.

In this project, this gate runs at zero tolerance across every adapter that ships. It catches anyone breaking adapter parity before it gets noticed in the field.

Wiring all four together

The gate runs in this order on every parity check:

Fresh-clone the reference library at the pinned commit. (No reusing dirty trees.)
Apply your patch. Fail if it doesn't apply cleanly.
Grep the patched tree for the literal lines your patch was supposed to introduce. Fail if any are missing.
Build with the production-equivalent toolchain (in this project's case xcodebuild, not swift build — different code paths).
Run the binary on every canonical image.
Compare every edge of every output element to the saved reference.
Run the cross-adapter gate.
Pass = no edge of any output exceeds max_edge_delta_allowed AND every adapter agrees at zero tolerance.

Any step that fails aborts the whole gate with a specific error. There is no "warning"; there is no "fuzzy pass." There is "the published number holds" or "the published number does not hold." Binary.

What this generalizes to

The specific scripts above are tuned for a Swift port of Qwen2.5-VL with mlx-swift-lm as the runtime. The methodology is platform-agnostic.

If you're porting an ML model across runtimes, the minimum-viable parity setup looks like this:

One reference output file, pinned to a specific reference-framework version + model snapshot, with the number you'd publish embedded as a hard limit.
One gate script that runs the non-reference binary on every test case and compares per-output-element, per-edge (or per-token, per-pixel — whatever the units of your model output are) against the reference, with a strict tolerance.
One bootstrap script that ensures the binary under test is built from a clean checkout with the canonical patch fully applied, and aborts if any patch line is missing.
One cross-adapter sanity check if you have multiple consumers.

That's it. Four components — a few short scripts plus a JSON file. It runs in a few minutes after the first build (which dominates wall time anyway).

The reason to build the gate first, before the bug-hunt, is that the bugs you find in your port are the ones the gate flags. A gate at the right tolerance turns "this is mysteriously wrong" into "this specific edge of this specific output exceeds the threshold by N px on this specific image." That's debuggable. A gate at the wrong tolerance produces no information at all.

Provenance

This came out of a Swift port of Qwen2.5-VL on Apple Silicon. The 9 px drift incident happened while preparing the upstream PR for the model fixes. A previously-published "≤ 2 px parity on all 8 edges" claim was sitting on top of a 30 px gate. The four-component setup above is what replaced it.

The bug-hunt war story (failed model approaches, the bug list, what eventually worked) is on dev.to as On-Device Document Grounding on macOS: Getting Qwen2.5-VL to Actually Work in Swift. This post is the testing-methodology companion.

The scripts I described — canonical_baselines.json, strict_2px_gate.py, setup_and_verify.sh, cross_adapter_gate.sh — live in the project's scripts/parity/ directory if you want to adapt them.

The summary: the parity gate must enforce the number you publish. A gate that's looser than the claim is not a gate. Build it first.

On-Device Document Grounding on macOS: Getting Qwen2.5-VL to Actually Work in Swift

Niv Dvir — Sat, 18 Apr 2026 04:05:36 +0000

The models that failed, the bugs that took weeks, and the architecture that survived.

◆ What I Wanted to Achieve

I wanted to read what was on my own screen — a long Wikipedia article, an arXiv PDF, a release note — and render an overlay on top with content annotations (summary bullets, section anchors, corner-to-corner perspective lines). Everything local on Apple Silicon. No cloud, no audio capture, no hidden assistance — just "here's a document on screen, understand it in-device, draw something useful on top."

On paper this is a well-defined pipeline: detect content regions in a screenshot, OCR them, accumulate text across scroll positions, render guide markers over the panels. In practice every step broke in a way I hadn't expected, and the working combination took months to find.

◆ What I Tried (and Why It Didn't Work)

This section might be the most useful part of this article. Each of these approaches cost days to weeks of effort. If you are building anything similar on macOS, you can skip all of them.

Florence-2

Microsoft's Florence-2 was the first vision model I tried. It supports grounding tasks out of the box -- you give it an image and ask "where is the text panel?" and it returns bounding box coordinates. On paper, perfect for UI panel detection.

In practice, Florence-2 cannot run on macOS with Apple Silicon. The model uses a custom architecture that requires trust_remote_code=True, depends on flash-attention (a CUDA-only library), and cannot be converted to CoreML. There is no MLX port. I spent two days trying different conversion paths before accepting that this model simply does not exist on Apple's platform.

If you are searching for a grounding-capable vision model on macOS, remove Florence-2 from your list immediately.

Ferret-UI

Apple's own UI understanding model seemed like the obvious choice for an Apple Silicon project. Ferret-UI was specifically designed to understand user interfaces -- element detection, widget classification, spatial reasoning about UI layouts.

It was a dead end. Ferret-UI requires CUDA flash-attention, which means it needs an NVIDIA GPU. Apple's own UI understanding model does not run on Apple's own hardware without significant porting effort. Beyond the runtime issue, the model's grounding output was not usable for my task -- I needed precise pixel-coordinate bounding boxes, and the model's output format did not map cleanly to that.

The irony of Apple publishing a UI model that cannot run on macOS was not lost on me.

Qwen2.5-VL-3B (the Small One)

After the first two dead ends, I found that Qwen2.5-VL had an MLX port via the mlx-vlm library. The 3B parameter variant (4-bit quantized) was only 2.9GB, loaded in 1.9 seconds, and ran inference in 3-7 seconds. Fast and light.

But too weak. The 3B model could identify that UI elements existed in an image -- it would say "there is a text panel on the left" -- but the bounding box coordinates it returned were hallucinated. Boxes would be off by hundreds of pixels, overlap incorrectly, or enclose regions that contained nothing. For panel detection where you need to know "the question text lives between pixels (0, 120) and (900, 800)," a model that hallucinates coordinates is worse than no model at all.

The 7B variant turned out to be the sweet spot. More on that in the next section.

Pixel-Edge Detection (No ML)

Before committing to a VLM, I tried the traditional computer vision approach. Each UI panel has a uniform background color. The question panel might be rgb(53, 67, 83), the editor panel rgb(22, 43, 54). In theory, you can find panel boundaries by detecting where the background color changes.

The algorithm worked on test screenshots. Then I tested it on a page where both panels used similar background colors. The panel border was a thin 1-pixel line that blended into the surrounding regions. Same-color-background UIs -- which are increasingly common with modern design trends -- broke the approach entirely.

Pixel-edge detection is fragile because it depends on an assumption (panels have visually distinct backgrounds) that is not guaranteed. A VLM can detect panel boundaries semantically -- it understands "this is a question panel" regardless of what color it is.

Accessibility API (AX API)

macOS has a built-in accessibility API that lets you programmatically read UI elements. For a screen reader, this sounds ideal.

The problem is that the Accessibility API cannot see inside web content rendered in Chrome. The browser exposes high-level structural elements -- the window, the tab bar, the content area -- but not individual text lines, panel layouts, or the DOM structure within the page. You get a single "web area" element that says "this is a web view" with no ability to drill into it.

If your target is a native macOS application, the AX API might work. For reading web-based UIs through the browser, it is a dead end.

Spawning a New Python VLM Process Per Inference

My initial integration spawned a new Python process for each VLM inference call. The Python script imported mlx-vlm, loaded the Qwen2.5-VL-7B model (5.3GB of weights), ran inference on one image, printed the result, and exited. The next cycle, 15 seconds later, spawned a new process that loaded the 5.3GB model again.

After three or four cycles, the Mac froze. Each process was loading the full model into unified memory, and the previous processes had not fully released their allocations before the next one started. OOM within minutes.

The fix was a persistent server architecture: load once, serve many.

◆ The Architecture That Worked

Here is the system that survived. Each component earned its place by being the last option standing after everything else failed.

Panel Detection: Qwen2.5-VL-7B via MLX

The 7B parameter Qwen2.5-VL model, 4-bit quantized, is the sweet spot for UI panel detection on Apple Silicon. The 3B model hallucinates bounding boxes. Larger models (14B+) are too slow for interactive use. The 7B variant reliably returns accurate panel coordinates when prompted correctly.

Why MLX matters. Apple Silicon's unified memory architecture means the CPU and GPU share the same physical RAM. MLX exploits this -- the model weights live in unified memory once and are accessed by both the CPU (for attention computations) and the GPU (for matrix multiplications) without copying. The 4-bit quantized model shows ~238MB resident memory in Activity Monitor, not the full weight file size, because MLX memory-maps the weights and pages them in on demand.

The prompt that works. After testing dozens of prompt variations, this format reliably produces usable output:

Detect the following UI panels in this screenshot and output their
bounding box coordinates in JSON format:
1. The "question" panel (problem description text area)
2. The "editor" panel (code editor area)

Return JSON with format: [{"label": "question", "bbox_2d": [x1,y1,x2,y2]},
{"label": "editor", "bbox_2d": [x1,y1,x2,y2]}]

Key details: ask for each panel by name on a numbered list (the model sometimes merges panels into one bbox if you describe them in a single sentence), and specify the exact JSON format you want (the model follows format instructions well).

The initial architecture: a persistent Python server. The model takes ~12 seconds to load. Rather than paying that cost every cycle, I built a Python server process that loads the model at startup and accepts requests over a simple stdin/stdout protocol:

# Server: load model once, serve forever
import sys, json
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("mlx-community/Qwen2.5-VL-7B-Instruct-4bit")
config = load_config("mlx-community/Qwen2.5-VL-7B-Instruct-4bit")

for line in sys.stdin:
    request = json.loads(line)
    prompt = apply_chat_template(processor, config,
                                  request["prompt"],
                                  num_images=1)
    result = generate(model, processor, prompt,
                      image=request["image_path"],
                      max_tokens=512, verbose=False)
    print(json.dumps({"result": result}), flush=True)

The Swift host process spawns this server once, sends JSON requests on stdin, and reads JSON responses from stdout. No HTTP server, no sockets, no serialization framework -- just newline-delimited JSON over pipes.

Coordinate conversion. The model returns bounding boxes in the coordinate space of the resized image (max 1280px on the longest side, rounded to multiples of 28). To get screen pixels:

screen_x = model_x * (original_width / resized_width) / retina_scale
screen_y = model_y * (original_height / resized_height) / retina_scale

On a Retina display, retina_scale is 2.0. Forgetting this division is a common source of bounding boxes that are exactly 2x too large.

From Python Server to Native Swift

The Python persistent server worked. But it had friction: a Python subprocess to manage, a PIL resize helper, stdin/stdout JSON marshaling, and ~50ms of overhead per inference just from process communication. For an interactive-latency pipeline I wanted everything native.

The mlx-swift-lm library promised exactly this -- a Swift implementation of the MLX model runtime, including Qwen2.5-VL. Load the model in Swift, run inference in Swift, no Python anywhere. In theory, a single-binary solution.

In practice, the Swift implementation had 10 bugs (the 10th surfaced after this article was published — see the postscript). Finding and fixing them took weeks. But the result was worth it: a fully native Swift binary that runs Qwen2.5-VL-7B with zero Python dependencies, matching the Python reference's grounding accuracy on canonical test images — and exceeding it on high-resolution screenshots once the 1280-pixel cap from Bug #5 is in place.

The 9 Bugs in mlx-swift-lm's Qwen2.5-VL (plus a 10th found after publication)

This section documents those weeks. The bugs collectively made the model produce wrong bounding boxes. Fixing them was the difference between "the model hallucinates" and "the model returns grounding boxes within bench tolerance of the Python reference on every panel of every canonical test image."

1. MROPE section selection (split-select vs slice-replace). Multi-Resolution Rotary Position Embedding (MROPE) assigns different frequency bands to temporal, height, and width dimensions. The Swift implementation split the frequency tensor into three parts using modulo indexing (i % 3), which interleaves the frequencies. Python's implementation starts with temporal frequencies and overwrites height/width slices in-place: [T_0-15, H_16-39, W_40-63]. The layouts are completely different, and the wrong layout produces subtly wrong attention patterns.

2. Chat template ordering. The Swift message generator placed text before the image token in the content array. The Python implementation puts the image first: <|vision_start|><|image_pad|><|vision_end|>PROMPT. This ordering matters because the model's attention patterns are position-dependent -- putting text before the image means the text tokens attend to positions where image features have not yet been injected.

3. invFreq registered as a Module weight. The invFreq tensor was declared as a property on an Attention class that inherits from Module. MLX's weight-loading mechanism scans all Module properties and tries to load matching weights from the checkpoint. Since invFreq is a computed constant (not a learned weight), the loader either threw keyNotFound errors or silently overwrote it with garbage. The fix was wrapping it in a non-Module class to hide it from reflection.

4. rope_deltas unused during autoregressive generation. After the prefill pass, the code cleared the cached position IDs but never applied rope_deltas during subsequent token generation. The correct computation is positionIds = cache_offset + rope_deltas + arange(seqLen). Without the deltas, position embeddings drifted with each generated token, degrading output quality progressively.

5. Image resize using 1800px max instead of 1280px. The Swift code resized input images to a maximum of 1800 pixels on the longest side, producing 2688 visual tokens. A follow-up sweep showed 1280 isn't an arbitrary number — it's the empirical accuracy peak for this model on high-resolution UI screenshots. Cap-sweep on five 3000+-px canonical test images: 1280 → 0.736 mean IoU vs 0.542 at the model's default unconstrained smart_resize (the same ranking holds on UI-Vision N=20: 0.435 → 0.407). Log-probability also peaks at 1280, cross-validating the IoU signal. The training resolution and the accuracy peak coincide, which is presumably no accident — feeding the model 1800-px images put visual token positions outside its training distribution and dropped accuracy substantially.

6. Prompt format for single bbox output. Using a single sentence asking for both panels caused the model to sometimes return one combined bounding box. Switching to a numbered list with explicit labels ("1. question panel" / "2. editor panel") reliably produced two separate bboxes.

7. maxTokens not set. Without an explicit max_tokens parameter, the model generated tokens until hitting an internal limit or running out of memory. For a task that should return ~100 tokens of JSON, this caused multi-second waits and occasionally produced thousands of tokens of hallucinated output.

8. MROPE state not reset between successive images. The cached position IDs and rope deltas from one image persisted into the next inference call. When processing a new screenshot, the model's position embeddings started from where the previous image left off instead of resetting. This caused progressively worse results on the second, third, and subsequent images.

9. Vision attention mask ignored -- the ROOT CAUSE. This was the single bug most responsible for bounding box inaccuracy. The vision encoder's self-attention uses a mask to implement windowed attention (the model processes the image in patches, and each patch should only attend to patches within its window). The Swift code passed mask: .none to the scaled dot-product attention call instead of mask: .array(floatMask). Without the mask, every patch attended to every other patch globally, destroying the spatial locality that the model relies on for precise coordinate prediction.

// WRONG -- ignores the attention mask entirely
let attnOutput = scaledDotProductAttention(
    queries: q, keys: k, values: v, scale: scale, mask: .none
)

// CORRECT -- applies the windowed attention mask
let attnOutput = scaledDotProductAttention(
    queries: q, keys: k, values: v, scale: scale,
    mask: .array(floatMask)
)

After fixing all 9 bugs (and a 10th, surfaced after publication), the Swift implementation produces grounding output that matches the Python reference's accuracy on the canonical test set — and exceeds it on high-resolution images once the 1280-pixel cap from Bug #5 is in place (mean IoU 0.736 vs 0.542 at unconstrained smart_resize). The model was not hallucinating; the implementation was broken. Canonical preprocessing now lives in PR #243; the rest of the fixes are in #222 and its splits #238/#239/#242.

Upstream status. The 8 bugs that live inside mlx-swift-lm (all of the above except #6 prompt format and #7 maxTokens, which belong in consumer code) are submitted upstream as the omnibus #222 plus four isolated splits per upstream review feedback: #238 (vision attention mask), #239 (MROPE + rope_deltas + invFreq + state-reset), #242 (chat-template image-first), and #243 (preprocessing: 1280-pixel cap with CIImage Bicubic resize).

These patterns aren't specific to Qwen2.5-VL. The same mask: .none attention bug appears in Qwen2VL.swift and GlmOcr.swift; the MROPE plumbing (invFreq, rope_deltas, section selection) is shared across all MROPE-based VLMs in the library (Qwen2VL, Qwen25VL, Qwen3VL, Qwen35, GlmOcr). PR #222 is the flagship fix with model-by-model follow-ups coming as each is validated against the Python reference.

Cross-architecture validation. To sanity-check that the combined patch really does something model-independent, I ran the same A/B on mlx-community/UI-TARS-1.5-7B-4bit — ByteDance's click-prediction model, which shares Qwen2_5_VLForConditionalGeneration architecture, same hidden size, same special tokens. Same deterministic input image, same model, only the source of Qwen25VL.swift differs.

With the PR applied: 200 tokens generated, 9+ distinct coordinates tracking actual content positions. Without the PR: 52 tokens generated, output collapses to two entries at identical coordinates (141, 141) — a clean signature of broken position encoding. Two independent Qwen2.5-VL-family models, same failure mode when the fixes aren't present. The A/B reproducer is attached to PR #222.

◆ Combining Everything Into a Working Pipeline

OCR: Apple Vision Framework

Apple's Vision framework provides on-device OCR that runs on the Neural Engine at ~300ms per frame.

Recognition levels are confusing. The API has two recognition levels: level 0 and level 1. Intuitively, you might assume level 0 is the baseline (fast) and level 1 is the premium (accurate). It is the opposite. Level 0 is accurate (slower, higher quality). Level 1 is fast (lower quality). I ran with level 1 for weeks thinking I was getting the best results, then discovered I was using the fast path the entire time.

RecognizeDocumentsRequest vs VNRecognizeTextRequest. Apple's Vision framework has two OCR APIs, and they behave very differently on code content. RecognizeDocumentsRequest (the newer, WWDC25 API) is optimized for documents -- prose, forms, receipts. It silently drops lines that look like code: indented lines with brackets, semicolons, and unusual formatting. For a code editor panel, it would capture 15 out of 20 visible lines, silently losing the rest.

VNRecognizeTextRequest (the older API) captures everything -- every line, regardless of formatting. For reading code from screen, use VNRecognizeTextRequest. I discovered this after weeks of mysterious "missing lines" that turned out to be the newer API being too clever about what constitutes document text.

Bounded OCR. Rather than scanning the entire screen (which picks up menu bars, dock icons, and other noise), the OCR is bounded to the panel regions detected by the VLM. This reduces both processing time and false positives -- you only extract text from the panel you care about.

Scroll Accumulator

Most non-trivial content does not fit in a single viewport. A problem description might be 40 lines long, but only 15 are visible at once. The scroll accumulator solves this by scrolling through the content in steps, OCR-ing each viewport, and stitching the results into a complete transcript.

The stitching problem. Adjacent viewports overlap. When you scroll down by 100 pixels, the bottom 80% of the previous viewport is still visible. Naive concatenation produces massive duplication. The accumulator uses Levenshtein distance to fuzzy-match each incoming OCR line against all accumulated lines.

Threshold tuning. A line is classified as "already seen" if its Levenshtein similarity to any accumulated line exceeds 60%. I tested thresholds from 40% to 80%:

40%: too permissive -- novel lines were classified as duplicates and dropped
80%: too strict -- lines with minor OCR variations were classified as novel and added twice
60%: best F1 score for the duplicate-vs-novel classification task

Metal GPU Overlay Rendering

The overlay renders detected text and annotations as a transparent window on top of the target application.

let window = NSWindow(
    contentRect: screenFrame,
    styleMask: .borderless,
    backing: .buffered,
    defer: false
)
window.level = NSWindow.Level(rawValue: 25)
window.isOpaque = false
window.backgroundColor = .clear
window.ignoresMouseEvents = true
window.hasShadow = false

Self-exclusion from screen capture. This is critical: the overlay must not appear in its own screenshots. If it does, the next VLM inference cycle sees the overlay text, interprets it as UI content, and the system enters a feedback loop where it reads its own annotations. The fix is captureScreenExcluding(windowID:), which tells ScreenCaptureKit to exclude the overlay window from the captured frame.

◆ The Final Result

A Taste of How It Looks

Here's the system running in reader mode — detecting the main content region, reading text across scroll positions, and rendering a summary overlay with perspective lines anchored to the source:

Performance Numbers

Component	Latency	Resource
VLM model loading	~3s	Unified memory (one-time)
VLM panel detection	~18s per inference	GPU (MLX unified memory)
OCR per frame	~300ms	Neural Engine
Overlay render	<16ms (60fps)	Metal GPU
Full scroll accumulation	~40s (20 steps)	Combined
Model resident memory	~5.5GB peak	Unified memory

The VLM inference is the bottleneck at ~18 seconds, but it only needs to run when the panel layout changes (e.g., navigating to a new page). During normal operation, the OCR and overlay run continuously at ~300ms per cycle while the VLM-detected panel bounds remain cached. On an M1 Pro with 16GB, the system runs comfortably alongside Chrome and other applications.

A note on "native Swift vs Python"

Worth being honest about what "native Swift" does and doesn't buy you here. The VLM forward pass is the same Metal kernels in either language, so a one-shot mlx-vlm Python benchmark on the same image finishes within a few percent of the Swift equivalent. Swift doesn't make the model run faster.

What Swift changes is everything around the model:

Cold start ~3 s vs ~15 s — no interpreter, no PyTorch import, just mmap the weights.
Zero-IPC pipeline — CGWindowListCreateImage → VLM → VNRecognizeTextRequest → Metal overlay all run in one process with shared memory. A Python pipeline has to serialize each screenshot across the subprocess boundary, adding 30–100 ms per cycle on top of inference.
Real-time frame work becomes viable — a per-frame 16 ms budget has room for actual OCR and overlay redraw; it doesn't fit a round-trip to a Python worker.
Bounded memory over long sessions — autoreleasepool around CGImage ops keeps a 100-minute session at ~5.5 GB peak. The earlier Python-subprocess version leaked ~900 MB over the same duration through PyObjC bridging.

The headline 18-second number is the same either way. The difference is whether you can wrap that around a responsive app — startable in 3 seconds, no IPC between stages, 60 fps overlay — rather than a command-line script.

How to Reproduce This

Requirements:

macOS 14+ on Apple Silicon (M1/M2/M3/M4)
Xcode 16+
~16GB unified memory (8GB minimum, 16GB comfortable)

Model:

mlx-community/Qwen2.5-VL-7B-Instruct-4bit from Hugging Face (~5.3GB)

Key dependencies:

mlx-swift-lm (Swift package, for native VLM inference)
Apple Vision framework (built into macOS)
Metal (built into macOS)
ScreenCaptureKit (built into macOS)

Source Code

GroundingKit is the open-source macOS app extracted from this project. Clone it, build it, and try panel detection on your own screen:

github.com/NivDvir/screen-overlay-toolkit

git clone https://github.com/NivDvir/screen-overlay-toolkit.git
cd screen-overlay-toolkit
bash build-app.sh

The build script uses xcodebuild (swift build doesn't link MLX's Metal kernel library properly — the binary builds but the model returns broken output). Produces a .app bundle you can open from Finder. Menu bar app, runs entirely local on Apple Silicon.

◆ Postscript

2026-04-25 — 10th bug + reframing

After publication, while preparing PR splits per a maintainer's review request, forensic re-measurement found the Swift output drifting +9 px from the Python reference on a canonical LeetCode test image — despite this article's original "0 px on all 8 edges" claim. Two things came out of the follow-up work:

A 10th bug was hiding. Qwen2VLMessageGenerator in Qwen2VL.swift ordered chat-template content as [text, image], but HuggingFace's template for Qwen2.5-VL emits <|vision_start|><|image_pad|><|vision_end|>{text} — image content first. Swapping the order removes a deterministic +9 / +8 / +5 px bbox shift. Fix filed upstream as mlx-swift-lm #242.
Bug #5 turned out to be the load-bearing fix — not the choice of resampler. The original preprocessing PR (#243) carried a custom PIL-byte-exact Lanczos resampler so that the Swift output could pass a strict ≤2 px parity check against the Python reference. A 2×2 ablation (resampler × cap) on the canonical test set + UI-Vision N=20 then showed the resampler choice (Bicubic vs Lanczos) is no-op for IoU within bench noise — the 1280-pixel cap is the entire driver of accuracy (0.542 → 0.736 mean IoU on five high-resolution canonical images; 0.407 → 0.435 on UI-Vision N=20; both vs HF defaults). PR #243 was simplified accordingly: dropped _pilLanczosCore (~157 lines), switched to upstream's existing CIImage Bicubic chain. The preprocessing change is now +15/-5 in a single file.

The patch file (patches/mlx-swift-lm-mrope-fixes.patch) covers both Qwen25VL.swift and Qwen2VL.swift. The omnibus PR #222 has four isolated companion PRs (#238, #239, #242, #243) per the maintainer's split-friendly review preference.

2026-05-25 — upstream `max_pixels` fix landed in mlx-vlm

A separate finding: the same bug, but upstream. The mlx-community/Qwen2.5-VL-7B-Instruct-4bit snapshot's preprocessor_config.json shipped max_pixels: 12,845,056 — 12× the model's ~1.04 M training resolution. On a 5-image high-resolution canonical set this drove grounding from 0.736 (with a 1280-pixel external cap) down to 0.542 (HF defaults). I filed Blaizzy/mlx-vlm #1175 on 2026-05-12; Blaizzy landed a fix in PR #1213 on 2026-05-23.

The article's working claim is unchanged: pixel cap is the load-bearing variable. What's new is that Python mlx-vlm users now get correct behavior out of the box. Swift / mlx-swift-lm still needs the cap applied externally because its image path resamples before any HF-config check.

◆ Closing

Building this system produced more failure than success. Six major approaches failed before the working architecture emerged, and even the working approach required fixing 10 implementation bugs in a third-party library before it produced correct output (the 10th surfaced after publication). The total development time from "I want to read a panel from the screen" to "this reliably works" was measured in weeks, not days.

The experience of building this on-device overlay led directly to a testing methodology I call CCSV (Cross-Channel Spatiotemporal Verification) -- the idea that you can verify a UI by reading it through two completely independent channels (DOM and pixels) and comparing what they see. That methodology is described in a companion article.

If you are building something similar -- local VLMs on Apple Silicon, on-device document grounding, overlay rendering -- I would like to hear what you have tried and what worked. The failure modes are not well documented anywhere, and the community benefits from sharing them.

Niv Dvir is a software developer who builds tools at the intersection of computer vision and UI automation. You can find him on GitHub.

How I Built a Cochlear Spiral Spectrogram That Visualizes Music Like the Inner Ear

Niv Dvir — Sun, 22 Mar 2026 12:23:13 +0000

What if you could see music the way your inner ear hears it?

I built a visualization system that maps audio frequencies onto a Fermat spiral — the same geometric curve that describes how the human cochlea arranges its frequency-sensitive hair cells. The result reveals the hidden geometry of harmony: you can literally see the difference between a major and minor chord.

The Core Idea

Traditional spectrograms show frequency vs. time as a rectangular heatmap. They're useful but clinical — they don't capture the feeling of music.

The cochlea (your inner ear) isn't rectangular. It's a spiral. Low frequencies resonate at the outer end, high frequencies at the inner end — logarithmically spaced, just like musical octaves.

So I asked: what if we visualize frequencies on an actual spiral?

How It Works

1. Audio Analysis (scipy FFT)

381 logarithmically-spaced frequency bins (20 Hz — 8 kHz)
ISO 226 equal-loudness contours for perceptual accuracy
60 FPS frame-by-frame analysis

# Simplified core: FFT → cochlear frequency mapping
import numpy as np
from scipy.fft import rfft, rfftfreq

def analyze_frame(samples, sample_rate=44100, n_bins=381):
    spectrum = np.abs(rfft(samples))
    freqs = rfftfreq(len(samples), 1/sample_rate)

    # Logarithmic bins: 20 Hz to 8 kHz (cochlear range)
    bin_edges = np.logspace(np.log10(20), np.log10(8000), n_bins + 1)
    amplitudes = np.zeros(n_bins)

    for i in range(n_bins):
        mask = (freqs >= bin_edges[i]) & (freqs < bin_edges[i+1])
        if mask.any():
            amplitudes[i] = spectrum[mask].mean()

    return amplitudes

2. Spiral Mapping (Fermat Spiral)

Each frequency bin gets a position on a Fermat spiral: r = sqrt(θ)

Low frequencies sit at the outer edge (like the cochlea's apex), high frequencies spiral inward.

# Map frequency bins to spiral coordinates
theta = np.linspace(0, 8 * np.pi, n_bins)
r = np.sqrt(theta)
x = r * np.cos(theta)
y = r * np.sin(theta)

3. Chromesthesia Color Mapping

Colors follow a chromesthesia mapping — the neurological phenomenon where people "see" sounds as colors:

Low frequencies (bass) → warm reds/oranges
Mid frequencies (voice, guitar) → greens/yellows
High frequencies (cymbals, harmonics) → cool blues/cyans

4. Temporal Features (The Secret Sauce)

Static spectrograms miss the movement of music. I added 5 temporal features, each validated across 1,704 audio samples:

Feature	What it does	Optimal parameter
Melodic trails	Short glowing trails following melody	10 frames, 0.70 decay
Rhythm pulses	Radial pulse on beat hits	0.50 intensity, 0.25 decay
Harmonic auras	Sustained glow for held chords	4.0s blend time
Atmospheric context	Background mood from 60s window	0.35 influence
Harmonic connections	Lines between harmonically related notes	Octave + fifth detection

Why Harmony Looks Beautiful

This is the magical part. When notes are harmonically related (octaves, fifths, thirds), they land at symmetric positions on the spiral. A major chord creates a visually balanced, symmetric pattern. Dissonance creates asymmetric, chaotic (but still beautiful) patterns.

Different musical traditions create remarkably different visual signatures:

Classical harmony → orderly radial symmetry
Arabic maqam → quarter-tone asymmetry with unique geometric beauty
EDM/electronic → explosive, pulsing energy patterns

Try It: The Wellspring

I also built a crowdsourcing platform called The Wellspring where people can rate how well these visualizations capture the music. The goal: build an open dataset for AI-powered audio visualization evaluation.

Tech Stack

Audio analysis: scipy (FFT), librosa
Rendering: PIL (2D), PyVista (3D optional)
Video encoding: FFmpeg (H.264, CRF 18, 60 FPS)
Web platform: React 18 + TypeScript, Node/Express, PostgreSQL

What's Next

I'm working on browser-based creation tools so anyone can create their own audio-visual harmony — no installation needed. The vision: a global community of creators exploring the intersection of sound and moving image.

The ancient dance between rhythm and movement, renewed with modern tools.

Channel: youtube.com/@NivDvir-ND
The Wellspring: synesthesia-labeler.onrender.com

I'd love to hear your thoughts — especially from anyone working on audio visualization, creative coding, or signal processing!