<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Niv Dvir</title>
    <description>The latest articles on DEV Community by Niv Dvir (@nivdvir).</description>
    <link>https://dev.to/nivdvir</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3838336%2Fdb2e5a42-2c89-4772-88e1-ce7f6e4aa813.png</url>
      <title>DEV Community: Niv Dvir</title>
      <link>https://dev.to/nivdvir</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nivdvir"/>
    <language>en</language>
    <item>
      <title>On-Device Document Grounding on macOS: Getting Qwen2.5-VL to Actually Work in Swift</title>
      <dc:creator>Niv Dvir</dc:creator>
      <pubDate>Sat, 18 Apr 2026 04:05:36 +0000</pubDate>
      <link>https://dev.to/nivdvir/building-a-real-time-screen-reader-on-macos-that-actually-works-471</link>
      <guid>https://dev.to/nivdvir/building-a-real-time-screen-reader-on-macos-that-actually-works-471</guid>
      <description>&lt;p&gt;&lt;em&gt;The models that failed, the bugs that took weeks, and the architecture that survived.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Was Trying to Do
&lt;/h2&gt;

&lt;p&gt;I wanted to read what was on my &lt;em&gt;own&lt;/em&gt; screen — a long Wikipedia article, an arXiv PDF, a release note — and render an overlay on top with content annotations (summary bullets, section anchors, corner-to-corner perspective lines). Everything local on Apple Silicon. No cloud, no audio capture, no hidden assistance — just "here's a document on screen, understand it in-device, draw something useful on top."&lt;/p&gt;

&lt;p&gt;On paper this is a well-defined pipeline: detect content regions in a screenshot, OCR them, accumulate text across scroll positions, render guide markers over the panels. In practice every step broke in a way I hadn't expected, and the working combination took months to find.&lt;/p&gt;




&lt;h2&gt;
  
  
  Everything That Failed
&lt;/h2&gt;

&lt;p&gt;This section might be the most useful part of this article. Each of these approaches cost days to weeks of effort. If you are building anything similar on macOS, you can skip all of them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Florence-2
&lt;/h3&gt;

&lt;p&gt;Microsoft's Florence-2 was the first vision model I tried. It supports grounding tasks out of the box -- you give it an image and ask "where is the text panel?" and it returns bounding box coordinates. On paper, perfect for UI panel detection.&lt;/p&gt;

&lt;p&gt;In practice, Florence-2 cannot run on macOS with Apple Silicon. The model uses a custom architecture that requires &lt;code&gt;trust_remote_code=True&lt;/code&gt;, depends on flash-attention (a CUDA-only library), and cannot be converted to CoreML. There is no MLX port. I spent two days trying different conversion paths before accepting that this model simply does not exist on Apple's platform.&lt;/p&gt;

&lt;p&gt;If you are searching for a grounding-capable vision model on macOS, remove Florence-2 from your list immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ferret-UI
&lt;/h3&gt;

&lt;p&gt;Apple's own UI understanding model seemed like the obvious choice for an Apple Silicon project. Ferret-UI was specifically designed to understand user interfaces -- element detection, widget classification, spatial reasoning about UI layouts.&lt;/p&gt;

&lt;p&gt;It was a dead end. Ferret-UI requires CUDA flash-attention, which means it needs an NVIDIA GPU. Apple's own UI understanding model does not run on Apple's own hardware without significant porting effort. Beyond the runtime issue, the model's grounding output was not usable for my task -- I needed precise pixel-coordinate bounding boxes, and the model's output format did not map cleanly to that.&lt;/p&gt;

&lt;p&gt;The irony of Apple publishing a UI model that cannot run on macOS was not lost on me.&lt;/p&gt;

&lt;h3&gt;
  
  
  Qwen2.5-VL-3B (the Small One)
&lt;/h3&gt;

&lt;p&gt;After the first two dead ends, I found that Qwen2.5-VL had an MLX port via the &lt;code&gt;mlx-vlm&lt;/code&gt; library. The 3B parameter variant (4-bit quantized) was only 2.9GB, loaded in 1.9 seconds, and ran inference in 3-7 seconds. Fast and light.&lt;/p&gt;

&lt;p&gt;But too weak. The 3B model could identify that UI elements existed in an image -- it would say "there is a text panel on the left" -- but the bounding box coordinates it returned were hallucinated. Boxes would be off by hundreds of pixels, overlap incorrectly, or enclose regions that contained nothing. For panel detection where you need to know "the question text lives between pixels (0, 120) and (900, 800)," a model that hallucinates coordinates is worse than no model at all.&lt;/p&gt;

&lt;p&gt;The 7B variant turned out to be the sweet spot. More on that in the next section.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6kd9o21gut4mr1ekfv83.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6kd9o21gut4mr1ekfv83.png" alt="3B model hallucinated bounding boxes vs 7B accurate detection" width="800" height="288"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Pixel-Edge Detection (No ML)
&lt;/h3&gt;

&lt;p&gt;Before committing to a VLM, I tried the traditional computer vision approach. Each UI panel has a uniform background color. The question panel might be &lt;code&gt;rgb(53, 67, 83)&lt;/code&gt;, the editor panel &lt;code&gt;rgb(22, 43, 54)&lt;/code&gt;. In theory, you can find panel boundaries by detecting where the background color changes.&lt;/p&gt;

&lt;p&gt;The algorithm worked on test screenshots. Then I tested it on a page where both panels used similar background colors. The panel border was a thin 1-pixel line that blended into the surrounding regions. Same-color-background UIs -- which are increasingly common with modern design trends -- broke the approach entirely.&lt;/p&gt;

&lt;p&gt;Pixel-edge detection is fragile because it depends on an assumption (panels have visually distinct backgrounds) that is not guaranteed. A VLM can detect panel boundaries semantically -- it understands "this is a question panel" regardless of what color it is.&lt;/p&gt;

&lt;h3&gt;
  
  
  Accessibility API (AX API)
&lt;/h3&gt;

&lt;p&gt;macOS has a built-in accessibility API that lets you programmatically read UI elements. For a screen reader, this sounds ideal.&lt;/p&gt;

&lt;p&gt;The problem is that the Accessibility API cannot see inside web content rendered in Chrome. The browser exposes high-level structural elements -- the window, the tab bar, the content area -- but not individual text lines, panel layouts, or the DOM structure within the page. You get a single "web area" element that says "this is a web view" with no ability to drill into it.&lt;/p&gt;

&lt;p&gt;If your target is a native macOS application, the AX API might work. For reading web-based UIs through the browser, it is a dead end.&lt;/p&gt;

&lt;h3&gt;
  
  
  Spawning a New Python VLM Process Per Inference
&lt;/h3&gt;

&lt;p&gt;My initial integration spawned a new Python process for each VLM inference call. The Python script imported &lt;code&gt;mlx-vlm&lt;/code&gt;, loaded the Qwen2.5-VL-7B model (5.3GB of weights), ran inference on one image, printed the result, and exited. The next cycle, 15 seconds later, spawned a new process that loaded the 5.3GB model again.&lt;/p&gt;

&lt;p&gt;After three or four cycles, the Mac froze. Each process was loading the full model into unified memory, and the previous processes had not fully released their allocations before the next one started. OOM within minutes.&lt;/p&gt;

&lt;p&gt;The fix was a persistent server architecture: load once, serve many.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Actually Worked -- The Architecture
&lt;/h2&gt;

&lt;p&gt;Here is the system that survived. Each component earned its place by being the last option standing after everything else failed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw469k0ytylzov2plorp3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw469k0ytylzov2plorp3.png" alt="Architecture diagram — screen reader pipeline" width="800" height="376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Panel Detection: Qwen2.5-VL-7B via MLX
&lt;/h3&gt;

&lt;p&gt;The 7B parameter Qwen2.5-VL model, 4-bit quantized, is the sweet spot for UI panel detection on Apple Silicon. The 3B model hallucinates bounding boxes. Larger models (14B+) are too slow for interactive use. The 7B variant reliably returns accurate panel coordinates when prompted correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why MLX matters.&lt;/strong&gt; Apple Silicon's unified memory architecture means the CPU and GPU share the same physical RAM. MLX exploits this -- the model weights live in unified memory once and are accessed by both the CPU (for attention computations) and the GPU (for matrix multiplications) without copying. The 4-bit quantized model shows ~238MB resident memory in Activity Monitor, not the full weight file size, because MLX memory-maps the weights and pages them in on demand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The prompt that works.&lt;/strong&gt; After testing dozens of prompt variations, this format reliably produces usable output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Detect the following UI panels in this screenshot and output their
bounding box coordinates in JSON format:
1. The "question" panel (problem description text area)
2. The "editor" panel (code editor area)

Return JSON with format: [{"label": "question", "bbox_2d": [x1,y1,x2,y2]},
{"label": "editor", "bbox_2d": [x1,y1,x2,y2]}]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key details: ask for each panel by name on a numbered list (the model sometimes merges panels into one bbox if you describe them in a single sentence), and specify the exact JSON format you want (the model follows format instructions well).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq4jmygeos0szcv6urzhr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq4jmygeos0szcv6urzhr.png" alt="Before and after: raw screenshot vs VLM panel detection" width="800" height="288"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The initial architecture: a persistent Python server.&lt;/strong&gt; The model takes ~12 seconds to load. Rather than paying that cost every cycle, I built a Python server process that loads the model at startup and accepts requests over a simple stdin/stdout protocol:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Server: load model once, serve forever
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mlx_vlm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generate&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mlx_vlm.prompt_utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;apply_chat_template&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mlx_vlm.utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_config&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mlx-community/Qwen2.5-VL-7B-Instruct-4bit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mlx-community/Qwen2.5-VL-7B-Instruct-4bit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;apply_chat_template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                  &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                  &lt;span class="n"&gt;num_images&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                      &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Swift host process spawns this server once, sends JSON requests on stdin, and reads JSON responses from stdout. No HTTP server, no sockets, no serialization framework -- just newline-delimited JSON over pipes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Coordinate conversion.&lt;/strong&gt; The model returns bounding boxes in the coordinate space of the resized image (max 1280px on the longest side, rounded to multiples of 28). To get screen pixels:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;screen_x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original_width&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;resized_width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;retina_scale&lt;/span&gt;
&lt;span class="n"&gt;screen_y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_y&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original_height&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;resized_height&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;retina_scale&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On a Retina display, &lt;code&gt;retina_scale&lt;/code&gt; is 2.0. Forgetting this division is a common source of bounding boxes that are exactly 2x too large.&lt;/p&gt;

&lt;h3&gt;
  
  
  From Python Server to Native Swift
&lt;/h3&gt;

&lt;p&gt;The Python persistent server worked. But it had friction: a Python subprocess to manage, a PIL resize helper, stdin/stdout JSON marshaling, and ~50ms of overhead per inference just from process communication. For an interactive-latency pipeline I wanted everything native.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;mlx-swift-lm&lt;/code&gt; library promised exactly this -- a Swift implementation of the MLX model runtime, including Qwen2.5-VL. Load the model in Swift, run inference in Swift, no Python anywhere. In theory, a single-binary solution.&lt;/p&gt;

&lt;p&gt;In practice, the Swift implementation had 10 bugs (the 10th surfaced after this article was published — see the postscript). Finding and fixing them took weeks. But the result was worth it: a fully native Swift binary that runs Qwen2.5-VL-7B with zero Python dependencies, producing output that matches the Python reference within 2 px on every edge of every panel.&lt;/p&gt;

&lt;h3&gt;
  
  
  The 9 Bugs in mlx-swift-lm's Qwen2.5-VL (plus a 10th found after publication)
&lt;/h3&gt;

&lt;p&gt;This section documents those weeks. The bugs collectively made the model produce wrong bounding boxes. Fixing them was the difference between "the model hallucinates" and "the model matches the Python reference within 2 px on every edge of every panel — bit-exact 0 px on most test images."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. MROPE section selection (split-select vs slice-replace).&lt;/strong&gt; Multi-Resolution Rotary Position Embedding (MROPE) assigns different frequency bands to temporal, height, and width dimensions. The Swift implementation split the frequency tensor into three parts using modulo indexing (&lt;code&gt;i % 3&lt;/code&gt;), which interleaves the frequencies. Python's implementation starts with temporal frequencies and overwrites height/width slices in-place: &lt;code&gt;[T_0-15, H_16-39, W_40-63]&lt;/code&gt;. The layouts are completely different, and the wrong layout produces subtly wrong attention patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Chat template ordering.&lt;/strong&gt; The Swift message generator placed text before the image token in the content array. The Python implementation puts the image first: &lt;code&gt;&amp;lt;|vision_start|&amp;gt;&amp;lt;|image_pad|&amp;gt;&amp;lt;|vision_end|&amp;gt;PROMPT&lt;/code&gt;. This ordering matters because the model's attention patterns are position-dependent -- putting text before the image means the text tokens attend to positions where image features have not yet been injected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. invFreq registered as a Module weight.&lt;/strong&gt; The &lt;code&gt;invFreq&lt;/code&gt; tensor was declared as a property on an &lt;code&gt;Attention&lt;/code&gt; class that inherits from &lt;code&gt;Module&lt;/code&gt;. MLX's weight-loading mechanism scans all &lt;code&gt;Module&lt;/code&gt; properties and tries to load matching weights from the checkpoint. Since &lt;code&gt;invFreq&lt;/code&gt; is a computed constant (not a learned weight), the loader either threw &lt;code&gt;keyNotFound&lt;/code&gt; errors or silently overwrote it with garbage. The fix was wrapping it in a non-&lt;code&gt;Module&lt;/code&gt; class to hide it from reflection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. rope_deltas unused during autoregressive generation.&lt;/strong&gt; After the prefill pass, the code cleared the cached position IDs but never applied &lt;code&gt;rope_deltas&lt;/code&gt; during subsequent token generation. The correct computation is &lt;code&gt;positionIds = cache_offset + rope_deltas + arange(seqLen)&lt;/code&gt;. Without the deltas, position embeddings drifted with each generated token, degrading output quality progressively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Image resize using 1800px max instead of 1280px.&lt;/strong&gt; The Swift code resized input images to a maximum of 1800 pixels on the longest side, producing 2688 visual tokens. The Python reference implementation uses 1280px maximum, producing 1305 visual tokens. The model was trained on the 1280px resolution. Feeding it 1800px images meant the visual token positions were outside the model's training distribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Prompt format for single bbox output.&lt;/strong&gt; Using a single sentence asking for both panels caused the model to sometimes return one combined bounding box. Switching to a numbered list with explicit labels ("1. question panel" / "2. editor panel") reliably produced two separate bboxes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. maxTokens not set.&lt;/strong&gt; Without an explicit &lt;code&gt;max_tokens&lt;/code&gt; parameter, the model generated tokens until hitting an internal limit or running out of memory. For a task that should return ~100 tokens of JSON, this caused multi-second waits and occasionally produced thousands of tokens of hallucinated output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. MROPE state not reset between successive images.&lt;/strong&gt; The cached position IDs and rope deltas from one image persisted into the next inference call. When processing a new screenshot, the model's position embeddings started from where the previous image left off instead of resetting. This caused progressively worse results on the second, third, and subsequent images.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;9. Vision attention mask ignored -- the ROOT CAUSE.&lt;/strong&gt; This was the single bug most responsible for bounding box inaccuracy. The vision encoder's self-attention uses a mask to implement windowed attention (the model processes the image in patches, and each patch should only attend to patches within its window). The Swift code passed &lt;code&gt;mask: .none&lt;/code&gt; to the scaled dot-product attention call instead of &lt;code&gt;mask: .array(floatMask)&lt;/code&gt;. Without the mask, every patch attended to every other patch globally, destroying the spatial locality that the model relies on for precise coordinate prediction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="c1"&gt;// WRONG -- ignores the attention mask entirely&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;attnOutput&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;scaledDotProductAttention&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;values&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;mask&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;none&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// CORRECT -- applies the windowed attention mask&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;attnOutput&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;scaledDotProductAttention&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;values&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;mask&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;floatMask&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After fixing all 9 bugs (and a 10th, surfaced after publication), the Swift implementation passes a strict ≤2 px parity gate against the Python &lt;code&gt;mlx-vlm&lt;/code&gt; reference on every edge of every panel across three canonical test images. Two of three are bit-exact (0 px); the third is 2 px on y2 of both panels. The gate (&lt;a href="https://github.com/NivDvir/screen-overlay-toolkit/blob/main/scripts/parity/setup_and_verify.sh" rel="noopener noreferrer"&gt;&lt;code&gt;scripts/parity/setup_and_verify.sh&lt;/code&gt;&lt;/a&gt;) fresh-clones upstream &lt;code&gt;mlx-swift-lm&lt;/code&gt;, applies the patch, builds, and aborts if any edge exceeds 2 px on any image. The model was not hallucinating -- the implementation was broken.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuc0wajlpuj710si8yu8d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuc0wajlpuj710si8yu8d.png" alt="Python vs Swift convergence — 0px delta after 9 bug fixes" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Upstream status.&lt;/strong&gt; The 8 bugs that live inside &lt;code&gt;mlx-swift-lm&lt;/code&gt; (all of the above except #6 prompt format and #7 maxTokens, which belong in consumer code) are submitted upstream as the omnibus &lt;a href="https://github.com/ml-explore/mlx-swift-lm/pull/222" rel="noopener noreferrer"&gt;&lt;strong&gt;#222&lt;/strong&gt;&lt;/a&gt; plus four isolated splits per upstream review feedback: &lt;a href="https://github.com/ml-explore/mlx-swift-lm/pull/238" rel="noopener noreferrer"&gt;&lt;strong&gt;#238&lt;/strong&gt;&lt;/a&gt; (vision attention mask), &lt;a href="https://github.com/ml-explore/mlx-swift-lm/pull/239" rel="noopener noreferrer"&gt;&lt;strong&gt;#239&lt;/strong&gt;&lt;/a&gt; (MROPE + rope_deltas + invFreq + state-reset), &lt;a href="https://github.com/ml-explore/mlx-swift-lm/pull/242" rel="noopener noreferrer"&gt;&lt;strong&gt;#242&lt;/strong&gt;&lt;/a&gt; (chat-template image-first), and &lt;a href="https://github.com/ml-explore/mlx-swift-lm/pull/243" rel="noopener noreferrer"&gt;&lt;strong&gt;#243&lt;/strong&gt;&lt;/a&gt; (preprocessing: PIL-Lanczos + 1280 cap). Each split links back to a verifiable strict ≤2 px gate that fresh-clones upstream and runs end-to-end.&lt;/p&gt;

&lt;p&gt;These patterns aren't specific to Qwen2.5-VL. The same &lt;code&gt;mask: .none&lt;/code&gt; attention bug appears in &lt;code&gt;Qwen2VL.swift&lt;/code&gt; and &lt;code&gt;GlmOcr.swift&lt;/code&gt;; the MROPE plumbing (invFreq, rope_deltas, section selection) is shared across all MROPE-based VLMs in the library (&lt;code&gt;Qwen2VL&lt;/code&gt;, &lt;code&gt;Qwen25VL&lt;/code&gt;, &lt;code&gt;Qwen3VL&lt;/code&gt;, &lt;code&gt;Qwen35&lt;/code&gt;, &lt;code&gt;GlmOcr&lt;/code&gt;). PR #222 is the flagship fix with model-by-model follow-ups coming as each is validated against the Python reference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-architecture validation.&lt;/strong&gt; To sanity-check that the combined patch really does something model-independent, I ran the same A/B on &lt;code&gt;mlx-community/UI-TARS-1.5-7B-4bit&lt;/code&gt; — ByteDance's click-prediction model, which shares &lt;code&gt;Qwen2_5_VLForConditionalGeneration&lt;/code&gt; architecture, same hidden size, same special tokens. Same deterministic input image, same model, only the source of &lt;code&gt;Qwen25VL.swift&lt;/code&gt; differs.&lt;/p&gt;

&lt;p&gt;With the PR applied: 200 tokens generated, 9+ distinct coordinates tracking actual content positions. Without the PR: 52 tokens generated, output collapses to two entries at identical coordinates &lt;code&gt;(141, 141)&lt;/code&gt; — a clean signature of broken position encoding. Two independent Qwen2.5-VL-family models, same failure mode when the fixes aren't present. The &lt;a href="https://github.com/ml-explore/mlx-swift-lm/pull/222#issuecomment-4283420555" rel="noopener noreferrer"&gt;A/B reproducer is attached to PR #222&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  OCR: Apple Vision Framework
&lt;/h3&gt;

&lt;p&gt;Apple's Vision framework provides on-device OCR that runs on the Neural Engine at ~300ms per frame.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recognition levels are confusing.&lt;/strong&gt; The API has two recognition levels: level 0 and level 1. Intuitively, you might assume level 0 is the baseline (fast) and level 1 is the premium (accurate). It is the opposite. Level 0 is accurate (slower, higher quality). Level 1 is fast (lower quality). I ran with level 1 for weeks thinking I was getting the best results, then discovered I was using the fast path the entire time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RecognizeDocumentsRequest vs VNRecognizeTextRequest.&lt;/strong&gt; Apple's Vision framework has two OCR APIs, and they behave very differently on code content. &lt;code&gt;RecognizeDocumentsRequest&lt;/code&gt; (the newer, WWDC25 API) is optimized for documents -- prose, forms, receipts. It silently drops lines that look like code: indented lines with brackets, semicolons, and unusual formatting. For a code editor panel, it would capture 15 out of 20 visible lines, silently losing the rest.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;VNRecognizeTextRequest&lt;/code&gt; (the older API) captures everything -- every line, regardless of formatting. For reading code from screen, use &lt;code&gt;VNRecognizeTextRequest&lt;/code&gt;. I discovered this after weeks of mysterious "missing lines" that turned out to be the newer API being too clever about what constitutes document text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bounded OCR.&lt;/strong&gt; Rather than scanning the entire screen (which picks up menu bars, dock icons, and other noise), the OCR is bounded to the panel regions detected by the VLM. This reduces both processing time and false positives -- you only extract text from the panel you care about.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scroll Accumulator
&lt;/h3&gt;

&lt;p&gt;Most non-trivial content does not fit in a single viewport. A problem description might be 40 lines long, but only 15 are visible at once. The scroll accumulator solves this by scrolling through the content in steps, OCR-ing each viewport, and stitching the results into a complete transcript.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The stitching problem.&lt;/strong&gt; Adjacent viewports overlap. When you scroll down by 100 pixels, the bottom 80% of the previous viewport is still visible. Naive concatenation produces massive duplication. The accumulator uses Levenshtein distance to fuzzy-match each incoming OCR line against all accumulated lines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Threshold tuning.&lt;/strong&gt; A line is classified as "already seen" if its Levenshtein similarity to any accumulated line exceeds 60%. I tested thresholds from 40% to 80%:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;40%: too permissive -- novel lines were classified as duplicates and dropped&lt;/li&gt;
&lt;li&gt;80%: too strict -- lines with minor OCR variations were classified as novel and added twice&lt;/li&gt;
&lt;li&gt;60%: best F1 score for the duplicate-vs-novel classification task&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Metal GPU Overlay Rendering
&lt;/h3&gt;

&lt;p&gt;The overlay renders detected text and annotations as a transparent window on top of the target application.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;NSWindow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;contentRect&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;screenFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;styleMask&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;borderless&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;backing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;buffered&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;defer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;NSWindow&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="kt"&gt;Level&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;rawValue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isOpaque&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backgroundColor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;clear&lt;/span&gt;
&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ignoresMouseEvents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hasShadow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Self-exclusion from screen capture.&lt;/strong&gt; This is critical: the overlay must not appear in its own screenshots. If it does, the next VLM inference cycle sees the overlay text, interprets it as UI content, and the system enters a feedback loop where it reads its own annotations. The fix is &lt;code&gt;captureScreenExcluding(windowID:)&lt;/code&gt;, which tells ScreenCaptureKit to exclude the overlay window from the captured frame.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Demo
&lt;/h3&gt;

&lt;p&gt;Here's the system running in reader mode — detecting the main content region, reading text across scroll positions, and rendering a summary overlay with perspective lines anchored to the source:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4sy2b58i45otlk7w89we.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4sy2b58i45otlk7w89we.gif" alt="GroundingKit reader mode — summary overlay on a long article with corner-anchor perspective lines" width="720" height="467"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Performance Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;VLM model loading&lt;/td&gt;
&lt;td&gt;~3s&lt;/td&gt;
&lt;td&gt;Unified memory (one-time)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VLM panel detection&lt;/td&gt;
&lt;td&gt;~18s per inference&lt;/td&gt;
&lt;td&gt;GPU (MLX unified memory)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OCR per frame&lt;/td&gt;
&lt;td&gt;~300ms&lt;/td&gt;
&lt;td&gt;Neural Engine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overlay render&lt;/td&gt;
&lt;td&gt;&amp;lt;16ms (60fps)&lt;/td&gt;
&lt;td&gt;Metal GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full scroll accumulation&lt;/td&gt;
&lt;td&gt;~40s (20 steps)&lt;/td&gt;
&lt;td&gt;Combined&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model resident memory&lt;/td&gt;
&lt;td&gt;~5.5GB peak&lt;/td&gt;
&lt;td&gt;Unified memory&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The VLM inference is the bottleneck at ~18 seconds, but it only needs to run when the panel layout changes (e.g., navigating to a new page). During normal operation, the OCR and overlay run continuously at ~300ms per cycle while the VLM-detected panel bounds remain cached. On an M1 Pro with 16GB, the system runs comfortably alongside Chrome and other applications.&lt;/p&gt;




&lt;h3&gt;
  
  
  A note on "native Swift vs Python"
&lt;/h3&gt;

&lt;p&gt;Worth being honest about what "native Swift" does and doesn't buy you here. The VLM forward pass is the same Metal kernels in either language, so a one-shot &lt;code&gt;mlx-vlm&lt;/code&gt; Python benchmark on the same image finishes within a few percent of the Swift equivalent. Swift doesn't make the model run faster.&lt;/p&gt;

&lt;p&gt;What Swift changes is everything &lt;em&gt;around&lt;/em&gt; the model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cold start ~3 s vs ~15 s&lt;/strong&gt; — no interpreter, no PyTorch import, just mmap the weights.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-IPC pipeline&lt;/strong&gt; — &lt;code&gt;CGWindowListCreateImage&lt;/code&gt; → VLM → &lt;code&gt;VNRecognizeTextRequest&lt;/code&gt; → Metal overlay all run in one process with shared memory. A Python pipeline has to serialize each screenshot across the subprocess boundary, adding 30–100 ms per cycle on top of inference.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time frame work becomes viable&lt;/strong&gt; — a per-frame 16 ms budget has room for actual OCR and overlay redraw; it doesn't fit a round-trip to a Python worker.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bounded memory over long sessions&lt;/strong&gt; — &lt;code&gt;autoreleasepool&lt;/code&gt; around CGImage ops keeps a 100-minute session at ~5.5 GB peak. The earlier Python-subprocess version leaked ~900 MB over the same duration through PyObjC bridging.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The headline 18-second number is the same either way. The difference is whether you can wrap that around a responsive app — startable in 3 seconds, no IPC between stages, 60 fps overlay — rather than a command-line script.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Reproduce This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;macOS 14+ on Apple Silicon (M1/M2/M3/M4)&lt;/li&gt;
&lt;li&gt;Xcode 16+&lt;/li&gt;
&lt;li&gt;~16GB unified memory (8GB minimum, 16GB comfortable)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Model:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;mlx-community/Qwen2.5-VL-7B-Instruct-4bit&lt;/code&gt; from Hugging Face (~5.3GB)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key dependencies:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;mlx-swift-lm&lt;/code&gt; (Swift package, for native VLM inference)&lt;/li&gt;
&lt;li&gt;Apple Vision framework (built into macOS)&lt;/li&gt;
&lt;li&gt;Metal (built into macOS)&lt;/li&gt;
&lt;li&gt;ScreenCaptureKit (built into macOS)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Source Code
&lt;/h2&gt;

&lt;p&gt;GroundingKit is the open-source macOS app extracted from this project. Clone it, build it, and try panel detection on your own screen:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/NivDvir/screen-overlay-toolkit" rel="noopener noreferrer"&gt;github.com/NivDvir/screen-overlay-toolkit&lt;/a&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/NivDvir/screen-overlay-toolkit.git
&lt;span class="nb"&gt;cd &lt;/span&gt;screen-overlay-toolkit
pip3 &lt;span class="nb"&gt;install &lt;/span&gt;mlx-vlm Pillow
swift build
.build/debug/GroundingKit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;~1,500 lines of Swift + 120 lines of Python. Menu bar app, runs entirely local on Apple Silicon.&lt;/p&gt;







&lt;h2&gt;
  
  
  Postscript: 10th bug, post-publication (added 2026-04-25)
&lt;/h2&gt;

&lt;p&gt;After publication, while preparing PR splits per a maintainer's review request, forensic re-measurement found the Swift output drifting +9 px from the Python reference on a canonical LeetCode test image — despite this article's "0 px on all 8 edges" claim. Two things were going on:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A 10th bug was hiding.&lt;/strong&gt; &lt;code&gt;Qwen2VLMessageGenerator&lt;/code&gt; in &lt;code&gt;Qwen2VL.swift&lt;/code&gt; ordered chat-template content as &lt;code&gt;[text, image]&lt;/code&gt;, but HuggingFace's template for Qwen2.5-VL emits &lt;code&gt;&amp;lt;|vision_start|&amp;gt;&amp;lt;|image_pad|&amp;gt;&amp;lt;|vision_end|&amp;gt;{text}&lt;/code&gt; — image content first. Swapping the order removes a deterministic +9 / +8 / +5 px bbox shift. Fix filed upstream as &lt;a href="https://github.com/ml-explore/mlx-swift-lm/pull/242" rel="noopener noreferrer"&gt;&lt;strong&gt;mlx-swift-lm #242&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The original parity claim was over-confident.&lt;/strong&gt; The "0 px on all 8 edges" prose in this article wasn't backed by an automated gate at the tolerance the prose claimed. The project's accuracy gate ran at 30 px tolerance, so the 9 px drift passed silently. The single-bug-fix narrative was right; the parity number was published ahead of the gate that would have enforced it.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both have been addressed. The patch file (&lt;a href="https://github.com/NivDvir/screen-overlay-toolkit/blob/main/patches/mlx-swift-lm-mrope-fixes.patch" rel="noopener noreferrer"&gt;&lt;code&gt;patches/mlx-swift-lm-mrope-fixes.patch&lt;/code&gt;&lt;/a&gt;) now covers both &lt;code&gt;Qwen25VL.swift&lt;/code&gt; and &lt;code&gt;Qwen2VL.swift&lt;/code&gt;. The strict ≤2 px gate (&lt;a href="https://github.com/NivDvir/screen-overlay-toolkit/blob/main/scripts/parity/setup_and_verify.sh" rel="noopener noreferrer"&gt;&lt;code&gt;scripts/parity/setup_and_verify.sh&lt;/code&gt;&lt;/a&gt;) runs against saved Python &lt;code&gt;mlx-vlm&lt;/code&gt; reference output and aborts if any edge of any panel exceeds 2 px on any image. With the fix applied, parity is bit-exact (0 px) on two of three canonical test images and within 2 px on the third (only y2 of both panels). The omnibus PR #222 now has four isolated companion PRs (#238, #239, #242, #243) per the maintainer's split-friendly review preference.&lt;/p&gt;

&lt;p&gt;The lesson: a parity gate must enforce the number you publish. A gate looser than the claim is not a gate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Building this system produced more failure than success. Six major approaches failed before the working architecture emerged, and even the working approach required fixing 10 implementation bugs in a third-party library before it produced correct output (the 10th surfaced after publication). The total development time from "I want to read a panel from the screen" to "this reliably works" was measured in weeks, not days.&lt;/p&gt;

&lt;p&gt;The experience of building this on-device overlay led directly to a testing methodology I call CCSV (Cross-Channel Spatiotemporal Verification) -- the idea that you can verify a UI by reading it through two completely independent channels (DOM and pixels) and comparing what they see. That methodology is described in a companion article.&lt;/p&gt;

&lt;p&gt;If you are building something similar -- local VLMs on Apple Silicon, on-device document grounding, overlay rendering -- I would like to hear what you have tried and what worked. The failure modes are not well documented anywhere, and the community benefits from sharing them.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Niv Dvir is a software developer who builds tools at the intersection of computer vision and UI automation. You can find him on &lt;a href="https://github.com/NivDvir" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>swift</category>
      <category>macos</category>
      <category>machinelearning</category>
      <category>ai</category>
    </item>
    <item>
      <title>How I Built a Cochlear Spiral Spectrogram That Visualizes Music Like the Inner Ear</title>
      <dc:creator>Niv Dvir</dc:creator>
      <pubDate>Sun, 22 Mar 2026 12:23:13 +0000</pubDate>
      <link>https://dev.to/nivdvir/how-i-built-a-cochlear-spiral-spectrogram-that-visualizes-music-like-the-inner-ear-3k49</link>
      <guid>https://dev.to/nivdvir/how-i-built-a-cochlear-spiral-spectrogram-that-visualizes-music-like-the-inner-ear-3k49</guid>
      <description>&lt;p&gt;What if you could see music the way your inner ear hears it?&lt;/p&gt;

&lt;p&gt;I built a visualization system that maps audio frequencies onto a &lt;strong&gt;Fermat spiral&lt;/strong&gt; — the same geometric curve that describes how the human cochlea arranges its frequency-sensitive hair cells. The result reveals the hidden geometry of harmony: you can literally &lt;em&gt;see&lt;/em&gt; the difference between a major and minor chord.&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/66RiYBl7aQY"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Idea
&lt;/h2&gt;

&lt;p&gt;Traditional spectrograms show frequency vs. time as a rectangular heatmap. They're useful but clinical — they don't capture the &lt;em&gt;feeling&lt;/em&gt; of music.&lt;/p&gt;

&lt;p&gt;The cochlea (your inner ear) isn't rectangular. It's a spiral. Low frequencies resonate at the outer end, high frequencies at the inner end — logarithmically spaced, just like musical octaves.&lt;/p&gt;

&lt;p&gt;So I asked: &lt;strong&gt;what if we visualize frequencies on an actual spiral?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Audio Analysis (scipy FFT)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;381 logarithmically-spaced frequency bins (20 Hz — 8 kHz)&lt;/li&gt;
&lt;li&gt;ISO 226 equal-loudness contours for perceptual accuracy&lt;/li&gt;
&lt;li&gt;60 FPS frame-by-frame analysis
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified core: FFT → cochlear frequency mapping
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scipy.fft&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;rfft&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rfftfreq&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sample_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;44100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;381&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;spectrum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;rfft&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;freqs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;rfftfreq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;sample_rate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Logarithmic bins: 20 Hz to 8 kHz (cochlear range)
&lt;/span&gt;    &lt;span class="n"&gt;bin_edges&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;logspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log10&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log10&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;n_bins&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;amplitudes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_bins&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_bins&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;mask&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;freqs&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;bin_edges&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;freqs&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;bin_edges&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;amplitudes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spectrum&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;amplitudes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Spiral Mapping (Fermat Spiral)
&lt;/h3&gt;

&lt;p&gt;Each frequency bin gets a position on a Fermat spiral: &lt;strong&gt;r = sqrt(θ)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Low frequencies sit at the outer edge (like the cochlea's apex), high frequencies spiral inward.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Map frequency bins to spiral coordinates
&lt;/span&gt;&lt;span class="n"&gt;theta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;linspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_bins&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;theta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;theta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;theta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Chromesthesia Color Mapping
&lt;/h3&gt;

&lt;p&gt;Colors follow a &lt;strong&gt;chromesthesia&lt;/strong&gt; mapping — the neurological phenomenon where people "see" sounds as colors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Low frequencies (bass) → warm reds/oranges&lt;/li&gt;
&lt;li&gt;Mid frequencies (voice, guitar) → greens/yellows&lt;/li&gt;
&lt;li&gt;High frequencies (cymbals, harmonics) → cool blues/cyans&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Temporal Features (The Secret Sauce)
&lt;/h3&gt;

&lt;p&gt;Static spectrograms miss the &lt;em&gt;movement&lt;/em&gt; of music. I added 5 temporal features, each validated across &lt;strong&gt;1,704 audio samples&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Optimal parameter&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Melodic trails&lt;/td&gt;
&lt;td&gt;Short glowing trails following melody&lt;/td&gt;
&lt;td&gt;10 frames, 0.70 decay&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rhythm pulses&lt;/td&gt;
&lt;td&gt;Radial pulse on beat hits&lt;/td&gt;
&lt;td&gt;0.50 intensity, 0.25 decay&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Harmonic auras&lt;/td&gt;
&lt;td&gt;Sustained glow for held chords&lt;/td&gt;
&lt;td&gt;4.0s blend time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Atmospheric context&lt;/td&gt;
&lt;td&gt;Background mood from 60s window&lt;/td&gt;
&lt;td&gt;0.35 influence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Harmonic connections&lt;/td&gt;
&lt;td&gt;Lines between harmonically related notes&lt;/td&gt;
&lt;td&gt;Octave + fifth detection&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why Harmony Looks Beautiful
&lt;/h2&gt;

&lt;p&gt;This is the magical part. When notes are &lt;strong&gt;harmonically related&lt;/strong&gt; (octaves, fifths, thirds), they land at &lt;strong&gt;symmetric positions&lt;/strong&gt; on the spiral. A major chord creates a visually balanced, symmetric pattern. Dissonance creates asymmetric, chaotic (but still beautiful) patterns.&lt;/p&gt;

&lt;p&gt;Different musical traditions create remarkably different visual signatures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Classical harmony&lt;/strong&gt; → orderly radial symmetry&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Arabic maqam&lt;/strong&gt; → quarter-tone asymmetry with unique geometric beauty&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EDM/electronic&lt;/strong&gt; → explosive, pulsing energy patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/bhgEEtMXEJ0"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It: The Wellspring
&lt;/h2&gt;

&lt;p&gt;I also built a crowdsourcing platform called &lt;a href="https://synesthesia-labeler.onrender.com" rel="noopener noreferrer"&gt;&lt;strong&gt;The Wellspring&lt;/strong&gt;&lt;/a&gt; where people can rate how well these visualizations capture the music. The goal: build an open dataset for AI-powered audio visualization evaluation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Audio analysis:&lt;/strong&gt; scipy (FFT), librosa&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rendering:&lt;/strong&gt; PIL (2D), PyVista (3D optional)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Video encoding:&lt;/strong&gt; FFmpeg (H.264, CRF 18, 60 FPS)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Web platform:&lt;/strong&gt; React 18 + TypeScript, Node/Express, PostgreSQL&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;I'm working on browser-based creation tools so anyone can create their own audio-visual harmony — no installation needed. The vision: a global community of creators exploring the intersection of sound and moving image.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The ancient dance between rhythm and movement, renewed with modern tools.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;Channel: &lt;a href="https://www.youtube.com/@NivDvir-ND" rel="noopener noreferrer"&gt;youtube.com/@NivDvir-ND&lt;/a&gt;&lt;br&gt;
The Wellspring: &lt;a href="https://synesthesia-labeler.onrender.com" rel="noopener noreferrer"&gt;synesthesia-labeler.onrender.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I'd love to hear your thoughts — especially from anyone working on audio visualization, creative coding, or signal processing!&lt;/p&gt;

</description>
      <category>audio</category>
    </item>
  </channel>
</rss>
