<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Hector Li</title>
    <description>The latest articles on DEV Community by Hector Li (@hector_lxm).</description>
    <link>https://dev.to/hector_lxm</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3766888%2F726a9842-393e-4fc8-9886-a507888b0217.png</url>
      <title>DEV Community: Hector Li</title>
      <link>https://dev.to/hector_lxm</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hector_lxm"/>
    <language>en</language>
    <item>
      <title>I Shipped a 5-Bug Fix to ONNX Runtime — By Telling an AI Agent "Still Wrong"</title>
      <dc:creator>Hector Li</dc:creator>
      <pubDate>Fri, 13 Feb 2026 06:54:28 +0000</pubDate>
      <link>https://dev.to/hector_lxm/i-shipped-a-5-bug-fix-to-onnx-runtime-by-telling-an-ai-agent-still-wrong-4gi4</link>
      <guid>https://dev.to/hector_lxm/i-shipped-a-5-bug-fix-to-onnx-runtime-by-telling-an-ai-agent-still-wrong-4gi4</guid>
      <description>&lt;p&gt;&lt;em&gt;I shipped a 5-file, production-quality &lt;a href="https://github.com/microsoft/onnxruntime/pull/27285" rel="noopener noreferrer"&gt;PR&lt;/a&gt; to ONNX Runtime in one session — and I wrote almost none of the code myself.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Know Your Goal (or the Problem)
&lt;/h2&gt;

&lt;p&gt;I had an ONNX model with a 2-bit quantized &lt;code&gt;MatMulNBits&lt;/code&gt; operator. It ran correctly on CPU. I wanted to run it in a web project using ONNX Runtime's WebGPU backend. I tried, and got this error:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Error running model: failed to call OrtRun(). ERROR_CODE: 1, ERROR_MESSAGE: .../matmul_nbits.cc:123 ... nbits != 2 was false. Currently, zero points are not supported for Q2 quantization.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;From the error message, I knew that 2-bit &lt;code&gt;MatMulNBits&lt;/code&gt; was partially supported in WebGPU, but there was a feature gap — it didn't support models that include a &lt;code&gt;zero_points&lt;/code&gt; input.&lt;/p&gt;

&lt;p&gt;As a former ONNX Runtime developer, I knew something about low-bit quantization, T-MAC, the 2bit implementation in CPU, but I have no idea or experience with OnnxRuntime WebGPU development. Next, let's see what an AI coding agent can do with this.&lt;/p&gt;




&lt;h2&gt;
  
  
  Ask the AI Agent to Do the Work
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Open VS Code with the local ONNX Runtime repository.&lt;/li&gt;
&lt;li&gt;Copy the error message directly into the AI agent (GitHub Copilot with Claude Opus 4.6).&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Round 1: Remove the Gate
&lt;/h3&gt;

&lt;p&gt;From the error message, the agent located the source file that threw the error and started investigating.&lt;/p&gt;

&lt;p&gt;The agent started to read the code and thinking.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4iew29qkd4cqhr8rsm6g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4iew29qkd4cqhr8rsm6g.png" alt="The agent started to read the code and thinking" width="800" height="731"&gt;&lt;/a&gt;&lt;br&gt;
The agent found the root cause and made the changes.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj2nlbi57uvgfdl795f1q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj2nlbi57uvgfdl795f1q.png" alt="The agent found the root cause and made the changes" width="800" height="317"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The agent removed the restriction — an &lt;code&gt;ORT_ENFORCE(nbits != 2, ...)&lt;/code&gt; guard that explicitly blocked Q2 with zero points. I knew from experience that simply removing a guard wouldn't be enough to make the feature work correctly — the underlying shader logic still assumed 4-bit. But I asked the agent to build it anyway to establish a baseline. I ran it with my model. Of course, it produced wrong results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My role:&lt;/strong&gt; Domain judgment — knowing the guard removal was necessary but insufficient, and choosing to proceed anyway to see what broke next.&lt;/p&gt;

&lt;h3&gt;
  
  
  Round 2: Fix the Buffer Stride
&lt;/h3&gt;

&lt;p&gt;Copy the error to the agent, it started to investigate.&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmbkqm6c6rcasdifqqr3b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmbkqm6c6rcasdifqqr3b.png" alt="start" width="800" height="217"&gt;&lt;/a&gt;&lt;br&gt;
The agent found the problem and made the changes.&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzd99u9z9lw4wbzkzpr6v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzd99u9z9lw4wbzkzpr6v.png" alt="problem" width="800" height="256"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The agent found that the zero-point buffer stride calculation used a Q4-only shortcut (&lt;code&gt;+1&lt;/code&gt;) that didn't generalize to Q2's 4-values-per-byte packing. It rewrote the formula with proper ceiling arithmetic.&lt;/p&gt;

&lt;p&gt;I rebuilt and tested with my project. The result was still not correct.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My role:&lt;/strong&gt; Testing against ground truth in a browser environment the agent couldn't access.&lt;/p&gt;

&lt;h3&gt;
  
  
  Round 3: Write Unit Tests as a Diagnostic Tool
&lt;/h3&gt;

&lt;p&gt;At this point, staring at shader generator code wasn't productive. I asked the agent to create unit tests — not just for coverage, but as a &lt;strong&gt;diagnostic strategy&lt;/strong&gt; to isolate which configurations were failing.&lt;/p&gt;

&lt;p&gt;Asked the agent to create some UTs to see if it can find some issues.&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr4kssog6yddnap5ejpyc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr4kssog6yddnap5ejpyc.png" alt="ut" width="800" height="147"&gt;&lt;/a&gt;&lt;br&gt;
It created UTs, found bugs, and fixed them&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6i1xeuia2g9y73zsprfw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6i1xeuia2g9y73zsprfw.png" alt="ut_fix" width="800" height="445"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The agent wrote a &lt;code&gt;MatMul2BitsWebGpu&lt;/code&gt; test suite, found that 6 of 8 test cases failed, traced the failures to bit-shift and value-extraction ordering bugs in the TypeScript shader generator, and fixed them.&lt;/p&gt;

&lt;p&gt;I rebuilt and tested with my project. The result was still not correct.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My role:&lt;/strong&gt; Choosing the right diagnostic approach — unit tests revealed bugs that code reading alone couldn't surface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Round 4: Feed It the Real Model
&lt;/h3&gt;

&lt;p&gt;The unit tests were passing, but my real model still gave wrong output. I provided the agent the actual 2-bit quantized transformer model I was using.&lt;/p&gt;

&lt;p&gt;Asked the agent to investigate with the real model.&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgfnowh2kgeo7e0spoz0v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgfnowh2kgeo7e0spoz0v.png" alt="investigate1" width="800" height="232"&gt;&lt;/a&gt;&lt;br&gt;
The agent walked through the code with the data and node attributes from the real model to address the issue. That was amazing!&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7dozoofovc2tis4bppv1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7dozoofovc2tis4bppv1.png" alt="real_model" width="800" height="721"&gt;&lt;/a&gt;&lt;br&gt;
The agent found the root cause and made the fix.&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F865tnwqr4ikfnqhb2hkl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F865tnwqr4ikfnqhb2hkl.png" alt="fix" width="800" height="379"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This was the most impressive round. The agent wrote Python scripts to simulate the shader's bit extraction logic step by step, using real data from my model. It discovered that the A-data (activation) pointer was being double-advanced across multi-pass loops — pass 1 was reading &lt;code&gt;A[16]&lt;/code&gt; instead of &lt;code&gt;A[8]&lt;/code&gt;, silently skipping 8 values. A one-line fix resolved it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My role:&lt;/strong&gt; Providing the real model — something the agent couldn't obtain on its own. This was the input that unlocked the final bug.&lt;/p&gt;

&lt;h3&gt;
  
  
  Round 5: Fill the Test Gaps
&lt;/h3&gt;

&lt;p&gt;The result was correct with my test project. Asked the agent to add more test cases to cover all changes.&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqgkx3skzkj2xk8iopy88.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqgkx3skzkj2xk8iopy88.png" alt="result" width="800" height="135"&gt;&lt;/a&gt;&lt;br&gt;
The agent said existing tests already have good coverage, but were missing cases that match the configuration in my real model.&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F98lslay0i33b0io4s8gm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F98lslay0i33b0io4s8gm.png" alt="final" width="800" height="289"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The result was finally correct! I asked the agent to update test coverage. It identified that the existing tests didn't include &lt;code&gt;block_size=64&lt;/code&gt; (the configuration my real model used, which exercises zero-point padding edge cases) and added three new test cases. All 9 tests passed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My role:&lt;/strong&gt; Validating the final result against the real model and asking for coverage of the actual production configuration.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Changed
&lt;/h2&gt;

&lt;p&gt;Five bugs across five files, each hidden behind the last:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bug&lt;/th&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Issue&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Q2+ZP blocked&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;matmul_nbits.cc&lt;/code&gt;, &lt;code&gt;matmul_nbits.h&lt;/code&gt;, WGSL template&lt;/td&gt;
&lt;td&gt;Hard-coded guards rejecting Q2 with zero points; missing bit mask&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Buffer stride&lt;/td&gt;
&lt;td&gt;&lt;code&gt;matmul_nbits.cc&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Zero-point stride used Q4-only &lt;code&gt;+1&lt;/code&gt; rounding instead of proper ceiling formula&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bit shift&lt;/td&gt;
&lt;td&gt;&lt;code&gt;matmulnbits.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Multi-pass shift &lt;code&gt;pass * 8&lt;/code&gt; crossed byte boundaries; should be &lt;code&gt;pass * bits * 4&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Value ordering&lt;/td&gt;
&lt;td&gt;&lt;code&gt;matmulnbits.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;unpack4xU8&lt;/code&gt; extracts same bit position from all 4 bytes — wrong order for Q2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A-data offset&lt;/td&gt;
&lt;td&gt;&lt;code&gt;matmulnbits.ts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pass 1 double-advanced the activation pointer, skipping 8 values&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The PR
&lt;/h2&gt;

&lt;p&gt;All work done! Time to push the changes to GitHub and create a PR: &lt;a href="https://github.com/microsoft/onnxruntime/pull/27285" rel="noopener noreferrer"&gt;Improve WebGPU MatMulNBits to support zero pointer for 2bits&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's worth noting that the PR didn't receive any review comments directly related to the code changes — only a future improvement request. The agent's code was production-quality on the first submission.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bonus: Ask the Agent to Write the Blog
&lt;/h2&gt;

&lt;p&gt;Asked the agent to create a blog from what we have done.&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxhjxvkuqx2dj4l1nbzkc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxhjxvkuqx2dj4l1nbzkc.png" alt="blog1" width="800" height="209"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;First attempt — a technical summary of the bugs and fixes:&lt;br&gt;
&lt;a href="https://dev.to/hector_lxm/bringing-2-bit-quantization-to-onnx-runtimes-webgpu-backend-33cj"&gt;Bringing 2-Bit Quantization to ONNX Runtime's WebGPU Backend&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That's useful, but what I wanted was a blog showing how I paired with the AI agent. So I asked again:&lt;/p&gt;

&lt;p&gt;Asked the agent to create another blog.&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmhv52r4vrrrt1e9b4fp5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmhv52r4vrrrt1e9b4fp5.png" alt="blog2" width="800" height="217"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://dev.to/hector_lxm/using-an-ai-coding-agent-to-ship-2-bit-quantization-for-webgpu-38j7"&gt;Using an AI Coding Agent to Ship 2-Bit Quantization for WebGPU&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Reading that second blog, you'll notice it emphasizes "what the agent did well", "tireless code reading", "the agent is most valuable on...". And you might wonder: what exactly did the &lt;em&gt;developer&lt;/em&gt; do? Just keep saying "result is not correct!" and "why don't the tests cover all cases?" 😄&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Actually Did
&lt;/h2&gt;

&lt;p&gt;But that framing misses the point. Here's what the developer contributed that the agent couldn't:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Defined the problem&lt;/strong&gt; — provided the error message, the model, and the expected behavior&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Made strategic choices&lt;/strong&gt; — when to build, when to switch to unit tests, when to provide the real model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Held ground truth&lt;/strong&gt; — tested in a real browser environment the agent had no access to&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Applied domain judgment&lt;/strong&gt; — knew the guard removal was insufficient, knew which model configurations mattered&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The developer's job wasn't to write code — it was to define the problem, validate the result, and make judgment calls about what to try next. That turned out to be enough.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Bringing 2-Bit Quantization to ONNX Runtime's WebGPU Backend</title>
      <dc:creator>Hector Li</dc:creator>
      <pubDate>Wed, 11 Feb 2026 18:14:25 +0000</pubDate>
      <link>https://dev.to/hector_lxm/bringing-2-bit-quantization-to-onnx-runtimes-webgpu-backend-33cj</link>
      <guid>https://dev.to/hector_lxm/bringing-2-bit-quantization-to-onnx-runtimes-webgpu-backend-33cj</guid>
      <description>&lt;p&gt;&lt;em&gt;A story of five bugs, bit-level debugging, and running transformer models at 2-bit precision in the browser. Here's the &lt;a href="https://github.com/microsoft/onnxruntime/pull/27285" rel="noopener noreferrer"&gt;PR&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;ONNX Runtime's &lt;code&gt;MatMulNBits&lt;/code&gt; operator supports low-bit quantized matrix multiplication — packing weight values into 2, 4, or 8 bits per element. The WebGPU execution provider (both the native C++ path and the JavaScript/JSEP path) already supported 4-bit (Q4) quantization, but 2-bit (Q2) was blocked or broken. Our goal: make Q2 with zero points work correctly end-to-end so that 2-bit quantized transformer models run accurately in the browser via WebGPU.&lt;/p&gt;

&lt;p&gt;What seemed like a single feature gap turned out to be &lt;strong&gt;five distinct bugs&lt;/strong&gt;, each hidden behind the last.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bug 1: The Gate — Hard-Coded Rejection of Q2 + Zero Points
&lt;/h2&gt;

&lt;p&gt;The first issue was immediate: attempting to run a 2-bit model with zero points threw a runtime error:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Currently, zero points are not supported for Q2 quantization."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Two enforcement guards explicitly blocked Q2:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Native WebGPU EP&lt;/strong&gt; (matmul_nbits.cc): An &lt;code&gt;ORT_ENFORCE(nbits != 2)&lt;/code&gt; when zero points were present.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JSEP C++ kernel&lt;/strong&gt; (matmul_nbits.h): &lt;code&gt;ORT_ENFORCE(nbits_ == 4)&lt;/code&gt; — only Q4 was allowed at all.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Additionally, the WGSL zero-point extraction template (matmul_nbits_zero_pt.wgsl.template) had &lt;code&gt;#elif n_bits == 2&lt;/code&gt; but was missing the &lt;code&gt;bit_mask&lt;/code&gt; constant, so even if the guard were removed, the shader would malfunction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Remove the enforcement blocks, add &lt;code&gt;const bit_mask = 0x3u;&lt;/code&gt; for Q2, guard the DP4A path (which uses a hardcoded LUT assuming &lt;code&gt;zero_point=2&lt;/code&gt;) to skip Q2 with custom zero points.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bug 2: Zero Point Buffer Stride Miscalculation
&lt;/h2&gt;

&lt;p&gt;With the gates removed, tests ran — but produced wrong results. The root cause was in how &lt;code&gt;zero_blocks_per_col&lt;/code&gt; was computed.&lt;/p&gt;

&lt;p&gt;Zero points are packed into bytes: for Q4, two values per byte; for Q2, &lt;strong&gt;four values per byte&lt;/strong&gt;. Each column's zero points are byte-aligned, so the shader uses a flat linear stride to skip between columns. The original formula:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt; &lt;span class="n"&gt;zero_blocks_per_col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n_blocks_per_col&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;nbits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="n"&gt;n_blocks_per_col&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;n_blocks_per_col&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This "+1" was a Q4 shortcut. For Q2 with &lt;code&gt;n_blocks_per_col = 6&lt;/code&gt; (e.g., K=384, block_size=64), the stride needs to round up to the next multiple of 4 (values per byte), not just add 1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Proper ceiling-to-multiple formula:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;uint32_t&lt;/span&gt; &lt;span class="n"&gt;zp_elements_per_byte&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;nbits&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kt"&gt;uint32_t&lt;/span&gt; &lt;span class="n"&gt;zero_blocks_per_col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_blocks_per_col&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;zp_elements_per_byte&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;zp_elements_per_byte&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;zp_elements_per_byte&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Bug 3: Shift Formula Crosses Byte Boundaries
&lt;/h2&gt;

&lt;p&gt;Now the native EP worked, but the JSEP path (the browser-facing JavaScript shaders in matmulnbits.ts) still produced garbage.&lt;/p&gt;

&lt;p&gt;For Q4, each &lt;code&gt;u32&lt;/code&gt; word holds 8 values — processed in a single pass. For Q2, each word holds &lt;strong&gt;16 values&lt;/strong&gt;, requiring 2 passes of 8. The original shift used &lt;code&gt;pass * 8&lt;/code&gt;, meaning pass 1 shifted by 8 bits — crossing from one byte into the next, mixing values from different bytes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; &lt;code&gt;lowerShift = pass * bits * 4&lt;/code&gt; — for Q2 this gives shifts of 0 and 4, staying within each byte's boundaries.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bug 4: Value Extraction Ordering — The Nibble-Spread
&lt;/h2&gt;

&lt;p&gt;After the shift fix, output changed but was still wrong. Deeper analysis revealed a fundamental ordering problem.&lt;/p&gt;

&lt;p&gt;The Q4 extraction pattern &lt;code&gt;unpack4xU8(b_value &amp;amp; 0x0F0F0F0F)&lt;/code&gt; works because it extracts the &lt;strong&gt;same bit position from all 4 bytes simultaneously&lt;/strong&gt; — and for Q4, that gives 4 sequential values (one per byte). But for Q2, the same technique extracts bit position 0-1 from bytes 0, 1, 2, and 3 — producing values v0, v4, v8, v12 instead of v0, v1, v2, v3. The A-data is sequential, so &lt;code&gt;a[2] * b[8]&lt;/code&gt; is computed instead of &lt;code&gt;a[2] * b[2]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; A "nibble-spread" technique that reorganizes bytes before extraction. Each pass takes 2 bytes (8 sequential values), spreads each nibble (4 bits = two Q2 values) into its own byte of a synthetic &lt;code&gt;u32&lt;/code&gt;, then applies the standard &lt;code&gt;unpack4xU8&lt;/code&gt; + mask pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;let half_word = b_value &amp;gt;&amp;gt; (pass * 16u);
let byte_lo = half_word &amp;amp; 0xFFu;
let byte_hi = (half_word &amp;gt;&amp;gt; 8u) &amp;amp; 0xFFu;
let spread_word = (byte_lo &amp;amp; 0xFu)
    | ((byte_lo &amp;gt;&amp;gt; 4u) &amp;lt;&amp;lt; 8u)
    | ((byte_hi &amp;amp; 0xFu) &amp;lt;&amp;lt; 16u)
    | ((byte_hi &amp;gt;&amp;gt; 4u) &amp;lt;&amp;lt; 24u);
b_value_lower = unpack4xU8(spread_word &amp;amp; 0x03030303u);
b_value_upper = unpack4xU8((spread_word &amp;gt;&amp;gt; 2u) &amp;amp; 0x03030303u);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This was applied to both the general shader path and the BlockSize32 optimized path.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bug 5: A-Data Double-Advancement
&lt;/h2&gt;

&lt;p&gt;After the nibble-spread fix, the result changed again — closer, but still incorrect. A Python trace script finally pinpointed the last bug: the A-data offset for pass 1 was wrong.&lt;/p&gt;

&lt;p&gt;In the multi-pass loop, pass 0 reads A values via a loop that increments &lt;code&gt;input_offset&lt;/code&gt; 8 times. Pass 1 then computed its starting offset as &lt;code&gt;input_offset + 8/aComponents&lt;/code&gt; — but &lt;code&gt;input_offset&lt;/code&gt; had &lt;strong&gt;already been advanced&lt;/strong&gt; by pass 0's loop. This double-counted the offset, causing pass 1 to read A[16] instead of A[8], skipping 8 activation values entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Pass 1 simply uses &lt;code&gt;input_offset&lt;/code&gt; directly — it already points to exactly where pass 0 left off:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Before (bug): input_offset + ${(pass * 8) / aComponents}&lt;/span&gt;
&lt;span class="c1"&gt;// After (fix):  input_offset&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this fix, the 2-bit quantized model produced &lt;strong&gt;correct results&lt;/strong&gt; on WebGPU, matching CPU output.&lt;/p&gt;




&lt;h2&gt;
  
  
  Parameterizing the Shader for Variable Bit Widths
&lt;/h2&gt;

&lt;p&gt;Beyond the bug fixes, the JSEP shader needed systematic parameterization. Hard-coded Q4 assumptions were replaced with &lt;code&gt;attributes.bits&lt;/code&gt;-driven constants throughout:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concept&lt;/th&gt;
&lt;th&gt;Q4&lt;/th&gt;
&lt;th&gt;Q2&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Values per u32 word&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Passes per word&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bit mask&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0x0F0F0F0Fu&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0x03030303u&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Default zero point&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZP values per byte&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZP byte mask&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0xFu&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0x3u&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;word_offset increment&lt;/td&gt;
&lt;td&gt;8/aComponents&lt;/td&gt;
&lt;td&gt;16/aComponents&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Test Coverage
&lt;/h2&gt;

&lt;p&gt;We added a &lt;code&gt;MatMul2BitsWebGpu&lt;/code&gt; test suite to exercise the Q2 path on the WebGPU EP:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Symmetric &amp;amp; asymmetric&lt;/strong&gt; (with/without zero points)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple block sizes&lt;/strong&gt; (16, 32, 64, 128) — block_size=64 is the critical case where &lt;code&gt;n_blocks_per_col&lt;/code&gt; is not a multiple of 4, exercising the zero-point padding logic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Varying dimensions&lt;/strong&gt; (K=16 to 1024, N=1 to 384) — covering single-word and multi-word extraction patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch tests&lt;/strong&gt; (M=1, 4, 100)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All 9 test configurations pass on WebGPU EP, with results matching CPU baseline within tolerance.&lt;/p&gt;




&lt;h2&gt;
  
  
  Files Changed
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;matmul_nbits.cc&lt;/td&gt;
&lt;td&gt;Remove Q2+ZP block, fix &lt;code&gt;zero_blocks_per_col&lt;/code&gt;, guard DP4A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;matmul_nbits_zero_pt.wgsl.template&lt;/td&gt;
&lt;td&gt;Add &lt;code&gt;bit_mask = 0x3u&lt;/code&gt; for Q2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;matmul_nbits.h&lt;/td&gt;
&lt;td&gt;Allow &lt;code&gt;nbits == 2&lt;/code&gt; in JSEP kernel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;matmulnbits.ts&lt;/td&gt;
&lt;td&gt;Parameterize for Q2, shift fix, nibble-spread, A-offset fix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;matmul_2bits_test.cc&lt;/td&gt;
&lt;td&gt;WebGPU-specific Q2 test suite&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;One feature, five bugs&lt;/strong&gt; — each fix revealed the next layer of incorrectness. Without tests that compared against a CPU baseline, any single fix would have appeared to "do something" while still being wrong.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Bit-packing extraction is subtle&lt;/strong&gt; — the Q4 pattern of "mask the same bits from all 4 bytes" only works because Q4 has exactly one value per nibble per byte. Q2 breaks that assumption fundamentally.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Trace scripts are essential&lt;/strong&gt; — Python scripts that simulate shader logic step-by-step (nibble-spread verification, A-offset tracking) were what ultimately identified bugs 4 and 5 after code-reading alone proved insufficient.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Parameterize, don't fork&lt;/strong&gt; — rather than creating a separate Q2 shader, making the existing shader bit-width-aware keeps the code maintainable and makes future N-bit support (Q3, Q8) straightforward.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>onnxruntime</category>
      <category>webgpu</category>
      <category>2bit</category>
      <category>quantization</category>
    </item>
    <item>
      <title>Using an AI Coding Agent to Ship 2-Bit Quantization for WebGPU</title>
      <dc:creator>Hector Li</dc:creator>
      <pubDate>Wed, 11 Feb 2026 18:10:32 +0000</pubDate>
      <link>https://dev.to/hector_lxm/using-an-ai-coding-agent-to-ship-2-bit-quantization-for-webgpu-38j7</link>
      <guid>https://dev.to/hector_lxm/using-an-ai-coding-agent-to-ship-2-bit-quantization-for-webgpu-38j7</guid>
      <description>&lt;p&gt;&lt;em&gt;How a developer paired with an AI agent to find and fix five layered bugs in ONNX Runtime's GPU shader pipeline — without being an expert in WGSL or bit-packing. Here's the &lt;a href="https://github.com/microsoft/onnxruntime/pull/27285" rel="noopener noreferrer"&gt;OnnxRuntime PR (merged)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;A developer needed to enable 2-bit (Q2) quantized model inference on ONNX Runtime's WebGPU backend. The 4-bit path worked, but 2-bit with zero points crashed immediately. The codebase involved C++ GPU kernels, WGSL shader templates, TypeScript shader generators, Emscripten WASM builds, and multiple build systems. A deep stack where any single layer could silently produce wrong numbers.&lt;/p&gt;

&lt;p&gt;Rather than spending days manually tracing shader bit logic, the developer partnered with an AI coding agent (GitHub Copilot in VS Code) to systematically find and fix every issue.&lt;/p&gt;

&lt;p&gt;Here's how that collaboration actually worked.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: "Why does it crash?" — The Agent Reads the Error
&lt;/h2&gt;

&lt;p&gt;The developer shared the error message:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Currently, zero points are not supported for Q2 quantization"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The agent searched the codebase, found the &lt;code&gt;ORT_ENFORCE&lt;/code&gt; guard in matmul_nbits.cc and the &lt;code&gt;nbits_ == 4&lt;/code&gt; check in matmul_nbits.h, and identified a missing &lt;code&gt;bit_mask&lt;/code&gt; constant in the WGSL template. Instead of just pointing these out, the agent &lt;strong&gt;directly applied all three fixes&lt;/strong&gt; — removing the guards, adding the mask, and guarding the DP4A codepath that couldn't handle Q2 zero points — across three files in a single edit operation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the agent did well:&lt;/strong&gt; Cross-file root cause analysis from a single error message. The developer didn't need to know which files to look at.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: "Tests pass but output is wrong" — The Agent Spots a Math Bug
&lt;/h2&gt;

&lt;p&gt;With the crash fixed, the developer built and ran tests. Six of eight failed with wrong numerical output. The developer asked the agent to investigate.&lt;/p&gt;

&lt;p&gt;The agent read the zero-point buffer stride calculation and identified that the formula &lt;code&gt;n_blocks_per_col + 1&lt;/code&gt; was a Q4-only shortcut. For Q2, where four values pack per byte, the stride must round up to the nearest multiple of 4. The agent wrote the corrected ceiling formula and applied it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the agent did well:&lt;/strong&gt; Pattern recognition in quantization math. The "+1" looked innocuous but encoded a Q4 assumption the developer might have glossed over.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: "JSEP still gives wrong results" — Diving into TypeScript Shader Generators
&lt;/h2&gt;

&lt;p&gt;After the native C++ path was fixed, the developer reported that the browser-facing JSEP path still produced garbage. This is where the collaboration got interesting.&lt;/p&gt;

&lt;p&gt;The JSEP shaders are &lt;strong&gt;generated at runtime by TypeScript code&lt;/strong&gt; — template strings that emit WGSL. The agent needed to understand code that &lt;em&gt;writes&lt;/em&gt; shader code, not the shader itself.&lt;/p&gt;

&lt;p&gt;The agent traced through matmulnbits.ts, identified that the multi-pass loop used &lt;code&gt;pass * 8&lt;/code&gt; as a bit shift — which works for Q4 (one pass) but for Q2 (two passes) shifts into the wrong byte — and fixed the formula to &lt;code&gt;pass * bits * 4&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the agent did well:&lt;/strong&gt; Reasoning through meta-programming. The bug wasn't in the TypeScript or the WGSL — it was in the &lt;em&gt;relationship&lt;/em&gt; between them.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: "Still wrong" — The Agent Writes Verification Scripts
&lt;/h2&gt;

&lt;p&gt;After the shift fix, the developer tested again: &lt;em&gt;"the result changed, but still not correct."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;At this point, staring at code wasn't enough. The agent &lt;strong&gt;wrote Python simulation scripts&lt;/strong&gt; that replicated the shader's bit extraction logic step by step. The first script (verify_extraction.py) proved the shift fix was necessary but insufficient. A second script (verify_extraction2.py) revealed the deeper bug:&lt;/p&gt;

&lt;p&gt;The Q4 extraction pattern &lt;code&gt;unpack4xU8(b_value &amp;amp; 0x0F0F0F0F)&lt;/code&gt; extracts the same bit position from all four bytes simultaneously. For Q4, that gives four sequential values. For Q2, it gives values v0, v4, v8, v12 — completely out of order relative to the sequential A-data they're multiplied with.&lt;/p&gt;

&lt;p&gt;The agent designed a "nibble-spread" technique: take two bytes per pass, spread each nibble into its own byte of a synthetic u32, then apply the standard extraction. It wrote yet another verification script (verify_nibble_spread2.py) with a non-repeating test pattern to confirm the extraction produces values in the correct order, then applied the fix to both shader paths in the TypeScript.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the agent did well:&lt;/strong&gt; When code reading hit a wall, the agent pivoted to &lt;strong&gt;writing executable proofs&lt;/strong&gt;. Each script answered a specific yes/no question about the bit logic, building confidence incrementally rather than guessing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: "Almost — but still off" — The Last Bug
&lt;/h2&gt;

&lt;p&gt;The developer tested again: &lt;em&gt;"the result changed, but still not correct."&lt;/em&gt; Three fixes in, still wrong.&lt;/p&gt;

&lt;p&gt;The agent wrote verify_a_offset.py — a script that traced how the A-data (activation) pointer advances across passes. It found the final bug: pass 0's inner loop increments &lt;code&gt;input_offset&lt;/code&gt; eight times. Pass 1 then computed its start as &lt;code&gt;input_offset + 8/aComponents&lt;/code&gt;, but since &lt;code&gt;input_offset&lt;/code&gt; was already advanced, this &lt;strong&gt;double-counted&lt;/strong&gt; the offset. Pass 1 read A[16] instead of A[8], skipping eight activation values.&lt;/p&gt;

&lt;p&gt;The fix was a one-line change: pass 1 uses &lt;code&gt;input_offset&lt;/code&gt; directly instead of adding an offset to an already-advanced pointer.&lt;/p&gt;

&lt;p&gt;The developer tested: &lt;em&gt;"the result is correct now."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the agent did well:&lt;/strong&gt; Maintained state across a long debugging session. By this point, the agent had built a mental model of how &lt;code&gt;word_offset&lt;/code&gt;, &lt;code&gt;input_offset&lt;/code&gt;, pass indices, and &lt;code&gt;aComponents&lt;/code&gt; interact across the shader generator's nested loops — context that would take a human significant time to reconstruct after each failed attempt.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6: "Do we need to update the tests?" — The Agent Adds Coverage
&lt;/h2&gt;

&lt;p&gt;With all fixes working, the developer asked whether tests needed updating. The agent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read the existing test file to assess coverage gaps&lt;/li&gt;
&lt;li&gt;Identified that block_size=64 (the real-model configuration that exercised the zero-point padding bug) had no test&lt;/li&gt;
&lt;li&gt;Added three new test cases covering block_size=64, symmetric variants, and multi-word extraction scenarios&lt;/li&gt;
&lt;li&gt;Figured out which build target to compile (&lt;code&gt;onnxruntime_provider_test&lt;/code&gt;, not &lt;code&gt;onnxruntime_test_all&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Built and ran all nine tests — all passed&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What the agent did well:&lt;/strong&gt; End-to-end task completion. The developer asked a yes/no question; the agent answered by doing the work, including navigating an unfamiliar build system to find the right test binary.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Collaboration Pattern
&lt;/h2&gt;

&lt;p&gt;Looking back, the session followed a repeating cycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Developer: "It's broken" / "Still wrong"
    → Agent: Search, read, analyze, hypothesize
    → Agent: Write verification script OR apply code fix
    → Agent: Build
    → Developer: Test with real model
    → (repeat until correct)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The developer brought &lt;strong&gt;domain context&lt;/strong&gt; (which model to test, what "correct" looks like, the build commands) and &lt;strong&gt;judgment&lt;/strong&gt; (when to test, when to push back). The agent brought &lt;strong&gt;tireless code reading&lt;/strong&gt;, &lt;strong&gt;cross-file tracing&lt;/strong&gt;, &lt;strong&gt;bit-level arithmetic verification&lt;/strong&gt;, and the ability to &lt;strong&gt;maintain context&lt;/strong&gt; across a multi-hour, multi-bug debugging session without losing track of which fixes were already applied.&lt;/p&gt;

&lt;p&gt;Key moments where the agent added outsized value:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Situation&lt;/th&gt;
&lt;th&gt;Without agent&lt;/th&gt;
&lt;th&gt;With agent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Finding all Q4-hardcoded guards&lt;/td&gt;
&lt;td&gt;Grep + manual reading across C++, WGSL, TypeScript&lt;/td&gt;
&lt;td&gt;Agent searched and identified all three in one pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Understanding shader generator meta-programming&lt;/td&gt;
&lt;td&gt;Mentally compile TypeScript → WGSL → GPU execution&lt;/td&gt;
&lt;td&gt;Agent traced the template logic and identified the generated shift values&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Verifying bit extraction ordering&lt;/td&gt;
&lt;td&gt;Pen-and-paper binary arithmetic&lt;/td&gt;
&lt;td&gt;Agent wrote executable Python proofs with non-repeating test patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tracking pointer advancement across nested loops&lt;/td&gt;
&lt;td&gt;Extremely error-prone mental simulation&lt;/td&gt;
&lt;td&gt;Agent wrote a trace script that showed exact index values at each step&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maintaining context across 5 sequential bugs&lt;/td&gt;
&lt;td&gt;Each "still wrong" resets human working memory&lt;/td&gt;
&lt;td&gt;Agent retained cumulative understanding of every prior fix&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What Didn't Work (and What the Developer Still Had to Do)
&lt;/h2&gt;

&lt;p&gt;The agent couldn't run the actual model on WebGPU — the developer had a test project with a browser environment and a real 2-bit transformer model. Each "is it correct now?" required the developer to run the model, compare output against CPU baseline, and report back. The agent operated on code structure and logic; the developer operated on ground truth.&lt;/p&gt;

&lt;p&gt;The build system was also a friction point. The agent had to discover — through trial and error — that tests lived in &lt;code&gt;onnxruntime_provider_test.exe&lt;/code&gt; rather than &lt;code&gt;onnxruntime_test_all.exe&lt;/code&gt;, and that the VS 2026 Insiders vcvarsall path was non-standard. These are the kinds of environmental details where the developer's existing knowledge was essential.&lt;/p&gt;




&lt;h2&gt;
  
  
  Takeaways for Developers
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Describe symptoms, not solutions.&lt;/strong&gt; Saying "it gives wrong results on WebGPU but correct on CPU" gave the agent more to work with than "I think the bit shift is wrong."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Let the agent write verification scripts.&lt;/strong&gt; When the bug is in bit-level arithmetic inside a shader generator, reading code has diminishing returns. Executable proofs are faster and more reliable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Iterate tight loops.&lt;/strong&gt; The five-bug sequence would have been demoralizing solo — each fix revealing another failure. With the agent maintaining context and proposing the next investigation immediately, the cycle stayed fast.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Keep ground truth in human hands.&lt;/strong&gt; The developer's ability to test with a real model and say "correct" or "still wrong" was the irreplaceable signal that drove the entire session. The agent can analyze and fix; only the developer can validate against the actual use case.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The agent is most valuable on cross-cutting, multi-layer bugs.&lt;/strong&gt; A bug in one file is easy. Five bugs spanning C++, WGSL templates, TypeScript shader generators, and build configuration — each masked by the previous one — is where an agent that doesn't lose context across files and hours earns its keep.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>github</category>
      <category>webgpu</category>
      <category>onnxruntime</category>
    </item>
  </channel>
</rss>
