DEV Community: Hector Li

I Shipped a 5-Bug Fix to ONNX Runtime — By Telling an AI Agent "Still Wrong"

Hector Li — Fri, 13 Feb 2026 06:54:28 +0000

I shipped a 5-file, production-quality PR to ONNX Runtime in one session — and I wrote almost none of the code myself.

Know Your Goal (or the Problem)

I had an ONNX model with a 2-bit quantized MatMulNBits operator. It ran correctly on CPU. I wanted to run it in a web project using ONNX Runtime's WebGPU backend. I tried, and got this error:

Error running model: failed to call OrtRun(). ERROR_CODE: 1, ERROR_MESSAGE: .../matmul_nbits.cc:123 ... nbits != 2 was false. Currently, zero points are not supported for Q2 quantization.

From the error message, I knew that 2-bit MatMulNBits was partially supported in WebGPU, but there was a feature gap — it didn't support models that include a zero_points input.

As a former ONNX Runtime developer, I knew something about low-bit quantization, T-MAC, the 2bit implementation in CPU, but I have no idea or experience with OnnxRuntime WebGPU development. Next, let's see what an AI coding agent can do with this.

Ask the AI Agent to Do the Work

Open VS Code with the local ONNX Runtime repository.
Copy the error message directly into the AI agent (GitHub Copilot with Claude Opus 4.6).

Round 1: Remove the Gate

From the error message, the agent located the source file that threw the error and started investigating.

The agent started to read the code and thinking.

The agent found the root cause and made the changes.

The agent removed the restriction — an ORT_ENFORCE(nbits != 2, ...) guard that explicitly blocked Q2 with zero points. I knew from experience that simply removing a guard wouldn't be enough to make the feature work correctly — the underlying shader logic still assumed 4-bit. But I asked the agent to build it anyway to establish a baseline. I ran it with my model. Of course, it produced wrong results.

My role: Domain judgment — knowing the guard removal was necessary but insufficient, and choosing to proceed anyway to see what broke next.

Round 2: Fix the Buffer Stride

Copy the error to the agent, it started to investigate.
The agent found the problem and made the changes.

The agent found that the zero-point buffer stride calculation used a Q4-only shortcut (+1) that didn't generalize to Q2's 4-values-per-byte packing. It rewrote the formula with proper ceiling arithmetic.

I rebuilt and tested with my project. The result was still not correct.

My role: Testing against ground truth in a browser environment the agent couldn't access.

Round 3: Write Unit Tests as a Diagnostic Tool

At this point, staring at shader generator code wasn't productive. I asked the agent to create unit tests — not just for coverage, but as a diagnostic strategy to isolate which configurations were failing.

Asked the agent to create some UTs to see if it can find some issues.
It created UTs, found bugs, and fixed them

The agent wrote a MatMul2BitsWebGpu test suite, found that 6 of 8 test cases failed, traced the failures to bit-shift and value-extraction ordering bugs in the TypeScript shader generator, and fixed them.

I rebuilt and tested with my project. The result was still not correct.

My role: Choosing the right diagnostic approach — unit tests revealed bugs that code reading alone couldn't surface.

Round 4: Feed It the Real Model

The unit tests were passing, but my real model still gave wrong output. I provided the agent the actual 2-bit quantized transformer model I was using.

Asked the agent to investigate with the real model.
The agent walked through the code with the data and node attributes from the real model to address the issue. That was amazing!
The agent found the root cause and made the fix.

This was the most impressive round. The agent wrote Python scripts to simulate the shader's bit extraction logic step by step, using real data from my model. It discovered that the A-data (activation) pointer was being double-advanced across multi-pass loops — pass 1 was reading A[16] instead of A[8], silently skipping 8 values. A one-line fix resolved it.

My role: Providing the real model — something the agent couldn't obtain on its own. This was the input that unlocked the final bug.

Round 5: Fill the Test Gaps

The result was correct with my test project. Asked the agent to add more test cases to cover all changes.
The agent said existing tests already have good coverage, but were missing cases that match the configuration in my real model.

The result was finally correct! I asked the agent to update test coverage. It identified that the existing tests didn't include block_size=64 (the configuration my real model used, which exercises zero-point padding edge cases) and added three new test cases. All 9 tests passed.

My role: Validating the final result against the real model and asking for coverage of the actual production configuration.

What Changed

Five bugs across five files, each hidden behind the last:

Bug	File	Issue
Q2+ZP blocked	`matmul_nbits.cc`, `matmul_nbits.h`, WGSL template	Hard-coded guards rejecting Q2 with zero points; missing bit mask
Buffer stride	`matmul_nbits.cc`	Zero-point stride used Q4-only `+1` rounding instead of proper ceiling formula
Bit shift	`matmulnbits.ts`	Multi-pass shift `pass * 8` crossed byte boundaries; should be `pass * bits * 4`
Value ordering	`matmulnbits.ts`	`unpack4xU8` extracts same bit position from all 4 bytes — wrong order for Q2
A-data offset	`matmulnbits.ts`	Pass 1 double-advanced the activation pointer, skipping 8 values

The PR

All work done! Time to push the changes to GitHub and create a PR: Improve WebGPU MatMulNBits to support zero pointer for 2bits

It's worth noting that the PR didn't receive any review comments directly related to the code changes — only a future improvement request. The agent's code was production-quality on the first submission.

Bonus: Ask the Agent to Write the Blog

Asked the agent to create a blog from what we have done.

First attempt — a technical summary of the bugs and fixes:
Bringing 2-Bit Quantization to ONNX Runtime's WebGPU Backend

That's useful, but what I wanted was a blog showing how I paired with the AI agent. So I asked again:

Asked the agent to create another blog.
Using an AI Coding Agent to Ship 2-Bit Quantization for WebGPU

Reading that second blog, you'll notice it emphasizes "what the agent did well", "tireless code reading", "the agent is most valuable on...". And you might wonder: what exactly did the developer do? Just keep saying "result is not correct!" and "why don't the tests cover all cases?" 😄

What I Actually Did

But that framing misses the point. Here's what the developer contributed that the agent couldn't:

Defined the problem — provided the error message, the model, and the expected behavior
Made strategic choices — when to build, when to switch to unit tests, when to provide the real model
Held ground truth — tested in a real browser environment the agent had no access to
Applied domain judgment — knew the guard removal was insufficient, knew which model configurations mattered

The developer's job wasn't to write code — it was to define the problem, validate the result, and make judgment calls about what to try next. That turned out to be enough.

Bringing 2-Bit Quantization to ONNX Runtime's WebGPU Backend

Hector Li — Wed, 11 Feb 2026 18:14:25 +0000

A story of five bugs, bit-level debugging, and running transformer models at 2-bit precision in the browser. Here's the PR

Background

ONNX Runtime's MatMulNBits operator supports low-bit quantized matrix multiplication — packing weight values into 2, 4, or 8 bits per element. The WebGPU execution provider (both the native C++ path and the JavaScript/JSEP path) already supported 4-bit (Q4) quantization, but 2-bit (Q2) was blocked or broken. Our goal: make Q2 with zero points work correctly end-to-end so that 2-bit quantized transformer models run accurately in the browser via WebGPU.

What seemed like a single feature gap turned out to be five distinct bugs, each hidden behind the last.

Bug 1: The Gate — Hard-Coded Rejection of Q2 + Zero Points

The first issue was immediate: attempting to run a 2-bit model with zero points threw a runtime error:

"Currently, zero points are not supported for Q2 quantization."

Two enforcement guards explicitly blocked Q2:

Native WebGPU EP (matmul_nbits.cc): An ORT_ENFORCE(nbits != 2) when zero points were present.
JSEP C++ kernel (matmul_nbits.h): ORT_ENFORCE(nbits_ == 4) — only Q4 was allowed at all.

Additionally, the WGSL zero-point extraction template (matmul_nbits_zero_pt.wgsl.template) had #elif n_bits == 2 but was missing the bit_mask constant, so even if the guard were removed, the shader would malfunction.

Fix: Remove the enforcement blocks, add const bit_mask = 0x3u; for Q2, guard the DP4A path (which uses a hardcoded LUT assuming zero_point=2) to skip Q2 with custom zero points.

Bug 2: Zero Point Buffer Stride Miscalculation

With the gates removed, tests ran — but produced wrong results. The root cause was in how zero_blocks_per_col was computed.

Zero points are packed into bytes: for Q4, two values per byte; for Q2, four values per byte. Each column's zero points are byte-aligned, so the shader uses a flat linear stride to skip between columns. The original formula:

uint32_t zero_blocks_per_col = n_blocks_per_col % (8 / nbits) == 0
    ? n_blocks_per_col : n_blocks_per_col + 1;

This "+1" was a Q4 shortcut. For Q2 with n_blocks_per_col = 6 (e.g., K=384, block_size=64), the stride needs to round up to the next multiple of 4 (values per byte), not just add 1.

Fix: Proper ceiling-to-multiple formula:

const uint32_t zp_elements_per_byte = 8 / nbits;
uint32_t zero_blocks_per_col =
    (n_blocks_per_col + zp_elements_per_byte - 1)
    / zp_elements_per_byte * zp_elements_per_byte;

Bug 3: Shift Formula Crosses Byte Boundaries

Now the native EP worked, but the JSEP path (the browser-facing JavaScript shaders in matmulnbits.ts) still produced garbage.

For Q4, each u32 word holds 8 values — processed in a single pass. For Q2, each word holds 16 values, requiring 2 passes of 8. The original shift used pass * 8, meaning pass 1 shifted by 8 bits — crossing from one byte into the next, mixing values from different bytes.

Fix: lowerShift = pass * bits * 4 — for Q2 this gives shifts of 0 and 4, staying within each byte's boundaries.

Bug 4: Value Extraction Ordering — The Nibble-Spread

After the shift fix, output changed but was still wrong. Deeper analysis revealed a fundamental ordering problem.

The Q4 extraction pattern unpack4xU8(b_value & 0x0F0F0F0F) works because it extracts the same bit position from all 4 bytes simultaneously — and for Q4, that gives 4 sequential values (one per byte). But for Q2, the same technique extracts bit position 0-1 from bytes 0, 1, 2, and 3 — producing values v0, v4, v8, v12 instead of v0, v1, v2, v3. The A-data is sequential, so a[2] * b[8] is computed instead of a[2] * b[2].

Fix: A "nibble-spread" technique that reorganizes bytes before extraction. Each pass takes 2 bytes (8 sequential values), spreads each nibble (4 bits = two Q2 values) into its own byte of a synthetic u32, then applies the standard unpack4xU8 + mask pattern:

let half_word = b_value >> (pass * 16u);
let byte_lo = half_word & 0xFFu;
let byte_hi = (half_word >> 8u) & 0xFFu;
let spread_word = (byte_lo & 0xFu)
    | ((byte_lo >> 4u) << 8u)
    | ((byte_hi & 0xFu) << 16u)
    | ((byte_hi >> 4u) << 24u);
b_value_lower = unpack4xU8(spread_word & 0x03030303u);
b_value_upper = unpack4xU8((spread_word >> 2u) & 0x03030303u);

This was applied to both the general shader path and the BlockSize32 optimized path.

Bug 5: A-Data Double-Advancement

After the nibble-spread fix, the result changed again — closer, but still incorrect. A Python trace script finally pinpointed the last bug: the A-data offset for pass 1 was wrong.

In the multi-pass loop, pass 0 reads A values via a loop that increments input_offset 8 times. Pass 1 then computed its starting offset as input_offset + 8/aComponents — but input_offset had already been advanced by pass 0's loop. This double-counted the offset, causing pass 1 to read A[16] instead of A[8], skipping 8 activation values entirely.

Fix: Pass 1 simply uses input_offset directly — it already points to exactly where pass 0 left off:

// Before (bug): input_offset + ${(pass * 8) / aComponents}
// After (fix):  input_offset

After this fix, the 2-bit quantized model produced correct results on WebGPU, matching CPU output.

Parameterizing the Shader for Variable Bit Widths

Beyond the bug fixes, the JSEP shader needed systematic parameterization. Hard-coded Q4 assumptions were replaced with attributes.bits-driven constants throughout:

Concept	Q4	Q2
Values per u32 word	8	16
Passes per word	1	2
Bit mask	`0x0F0F0F0Fu`	`0x03030303u`
Default zero point	8	2
ZP values per byte	2	4
ZP byte mask	`0xFu`	`0x3u`
word_offset increment	8/aComponents	16/aComponents

Test Coverage

We added a MatMul2BitsWebGpu test suite to exercise the Q2 path on the WebGPU EP:

Symmetric & asymmetric (with/without zero points)
Multiple block sizes (16, 32, 64, 128) — block_size=64 is the critical case where n_blocks_per_col is not a multiple of 4, exercising the zero-point padding logic
Varying dimensions (K=16 to 1024, N=1 to 384) — covering single-word and multi-word extraction patterns
Batch tests (M=1, 4, 100)

All 9 test configurations pass on WebGPU EP, with results matching CPU baseline within tolerance.

Files Changed

File	Change
matmul_nbits.cc	Remove Q2+ZP block, fix `zero_blocks_per_col`, guard DP4A
matmul_nbits_zero_pt.wgsl.template	Add `bit_mask = 0x3u` for Q2
matmul_nbits.h	Allow `nbits == 2` in JSEP kernel
matmulnbits.ts	Parameterize for Q2, shift fix, nibble-spread, A-offset fix
matmul_2bits_test.cc	WebGPU-specific Q2 test suite

Takeaways

One feature, five bugs — each fix revealed the next layer of incorrectness. Without tests that compared against a CPU baseline, any single fix would have appeared to "do something" while still being wrong.
Bit-packing extraction is subtle — the Q4 pattern of "mask the same bits from all 4 bytes" only works because Q4 has exactly one value per nibble per byte. Q2 breaks that assumption fundamentally.
Trace scripts are essential — Python scripts that simulate shader logic step-by-step (nibble-spread verification, A-offset tracking) were what ultimately identified bugs 4 and 5 after code-reading alone proved insufficient.
Parameterize, don't fork — rather than creating a separate Q2 shader, making the existing shader bit-width-aware keeps the code maintainable and makes future N-bit support (Q3, Q8) straightforward.

Using an AI Coding Agent to Ship 2-Bit Quantization for WebGPU

Hector Li — Wed, 11 Feb 2026 18:10:32 +0000

How a developer paired with an AI agent to find and fix five layered bugs in ONNX Runtime's GPU shader pipeline — without being an expert in WGSL or bit-packing. Here's the OnnxRuntime PR (merged)

The Problem

A developer needed to enable 2-bit (Q2) quantized model inference on ONNX Runtime's WebGPU backend. The 4-bit path worked, but 2-bit with zero points crashed immediately. The codebase involved C++ GPU kernels, WGSL shader templates, TypeScript shader generators, Emscripten WASM builds, and multiple build systems. A deep stack where any single layer could silently produce wrong numbers.

Rather than spending days manually tracing shader bit logic, the developer partnered with an AI coding agent (GitHub Copilot in VS Code) to systematically find and fix every issue.

Here's how that collaboration actually worked.

Step 1: "Why does it crash?" — The Agent Reads the Error

The developer shared the error message:

"Currently, zero points are not supported for Q2 quantization"

The agent searched the codebase, found the ORT_ENFORCE guard in matmul_nbits.cc and the nbits_ == 4 check in matmul_nbits.h, and identified a missing bit_mask constant in the WGSL template. Instead of just pointing these out, the agent directly applied all three fixes — removing the guards, adding the mask, and guarding the DP4A codepath that couldn't handle Q2 zero points — across three files in a single edit operation.

What the agent did well: Cross-file root cause analysis from a single error message. The developer didn't need to know which files to look at.

Step 2: "Tests pass but output is wrong" — The Agent Spots a Math Bug

With the crash fixed, the developer built and ran tests. Six of eight failed with wrong numerical output. The developer asked the agent to investigate.

The agent read the zero-point buffer stride calculation and identified that the formula n_blocks_per_col + 1 was a Q4-only shortcut. For Q2, where four values pack per byte, the stride must round up to the nearest multiple of 4. The agent wrote the corrected ceiling formula and applied it.

What the agent did well: Pattern recognition in quantization math. The "+1" looked innocuous but encoded a Q4 assumption the developer might have glossed over.

Step 3: "JSEP still gives wrong results" — Diving into TypeScript Shader Generators

After the native C++ path was fixed, the developer reported that the browser-facing JSEP path still produced garbage. This is where the collaboration got interesting.

The JSEP shaders are generated at runtime by TypeScript code — template strings that emit WGSL. The agent needed to understand code that writes shader code, not the shader itself.

The agent traced through matmulnbits.ts, identified that the multi-pass loop used pass * 8 as a bit shift — which works for Q4 (one pass) but for Q2 (two passes) shifts into the wrong byte — and fixed the formula to pass * bits * 4.

What the agent did well: Reasoning through meta-programming. The bug wasn't in the TypeScript or the WGSL — it was in the relationship between them.

Step 4: "Still wrong" — The Agent Writes Verification Scripts

After the shift fix, the developer tested again: "the result changed, but still not correct."

At this point, staring at code wasn't enough. The agent wrote Python simulation scripts that replicated the shader's bit extraction logic step by step. The first script (verify_extraction.py) proved the shift fix was necessary but insufficient. A second script (verify_extraction2.py) revealed the deeper bug:

The Q4 extraction pattern unpack4xU8(b_value & 0x0F0F0F0F) extracts the same bit position from all four bytes simultaneously. For Q4, that gives four sequential values. For Q2, it gives values v0, v4, v8, v12 — completely out of order relative to the sequential A-data they're multiplied with.

The agent designed a "nibble-spread" technique: take two bytes per pass, spread each nibble into its own byte of a synthetic u32, then apply the standard extraction. It wrote yet another verification script (verify_nibble_spread2.py) with a non-repeating test pattern to confirm the extraction produces values in the correct order, then applied the fix to both shader paths in the TypeScript.

What the agent did well: When code reading hit a wall, the agent pivoted to writing executable proofs. Each script answered a specific yes/no question about the bit logic, building confidence incrementally rather than guessing.

Step 5: "Almost — but still off" — The Last Bug

The developer tested again: "the result changed, but still not correct." Three fixes in, still wrong.

The agent wrote verify_a_offset.py — a script that traced how the A-data (activation) pointer advances across passes. It found the final bug: pass 0's inner loop increments input_offset eight times. Pass 1 then computed its start as input_offset + 8/aComponents, but since input_offset was already advanced, this double-counted the offset. Pass 1 read A[16] instead of A[8], skipping eight activation values.

The fix was a one-line change: pass 1 uses input_offset directly instead of adding an offset to an already-advanced pointer.

The developer tested: "the result is correct now."

What the agent did well: Maintained state across a long debugging session. By this point, the agent had built a mental model of how word_offset, input_offset, pass indices, and aComponents interact across the shader generator's nested loops — context that would take a human significant time to reconstruct after each failed attempt.

Step 6: "Do we need to update the tests?" — The Agent Adds Coverage

With all fixes working, the developer asked whether tests needed updating. The agent:

Read the existing test file to assess coverage gaps
Identified that block_size=64 (the real-model configuration that exercised the zero-point padding bug) had no test
Added three new test cases covering block_size=64, symmetric variants, and multi-word extraction scenarios
Figured out which build target to compile (onnxruntime_provider_test, not onnxruntime_test_all)
Built and ran all nine tests — all passed

What the agent did well: End-to-end task completion. The developer asked a yes/no question; the agent answered by doing the work, including navigating an unfamiliar build system to find the right test binary.

The Collaboration Pattern

Looking back, the session followed a repeating cycle:

Developer: "It's broken" / "Still wrong"
    → Agent: Search, read, analyze, hypothesize
    → Agent: Write verification script OR apply code fix
    → Agent: Build
    → Developer: Test with real model
    → (repeat until correct)

The developer brought domain context (which model to test, what "correct" looks like, the build commands) and judgment (when to test, when to push back). The agent brought tireless code reading, cross-file tracing, bit-level arithmetic verification, and the ability to maintain context across a multi-hour, multi-bug debugging session without losing track of which fixes were already applied.

Key moments where the agent added outsized value:

Situation	Without agent	With agent
Finding all Q4-hardcoded guards	Grep + manual reading across C++, WGSL, TypeScript	Agent searched and identified all three in one pass
Understanding shader generator meta-programming	Mentally compile TypeScript → WGSL → GPU execution	Agent traced the template logic and identified the generated shift values
Verifying bit extraction ordering	Pen-and-paper binary arithmetic	Agent wrote executable Python proofs with non-repeating test patterns
Tracking pointer advancement across nested loops	Extremely error-prone mental simulation	Agent wrote a trace script that showed exact index values at each step
Maintaining context across 5 sequential bugs	Each "still wrong" resets human working memory	Agent retained cumulative understanding of every prior fix

What Didn't Work (and What the Developer Still Had to Do)

The agent couldn't run the actual model on WebGPU — the developer had a test project with a browser environment and a real 2-bit transformer model. Each "is it correct now?" required the developer to run the model, compare output against CPU baseline, and report back. The agent operated on code structure and logic; the developer operated on ground truth.

The build system was also a friction point. The agent had to discover — through trial and error — that tests lived in onnxruntime_provider_test.exe rather than onnxruntime_test_all.exe, and that the VS 2026 Insiders vcvarsall path was non-standard. These are the kinds of environmental details where the developer's existing knowledge was essential.

Takeaways for Developers

Describe symptoms, not solutions. Saying "it gives wrong results on WebGPU but correct on CPU" gave the agent more to work with than "I think the bit shift is wrong."
Let the agent write verification scripts. When the bug is in bit-level arithmetic inside a shader generator, reading code has diminishing returns. Executable proofs are faster and more reliable.
Iterate tight loops. The five-bug sequence would have been demoralizing solo — each fix revealing another failure. With the agent maintaining context and proposing the next investigation immediately, the cycle stayed fast.
Keep ground truth in human hands. The developer's ability to test with a real model and say "correct" or "still wrong" was the irreplaceable signal that drove the entire session. The agent can analyze and fix; only the developer can validate against the actual use case.
The agent is most valuable on cross-cutting, multi-layer bugs. A bug in one file is easy. Five bugs spanning C++, WGSL templates, TypeScript shader generators, and build configuration — each masked by the previous one — is where an agent that doesn't lose context across files and hours earns its keep.