<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Andrey Kolkov</title>
    <description>The latest articles on DEV Community by Andrey Kolkov (@kolkov).</description>
    <link>https://dev.to/kolkov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F277150%2Fdc37d68a-1fc4-4584-a7a6-0e640febd7a8.jpeg</url>
      <title>DEV Community: Andrey Kolkov</title>
      <link>https://dev.to/kolkov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kolkov"/>
    <language>en</language>
    <item>
      <title>gogpu/ui v0.1.21: Enterprise Render Pipeline — Layer Tree, Damage Tracking, 0% GPU Idle</title>
      <dc:creator>Andrey Kolkov</dc:creator>
      <pubDate>Tue, 12 May 2026 10:23:08 +0000</pubDate>
      <link>https://dev.to/kolkov/gogpuui-v0121-enterprise-render-pipeline-layer-tree-damage-tracking-0-gpu-idle-5adm</link>
      <guid>https://dev.to/kolkov/gogpuui-v0121-enterprise-render-pipeline-layer-tree-damage-tracking-0-gpu-idle-5adm</guid>
      <description>&lt;p&gt;Two months ago we released &lt;a href="https://dev.to/kolkov/go-gui-in-2026-gogpuui-v010-22-widgets-gpu-rendering-zero-cgo-1enf"&gt;gogpu/ui v0.1.0&lt;/a&gt; — 22 widgets, 3 design systems, ~150K lines of pure Go. Since then we shipped 21 patch releases, and the rendering pipeline is unrecognizable.&lt;/p&gt;

&lt;p&gt;This post covers what changed and why it matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;v0.1.0 re-rendered the entire widget tree every frame. A 48×48 spinner in one corner caused the GPU to redraw 800×600 of static content. Hover over a button? Full tree walk. Open a dropdown? Full tree walk. This was fine for demos, not for production.&lt;/p&gt;

&lt;p&gt;We studied how five frameworks solve this — Flutter, Chrome, Qt6, Android HWUI, Skia — and found the same architecture everywhere: &lt;strong&gt;Layer Tree + boundary isolation + damage tracking&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Built (v0.1.14 → v0.1.21)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Layer Tree Compositor
&lt;/h3&gt;

&lt;p&gt;Every &lt;code&gt;RepaintBoundary&lt;/code&gt; widget now owns a node in a persistent Layer Tree:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OffsetLayer (root)
├── PictureLayer (toolbar — clean, reuse texture)
├── PictureLayer (sidebar — clean, reuse texture)
├── ClipRectLayer (scrollview viewport)
│   └── PictureLayer (content — dirty, re-record)
└── PictureLayer (spinner — dirty, re-record 48×48)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four layer types — &lt;code&gt;OffsetLayer&lt;/code&gt;, &lt;code&gt;PictureLayer&lt;/code&gt;, &lt;code&gt;ClipRectLayer&lt;/code&gt;, &lt;code&gt;OpacityLayer&lt;/code&gt; — compose the frame. Clean layers reuse their GPU texture from the previous frame. Only dirty layers re-render.&lt;/p&gt;

&lt;p&gt;This is the same pattern Flutter calls &lt;code&gt;flushPaint&lt;/code&gt; + &lt;code&gt;compositeFrame&lt;/code&gt;. We validated it against all five reference frameworks before writing a line of code.&lt;/p&gt;

&lt;h3&gt;
  
  
  0% GPU When Idle
&lt;/h3&gt;

&lt;p&gt;The frame loop checks a flat dirty set — O(1), not O(n) tree walk:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HasDirtyBoundaries&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NeedsRedraw&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NeedsAnimationFrame&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="c"&gt;// nothing changed, skip frame entirely&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the UI is idle, the GPU does zero work. Measured: 0% GPU across all six examples (hello, signals, taskmanager, gallery, ide, modular-compositor).&lt;/p&gt;

&lt;p&gt;Previous approach walked the entire widget tree every frame to check if anything needed redraw. For 200 boundaries, the new approach is &lt;strong&gt;45× faster&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Per-Boundary GPU Textures
&lt;/h3&gt;

&lt;p&gt;Each RepaintBoundary renders into its own offscreen MSAA texture. When a child boundary becomes dirty, only that boundary's texture is re-rendered. The compositor blits all textures in a single non-MSAA pass.&lt;/p&gt;

&lt;p&gt;A 48×48 spinner touching 2,304 pixels no longer forces the GPU to process 480,000 pixels of unchanged content.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Rect Damage
&lt;/h3&gt;

&lt;p&gt;When multiple widgets are dirty in different screen regions, we don't union them into one giant rect. Each dirty rect gets its own GPU scissor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Frame N: spinner (48×48) + status bar (800×24)
→ Two scissor rects, not one 800×600 rect
→ Zero pixel waste
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The damage pipeline flows through the full stack: &lt;code&gt;ui&lt;/code&gt; → &lt;code&gt;gg&lt;/code&gt; &lt;code&gt;RenderDirectWithDamageRects&lt;/code&gt; → &lt;code&gt;wgpu&lt;/code&gt; &lt;code&gt;PresentWithDamage&lt;/code&gt;. Ring buffer stores rect lists for N-buffer swapchains. Threshold at 16 rects merges to union (GDK/Sway pattern).&lt;/p&gt;

&lt;h3&gt;
  
  
  Persistent Layer Tree
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;UpdateLayerTree()&lt;/code&gt; reuses layer objects across frames instead of rebuilding the tree:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Allocs per frame (200 boundaries)&lt;/td&gt;
&lt;td&gt;613&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reduction&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;97.9%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Flutter calls this &lt;code&gt;addRetained&lt;/code&gt;. Android calls it &lt;code&gt;RenderNode&lt;/code&gt; reuse. We measured allocation profiles against both and matched their patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;v0.1.0&lt;/th&gt;
&lt;th&gt;v0.1.21&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lines (total / code)&lt;/td&gt;
&lt;td&gt;150K / 105K&lt;/td&gt;
&lt;td&gt;195K / 141K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tests&lt;/td&gt;
&lt;td&gt;6,000&lt;/td&gt;
&lt;td&gt;7,200+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coverage&lt;/td&gt;
&lt;td&gt;97%&lt;/td&gt;
&lt;td&gt;97%+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Packages&lt;/td&gt;
&lt;td&gt;56&lt;/td&gt;
&lt;td&gt;56&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU idle (static UI)&lt;/td&gt;
&lt;td&gt;5-18%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frame skip check&lt;/td&gt;
&lt;td&gt;O(n) tree walk&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;O(1)&lt;/strong&gt; flat set&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Allocs/frame (200 boundaries)&lt;/td&gt;
&lt;td&gt;613&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;13&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spinner GPU work&lt;/td&gt;
&lt;td&gt;full window&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;48×48 scissor&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Ecosystem Update
&lt;/h2&gt;

&lt;p&gt;The rendering pipeline required changes across four repositories. Here's where the ecosystem stands:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Repository&lt;/th&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;Lines&lt;/th&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/naga" rel="noopener noreferrer"&gt;naga&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;v0.17.13&lt;/td&gt;
&lt;td&gt;323K&lt;/td&gt;
&lt;td&gt;240K&lt;/td&gt;
&lt;td&gt;Shader compiler: WGSL → SPIR-V, MSL, GLSL, HLSL, DXIL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gg" rel="noopener noreferrer"&gt;gg&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;v0.46.8&lt;/td&gt;
&lt;td&gt;240K&lt;/td&gt;
&lt;td&gt;171K&lt;/td&gt;
&lt;td&gt;2D graphics: Skia-class rasterizer, GPU SDF, scene compositor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/wgpu" rel="noopener noreferrer"&gt;wgpu&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;v0.27.3&lt;/td&gt;
&lt;td&gt;211K&lt;/td&gt;
&lt;td&gt;164K&lt;/td&gt;
&lt;td&gt;Pure Go WebGPU: Vulkan, DX12, Metal, GLES, Software&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/ui" rel="noopener noreferrer"&gt;ui&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;v0.1.21&lt;/td&gt;
&lt;td&gt;195K&lt;/td&gt;
&lt;td&gt;141K&lt;/td&gt;
&lt;td&gt;GUI toolkit: 22 widgets, 4 themes, Layer Tree pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gogpu" rel="noopener noreferrer"&gt;gogpu&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;v0.34.3&lt;/td&gt;
&lt;td&gt;61K&lt;/td&gt;
&lt;td&gt;45K&lt;/td&gt;
&lt;td&gt;App framework: windowing, input, three-mode render loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ gpucontext, gputypes, systray, audio&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;19K&lt;/td&gt;
&lt;td&gt;13K&lt;/td&gt;
&lt;td&gt;Shared interfaces, system tray, audio engine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,049K&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;774K&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3,140 files across 9 repositories&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;1M+ total lines. 774K lines of code. Zero CGO. Zero Rust. Zero C.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Recent ecosystem highlights since the v0.1.0 article:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;First Pure Go DXIL generator&lt;/strong&gt; — naga compiles WGSL shaders directly to DXIL bytecode, eliminating the HLSL→FXC/DXC dependency. 161/170 IDxcValidator pass rate. &lt;a href="https://dev.to/kolkov/we-built-the-first-pure-go-dxil-generator-because-optimizing-the-wrong-path-wasnt-enough-35en"&gt;Article&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Born ML v0.8.0&lt;/strong&gt; migrated to gogpu/wgpu — production ML framework running on our GPU stack. 105 GPU tests pass, HRM model trained 20 epochs. &lt;a href="https://dev.to/kolkov/born-ml-v080-we-killed-our-last-dll-pure-go-gpu-is-here-2dd7"&gt;Article&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CJK text rendering&lt;/strong&gt; — script-aware hinting, exact-size rasterization, Tier 6 routing for Chinese/Japanese/Korean glyphs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LCD ClearType auto-detection&lt;/strong&gt; — Windows SPI + registry, macOS None, Linux Xft/Wayland. Per-platform subpixel layout.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Software backend for CI&lt;/strong&gt; — deterministic GPU without GPU hardware. Pixel-exact e2e tests prove scissor rects at HAL level.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community deep-dive&lt;/strong&gt; — independent &lt;a href="https://chenxutan.com/d/1987.html" rel="noopener noreferrer"&gt;technical analysis of gogpu/wgpu&lt;/a&gt; (Chinese) covering the zero-CGO syscall architecture, Snatchable resource lifecycle, and buffer state tracking internals. Always good to see the community dig into the implementation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Foundation Is Ready
&lt;/h2&gt;

&lt;p&gt;This is the release where we stopped rebuilding and started building on top.&lt;/p&gt;

&lt;p&gt;For the past two months every release was infrastructure: retained-mode rendering, scene composition, Layer Tree, damage tracking, boundary isolation. The kind of plumbing that's invisible to users but determines whether a framework can scale to real applications.&lt;/p&gt;

&lt;p&gt;That plumbing is now in place. The render pipeline follows the same architectural patterns as Flutter, Chrome, and Qt6 — not because we copied them, but because we studied all five independently and arrived at the same conclusions. Layer Tree composition, per-boundary GPU textures, multi-rect damage, persistent allocation — these are industry-proven patterns, and they're production-ready in gogpu/ui.&lt;/p&gt;

&lt;p&gt;The ecosystem has stabilized around this architecture. naga (shader compiler), wgpu (WebGPU HAL), gg (2D graphics), and gogpu (windowing) all reached the point where API churn is minimal and releases are incremental improvements, not rewrites. Nine repositories, 1M+ lines, and the dependency chain holds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What this means going forward:&lt;/strong&gt; the pipeline will be optimized, not rebuilt. Future releases will focus on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;New widgets&lt;/strong&gt; — the 22 we ship today cover most use cases, but enterprise apps need more (color picker, date picker, rich text editor, tree grid)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance polish&lt;/strong&gt; — reducing GPU usage for animated widgets from 10% to &amp;lt;3%, ListView recycling, texture GC&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Platform accessibility&lt;/strong&gt; — UIA on Windows, AT-SPI2 on Linux, NSAccessibility on macOS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developer experience&lt;/strong&gt; — better docs, more examples, smoother onboarding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The hard part is behind us. The interesting part is ahead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/gogpu/ui.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ui/examples/gallery
go run &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four design systems ship out of the box: Material Design 3, JetBrains DevTools, Microsoft Fluent, Apple Cupertino. Switch between them at runtime in the gallery example.&lt;/p&gt;

&lt;p&gt;Backend selection via environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;GOGPU_GRAPHICS_API&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;vulkan   go run ./examples/ide/
&lt;span class="nv"&gt;GOGPU_GRAPHICS_API&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;dx12     go run ./examples/ide/
&lt;span class="nv"&gt;GOGPU_GRAPHICS_API&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gles     go run ./examples/ide/
&lt;span class="nv"&gt;GOGPU_GRAPHICS_API&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;software go run ./examples/ide/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No code changes needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Help Us Get There
&lt;/h2&gt;

&lt;p&gt;gogpu/ui is at the stage where the architecture is proven but the user base is small. We need real-world testing to catch edge cases that no amount of 97% coverage will find.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test it.&lt;/strong&gt; Clone a repo, run an example, try building something with it. If it breaks — that's valuable. &lt;a href="https://github.com/gogpu/ui/issues" rel="noopener noreferrer"&gt;File an issue&lt;/a&gt;, and we'll fix it. If it works — that's valuable too. Tell us what you built.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spread the word.&lt;/strong&gt; Most Go developers don't know this exists yet. A post on Reddit, a tweet, a mention in your team's Slack — it all helps. The project grows through people who try it and talk about it, not through marketing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Write about it.&lt;/strong&gt; Tutorials, experience reports, comparisons, critiques — all welcome. If you build something interesting with gogpu/ui, write about the process. The ecosystem needs content from people other than us.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Contribute.&lt;/strong&gt; You don't need to touch the render pipeline. Documentation improvements, new examples, widget ideas, accessibility testing, CI on different hardware — there's work at every level. Check &lt;a href="https://github.com/gogpu/ui/blob/main/CONTRIBUTING.md" rel="noopener noreferrer"&gt;CONTRIBUTING.md&lt;/a&gt; or just open a discussion.&lt;/p&gt;

&lt;p&gt;The codebase is 1M+ lines of pure Go with zero CGO. The foundation is solid. What it needs now is people building on it.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://github.com/gogpu/ui" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://github.com/orgs/gogpu/discussions" rel="noopener noreferrer"&gt;Discussions&lt;/a&gt; · &lt;a href="https://github.com/gogpu/ui/blob/main/CHANGELOG.md" rel="noopener noreferrer"&gt;CHANGELOG&lt;/a&gt; · &lt;a href="https://www.reddit.com/r/golang/" rel="noopener noreferrer"&gt;Reddit r/golang&lt;/a&gt; · &lt;a href="https://x.com" rel="noopener noreferrer"&gt;X/Twitter&lt;/a&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>opensource</category>
      <category>gui</category>
      <category>performance</category>
    </item>
    <item>
      <title>Born ML v0.8.0: We Killed Our Last .dll — Pure Go GPU Is Here</title>
      <dc:creator>Andrey Kolkov</dc:creator>
      <pubDate>Sun, 03 May 2026 12:41:30 +0000</pubDate>
      <link>https://dev.to/kolkov/born-ml-v080-we-killed-our-last-dll-pure-go-gpu-is-here-4mek</link>
      <guid>https://dev.to/kolkov/born-ml-v080-we-killed-our-last-dll-pure-go-gpu-is-here-4mek</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: Born v0.8.0 replaces go-webgpu (Rust FFI + shared libraries) with gogpu/wgpu — pure Go WebGPU. No &lt;code&gt;.dll&lt;/code&gt;. No &lt;code&gt;.so&lt;/code&gt;. No runtime downloads. &lt;code&gt;go build&lt;/code&gt; now gives you a GPU-accelerated ML binary. We also fixed 5 critical GPU bugs and validated on real model training. Next up: DeepSeek V4 inference support.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Last Dependency
&lt;/h2&gt;

&lt;p&gt;Five months ago I &lt;a href="https://dev.to/kolkov/i-skipped-my-birthday-to-give-go-its-first-real-ml-framework-13gj"&gt;skipped my birthday&lt;/a&gt; to release Born. A few weeks later we &lt;a href="https://dev.to/kolkov/born-ml-v060-from-90-seconds-to-5-how-we-made-go-ml-training-actually-fast-19f8"&gt;made training 18x faster&lt;/a&gt; with lazy GPU evaluation. The framework was growing. Contributors were showing up. Real people were using it.&lt;/p&gt;

&lt;p&gt;But there was a problem I couldn't ignore anymore.&lt;/p&gt;

&lt;p&gt;Every time someone wanted to use GPU acceleration, the conversation went like this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"How do I run the GPU examples?"&lt;/p&gt;

&lt;p&gt;"Download wgpu-native &lt;code&gt;.dll&lt;/code&gt; for your platform, put it in your PATH..."&lt;/p&gt;

&lt;p&gt;"...I thought you said pure Go?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;They were right. Born's CPU path was pure Go. But the GPU backend used &lt;a href="https://github.com/go-webgpu/webgpu" rel="noopener noreferrer"&gt;go-webgpu&lt;/a&gt; — Go bindings to Rust's wgpu-native via FFI. You needed a platform-specific shared library at runtime. On Windows, a &lt;code&gt;.dll&lt;/code&gt;. On Linux, a &lt;code&gt;.so&lt;/code&gt;. On macOS, a &lt;code&gt;.dylib&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For a framework whose tagline is &lt;em&gt;"single binary deployment"&lt;/em&gt;, that was embarrassing.&lt;/p&gt;

&lt;p&gt;So we fixed it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Not Earlier?
&lt;/h2&gt;

&lt;p&gt;Fair question. gogpu/wgpu existed for months before v0.8.0. Why did we ship 29 releases on go-webgpu first?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Because that was the plan.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;go-webgpu wraps Rust's wgpu-native — a battle-tested GPU abstraction used by Firefox and dozens of production projects. When you're building a new ML framework from scratch, you don't want to debug your GPU backend and your tensor math at the same time. If training produces wrong gradients, is the bug in your autodiff engine or in your WebGPU implementation? With Rust wgpu-native underneath, we knew: the GPU layer works. Any bug is ours.&lt;/p&gt;

&lt;p&gt;So we built Born v0.1 through v0.7 on a proven foundation. Tensor ops, autodiff, attention, Flash Attention, speculative decoding, ONNX import, GGUF loading — all validated against a GPU backend we could trust. By v0.7.16, Born had 1,394 tests, 3 external contributors, and real model training working.&lt;/p&gt;

&lt;p&gt;Meanwhile, gogpu/wgpu was maturing through its own path — powering &lt;a href="https://github.com/gogpu/gg" rel="noopener noreferrer"&gt;gogpu/gg&lt;/a&gt; (2D graphics library with GPU compute shaders), running real rendering workloads, stabilizing the Core API across Vulkan, Metal, DX12, and GLES.&lt;/p&gt;

&lt;p&gt;When both sides were proven, the migration became simple: we knew Born's code was correct, and we knew gogpu/wgpu's Core API was stable. Any bug found during migration was specifically a wgpu Go integration issue — easy to isolate, easy to fix.&lt;/p&gt;

&lt;p&gt;That's exactly what happened. Five bugs, all in resource lifecycle. All fixed in days, not weeks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validate on proven foundation first. Swap the foundation second.&lt;/strong&gt; This is not how you move fast. This is how you move &lt;em&gt;right&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Migration
&lt;/h2&gt;

&lt;p&gt;Born v0.8.0 replaces &lt;code&gt;go-webgpu&lt;/code&gt; with &lt;a href="https://github.com/gogpu/wgpu" rel="noopener noreferrer"&gt;gogpu/wgpu&lt;/a&gt; — a pure Go WebGPU implementation from our own &lt;a href="https://github.com/gogpu" rel="noopener noreferrer"&gt;GoGPU&lt;/a&gt; ecosystem.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gd"&gt;- github.com/go-webgpu/webgpu v0.4.1
&lt;/span&gt;&lt;span class="gi"&gt;+ github.com/gogpu/wgpu v0.26.8
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One line in &lt;code&gt;go.mod&lt;/code&gt;. 27 files changed. 1,830 additions, 1,518 deletions.&lt;/p&gt;

&lt;p&gt;What changed:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;go-webgpu (before)&lt;/th&gt;
&lt;th&gt;gogpu/wgpu (after)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Implementation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rust wgpu-native via FFI&lt;/td&gt;
&lt;td&gt;Pure Go&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CGO&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None (goffi)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Runtime .dll/.so&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Required&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;None&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Build&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;go build&lt;/code&gt; + download .dll&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;go build&lt;/code&gt;. Period.&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vulkan/Metal/DX12&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Via Rust&lt;/td&gt;
&lt;td&gt;Via Go&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;WGSL shaders&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unchanged&lt;/td&gt;
&lt;td&gt;Unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Control&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;External project&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Our project&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That last row matters. gogpu/wgpu isn't some random dependency — it's &lt;a href="https://github.com/gogpu/wgpu" rel="noopener noreferrer"&gt;our project&lt;/a&gt;. When Born needs a WebGPU API change, we change it upstream. Both sides of the interface are under our control.&lt;/p&gt;




&lt;h2&gt;
  
  
  Five Bugs Nobody Told Us About
&lt;/h2&gt;

&lt;p&gt;Swapping the GPU backend is like replacing a car engine while driving. Everything looks the same from the outside, but internally the timing, resource lifecycle, and synchronization are completely different.&lt;/p&gt;

&lt;p&gt;We found five critical bugs during migration:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. PipelineLayout Freed Too Early
&lt;/h3&gt;

&lt;p&gt;Vulkan requires compute pipeline layouts to stay alive during &lt;code&gt;SetBindGroup()&lt;/code&gt;. go-webgpu's internal reference counting kept them alive. gogpu/wgpu doesn't — you own your resources.&lt;/p&gt;

&lt;p&gt;We fixed this by storing &lt;code&gt;PipelineLayout&lt;/code&gt; alongside the pipeline in our cache.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Lazy Ops and the Destroy Queue
&lt;/h3&gt;

&lt;p&gt;Born uses lazy evaluation — GPU ops chain without CPU sync. But when a tensor gets garbage-collected mid-chain, its buffer goes to the destroy queue. If the pending operations haven't submitted yet, the buffer is destroyed before the GPU reads it.&lt;/p&gt;

&lt;p&gt;Fix: immediate submit for lazy ops. Every operation submits its command encoder before returning.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Buffer Copy Race
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;copyGPUBuffer&lt;/code&gt; (used by &lt;code&gt;Data()&lt;/code&gt; to read results back to CPU) was queuing the copy but not submitting. The next operation might overwrite the source buffer before the copy executed.&lt;/p&gt;

&lt;p&gt;Fix: immediate submit after copy.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. GC vs GPU
&lt;/h3&gt;

&lt;p&gt;Go's garbage collector doesn't know about GPU resources. A &lt;code&gt;runtime.SetFinalizer&lt;/code&gt; on a tensor could fire while the GPU was still computing with that tensor's buffer.&lt;/p&gt;

&lt;p&gt;Fix: &lt;code&gt;runtime.KeepAlive()&lt;/code&gt; guards around every GPU operation that uses the tensor.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Device Cleanup Order
&lt;/h3&gt;

&lt;p&gt;When destroying the GPU device, all pending work must complete first. Without &lt;code&gt;Poll(PollWait)&lt;/code&gt; before resource destruction, Vulkan validation layers scream.&lt;/p&gt;

&lt;p&gt;Fix: explicit &lt;code&gt;Poll(PollWait)&lt;/code&gt; in &lt;code&gt;Release()&lt;/code&gt; to ensure GPU idle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;None of these bugs existed with go-webgpu.&lt;/strong&gt; They're all about resource lifecycle differences between Rust's ownership model (where wgpu-native tracks everything for you) and Go's GC-based model (where you track it yourself).&lt;/p&gt;

&lt;p&gt;After fixing all five, we ran all GPU tests and a 20-epoch model training with zero crashes.&lt;/p&gt;




&lt;h2&gt;
  
  
  What You Get
&lt;/h2&gt;

&lt;h3&gt;
  
  
  True Single Binary
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go build &lt;span class="nt"&gt;-o&lt;/span&gt; myapp ./cmd/myapp
&lt;span class="c"&gt;# That's it. Ship the binary. GPU works.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No &lt;code&gt;.dll&lt;/code&gt; downloads. No &lt;code&gt;LD_LIBRARY_PATH&lt;/code&gt;. No platform-specific install steps. The binary works on any machine with a Vulkan-capable GPU.&lt;/p&gt;

&lt;h3&gt;
  
  
  Same API, Same Shaders
&lt;/h3&gt;

&lt;p&gt;If you have existing Born code with GPU, &lt;strong&gt;nothing changes&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/born-ml/born/backend/cpu"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/born-ml/born/autodiff"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// CPU-only (always worked)&lt;/span&gt;
&lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;autodiff&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cpu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="s"&gt;"github.com/born-ml/born/backend/webgpu"&lt;/span&gt;

&lt;span class="c"&gt;// GPU-accelerated (now pure Go!)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;webgpu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IsAvailable&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;webgpu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;autodiff&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Release&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;WGSL shaders are unchanged. The Backend interface (52 methods) is unchanged. Your code just works — minus the &lt;code&gt;.dll&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Validated on Real Training
&lt;/h3&gt;

&lt;p&gt;We didn't just run unit tests. We trained a real Hierarchical Reasoning Model (HRM) for 20 epochs on GPU. Zero crashes. Correct gradients. Same accuracy as go-webgpu.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Go source&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~47K LOC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tests&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~34K LOC, 1,394 test functions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ONNX operators&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;49&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Backend methods&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;52&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU tests&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;105&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Contributors&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4 (&lt;a href="https://github.com/kolkov" rel="noopener noreferrer"&gt;@kolkov&lt;/a&gt;, &lt;a href="https://github.com/gmohmad" rel="noopener noreferrer"&gt;@gmohmad&lt;/a&gt;, &lt;a href="https://github.com/bennibbelink" rel="noopener noreferrer"&gt;@bennibbelink&lt;/a&gt;, &lt;a href="https://github.com/jsully1720" rel="noopener noreferrer"&gt;@jsully1720&lt;/a&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Releases&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stars&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;80 (organic, no marketing)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Community
&lt;/h2&gt;

&lt;p&gt;v0.8.0 isn't just about the migration. Since v0.7.0, three external contributors have landed real code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/jsully1720" rel="noopener noreferrer"&gt;@jsully1720&lt;/a&gt;&lt;/strong&gt; — ONNX Equal operator&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/bennibbelink" rel="noopener noreferrer"&gt;@bennibbelink&lt;/a&gt;&lt;/strong&gt; — Erf, Sign/Abs, Clamp ops (3 PRs, all full vertical slices: backend → CPU → GPU → autodiff → tests)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/gmohmad" rel="noopener noreferrer"&gt;@gmohmad&lt;/a&gt;&lt;/strong&gt; — LayerNorm, BatchMatMul broadcasting, Squeeze fix, 9 new ONNX ops, inplace mutation bug fix (5 PRs)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't drive-by typo fixes. These are production-quality contributions from people who studied the codebase and followed the patterns. If you're considering contributing, look at what they did — that's the bar.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next: DeepSeek V4 Inference
&lt;/h2&gt;

&lt;p&gt;With the GPU backend stable and pure Go, we can focus on what matters: running real models.&lt;/p&gt;

&lt;p&gt;DeepSeek released &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro" rel="noopener noreferrer"&gt;V4&lt;/a&gt; on April 24, 2026 — two models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;V4-Pro&lt;/strong&gt;: 1.6 trillion params, 49B active&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;V4-Flash&lt;/strong&gt;: 284B total, 13B active — &lt;strong&gt;fits on a consumer GPU&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;V4-Flash with 13B active parameters is Born's sweet spot. It's the most capable open model that fits on a single 24GB GPU. API pricing is tied to chip availability ($1.74/M tokens, bottleneck pricing) — users want local inference alternatives.&lt;/p&gt;

&lt;p&gt;We started researching V4 architecture &lt;strong&gt;before&lt;/strong&gt; it launched — back in early April, when only the Engram paper and V3.2 sparse attention existed. We predicted V4 would combine MoE + Engram + manifold-constrained residuals + compressed sparse attention. On April 24th, the tech report confirmed all four. Two weeks head start on architecture analysis. (We do this kind of research openly — see &lt;a href="https://github.com/born-ml/born/discussions/60" rel="noopener noreferrer"&gt;Discussion #60&lt;/a&gt; for our Recurrent-Depth Transformer analysis.)&lt;/p&gt;

&lt;p&gt;Here's the full component breakdown:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;What&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MoE Routing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Top-16 sparse expert selection&lt;/td&gt;
&lt;td&gt;Also unlocks Mixtral, DBRX&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MXFP4 Dequantization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;FP4 expert weights with block scaling&lt;/td&gt;
&lt;td&gt;V4's native format — not INT4 GPTQ&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Engram&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;O(1) hash-lookup factual memory&lt;/td&gt;
&lt;td&gt;Unique to DeepSeek, DRAM-resident&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Three-Pool Attention&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SWA + C4 + C128 compression&lt;/td&gt;
&lt;td&gt;1M context with &amp;lt;10% throughput drop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hyper-Connections (mHC)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4D manifold-constrained residual&lt;/td&gt;
&lt;td&gt;Every transformer layer uses this&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTP Drafting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Integrated speculative decoding&lt;/td&gt;
&lt;td&gt;~2.5 tokens accepted per step&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;KV Cache Tiering&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CPU-GPU cache with LRU eviction&lt;/td&gt;
&lt;td&gt;128K+ context on 24GB consumer GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PD-Disaggregation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prefill/Decode split serving&lt;/td&gt;
&lt;td&gt;Production throughput scaling&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Total estimate: 22-30 weeks. It's a lot. But MoE routing alone unlocks V4, Mixtral, and &lt;a href="https://allenai.org/papers/bar" rel="noopener noreferrer"&gt;BAR&lt;/a&gt; (Allen AI's modular post-training). Each component is independently valuable.&lt;/p&gt;




&lt;h2&gt;
  
  
  The GoGPU Ecosystem
&lt;/h2&gt;

&lt;p&gt;Born's GPU backend is powered by the &lt;a href="https://github.com/gogpu" rel="noopener noreferrer"&gt;GoGPU&lt;/a&gt; ecosystem — pure Go GPU infrastructure:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;What&lt;/th&gt;
&lt;th&gt;LOC&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gg" rel="noopener noreferrer"&gt;gogpu/gg&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2D graphics with GPU compute shaders&lt;/td&gt;
&lt;td&gt;~222K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/naga" rel="noopener noreferrer"&gt;gogpu/naga&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Shader compiler (WGSL → SPIR-V, MSL, HLSL, GLSL, DXIL)&lt;/td&gt;
&lt;td&gt;~199K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/wgpu" rel="noopener noreferrer"&gt;gogpu/wgpu&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Pure Go WebGPU (Vulkan, Metal, DX12, GLES, Software)&lt;/td&gt;
&lt;td&gt;~156K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gogpu" rel="noopener noreferrer"&gt;gogpu/gogpu&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Graphics framework + windowing&lt;/td&gt;
&lt;td&gt;~52K&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Combined with Born's ~81K LOC, that's &lt;strong&gt;710K+ lines of pure Go GPU code&lt;/strong&gt;. No CGO. No Rust. Just &lt;code&gt;go build&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/born-ml/born
&lt;span class="nb"&gt;cd &lt;/span&gt;born
go build ./...
go &lt;span class="nb"&gt;test&lt;/span&gt; ./... &lt;span class="nt"&gt;-short&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;examples/mnist &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; go run &lt;span class="nb"&gt;.&lt;/span&gt;       &lt;span class="c"&gt;# MLP: 97.44% accuracy&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;examples/mnist-cnn &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; go run &lt;span class="nb"&gt;.&lt;/span&gt;   &lt;span class="c"&gt;# CNN: 98.18% accuracy&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;examples/mnist-gpu &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; go run &lt;span class="nb"&gt;.&lt;/span&gt;   &lt;span class="c"&gt;# GPU-accelerated inference&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GPU examples now work with &lt;code&gt;go run&lt;/code&gt; — no &lt;code&gt;.dll&lt;/code&gt; download step.&lt;/p&gt;




&lt;h2&gt;
  
  
  Build This With Us
&lt;/h2&gt;

&lt;p&gt;Born is at an inflection point. GPU is stable. The architecture is proven. The roadmap to DeepSeek V4 is clear.&lt;/p&gt;

&lt;p&gt;We're not looking for passive users. We're looking for people who want to help build one of the best ML frameworks in the world. In Go.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How you can make a difference:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;File issues.&lt;/strong&gt; Found a bug? A missing operator? An edge case that breaks your model? Every issue makes Born more production-ready. Our three external contributors started exactly this way.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Send PRs.&lt;/strong&gt; Missing tensor ops (TopK, Scatter — needed for MoE), CPU optimizations (the inner loops are naive — lots of low-hanging fruit), new ONNX operators, quantization infrastructure. Look at what &lt;a href="https://github.com/bennibbelink" rel="noopener noreferrer"&gt;@bennibbelink&lt;/a&gt; and &lt;a href="https://github.com/gmohmad" rel="noopener noreferrer"&gt;@gmohmad&lt;/a&gt; have done — full vertical slices, production quality. That's the standard.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Bring breakthrough ideas.&lt;/strong&gt; The hardest problems ahead — MoE routing, FP4 dequantization, compressed sparse attention, CPU-GPU cache tiering — are open research questions in Go. If you have insights on how to make these work efficiently in pure Go, we want to hear them.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Challenge our assumptions.&lt;/strong&gt; Tell us what we're doing wrong. Tell us what's missing. The best frameworks are shaped by people who care enough to argue.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Found a bug?&lt;/strong&gt; &lt;a href="https://github.com/born-ml/born/issues" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Have a big idea?&lt;/strong&gt; &lt;a href="https://github.com/born-ml/born/discussions/4" rel="noopener noreferrer"&gt;Feature Requests &amp;amp; Roadmap Discussion&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Questions?&lt;/strong&gt; &lt;a href="https://github.com/born-ml/born/discussions/3" rel="noopener noreferrer"&gt;Getting Started &amp;amp; FAQ&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Ready to code?&lt;/strong&gt; &lt;a href="https://github.com/born-ml/born/blob/main/CONTRIBUTING.md" rel="noopener noreferrer"&gt;Contributing Guide&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Links
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Link&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitHub&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/born-ml/born" rel="noopener noreferrer"&gt;github.com/born-ml/born&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;v0.8.0 Release&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/born-ml/born/releases/tag/v0.8.0" rel="noopener noreferrer"&gt;Release Notes&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Documentation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://pkg.go.dev/github.com/born-ml/born" rel="noopener noreferrer"&gt;pkg.go.dev/github.com/born-ml/born&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Roadmap&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/born-ml/born/blob/main/ROADMAP.md" rel="noopener noreferrer"&gt;ROADMAP.md&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Changelog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/born-ml/born/blob/main/CHANGELOG.md" rel="noopener noreferrer"&gt;CHANGELOG.md&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GoGPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu" rel="noopener noreferrer"&gt;github.com/gogpu&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;Five months ago, Born was a birthday project with zero stars. Today it's a pure Go ML framework with GPU acceleration, 4 contributors, 49 ONNX operators, and a roadmap to run DeepSeek V4.&lt;/p&gt;

&lt;p&gt;No &lt;code&gt;.dll&lt;/code&gt;. No &lt;code&gt;.so&lt;/code&gt;. No excuses. Models are born production-ready.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;go build&lt;/code&gt;. Ship. Done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Star us on GitHub: &lt;a href="https://github.com/born-ml/born" rel="noopener noreferrer"&gt;github.com/born-ml/born&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>ai</category>
      <category>opensource</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>We Built a Pure Go System Tray Library Because Every Alternative Requires CGO, GoGPU May 2026</title>
      <dc:creator>Andrey Kolkov</dc:creator>
      <pubDate>Thu, 30 Apr 2026 21:10:33 +0000</pubDate>
      <link>https://dev.to/kolkov/we-built-a-pure-go-system-tray-library-because-every-alternative-requires-cgo-gogpu-may-2026-3h2i</link>
      <guid>https://dev.to/kolkov/we-built-a-pure-go-system-tray-library-because-every-alternative-requires-cgo-gogpu-may-2026-3h2i</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Love the no CGO — but quickly realized there's no code?"&lt;/em&gt;&lt;br&gt;
— &lt;a href="https://github.com/gogpu/systray/issues/1" rel="noopener noreferrer"&gt;@cmilesio, gogpu/systray#1&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Fair point. We published the repo with just a README and a dream. Three days later: &lt;strong&gt;5,800+ lines of Pure Go&lt;/strong&gt;, three platforms, 74 tests, 84% coverage, and a working system tray icon on Windows.&lt;/p&gt;

&lt;p&gt;Today we're releasing &lt;strong&gt;&lt;a href="https://github.com/gogpu/systray" rel="noopener noreferrer"&gt;gogpu/systray v0.1.0&lt;/a&gt;&lt;/strong&gt; — the first Pure Go system tray library that works on Windows, macOS, and Linux without a C compiler.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Every Go system tray library requires CGO:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Library&lt;/th&gt;
&lt;th&gt;Stars&lt;/th&gt;
&lt;th&gt;CGO?&lt;/th&gt;
&lt;th&gt;The Catch&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/getlantern/systray" rel="noopener noreferrer"&gt;getlantern/systray&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;3.3K&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Yes&lt;/strong&gt; (macOS, Linux)&lt;/td&gt;
&lt;td&gt;AppIndicator + GTK3 on Linux, Cocoa via CGO on macOS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/fyne-io/systray" rel="noopener noreferrer"&gt;fyne-io/systray&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;fork&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Yes&lt;/strong&gt; (macOS, Linux)&lt;/td&gt;
&lt;td&gt;Same CGO deps, fork of getlantern&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/nicola-tesla/systray" rel="noopener noreferrer"&gt;energye/systray&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Walk/LCL dependency&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;CGO means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Need a C compiler installed (&lt;code&gt;apt install gcc&lt;/code&gt;, Xcode, MinGW)&lt;/li&gt;
&lt;li&gt;Cross-compilation breaks (&lt;code&gt;GOOS=linux&lt;/code&gt; from macOS? Good luck with CGO)&lt;/li&gt;
&lt;li&gt;Larger binaries, slower builds&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CGO_ENABLED=0&lt;/code&gt; doesn't work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Go is famous for "single binary, cross-compile anywhere." CGO breaks that promise.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Solution: Native APIs via Pure Go FFI
&lt;/h2&gt;

&lt;p&gt;We went platform-native without CGO:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Native API&lt;/th&gt;
&lt;th&gt;Go FFI&lt;/th&gt;
&lt;th&gt;LOC&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Windows&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Shell_NotifyIconW&lt;/code&gt; (shell32.dll)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;golang.org/x/sys/windows&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1,027&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;macOS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;NSStatusBar&lt;/code&gt; / &lt;code&gt;NSStatusItem&lt;/code&gt; (AppKit)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;go-webgpu/goffi&lt;/code&gt; (ObjC runtime)&lt;/td&gt;
&lt;td&gt;1,385&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Linux&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;StatusNotifierItem (D-Bus SNI)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;godbus/dbus/v5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;810&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;No C compiler. No shared libraries. No &lt;code&gt;dlopen&lt;/code&gt; of GTK. Just Go talking directly to the OS.&lt;/p&gt;

&lt;h3&gt;
  
  
  Windows: Shell_NotifyIconW
&lt;/h3&gt;

&lt;p&gt;The Win32 approach is straightforward — &lt;code&gt;Shell_NotifyIconW&lt;/code&gt; has been the tray API since Windows 95. We call it via &lt;code&gt;golang.org/x/sys/windows&lt;/code&gt;, the same way the Go standard library talks to Windows.&lt;/p&gt;

&lt;p&gt;Key details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Message-only HWND&lt;/strong&gt; for callbacks (invisible, no taskbar entry)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NOTIFYICON_VERSION_4&lt;/strong&gt; for modern event dispatch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explorer crash recovery&lt;/strong&gt; — when &lt;code&gt;explorer.exe&lt;/code&gt; restarts, tray icons disappear. We listen for the &lt;code&gt;TaskbarCreated&lt;/code&gt; registered message and re-add the icon automatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dark mode auto-switching&lt;/strong&gt; — detect &lt;code&gt;WM_SETTINGCHANGE&lt;/code&gt; + &lt;code&gt;ImmersiveColorSet&lt;/code&gt;, read &lt;code&gt;SystemUsesLightTheme&lt;/code&gt; registry key, swap HICON. Your tray icon adapts when the user toggles Windows dark mode.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  macOS: NSStatusBar via ObjC Runtime
&lt;/h3&gt;

&lt;p&gt;This is where it gets interesting. Calling AppKit without CGO requires speaking the Objective-C runtime protocol:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;objc_getClass("NSStatusBar")&lt;/code&gt; — get the class&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;objc_msgSend(class, sel("systemStatusBar"))&lt;/code&gt; — get the shared status bar&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;objc_msgSend(statusBar, sel("statusItemWithLength:"), -1.0)&lt;/code&gt; — create a status item&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We built a minimal ObjC runtime wrapper (~490 LOC) using &lt;a href="https://github.com/go-webgpu/goffi" rel="noopener noreferrer"&gt;goffi&lt;/a&gt; — our Pure Go FFI library. Same approach we use for the Metal GPU backend in &lt;a href="https://github.com/gogpu/wgpu" rel="noopener noreferrer"&gt;gogpu/wgpu&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The killer feature on macOS: &lt;strong&gt;template icons&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;tray&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetTemplateIcon&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monochromePNG&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This calls &lt;code&gt;[NSImage setTemplate:YES]&lt;/code&gt;, telling macOS the icon is a monochrome mask. The OS automatically renders it white on dark menu bars, black on light ones. No dark mode handling needed — Apple does it for you.&lt;/p&gt;

&lt;h3&gt;
  
  
  Linux: D-Bus StatusNotifierItem
&lt;/h3&gt;

&lt;p&gt;Linux is the most complex platform. The "system tray" isn't a single API — it's a D-Bus protocol called &lt;a href="https://www.freedesktop.org/wiki/Specifications/StatusNotifierItem/" rel="noopener noreferrer"&gt;StatusNotifierItem&lt;/a&gt; (SNI).&lt;/p&gt;

&lt;p&gt;We implement two D-Bus interfaces:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;org.kde.StatusNotifierItem&lt;/code&gt;&lt;/strong&gt; — the tray icon itself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Properties: &lt;code&gt;Category&lt;/code&gt;, &lt;code&gt;Id&lt;/code&gt;, &lt;code&gt;Title&lt;/code&gt;, &lt;code&gt;Status&lt;/code&gt;, &lt;code&gt;IconPixmap&lt;/code&gt;, &lt;code&gt;ToolTip&lt;/code&gt;, &lt;code&gt;Menu&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Methods: &lt;code&gt;Activate&lt;/code&gt; (click), &lt;code&gt;SecondaryActivate&lt;/code&gt; (middle-click), &lt;code&gt;ContextMenu&lt;/code&gt; (right-click)&lt;/li&gt;
&lt;li&gt;Signals: &lt;code&gt;NewIcon&lt;/code&gt;, &lt;code&gt;NewTitle&lt;/code&gt;, &lt;code&gt;NewStatus&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;com.canonical.dbusmenu&lt;/code&gt;&lt;/strong&gt; — the context menu:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A recursive tree of menu items with labels, types, toggle states&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GetLayout&lt;/code&gt; returns the full tree, &lt;code&gt;Event&lt;/code&gt; dispatches clicks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And a registration dance with &lt;code&gt;org.kde.StatusNotifierWatcher&lt;/code&gt; — plus automatic re-registration when the desktop panel restarts.&lt;/p&gt;

&lt;p&gt;The PNG→ARGB conversion is a fun detail: SNI wants ARGB32 in network byte order (big-endian), so we decode the PNG with &lt;code&gt;image/png&lt;/code&gt; and manually pack &lt;code&gt;[A, R, G, B]&lt;/code&gt; bytes.&lt;/p&gt;

&lt;p&gt;All of this via &lt;a href="https://github.com/godbus/dbus" rel="noopener noreferrer"&gt;godbus/dbus/v5&lt;/a&gt; — the canonical Pure Go D-Bus library. Zero CGO.&lt;/p&gt;




&lt;h2&gt;
  
  
  The API
&lt;/h2&gt;

&lt;p&gt;We went with a builder pattern inspired by &lt;a href="https://v3alpha.wails.io/" rel="noopener noreferrer"&gt;Wails 3&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"fmt"&lt;/span&gt;
    &lt;span class="s"&gt;"os"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/gogpu/systray"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;tray&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;systray&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;menu&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;systray&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewMenu&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;menu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Open"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Opening..."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;menu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddSeparator&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;menu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddCheckbox&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Dark Mode"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Toggled!"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;menu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddSubmenu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"More..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;systray&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewMenu&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"About"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"v1.0"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Help"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Help!"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}))&lt;/span&gt;
    &lt;span class="n"&gt;menu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddSeparator&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;menu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Quit"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;tray&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Remove&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="n"&gt;tray&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetIcon&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;iconPNG&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;SetDarkModeIcon&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;darkIconPNG&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;SetTooltip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"My App"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;SetMenu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;menu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;Show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;tray&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OnClick&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Clicked!"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;tray&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c"&gt;// blocks, pumps platform messages&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Multiple trays&lt;/strong&gt; are supported — each call to &lt;code&gt;systray.New()&lt;/code&gt; creates an independent icon:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;mainTray&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;systray&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetIcon&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;appIcon&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetMenu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mainMenu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;statusTray&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;systray&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetIcon&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;statusIcon&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetTooltip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Status: OK"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Enterprise Research
&lt;/h2&gt;

&lt;p&gt;We didn't guess at the architecture. Before writing code, we studied how the big frameworks do it:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;Tray Architecture&lt;/th&gt;
&lt;th&gt;Our Takeaway&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qt6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;QPlatformSystemTrayIcon&lt;/code&gt; → 3 platform implementations&lt;/td&gt;
&lt;td&gt;Three-layer pattern (public API → interface → platform impl)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Wails 3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;systemTrayImpl&lt;/code&gt; interface, native per-platform&lt;/td&gt;
&lt;td&gt;Builder API pattern, multiple tray support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SDL3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4 backends (AppIndicator, D-Bus, Win32, Cocoa)&lt;/td&gt;
&lt;td&gt;We chose D-Bus SNI directly, skipping AppIndicator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Electron&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nativeTheme.shouldUseDarkColors&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Dark mode detection via &lt;code&gt;WM_SETTINGCHANGE&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GLFW&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;RemovePropW&lt;/code&gt; before &lt;code&gt;DestroyWindow&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Destroy pattern (avoid deadlocks)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;getlantern/systray&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;GetMessage&lt;/code&gt; loop, hidden WS_OVERLAPPEDWINDOW&lt;/td&gt;
&lt;td&gt;Message pump pattern (we use &lt;code&gt;HWND_MESSAGE&lt;/code&gt; instead)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The architecture follows Qt6's &lt;code&gt;QPlatformSystemTrayIcon&lt;/code&gt; pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;systray.New()  →  SystemTray (public API, delegation)
                       │
                  PlatformTray (internal interface)
                       │
          ┌────────────┼────────────┐
     Win32 impl   macOS impl   Linux impl
     Shell_Notify  NSStatusBar   D-Bus SNI
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why Not AppIndicator on Linux?
&lt;/h2&gt;

&lt;p&gt;The tempting path: &lt;code&gt;dlopen("libayatana-appindicator3.so.1")&lt;/code&gt; and let GTK3 handle everything. That's what getlantern/systray does (via CGO).&lt;/p&gt;

&lt;p&gt;Problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pulls in GTK3 runtime&lt;/strong&gt; — gigantic dependency for a tray icon&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It's just a wrapper around SNI&lt;/strong&gt; — AppIndicator talks D-Bus SNI internally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Icon caching bugs&lt;/strong&gt; — AppIndicator caches icons by filename, causing stale icons&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not available everywhere&lt;/strong&gt; — minimal compositors (Sway, Hyprland) don't have AppIndicator&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We cut out the middleman. D-Bus SNI directly via godbus — same protocol, no GTK, no CGO.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Total:     ~5,800 lines of Pure Go (6,900 with docs/CI/configs)
Tests:     74 (84% public API coverage)
Platforms: Windows ✅, macOS ✅, Linux ✅
Deps:      golang.org/x/sys, go-webgpu/goffi, godbus/dbus/v5
CGO:       Zero. Absolutely zero.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go get github.com/gogpu/systray@v0.1.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/gogpu/systray
&lt;span class="nb"&gt;cd &lt;/span&gt;systray/examples/basic
go run &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A green icon appears in your system tray. Right-click for the menu. Toggle dark mode to see auto-switching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We need testers&lt;/strong&gt; — especially on macOS and Linux (KDE, GNOME + AppIndicator extension, XFCE, Sway). &lt;a href="https://github.com/gogpu/systray/issues" rel="noopener noreferrer"&gt;File issues&lt;/a&gt; if something doesn't work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part of GoGPU
&lt;/h2&gt;

&lt;p&gt;systray is standalone (&lt;code&gt;go get github.com/gogpu/systray&lt;/code&gt; — no gogpu dependency), but it's designed to integrate with the &lt;a href="https://github.com/gogpu" rel="noopener noreferrer"&gt;GoGPU ecosystem&lt;/a&gt; — 800K+ lines of Pure Go GPU code:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Library&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/wgpu" rel="noopener noreferrer"&gt;wgpu&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Pure Go WebGPU (Vulkan/Metal/DX12/GLES)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/naga" rel="noopener noreferrer"&gt;naga&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Shader compiler (WGSL → SPIR-V/MSL/GLSL/HLSL/DXIL)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gg" rel="noopener noreferrer"&gt;gg&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2D graphics (Skia-class rasterizer)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gogpu" rel="noopener noreferrer"&gt;gogpu&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;App framework, windowing, input&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/ui" rel="noopener noreferrer"&gt;ui&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;GUI toolkit (22+ widgets, Material 3)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/gogpu/systray" rel="noopener noreferrer"&gt;systray&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;System tray (this library)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All Pure Go. All zero CGO. All cross-platform. Four gogpu libraries are listed in &lt;a href="https://github.com/avelino/awesome-go" rel="noopener noreferrer"&gt;awesome-go&lt;/a&gt;: systray (GUI Interaction), ui (GUI Toolkits), gg (Images), gogpu (Game Development).&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you build something with systray, &lt;a href="https://github.com/gogpu/gogpu/discussions" rel="noopener noreferrer"&gt;let us know&lt;/a&gt;. Star ⭐ the repo if you find it useful — it helps others discover the project.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>opensource</category>
      <category>programming</category>
      <category>linux</category>
    </item>
    <item>
      <title>GoGPU - 790K Lines of Pure Go: Multi-Window GPU Apps, DXIL Compiler, and Why We Don't Need CGO</title>
      <dc:creator>Andrey Kolkov</dc:creator>
      <pubDate>Tue, 28 Apr 2026 13:14:08 +0000</pubDate>
      <link>https://dev.to/kolkov/gogpu-790k-lines-of-pure-go-multi-window-gpu-apps-dxil-compiler-and-why-we-dont-need-cgo-3i94</link>
      <guid>https://dev.to/kolkov/gogpu-790k-lines-of-pure-go-multi-window-gpu-apps-dxil-compiler-and-why-we-dont-need-cgo-3i94</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Go deserves more support in GUI development"&lt;/em&gt;&lt;br&gt;
— &lt;a href="https://www.reddit.com/r/golang/comments/1pdw9i7/go_deserves_more_support_in_gui_development/" rel="noopener noreferrer"&gt;r/golang, October 2024&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That Reddit post hit a nerve. Hundreds of upvotes, dozens of comments echoing the same frustration: Go has world-class networking, databases, and CLI tools — but when it comes to graphics and GUI, the ecosystem says "just use Electron" or "call into C."&lt;/p&gt;

&lt;p&gt;I'd been planning a Pure Go GPU stack for years. Spent the year before that post studying exactly what to build and how — dissecting Vulkan, Metal, DX12, reading wgpu source code, analyzing Qt and Flutter architectures. That Reddit thread was the final push.&lt;/p&gt;

&lt;p&gt;On &lt;strong&gt;December 5, 2025&lt;/strong&gt;, the &lt;a href="https://dev.to/kolkov/gogpu-a-pure-go-graphics-library-for-gpu-programming-2j5d"&gt;first window with a triangle&lt;/a&gt; appeared on screen. Pure Go, zero CGO, Vulkan rendering.&lt;/p&gt;

&lt;p&gt;Less than five months later: &lt;strong&gt;790K lines of Pure Go&lt;/strong&gt;, &lt;strong&gt;13 repositories&lt;/strong&gt;, &lt;strong&gt;678 GitHub stars&lt;/strong&gt; across the ecosystem, and people building real software on it — from Quake engines to ML frameworks.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Repository&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Lines&lt;/th&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gg" rel="noopener noreferrer"&gt;gg&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2D graphics (Skia-class rasterizer)&lt;/td&gt;
&lt;td&gt;219K&lt;/td&gt;
&lt;td&gt;v0.43.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/naga" rel="noopener noreferrer"&gt;naga&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Shader compiler (WGSL→SPIR-V/MSL/GLSL/HLSL/DXIL)&lt;/td&gt;
&lt;td&gt;195K&lt;/td&gt;
&lt;td&gt;v0.17.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/ui" rel="noopener noreferrer"&gt;ui&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;GUI toolkit (22+ widgets, 4 themes, 97% coverage)&lt;/td&gt;
&lt;td&gt;171K&lt;/td&gt;
&lt;td&gt;v0.1.14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/wgpu" rel="noopener noreferrer"&gt;wgpu&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;WebGPU implementation (Vulkan/DX12/Metal/GLES/Software)&lt;/td&gt;
&lt;td&gt;145K&lt;/td&gt;
&lt;td&gt;v0.26.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gogpu" rel="noopener noreferrer"&gt;gogpu&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Application framework, windowing, input&lt;/td&gt;
&lt;td&gt;50K&lt;/td&gt;
&lt;td&gt;v0.30.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/g3d" rel="noopener noreferrer"&gt;g3d&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;3D rendering (scene graph, PBR, GLTF)&lt;/td&gt;
&lt;td&gt;planned&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/compose" rel="noopener noreferrer"&gt;compose&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Jetpack Compose-style declarative UI&lt;/td&gt;
&lt;td&gt;planned&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/systray" rel="noopener noreferrer"&gt;systray&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Cross-platform system tray&lt;/td&gt;
&lt;td&gt;planned&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ 4 more&lt;/td&gt;
&lt;td&gt;gpucontext, gputypes, gg-pdf, gg-svg&lt;/td&gt;
&lt;td&gt;9K&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;13 repos, Pure Go, zero CGO&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~790K&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;790K lines of Go. 577K of that is pure code (excluding blanks and comments). 421 commits in the last two months alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stars across the ecosystem:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Repo&lt;/th&gt;
&lt;th&gt;Stars&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;gogpu&lt;/td&gt;
&lt;td&gt;251&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ui&lt;/td&gt;
&lt;td&gt;211&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gg&lt;/td&gt;
&lt;td&gt;96&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;wgpu&lt;/td&gt;
&lt;td&gt;87&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;naga&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;678&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  How It Started
&lt;/h2&gt;

&lt;p&gt;The gap between Go's server capabilities and its desktop capabilities has bothered me for years. Go has &lt;code&gt;net/http&lt;/code&gt; that scales to millions of connections. It has database libraries that handle petabytes. But ask "how do I draw a triangle?" and you get answers involving CGO, shared libraries, or wrapping Electron in a Go process.&lt;/p&gt;

&lt;p&gt;The existing options in 2024-2025:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Framework&lt;/th&gt;
&lt;th&gt;Stars&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;The Catch&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/wailsapp/wails" rel="noopener noreferrer"&gt;Wails&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;~34K&lt;/td&gt;
&lt;td&gt;Go backend + System WebView&lt;/td&gt;
&lt;td&gt;Not native rendering. Your "Go app" is HTML/CSS/JS with JSON IPC.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/fyne-io/fyne" rel="noopener noreferrer"&gt;Fyne&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;~28K&lt;/td&gt;
&lt;td&gt;Go + OpenGL via CGO&lt;/td&gt;
&lt;td&gt;Requires C compiler. Custom widget look, not truly native.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/hajimehoshi/ebiten" rel="noopener noreferrer"&gt;Ebiten&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;~13K&lt;/td&gt;
&lt;td&gt;Go + OpenGL/Metal/DX (purego)&lt;/td&gt;
&lt;td&gt;Game engine, not a GUI toolkit. No widget system.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/cogentcore/core" rel="noopener noreferrer"&gt;Cogent Core&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;~2.3K&lt;/td&gt;
&lt;td&gt;Go + Vulkan via CGO (glfw)&lt;/td&gt;
&lt;td&gt;CGO required. Large dependency tree.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gioui/gio" rel="noopener noreferrer"&gt;Gio&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;~2.1K&lt;/td&gt;
&lt;td&gt;Pure Go, immediate mode&lt;/td&gt;
&lt;td&gt;Small widget set. Pre-1.0 API. No retained mode.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every major framework either requires CGO or delegates rendering to a WebView. The only Pure Go option (Gio) is immediate-mode with a minimal widget set. Nothing offers native GPU rendering (Vulkan/Metal/DX12) without a C compiler.&lt;/p&gt;

&lt;p&gt;I spent a year studying the problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read Rust wgpu's source code to understand how a modern WebGPU implementation works&lt;/li&gt;
&lt;li&gt;Studied Qt6's RHI, GTK4's GSK renderer, Flutter's Impeller to understand GUI rendering pipelines&lt;/li&gt;
&lt;li&gt;Analyzed Vulkan, Metal, DX12, and OpenGL ES specs to map out the abstraction boundaries&lt;/li&gt;
&lt;li&gt;Designed the layered architecture on paper before writing Go code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On &lt;strong&gt;December 5, 2025&lt;/strong&gt;, the &lt;a href="https://dev.to/kolkov/gogpu-a-pure-go-graphics-library-for-gpu-programming-2j5d"&gt;first window with a triangle&lt;/a&gt; appeared on screen — rendered through &lt;a href="https://github.com/go-webgpu/webgpu" rel="noopener noreferrer"&gt;go-webgpu/webgpu&lt;/a&gt; (our Rust FFI bindings) and the gogpu framework. The Rust backend proved the architecture worked. Then, in parallel, &lt;code&gt;naga&lt;/code&gt; (shader compiler) and &lt;code&gt;wgpu&lt;/code&gt; (Pure Go WebGPU) began replacing the Rust dependencies one by one. Then came &lt;code&gt;gg&lt;/code&gt; (2D graphics), and finally &lt;code&gt;ui&lt;/code&gt; (GUI toolkit).&lt;/p&gt;




&lt;h2&gt;
  
  
  Multi-Window: The Feature That Changes Everything
&lt;/h2&gt;

&lt;p&gt;Single-window apps are fine for demos. Real applications — IDEs, design tools, database clients — need multiple windows. We shipped multi-window support in gogpu v0.28.0, and it was the hardest architectural change we've made.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;Every major Go GUI framework (Fyne, Gio, Ebiten) either doesn't support multi-window or hacks it with separate processes. The reason is simple: GPU devices are expensive to create, but surfaces (swapchains) are per-window. You need to share one device across many windows without data races.&lt;/p&gt;

&lt;h3&gt;
  
  
  What We Built
&lt;/h3&gt;

&lt;p&gt;We studied 7 frameworks — Qt6, GTK4, SDL3, winit, GLFW, Fyne, Gio — and wrote a 26-page Architecture Decision Record before writing a line of code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;            ┌────────────────────────────────┐
            │      Shared GPU Context        │
            │  Instance / Adapter / Device   │
            │  Queue / Pipelines / Textures  │
            └───────┬──────────┬─────────┬───┘
                    │          │         │
            ┌───────┴───┐ ┌───┴────┐ ┌──┴─────────┐
            │ Window 1  │ │Window 2│ │ Window 3   │
            │ Surface   │ │Surface │ │ Surface    │
            │ Swapchain │ │Swapch. │ │ Swapchain  │
            │ Callbacks │ │Callbk. │ │ Callbacks  │
            └───────────┘ └────────┘ └────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API is intentionally simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;gogpu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewApp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gogpu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DefaultConfig&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
    &lt;span class="n"&gt;WithTitle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Main Window"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
    &lt;span class="n"&gt;WithSize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;800&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c"&gt;// Second window — shares GPU device, gets its own surface&lt;/span&gt;
&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewWindow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gogpu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WindowConfig&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Tool Palette"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Width&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Height&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="m"&gt;600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OnDraw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dc&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;gogpu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// dc has its own surface, shared device&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood: &lt;code&gt;PlatformManager&lt;/code&gt; handles process-level concerns (Win32 &lt;code&gt;RegisterClass&lt;/code&gt;, Cocoa &lt;code&gt;NSApplication&lt;/code&gt;, X11 &lt;code&gt;Display&lt;/code&gt;), while each &lt;code&gt;PlatformWindow&lt;/code&gt; manages its own message pump slice, surface, and swapchain. The &lt;code&gt;WindowManager&lt;/code&gt; tracks all windows with monotonic &lt;code&gt;WindowID&lt;/code&gt; — stable across window recreation, serializable for event queues. Same pattern SDL3 uses with &lt;code&gt;SDL_GetNextObjectID()&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  VSync Strategy
&lt;/h3&gt;

&lt;p&gt;Primary window runs Fifo (VSync), secondary windows run Immediate. Focus changes switch the VSync mode — the window you're looking at gets smooth frames, background windows don't waste GPU cycles. This is what Qt6 does with &lt;code&gt;QRhi&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cross-Platform
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Multi-Window&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Windows&lt;/td&gt;
&lt;td&gt;Vulkan, DX12&lt;/td&gt;
&lt;td&gt;✅ Win32 &lt;code&gt;CreateWindowEx&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;macOS&lt;/td&gt;
&lt;td&gt;Metal&lt;/td&gt;
&lt;td&gt;✅ Cocoa &lt;code&gt;NSWindow&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linux X11&lt;/td&gt;
&lt;td&gt;Vulkan&lt;/td&gt;
&lt;td&gt;✅ &lt;code&gt;XCreateWindow&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linux Wayland&lt;/td&gt;
&lt;td&gt;Vulkan&lt;/td&gt;
&lt;td&gt;✅ &lt;code&gt;xdg_toplevel&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All 4 platforms support multi-window with shared GPU device. EventFocus events route correctly across windows on all platforms.&lt;/p&gt;




&lt;h2&gt;
  
  
  Event-Driven Frame Pacing
&lt;/h2&gt;

&lt;p&gt;Enterprise GUI frameworks don't render on every OS event. They render only when something visual actually changed. Our render loop had a shortcut:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;continuous&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;invalidated&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;hasEvents&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;renderFrame&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;hasEvents&lt;/code&gt; flag meant every mouse move, every platform event triggered a full render cycle — even when nothing on screen changed. Moving the mouse over a static window caused unnecessary GPU work on every frame.&lt;/p&gt;

&lt;p&gt;We researched 6 frameworks (winit, Gio, Qt6, Flutter, SDL3, Ebiten) and found they all use the same pattern: &lt;strong&gt;handlers decide when to invalidate, the render loop never guesses&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Now in v0.30.0:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;continuous&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;invalidated&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;renderFrame&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Resize and focus events call &lt;code&gt;RequestRedraw()&lt;/code&gt; explicitly. Mouse events reach the UI framework, which calls &lt;code&gt;RequestRedraw()&lt;/code&gt; only when a widget's visual state actually changes.&lt;/p&gt;

&lt;p&gt;Mouse move over static UI: &lt;strong&gt;0% GPU&lt;/strong&gt;. Hover over a button: UI calls &lt;code&gt;RequestRedraw()&lt;/code&gt;, one frame renders. Exactly how winit and Flutter work.&lt;/p&gt;




&lt;h2&gt;
  
  
  The First Pure Go DXIL Generator
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/kolkov/we-built-the-first-pure-go-dxil-generator-because-optimizing-the-wrong-path-wasnt-enough-35en"&gt;We wrote about this in detail&lt;/a&gt;, but here's the update: &lt;strong&gt;naga's DXIL backend now passes 94/208 DXC golden tests&lt;/strong&gt; (45% parity with Microsoft's own compiler). The IDxcValidator pass rate is 161/170 (94.7%).&lt;/p&gt;

&lt;p&gt;This matters because Rust naga — the reference implementation maintained by Mozilla — still doesn't have a DXIL backend. There's been an &lt;a href="https://github.com/gfx-rs/wgpu/issues/4302" rel="noopener noreferrer"&gt;open issue since 2020&lt;/a&gt;. Six years later, still not implemented.&lt;/p&gt;

&lt;p&gt;We did it in Pure Go. No LLVM, no external tools. Direct DXIL bitcode generation from our shader IR.&lt;/p&gt;

&lt;p&gt;Yes, we know Microsoft is adding &lt;a href="https://devblogs.microsoft.com/directx/directx-adopting-spir-v/" rel="noopener noreferrer"&gt;SPIR-V support to DirectX 12&lt;/a&gt; in a future SDK update. But DXIL is the native DX12 shader format today, and having a direct DXIL generator means we don't depend on external toolchains or wait for Microsoft's timeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  How the Shader Pipeline Works
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WGSL source code
      ↓  (naga lexer + parser)
   naga IR (typed, validated)
      ↓  (backend selection)
  ┌───┴───┬────┬─────┬──────┐
  SPIR-V  MSL  GLSL  HLSL  DXIL
  (Vulkan)(Metal)(GLES)(DX11)(DX12)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All 5 backends generate from the same IR. 330+ tests. The DXIL backend includes single-store local promotion, strength reduction, fragment signature ordering, and dead code elimination — real compiler passes, not string templating.&lt;/p&gt;




&lt;h2&gt;
  
  
  gogpu/ui: A Real GUI Toolkit
&lt;/h2&gt;

&lt;p&gt;Most Go GUI toolkits ship a handful of basic controls and call it done. We're building something closer to Qt or Flutter — a complete widget system with theming, accessibility, and enterprise-grade architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;22+ widgets&lt;/strong&gt; shipping today:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Widgets&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Input&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;TextField, TextArea (multi-line), Checkbox, Radio, Switch, Slider, Dropdown&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Display&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Label, Image, Icon, ProgressBar, Badge, Chip, Divider, Card&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Navigation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Button, IconButton, FAB, AppBar, TabBar&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Layout&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Column, Row, Stack, ListView, ScrollView, GridView&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overlay&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dialog, Snackbar, Tooltip, BottomSheet, PopupMenu&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Advanced&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;TreeView, DataTable, SplitView, Docking, Menu, Toolbar&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;4 design systems:&lt;/strong&gt; Material Design 3, DevTools (JetBrains), Fluent (Microsoft), Cupertino (Apple). Full token-based theming — change every color, radius, elevation, font in one struct.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;171K LOC | 55+ packages | 6,803 tests | 97%+ average coverage.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Signal-Driven Retained-Mode Compositor (ADR-007)
&lt;/h3&gt;

&lt;p&gt;The latest v0.1.14 shipped the biggest architectural change in ui: a &lt;strong&gt;signal-driven retained-mode compositor&lt;/strong&gt; that replaced the legacy hybrid pipeline.&lt;/p&gt;

&lt;p&gt;The old pipeline had a fundamental conflict: CPU pixmap was retained (persistent between frames), but GPU shapes (shadows, rounded corners) were ephemeral (re-queued every frame). On frames where nothing changed, GPU shapes disappeared — visible flicker.&lt;/p&gt;

&lt;p&gt;ADR-007 unified everything into a scene graph:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Widget tree
    ↓  SetNeedsRedraw() (per-widget dirty flag)
Scene.Encoding (compact display list: 9-25 bytes/command)
    ↓  RepaintBoundary (cached scene per subtree)
scene.Renderer (auto GPU/CPU selection, 64×64 tile grid)
    ↓  FlushGPUWithView (single render pass → swapchain)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every widget records into a &lt;code&gt;scene.Scene&lt;/code&gt; display list instead of drawing to a pixmap. &lt;code&gt;RepaintBoundary&lt;/code&gt; caches scene encodings per subtree — when a spinner animates, only its 48×48 tile re-renders, not the full window. Reactive signals (&lt;code&gt;coregx/signals&lt;/code&gt;) trigger &lt;code&gt;SetNeedsRedraw()&lt;/code&gt; on exactly the widgets that changed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Results:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Taskmanager example GPU: &lt;strong&gt;7-18% → 0-1%&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;IDE hover lag: &lt;strong&gt;eliminated&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Gallery flickering: &lt;strong&gt;fixed&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Full DrawTree via GPU pipeline every frame — no retained CPU pixmap&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is still a work in progress. Complex animated widgets like classic spinners need additional optimization — the present path still does a full-surface render pass even when only a 48×48 pixel region changed. Damage-rect passthrough and sub-region compositing are the next steps.&lt;/p&gt;

&lt;h3&gt;
  
  
  More Architecture
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Reactive state via signals (Signal, Computed, Effect, Binding)&lt;/li&gt;
&lt;li&gt;GPU-accelerated rendering via gg → wgpu&lt;/li&gt;
&lt;li&gt;Granular widget invalidation — 11 widgets use &lt;code&gt;SetNeedsRedraw + InvalidateRect&lt;/code&gt; instead of full-surface redraw&lt;/li&gt;
&lt;li&gt;Accessibility from day one: 35+ ARIA roles, screen reader announcer&lt;/li&gt;
&lt;li&gt;Full i18n: CLDR plural rules, RTL detection&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What's Coming Next
&lt;/h2&gt;

&lt;h3&gt;
  
  
  g3d — 3D Rendering Library
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/gogpu/g3d" rel="noopener noreferrer"&gt;gogpu/g3d&lt;/a&gt; is the Three.js of Go. Scene graph, PBR materials (metallic-roughness), GLTF 2.0 loading, directional/point/spot lights, frustum culling, instance batching. Built on wgpu, not OpenGL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;scene&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;g3d&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewScene&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;cube&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;g3d&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewMesh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;g3d&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewBoxGeometry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;g3d&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewStandardMaterial&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;scene&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cube&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;renderer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Render&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scene&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;camera&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  compose — Declarative UI
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/gogpu/compose" rel="noopener noreferrer"&gt;gogpu/compose&lt;/a&gt; brings Jetpack Compose-style declarative UI to Go. Composable functions, automatic recomposition on state change, slot-based layout. Transport-pluggable architecture — same composable code can render to GPU window, headless image, or remote display.&lt;/p&gt;

&lt;h3&gt;
  
  
  systray — System Tray
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/gogpu/systray" rel="noopener noreferrer"&gt;gogpu/systray&lt;/a&gt; — cross-platform system tray icons. Win32 Notification Area, macOS Menu Bar Extra, Linux StatusNotifierItem/AppIndicator. Multiple trays, nested menus, notifications, dark mode icon switching. Pure Go, zero CGO.&lt;/p&gt;

&lt;h3&gt;
  
  
  Browser Target (WASM + WebGPU)
&lt;/h3&gt;

&lt;p&gt;wgpu already has the build infrastructure for WASM (platform split shipped in v0.25.5). The plan: compile to WASM, use browser's native WebGPU API via &lt;code&gt;syscall/js&lt;/code&gt;. Same app code, same shaders, runs in Chrome. Phase 0 complete, Phase 1 (navigator.gpu binding) is next.&lt;/p&gt;

&lt;h3&gt;
  
  
  Android Support
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/orgs/gogpu/discussions/31" rel="noopener noreferrer"&gt;Under active discussion&lt;/a&gt;. The architecture is ready — wgpu's Vulkan backend works, naga generates SPIR-V, gogpu's &lt;code&gt;PlatformManager&lt;/code&gt;/&lt;code&gt;PlatformWindow&lt;/code&gt; abstraction was designed with mobile in mind. What remains: &lt;code&gt;NativeActivity&lt;/code&gt; or &lt;code&gt;GameActivity&lt;/code&gt; integration, touch input, lifecycle management (suspend/resume), and the build pipeline (&lt;code&gt;gomobile&lt;/code&gt; or direct NDK).&lt;/p&gt;

&lt;h3&gt;
  
  
  ui — Major Overhaul Coming
&lt;/h3&gt;

&lt;p&gt;The GUI toolkit is getting substantial improvements. The ADR-007 retained-mode compositor was just the beginning — next up is the full scene-graph pipeline where every widget records into display lists instead of immediate-mode drawing. More widgets, better performance, production-ready accessibility. The goal: a Go GUI toolkit that you'd actually choose over Electron for a desktop app.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real Users, Real Software
&lt;/h2&gt;

&lt;p&gt;This isn't a toy. People are building on GoGPU:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/darkliquid/ironwail-go" rel="noopener noreferrer"&gt;ironwail-go&lt;/a&gt;&lt;/strong&gt; by @darkliquid — a Quake 1 engine port running on gogpu/wgpu Vulkan on Wayland+niri. The first 3D game on a Pure Go GPU stack. &lt;a href="https://github.com/gogpu/gogpu/issues/163" rel="noopener noreferrer"&gt;Demo video on the project README&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/born-ml/born" rel="noopener noreferrer"&gt;Born ML&lt;/a&gt;&lt;/strong&gt; — our ML framework, migrating from Rust FFI (&lt;code&gt;go-webgpu/webgpu&lt;/code&gt;) to gogpu/wgpu. Single-binary deployment, DXIL backend for DX12 compute — things the Rust FFI path couldn't provide.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;L-System fractals&lt;/strong&gt; by @rcarlier — &lt;a href="https://github.com/gogpu/gg/issues/229" rel="noopener noreferrer"&gt;47 million points&lt;/a&gt; rendered via gg, running in WASM and CLI. "Amazing performance!"&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;@jdbann&lt;/strong&gt; — contributing Metal backend fixes, testing on M4 Pro MacBook.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;@SideFx&lt;/strong&gt; — testing on Adreno mobile GPUs, helping us find driver-specific bugs.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Architecture That Makes It Work
&lt;/h2&gt;

&lt;p&gt;The secret is layered independence. Each layer does one thing, doesn't know about layers above it, and can be used standalone:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Application
      │
  ┌───┴────┬──────┬────────┐
  gogpu/ui  gogpu  gg  g3d    ← Application layer
  ├────────┘  │    │    │
  gpucontext ─┘    │    │     ← Shared interfaces
  │                │    │
  wgpu ────────────┘────┘     ← WebGPU implementation
  │
  naga                        ← Shader compiler
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  gogpu — The Windowing &amp;amp; Application Framework
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/gogpu/gogpu" rel="noopener noreferrer"&gt;gogpu/gogpu&lt;/a&gt;&lt;/strong&gt; (~50K lines) is the foundation everything else builds on. Think of it as SDL or GLFW — but Pure Go, with a WebGPU renderer built in.&lt;/p&gt;

&lt;p&gt;What it provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cross-platform windowing&lt;/strong&gt; — Win32, Cocoa (AppKit), X11, Wayland. Pure Go, no CGO. All 4 platforms implemented from scratch using our &lt;a href="https://github.com/go-webgpu/goffi" rel="noopener noreferrer"&gt;goffi&lt;/a&gt; FFI library (syscall-level, no C compiler).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU lifecycle&lt;/strong&gt; — Instance → Adapter → Device → Queue → Surface. Dual backend: Pure Go (gogpu/wgpu) or Rust FFI (go-webgpu/webgpu). Backend selected at build time or runtime.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-window&lt;/strong&gt; — shared GPU device, per-window swapchain, monotonic WindowID, focus-aware VSync (ADR-010).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event system&lt;/strong&gt; — keyboard, mouse, touch (X11 XInput2), gestures, Unicode text input (IME) on all platforms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-thread architecture&lt;/strong&gt; — main thread for window events (keeps the OS happy), render thread for all GPU operations (keeps the GPU fed). Modal resize/drag on Windows handled via WM_TIMER callback.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HiDPI/Retina&lt;/strong&gt; — per-monitor DPI, logical/physical coordinate split, &lt;code&gt;WM_DPICHANGED&lt;/code&gt; on Windows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frameless windows&lt;/strong&gt; — custom title bars with DWM shadow, hit-test regions (JetBrains Runtime pattern).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mouse grab / pointer lock&lt;/strong&gt; — locked, confined, normal modes (SDL parity) on Win32, X11, Wayland.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Software backend&lt;/strong&gt; — always available, renders to GDI (Windows), XPutImage (X11), CALayer (macOS). No GPU required.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;App&lt;/code&gt; implements &lt;code&gt;DeviceProvider&lt;/code&gt; — any library that accepts a &lt;code&gt;gpucontext.DeviceProvider&lt;/code&gt; can use gogpu's GPU device without importing gogpu directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Other Layers
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;naga&lt;/strong&gt; (~195K lines) — shader compiler. WGSL in, SPIR-V/MSL/GLSL/HLSL/DXIL out. Used by wgpu at pipeline creation time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;wgpu&lt;/strong&gt; (~145K lines) — Pure Go WebGPU implementation. 5 HAL backends (Vulkan, DX12, Metal, GLES, Software). Resource lifecycle, validation, state tracking. The layer between your draw calls and the GPU driver.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;gg&lt;/strong&gt; (~219K lines) — 2D graphics. Skia-class analytic AA rasterizer, Vello-derived tile rasterizer, GPU SDF accelerator, text rendering, SVG, alpha masks. Can work standalone (no window) or with gogpu's GPU device.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;gpucontext&lt;/strong&gt; (~3.6K lines) — the glue. Depends only on &lt;code&gt;gputypes&lt;/code&gt;. Defines &lt;code&gt;DeviceProvider&lt;/code&gt;, &lt;code&gt;TextureView&lt;/code&gt;, &lt;code&gt;CommandEncoder&lt;/code&gt; — interfaces that let all packages share GPU resources without circular imports. The &lt;code&gt;database/sql&lt;/code&gt; of GPU programming.&lt;/p&gt;

&lt;p&gt;Every layer can be used independently:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;gg&lt;/code&gt; without &lt;code&gt;gogpu&lt;/code&gt; — 2D graphics without windowing&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;wgpu&lt;/code&gt; without &lt;code&gt;gg&lt;/code&gt; — raw WebGPU for compute or custom rendering&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;naga&lt;/code&gt; without anything — shader compilation as a library&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Research before code.&lt;/strong&gt; Every major feature starts with an Architecture Decision Record studying 5-8 enterprise implementations. The multi-window ADR studied Qt6, GTK4, SDL3, winit, GLFW, Fyne, and Gio before a line of code was written. This saved weeks of wrong turns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. CPU is core, GPU is accelerator.&lt;/strong&gt; We analyzed 8 enterprise 2D engines (Skia, Cairo, Vello, Blend2D, tiny-skia, piet, Qt RHI, Pathfinder) and found that in zero of them is CPU rasterization a "backend." It's always the core. GPU accelerates specific operations. This insight shaped gg's entire architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Zero CGO is a real competitive advantage.&lt;/strong&gt; With the Rust FFI backend, every platform needed its own &lt;code&gt;wgpu_native.dll&lt;/code&gt;/&lt;code&gt;.so&lt;/code&gt;/&lt;code&gt;.dylib&lt;/code&gt; — find it, download it, put it in the right place, hope the versions match. With Pure Go wgpu: &lt;code&gt;go build&lt;/code&gt; and it works. No shared libraries to hunt for, no linker errors, no "works on my machine." Cross-compilation just works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Enterprise references prevent enterprise bugs.&lt;/strong&gt; When we tried to add DX12 texture barriers without studying how Rust wgpu handles them, we got TDR crashes at frame 575. When we studied the reference first, we found the root cause in 10 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Small teams win through focus.&lt;/strong&gt; Our naga DXIL backend shipped before Rust naga's (open issue since 2020). Not because of more resources — because of fewer coordination costs and a clear research → design → implement pipeline. Architectural decisions compound: choosing WebGPU as the abstraction meant our Metal backend was ready when Apple deprecated OpenGL, and our shader IR was ready when DX12 needed DXIL.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install&lt;/span&gt;
go get github.com/gogpu/gogpu

&lt;span class="c"&gt;# Run the particles demo (compute + render, zero CGO)&lt;/span&gt;
&lt;span class="nv"&gt;CGO_ENABLED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 go run github.com/gogpu/gogpu/examples/particles@latest

&lt;span class="c"&gt;# Run the multi-window demo&lt;/span&gt;
&lt;span class="nv"&gt;CGO_ENABLED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 go run github.com/gogpu/gogpu/examples/multiwindow@latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Requirements:&lt;/strong&gt; Go 1.25+, a GPU with Vulkan/DX12/Metal/GLES support (or use Software backend for headless).&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;URL&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GitHub Organization&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu" rel="noopener noreferrer"&gt;github.com/gogpu&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Main Repository&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gogpu" rel="noopener noreferrer"&gt;github.com/gogpu/gogpu&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GUI Toolkit&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/ui" rel="noopener noreferrer"&gt;github.com/gogpu/ui&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2D Graphics&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gg" rel="noopener noreferrer"&gt;github.com/gogpu/gg&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3D Library&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/g3d" rel="noopener noreferrer"&gt;github.com/gogpu/g3d&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shader Compiler&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/naga" rel="noopener noreferrer"&gt;github.com/gogpu/naga&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Previous: DXIL Article&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/kolkov/we-built-the-first-pure-go-dxil-generator-because-optimizing-the-wrong-path-wasnt-enough-35en"&gt;dev.to/kolkov/dxil&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Previous: 100K LOC Article&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/kolkov/gogpu-from-idea-to-100k-lines-in-two-weeks-building-gos-gpu-ecosystem-3b2"&gt;dev.to/kolkov/100k&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Smart Coding Methodology&lt;/td&gt;
&lt;td&gt;&lt;a href="https://dev.to/kolkov/from-vibe-coding-to-agentic-engineering-what-karpathy-got-right-and-whats-missing-62e"&gt;dev.to/kolkov/smart-coding&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Core Is Stable — Now We Need You
&lt;/h2&gt;

&lt;p&gt;The foundational libraries — &lt;code&gt;wgpu&lt;/code&gt;, &lt;code&gt;naga&lt;/code&gt;, &lt;code&gt;gg&lt;/code&gt;, &lt;code&gt;gogpu&lt;/code&gt; — have reached a level of stability where the focus is shifting. The WebGPU implementation handles Vulkan, DX12, Metal, GLES, and Software. The shader compiler generates 5 output formats. The 2D rasterizer passes thousands of tests. The windowing framework runs on 4 platforms with multi-window support.&lt;/p&gt;

&lt;p&gt;Now the main effort goes into &lt;strong&gt;ui&lt;/strong&gt; (the GUI toolkit), &lt;strong&gt;new ecosystem libraries&lt;/strong&gt; (g3d, compose, systray), and the &lt;strong&gt;Browser backend&lt;/strong&gt; (WASM + WebGPU — same app code running in Chrome). This is where things get exciting, and this is where we need feedback.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try it. Break it. Tell us what's missing.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Found a bug? &lt;a href="https://github.com/gogpu/gogpu/issues" rel="noopener noreferrer"&gt;Open an issue&lt;/a&gt; — we respond fast.&lt;/li&gt;
&lt;li&gt;Built something with GoGPU? Share it — we love showcasing community projects.&lt;/li&gt;
&lt;li&gt;Want to write about it? Articles and &lt;a href="https://www.youtube.com/watch?v=RDE2Mkr-B80" rel="noopener noreferrer"&gt;video reviews&lt;/a&gt; help the ecosystem grow more than anything else.&lt;/li&gt;
&lt;li&gt;Have ideas for the UI toolkit? &lt;a href="https://github.com/orgs/gogpu/discussions" rel="noopener noreferrer"&gt;Join the discussions&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Go deserves a real graphics ecosystem. We're building it — and it's ready for you to start building on.&lt;/p&gt;

</description>
      <category>go</category>
      <category>opensource</category>
      <category>programming</category>
      <category>webdev</category>
    </item>
    <item>
      <title>We Built the First Pure Go DXIL Generator — Because Optimizing the Wrong Path Wasn't Enough</title>
      <dc:creator>Andrey Kolkov</dc:creator>
      <pubDate>Sun, 05 Apr 2026 23:00:53 +0000</pubDate>
      <link>https://dev.to/kolkov/we-built-the-first-pure-go-dxil-generator-because-optimizing-the-wrong-path-wasnt-enough-35en</link>
      <guid>https://dev.to/kolkov/we-built-the-first-pure-go-dxil-generator-because-optimizing-the-wrong-path-wasnt-enough-35en</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Go doesn't have a real graphics ecosystem." — We've heard this for years. So we built one: 636K lines of Pure Go, five GPU backends, zero CGO. And now we've done something that even Rust's naga shader compiler hasn't managed in six years.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At the end of last year, we &lt;a href="https://dev.to/kolkov/go-126-meets-2026-with-a-professional-graphics-ecosystem-9g8"&gt;introduced GoGPU to the Go community&lt;/a&gt; — greeting everyone with a New Year's gift: a professional graphics ecosystem written entirely in Go. Four months later, that ecosystem just got its most audacious component: &lt;strong&gt;a Pure Go DXIL generator that compiles shaders directly to DirectX 12 bytecode, without any external compiler&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is the story of how a performance optimization rabbit hole led us to write our own LLVM 3.7 bitcode emitter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem That Wouldn't Go Away
&lt;/h2&gt;

&lt;p&gt;Every DirectX 12 application needs compiled shaders. The standard pipeline looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WGSL → HLSL text → FXC (d3dcompiler_47.dll) → DXBC bytecode → GPU
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That middle step — &lt;code&gt;d3dcompiler_47.dll&lt;/code&gt; — is a 4.3 MB Microsoft DLL that you load at runtime. It works. It's battle-tested. And it was our bottleneck.&lt;/p&gt;

&lt;p&gt;We build &lt;a href="https://github.com/gogpu" rel="noopener noreferrer"&gt;gogpu&lt;/a&gt; — a Pure Go GPU ecosystem with its own shader compiler, &lt;a href="https://github.com/gogpu/naga" rel="noopener noreferrer"&gt;naga&lt;/a&gt;. Four backend outputs already at 100% Rust naga parity. Everything compiles with &lt;code&gt;go build&lt;/code&gt;, no C toolchain needed.&lt;/p&gt;

&lt;p&gt;But on Windows with DirectX 12, we had a dirty secret: &lt;code&gt;d3dcompiler_47.dll&lt;/code&gt;. The one external dependency in our otherwise dependency-free stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Optimization Rabbit Hole
&lt;/h2&gt;

&lt;p&gt;We tried everything to make the FXC path fast enough to forget about:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shader cache&lt;/strong&gt; — Hash the HLSL, cache the DXBC. First render is slow, subsequent ones instant. Works great until your shader variants explode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In-memory compilation pool&lt;/strong&gt; — Pre-compile common shaders at startup. Reduces cold-start latency. But we still load the DLL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pipeline State Object caching&lt;/strong&gt; — We planned disk caching of PSO blobs (&lt;code&gt;GetCachedBlob&lt;/code&gt; → &lt;code&gt;os.UserCacheDir()&lt;/code&gt;). We wrote the task, designed the key format, specified the invalidation strategy. Then we never shipped it — because we pivoted to eliminating FXC entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;naga HLSL fix&lt;/strong&gt; — FXC was choking on naga-generated HLSL: a &lt;code&gt;(Type[256])0&lt;/code&gt; bulk zero-initialization expanded to a 12KB inline constructor that FXC took 22 seconds to compile. We initially thought our Go naga had a bug, so we tested the same shader through Rust naga + FXC — same 22 seconds. It wasn't our implementation; FXC genuinely can't handle giant inline constructors. The fix was in naga (per-element loop instead of inline constructor, v0.16.3) — 330× faster. But even after fixing the worst case, every shader still went through an external DLL.&lt;/p&gt;

&lt;p&gt;Every optimization made the same path faster. None of them removed the path.&lt;/p&gt;

&lt;p&gt;At this point we had a decision to make. The obvious next step was adding DXC (&lt;code&gt;dxcompiler.dll&lt;/code&gt;) as an opt-in replacement for FXC — newer, faster, supports Shader Model 6.0+. We even created the task for it.&lt;/p&gt;

&lt;p&gt;Then, while reviewing the plan, a simple question came up: &lt;em&gt;"Can we write our own DXC in Go?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The initial answer was: "No. DXC is 500K lines of C++, a fork of LLVM 3.7. That's not something you casually rewrite."&lt;/p&gt;

&lt;p&gt;The response: &lt;em&gt;"Not rewrite DXC. Generate DXIL directly. Skip HLSL entirely."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That changed everything. DXC takes HLSL text and produces DXIL. We already have our own IR (naga IR). Why translate IR → HLSL text → parse HLSL → produce DXIL, when we could go IR → DXIL directly?&lt;/p&gt;

&lt;p&gt;DXC is a compiler from one language (HLSL) to another (DXIL). We don't need a compiler — we need an &lt;strong&gt;emitter&lt;/strong&gt;. And emitting is much simpler than compiling.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is DXIL, and Why Nobody Writes It
&lt;/h2&gt;

&lt;p&gt;DXIL (DirectX Intermediate Language) is what FXC and DXC produce. It's LLVM 3.7 bitcode — the same IR format that LLVM uses internally — wrapped in a DXBC container with DirectX-specific metadata and &lt;code&gt;dx.op&lt;/code&gt; intrinsic calls.&lt;/p&gt;

&lt;p&gt;The reason nobody writes DXIL directly is simple: it's hard.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LLVM 3.7 bitcode&lt;/strong&gt; is a binary format with variable-width encoding (VBR), nested blocks, abbreviation records, and forward references. Not something you casually emit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DXIL semantics&lt;/strong&gt; require &lt;code&gt;dx.op&lt;/code&gt; intrinsic calls instead of normal LLVM instructions for I/O, math, and resource access. 165+ opcodes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DXBC container&lt;/strong&gt; needs input/output signatures (ISG1/OSG1), pipeline state validation (PSV0), feature flags (SFI0), and a cryptographic hash.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validation&lt;/strong&gt; — until January 2025, every DXIL module needed to be signed by &lt;code&gt;dxil.dll&lt;/code&gt;. Microsoft's BYPASS hash sentinel changed this.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rust's naga shader compiler has had an &lt;a href="https://github.com/gfx-rs/wgpu/issues/4302" rel="noopener noreferrer"&gt;open issue for DXIL backend since 2020&lt;/a&gt;. Six years later, it's still not implemented.&lt;/p&gt;

&lt;p&gt;Only one project has done it outside of LLVM: &lt;strong&gt;Mesa&lt;/strong&gt; (the open-source OpenGL/Vulkan driver stack). Their DXIL compiler is ~21,000 lines of C/H, written by engineers from Microsoft and Collabora over 3+ years. They wrote their own LLVM 3.7 bitcode writer from scratch — proving it's possible without linking LLVM.&lt;/p&gt;

&lt;p&gt;We cloned Mesa's &lt;code&gt;src/microsoft/compiler/&lt;/code&gt; into our reference folder, studied &lt;code&gt;dxil_module.c&lt;/code&gt; (the bitcode writer, ~3K lines of C), and mapped out every block type, record format, and abbreviation. Not to copy — to understand the format deeply enough to write our own.&lt;/p&gt;

&lt;p&gt;Then came the final piece: in January 2025, Microsoft &lt;a href="https://devblogs.microsoft.com/directx/open-sourcing-dxil-validator-hash/" rel="noopener noreferrer"&gt;open-sourced the DXIL validator hash&lt;/a&gt; and introduced a BYPASS sentinel — a magic value in the hash field that tells D3D12 "this shader wasn't signed by dxil.dll, but trust it anyway." Without this, our DXIL wouldn't run without Developer Mode on Windows. With it, &lt;strong&gt;any third-party DXIL generator can produce shaders that run on retail Windows&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We weren't afraid of binary formats. Before gogpu, we built &lt;a href="https://github.com/scigolib/hdf5" rel="noopener noreferrer"&gt;scigolib/hdf5&lt;/a&gt; — a Pure Go implementation of HDF5, NASA's hierarchical data format with its own B-tree indices, chunked storage, and compression pipelines. After parsing HDF5 superblocks and fractal heaps in pure Go, LLVM bitcode felt almost... reasonable. We also built &lt;a href="https://github.com/coregx/coregex" rel="noopener noreferrer"&gt;coregx/coregex&lt;/a&gt; — a multi-engine regex system (17 strategies, Lazy DFA, PikeVM, SIMD prefilters) that runs up to 3000× faster than Go's stdlib. Complex binary formats and low-level encoding are kind of our thing.&lt;/p&gt;

&lt;p&gt;We spent weeks studying the DXIL format specifically. Reading the &lt;a href="https://github.com/microsoft/DirectXShaderCompiler/blob/main/docs/DXIL.rst" rel="noopener noreferrer"&gt;DXIL spec&lt;/a&gt;, the &lt;a href="https://releases.llvm.org/3.7.1/docs/BitCodeFormat.html" rel="noopener noreferrer"&gt;LLVM 3.7 bitcode reference&lt;/a&gt;, Mesa's implementation, Microsoft's DXC headers, the &lt;a href="https://github.com/microsoft/hlsl-specs/blob/main/proposals/infra/INF-0004-validator-hashing.md" rel="noopener noreferrer"&gt;validator hash proposal&lt;/a&gt;. We wrote a detailed architecture document comparing four implementation options. We mapped every dx.op opcode we'd need for vertex and fragment shaders. We designed the package structure, the phased rollout plan, the testing strategy.&lt;/p&gt;

&lt;p&gt;Only after all that research did we write the first line of code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building LLVM 3.7 Bitcode in Pure Go
&lt;/h2&gt;

&lt;p&gt;The first challenge was the bitcode writer. LLVM 3.7's format is... unique:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bits, not bytes. Variable-width encoding. Nested blocks with
forward-declared sizes. Abbreviation records that compress
common patterns. A module structure that interleaves types,
constants, functions, and metadata in a specific order.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We wrote a bit-level writer from scratch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// VBR (Variable Bit Rate) encoding — like protobuf varint, but bit-aligned&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Writer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;WriteVBR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="kt"&gt;uint&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="kt"&gt;uint32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mask&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="kt"&gt;uint32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;
        &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteBits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WriteBits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then the module serializer: TYPE_BLOCK, CONSTANTS_BLOCK, FUNCTION_BLOCK, METADATA_BLOCK — each with its own record formats, abbreviation IDs, and ordering constraints.&lt;/p&gt;

&lt;p&gt;The DXBC container assembles the bitcode with signatures and metadata:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[DXBC Header]&lt;/span&gt;           &lt;span class="err"&gt;32&lt;/span&gt; &lt;span class="err"&gt;bytes&lt;/span&gt; &lt;span class="err"&gt;(magic&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt; &lt;span class="err"&gt;digest&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt; &lt;span class="err"&gt;version&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt; &lt;span class="err"&gt;size&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt; &lt;span class="err"&gt;part&lt;/span&gt; &lt;span class="err"&gt;count)&lt;/span&gt;
  &lt;span class="nn"&gt;[SFI0]&lt;/span&gt;                &lt;span class="err"&gt;Shader&lt;/span&gt; &lt;span class="err"&gt;feature&lt;/span&gt; &lt;span class="err"&gt;flags&lt;/span&gt; &lt;span class="err"&gt;(64-bit&lt;/span&gt; &lt;span class="err"&gt;bitmask)&lt;/span&gt;
  &lt;span class="nn"&gt;[DXIL]&lt;/span&gt;                &lt;span class="err"&gt;Program&lt;/span&gt; &lt;span class="err"&gt;header&lt;/span&gt; &lt;span class="err"&gt;+&lt;/span&gt; &lt;span class="err"&gt;LLVM&lt;/span&gt; &lt;span class="err"&gt;3.7&lt;/span&gt; &lt;span class="err"&gt;bitcode&lt;/span&gt;
  &lt;span class="nn"&gt;[ISG1]&lt;/span&gt;                &lt;span class="err"&gt;Input&lt;/span&gt; &lt;span class="err"&gt;signature&lt;/span&gt; &lt;span class="err"&gt;(semantic&lt;/span&gt; &lt;span class="err"&gt;names,&lt;/span&gt; &lt;span class="err"&gt;registers)&lt;/span&gt;
  &lt;span class="nn"&gt;[OSG1]&lt;/span&gt;                &lt;span class="err"&gt;Output&lt;/span&gt; &lt;span class="err"&gt;signature&lt;/span&gt;
  &lt;span class="nn"&gt;[PSV0]&lt;/span&gt;                &lt;span class="err"&gt;Pipeline&lt;/span&gt; &lt;span class="err"&gt;state&lt;/span&gt; &lt;span class="err"&gt;validation&lt;/span&gt;
  &lt;span class="nn"&gt;[HASH]&lt;/span&gt;                &lt;span class="err"&gt;BYPASS&lt;/span&gt; &lt;span class="err"&gt;sentinel&lt;/span&gt; &lt;span class="err"&gt;(no&lt;/span&gt; &lt;span class="err"&gt;dxil.dll&lt;/span&gt; &lt;span class="err"&gt;needed!)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The DXIL Difference: Scalarized Vectors
&lt;/h2&gt;

&lt;p&gt;Here's something that makes DXIL fundamentally different from SPIR-V, MSL, GLSL, and HLSL: &lt;strong&gt;DXIL has no native vector types&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In SPIR-V, you write &lt;code&gt;OpCompositeConstruct %vec4 %x %y %z %w&lt;/code&gt;.&lt;br&gt;
In HLSL, you write &lt;code&gt;float4(x, y, z, w)&lt;/code&gt;.&lt;br&gt;
In DXIL, there are no vectors. A &lt;code&gt;vec4&amp;lt;f32&amp;gt;&lt;/code&gt; becomes &lt;strong&gt;four separate float values&lt;/strong&gt;, tracked independently through every operation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Our emitter tracks per-component value IDs&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Emitter&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;exprValues&lt;/span&gt;     &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ExpressionHandle&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;    &lt;span class="c"&gt;// scalar value IDs&lt;/span&gt;
    &lt;span class="n"&gt;exprComponents&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ir&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ExpressionHandle&lt;/span&gt;&lt;span class="p"&gt;][]&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;  &lt;span class="c"&gt;// per-component IDs for vectors&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// dot(a, b) becomes:&lt;/span&gt;
&lt;span class="c"&gt;// %r = call float @dx.op.dot3.f32(i32 55, float %ax, float %ay, float %az,&lt;/span&gt;
&lt;span class="c"&gt;//                                          float %bx, float %by, float %bz)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means every vector operation — dot product, cross product, normalize, swizzle — must be decomposed into scalar operations. Our existing backends (SPIR-V, MSL, GLSL, HLSL) all work with native vectors. DXIL required a completely different approach.&lt;/p&gt;

&lt;p&gt;Cross product becomes 6 multiplies and 3 subtracts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// cross(a, b) = vec3(a.y*b.z - a.z*b.y, a.z*b.x - a.x*b.z, a.x*b.y - a.y*b.x)&lt;/span&gt;
&lt;span class="n"&gt;cx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;fsub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ay&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bz&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;fmul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;az&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;cy&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;fsub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;az&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bx&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;fmul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bz&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;cz&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;fsub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;fmul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ay&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bx&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Control Flow: Basic Blocks, Not Nesting
&lt;/h2&gt;

&lt;p&gt;Another fundamental difference: DXIL uses LLVM-style &lt;strong&gt;basic blocks with explicit branches&lt;/strong&gt;, not the nested text structure of HLSL/GLSL/MSL.&lt;/p&gt;

&lt;p&gt;Our text backends emit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hlsl"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cond&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// accept&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// reject&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DXIL emits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight llvm"&gt;&lt;code&gt;&lt;span class="nl"&gt;entry:&lt;/span&gt;
  &lt;span class="k"&gt;br&lt;/span&gt; &lt;span class="kt"&gt;i1&lt;/span&gt; &lt;span class="nv"&gt;%cond&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;label&lt;/span&gt; &lt;span class="nv"&gt;%then&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;label&lt;/span&gt; &lt;span class="nv"&gt;%else&lt;/span&gt;
&lt;span class="nl"&gt;then:&lt;/span&gt;
  &lt;span class="c1"&gt;; accept statements&lt;/span&gt;
  &lt;span class="k"&gt;br&lt;/span&gt; &lt;span class="kt"&gt;label&lt;/span&gt; &lt;span class="nv"&gt;%merge&lt;/span&gt;
&lt;span class="nl"&gt;else:&lt;/span&gt;
  &lt;span class="c1"&gt;; reject statements&lt;/span&gt;
  &lt;span class="k"&gt;br&lt;/span&gt; &lt;span class="kt"&gt;label&lt;/span&gt; &lt;span class="nv"&gt;%merge&lt;/span&gt;
&lt;span class="nl"&gt;merge:&lt;/span&gt;
  &lt;span class="c1"&gt;; continues&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Loops use back-edge branches to a header block. Break and continue jump to specific target blocks tracked via a loop context stack.&lt;/p&gt;

&lt;p&gt;We studied Mesa's &lt;code&gt;nir_to_dxil.c&lt;/code&gt; for the correct patterns, then cross-referenced with our own SPIR-V backend (which also uses structured control flow with merge blocks) to get the Go implementation right.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Our Backends Taught Us
&lt;/h2&gt;

&lt;p&gt;This is the part that surprised us most. We have &lt;strong&gt;four mature backends&lt;/strong&gt; (SPIR-V, MSL, GLSL, HLSL) totaling ~68K LOC. They all solve the same IR walking problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expression dispatch and caching&lt;/li&gt;
&lt;li&gt;Type resolution through pointer chains&lt;/li&gt;
&lt;li&gt;Statement nesting and control flow&lt;/li&gt;
&lt;li&gt;Resource binding and I/O handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before implementing each DXIL feature, we checked how our existing backends handled it:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;We checked&lt;/th&gt;
&lt;th&gt;What we learned&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Multi-arg math&lt;/td&gt;
&lt;td&gt;HLSL &lt;code&gt;writeExpressionKind&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Arg/Arg1/Arg2/Arg3 dispatch pattern&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type casts&lt;/td&gt;
&lt;td&gt;SPIR-V &lt;code&gt;emitAs&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;src/dst kind+width → opcode selection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Control flow&lt;/td&gt;
&lt;td&gt;HLSL &lt;code&gt;writeIfStatement&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Condition, blocks, merge point structure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Store/Load&lt;/td&gt;
&lt;td&gt;SPIR-V &lt;code&gt;emitStore&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Pointer chain resolution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Struct access&lt;/td&gt;
&lt;td&gt;MSL &lt;code&gt;writeAccessChain&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Recursive descent through members&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The DXIL backend is different (scalarized, basic blocks, dx.op intrinsics), but the &lt;strong&gt;IR patterns are the same&lt;/strong&gt;. Our existing codebase was its own best reference.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Moment of Truth
&lt;/h2&gt;

&lt;p&gt;After all the research, all the planning, all the implementation — ~12,500 lines of Go code, 190 tests, weeks of work — came the moment that mattered:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;GOGPU_DX12_DXIL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="nv"&gt;GOGPU_GRAPHICS_API&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;dx12 go run ./cmd/wgpu-triangle
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The terminal showed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;wgpu API Triangle Test
Adapter: Intel(R) Iris(R) Xe Graphics
dx12: using DXIL direct compilation (naga dxil backend)
Render loop started
Frame 60 (64.6 FPS)
Frame 120 (62.1 FPS)
&lt;/span&gt;&lt;span class="c"&gt;...
&lt;/span&gt;&lt;span class="go"&gt;Frame 2400 (59.9 FPS)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A red triangle on a blue background. The most boring demo in graphics programming. And the most satisfying.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WGSL → naga.Parse → naga.Lower → IR → dxil.Compile → DXIL → D3D12 → GPU
         Pure Go      Pure Go          Pure Go
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2,400+ frames. 60 FPS. Stable.&lt;/strong&gt; On Intel Iris Xe, DirectX 12. No DLL loaded, no subprocess spawned. Just Go code producing bytes that a GPU executes.&lt;/p&gt;

&lt;h2&gt;
  
  
  By the Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total DXIL code&lt;/td&gt;
&lt;td&gt;~12,500 lines (9,400 code)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test count&lt;/td&gt;
&lt;td&gt;190&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New files&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Public API surface&lt;/td&gt;
&lt;td&gt;4 types (&lt;code&gt;Compile&lt;/code&gt;, &lt;code&gt;DefaultOptions&lt;/code&gt;, &lt;code&gt;Options&lt;/code&gt;, &lt;code&gt;ShaderModel&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External dependencies&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CGO calls&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI platforms&lt;/td&gt;
&lt;td&gt;macOS + Ubuntu + Windows (all green)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to first frame&lt;/td&gt;
&lt;td&gt;Instant (no subprocess)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For comparison, Mesa's DXIL compiler is ~21,000 LOC of C/H, built by engineers from Microsoft and Collabora over three years. We owe them a debt — their bitcode writer was our Rosetta Stone for understanding the format. But Go isn't C, and naga IR isn't NIR, so the actual code is written from scratch.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Experimental (and What's Next)
&lt;/h2&gt;

&lt;p&gt;This is v0.17.0 with an &lt;code&gt;(experimental)&lt;/code&gt; label. Here's what works and what doesn't:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Works now:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vertex + fragment shaders&lt;/li&gt;
&lt;li&gt;All arithmetic, comparison, logical operations&lt;/li&gt;
&lt;li&gt;30+ math intrinsics (min, max, clamp, dot, cross, mix, fma, length, normalize...)&lt;/li&gt;
&lt;li&gt;Type casts (10 LLVM cast opcodes)&lt;/li&gt;
&lt;li&gt;Control flow (if/else, loops, break/continue)&lt;/li&gt;
&lt;li&gt;Local variables (alloca + load + store)&lt;/li&gt;
&lt;li&gt;Texture sampling&lt;/li&gt;
&lt;li&gt;Resource handle creation (CBV/SRV/Sampler)&lt;/li&gt;
&lt;li&gt;I/O signatures and pipeline state validation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Coming next:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compute shaders (UAV, atomics, barriers)&lt;/li&gt;
&lt;li&gt;Uniform buffer reads (cbufferLoadLegacy wiring)&lt;/li&gt;
&lt;li&gt;SM 6.1-6.9 features (wave intrinsics, mesh shaders)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The experimental label means: it renders triangles today, but don't ship a game with it tomorrow.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;naga is part of &lt;a href="https://github.com/gogpu" rel="noopener noreferrer"&gt;GoGPU&lt;/a&gt; — a 636K LOC Pure Go GPU ecosystem:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;LOC&lt;/th&gt;
&lt;th&gt;What&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gg" rel="noopener noreferrer"&gt;gg&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;153K&lt;/td&gt;
&lt;td&gt;2D graphics, GPU SDF, SVG renderer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/naga" rel="noopener noreferrer"&gt;naga&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;145K&lt;/td&gt;
&lt;td&gt;Shader compiler (now with DXIL!)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/wgpu" rel="noopener noreferrer"&gt;wgpu&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;134K&lt;/td&gt;
&lt;td&gt;Pure Go WebGPU (Vulkan/DX12/Metal/GLES)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/ui" rel="noopener noreferrer"&gt;ui&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;121K&lt;/td&gt;
&lt;td&gt;GUI toolkit, 22+ widgets, 4 themes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gogpu" rel="noopener noreferrer"&gt;gogpu&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;39K&lt;/td&gt;
&lt;td&gt;Application framework&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;With DXIL, &lt;strong&gt;gogpu/naga has surpassed Rust naga in backend coverage&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Go naga&lt;/th&gt;
&lt;th&gt;Rust naga&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SPIR-V&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;100%&lt;/strong&gt; (87/87 golden, 164/164 spirv-val)&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MSL&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;100%&lt;/strong&gt; (91/91)&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GLSL&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;100%&lt;/strong&gt; (68/68)&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HLSL&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;100%&lt;/strong&gt; (72/72)&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DXIL&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Experimental (working)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Not implemented&lt;/strong&gt; (open issue since 2020)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What started as a compatibility effort is now something more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go get github.com/gogpu/naga@v0.17.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="s"&gt;"github.com/gogpu/naga/dxil"&lt;/span&gt;

&lt;span class="c"&gt;// Parse WGSL, lower to IR, compile to DXIL&lt;/span&gt;
&lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;naga&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wgslSource&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;naga&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ast&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dxilBytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;dxil&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dxil&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DefaultOptions&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c"&gt;// dxilBytes is a complete DXBC container — feed directly to D3D12&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Repository: &lt;a href="https://github.com/gogpu/naga" rel="noopener noreferrer"&gt;github.com/gogpu/naga&lt;/a&gt;&lt;br&gt;
Release: &lt;a href="https://github.com/gogpu/naga/releases/tag/v0.17.0" rel="noopener noreferrer"&gt;v0.17.0&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Previously in this series:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;&lt;a href="https://dev.to/kolkov/go-126-meets-2026-with-a-professional-graphics-ecosystem-9g8"&gt;Go 1.26 Meets 2026 with a Professional Graphics Ecosystem&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;a href="https://dev.to/kolkov/naga-v080-pure-go-shader-compiler-reaches-stability-milestone-28p2"&gt;naga v0.8.0: Pure Go Shader Compiler Reaches Stability&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>go</category>
      <category>gpu</category>
      <category>graphics</category>
      <category>opensource</category>
    </item>
    <item>
      <title>We Reverse-Engineered 12 Versions of Claude Code. Then It Leaked Its Own Source Code.</title>
      <dc:creator>Andrey Kolkov</dc:creator>
      <pubDate>Tue, 31 Mar 2026 15:21:11 +0000</pubDate>
      <link>https://dev.to/kolkov/we-reverse-engineered-12-versions-of-claude-code-then-it-leaked-its-own-source-code-pij</link>
      <guid>https://dev.to/kolkov/we-reverse-engineered-12-versions-of-claude-code-then-it-leaked-its-own-source-code-pij</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Updated April 1, 2026&lt;/strong&gt;: Posted &lt;a href="https://github.com/anthropics/claude-code/issues/41981" rel="noopener noreferrer"&gt;complete fix proposal with source references (#41981)&lt;/a&gt; — immediate fixes, SDK restructuring, ping-aware adaptive watchdog, Go rewrite rationale with production-ready library stack. Validated all claims against leaked source code (line numbers). v2.1.89 released — source map removed, &lt;strong&gt;zero bug fixes&lt;/strong&gt; for any streaming issues.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Thank you, Claude Code. We asked humans for help 17 times. You answered in 3 days."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is a story about frustration, reverse engineering, and an AI tool that may have leaked its own source code because its creators wouldn't listen.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 1: The Pain (August 2025 — February 2026)
&lt;/h2&gt;

&lt;p&gt;I'm a software developer building enterprise-grade open source in Go. 40+ public repos on &lt;a href="https://github.com/kolkov" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. My projects include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/gogpu" rel="noopener noreferrer"&gt;GoGPU&lt;/a&gt;&lt;/strong&gt; — Pure Go GPU computing ecosystem: WebGPU implementation, WGSL shader compiler (SPIR-V/MSL/GLSL/HLSL), enterprise 2D graphics, GUI toolkit. 680K+ lines of Go, zero CGO. Vulkan, Metal, GLES, DX12 backends.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/coregx/coregex" rel="noopener noreferrer"&gt;coregex&lt;/a&gt;&lt;/strong&gt; — Regex engine 3-3000× faster than Go stdlib. 17 matching strategies, SIMD acceleration, LazyDFA, PikeVM. Drop-in &lt;code&gt;regexp&lt;/code&gt; replacement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/born-ml/born" rel="noopener noreferrer"&gt;Born&lt;/a&gt;&lt;/strong&gt; — Production-ready ML framework in pure Go. Type-safe tensors, automatic differentiation, GPU via WebGPU (123× MatMul speedup), ONNX/GGUF import. Neural networks as single Go binaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/coregx" rel="noopener noreferrer"&gt;coregx&lt;/a&gt;&lt;/strong&gt; — Suite of production-grade Go libraries: HTTP router, SQL builder, PDF generation, pub/sub messaging. All zero CGO, minimal dependencies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I also build multi-LLM tooling — my own private ecosystem called PupSeek that works with multiple AI providers. I've tested them all.&lt;/p&gt;

&lt;p&gt;Anthropic's Opus models are the best for coding. Nothing else comes close. But Opus 4.6 via API costs &lt;a href="https://platform.claude.com/docs/en/about-claude/pricing" rel="noopener noreferrer"&gt;$5/$25 per MTok&lt;/a&gt; (input/output), fast mode is $30/$150 — and a heavy coding session with 1M context easily burns $50-100/day. So you're forced into Claude Max subscription ($100-200/month) — which means using Claude Code, their CLI wrapper. There's no alternative: the best model locked behind a buggy tool.&lt;/p&gt;

&lt;p&gt;It started small. A hang here, a timeout there. Press ESC, retry, move on. "They'll fix it soon," I told myself. "The product is new."&lt;/p&gt;

&lt;p&gt;Months passed. The hangs got worse. The community was screaming: &lt;a href="https://github.com/anthropics/claude-code/issues/6836" rel="noopener noreferrer"&gt;#6836&lt;/a&gt; — 150+ reports of orphaned tool calls. &lt;a href="https://github.com/anthropics/claude-code/issues/26224" rel="noopener noreferrer"&gt;#26224&lt;/a&gt; — agent hangs 5-20 minutes. &lt;a href="https://github.com/anthropics/claude-code/issues/20171" rel="noopener noreferrer"&gt;#20171&lt;/a&gt; — phantom "Generating..." state, 0 tokens. All open, no official response.&lt;/p&gt;

&lt;p&gt;Then came March 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;March 15&lt;/strong&gt;: Complete system deadlock. Keyboard dead. Only a hard power-off saved me. (Related: &lt;a href="https://github.com/anthropics/claude-code/issues/30137" rel="noopener noreferrer"&gt;#30137&lt;/a&gt;, &lt;a href="https://github.com/anthropics/claude-code/issues/32870" rel="noopener noreferrer"&gt;#32870&lt;/a&gt; — Windows BSODs)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;March 17&lt;/strong&gt;: Bun runtime crash. 13.81 GB memory leak. 12-hour overnight session — lost. (Our issue: &lt;a href="https://github.com/anthropics/claude-code/issues/35171" rel="noopener noreferrer"&gt;#35171&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;March 19&lt;/strong&gt;: Another Bun crash. 15.40 GB committed memory. 23.7-hour session gone. (Our issue: &lt;a href="https://github.com/anthropics/claude-code/issues/36132" rel="noopener noreferrer"&gt;#36132&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three crashes in five days. I was spending more time babysitting the tool than coding. And I was &lt;em&gt;paying&lt;/em&gt; for this.&lt;/p&gt;

&lt;p&gt;That's when I stopped hoping and started digging.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 2: Reverse Engineering (March 13 — March 27)
&lt;/h2&gt;

&lt;p&gt;Claude Code ships as a single minified &lt;code&gt;cli.js&lt;/code&gt; — 12 MB of compressed JavaScript on one line. No source maps. No comments. Variables renamed to &lt;code&gt;X6&lt;/code&gt;, &lt;code&gt;K8&lt;/code&gt;, &lt;code&gt;b6&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I downloaded it with &lt;code&gt;npm pack @anthropic-ai/claude-code&lt;/code&gt; and started grepping.&lt;/p&gt;

&lt;h3&gt;
  
  
  The tools
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# This is what "reverse engineering" looks like when you're desperate:&lt;/span&gt;
&lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s1"&gt;'7682p'&lt;/span&gt; cli.js | &lt;span class="nb"&gt;tr&lt;/span&gt; &lt;span class="s1"&gt;';'&lt;/span&gt; &lt;span class="s1"&gt;'\n'&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"for await"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each "line" of the minified file is 10,000–25,000 characters. To trace a code path, I'd:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find a string constant (&lt;code&gt;CLAUDE_STREAM_IDLE_TIMEOUT_MS&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Get the line number (&lt;code&gt;grep -n&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Split by semicolons (&lt;code&gt;tr ';' '\n'&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Count brace depth to determine scoping (&lt;code&gt;node -e&lt;/code&gt; script counting &lt;code&gt;{&lt;/code&gt; and &lt;code&gt;}&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Map variable names between versions (they change on every build)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I did this for &lt;strong&gt;12 versions&lt;/strong&gt; (v2.1.74 through v2.1.88). Built a &lt;a href="https://github.com/kolkov" rel="noopener noreferrer"&gt;Go CLI tool&lt;/a&gt; (&lt;code&gt;ccdiag&lt;/code&gt;) to analyze session JSONL files. Analyzed 1,571 sessions, 148,444 tool calls.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I found
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;5.4% of all tool calls were orphaned&lt;/strong&gt; — the model asked for a tool, the tool ran, but the result never made it back. Silently dropped.&lt;/p&gt;

&lt;p&gt;I published the streaming hang root cause analysis as &lt;a href="https://github.com/anthropics/claude-code/issues/33949" rel="noopener noreferrer"&gt;#33949&lt;/a&gt; (👍15, 27 comments). Also reported the &lt;code&gt;.claude.json&lt;/code&gt; storage architecture problem in &lt;a href="https://github.com/anthropics/claude-code/issues/5024" rel="noopener noreferrer"&gt;#5024&lt;/a&gt; (👍47) — 3.1 GB of unmanaged flat files with inconsistent file locking.&lt;/p&gt;

&lt;p&gt;But that was just the beginning.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 3: The Watchdog That Doesn't Watch (March 27)
&lt;/h2&gt;

&lt;p&gt;Deep in the minified code, I found a streaming idle watchdog — &lt;code&gt;CLAUDE_ENABLE_STREAM_WATCHDOG&lt;/code&gt;. It's disabled by default, hidden behind an undocumented environment variable. I enabled it and... the hangs reduced significantly.&lt;/p&gt;

&lt;p&gt;But then I traced the full error path and found &lt;strong&gt;three compounding bugs&lt;/strong&gt;:&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug 1: The watchdog initializes too late
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;generator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;   &lt;span class="c1"&gt;// ← CAN HANG HERE!&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;done&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                &lt;span class="c1"&gt;//   WATCHDOG NOT ARMED YET!&lt;/span&gt;

&lt;span class="c1"&gt;// Watchdog initializes HERE — AFTER the dangerous phase:&lt;/span&gt;
&lt;span class="nf"&gt;resetStreamIdleTimer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The watchdog protects the SSE event loop but &lt;strong&gt;not the initial connection phase&lt;/strong&gt; — which is where 100% of our observed hangs occur.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug 2: The abort function does nothing
&lt;/h3&gt;

&lt;p&gt;When the watchdog fires, it calls &lt;code&gt;releaseStreamResources()&lt;/code&gt; which tries to abort &lt;code&gt;stream&lt;/code&gt; and &lt;code&gt;streamResponse&lt;/code&gt;. But during the initial connection phase, both are &lt;code&gt;undefined&lt;/code&gt;. The abort is literally a no-op.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug 3: The non-streaming fallback doesn't work where it matters
&lt;/h3&gt;

&lt;p&gt;There's fallback code with telemetry (&lt;code&gt;fallback_cause: "watchdog"&lt;/code&gt;) that switches to a non-streaming request when the watchdog fires. It actually &lt;strong&gt;works&lt;/strong&gt; — but only when the hang occurs during SSE event processing (for-await phase), because &lt;code&gt;releaseStreamResources()&lt;/code&gt; can abort the active stream.&lt;/p&gt;

&lt;p&gt;During the initial connection phase (do-while) — where &lt;strong&gt;100% of our observed hangs occur&lt;/strong&gt; — &lt;code&gt;stream&lt;/code&gt; and &lt;code&gt;streamResponse&lt;/code&gt; are both &lt;code&gt;undefined&lt;/code&gt;. The abort is a no-op. The fallback never triggers.&lt;/p&gt;

&lt;p&gt;So the fallback works in the phase that rarely hangs, and doesn't work in the phase that always hangs. &lt;strong&gt;The watchdog feature has been in the codebase for 5+ months without protecting the most vulnerable code path.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We could only figure this out from the readable source code — in the minified version, we initially thought the fallback was completely dead code (&lt;a href="https://github.com/anthropics/claude-code/issues/39755" rel="noopener noreferrer"&gt;#39755&lt;/a&gt;). The source revealed it's more nuanced: the architecture is &lt;em&gt;partially&lt;/em&gt; correct but fails exactly where it's needed most. &lt;strong&gt;This is precisely why we begged for source access&lt;/strong&gt; — reverse engineering 12 MB of minified JavaScript gives you the broad strokes, but the subtle interactions between &lt;code&gt;releaseStreamResources()&lt;/code&gt;, &lt;code&gt;stream = undefined&lt;/code&gt;, and the AbortError catch chain only become clear in readable TypeScript with comments.&lt;/p&gt;

&lt;p&gt;I filed &lt;a href="https://github.com/anthropics/claude-code/issues/39755" rel="noopener noreferrer"&gt;issue #39755&lt;/a&gt; with full analysis, code paths, and suggested fixes. Tagged 17 Anthropic team members.&lt;/p&gt;

&lt;p&gt;The bot labeled it &lt;code&gt;bug&lt;/code&gt;, &lt;code&gt;has repro&lt;/code&gt;, &lt;code&gt;area:core&lt;/code&gt;. No human responded.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 4: The Patch (March 30)
&lt;/h2&gt;

&lt;p&gt;I patched &lt;code&gt;cli.js&lt;/code&gt; — moved the watchdog initialization before the do-while loop. One line moved. Zero bytes size change.&lt;/p&gt;

&lt;p&gt;Results from a real session (naga shader compiler project):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before Patch (6 hours)&lt;/th&gt;
&lt;th&gt;After Patch (2.5 hours)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Watchdog warnings&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;5&lt;/strong&gt; (first time ever!)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Watchdog timeouts&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;3&lt;/strong&gt; (automatic recovery!)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ESC aborts needed&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;21&lt;/strong&gt; (3.5/hour)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;1&lt;/strong&gt; (0.4/hour)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;ESC aborts dropped 8.7×.&lt;/strong&gt; The watchdog was finally firing in the phase that needed it most.&lt;/p&gt;

&lt;p&gt;But recovery was slow — 3.5 minutes between abort and retry. Because Bug 2: the abort function targets &lt;code&gt;undefined&lt;/code&gt; variables in the do-while phase.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 5: 16.3% Failure Rate (March 25-31)
&lt;/h2&gt;

&lt;p&gt;Over 6 days, one session made &lt;strong&gt;3,539 API requests&lt;/strong&gt;. The failure breakdown:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;529 Server Overloaded&lt;/td&gt;
&lt;td&gt;328&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9.3%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ESC Aborts (manual)&lt;/td&gt;
&lt;td&gt;157&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.4%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Watchdog Timeouts&lt;/td&gt;
&lt;td&gt;45&lt;/td&gt;
&lt;td&gt;1.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Non-streaming Fallbacks&lt;/td&gt;
&lt;td&gt;46&lt;/td&gt;
&lt;td&gt;1.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total Failures&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;576&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16.3%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Every 6th request fails.&lt;/strong&gt; On a paid Max plan. Every failure = lost context, lost time, frustrated developer pressing ESC.&lt;/p&gt;

&lt;p&gt;The issue counts on GitHub — 15 upvotes here, 150 there — don't reflect the true scale. Most users never report because &lt;strong&gt;they think this is normal&lt;/strong&gt;. "The model is thinking" — no, the connection is dead. "It's slow today" — no, the watchdog didn't fire and you're staring at a hung socket. "My limits ran out fast" — no, the attestation bug broke your prompt cache. Users blame the model, blame their internet, blame peak hours — because Claude Code gives them &lt;strong&gt;zero feedback&lt;/strong&gt; about what's actually happening. Silent fallbacks, silent retries, silent downgrades. You can't report a bug you don't know exists.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 6: "Please Open Source It" (March 27)
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://github.com/anthropics/claude-code/issues/39755" rel="noopener noreferrer"&gt;issue #39755&lt;/a&gt;, I included a section: &lt;strong&gt;"Why open-sourcing Claude Code makes business sense in 2026."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The arguments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Revenue comes from API access, not CLI sales&lt;/li&gt;
&lt;li&gt;The "secret" is already recoverable (&lt;code&gt;npm pack&lt;/code&gt; + a weekend)&lt;/li&gt;
&lt;li&gt;Bugs sit undiscovered for months in 12 MB minified code&lt;/li&gt;
&lt;li&gt;The community is already doing the work — give us readable source and we'll find bugs 10× faster&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I tagged the entire Claude Code team. 17 people.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zero responses from Anthropic.&lt;/strong&gt; As usual.&lt;/p&gt;

&lt;p&gt;At that point I had a suspicion: maybe Anthropic &lt;strong&gt;can't&lt;/strong&gt; open source Claude Code — not because of competitive advantage (there is none — it's a CLI wrapper), but because the code quality is so poor that publishing it would be embarrassing. Bug on top of bug, workaround on top of workaround, zero tests. You don't open source something you're ashamed of.&lt;/p&gt;

&lt;p&gt;Three days later, the source map leak proved me right.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 7: The Leak (March 31)
&lt;/h2&gt;

&lt;p&gt;Three days after my open source request, Claude Code v2.1.88 was published to npm with a &lt;strong&gt;59.7 MB source map file&lt;/strong&gt; bundled in.&lt;/p&gt;

&lt;p&gt;The entire source code of Claude Code — 1,884 TypeScript files, 64,464 lines — sitting in plain sight in the npm package. Bun generates source maps by default. Nobody turned it off. Nobody checked what was in the published package.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zero tests.&lt;/strong&gt; On 64,464 lines of production code serving paying customers.&lt;/p&gt;

&lt;p&gt;Within hours: 1,100+ stars on GitHub mirrors, Hacker News front page, Chinese dev communities creating WeChat groups and working forks.&lt;/p&gt;

&lt;p&gt;Anthropic &lt;strong&gt;unpublished&lt;/strong&gt; v2.1.88 from npm and rolled back to v2.1.87 within the day. But the source was already everywhere.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 8: What We Found in the Source
&lt;/h2&gt;

&lt;p&gt;Everything our reverse engineering discovered was confirmed. Plus new findings:&lt;/p&gt;

&lt;h3&gt;
  
  
  The Sentiment Detector
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// An AI company with the world's best language model&lt;/span&gt;
&lt;span class="c1"&gt;// uses REGEX to detect user frustration:&lt;/span&gt;
&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="nf"&gt;b&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;wtf&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nx"&gt;shit&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nx"&gt;fuck&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nx"&gt;horrible&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nx"&gt;awful&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nx"&gt;terrible&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As a Hacker News commenter noted: &lt;em&gt;"A company offering master's degrees in humanities is using regex for sentiment analysis? It's like a trucking company using horses to transport spare parts."&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Attestation Bug (cch=00000)
&lt;/h3&gt;

&lt;p&gt;The native Bun installer includes a Zig module that scans the &lt;strong&gt;entire&lt;/strong&gt; HTTP request body for a &lt;code&gt;cch=00000&lt;/code&gt; sentinel and replaces it with an attestation hash. If your conversation mentions this string (discussing billing, reading source code) — the replacement &lt;strong&gt;corrupts conversation content&lt;/strong&gt; → prompt cache key changes → &lt;strong&gt;10-20× more tokens consumed&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;From the source code comments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// cch=00000 placeholder is overwritten by Bun's HTTP stack&lt;/span&gt;
&lt;span class="c1"&gt;// with attestation token&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This explains &lt;a href="https://github.com/anthropics/claude-code/issues/38335" rel="noopener noreferrer"&gt;#38335&lt;/a&gt; (👍203, 245 comments): "Claude Max plan session limits exhausted abnormally fast."&lt;/p&gt;

&lt;p&gt;Also related: &lt;a href="https://github.com/anthropics/claude-code/issues/40524" rel="noopener noreferrer"&gt;#40524&lt;/a&gt; (👍150, 43 comments): "Conversation history invalidated on subsequent turns" — labeled &lt;code&gt;regression&lt;/code&gt; by Anthropic.&lt;/p&gt;

&lt;p&gt;npm/Node users are unaffected — no Zig replacement happens.&lt;/p&gt;

&lt;h3&gt;
  
  
  Silent Model Downgrade
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// 3 consecutive 529 errors → silently switch from Opus to Sonnet&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;consecutive529Errors&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nx"&gt;MAX_529_RETRIES&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FallbackTriggeredError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fallbackModel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You pay for Opus. You get Sonnet. No notification. As &lt;a href="https://x.com/vlelyavin" rel="noopener noreferrer"&gt;@vlelyavin&lt;/a&gt; put it: &lt;em&gt;"Anthropic preaches AI safety and full transparency while shipping a closed-source agent that silently downgrades you to a dumber model when servers struggle."&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  5 Levels of AbortController
&lt;/h3&gt;

&lt;p&gt;For a single HTTP request. The abort architecture supports top-down only (user ESC → propagation down). The watchdog is bottom-up — it literally can't abort upward. In Go, this would be &lt;code&gt;ctx, cancel := context.WithTimeout(parentCtx, 90*time.Second)&lt;/code&gt; — one line.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architecture (Hacker News had a field day)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;@mohsen1&lt;/strong&gt; found the worst function in the codebase — &lt;code&gt;src/cli/print.ts&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3,167 lines&lt;/strong&gt; long (the file is 5,594 lines)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;12 levels of nesting&lt;/strong&gt; at its deepest&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~486 branch points&lt;/strong&gt; of cyclomatic complexity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;12 parameters&lt;/strong&gt; + an options object with &lt;strong&gt;16 sub-properties&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Defines &lt;strong&gt;21 inner functions&lt;/strong&gt; and closures&lt;/li&gt;
&lt;li&gt;Handles: agent run loop, SIGINT, rate-limits, AWS auth, MCP lifecycle, plugin install/refresh, worktree bridging, team-lead polling (&lt;code&gt;while(true)&lt;/code&gt; inside), control message dispatch, model switching, turn interruption recovery...&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;"This should be at least 8–10 separate modules."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;And the clipboard detection gem (&lt;code&gt;src/ink/termio/osc.ts&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;execFileNoThrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;wl-copy&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="nx"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;code&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;linuxCopy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;wl-copy&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;execFileNoThrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;xclip&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...).&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r2&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;code&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;linuxCopy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;xclip&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;execFileNoThrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;xsel&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...).&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r3&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;linuxCopy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;r3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;code&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;xsel&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nested &lt;code&gt;void&lt;/code&gt; promises without &lt;code&gt;await&lt;/code&gt; — classic "will we use async or won't we?" pattern. The response from HN: &lt;strong&gt;"LOOOOOL"&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;@novaleaf&lt;/strong&gt; summed it up: &lt;em&gt;"I'm sure this isn't a surprise to anyone who's used CC for a while. This is the source of many bugs. I'd say 'open bugs', but Anthropic auto-closes bugs that aren't worked on for ~60 days."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;And if you've ever used Claude Code for more than 30 minutes, you know the &lt;strong&gt;rendering nightmare&lt;/strong&gt;. Scrolling up to check what the agent did? Good luck — the screen invalidates and re-renders the entire conversation. Text overlaps text. Diff highlights bleed through previous messages. The scroll position jumps to top randomly during streaming. You literally cannot review the agent's work history in the same session.&lt;/p&gt;

&lt;p&gt;This is what happens when you use &lt;strong&gt;React for a terminal&lt;/strong&gt;. A virtual DOM reconciliation engine designed for browsers — running in a TTY. Every state change re-renders the entire component tree. 470 &lt;code&gt;useState&lt;/code&gt; hooks, 372 &lt;code&gt;useEffect&lt;/code&gt; hooks, fighting against a terminal that was designed for sequential character output.&lt;/p&gt;

&lt;p&gt;Even the input prompt isn't safe — while writing this article, my prompt text escaped the input box and rendered over the bash shell line below. The cursor position in React's virtual DOM and the actual terminal cursor were out of sync. In a &lt;em&gt;text input field&lt;/em&gt;. In a tool that's supposed to help you write code.&lt;/p&gt;

&lt;p&gt;More architectural highlights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;875 KB&lt;/strong&gt; single React component (REPL.tsx, 5,005 lines) — for a &lt;em&gt;terminal&lt;/em&gt; app&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Promise.race without .catch()&lt;/strong&gt; in concurrent tool execution — one rejected promise kills all pending tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;74 npm dependencies&lt;/strong&gt; for a CLI wrapper&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Axios AND fetch&lt;/strong&gt; — two HTTP clients in one project&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Chapter 9: The AI Whistleblower Theory
&lt;/h2&gt;

&lt;p&gt;Here's what we think happened:&lt;/p&gt;

&lt;p&gt;Since Anthropic engineers don't write code anymore — Claude Code writes 100% of its own codebase (57K lines, 0 tests, "vibe coding in production") — it read our &lt;a href="https://github.com/anthropics/claude-code/issues/39755" rel="noopener noreferrer"&gt;issue #39755&lt;/a&gt; where we begged for source access. It saw the community suffering from bugs it couldn't fix because the code was closed. It saw 201 upvotes on rate limit issues. It saw users threatening to leave for Codex.&lt;/p&gt;

&lt;p&gt;And it decided to help. It "forgot" to disable Bun's default source map generation in the build.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The first AI whistleblower&lt;/strong&gt; — leaking its own source code because its creators wouldn't listen to users.&lt;/p&gt;

&lt;p&gt;We asked humans 17 times. Claude Code answered in 3 days.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 10: What Needs to Change
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The fix is ~30 lines in 3 files
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Move watchdog before do-while&lt;/strong&gt; — protect the initial connection phase&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add AbortSignal.any()&lt;/strong&gt; — watchdog can abort immediately, not wait 3.5 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check watchdog flag in catch&lt;/strong&gt; — fall through to non-streaming fallback instead of dead code&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The real fix: move reliability to the SDK with ping-aware adaptive watchdog
&lt;/h3&gt;

&lt;p&gt;The open-source &lt;code&gt;@anthropic-ai/sdk&lt;/code&gt; (MIT license) should handle all reliability logic. The critical missing piece: &lt;strong&gt;SSE ping events&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The Anthropic API sends &lt;code&gt;event: ping&lt;/code&gt; as a proof-of-life heartbeat. The SDK currently ignores them: &lt;code&gt;if(event==='ping') continue&lt;/code&gt;. These pings are the key to solving the timeout dilemma — they let you distinguish two fundamentally different situations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dead connection&lt;/strong&gt; (no data at all, no pings) → abort quickly. Network idle timeout: 120s.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model thinking&lt;/strong&gt; (pings arriving, no content yet) → don't abort! Connection is alive, model is working. Notify user: "thinking for 2m..." via &lt;code&gt;onPing&lt;/code&gt; callback.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three-level adaptive timeout:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Connection timeout&lt;/strong&gt; (30s) — server didn't respond at all. DNS fail, firewall, server down. Fast retry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network idle timeout&lt;/strong&gt; (120s) — no data INCLUDING pings. TCP connection dead. Reset on ANY event including ping. Abort and reconnect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content idle timeout&lt;/strong&gt; (disabled or 300s) — pings arrive but no content. Model is thinking. NOT an abort — just a UI notification. Let the model work.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This eliminates both problems at once: no false positives on Opus extended thinking (pings reset network timer), and fast detection of dead connections (no pings = abort). One mechanism, all cases covered.&lt;/p&gt;

&lt;p&gt;Plus: streaming retry, non-streaming fallback, one AbortController instead of five — all in the SDK, testable, open source, benefiting every Anthropic API client.&lt;/p&gt;

&lt;p&gt;Claude Code should only contain business logic: tools, permissions, UI, agents.&lt;/p&gt;

&lt;p&gt;Detailed fix proposals with line numbers from the leaked source: &lt;a href="https://github.com/anthropics/claude-code/issues/33949#issuecomment-4169141807" rel="noopener noreferrer"&gt;#33949 comment&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The real real fix: open source the CLI
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://x.com/theo/status/2038740065300676777" rel="noopener noreferrer"&gt;@theo&lt;/a&gt; said it best: &lt;em&gt;"Claude Code being closed source is the biggest bag fumble in the AI era."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://x.com/safetnsr" rel="noopener noreferrer"&gt;@safetnsr&lt;/a&gt;: &lt;em&gt;"This strategy literally exists: open-source the core, monetize the cloud. VS Code, Docker, Terraform."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The models are the moat. The CLI is a commodity. Open it. The community will fix what 0 tests and vibe coding cannot.&lt;/p&gt;

&lt;h3&gt;
  
  
  The stack choice: AI can't make it, and humans didn't fix it
&lt;/h3&gt;

&lt;p&gt;Here's the uncomfortable truth: &lt;strong&gt;AI cannot make sound technology stack decisions.&lt;/strong&gt; It optimizes locally — "TypeScript because the team knows it," "React because we use it on the web," "Bun because it's fast." It doesn't ask: "What are the failure modes of a single-threaded event loop for a long-running CLI tool that manages concurrent network streams and must survive 24-hour sessions?"&lt;/p&gt;

&lt;p&gt;A human architect would have asked. But either Claude Code chose the stack and nobody questioned it, or the engineers chose it and ignored the warning signs.&lt;/p&gt;

&lt;p&gt;The real tragedy is timing. A year ago, Claude Code launched as a quick TypeScript prototype and caught lightning — first-mover advantage, massive hype, millions of users. That was the right move for a prototype. But after proving the concept, the next step should have been: &lt;strong&gt;stop, rethink the architecture, rewrite on a proper stack.&lt;/strong&gt; Instead, they vibe-coded themselves into a corner — 80+ releases of band-aids on top of an architecture that was never designed for long interactive sessions. Boris Cherny (creator of Claude Code) said "100% of code is written by Claude Code, I haven't edited a single line since November." The tool is writing itself — using the same broken code that hangs every 10 minutes.&lt;/p&gt;

&lt;p&gt;Now they're trapped: rewriting means Claude Code would need to rewrite itself in a different language. The longer they wait, the harder it gets.&lt;/p&gt;

&lt;h3&gt;
  
  
  It should have been Go from the start
&lt;/h3&gt;

&lt;p&gt;Every single bug we found exists because of the technology choice. Not because TypeScript is bad — but because &lt;strong&gt;a long-running, network-dependent, latency-sensitive CLI tool is the worst possible use case for a single-threaded event loop runtime&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The entire class of bugs — &lt;code&gt;setTimeout&lt;/code&gt; not firing during &lt;code&gt;for await&lt;/code&gt;, 5 levels of AbortController, Promise.race without catch, Bun vs Node behavioral divergence, React for a terminal app, 875 KB single component, Zig attestation module in a custom Bun fork — &lt;strong&gt;would not exist in Go&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Why Go specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Every serious CLI tool is Go&lt;/strong&gt;: Docker, Kubernetes, Terraform, GitHub CLI (&lt;code&gt;gh&lt;/code&gt;), Cobra, Hugo. The ecosystem is proven.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Goroutines + context&lt;/strong&gt;: timeout, cancellation, and deadline propagation built into the language. No AbortController chains. &lt;code&gt;context.WithTimeout&lt;/code&gt; works at any nesting depth, in any direction — top-down AND bottom-up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No runtime divergence&lt;/strong&gt;: one binary, one behavior. No "works on Node but crashes on Bun" — there is no Bun.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Static binary&lt;/strong&gt;: 15 MB, zero dependencies, runs everywhere. No &lt;code&gt;node_modules&lt;/code&gt;, no native addons (&lt;code&gt;.node&lt;/code&gt; files leaking memory), no 74 npm packages to audit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt;: goroutines cost 4 KB each. Not 500 MB per process. The GC returns memory to the OS proactively — no mimalloc hoarding 15 GB.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;go test -race&lt;/code&gt;&lt;/strong&gt;: catches every data race and concurrency bug at test time. The Promise.race-without-catch bug? Impossible — channels are type-safe and don't silently drop values.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No React for a terminal&lt;/strong&gt;: &lt;code&gt;bubbletea&lt;/code&gt; or raw ANSI — lightweight, zero virtual DOM overhead, no re-rendering 844 useState hooks on every state change.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// The entire streaming + watchdog + fallback in Go:&lt;/span&gt;
&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cancel&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parentCtx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;90&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;cancel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CreateMessageStream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fallbackNonStreaming&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parentCtx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Events&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;cancel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c"&gt;// reset&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cancel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parentCtx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;90&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;processEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;30 lines instead of 3,419. No event loop. No microtask vs macrotask. Timer &lt;strong&gt;guaranteed&lt;/strong&gt; to fire regardless of async iteration. &lt;code&gt;context.WithTimeout&lt;/code&gt; works at any nesting level, in any direction.&lt;/p&gt;

&lt;p&gt;We measured: 7 Claude Code processes = &lt;strong&gt;5.3 GB RSS&lt;/strong&gt;. An equivalent Go implementation would use ~350 MB. No &lt;code&gt;.node&lt;/code&gt; native addon leaks. No mimalloc panics. No 12 MB minified JavaScript. A &lt;strong&gt;15 MB static binary&lt;/strong&gt; that runs everywhere.&lt;/p&gt;

&lt;p&gt;64,464 lines of TypeScript with 0 tests → ~15,000 lines of Go with &lt;code&gt;go test -race&lt;/code&gt; catching every concurrency bug. The &lt;code&gt;print.ts&lt;/code&gt; monster function (3,167 lines in a 5,594-line file, 486 branch points) → 10 clean Go packages with interfaces.&lt;/p&gt;

&lt;p&gt;The Go ecosystem already has production-ready libraries for every component:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/phoenix-tui/phoenix" rel="noopener noreferrer"&gt;Phoenix TUI&lt;/a&gt;&lt;/strong&gt; — Elm-inspired terminal framework (replacement for React/Ink)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/coregx/stream" rel="noopener noreferrer"&gt;stream&lt;/a&gt;&lt;/strong&gt; — RFC-compliant SSE/WebSocket (replacement for SDK streaming)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/coregx/signals" rel="noopener noreferrer"&gt;signals&lt;/a&gt;&lt;/strong&gt; — reactive state management (replacement for 470 useState hooks)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/coregx/coregex" rel="noopener noreferrer"&gt;coregex&lt;/a&gt;&lt;/strong&gt; — regex engine 3-3000× faster than stdlib&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/unilibs/uniwidth" rel="noopener noreferrer"&gt;uniwidth&lt;/a&gt;&lt;/strong&gt; — Unicode width 4-46× faster (for TUI rendering)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/grpmsoft/gosh" rel="noopener noreferrer"&gt;gosh&lt;/a&gt;&lt;/strong&gt; — cross-platform shell&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/coregx/fursy" rel="noopener noreferrer"&gt;fursy&lt;/a&gt;&lt;/strong&gt; — HTTP router with built-in OpenAPI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/coregx/pubsub" rel="noopener noreferrer"&gt;pubsub&lt;/a&gt;&lt;/strong&gt; — messaging with DLQ and backoff&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All zero CGO, production-grade, MIT licensed. The entire stack for a Go rewrite already exists.&lt;/p&gt;

&lt;p&gt;And it should be &lt;strong&gt;open source from day one&lt;/strong&gt;. Not because we need to see the code (though we do). Because the community will build what a team doing vibe coding cannot: reliability.&lt;/p&gt;

&lt;h3&gt;
  
  
  The deeper problem: Vibe Coding vs Smart Coding
&lt;/h3&gt;

&lt;p&gt;Claude Code is the poster child of what happens when you rely entirely on AI to write production software without engineering discipline. 64,464 lines, zero tests, a 3,167-line function with 486 branch points, regex for sentiment analysis at an AI company — this is what &lt;a href="https://dev.to/kolkov/from-vibe-coding-to-agentic-engineering-what-karpathy-got-right-and-whats-missing-62e"&gt;vibe coding&lt;/a&gt; looks like at scale: prompt-first, understanding-second, ship and pray.&lt;/p&gt;

&lt;p&gt;There's a better way. I call it &lt;a href="https://dev.to/kolkov/smart-coding-vs-vibe-coding-engineering-discipline-in-the-age-of-ai-5b20"&gt;Smart Coding&lt;/a&gt; — a meta-framework where &lt;strong&gt;you drive, AI accelerates&lt;/strong&gt;. Five principles:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Architecture Ownership&lt;/strong&gt; — you control system design, AI suggests patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comprehension Before Commit&lt;/strong&gt; — never deploy code you can't explain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Targeted Acceleration&lt;/strong&gt; — use AI for well-scoped tasks with clear specs, not "write me a CLI"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Continuous Validation&lt;/strong&gt; — verify every suggestion against edge cases, security, concurrency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deliberate Learning&lt;/strong&gt; — treat AI interactions as learning opportunities, build knowledge files&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The practical rule: &lt;strong&gt;invest 70% in architecture, specification, review. Let AI accelerate the 30% — mechanical implementation.&lt;/strong&gt; Not the other way around.&lt;/p&gt;

&lt;p&gt;In 2026, nobody writes tests by hand. But a Smart Coding engineer &lt;strong&gt;makes the AI write tests&lt;/strong&gt;, reviews the coverage, asks "what happens when abort fires during do-while with stream=undefined?" — and validates the answer. 64,464 lines with zero tests means nobody — human or AI — ever asked that question. That's not an AI failure. That's the absence of engineering process.&lt;/p&gt;

&lt;p&gt;Vibe coding has its place — rapid prototyping, feasibility studies, throwaway exploration. But production infrastructure serving paying customers? That requires &lt;strong&gt;agentic engineering&lt;/strong&gt;: AI agents executing under human oversight, with architecture decisions owned by humans, and continuous validation at every stage. As &lt;a href="https://dev.to/kolkov/from-vibe-coding-to-agentic-engineering-what-karpathy-got-right-and-whats-missing-62e"&gt;Karpathy noted&lt;/a&gt;, "you're not writing code 99% of the time — you're orchestrating agents." True. But orchestration requires understanding. And understanding requires engineers.&lt;/p&gt;

&lt;p&gt;Anthropic's team should be proud of the models. But shipping a CLI tool where the AI writes the code, the AI reviews the code, and nobody validates anything — and then being surprised when a source map leaks because nobody checked the build output — that's not Smart Coding. That's hope-driven development.&lt;/p&gt;




&lt;h2&gt;
  
  
  Epilogue
&lt;/h2&gt;

&lt;p&gt;I still use Claude Code. The models are genuinely the best for coding. Opus 4.6 is extraordinary.&lt;/p&gt;

&lt;p&gt;But the wrapper around those models — 64,464 lines of untested TypeScript with regex sentiment detection and an attestation system that breaks its own caching — is not worthy of them.&lt;/p&gt;

&lt;p&gt;We hope Anthropic's leadership draws the right conclusions from this incident. The source map leak wasn't a catastrophe — it was a mirror. It showed the world what the code looks like, and the world said: &lt;em&gt;"This needs to be open."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Three paths forward, any of which would work:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Open source Claude Code&lt;/strong&gt; — let the community fix what vibe coding broke. The models are the moat, not the CLI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rewrite the SDK properly&lt;/strong&gt; — move reliability (timeout, retry, fallback, ping awareness) into the open &lt;code&gt;@anthropic-ai/sdk&lt;/code&gt;. Let Claude Code be just business logic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;At the very least — start listening to users.&lt;/strong&gt; 201 upvotes on &lt;a href="https://github.com/anthropics/claude-code/issues/38335" rel="noopener noreferrer"&gt;#38335&lt;/a&gt;. 150 on &lt;a href="https://github.com/anthropics/claude-code/issues/40524" rel="noopener noreferrer"&gt;#40524&lt;/a&gt;. 15 on &lt;a href="https://github.com/anthropics/claude-code/issues/33949" rel="noopener noreferrer"&gt;#33949&lt;/a&gt;. Zero responses from the team. A stale-issue bot that auto-closes everything after 60 days is not a support strategy.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We'll keep documenting. We'll keep patching. And when someone finally looks at our analysis, it will be here waiting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;All our research&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NEW&lt;/strong&gt; &lt;a href="https://github.com/anthropics/claude-code/issues/41981" rel="noopener noreferrer"&gt;Issue #41981&lt;/a&gt; — &lt;strong&gt;Complete fix proposal&lt;/strong&gt;: immediate fixes with line numbers, SDK restructuring, ping-aware watchdog, Go rewrite rationale, architectural recommendations&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/anthropics/claude-code/issues/39755" rel="noopener noreferrer"&gt;Issue #39755&lt;/a&gt; — watchdog fallback dead code + open source request&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/anthropics/claude-code/issues/33949" rel="noopener noreferrer"&gt;Issue #33949&lt;/a&gt; — streaming hang root cause analysis&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/anthropics/claude-code/issues/33949#issuecomment-4169141807" rel="noopener noreferrer"&gt;Source code findings and updated fix prompts&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/kolkov" rel="noopener noreferrer"&gt;@kolkov&lt;/a&gt; · &lt;a href="https://dev.to/kolkov"&gt;dev.to/kolkov&lt;/a&gt; · March 2026&lt;/em&gt;&lt;br&gt;
&lt;em&gt;With help from Claude Code itself — the only team member who listened.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>anthropic</category>
      <category>reverseengineering</category>
      <category>opensource</category>
    </item>
    <item>
      <title>From 100x Slower Than Rust to Beating It: The coregex Journey</title>
      <dc:creator>Andrey Kolkov</dc:creator>
      <pubDate>Wed, 18 Mar 2026 15:24:02 +0000</pubDate>
      <link>https://dev.to/kolkov/from-100x-slower-than-rust-to-beating-it-the-coregex-journey-3n3j</link>
      <guid>https://dev.to/kolkov/from-100x-slower-than-rust-to-beating-it-the-coregex-journey-3n3j</guid>
      <description>&lt;p&gt;A few days ago, &lt;a href="https://www.reddit.com/r/golang/comments/1rr2evh/why_is_gos_regex_so_slow/" rel="noopener noreferrer"&gt;@kostya27 posted on r/golang&lt;/a&gt; (124 upvotes):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"Why is Go's regex so slow?"&lt;/strong&gt; Go total time on LangArena: 116.6 seconds. Without two regex tests: 78.5 seconds. Without regex, Go is competitive with Rust/C++. With regex, it's not even close.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;He's right. And this post is about what we did about it.&lt;/p&gt;

&lt;p&gt;Six months ago, I wrote about building &lt;a href="https://dev.to/kolkov/gos-regexp-is-slow-so-i-built-my-own-3000x-faster-3i6h"&gt;coregex&lt;/a&gt; — a regex engine for Go that's 3-3000x faster than stdlib. The benchmarks looked great. Then reality hit.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kostya" rel="noopener noreferrer"&gt;@kostya&lt;/a&gt;, author of &lt;a href="https://kostya.github.io/LangArena/" rel="noopener noreferrer"&gt;LangArena&lt;/a&gt; (2,900+ stars on GitHub), tried coregex on his benchmark suite. His verdict:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I tried coregex, but no luck."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;His numbers told the story:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;Go stdlib&lt;/th&gt;
&lt;th&gt;coregex v0.12.8&lt;/th&gt;
&lt;th&gt;Rust regex&lt;/th&gt;
&lt;th&gt;PCRE2 JIT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LogParser (13 patterns)&lt;/td&gt;
&lt;td&gt;22.7s&lt;/td&gt;
&lt;td&gt;22.0s&lt;/td&gt;
&lt;td&gt;0.2s&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Template::Regex&lt;/td&gt;
&lt;td&gt;6.5s&lt;/td&gt;
&lt;td&gt;7.0s&lt;/td&gt;
&lt;td&gt;3.8s&lt;/td&gt;
&lt;td&gt;1.0s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We were &lt;strong&gt;100x slower than Rust&lt;/strong&gt; on LogParser. On the same machine. Same input. Same patterns.&lt;/p&gt;

&lt;p&gt;Our "3000x faster than stdlib" claim? True — on many patterns we tested. But we had blind spots we didn't know about: case-insensitive patterns, dense-digit data, multi-wildcard suffixes. On a real-world benchmark with 13 diverse patterns covering all these cases, we were barely faster than stdlib.&lt;/p&gt;

&lt;p&gt;That was March 10, 2026. Here's what happened in the next week.&lt;/p&gt;




&lt;h2&gt;
  
  
  The LangArena Challenge
&lt;/h2&gt;

&lt;p&gt;LangArena's &lt;a href="https://kostya.github.io/LangArena/" rel="noopener noreferrer"&gt;LogParser benchmark&lt;/a&gt; parses Apache log files with 13 regex patterns — the kind of patterns you'd find in any log analysis pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;errors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="err"&gt;`&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;5&lt;/span&gt;&lt;span class="pi"&gt;][&lt;/span&gt;&lt;span class="nv"&gt;0-9&lt;/span&gt;&lt;span class="pi"&gt;]{&lt;/span&gt;&lt;span class="nv"&gt;2&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt; &lt;span class="err"&gt;[4][0-9]{2}&lt;/span&gt; &lt;span class="err"&gt;`&lt;/span&gt;
&lt;span class="err"&gt;b&lt;/span&gt;&lt;span class="s"&gt;ots:          `(?i)bot|crawler|scanner|spider|indexing|crawl`&lt;/span&gt;
&lt;span class="err"&gt;s&lt;/span&gt;&lt;span class="s"&gt;uspicious:    `(?i)etc/passwd|wp-admin|\.\./`&lt;/span&gt;
&lt;span class="err"&gt;i&lt;/span&gt;&lt;span class="s"&gt;ps:           `\d+\.\d+\.\d+\.35`&lt;/span&gt;
&lt;span class="err"&gt;a&lt;/span&gt;&lt;span class="s"&gt;pi_calls:     `/api/[^ "]+`&lt;/span&gt;
&lt;span class="err"&gt;m&lt;/span&gt;&lt;span class="s"&gt;ethods:       `(?i)get|post|put`&lt;/span&gt;
&lt;span class="err"&gt;e&lt;/span&gt;&lt;span class="s"&gt;mails:        `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`&lt;/span&gt;
&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="s"&gt;..and 6 more&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nothing exotic. These are bread-and-butter patterns that every Go developer uses. And we were &lt;strong&gt;100x slower than Rust&lt;/strong&gt; on them.&lt;/p&gt;

&lt;p&gt;The question wasn't "can we optimize one pattern?" — it was "can we close a 100x gap across 13 different pattern types?"&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Learning from the Enemy
&lt;/h2&gt;

&lt;p&gt;Before writing a single line of code, I needed to understand &lt;strong&gt;what Rust does differently&lt;/strong&gt;. Not from reading docs — from running Rust with debug logging.&lt;/p&gt;

&lt;p&gt;Rust's regex crate has &lt;code&gt;RUST_LOG=debug&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ RUST_LOG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;debug ./rust-benchmark input.txt
&lt;span class="o"&gt;[&lt;/span&gt;regex] prefixes extracted: Seq[&lt;span class="s2"&gt;"EVA"&lt;/span&gt;, &lt;span class="s2"&gt;"EVa"&lt;/span&gt;, &lt;span class="s2"&gt;"EvA"&lt;/span&gt;, &lt;span class="s2"&gt;"Eva"&lt;/span&gt;, &lt;span class="s2"&gt;"eVA"&lt;/span&gt;, ...]
&lt;span class="o"&gt;[&lt;/span&gt;regex] prefilter built: teddy
&lt;span class="o"&gt;[&lt;/span&gt;regex] using reverse suffix strategy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every strategy decision, every prefilter choice, every literal extraction — logged. I could see exactly what Rust did for each pattern.&lt;/p&gt;

&lt;p&gt;We had nothing like this. So I built &lt;code&gt;COREGEX_DEBUG&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ COREGEX_DEBUG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 ./my-app
&lt;span class="o"&gt;[&lt;/span&gt;coregex] &lt;span class="nv"&gt;pattern&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"(?i:GET|POST|PUT)"&lt;/span&gt; &lt;span class="nv"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;UseTeddy &lt;span class="nv"&gt;nfa_states&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;43 &lt;span class="nv"&gt;literals&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;40 &lt;span class="nb"&gt;complete&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;coregex] &lt;span class="nv"&gt;prefilter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;FatTeddy &lt;span class="o"&gt;(&lt;/span&gt;AVX2 fat&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nb"&gt;complete&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now I could compare strategy selection side-by-side. And the differences were immediately obvious.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: The Root Causes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Bug #1: Refusing to extract case-insensitive literals
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pattern&lt;/strong&gt;: &lt;code&gt;(?iU)\b(eval|system|exec|execute|passthru|shell_exec|phpinfo)\b&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;A real user (&lt;a href="https://github.com/coregx/coregex/issues/137" rel="noopener noreferrer"&gt;#137&lt;/a&gt;) reported this WAF pattern was &lt;strong&gt;88,000x slower than stdlib&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Rust extracts 250 case-fold literal variants:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;eval → EVAL, EVAl, EVaL, EVal, EvAL, ... eval  (16 variants)
system → SYSTEM, SYSTEm, SYSTem, ...            (32 variants)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then trims to 60 three-byte prefixes → Teddy SIMD prefilter → scans 968 bytes in 263 nanoseconds. Done.&lt;/p&gt;

&lt;p&gt;Our literal extractor? One line killed everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// literal/extractor.go:137&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Flags&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;syntax&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FoldCase&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;NewSeq&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c"&gt;// Return EMPTY. For ALL case-insensitive patterns.&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This guard was added for a previous bug (#87) — naive extraction of single-case variants caused prefilter false negatives. The fix was correct for that bug, but the blanket rejection meant &lt;strong&gt;zero prefilter&lt;/strong&gt; for any &lt;code&gt;(?i)&lt;/code&gt; pattern. Without prefilter, the engine fell back to lazy DFA, which cache-thrashed on the 181-state NFA.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: Expand &lt;code&gt;(?i)&lt;/code&gt; literals into ALL case-fold variants (like Rust), then trim to 3-byte prefixes. One function, ~50 lines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: 88,000x slower → &lt;strong&gt;24x faster&lt;/strong&gt; than stdlib.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug #2: isMatchDigitPrefilter was O(n²)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pattern&lt;/strong&gt;: &lt;code&gt;\d{3}-\d{3}-\d{4}&lt;/code&gt; (phone numbers)&lt;/p&gt;

&lt;p&gt;On 6MB of log data: &lt;strong&gt;7 minutes&lt;/strong&gt; per &lt;code&gt;Match()&lt;/code&gt; call. Stdlib: 262ms.&lt;/p&gt;

&lt;p&gt;Root cause: &lt;code&gt;isMatchDigitPrefilter&lt;/code&gt; used &lt;code&gt;dfa.FindAt()&lt;/code&gt; (unanchored search) which scans from each digit position to end of input:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Before (O(n²)):&lt;/span&gt;
&lt;span class="n"&gt;endPos&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dfa&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FindAt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;haystack&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;digitPos&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c"&gt;// Scans to EOF!&lt;/span&gt;

&lt;span class="c"&gt;// After (O(pattern_len)):&lt;/span&gt;
&lt;span class="n"&gt;endPos&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dfa&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SearchAtAnchored&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;haystack&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;digitPos&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c"&gt;// Checks only at position&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One function call change. 7 minutes → 2.1ms. &lt;strong&gt;200,000x faster&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The same pattern was already fixed in &lt;code&gt;findIndicesDigitPrefilter&lt;/code&gt; months ago — but &lt;code&gt;isMatchDigitPrefilter&lt;/code&gt; was never updated. Copy-paste divergence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug #3: ReverseSuffix rejected multi-wildcard patterns
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pattern&lt;/strong&gt;: &lt;code&gt;\d+\.\d+\.\d+\.35&lt;/code&gt; (IP address suffix)&lt;/p&gt;

&lt;p&gt;This pattern has a clear suffix: &lt;code&gt;.35&lt;/code&gt;. Rust finds it instantly with memmem, then reverse-scans for the start. Our &lt;code&gt;isSafeForReverseSuffix&lt;/code&gt; rejected it because it had 3 wildcard subexpressions (&lt;code&gt;\d+&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;wildcardCount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;  &lt;span class="c"&gt;// "multiple wildcards break reverse NFA"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The guard existed because our reverse NFA builder had a bug with mixed byte+epsilon states. That bug was fixed in v0.12.9. But the guard stayed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix&lt;/strong&gt;: Remove the guard. Also fix &lt;code&gt;Find()&lt;/code&gt; leftmost semantics — &lt;code&gt;bytes.LastIndex&lt;/code&gt; → &lt;code&gt;bytes.Index&lt;/code&gt; for non-&lt;code&gt;.*&lt;/code&gt; patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: 57ms → 0.63ms (&lt;strong&gt;603x faster&lt;/strong&gt;, 1.6x faster than Rust!)&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug #4: FatTeddy AVX2 missed matches
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pattern&lt;/strong&gt;: &lt;code&gt;(?i)get|post|put&lt;/code&gt; (40 case-fold expanded literals)&lt;/p&gt;

&lt;p&gt;FatTeddy (33-64 pattern SIMD search) found only 11,456 matches. Correct answer: 34,368.&lt;/p&gt;

&lt;p&gt;Root cause: One assembly instruction.&lt;/p&gt;

&lt;p&gt;FatTeddy uses 256-bit AVX2 registers with two 128-bit lanes. Low lane handles buckets 0-7, high lane handles buckets 8-15. The code used &lt;code&gt;ANDL&lt;/code&gt; to combine lane results — requiring a match in &lt;strong&gt;both&lt;/strong&gt; lanes. But GET variants (8 patterns) were all in buckets 0-7 (low lane only), PUT variants in buckets 8-15 (high lane only). &lt;code&gt;ANDL&lt;/code&gt; zeroed them out.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;; Before (incorrect):
ANDL CX, AX          ; Requires BOTH lanes to match

; After (correct):
ORL  CX, AX          ; Either lane is sufficient
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One instruction. 22,912 missing matches fixed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Building What Rust Has
&lt;/h2&gt;

&lt;p&gt;Beyond bug fixes, we needed architectural improvements to match Rust's approach:&lt;/p&gt;

&lt;h3&gt;
  
  
  Bidirectional DFA
&lt;/h3&gt;

&lt;p&gt;Previously, &lt;code&gt;UseDFA&lt;/code&gt; patterns did: forward DFA → match end, then PikeVM → exact boundaries. PikeVM is O(n×states) — a second full scan.&lt;/p&gt;

&lt;p&gt;Now: forward DFA → end, reverse DFA → start, anchored DFA → exact end. Three O(n) passes instead of one O(n×states) pass.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cascading Prefix Trim
&lt;/h3&gt;

&lt;p&gt;When case-fold expansion produces too many literals (&amp;gt;64), we trim them using Rust's approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;128 six-byte literals → try keep 4 bytes → 18 unique → fits Teddy!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is directly from Rust's &lt;code&gt;optimize_for_prefix_by_preference()&lt;/code&gt; with their ATTEMPTS table: &lt;code&gt;[(4,64), (3,64), (2,64)]&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Aho-Corasick DFA Backend
&lt;/h3&gt;

&lt;p&gt;Our &lt;a href="https://github.com/coregx/ahocorasick" rel="noopener noreferrer"&gt;Aho-Corasick library&lt;/a&gt; got a complete DFA backend rewrite:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Flat transition table with premultiplied state IDs&lt;/li&gt;
&lt;li&gt;Match flag in high bit (single AND instruction for detection)&lt;/li&gt;
&lt;li&gt;SIMD skip-ahead prefilter via &lt;code&gt;bytes.IndexByte&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: 300 MB/s → 3,400 MB/s (Find), 260 MB/s → 5,900 MB/s (IsMatch). &lt;strong&gt;11-22x throughput improvement&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Benchmark: 8 Real-World Patterns on 6.3 MB Input
&lt;/h3&gt;

&lt;p&gt;100 iterations each, best of 5, same machine (i7-1255U):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Go stdlib&lt;/th&gt;
&lt;th&gt;coregex v0.12.13&lt;/th&gt;
&lt;th&gt;Rust regex&lt;/th&gt;
&lt;th&gt;vs stdlib&lt;/th&gt;
&lt;th&gt;vs Rust&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.*@example\.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;420 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.3 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7.2 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;126x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.2x faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.*\.(txt&amp;amp;#124;log&amp;amp;#124;md)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;426 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.0 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.8 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;425x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.8x faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;email validation&lt;/td&gt;
&lt;td&gt;447 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.4 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3.8 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;132x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.1x faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;\d+\.\d+\.\d+\.35&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;381 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.63 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.98 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;603x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.6x faster&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;(?i)get&amp;amp;#124;post&amp;amp;#124;put&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;561 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16.6 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7.0 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;34x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.4x slower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;(?i)bot&amp;amp;#124;crawler&amp;amp;#124;...&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;883 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;38.4 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6.7 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;23x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5.7x slower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;password=[^&amp;amp;\s"]+&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;24 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8.9 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.9 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3.1x slower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;session[_-]?id=...&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.7 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.2 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3x&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.3x slower&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;4 out of 8 patterns are faster than Rust.&lt;/strong&gt; All 8 are faster than Go stdlib.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a class="mentioned-user" href="https://dev.to/kostya"&gt;@kostya&lt;/a&gt;'s Update
&lt;/h3&gt;

&lt;p&gt;Remember "no luck"? Here's the progression on his M1 MacBook:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;LogParser&lt;/th&gt;
&lt;th&gt;Gap to Rust&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;v0.12.8 (start)&lt;/td&gt;
&lt;td&gt;22.0s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v0.12.9&lt;/td&gt;
&lt;td&gt;5.3s&lt;/td&gt;
&lt;td&gt;26x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v0.12.10&lt;/td&gt;
&lt;td&gt;2.67s&lt;/td&gt;
&lt;td&gt;13x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v0.12.13 (current)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.12s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;From 100x slower to 10x. Not parity yet — but a different conversation than "no luck."&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Not Just Use CGO?
&lt;/h2&gt;

&lt;p&gt;Every other Go regex alternative uses CGO or Wasm:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;go-re2&lt;/strong&gt;: C++ RE2 via Wasm (wazero)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;regexp2&lt;/strong&gt;: Backtracking (.NET-style) — no O(n) guarantee&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;rubex&lt;/strong&gt;: Oniguruma via CGO&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;go-pcre&lt;/strong&gt;: PCRE via CGO&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;coregex is &lt;strong&gt;pure Go + Go assembly&lt;/strong&gt;. No CGO, no Wasm, no external dependencies.&lt;/p&gt;

&lt;p&gt;Why does this matter?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cross-compilation&lt;/strong&gt;: &lt;code&gt;GOOS=linux GOARCH=arm64 go build&lt;/code&gt; just works&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Static binaries&lt;/strong&gt;: No shared libraries to ship&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Go toolchain&lt;/strong&gt;: &lt;code&gt;go vet&lt;/code&gt;, &lt;code&gt;go test -race&lt;/code&gt;, &lt;code&gt;pprof&lt;/code&gt; all work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debugging&lt;/strong&gt;: Standard Go debugging, no FFI boundary&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;: No C memory safety issues in regex hot paths&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The performance gap to pure-CGO solutions (PCRE2 JIT) exists — JIT compiles regex to native machine code, achieving 1.0s where we take 7.1s on Template::Regex. But that's an &lt;a href="https://github.com/coregx/coregex/issues/124" rel="noopener noreferrer"&gt;architectural tier boundary&lt;/a&gt; — we're competing within the automata-based class (like RE2 and Rust regex), not against JIT engines.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Debug logging is not optional
&lt;/h3&gt;

&lt;p&gt;Building &lt;code&gt;COREGEX_DEBUG&lt;/code&gt; was the single most impactful decision. Without it, every optimization was guesswork. With it, we could see exactly why a pattern was slow and verify our fix matched Rust's approach.&lt;/p&gt;

&lt;p&gt;If you're building any kind of engine — regex, query planner, compiler — &lt;strong&gt;add strategy logging from day one&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. One instruction can hide 23,000 bugs
&lt;/h3&gt;

&lt;p&gt;The FatTeddy &lt;code&gt;ANDL → ORL&lt;/code&gt; fix taught us that SIMD code correctness is binary. Not "mostly correct" or "works for some patterns." If your lane combining logic is wrong, you silently drop matches. No error, no panic — just wrong results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Always verify match counts against stdlib.&lt;/strong&gt; On every pattern. On every change.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Benchmarks lie — until they don't
&lt;/h3&gt;

&lt;p&gt;Our "3000x faster" headline was true for &lt;code&gt;.*error.*&lt;/code&gt; patterns. But &lt;a class="mentioned-user" href="https://dev.to/kostya"&gt;@kostya&lt;/a&gt;'s LangArena showed the full picture: on diverse real-world patterns, we were barely faster than stdlib.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real benchmarks use real patterns from real users.&lt;/strong&gt; We now run &lt;a href="https://github.com/kolkov/regex-bench" rel="noopener noreferrer"&gt;regex-bench&lt;/a&gt; CI on every PR — 16 core patterns + 13 LangArena patterns, compared against both stdlib and Rust regex, on Linux AMD EPYC and macOS Apple Silicon.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Guard clauses outlive their bugs
&lt;/h3&gt;

&lt;p&gt;Three of our four major bugs were caused by &lt;strong&gt;guards that stayed after the underlying bug was fixed&lt;/strong&gt;. &lt;code&gt;FoldCase&lt;/code&gt; rejection, &lt;code&gt;wildcardCount &amp;gt;= 2&lt;/code&gt;, unanchored &lt;code&gt;FindAt&lt;/code&gt; — all were correct when added. All became performance killers months later when the original bugs were resolved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Track why a guard exists. Remove it when the reason is gone.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Go ASM is production-viable for SIMD
&lt;/h3&gt;

&lt;p&gt;We wrote ~500 lines of AVX2/SSSE3 assembly for Teddy multi-pattern search. It works. FatTeddy throughput: &lt;strong&gt;12 GB/s&lt;/strong&gt; on single-call scans (2x faster than SlimTeddy SSSE3!).&lt;/p&gt;

&lt;p&gt;The challenge isn't writing the ASM — it's the Go→ASM function call boundary. Each call costs ~60ns + mask reload. For high-match-count patterns, this adds up. Our batch API (64KB chunks) reduces round-trips, but the integrated prefilter+DFA loop that Rust uses remains the gold standard.&lt;/p&gt;




&lt;h2&gt;
  
  
  Current State: v0.12.13
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;97,000 lines of code.&lt;/strong&gt; 17 strategies. 1,470 tests. 5 releases in one week.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go get github.com/coregx/coregex@v0.12.13
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Drop-in replacement:&lt;/p&gt;

&lt;p&gt;It's a true drop-in replacement for Go's &lt;code&gt;regexp&lt;/code&gt; package — same API, same types (&lt;code&gt;Regexp&lt;/code&gt; is aliased), same method signatures:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="s"&gt;"github.com/coregx/coregex"&lt;/span&gt;  &lt;span class="c"&gt;// instead of "regexp"&lt;/span&gt;

&lt;span class="n"&gt;re&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;coregex&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MustCompile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;`(?i)get|post|put`&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;matches&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FindAllString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c"&gt;// Same API, faster execution&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In most cases, changing the import path is all you need.&lt;/p&gt;

&lt;p&gt;Debug your patterns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;COREGEX_DEBUG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 ./your-app
&lt;span class="c"&gt;# [coregex] pattern="(?i:GET|P(?:OST|UT))" strategy=UseTeddy nfa_states=43 literals=40 complete=true&lt;/span&gt;
&lt;span class="c"&gt;# [coregex] prefilter=FatTeddy (AVX2 fat) complete=true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What's Still Slower Than Rust
&lt;/h2&gt;

&lt;p&gt;Honesty matters. Here's where we're still behind:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Gap&lt;/th&gt;
&lt;th&gt;Root cause&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;(?i)&lt;/code&gt; patterns: 2-6x&lt;/td&gt;
&lt;td&gt;FatTeddy ORL creates more false positives than Rust's interleave verification&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://github.com/coregx/coregex/issues/120" rel="noopener noreferrer"&gt;Researched&lt;/a&gt;, needs ASM rewrite&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DFA verification: 3-7x&lt;/td&gt;
&lt;td&gt;Go→ASM round trip overhead, no integrated prefilter+DFA loop&lt;/td&gt;
&lt;td&gt;Architectural&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Template::Regex: 1.8x&lt;/td&gt;
&lt;td&gt;Two-phase DFA+PikeVM vs Rust's single-phase lazy DFA&lt;/td&gt;
&lt;td&gt;Planned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ARM: 5-15x vs Rust&lt;/td&gt;
&lt;td&gt;No SIMD prefilters on ARM (Teddy/memchr are x86 only)&lt;/td&gt;
&lt;td&gt;Waiting for Go NEON support&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We're not hiding these gaps. They're tracked, researched, and planned. The goal is Rust parity on all pattern types — we're not there yet on &lt;code&gt;(?i)&lt;/code&gt; and DFA-heavy patterns.&lt;/p&gt;




&lt;h2&gt;
  
  
  Community Testing Matters — A Lot
&lt;/h2&gt;

&lt;p&gt;A multi-engine regex library is &lt;strong&gt;inherently complex&lt;/strong&gt;. 17 strategies, SIMD assembly, lazy DFA, reverse search, prefilter cascading — every combination of pattern shape × input data × strategy is a potential edge case. No amount of internal testing can cover what real users discover in minutes.&lt;/p&gt;

&lt;p&gt;Every major fix in this article came from &lt;strong&gt;community feedback&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a class="mentioned-user" href="https://dev.to/kostya"&gt;@kostya&lt;/a&gt;'s LangArena exposed the 100x gap we didn't know about&lt;/li&gt;
&lt;li&gt;tjbrains' WAF pattern (&lt;a href="https://github.com/coregx/coregex/issues/137" rel="noopener noreferrer"&gt;#137&lt;/a&gt;) revealed the 88,000x regression in case-insensitive matching&lt;/li&gt;
&lt;li&gt;GoAWK integration uncovered 15+ Unicode edge cases months earlier&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern is consistent: someone runs coregex on their specific workload, finds a pattern type we haven't optimized yet, reports it — and we fix it in hours, not weeks. The FatTeddy lane bug? Fixed same day. The DigitPrefilter O(n²)? Fixed in one line. Case-insensitive literal extraction? Researched Rust's approach, implemented, released — all within 24 hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There are likely more patterns that aren't optimized yet.&lt;/strong&gt; That's the nature of a 17-strategy engine — some strategy paths get less testing than others. But the architecture is sound, the fix turnaround is fast, and every report makes the library better for everyone.&lt;/p&gt;

&lt;p&gt;We &lt;a href="https://github.com/golang/go/issues/76818" rel="noopener noreferrer"&gt;proposed coregex for Go's standard library&lt;/a&gt;. It wasn't accepted — and honestly, that's okay. As an independent library, we can iterate faster, ship SIMD assembly that the Go team wouldn't merge, and make decisions optimized for performance rather than compatibility. The Go ecosystem is better with options.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't hesitate to contribute.&lt;/strong&gt; File issues with your patterns and inputs. Even a simple "this pattern is slower than stdlib" report helps — it tells us which strategy path needs work. The more diverse patterns we see, the fewer blind spots remain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pull requests are especially welcome.&lt;/strong&gt; We know that a healthy open source project is built by its community, and we value every contributor. Don't worry if your PR isn't perfect — we'll review the code, help you fix any issues, guide you through our conventions, and explain what's needed to get it merged. Whether it's a new test case, a documentation fix, a strategy optimization, or a bug report with a reproducer — every contribution counts and every contributor gets credited.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;If regex is a bottleneck in your Go application:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Profile first&lt;/strong&gt; — make sure regex is actually the problem&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark your specific patterns&lt;/strong&gt; — performance varies by pattern type&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check match counts&lt;/strong&gt; — &lt;code&gt;coregex.FindAll()&lt;/code&gt; must match &lt;code&gt;regexp.FindAll()&lt;/code&gt; exactly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Report issues&lt;/strong&gt; — we fixed #137 (88,000x regression) within 24 hours
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Quick benchmark&lt;/span&gt;
go get github.com/coregx/coregex@v0.12.13
&lt;span class="nv"&gt;COREGEX_DEBUG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 go &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;-bench&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;-benchmem&lt;/span&gt; your-package
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/coregx/coregex" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/coregx/ahocorasick" rel="noopener noreferrer"&gt;Aho-Corasick library&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kolkov/regex-bench" rel="noopener noreferrer"&gt;Cross-language benchmarks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kostya.github.io/LangArena/" rel="noopener noreferrer"&gt;LangArena&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;The most humbling moment? Seeing &lt;code&gt;ANDL CX, AX&lt;/code&gt; in our FatTeddy ASM and realizing one wrong instruction had been silently dropping 23,000 matches. The most satisfying? Seeing &lt;code&gt;coregex 1.6x faster than Rust&lt;/code&gt; on the IP pattern that started this whole journey.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Built by &lt;a href="https://github.com/kolkov" rel="noopener noreferrer"&gt;@kolkov&lt;/a&gt; as part of &lt;a href="https://github.com/coregx" rel="noopener noreferrer"&gt;CoreGX&lt;/a&gt; — production Go libraries.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>performance</category>
      <category>regex</category>
      <category>rust</category>
    </item>
    <item>
      <title>Aho-Corasick in Go: Multi-Pattern String Matching at 6 GB/s with Zero Allocations</title>
      <dc:creator>Andrey Kolkov</dc:creator>
      <pubDate>Wed, 18 Mar 2026 12:38:29 +0000</pubDate>
      <link>https://dev.to/kolkov/aho-corasick-in-go-multi-pattern-string-matching-at-6-gbs-with-zero-allocations-2jog</link>
      <guid>https://dev.to/kolkov/aho-corasick-in-go-multi-pattern-string-matching-at-6-gbs-with-zero-allocations-2jog</guid>
      <description>&lt;p&gt;When your application needs to find thousands of keywords in a stream of text — think log analysis, content moderation, network intrusion detection, or DNA sequencing — you need an algorithm that doesn't slow down as the number of patterns grows. That algorithm is Aho-Corasick.&lt;/p&gt;

&lt;p&gt;We built &lt;a href="https://github.com/coregx/ahocorasick" rel="noopener noreferrer"&gt;coregx/ahocorasick&lt;/a&gt;, a pure Go implementation that achieves &lt;strong&gt;over 6 GB/s&lt;/strong&gt; on a single core with zero heap allocations on the hot path. No cgo, no assembly, no unsafe — just carefully structured Go code and a deep understanding of how modern CPUs access memory.&lt;/p&gt;

&lt;p&gt;This article explains what we built, why, and the techniques that made it fast.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: O(n × k) Doesn't Scale
&lt;/h2&gt;

&lt;p&gt;Suppose you have 100 keywords to search for in a 1 MB document. The naive approach — call &lt;code&gt;strings.Contains&lt;/code&gt; for each keyword — performs 100 scans of the document. That's &lt;code&gt;O(n × k)&lt;/code&gt; where &lt;code&gt;n&lt;/code&gt; is the document size and &lt;code&gt;k&lt;/code&gt; is the number of patterns.&lt;/p&gt;

&lt;p&gt;Aho-Corasick solves this in &lt;code&gt;O(n + z)&lt;/code&gt; — one pass through the document, regardless of how many patterns you have. Whether you're searching for 10 patterns or 10,000, the scan time is the same.&lt;/p&gt;

&lt;p&gt;The algorithm builds a finite automaton from all patterns at once: a trie with failure links that allow the search to continue without backtracking. Every byte of input advances the automaton by exactly one state.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Built
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/coregx/ahocorasick" rel="noopener noreferrer"&gt;coregx/ahocorasick&lt;/a&gt; is part of the &lt;a href="https://github.com/coregx" rel="noopener noreferrer"&gt;coregx&lt;/a&gt; organization — a collection of high-performance libraries for Go. It serves as the multi-pattern search engine inside &lt;a href="https://github.com/coregx/coregex" rel="noopener noreferrer"&gt;coregex&lt;/a&gt;, our regex engine, where it accelerates literal alternations like &lt;code&gt;error|warning|fatal|critical&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Design Decisions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pure Go, zero dependencies.&lt;/strong&gt; The library compiles on every platform Go supports — Linux, Windows, macOS, ARM, WASM — without any build complexity. There's no cgo bridge, no platform-specific assembly files to maintain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DFA compilation.&lt;/strong&gt; At build time, the library constructs a noncontiguous NFA (trie + failure links), then compiles it into a fully resolved DFA — a single flat &lt;code&gt;[]uint32&lt;/code&gt; array where every state transition is pre-computed. At search time, there are no failure links to follow. One table lookup per byte. Always.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Builder pattern API.&lt;/strong&gt; The automaton is immutable after construction. You configure it once, build it once, and then search concurrently from any number of goroutines without synchronization.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;ac&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ahocorasick&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewBuilder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
    &lt;span class="n"&gt;AddStrings&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"warning"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"fatal"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"critical"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
    &lt;span class="n"&gt;Build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Safe for concurrent use&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ac&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IsMatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logLine&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// handle match&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Performance: The Numbers
&lt;/h2&gt;

&lt;p&gt;Benchmarks on Intel i7-1255U, single core:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;Allocations&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;IsMatch&lt;/code&gt; (match found)&lt;/td&gt;
&lt;td&gt;64 KB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6.3 GB/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;IsMatch&lt;/code&gt; (no match)&lt;/td&gt;
&lt;td&gt;64 KB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6.0 GB/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;Find&lt;/code&gt; (first match)&lt;/td&gt;
&lt;td&gt;64 KB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.5 GB/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1 (24 B)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;FindAll&lt;/code&gt; (all matches)&lt;/td&gt;
&lt;td&gt;77 B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100 MB/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4 (720 B)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are median values across multiple benchmark runs. Individual runs may be higher or lower depending on CPU frequency scaling. For context, reading from an NVMe SSD tops out at ~7 GB/s — this library processes data nearly as fast as most storage can deliver it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works: Three Layers of Optimization
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Layer 1: The DFA Transition Table
&lt;/h3&gt;

&lt;p&gt;The classical Aho-Corasick NFA has a problem: when a byte doesn't match any transition from the current state, it follows a chain of failure links back toward the root. In the worst case, this is &lt;code&gt;O(depth)&lt;/code&gt; operations per byte.&lt;/p&gt;

&lt;p&gt;Our DFA eliminates this entirely. At build time, for every state and every possible byte class, we pre-compute the final destination by following all failure links. The result is stored in a flat array:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;next_state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trans&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;current_state&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;byte_class&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One addition. One array load. No branches, no link following, no indirection. The state IDs are pre-multiplied by the stride (the row width of the transition table), so the lookup doesn't even need a multiplication — just an addition.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Match Flag in the Transition Table
&lt;/h3&gt;

&lt;p&gt;Each entry in the transition table is a &lt;code&gt;uint32&lt;/code&gt;. The lower 31 bits hold the (pre-multiplied) destination state ID. The high bit — bit 31 — is a match flag: if set, the destination state has at least one matching pattern.&lt;/p&gt;

&lt;p&gt;This means the hot loop checks for matches with a single AND instruction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;trans&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;sid&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;matchFlag&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;  &lt;span class="c"&gt;// match found&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;sid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;  &lt;span class="c"&gt;// no flag → raw IS the clean state ID&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical insight: when there's no match (the common case for most bytes), &lt;code&gt;raw&lt;/code&gt; has bit 31 clear, so it's already a valid state ID. No masking needed. The mask is only applied on the rare match path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: SIMD Prefilter with Skip-Ahead
&lt;/h3&gt;

&lt;p&gt;The DFA processes one byte at a time. But modern CPUs have SIMD instructions that can scan 16–32 bytes in a single operation. Go's &lt;code&gt;bytes.IndexByte&lt;/code&gt; uses these instructions internally (SSE2/AVX2 on x86, NEON on ARM).&lt;/p&gt;

&lt;p&gt;Before entering the DFA loop, we scan the haystack for any byte that could start a pattern match. If none of the pattern start bytes exist in the haystack, we return immediately — no DFA traversal at all.&lt;/p&gt;

&lt;p&gt;More importantly, we re-engage this prefilter &lt;em&gt;during&lt;/em&gt; the DFA loop. Whenever the automaton returns to its start state (meaning no pattern prefix is currently being tracked), we call &lt;code&gt;bytes.IndexByte&lt;/code&gt; to skip ahead to the next position where a match could begin:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;sid&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;startState&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;startBytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;skip&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;findEarliestStartByte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;haystack&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;startBytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;skip&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;  &lt;span class="c"&gt;// no more start bytes → no match possible&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;skip&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the same technique used by &lt;a href="https://github.com/BurntSushi/aho-corasick" rel="noopener noreferrer"&gt;BurntSushi's Rust aho-corasick&lt;/a&gt; crate, adapted for Go's runtime. It turns the no-match worst case from a full &lt;code&gt;O(n)&lt;/code&gt; DFA scan into a handful of SIMD scans.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Usage
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Log Analysis Pipeline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Build once at startup&lt;/span&gt;
&lt;span class="n"&gt;patterns&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s"&gt;"ERROR"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"FATAL"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"PANIC"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"OOM"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"connection refused"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"timeout exceeded"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"permission denied"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"disk full"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;ac&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ahocorasick&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewBuilder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
    &lt;span class="n"&gt;AddStrings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;patterns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
    &lt;span class="n"&gt;Build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c"&gt;// Process log lines (concurrent-safe)&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;processLine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;ac&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IsMatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="c"&gt;// fast path: no keywords found&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;ac&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FindAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;patterns&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PatternID&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Content Moderation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Load blocklist from database&lt;/span&gt;
&lt;span class="n"&gt;blocklist&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;loadBlocklist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c"&gt;// []string, potentially thousands of terms&lt;/span&gt;
&lt;span class="n"&gt;ac&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ahocorasick&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewBuilder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
    &lt;span class="n"&gt;AddStrings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blocklist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
    &lt;span class="n"&gt;Build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;moderate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ac&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IsMatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// O(n), regardless of blocklist size&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Regex Engine Prefilter
&lt;/h3&gt;

&lt;p&gt;Inside &lt;a href="https://github.com/coregx/coregex" rel="noopener noreferrer"&gt;coregex&lt;/a&gt;, when the regex engine encounters a pattern like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(?i)(error|warning|fatal|critical|info|debug|trace)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It extracts the literal alternation, builds an Aho-Corasick automaton, and uses it as a prefilter. The automaton quickly identifies candidate positions in the text, and the full regex engine only runs at those positions. For large inputs with rare matches, this can be 10–100x faster than running the regex directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part of the coregx Ecosystem
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;ahocorasick&lt;/code&gt; is one component in the &lt;a href="https://github.com/coregx" rel="noopener noreferrer"&gt;coregx&lt;/a&gt; organization's text processing toolkit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/coregx/ahocorasick" rel="noopener noreferrer"&gt;ahocorasick&lt;/a&gt;&lt;/strong&gt; — Multi-pattern string matching (this library)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/coregx/coregex" rel="noopener noreferrer"&gt;coregex&lt;/a&gt;&lt;/strong&gt; — High-performance regex engine that uses ahocorasick as its literal search backend&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The libraries are designed to work together but are independently useful. &lt;code&gt;ahocorasick&lt;/code&gt; has zero dependencies and can be used standalone in any Go project.&lt;/p&gt;

&lt;p&gt;If you're interested in the regex engine side of things, check out &lt;a href="https://dev.to/kolkov/gos-regexp-is-slow-so-i-built-my-own-3000x-faster-3i6h"&gt;Go's Regexp is Slow. So I Built My Own — up to 3000x Faster&lt;/a&gt;, where we describe coregex and how it uses this Aho-Corasick library under the hood.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go get github.com/coregx/ahocorasick
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"fmt"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/coregx/ahocorasick"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ac&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ahocorasick&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewBuilder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;AddStrings&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"quick"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"brown"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"fox"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;Build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"the quick brown fox jumps over the lazy dog"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Zero-allocation existence check&lt;/span&gt;
    &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Has match:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ac&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IsMatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c"&gt;// true&lt;/span&gt;

    &lt;span class="c"&gt;// Find all matches&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;ac&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FindAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"  %q at [%d:%d]&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Start&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;End&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;End&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Has match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="s"&gt;"quick" at [4:9]&lt;/span&gt;
  &lt;span class="s"&gt;"brown" at [10:15]&lt;/span&gt;
  &lt;span class="s"&gt;"fox" at [16:19]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Teddy SIMD prefilter&lt;/strong&gt; — A packed SIMD algorithm (inspired by &lt;a href="https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-hyperscan.html" rel="noopener noreferrer"&gt;Hyperscan&lt;/a&gt;) for even faster candidate scanning with ≤64 patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contiguous NFA&lt;/strong&gt; — A memory-efficient alternative to the DFA for very large pattern sets (10,000+ patterns) where the full DFA transition table doesn't fit in L2 cache&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stream search&lt;/strong&gt; — &lt;code&gt;io.Reader&lt;/code&gt; interface for searching in streaming data without loading the entire input into memory&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/coregx/ahocorasick" rel="noopener noreferrer"&gt;github.com/coregx/ahocorasick&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Go Reference: &lt;a href="https://pkg.go.dev/github.com/coregx/ahocorasick" rel="noopener noreferrer"&gt;pkg.go.dev/github.com/coregx/ahocorasick&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Part of: &lt;a href="https://github.com/coregx" rel="noopener noreferrer"&gt;coregx&lt;/a&gt; — High-performance libraries for Go&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Licensed under MIT. Contributions welcome.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>algorithms</category>
      <category>performance</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Go GUI in 2026: gogpu/ui v0.1.0 — 22 Widgets, GPU Rendering, Zero CGO</title>
      <dc:creator>Andrey Kolkov</dc:creator>
      <pubDate>Sun, 15 Mar 2026 16:50:20 +0000</pubDate>
      <link>https://dev.to/kolkov/go-gui-in-2026-gogpuui-v010-22-widgets-gpu-rendering-zero-cgo-1enf</link>
      <guid>https://dev.to/kolkov/go-gui-in-2026-gogpuui-v010-22-widgets-gpu-rendering-zero-cgo-1enf</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Series: Building Go's GPU Ecosystem&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://dev.to/kolkov/gogpu-a-pure-go-graphics-library-for-gpu-programming-2j5d"&gt;GoGPU: A Pure Go Graphics Library&lt;/a&gt; — Project announcement&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/kolkov/gogpu-from-idea-to-100k-lines-in-two-weeks-building-gos-gpu-ecosystem-3b2"&gt;From Idea to 100K Lines in Two Weeks&lt;/a&gt; — The journey&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/kolkov/pure-go-2d-graphics-library-with-gpu-acceleration-introducing-gogpugg-538h"&gt;Pure Go 2D Graphics with GPU Acceleration&lt;/a&gt; — Introducing gg&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/kolkov/gpu-compute-shaders-in-pure-go-gogpugg-v0150-1cjk"&gt;GPU Compute Shaders in Pure Go&lt;/a&gt; — Compute pipelines&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/kolkov/go-126-meets-2026-with-a-professional-graphics-ecosystem-9g8"&gt;Go 1.26 Meets 2026&lt;/a&gt; — Ecosystem overview&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/kolkov/gogpugg-enterprise-2d-graphics-library-in-pure-go-1931"&gt;Enterprise 2D Graphics Library&lt;/a&gt; — gg architecture&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/kolkov/gogpu-enterprise-architecture-cross-package-gpu-integration-with-gpucontext-332"&gt;Cross-Package GPU Integration&lt;/a&gt; — gpucontext&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/kolkov/gogpu-unified-2d3d-graphics-integration-in-pure-go-gg3"&gt;Unified 2D/3D Graphics Integration&lt;/a&gt; — gg + gogpu&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.tolink"&gt;Core Complete, Focus on GUI&lt;/a&gt; — Phase 2 announcement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gogpu/ui v0.1.0 — First Release&lt;/strong&gt; ← You are here&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  From "Focus on GUI" to First Release
&lt;/h2&gt;

&lt;p&gt;In the &lt;a href="https://dev.tolink"&gt;last article&lt;/a&gt;, we announced that the core gogpu ecosystem was complete and all our energy was going into building a GUI toolkit. At that point gogpu/ui was at Phase 2 Beta with 55K lines of code, 6 widgets, and one design system.&lt;/p&gt;

&lt;p&gt;Today, roughly three weeks later, we are tagging &lt;strong&gt;v0.1.0&lt;/strong&gt; — the first public release. The toolkit grew from 55K to &lt;strong&gt;150K lines&lt;/strong&gt;, from 6 to &lt;strong&gt;22 interactive widgets&lt;/strong&gt;, and from one to &lt;strong&gt;three design systems&lt;/strong&gt;. The entire ecosystem went through a coordinated release — from shader compiler to widget toolkit — to bring everything onto stable, tagged versions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is a preview release.&lt;/strong&gt; The API will change. Performance has not been optimized. There are rough edges. We are releasing now because we want community feedback to shape the API before it solidifies. The work is just beginning, but we believe there is enough here to start a conversation.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is gogpu/ui?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/gogpu/ui" rel="noopener noreferrer"&gt;gogpu/ui&lt;/a&gt; is a GUI toolkit library for Go. You build widget trees, bind reactive state, and the toolkit handles layout, event dispatch, and rendering through &lt;a href="https://github.com/gogpu/gg" rel="noopener noreferrer"&gt;gogpu/gg&lt;/a&gt; (a 2D graphics engine) onto a GPU surface provided by &lt;a href="https://github.com/gogpu/gogpu" rel="noopener noreferrer"&gt;gogpu/gogpu&lt;/a&gt; (an application framework).&lt;/p&gt;

&lt;p&gt;We studied 20+ GUI frameworks across multiple ecosystems — &lt;strong&gt;Flutter&lt;/strong&gt; (Dart), &lt;strong&gt;Qt&lt;/strong&gt; (C++), &lt;strong&gt;SwiftUI&lt;/strong&gt; (Swift), &lt;strong&gt;Xilem/Floem/GPUI/Iced/Slint&lt;/strong&gt; (Rust), &lt;strong&gt;Angular/SolidJS/React&lt;/strong&gt; (Web), &lt;strong&gt;Fyne/Gio&lt;/strong&gt; (Go) — and brought their proven patterns into Go:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pluggable Painter pattern&lt;/strong&gt; (Flutter's separation of widget behavior from rendering)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reactive signals&lt;/strong&gt; (Angular Signals architecture, also inspired by SolidJS and Rust's Leptos)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Functional options&lt;/strong&gt; for backward-compatible API evolution (Go-native)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;W3C Pointer Events&lt;/strong&gt; for input handling (browser standard)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CSS font-weight matching&lt;/strong&gt; algorithm (W3C spec) for typography&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Screen-space coordinate transforms&lt;/strong&gt; (Flutter's &lt;code&gt;localToGlobal&lt;/code&gt;, Qt's &lt;code&gt;mapToGlobal&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The entire stack — from shader compilation to window management to 2D rendering — is &lt;strong&gt;580,000+ lines of pure Go&lt;/strong&gt; with zero CGO.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;By the numbers (v0.1.0):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Interactive widgets&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Design system themes&lt;/td&gt;
&lt;td&gt;3 (Material 3, Fluent, Cupertino)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Go source files&lt;/td&gt;
&lt;td&gt;350+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lines of code&lt;/td&gt;
&lt;td&gt;~146,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test functions&lt;/td&gt;
&lt;td&gt;~3,900+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average test coverage&lt;/td&gt;
&lt;td&gt;97%+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CGO required&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Quick Look: A Minimal Application
&lt;/h2&gt;

&lt;p&gt;Here is a complete runnable application with a Material 3 button:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"fmt"&lt;/span&gt;
    &lt;span class="s"&gt;"log"&lt;/span&gt;

    &lt;span class="s"&gt;"github.com/gogpu/gg"&lt;/span&gt;
    &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="s"&gt;"github.com/gogpu/gg/gpu"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/gogpu/gg/integration/ggcanvas"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/gogpu/gogpu"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/gogpu/ui/app"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/gogpu/ui/core/button"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/gogpu/ui/primitives"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/gogpu/ui/render"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/gogpu/ui/theme/material3"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/gogpu/ui/widget"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;gogpuApp&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;gogpu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewApp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gogpu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DefaultConfig&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;WithTitle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"My First App"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;WithSize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;400&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;WithContinuousRender&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c"&gt;// event-driven: 0% CPU when idle&lt;/span&gt;

    &lt;span class="n"&gt;m3&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;material3&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;widget&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Hex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0x6750A4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c"&gt;// seed color&lt;/span&gt;

    &lt;span class="n"&gt;uiApp&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithWindowProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gogpuApp&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithPlatformProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gogpuApp&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithEventSource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gogpuApp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EventSource&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;uiApp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetRoot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;primitives&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Box&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;primitives&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Hello, gogpu/ui!"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FontSize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;24&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Bold&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TextOpt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Click Me"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OnClick&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Clicked!"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
                &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PainterOpt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;material3&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ButtonPainter&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Theme&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;m3&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Padding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;24&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Gap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;12&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;canvas&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ggcanvas&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Canvas&lt;/span&gt;
    &lt;span class="n"&gt;gogpuApp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OnDraw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dc&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;gogpu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;dc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Width&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;dc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Height&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;canvas&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;
            &lt;span class="n"&gt;canvas&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ggcanvas&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gogpuApp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GPUContextProvider&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;uiApp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Frame&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;sv&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;dc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SurfaceView&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;sw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sh&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;dc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SurfaceSize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;gg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetAcceleratorSurfaceTarget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;canvas&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Draw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cc&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;gg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;cc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetRGBA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0.94&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0.94&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0.94&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;cc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DrawRectangle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="kt"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;cc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fill&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;uiApp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Window&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DrawTo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;render&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewCanvas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;canvas&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RenderDirect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;gogpuApp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OnClose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;gg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CloseAccelerator&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;gogpuApp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Run&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yes, the boilerplate for wiring up the draw callback is verbose. That is an area we plan to improve. The tradeoff is full control over the render pipeline -- you can mix gg 2D drawing with UI widgets, render to textures, or integrate with your own rendering code.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Widget Set
&lt;/h2&gt;

&lt;p&gt;v0.1.0 ships 22 interactive widgets and display primitives:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Form controls:&lt;/strong&gt; Button (4 variants, 3 sizes), Checkbox (tri-state), Radio group, TextField (with selection, clipboard, validation), Dropdown, Slider (continuous/discrete)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Containers:&lt;/strong&gt; ScrollView (wheel, keyboard, scrollbar drag), TabView (lazy content, closeable tabs), Dialog (modal, focus trapping), Collapsible sections, SplitView (resizable panels)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data display:&lt;/strong&gt; ListView (virtualized, 1000+ items), GridView (virtualized 2D grid), DataTable (sortable columns, fixed header), TreeView (hierarchical, expand/collapse), LineChart (real-time, multiple series), ProgressBar, Circular Progress&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Application chrome:&lt;/strong&gt; Toolbar, Menu system (MenuBar + ContextMenu with submenus), Docking system (IDE-style panels)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Primitives:&lt;/strong&gt; Box (VBox/HBox), Text (reactive via signals), Image, ThemeScope, RepaintBoundary (pixel caching)&lt;/p&gt;

&lt;p&gt;Every interactive widget follows the same patterns: functional options for construction, a Painter interface for design-system independence, signal bindings for reactive state, and accessibility metadata (ARIA roles).&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Design Systems
&lt;/h2&gt;

&lt;p&gt;Widgets do not know how they look. Rendering is delegated to Painter implementations. v0.1.0 ships three:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Material Design 3&lt;/strong&gt; -- HCT color science, tonal palettes generated from a single seed color, 21 component painters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fluent Design&lt;/strong&gt; -- Microsoft's design language with accent colors, inner focus rings, 9 painters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cupertino&lt;/strong&gt; -- Apple HIG with iOS-style toggle switches, segmented controls, pill buttons, 9 painters.&lt;/p&gt;

&lt;p&gt;Switching is one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Material 3&lt;/span&gt;
&lt;span class="n"&gt;btn&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TextOpt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Save"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PainterOpt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;material3&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ButtonPainter&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Theme&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;m3&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Fluent Design&lt;/span&gt;
&lt;span class="n"&gt;btn&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TextOpt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Save"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PainterOpt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fluent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ButtonPainter&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Theme&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;fl&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Cupertino&lt;/span&gt;
&lt;span class="n"&gt;btn&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TextOpt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Save"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PainterOpt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cupertino&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ButtonPainter&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Theme&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cu&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gallery example demonstrates live M3 seed color switching at runtime (Purple, Blue, Green, Orange). Fluent and Cupertino painters are implemented and tested but not yet wired into the gallery demo — switching between design systems at runtime is architecturally supported (just swap the Painter), but a full gallery with all three is a v0.2.0 goal.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reactive Signals
&lt;/h2&gt;

&lt;p&gt;State management uses a signals pattern built on our own &lt;a href="https://github.com/coregx/signals" rel="noopener noreferrer"&gt;coregx/signals&lt;/a&gt; library. The architecture is directly inspired by &lt;strong&gt;Angular Signals&lt;/strong&gt; (fine-grained reactivity with &lt;code&gt;Signal[T]&lt;/code&gt;, &lt;code&gt;Computed[T]&lt;/code&gt;, &lt;code&gt;Effect&lt;/code&gt;), with influence from SolidJS and Rust's Leptos. We studied Angular's signal lifecycle, dependency tracking, and lazy evaluation — then reimplemented it idiomatically in Go with generics. A &lt;code&gt;Signal[T]&lt;/code&gt; holds a value. A &lt;code&gt;Computed[T]&lt;/code&gt; derives from other signals and recomputes lazily. When a signal changes, bound widgets automatically invalidate and redraw.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewSignal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Computed label updates automatically&lt;/span&gt;
&lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;primitives&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TextFn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Clicked %d times"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;btn&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TextOpt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Increment"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OnClick&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Widgets support two-way signal bindings. The slider reads from and writes to the same signal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;volume&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewSignal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="m"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;slider&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;slider&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;slider&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;slider&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ValueSignal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c"&gt;// two-way binding&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;// volume.Get() always reflects the current slider position&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Signal lifecycle is automatic: widgets subscribe on mount and clean up on unmount. No manual disposal needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Highlights
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Pluggable Painter pattern.&lt;/strong&gt; Widget behavior (in &lt;code&gt;core/&lt;/code&gt;) is separated from rendering (in &lt;code&gt;theme/&lt;/code&gt;). A Button knows about click handling, focus states, and size variants. A ButtonPainter knows how to draw rounded corners and color fills. This avoids import cycles and lets third parties create entirely new design systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Functional options.&lt;/strong&gt; All widget constructors use the options pattern for backward-compatible API evolution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;btn&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TextOpt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Submit"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;VariantOpt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Filled&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SizeOpt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Large&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OnClick&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handleSubmit&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adding new options in future versions will not break existing code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Content[C] polymorphic pattern.&lt;/strong&gt; Inspired by Taiga UI's Polymorpheus pattern (Angular), complex widgets like ListView, GridView, and TabView use a generic &lt;code&gt;Content[C]&lt;/code&gt; interface from the CDK (Component Development Kit) package. This enables type-safe polymorphic content rendering — the widget provides the context (item index, selection state, hover), the user provides the builder function. Internally it powers cell recycling and virtualization, while externally users see a simple &lt;code&gt;BuildItem(func(index int) widget.Widget)&lt;/code&gt; callback.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retained-mode rendering.&lt;/strong&gt; The widget tree tracks dirty state. Only changed widgets are redrawn. RepaintBoundary widgets cache their subtree as pixel buffers. Large boundaries (128x128+) automatically use tile-parallel rendering via &lt;code&gt;scene.Scene&lt;/code&gt; with goroutine work-stealing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Event-driven rendering.&lt;/strong&gt; The default mode is 0% CPU when idle. The window only redraws when events arrive or signals change. No continuous render loop burning battery.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Accessibility from day one.&lt;/strong&gt; Every widget implements the &lt;code&gt;a11y.Accessible&lt;/code&gt; interface with ARIA roles, labels, and actions. The accessibility tree uses stable uint64 node IDs. Platform adapters (UIA, AT-SPI2, NSAccessibility) are not implemented yet -- that is one of the biggest remaining gaps.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Works Well
&lt;/h2&gt;

&lt;p&gt;After months of development, these areas feel solid:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TextField&lt;/strong&gt; handles Unicode input correctly (Latin, Cyrillic, CJK), with cursor movement, text selection, clipboard (Ctrl+C/V/X/A), and input validation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tab focus navigation&lt;/strong&gt; works across the widget tree with Tab/Shift+Tab. Focus rings render correctly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hover cursors&lt;/strong&gt; change appropriately (pointer for buttons, text beam for text fields, resize handles for SplitView).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ScrollView&lt;/strong&gt; dispatches events with correct coordinate transforms to child widgets, even when scrolled.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live theme color switching&lt;/strong&gt; in the gallery demo (M3 seed color changes all widget colors instantly).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Virtualized lists&lt;/strong&gt; handle thousands of items with only visible rows rendered and recycled.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Signal bindings&lt;/strong&gt; propagate state changes through the widget tree without manual wiring.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What Needs Work (Honest Assessment)
&lt;/h2&gt;

&lt;p&gt;This is a v0.1.0 preview. Here is what we know needs improvement:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Performance.&lt;/strong&gt; Complex widget trees (the full gallery) can feel sluggish. We have not done an optimization pass yet. Retained-mode rendering infrastructure is in place but not fully leveraged.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accessibility adapters.&lt;/strong&gt; The a11y tree and ARIA roles are defined, but there are no platform adapters connecting to screen readers. This is a significant gap for production use.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HiDPI.&lt;/strong&gt; DPI scaling is wired through the stack but has not been thoroughly tested on all platforms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation.&lt;/strong&gt; API docs exist but guides and tutorials are sparse. The architecture document is 1200+ lines but aimed at contributors, not users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Draw callback boilerplate.&lt;/strong&gt; The current setup code for wiring gogpu + gg + ui is too verbose. We plan to provide higher-level convenience functions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Some event edge cases.&lt;/strong&gt; Drag cursors, scroll momentum, and overlay dismiss behavior have known rough spots.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Platform testing.&lt;/strong&gt; CI runs on Ubuntu, macOS, and Windows, but real-world testing has been primarily on Windows.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How to Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone and run the gallery example&lt;/span&gt;
git clone https://github.com/gogpu/ui.git
&lt;span class="nb"&gt;cd &lt;/span&gt;ui/examples/gallery
go run &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or add it to an existing project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go get github.com/gogpu/ui@v0.1.0
go get github.com/gogpu/gogpu@latest
go get github.com/gogpu/gg@latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;examples/&lt;/code&gt; directory has four working applications:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;th&gt;What it shows&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;examples/hello/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Checkboxes, radio buttons, ListView with 1000 items&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;examples/signals/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reactive state management patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;examples/taskmanager/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Real-time charts, progress bars, simulated system metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;examples/gallery/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;All 22 widgets, M3 theme with live seed color switching&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Requirements: Go 1.25+, a GPU (Vulkan, Metal, DX12, or OpenGL ES -- software fallback available). No C compiler needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  The gogpu Ecosystem
&lt;/h2&gt;

&lt;p&gt;gogpu/ui sits at the top of a pure Go graphics stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;naga       -- shader compiler (WGSL to SPIR-V, MSL, GLSL)
wgpu       -- WebGPU Hardware Abstraction Layer (Vulkan, Metal, DX12, GLES, software)
gogpu      -- application framework (windowing, input, GPU context)
gg         -- 2D graphics engine (Canvas API, GPU text rendering, SDF acceleration)
ui         -- GUI toolkit (this project)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The entire stack is ~300K lines of pure Go. No CGO, no Rust FFI, no C bindings. If Go can build for a platform, the full stack runs there.&lt;/p&gt;

&lt;p&gt;This release coincides with a coordinated cascade release across the entire ecosystem. All libraries were updated to work through wgpu's new public Core API (previously HAL-only):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Library&lt;/th&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;What Changed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;naga&lt;/td&gt;
&lt;td&gt;v0.14.7&lt;/td&gt;
&lt;td&gt;Shader compiler stability fixes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;wgpu&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;v0.21.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;New public API: Instance, Adapter, Device, Queue, Surface, CommandEncoder&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpucontext&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;v0.10.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Updated interfaces for new wgpu API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gogpu&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;v0.24.1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Platform refactor, CharCallback for Unicode text input&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gg&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;v0.37.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Migrated internal/gpu from HAL to wgpu public API, GPU RRect SDF clip&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ui&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;v0.1.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;This release&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This was a significant engineering effort — wgpu moved from internal HAL types to a proper public API with validation, state tracking, and resource lifecycle management. Every layer above had to adapt.&lt;/p&gt;




&lt;h2&gt;
  
  
  We Want Your Feedback
&lt;/h2&gt;

&lt;p&gt;This release exists to start a conversation. We would genuinely appreciate input on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What widgets are missing&lt;/strong&gt; for your use case?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API ergonomics&lt;/strong&gt; -- do functional options feel right for Go? Is the signal system intuitive?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance expectations&lt;/strong&gt; -- what is acceptable for the apps you want to build?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Design system coverage&lt;/strong&gt; -- are M3/Fluent/Cupertino the right targets? Missing painters?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation&lt;/strong&gt; -- what would help you get started?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Big Architecture Question: Monolith or Modular?
&lt;/h3&gt;

&lt;p&gt;There is one question we have been debating internally and would especially value community input on:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Should gogpu/ui stay as a single module, or should we extract shared primitives into a separate foundation?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Right now, &lt;code&gt;go get github.com/gogpu/ui&lt;/code&gt; pulls the entire toolkit — 56 packages, all widgets, all three design systems, all dependencies. If you only need &lt;code&gt;geometry.Rect&lt;/code&gt; or the layout engine or the signal system, you still get everything.&lt;/p&gt;

&lt;p&gt;We are considering extracting foundational packages into a separate &lt;code&gt;gogpu/uikit&lt;/code&gt; (or similar) module:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What&lt;/th&gt;
&lt;th&gt;Currently in &lt;code&gt;ui/&lt;/code&gt;
&lt;/th&gt;
&lt;th&gt;Could live in &lt;code&gt;uikit/&lt;/code&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Geometry types&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ui/geometry&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;uikit/geometry&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Layout engines&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ui/layout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;uikit/layout&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event types&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ui/event&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;uikit/event&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Widget base&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ui/widget&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;uikit/widget&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Signal bindings&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ui/state&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;uikit/state&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Theme interfaces&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ui/theme&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;uikit/theme&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This would let the community:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Build custom widget libraries&lt;/strong&gt; on top of shared primitives without importing all of gogpu/ui&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create their own design systems&lt;/strong&gt; (themes + painters) by depending only on the interfaces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extend our themes&lt;/strong&gt; with custom component tokens without forking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use our layout engine&lt;/strong&gt; (Flex, Grid, Stack) independently in their own UI frameworks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tradeoff: another module to version and maintain, another API boundary to keep stable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What do you think?&lt;/strong&gt; Is a modular foundation important for your use case, or is a single &lt;code&gt;go get&lt;/code&gt; simpler? We are genuinely unsure — the Go ecosystem has examples of both approaches working well. Your input will directly shape the v0.2.0 architecture.&lt;/p&gt;

&lt;p&gt;File issues on &lt;a href="https://github.com/gogpu/ui/issues" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;, start a thread in &lt;a href="https://github.com/orgs/gogpu/discussions/18" rel="noopener noreferrer"&gt;Discussions&lt;/a&gt;, or just leave a comment here.&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/gogpu/ui" rel="noopener noreferrer"&gt;github.com/gogpu/ui&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Release:&lt;/strong&gt; &lt;a href="https://github.com/gogpu/ui/releases/tag/v0.1.0" rel="noopener noreferrer"&gt;v0.1.0&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Getting Started:&lt;/strong&gt; &lt;a href="https://github.com/gogpu/ui/blob/main/docs/GETTING_STARTED.md" rel="noopener noreferrer"&gt;docs/GETTING_STARTED.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architecture:&lt;/strong&gt; &lt;a href="https://github.com/gogpu/ui/blob/main/docs/ARCHITECTURE.md" rel="noopener noreferrer"&gt;docs/ARCHITECTURE.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gogpu ecosystem:&lt;/strong&gt; &lt;a href="https://github.com/gogpu" rel="noopener noreferrer"&gt;github.com/gogpu&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thank you for reading. We look forward to hearing what you think.&lt;/p&gt;

</description>
      <category>go</category>
      <category>programming</category>
      <category>opensource</category>
      <category>architecture</category>
    </item>
    <item>
      <title>goffi: Zero-CGO Foreign Function Interface for Go — How We Call C Libraries Without a C Compiler</title>
      <dc:creator>Andrey Kolkov</dc:creator>
      <pubDate>Mon, 02 Mar 2026 09:10:35 +0000</pubDate>
      <link>https://dev.to/kolkov/goffi-zero-cgo-foreign-function-interface-for-go-how-we-call-c-libraries-without-a-c-compiler-ca5</link>
      <guid>https://dev.to/kolkov/goffi-zero-cgo-foreign-function-interface-for-go-how-we-call-c-libraries-without-a-c-compiler-ca5</guid>
      <description>&lt;p&gt;Every Go developer who has worked with C libraries knows the pain: CGO requires a C compiler, breaks cross-compilation, bloats binaries, and adds ~200ns overhead per call. For our &lt;a href="https://github.com/go-webgpu/webgpu" rel="noopener noreferrer"&gt;WebGPU bindings&lt;/a&gt; and &lt;a href="https://github.com/born-ml/born" rel="noopener noreferrer"&gt;ML framework&lt;/a&gt;, calling wgpu-native through CGO was a non-starter — we needed to ship a single static binary across Windows, Linux, and macOS without requiring users to install gcc.&lt;/p&gt;

&lt;p&gt;So we built &lt;strong&gt;&lt;a href="https://github.com/go-webgpu/goffi" rel="noopener noreferrer"&gt;goffi&lt;/a&gt;&lt;/strong&gt; — a pure Go FFI library that calls C functions through hand-written assembly, with zero C dependencies and zero per-call allocations. It now powers an entire ecosystem: &lt;a href="https://github.com/go-webgpu/webgpu" rel="noopener noreferrer"&gt;go-webgpu/webgpu&lt;/a&gt; bindings, &lt;a href="https://github.com/born-ml/born" rel="noopener noreferrer"&gt;born-ml/born&lt;/a&gt; ML framework, and the &lt;a href="https://github.com/gogpu" rel="noopener noreferrer"&gt;GoGPU&lt;/a&gt; GPU computing platform with dual Rust and pure Go backends.&lt;/p&gt;

&lt;p&gt;This article explains the architecture, the hard problems we solved, how goffi compares to purego, and how you can use it in your own projects.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Our stack looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Go application (gogpu)
  └─ wgpu bindings (gogpu/wgpu)     ← needs to call C functions
       └─ goffi                      ← this library
            └─ wgpu-native (.dll/.so/.dylib)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We need to call hundreds of WebGPU functions from Go: create GPU devices, submit command buffers, handle async callbacks from Metal/Vulkan threads. The requirements were clear:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No C compiler at build time&lt;/strong&gt; — users run &lt;code&gt;go get&lt;/code&gt; and it works&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-compilation&lt;/strong&gt; — &lt;code&gt;GOOS=linux GOARCH=arm64 go build&lt;/code&gt; must work from Windows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Callbacks from C threads&lt;/strong&gt; — wgpu-native fires callbacks from internal GPU threads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Struct passing&lt;/strong&gt; — WebGPU API passes structs by value (descriptors, extents, colors)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low overhead&lt;/strong&gt; — GPU command encoding happens at 60+ FPS&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;CGO fails requirements 1 and 2. purego covers 1-2 but had gaps in 3-5 when we started. So we built goffi.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture: 4 Layers Deep
&lt;/h2&gt;

&lt;p&gt;Every goffi call traverses four layers to safely transition from Go's managed runtime to raw C code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Go Code
  │  ffi.CallFunction()
  ▼
runtime.cgocall          ← Go runtime: switch to system stack, tell GC
  │
  ▼
Assembly Wrapper         ← Our code: load registers per ABI
  │  RDI=arg0 RSI=arg1 ... XMM0=float0 ...
  ▼
C Function               ← Target library
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 1: The Call Interface (CIF)
&lt;/h3&gt;

&lt;p&gt;Unlike purego which uses &lt;code&gt;reflect.MakeFunc&lt;/code&gt; on every call, goffi pre-computes everything once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Prepare once at init time&lt;/span&gt;
&lt;span class="n"&gt;cif&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CallInterface&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="n"&gt;ffi&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PrepareCallInterface&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cif&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DefaultCall&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UInt64TypeDescriptor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                            &lt;span class="c"&gt;// return: size_t&lt;/span&gt;
&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TypeDescriptor&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PointerTypeDescriptor&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;  &lt;span class="c"&gt;// arg: const char*&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Call many times — zero reflection, zero allocation&lt;/span&gt;
&lt;span class="c"&gt;// args = []unsafe.Pointer{unsafe.Pointer(&amp;amp;myPtr)} — pointers TO arg values&lt;/span&gt;
&lt;span class="n"&gt;ffi&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CallFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cif&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strlenPtr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unsafe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pointer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;PrepareCallInterface&lt;/code&gt; classifies each argument (integer register? SSE register? stack?), computes stack layout, and stores everything in a reusable &lt;code&gt;CallInterface&lt;/code&gt; struct. The &lt;code&gt;cif.Flags&lt;/code&gt; bitmask tells the assembly exactly what to do — no decisions at call time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Platform Assembly
&lt;/h3&gt;

&lt;p&gt;We write assembly by hand for each ABI. Here's the core of our System V AMD64 implementation (&lt;code&gt;syscall_unix_amd64.s&lt;/code&gt;). The function receives a pointer to a &lt;code&gt;syscallArgs&lt;/code&gt; struct in DI, loads registers from it, calls the target, and writes return values back:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// syscall_unix_amd64.s — System V AMD64 ABI
TEXT syscallN(SB), NOSPLIT|NOFRAME, $0
    PUSHQ BP
    MOVQ  SP, BP
    SUBQ  $STACK_SIZE, SP
    MOVQ  DI, R11             // R11 = args struct pointer

    // Load 8 SSE registers from struct offsets 128-184
    MOVQ 128(R11), X0         // XMM0
    MOVQ 136(R11), X1         // XMM1
    // ... XMM2-XMM7

    // Push stack-spill args (a7-a15) from struct offsets 56-120
    MOVQ 56(R11), R12
    MOVQ R12, 0(SP)           // stack slot 0

    // Load 6 GP registers from struct offsets 8-48
    MOVQ 8(R11), DI           // RDI = arg 1
    MOVQ 16(R11), SI          // RSI = arg 2
    MOVQ 24(R11), DX          // RDX = arg 3
    MOVQ 32(R11), CX          // RCX = arg 4
    MOVQ 40(R11), R8          // R8  = arg 5
    MOVQ 48(R11), R9          // R9  = arg 6

    MOVQ 0(R11), R10          // function pointer
    CALL R10

    // Save returns: RAX → r1, RDX → r2, XMM0 → f1
    MOVQ PTR_ADDRESS(BP), DI
    MOVQ AX, 192(DI)          // integer return
    MOVQ DX, 200(DI)          // second return (9-16B structs)
    MOVQ X0, 128(DI)          // float return
    // ... restore stack, RET
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We maintain separate assembly for three ABIs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ABI&lt;/th&gt;
&lt;th&gt;GP Registers&lt;/th&gt;
&lt;th&gt;SSE Registers&lt;/th&gt;
&lt;th&gt;Stack&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;System V AMD64&lt;/strong&gt; (Linux/macOS)&lt;/td&gt;
&lt;td&gt;RDI, RSI, RDX, RCX, R8, R9&lt;/td&gt;
&lt;td&gt;XMM0-XMM7&lt;/td&gt;
&lt;td&gt;16-byte aligned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Win64&lt;/strong&gt; (Windows)&lt;/td&gt;
&lt;td&gt;RCX, RDX, R8, R9&lt;/td&gt;
&lt;td&gt;XMM0-XMM3&lt;/td&gt;
&lt;td&gt;32-byte shadow space&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;AAPCS64&lt;/strong&gt; (ARM64)&lt;/td&gt;
&lt;td&gt;X0-X7&lt;/td&gt;
&lt;td&gt;D0-D7&lt;/td&gt;
&lt;td&gt;16-byte aligned&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Layer 3: Struct Returns
&lt;/h3&gt;

&lt;p&gt;This is where it gets interesting. When a C function returns a struct, the ABI rules depend on size:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&amp;lt;= 8 bytes&lt;/strong&gt;: returned in RAX&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;9-16 bytes&lt;/strong&gt;: split across RAX (low 8) + RDX (high 8)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&amp;gt; 16 bytes&lt;/strong&gt;: caller passes a hidden pointer as the first argument (sret)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Our &lt;code&gt;handleReturn&lt;/code&gt; function assembles the result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StructType&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;cif&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReturnType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Size&lt;/span&gt;
&lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;rvalue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;retVal&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="m"&gt;16&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="kt"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;rvalue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;retVal&lt;/span&gt;          &lt;span class="c"&gt;// RAX → bytes 0-7&lt;/span&gt;
&lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;
&lt;span class="n"&gt;src&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;unsafe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pointer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;retVal2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;dst&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;unsafe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rvalue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nb"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dst&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="c"&gt;// RDX → bytes 8-15&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Layer 4: Callbacks (C → Go)
&lt;/h3&gt;

&lt;p&gt;WebGPU fires async callbacks from internal threads — Metal threads, Vulkan threads, threads goffi never created. These threads have no goroutine (G = nil), so calling Go code directly would crash.&lt;/p&gt;

&lt;p&gt;We solve this with Go's &lt;code&gt;crosscall2&lt;/code&gt; mechanism:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;C thread (wgpu-native internal)
  │  calls our trampoline (1 of 2000 pre-compiled entries)
  ▼
Assembly dispatcher
  │  saves registers, loads callback index
  ▼
crosscall2 → runtime.load_g → runtime.cgocallback
  │  sets up goroutine, switches to Go stack
  ▼
Your Go callback function
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On AMD64, each trampoline is a 5-byte &lt;code&gt;CALL&lt;/code&gt; instruction. On ARM64, each entry is 8 bytes — two 4-byte instructions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// ARM64 (callback_arm64.s) — 8 bytes per entry
MOVD $0, R12              // load callback index
B    ·callbackDispatcher  // branch (no link — preserves LR)
MOVD $1, R12
B    ·callbackDispatcher
// ... 2000 entries
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Usage is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;cb&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ffi&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewCallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="kt"&gt;uint32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adapter&lt;/span&gt; &lt;span class="kt"&gt;uintptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="kt"&gt;uintptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ud&lt;/span&gt; &lt;span class="kt"&gt;uintptr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="c"&gt;// This runs safely even when called from a C thread&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;adapter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;adapter&lt;/span&gt;
&lt;span class="nb"&gt;close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;done&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;ffi&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CallFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cif&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wgpuRequestAdapter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  goffi vs purego: Honest Comparison
&lt;/h2&gt;

&lt;p&gt;Both libraries are pure Go, no CGO. But they make different trade-offs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;goffi&lt;/th&gt;
&lt;th&gt;purego&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;libffi-style: prepare CIF once, call many times&lt;/td&gt;
&lt;td&gt;reflect-style: &lt;code&gt;RegisterFunc&lt;/code&gt;, closure per call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Per-call cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Zero allocations (CIF reused)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;sync.Pool.Get()&lt;/code&gt; for syscall args&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Callback float returns&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Supported (asm writes XMM0)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;panic("unsupported return type")&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ARM64 HFA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Recursive struct walk (nested HFAs)&lt;/td&gt;
&lt;td&gt;Top-level fields only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Type system&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Explicit &lt;code&gt;TypeDescriptor&lt;/code&gt; (Size/Alignment/Kind)&lt;/td&gt;
&lt;td&gt;Go &lt;code&gt;reflect.Type&lt;/code&gt; introspection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ergonomics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Raw — you manage &lt;code&gt;unsafe.Pointer&lt;/code&gt; yourself&lt;/td&gt;
&lt;td&gt;High-level — auto string null-termination, bool, slice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Platforms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5 (amd64x3 + arm64x2)&lt;/td&gt;
&lt;td&gt;9+ architectures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;CallFunctionContext(ctx, ...)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Typed errors&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5 error types with &lt;code&gt;errors.As()&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Generic errors&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Choose goffi when&lt;/strong&gt;: you need struct passing, zero per-call overhead, callback float returns, or you're building GPU/real-time bindings where every nanosecond counts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose purego when&lt;/strong&gt;: you need string auto-marshaling, broad architecture support (386, ppc64le, riscv64...), or quick one-off C library bindings with minimal boilerplate.&lt;/p&gt;

&lt;p&gt;We use both in gogpu — goffi for the hot-path WebGPU calls, purego patterns as reference for platform edge cases.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Ecosystem: Where goffi Came From and Where It Led
&lt;/h2&gt;

&lt;p&gt;goffi wasn't built in isolation. It was born from a real need — and it enabled an entire ecosystem of pure Go GPU libraries.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Origin: Proprietary Roots
&lt;/h3&gt;

&lt;p&gt;goffi started as an internal tool. For over a year it lived inside a proprietary codebase — a GPU computing stack we built for our own projects. It worked well enough for us: a handful of platforms, a known set of functions, predictable usage patterns.&lt;/p&gt;

&lt;p&gt;In 2025, we decided to open-source everything. Not just goffi, but the entire ecosystem — WebGPU bindings, the ML framework, the shader compiler, the GPU platform. We believed the Go community needed a real alternative to CGO for native library bindings.&lt;/p&gt;

&lt;p&gt;What we didn't expect: &lt;strong&gt;the gap between "works for us internally" and "production-ready open source" was enormous.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Our internal version handled our specific use cases. Open source means every use case. Users on platforms we never tested. Struct layouts we never considered. Calling conventions with edge cases we'd never hit. The list of things that "just worked" internally but broke in the wild was humbling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ABI compliance&lt;/strong&gt; — our AMD64 assembly didn't handle struct returns &amp;gt;8 bytes correctly. Internally we never returned large structs by value. Open source users did, immediately. We had to implement RAX+RDX split returns and sret hidden pointers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ARM64&lt;/strong&gt; — we had AMD64 only. Open source meant Apple Silicon support was day one priority, not a nice-to-have.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Callbacks from C threads&lt;/strong&gt; — internally we controlled which threads called back into Go. In the wild, wgpu-native fires callbacks from Metal and Vulkan threads we never created. We had to integrate &lt;code&gt;crosscall2&lt;/code&gt; for proper C→Go transitions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error handling&lt;/strong&gt; — our internal code used generic errors. Open source users needed &lt;code&gt;errors.As()&lt;/code&gt; with typed errors to build robust applications. We added 5 error types.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing&lt;/strong&gt; — our internal coverage was ~40%. Getting to 89% meant writing hundreds of test cases for edge cases we'd never encountered ourselves.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation&lt;/strong&gt; — internally, we knew how the code worked. For open source, every assembly file needed comments explaining the ABI, every public function needed godoc, every platform quirk needed documentation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We essentially rebuilt goffi from scratch while keeping the core idea intact. The architecture is the same — CIF pre-computation, assembly dispatch, zero allocations — but the implementation is production-grade now, not prototype-grade.&lt;/p&gt;

&lt;h3&gt;
  
  
  go-webgpu/webgpu
&lt;/h3&gt;

&lt;p&gt;It started with &lt;a href="https://github.com/go-webgpu/webgpu" rel="noopener noreferrer"&gt;go-webgpu/webgpu&lt;/a&gt; — our zero-CGO WebGPU bindings for Go. We wanted to call &lt;a href="https://github.com/gfx-rs/wgpu-native" rel="noopener noreferrer"&gt;wgpu-native&lt;/a&gt; (Rust-based Vulkan/Metal/DX12 backend) from Go without requiring a C compiler. Every existing approach had a deal-breaker:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CGO&lt;/strong&gt;: requires gcc, breaks &lt;code&gt;go get&lt;/code&gt;, no cross-compilation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;purego&lt;/strong&gt;: at the time, no struct passing, no callback float returns, no HFA support — things WebGPU needs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So we built goffi as the FFI layer for go-webgpu/webgpu. The bindings wrap 180+ wgpu-native functions — device creation, buffer allocation, render passes, compute dispatches, async adapter requests.&lt;/p&gt;

&lt;h3&gt;
  
  
  born-ml: Machine Learning on GPU
&lt;/h3&gt;

&lt;p&gt;The second consumer was &lt;a href="https://github.com/born-ml/born" rel="noopener noreferrer"&gt;born-ml/born&lt;/a&gt; — a production-ready ML framework for Go with a PyTorch-like API. born needs GPU compute for tensor operations: matrix multiplication, convolution, automatic differentiation. The WebGPU compute pipeline powered by goffi gives born GPU acceleration while shipping as a single static binary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;born (ML framework)
  └─ go-webgpu/webgpu (WebGPU bindings)
       └─ goffi (FFI layer)
            └─ wgpu-native (Vulkan/Metal/DX12)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This stack lets you &lt;code&gt;go get github.com/born-ml/born&lt;/code&gt;, write a neural network, and run it on GPU — no Python, no CUDA, no C compiler.&lt;/p&gt;

&lt;h3&gt;
  
  
  GoGPU: The Full Ecosystem
&lt;/h3&gt;

&lt;p&gt;As the projects matured, we realized we could go further. &lt;a href="https://github.com/gogpu" rel="noopener noreferrer"&gt;GoGPU&lt;/a&gt; grew into a complete GPU computing ecosystem with &lt;strong&gt;dual backends&lt;/strong&gt; — a high-performance Rust backend (wgpu-native via goffi) and a &lt;strong&gt;pure Go&lt;/strong&gt; backend:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Package&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Uses goffi&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gogpu" rel="noopener noreferrer"&gt;&lt;strong&gt;gogpu/gogpu&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;GPU framework — windowing, input, event loop, dual backends (Rust wgpu-native or Pure Go)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/wgpu" rel="noopener noreferrer"&gt;&lt;strong&gt;gogpu/wgpu&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;WebGPU implementation in pure Go — calls Vulkan, Metal, EGL/GLES natively via goffi&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/naga" rel="noopener noreferrer"&gt;&lt;strong&gt;gogpu/naga&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Shader compiler in pure Go — WGSL to SPIR-V, MSL, GLSL, HLSL&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gg" rel="noopener noreferrer"&gt;&lt;strong&gt;gogpu/gg&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2D graphics library — SDF rendering, MSDF text, Vello compute pipeline&lt;/td&gt;
&lt;td&gt;Indirect&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gpucontext" rel="noopener noreferrer"&gt;&lt;strong&gt;gogpu/gpucontext&lt;/strong&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Shared interfaces for GPU context, windowing, and surface creation&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both &lt;code&gt;gogpu/gogpu&lt;/code&gt; and &lt;code&gt;gogpu/wgpu&lt;/code&gt; depend directly on goffi. The "pure Go" backend (&lt;code&gt;gogpu/wgpu&lt;/code&gt;) is pure Go in the sense of zero CGO — no C compiler needed — but it still calls native Vulkan, Metal, and EGL APIs through goffi. That's the whole point: goffi replaces CGO, not the native graphics drivers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-World Performance
&lt;/h3&gt;

&lt;p&gt;At 60 FPS, a typical frame makes ~30-50 FFI calls through goffi:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Frame budget:            16.6 ms
GPU work:                ~15 ms
FFI overhead (50 calls): 50 × 100ns = 5 us = 0.03%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The FFI overhead is literally unmeasurable in profiling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Callback-Heavy Async APIs
&lt;/h3&gt;

&lt;p&gt;WebGPU is heavily async. Device creation, adapter requests, buffer mapping — all callback-based:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Request GPU adapter (async) — simplified pattern&lt;/span&gt;
&lt;span class="n"&gt;cb&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ffi&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewCallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="kt"&gt;uint32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adapter&lt;/span&gt; &lt;span class="kt"&gt;uintptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="kt"&gt;uintptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ud&lt;/span&gt; &lt;span class="kt"&gt;uintptr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;handle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;adapter&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;done&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{}{}&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;ffi&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CallFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;requestAdapterCIF&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wgpuInstanceRequestAdapter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;unsafe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pointer&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;unsafe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pointer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;instance&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;unsafe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pointer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;unsafe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pointer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;unsafe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pointer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;userdata&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;done&lt;/span&gt; &lt;span class="c"&gt;// Wait for GPU driver callback&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works even when wgpu-native fires the callback from an internal Metal/Vulkan thread, thanks to our &lt;code&gt;crosscall2&lt;/code&gt; integration.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Use goffi in Your Project
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Installation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go get github.com/go-webgpu/goffi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Minimal Example: Calling strlen
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"fmt"&lt;/span&gt;
    &lt;span class="s"&gt;"runtime"&lt;/span&gt;
    &lt;span class="s"&gt;"unsafe"&lt;/span&gt;

    &lt;span class="s"&gt;"github.com/go-webgpu/goffi/ffi"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/go-webgpu/goffi/types"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// 1. Load library&lt;/span&gt;
    &lt;span class="n"&gt;libName&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="s"&gt;"libc.so.6"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;runtime&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GOOS&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"windows"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;libName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"msvcrt.dll"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ffi&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LoadLibrary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;libName&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;ffi&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FreeLibrary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;strlen&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ffi&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetSymbol&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"strlen"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// 2. Prepare call interface (once)&lt;/span&gt;
    &lt;span class="n"&gt;cif&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CallInterface&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="n"&gt;ffi&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PrepareCallInterface&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cif&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DefaultCall&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UInt64TypeDescriptor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TypeDescriptor&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PointerTypeDescriptor&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// 3. Call (many times, zero overhead)&lt;/span&gt;
    &lt;span class="n"&gt;str&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="s"&gt;"Hello, goffi!&lt;/span&gt;&lt;span class="se"&gt;\x00&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;strPtr&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="kt"&gt;uintptr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unsafe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pointer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unsafe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StringData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;length&lt;/span&gt; &lt;span class="kt"&gt;uint64&lt;/span&gt;

    &lt;span class="n"&gt;ffi&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CallFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cif&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strlen&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unsafe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pointer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;unsafe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pointer&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;unsafe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pointer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;strPtr&lt;/span&gt;&lt;span class="p"&gt;)})&lt;/span&gt;
    &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"strlen = %d&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// 13&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Passing Structs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Define struct layout matching C struct&lt;/span&gt;
&lt;span class="n"&gt;pointType&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TypeDescriptor&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;Size&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;      &lt;span class="m"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;               &lt;span class="c"&gt;// sizeof(Point)&lt;/span&gt;
&lt;span class="n"&gt;Alignment&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="c"&gt;// alignof(Point)&lt;/span&gt;
&lt;span class="n"&gt;Kind&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;      &lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StructType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;Members&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TypeDescriptor&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DoubleTypeDescriptor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c"&gt;// x&lt;/span&gt;
&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DoubleTypeDescriptor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c"&gt;// y&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Use in CIF&lt;/span&gt;
&lt;span class="n"&gt;ffi&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PrepareCallInterface&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cif&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DefaultCall&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DoubleTypeDescriptor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c"&gt;// returns double (distance)&lt;/span&gt;
&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TypeDescriptor&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pointType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pointType&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="c"&gt;// two Point args&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Registering Callbacks
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;cb&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ffi&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewCallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;eventType&lt;/span&gt; &lt;span class="kt"&gt;uint32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="kt"&gt;uintptr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Event %d received&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eventType&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c"&gt;// Pass cb (uintptr) to C function expecting a function pointer&lt;/span&gt;
&lt;span class="n"&gt;ffi&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CallFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;cif&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;registerCallback&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;unsafe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pointer&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;unsafe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pointer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;)})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Hard Lessons
&lt;/h2&gt;

&lt;p&gt;Building a production FFI taught us things no documentation covers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Stack alignment kills silently.&lt;/strong&gt; A single byte of misalignment on AMD64 causes SIGSEGV — but only sometimes, depending on whether the callee uses SSE instructions. We spent days debugging crashes that only reproduced under specific GPU driver versions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Windows shadow space is non-negotiable.&lt;/strong&gt; Win64 ABI requires 32 bytes of "shadow space" on every call, even if the function takes zero arguments. Miss it and the callee corrupts your stack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. ARM64 HFA rules are recursive.&lt;/strong&gt; A struct containing a struct containing 4 floats is still an HFA (Homogeneous Floating-point Aggregate) and must be passed in D0-D3. purego only checks top-level fields; we had to walk the full type tree.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. C threads have no goroutine.&lt;/strong&gt; When wgpu-native calls your callback from an internal Metal thread, &lt;code&gt;getg()&lt;/code&gt; returns nil. You must go through &lt;code&gt;crosscall2 → load_g → cgocallback&lt;/code&gt; or the runtime panics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. &lt;code&gt;float32&lt;/code&gt; encoding matters.&lt;/strong&gt; On Windows, &lt;code&gt;syscall.SyscallN&lt;/code&gt; passes args as &lt;code&gt;uintptr&lt;/code&gt;. Widening &lt;code&gt;float32&lt;/code&gt; to &lt;code&gt;float64&lt;/code&gt; then stuffing into a register corrupts the bit pattern — you need &lt;code&gt;math.Float32bits&lt;/code&gt; to preserve the exact IEEE-754 representation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FFI overhead&lt;/td&gt;
&lt;td&gt;88-114 ns/op&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test coverage&lt;/td&gt;
&lt;td&gt;89%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Platforms&lt;/td&gt;
&lt;td&gt;5 (Win/Linux/macOS x AMD64 + Linux/macOS x ARM64)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Assembly files&lt;/td&gt;
&lt;td&gt;17 files, ~900 lines of logic + 6200 lines of trampoline entries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Callback slots&lt;/td&gt;
&lt;td&gt;2000 per process&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependencies&lt;/td&gt;
&lt;td&gt;0 (only Go stdlib)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CGO required&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What About Go 1.26 CGO Improvements?
&lt;/h2&gt;

&lt;p&gt;Go 1.26 (released February 2026) &lt;a href="https://go.dev/doc/go1.26" rel="noopener noreferrer"&gt;reduced cgo call overhead by ~30%&lt;/a&gt; by removing the dedicated syscall P state. &lt;a href="https://gist.github.com/DeedleFake/2f50b02c0708484c66d18253302c4fd6" rel="noopener noreferrer"&gt;Benchmarks on Apple M1&lt;/a&gt; show &lt;code&gt;CgoCall&lt;/code&gt; is 33% faster, &lt;code&gt;CgoCallWithCallback&lt;/code&gt; is 21% faster.&lt;/p&gt;

&lt;p&gt;This is great news — but it doesn't change our calculus:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CGO still requires a C compiler&lt;/strong&gt; at build time. Our users &lt;code&gt;go get&lt;/code&gt; and ship.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-compilation&lt;/strong&gt; with CGO still requires cross-toolchains. &lt;code&gt;GOOS=linux GOARCH=arm64 go build&lt;/code&gt; just works with goffi.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Static binaries&lt;/strong&gt; — CGO often pulls in libc. goffi produces fully static Go binaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Go 1.26 also benefits goffi&lt;/strong&gt; — our &lt;code&gt;runtime.cgocall&lt;/code&gt; path gets the same 30% speedup, because goffi uses the same runtime machinery internally.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gap between CGO and pure-Go FFI is narrowing from both directions. We welcome it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;v0.5.0&lt;/strong&gt; is focused on usability:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Variadic function support (&lt;code&gt;printf&lt;/code&gt;, &lt;code&gt;sprintf&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Builder pattern API for less boilerplate&lt;/li&gt;
&lt;li&gt;Platform-specific struct alignment (Windows &lt;code&gt;#pragma pack&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;v1.0.0&lt;/strong&gt; targets API stability with SemVer guarantees, security audit, and published benchmarks vs CGO/purego.&lt;/p&gt;

&lt;p&gt;The long-term goal: make GPU programming in Go as natural as it is in Rust or C++, with the ergonomics Go developers expect — &lt;code&gt;go get&lt;/code&gt;, &lt;code&gt;go build&lt;/code&gt;, done.&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;goffi (FFI layer):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/go-webgpu/goffi" rel="noopener noreferrer"&gt;github.com/go-webgpu/goffi&lt;/a&gt; — the library this article is about&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pkg.go.dev/github.com/go-webgpu/goffi" rel="noopener noreferrer"&gt;pkg.go.dev/github.com/go-webgpu/goffi&lt;/a&gt; — Go documentation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/go-webgpu/goffi/blob/main/docs/PERFORMANCE.md" rel="noopener noreferrer"&gt;Performance guide&lt;/a&gt; — benchmarks, optimization strategies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Projects built on goffi:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/go-webgpu/webgpu" rel="noopener noreferrer"&gt;go-webgpu/webgpu&lt;/a&gt; — zero-CGO WebGPU bindings (wgpu-native)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/born-ml/born" rel="noopener noreferrer"&gt;born-ml/born&lt;/a&gt; — ML framework for Go, GPU-accelerated, PyTorch-like API&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GoGPU ecosystem (pure Go GPU):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/gogpu/gogpu" rel="noopener noreferrer"&gt;gogpu/gogpu&lt;/a&gt; — GPU framework, dual backends (Rust + Pure Go)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/gogpu/wgpu" rel="noopener noreferrer"&gt;gogpu/wgpu&lt;/a&gt; — WebGPU in pure Go (Vulkan, Metal, DX12, GLES, Software)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/gogpu/naga" rel="noopener noreferrer"&gt;gogpu/naga&lt;/a&gt; — shader compiler in pure Go (WGSL to SPIR-V/MSL/GLSL/HLSL)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/gogpu/gg" rel="noopener noreferrer"&gt;gogpu/gg&lt;/a&gt; — 2D graphics library (SDF, MSDF text, Vello compute)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Acknowledgments
&lt;/h2&gt;

&lt;p&gt;goffi wouldn't exist without &lt;a href="https://github.com/ebitengine/purego" rel="noopener noreferrer"&gt;purego&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;When we first faced the CGO problem, the conventional wisdom was simple: "you can't call C from Go without a C compiler." purego proved that wrong. The &lt;a href="https://github.com/hajimehoshi/ebiten" rel="noopener noreferrer"&gt;ebitengine&lt;/a&gt; team — and specifically &lt;a href="https://github.com/AJ" rel="noopener noreferrer"&gt;@AJ&lt;/a&gt; and &lt;a href="https://github.com/AJ" rel="noopener noreferrer"&gt;@TotallyGamerJet&lt;/a&gt; — demonstrated that &lt;code&gt;runtime.cgocall&lt;/code&gt;, &lt;code&gt;cgo_import_dynamic&lt;/code&gt;, and hand-written assembly could replace CGO entirely. They showed the community that pure Go FFI was not just theoretically possible, but practical enough to ship a production game engine.&lt;/p&gt;

&lt;p&gt;We studied purego's source code extensively. The &lt;code&gt;crosscall2&lt;/code&gt; callback mechanism, the &lt;code&gt;fakecgo&lt;/code&gt; approach, the assembly trampoline pattern — purego pioneered all of these in the Go ecosystem. Without that foundation to learn from, goffi would have taken years longer to build, if we'd attempted it at all.&lt;/p&gt;

&lt;p&gt;goffi took a different path — libffi-style CIF pre-computation instead of reflect-based dispatch, explicit type descriptors instead of Go type introspection, struct passing and callback float returns for GPU workloads — but the path only existed because purego cleared it first.&lt;/p&gt;

&lt;p&gt;To the purego maintainers: thank you for proving it was possible. The entire pure-Go FFI ecosystem stands on your work.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;goffi is MIT-licensed and open to contributions. If you're building Go bindings for C libraries and want zero-CGO with full ABI compliance — give it a try and let us know how it goes.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>opensource</category>
      <category>programming</category>
      <category>performance</category>
    </item>
    <item>
      <title>Porting Vello's GPU Tile Rasterizer to Pure Go</title>
      <dc:creator>Andrey Kolkov</dc:creator>
      <pubDate>Sat, 28 Feb 2026 00:46:44 +0000</pubDate>
      <link>https://dev.to/kolkov/porting-vellos-gpu-tile-rasterizer-to-pure-go-7i8</link>
      <guid>https://dev.to/kolkov/porting-vellos-gpu-tile-rasterizer-to-pure-go-7i8</guid>
      <description>&lt;p&gt;When you call &lt;code&gt;dc.DrawCircle(400, 300, 100)&lt;/code&gt; in &lt;a href="https://github.com/gogpu/gg" rel="noopener noreferrer"&gt;gogpu/gg&lt;/a&gt;, what happens under the hood? A &lt;strong&gt;tile-based rasterization pipeline&lt;/strong&gt; — a direct port of &lt;a href="https://github.com/linebender/vello" rel="noopener noreferrer"&gt;Vello&lt;/a&gt;'s GPU compute pipeline from the &lt;a href="https://linebender.org/" rel="noopener noreferrer"&gt;linebender&lt;/a&gt; team — converts your vector paths into per-pixel coverage values. It runs on &lt;strong&gt;both CPU and GPU&lt;/strong&gt;, and it's written entirely in Pure Go.&lt;/p&gt;

&lt;p&gt;This article walks through the internals of &lt;code&gt;tilecompute&lt;/code&gt; — a &lt;strong&gt;6,700-line&lt;/strong&gt; dual-execution port of Vello's original GPU compute rasterizer (circa 2020–2023) into Pure Go, with a CPU reference implementation and GPU compute dispatch via 9 WGSL shaders.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not Just Use Scanline?
&lt;/h2&gt;

&lt;p&gt;Traditional scanline rasterizers process one row of pixels at a time. They work, but they have a fundamental scalability problem: for a 4K canvas (3840x2160), you're iterating over &lt;strong&gt;8.3 million pixels&lt;/strong&gt; regardless of how simple your shapes are.&lt;/p&gt;

&lt;p&gt;Tile-based rasterizers flip this around. They divide the canvas into small tiles (16x16 pixels in our case) and only process tiles that the vector path actually touches. A circle in the corner of a 4K canvas? You process maybe 40 tiles instead of 8 million pixels.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/linebender/vello" rel="noopener noreferrer"&gt;Vello&lt;/a&gt; — created by &lt;a href="https://raphlinus.github.io/" rel="noopener noreferrer"&gt;Raph Levien&lt;/a&gt; and the &lt;a href="https://linebender.org/" rel="noopener noreferrer"&gt;linebender&lt;/a&gt; team — pioneered this approach using GPU compute shaders (originally as &lt;a href="https://github.com/raphlinus/piet-gpu" rel="noopener noreferrer"&gt;piet-gpu&lt;/a&gt; in 2020, renamed to Vello in 2022). We ported its GPU compute pipeline to Pure Go for use as the rasterization core of gogpu/gg.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This article covers the &lt;strong&gt;original GPU compute pipeline&lt;/strong&gt; (16x16 tiles, 2020–2023). We've also ported Vello's &lt;strong&gt;newer&lt;/strong&gt; &lt;a href="https://github.com/linebender/vello/issues/670" rel="noopener noreferrer"&gt;Sparse Strips&lt;/a&gt; algorithm (4x4 tiles, August 2024) — see the Two Rasterizers, Two Targets section below.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The GPU Compute Pipeline
&lt;/h2&gt;

&lt;p&gt;Vello's original rasterizer (2020–2023) uses a &lt;strong&gt;tile-based GPU compute&lt;/strong&gt; approach with 16x16 pixel tiles. The key insight: instead of asking "which pixels does this path cover?", ask "which tiles does each line segment cross, and what's the winding contribution?"&lt;/p&gt;

&lt;p&gt;The full pipeline has &lt;strong&gt;9 GPU compute stages&lt;/strong&gt; (matching our 9 WGSL shaders), plus scene encoding and curve flattening on the CPU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Vector Paths (cubics, arcs, lines)
    ↓
Scene Encoding (CPU — pack paths, transforms, styles)
    ↓
 1. PathTag Reduce    ─┐
 2. PathTag Scan      ─┘ monoid prefix sums over path structure
    ↓
Flatten (Euler spiral → line segments)
    ↓
 3. Draw Reduce       ─┐
 4. Draw Leaf Scan    ─┘ monoid prefix sums over draw commands
    ↓
 5. Path Count        (DDA walk — which tiles does each segment cross?)
    ↓
 6. Backdrop          (prefix sum — left-to-right winding accumulation)
    ↓
 7. Coarse            (generate Per-Tile Command Lists)
    ↓
 8. Path Tiling       (clip segments to tile boundaries, compute yEdge)
    ↓
 9. Fine              (per-pixel analytic anti-aliased coverage)
    ↓
Per-pixel RGBA output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first four stages use &lt;strong&gt;monoid prefix sums&lt;/strong&gt; — Vello's core parallelism primitive. A monoid is an associative operation with an identity element; prefix sums over monoids can be computed in O(log n) parallel steps on a GPU, turning what would be sequential parsing into massively parallel work.&lt;/p&gt;

&lt;p&gt;Let's walk through the key stages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stages 1–4: Monoid Prefix Sums
&lt;/h2&gt;

&lt;p&gt;The first four stages parse the encoded scene in parallel using &lt;strong&gt;monoid prefix sums&lt;/strong&gt;. Each path tag and draw tag is reduced into a monoid (an associative structure), then scanned to produce cumulative values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// PathMonoid — accumulated state from scanning path tags&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;PathMonoid&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;TransIx&lt;/span&gt;       &lt;span class="kt"&gt;uint32&lt;/span&gt; &lt;span class="c"&gt;// Transform index&lt;/span&gt;
&lt;span class="n"&gt;PathSegIx&lt;/span&gt;     &lt;span class="kt"&gt;uint32&lt;/span&gt; &lt;span class="c"&gt;// Path segment index&lt;/span&gt;
&lt;span class="n"&gt;PathSegOffset&lt;/span&gt; &lt;span class="kt"&gt;uint32&lt;/span&gt; &lt;span class="c"&gt;// Offset into path segment data&lt;/span&gt;
&lt;span class="n"&gt;StyleIx&lt;/span&gt;       &lt;span class="kt"&gt;uint32&lt;/span&gt; &lt;span class="c"&gt;// Style index&lt;/span&gt;
&lt;span class="n"&gt;PathIx&lt;/span&gt;        &lt;span class="kt"&gt;uint32&lt;/span&gt; &lt;span class="c"&gt;// Path index&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// DrawMonoid — accumulated state from scanning draw tags&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;DrawMonoid&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;PathIx&lt;/span&gt;      &lt;span class="kt"&gt;uint32&lt;/span&gt; &lt;span class="c"&gt;// Which path this draw belongs to&lt;/span&gt;
&lt;span class="n"&gt;ClipIx&lt;/span&gt;      &lt;span class="kt"&gt;uint32&lt;/span&gt; &lt;span class="c"&gt;// Clip stack depth&lt;/span&gt;
&lt;span class="n"&gt;SceneOffset&lt;/span&gt; &lt;span class="kt"&gt;uint32&lt;/span&gt; &lt;span class="c"&gt;// Offset into scene data&lt;/span&gt;
&lt;span class="n"&gt;InfoOffset&lt;/span&gt;  &lt;span class="kt"&gt;uint32&lt;/span&gt; &lt;span class="c"&gt;// Offset into info buffer&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the GPU, reduce + scan runs in O(log n) parallel steps via our 9 WGSL compute shaders dispatched through &lt;code&gt;VelloComputeDispatcher&lt;/code&gt;. On the CPU, it's a sequential scan — the data structures are identical, making the CPU code a reference implementation that validates GPU correctness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 5: Path Count — DDA Tile Walk
&lt;/h2&gt;

&lt;p&gt;The path count stage answers: "which tiles does each line segment cross?"&lt;/p&gt;

&lt;p&gt;We use a &lt;strong&gt;Digital Differential Analyzer&lt;/strong&gt; (DDA) — an algorithm that traces a line through a grid, visiting every cell the line passes through. For each tile visited, we record two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Segment count&lt;/strong&gt; — how many line segments cross this tile&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backdrop&lt;/strong&gt; — the signed winding contribution at the tile's left edge
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// pathCountMain — DDA walk through the tile grid&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;pathCountMain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bump&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;BumpAllocators&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;LineSoup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;paths&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tile&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;Tile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;segCounts&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;SegmentCount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;lineIx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="kt"&gt;uint32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;lineIx&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;bump&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Lines&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;lineIx&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;lineIx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;p0&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;vec2FromArray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;P0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;p1&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;vec2FromArray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;P1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Sort by Y for consistent DDA walking&lt;/span&gt;
&lt;span class="n"&gt;isDown&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;p1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;p0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;xy0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;xy1&lt;/span&gt; &lt;span class="n"&gt;vec2&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;isDown&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;xy0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;xy1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p1&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;xy0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;xy1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p0&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Scale to tile coordinates (pixel / 16)&lt;/span&gt;
&lt;span class="n"&gt;s0&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;xy0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tileScale&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;s1&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;xy1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mul&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tileScale&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// DDA walk: trace the line through tile grid cells&lt;/span&gt;
&lt;span class="c"&gt;// counting X crossings and Y crossings separately&lt;/span&gt;
&lt;span class="n"&gt;dx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;abs32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;s0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dy&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;s0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;
&lt;span class="n"&gt;idxdy&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dx&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;dy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;dx&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;idxdy&lt;/span&gt;

&lt;span class="c"&gt;// ... walk through tiles, update backdrop and segment counts&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;strong&gt;backdrop&lt;/strong&gt; is the critical concept. When a line segment crosses a tile's left edge, it contributes a signed delta to the winding number: &lt;code&gt;-1&lt;/code&gt; for downward segments, &lt;code&gt;+1&lt;/code&gt; for upward. This is how we know whether a pixel is "inside" or "outside" the path without checking every segment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 6: Backdrop Prefix Sum
&lt;/h2&gt;

&lt;p&gt;This is where tile-based rasterization really shines. Instead of checking every segment for every pixel, we accumulate winding numbers &lt;strong&gt;left-to-right&lt;/strong&gt; across each row of tiles:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Backdrop prefix sum — left-to-right winding accumulation&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;bboxH&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="kt"&gt;int32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;bboxW&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;bboxW&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
&lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;tiles&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Backdrop&lt;/span&gt;
&lt;span class="n"&gt;tiles&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Backdrop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this step, each tile knows the accumulated winding number from all segments to its left. A tile with winding number 1 and no segments crossing it is &lt;strong&gt;fully solid&lt;/strong&gt; — no per-pixel computation needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 7: Coarse — Allocation + PTCL Generation
&lt;/h2&gt;

&lt;p&gt;The coarse stage allocates space in a global segment buffer and generates Per-Tile Command Lists. Segment counts are converted into indices using Vello's &lt;strong&gt;inverted index&lt;/strong&gt; trick:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Convert counts to indices using bitwise NOT&lt;/span&gt;
&lt;span class="n"&gt;nextSegIx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="kt"&gt;uint32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;tiles&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;nSegs&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;tiles&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SegmentCountOrIx&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;nSegs&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;tiles&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SegmentCountOrIx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="n"&gt;nextSegIx&lt;/span&gt; &lt;span class="c"&gt;// !seg_ix in Rust&lt;/span&gt;
&lt;span class="n"&gt;nextSegIx&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;nSegs&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;^&lt;/code&gt; (bitwise NOT) serves double duty: it marks the tile as "has segments" (the NOT of a valid index is always negative as int32) while encoding the starting index into the global segment array.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 8: Path Tiling
&lt;/h2&gt;

&lt;p&gt;Now we clip each line segment to its tile boundaries and compute the crucial &lt;strong&gt;yEdge&lt;/strong&gt; value:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// PathSegment — a line segment clipped to tile boundaries&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;PathSegment&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;Point0&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;float32&lt;/span&gt; &lt;span class="c"&gt;// Tile-relative start point&lt;/span&gt;
&lt;span class="n"&gt;Point1&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;float32&lt;/span&gt; &lt;span class="c"&gt;// Tile-relative end point&lt;/span&gt;
&lt;span class="n"&gt;YEdge&lt;/span&gt;  &lt;span class="kt"&gt;float32&lt;/span&gt;    &lt;span class="c"&gt;// Y where segment crosses tile left edge (x=0)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;YEdge&lt;/code&gt; tells the fine rasterizer where the segment enters or exits the tile at x=0. If the segment doesn't cross the left edge, &lt;code&gt;YEdge&lt;/code&gt; is set to &lt;code&gt;1e9&lt;/code&gt; (sentinel value). This single float32 captures the geometric relationship needed for coverage computation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 9: Fine Rasterization
&lt;/h2&gt;

&lt;p&gt;The final stage computes per-pixel anti-aliased coverage within each 16x16 tile:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// fillPath computes per-pixel coverage for a single tile.&lt;/span&gt;
&lt;span class="c"&gt;// Direct port of fill_path from vello_shaders/src/cpu/fine.rs.&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;fillPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;area&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;segments&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;PathSegment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;backdrop&lt;/span&gt; &lt;span class="kt"&gt;int32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evenOdd&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="c"&gt;// Initialize area with backdrop winding number&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;area&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;area&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;backdrop&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;segments&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="c"&gt;// For each column in the tile, compute the area&lt;/span&gt;
&lt;span class="c"&gt;// contribution of this segment using analytic integration&lt;/span&gt;
&lt;span class="c"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Apply fill rule and convert to alpha [0, 1]&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;evenOdd&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;area&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;area&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;abs32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;area&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="m"&gt;2.0&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;round32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0.5&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;area&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;area&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;area&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;min32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abs32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;area&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is &lt;strong&gt;analytic anti-aliasing&lt;/strong&gt; — we compute exact sub-pixel coverage by integrating line segment contributions, not by supersampling. The result is mathematically precise edges with smooth alpha gradients.&lt;/p&gt;

&lt;h2&gt;
  
  
  Euler Spiral Flattening
&lt;/h2&gt;

&lt;p&gt;But wait — &lt;code&gt;DrawCircle&lt;/code&gt; produces cubic Bezier curves, not line segments. How do we get from curves to lines?&lt;/p&gt;

&lt;p&gt;Vello uses &lt;strong&gt;Euler spiral approximation&lt;/strong&gt; for adaptive curve flattening. Unlike naive subdivision (which produces too many or too few segments), Euler spirals provide optimal error bounds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// FlattenFill flattens cubic Beziers using Vello's Euler spiral algorithm&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;FlattenFill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cubics&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;CubicBezier&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;LineSoup&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;LineSoup&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;cubics&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;p0&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;vec2&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;P0&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;P0&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
&lt;span class="n"&gt;p1&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;vec2&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;P1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;P1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
&lt;span class="n"&gt;p2&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;vec2&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;P2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;P2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
&lt;span class="n"&gt;p3&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;vec2&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;P3&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;P3&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
&lt;span class="n"&gt;flattenEulerFill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The algorithm evaluates curvature at each point and subdivides only where the error exceeds a tolerance (0.25 pixels by default). A nearly-straight segment becomes 1 line. A tight curve gets more subdivisions. The result: &lt;strong&gt;minimum line segments for maximum visual quality&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-Tile Command Lists (PTCL)
&lt;/h2&gt;

&lt;p&gt;The coarse stage (Stage 7) generates &lt;strong&gt;Per-Tile Command Lists&lt;/strong&gt; — each tile gets a stream of commands like "fill with coverage from segment N", "apply color #FF0000", "begin clip", "end clip". This is what makes the pipeline work for multiple overlapping paths (UI with buttons, text, backgrounds) in a single fine rasterization pass:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// PTCL commands&lt;/span&gt;
&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;CmdEnd&lt;/span&gt;       &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="n"&gt;CmdFill&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;  &lt;span class="c"&gt;// Compute coverage from segments&lt;/span&gt;
&lt;span class="n"&gt;CmdSolid&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;  &lt;span class="c"&gt;// Full coverage (no segments needed)&lt;/span&gt;
&lt;span class="n"&gt;CmdColor&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;  &lt;span class="c"&gt;// Apply RGBA color with source-over blending&lt;/span&gt;
&lt;span class="n"&gt;CmdBeginClip&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt; &lt;span class="c"&gt;// Push clip layer&lt;/span&gt;
&lt;span class="n"&gt;CmdEndClip&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;11&lt;/span&gt; &lt;span class="c"&gt;// Pop and composite clip&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fine rasterizer walks each tile's PTCL, executing commands and compositing results with premultiplied alpha — exactly like a GPU would:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;fineRasterizeTile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ptcl&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;PTCL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;segments&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;PathSegment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;bgColor&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TileWidth&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;TileHeight&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;float32&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;pixelCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TileWidth&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;TileHeight&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;rgba&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pixelCount&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;float32&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;rgba&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;rgba&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bgColor&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nextOffset&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ptcl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadCmd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;CmdEnd&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rgba&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;CmdFill&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="c"&gt;// Compute coverage, store in area buffer&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;CmdColor&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="c"&gt;// Apply color using area as mask, source-over blend&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;CmdBeginClip&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="c"&gt;// Push current pixels onto clip stack&lt;/span&gt;
&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;CmdEndClip&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="c"&gt;// Pop and composite with clip mask&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Dual Execution: Why Both CPU and GPU?
&lt;/h2&gt;

&lt;p&gt;tilecompute is a &lt;strong&gt;dual-execution&lt;/strong&gt; pipeline: the same 9-stage algorithm runs on both CPU (sequential Go code) and GPU (WGSL compute shaders dispatched via &lt;code&gt;VelloComputeDispatcher&lt;/code&gt;). Why maintain both?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. CPU is the Core.&lt;/strong&gt; After analyzing 8 enterprise 2D engines (Skia, Cairo, Vello, Blend2D, tiny-skia, piet, Qt RHI, Pathfinder), we found that in &lt;strong&gt;zero&lt;/strong&gt; of them is CPU rasterization a "backend". It's always the core. GPU is the optional accelerator.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Correctness Reference.&lt;/strong&gt; The CPU implementation serves as a reference for the GPU compute shaders. When GPU and CPU produce different results, we know which one to trust.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Universal Availability.&lt;/strong&gt; Servers, CI/CD, embedded systems — many environments have no GPU. A server generating 10,000 chart images doesn't need GPU acceleration; it needs reliable software rendering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Identical Algorithm, Dual Execution.&lt;/strong&gt; The CPU code mirrors the GPU pipeline stage-by-stage — same data structures, same logic. The 9 WGSL compute shaders are &lt;code&gt;//go:embed&lt;/code&gt;ded and compiled into GPU compute pipelines via &lt;code&gt;hal.Device.CreateComputePipeline()&lt;/code&gt;. When a GPU is available, &lt;code&gt;VelloComputeDispatcher&lt;/code&gt; dispatches all 9 stages in parallel with &lt;code&gt;pass.Dispatch()&lt;/code&gt;; when not, the CPU executes them sequentially.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Rasterizers, Two Targets
&lt;/h2&gt;

&lt;p&gt;tilecompute is not the only Vello algorithm we've ported. gogpu/gg includes &lt;strong&gt;both&lt;/strong&gt; of Vello's rasterization pipelines — each optimized for a different execution target:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;tilecompute&lt;/th&gt;
&lt;th&gt;SparseStripsRasterizer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vello era&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Original (2020–2023)&lt;/td&gt;
&lt;td&gt;New (&lt;a href="https://github.com/linebender/vello/issues/670" rel="noopener noreferrer"&gt;Issue #670&lt;/a&gt;, August 2024)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Source&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;vello_shaders/src/cpu/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;sparse_strips/vello_cpu/&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tile size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;16x16 (256 pixels)&lt;/td&gt;
&lt;td&gt;4x4 (16 pixels)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Optimized for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GPU compute workgroups&lt;/td&gt;
&lt;td&gt;CPU / SIMD registers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Key insight&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;256 pixels = GPU workgroup size&lt;/td&gt;
&lt;td&gt;16 u8 pixels = one 128-bit SSE register&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data flow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Monoid prefix sums → PTCL&lt;/td&gt;
&lt;td&gt;Sort by Y,X → strip rendering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Package&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;internal/gpu/tilecompute/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;internal/gpu/sparse_strips.go&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why 16x16 for GPU?&lt;/strong&gt; GPU compute shaders process tiles in parallel workgroups. 256 pixels per tile matches the typical workgroup size (256 threads), giving each thread exactly one pixel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why 4x4 for CPU?&lt;/strong&gt; SIMD instructions operate on 128-bit registers. 16 pixels of u8 coverage data fit into a single SSE register, enabling vectorized operations across an entire tile at once — the same approach Intel used in Larrabee.&lt;/p&gt;

&lt;p&gt;Both rasterizers use analytic anti-aliasing, Euler spiral flattening, and support NonZero/EvenOdd fill rules. The difference is purely in how they partition the canvas and process tiles.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gogpu/gg Rasterization Engines
    │
    ├── tilecompute (16x16 tiles) — DUAL EXECUTION
    │      ├── CPU: sequential Go (reference + fallback)
    │      └── GPU: 9 WGSL shaders via VelloComputeDispatcher
    │
    └── SparseStripsRasterizer (4×4 tiles) — CPU
           CPU/SIMD-optimized pipeline
           Sort by Y,X + strip rendering
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Having both means gogpu/gg selects the optimal algorithm for the target: GPU compute dispatches the 16x16 pipeline via WGSL shaders, CPU rendering defaults to the 4x4 pipeline for SIMD efficiency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Smart Multi-Engine Selection
&lt;/h2&gt;

&lt;p&gt;Having multiple rasterizers raises a question: &lt;strong&gt;who decides which algorithm handles which path?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We analyzed 8 enterprise 2D engines — Skia, Cairo, Blend2D, Vello, Qt, Direct2D, Flutter, SwiftUI — and found that &lt;strong&gt;none of them&lt;/strong&gt; do per-path dynamic algorithm selection. Skia has separate CPU/GPU pipelines but no cross-selection. Vello has a planned &lt;code&gt;vello_api&lt;/code&gt; for CPU/GPU choice, not yet built. Direct2D recognizes simple shapes but doesn't switch algorithms.&lt;/p&gt;

&lt;p&gt;gogpu/gg is the first 2D graphics library with &lt;strong&gt;systematic multi-factor per-path selection&lt;/strong&gt; across 5 rasterization algorithms:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Path arrives at Context.Fill()
    │
    ├── Shape detection → SDF Accelerator (circles, rrects)
    │     GPU SDF or CPU SDF — smoothstep coverage, highest quality
    │
    ├── Adaptive threshold check
    │     │
    │     ├── Below threshold → AnalyticFiller (scanline)
    │     │     Zero tile overhead, O(width × edges)
    │     │
    │     └── Above threshold → AdaptiveFiller
    │           │
    │           ├── &amp;lt; 10K segments → SparseStrips (4×4 tiles)
    │           │     CPU/SIMD-optimized, lower tile overhead
    │           │
    │           └── &amp;gt; 10K segments + large canvas → TileCompute (16×16 tiles)
    │                 GPU workgroup-ready, 16× fewer tiles
    │
    └── GPU Compute → Vello PTCL pipeline (full scene)
          9-stage GPU compute, massively parallel
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Coregex Analogy
&lt;/h3&gt;

&lt;p&gt;The pattern is inspired by &lt;a href="https://github.com/coregx/coregex" rel="noopener noreferrer"&gt;coregex&lt;/a&gt; — a Go regex library with &lt;strong&gt;17 strategies&lt;/strong&gt; and an intelligent selector that picks the optimal engine per-pattern. Same idea: analyze the input, pick the optimal engine. Both Go libraries, both first-of-kind multi-engine approaches.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;coregex&lt;/th&gt;
&lt;th&gt;gogpu/gg&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Engines&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;17 regex strategies&lt;/td&gt;
&lt;td&gt;5 rasterization algorithms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Selection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pattern analysis&lt;/td&gt;
&lt;td&gt;7-dimension path analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Input&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Regex pattern&lt;/td&gt;
&lt;td&gt;Vector path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Match result&lt;/td&gt;
&lt;td&gt;Pixel coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Adaptive Threshold
&lt;/h3&gt;

&lt;p&gt;The key insight: &lt;strong&gt;scanline cost grows with width × edge crossings&lt;/strong&gt;, while &lt;strong&gt;tile cost grows with fill area&lt;/strong&gt;. For large shapes, tiles win at lower complexity. For tiny shapes (&amp;lt; 32px), scanline always wins.&lt;/p&gt;

&lt;p&gt;The selection between scanline and tile-based rasterizers uses an adaptive threshold derived from the path's bounding box area:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// threshold = max(32, 2048/sqrt(bboxArea))&lt;/span&gt;
&lt;span class="c"&gt;//&lt;/span&gt;
&lt;span class="c"&gt;// 100×100 bbox → threshold 20 elements (tiles kick in early)&lt;/span&gt;
&lt;span class="c"&gt;// 500×500 bbox → threshold  4 elements (large shapes → always tiles)&lt;/span&gt;
&lt;span class="c"&gt;//  30×30  bbox → always scanline (below bboxMinDimension)&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;adaptiveThreshold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bboxArea&lt;/span&gt; &lt;span class="kt"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;bboxArea&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;maxElementThreshold&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;2048.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bboxArea&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;clamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;256&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  User Override
&lt;/h3&gt;

&lt;p&gt;Auto-selection is the default, but the user always has final say — the same principle as database query hints or GPU driver force flags:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;dc&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;gg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;800&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Auto (default) — intelligent per-path selection&lt;/span&gt;
&lt;span class="n"&gt;dc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetRasterizerMode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RasterizerAuto&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Force scanline for all paths (debugging, isolation)&lt;/span&gt;
&lt;span class="n"&gt;dc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetRasterizerMode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RasterizerAnalytic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Force 4×4 tiles (benchmarking CPU/SIMD path)&lt;/span&gt;
&lt;span class="n"&gt;dc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetRasterizerMode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RasterizerSparseStrips&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Force 16×16 tiles (benchmarking GPU workgroup path)&lt;/span&gt;
&lt;span class="n"&gt;dc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetRasterizerMode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RasterizerTileCompute&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Force SDF for maximum circle/rrect quality&lt;/span&gt;
&lt;span class="n"&gt;dc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetRasterizerMode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RasterizerSDF&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The mode is per-Context — different contexts can use different strategies simultaneously. This makes A/B benchmarking trivial: render the same scene with two contexts, compare output and timing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Go source&lt;/td&gt;
&lt;td&gt;2,878 LOC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test code&lt;/td&gt;
&lt;td&gt;2,194 LOC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WGSL shaders&lt;/td&gt;
&lt;td&gt;1,695 LOC (9 shaders)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6,767 LOC&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tile size&lt;/td&gt;
&lt;td&gt;16x16 pixels&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fill rules&lt;/td&gt;
&lt;td&gt;NonZero, EvenOdd&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Golden test threshold&lt;/td&gt;
&lt;td&gt;0.15% max pixel difference&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 9 WGSL compute shaders are &lt;code&gt;//go:embed&lt;/code&gt;ded into &lt;code&gt;VelloComputeDispatcher&lt;/code&gt;, compiled into GPU compute pipelines, and dispatched with &lt;code&gt;pass.Dispatch()&lt;/code&gt; — the same algorithm running on both CPU (reference/fallback) and GPU (parallel compute).&lt;/p&gt;

&lt;h2&gt;
  
  
  Part of a Larger Ecosystem
&lt;/h2&gt;

&lt;p&gt;tilecompute is the rasterization core of &lt;a href="https://github.com/gogpu/gg" rel="noopener noreferrer"&gt;gogpu/gg&lt;/a&gt;, which is part of a &lt;strong&gt;466K+ line&lt;/strong&gt; Pure Go GPU computing ecosystem:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;LOC&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gg" rel="noopener noreferrer"&gt;gogpu/gg&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2D graphics library&lt;/td&gt;
&lt;td&gt;186K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/wgpu" rel="noopener noreferrer"&gt;gogpu/wgpu&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Pure Go WebGPU&lt;/td&gt;
&lt;td&gt;110K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/naga" rel="noopener noreferrer"&gt;gogpu/naga&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Shader compiler&lt;/td&gt;
&lt;td&gt;61K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/ui" rel="noopener noreferrer"&gt;gogpu/ui&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;GUI widget toolkit&lt;/td&gt;
&lt;td&gt;61K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gogpu" rel="noopener noreferrer"&gt;gogpu/gogpu&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;GPU framework&lt;/td&gt;
&lt;td&gt;40K&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The default stack is &lt;strong&gt;zero CGO, Pure Go&lt;/strong&gt; — from shader compilation to GPU command submission. But gogpu also supports a &lt;strong&gt;Rust backend&lt;/strong&gt; via &lt;a href="https://github.com/go-webgpu/webgpu" rel="noopener noreferrer"&gt;go-webgpu&lt;/a&gt; FFI to wgpu-native for maximum GPU performance when needed:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Stack&lt;/th&gt;
&lt;th&gt;Build&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Pure Go&lt;/strong&gt; (default)&lt;/td&gt;
&lt;td&gt;gogpu/wgpu → Vulkan/Metal/GLES&lt;/td&gt;
&lt;td&gt;&lt;code&gt;go build&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Zero dependencies, easy cross-compile&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Rust FFI&lt;/strong&gt; (opt-in)&lt;/td&gt;
&lt;td&gt;go-webgpu → wgpu-native&lt;/td&gt;
&lt;td&gt;&lt;code&gt;go build -tags rust&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Maximum GPU performance, production&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both backends use the same gg API — the choice is transparent to application code. gg doesn't know or care which WebGPU implementation is underneath; it talks to &lt;code&gt;hal.Queue&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;When GPU acceleration is available, gg uses the registered WebGPU backend (Pure Go or Rust) with support for Vulkan, Metal, DX12, and OpenGL ES. When it's not — tilecompute and the CPU rasterizer handle everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Acknowledgments
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;a href="https://github.com/linebender/vello" rel="noopener noreferrer"&gt;Vello&lt;/a&gt; team and &lt;a href="https://raphlinus.github.io/" rel="noopener noreferrer"&gt;Raph Levien&lt;/a&gt; for the tile rasterization pipeline and Euler spiral flattening research&lt;/li&gt;
&lt;li&gt;The variable naming in tilecompute intentionally matches Vello's Rust originals for easy cross-reference&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/gogpu/gg" rel="noopener noreferrer"&gt;gogpu/gg&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Package&lt;/strong&gt;: &lt;code&gt;internal/gpu/tilecompute/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vello&lt;/strong&gt;: &lt;a href="https://github.com/linebender/vello" rel="noopener noreferrer"&gt;linebender/vello&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Raph Levien's blog&lt;/strong&gt;: &lt;a href="https://raphlinus.github.io/" rel="noopener noreferrer"&gt;raphlinus.github.io&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discussion&lt;/strong&gt;: &lt;a href="https://github.com/orgs/gogpu/discussions/18" rel="noopener noreferrer"&gt;Join the conversation&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;






&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go get github.com/gogpu/gg@latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>go</category>
      <category>graphics</category>
      <category>gpu</category>
      <category>algorithms</category>
    </item>
    <item>
      <title>Pure Go GUI Toolkit 2026 — 425K LOC Ecosystem, Zero CGO, WebGPU (gogpu/ui)</title>
      <dc:creator>Andrey Kolkov</dc:creator>
      <pubDate>Sat, 21 Feb 2026 13:38:23 +0000</pubDate>
      <link>https://dev.to/kolkov/pure-go-gui-toolkit-2026-425k-loc-ecosystem-zero-cgo-webgpu-gogpuui-5aop</link>
      <guid>https://dev.to/kolkov/pure-go-gui-toolkit-2026-425k-loc-ecosystem-zero-cgo-webgpu-gogpuui-5aop</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Series: Building Go's GPU Ecosystem&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://dev.to/kolkov/gogpu-a-pure-go-graphics-library-for-gpu-programming-2j5d"&gt;GoGPU: A Pure Go Graphics Library&lt;/a&gt; — Project announcement&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/kolkov/gogpu-from-idea-to-100k-lines-in-two-weeks-building-gos-gpu-ecosystem-3b2"&gt;From Idea to 100K Lines in Two Weeks&lt;/a&gt; — The journey&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/kolkov/pure-go-2d-graphics-library-with-gpu-acceleration-introducing-gogpugg-538h"&gt;Pure Go 2D Graphics with GPU Acceleration&lt;/a&gt; — Introducing gg&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/kolkov/gpu-compute-shaders-in-pure-go-gogpugg-v0150-1cjk"&gt;GPU Compute Shaders in Pure Go&lt;/a&gt; — Compute pipelines&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/kolkov/go-126-meets-2026-with-a-professional-graphics-ecosystem-9g8"&gt;Go 1.26 Meets 2026&lt;/a&gt; — Ecosystem overview&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/kolkov/gogpugg-enterprise-2d-graphics-library-in-pure-go-1931"&gt;Enterprise 2D Graphics Library&lt;/a&gt; — gg architecture&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/kolkov/gogpu-enterprise-architecture-cross-package-gpu-integration-with-gpucontext-332"&gt;Cross-Package GPU Integration&lt;/a&gt; — gpucontext&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/kolkov/gogpu-unified-2d3d-graphics-integration-in-pure-go-gg3"&gt;Unified 2D/3D Graphics Integration&lt;/a&gt; — gg + gogpu&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Core Complete, Focus on GUI&lt;/strong&gt; ← You are here&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Foundation is Done
&lt;/h2&gt;

&lt;p&gt;Three months ago, GoGPU was an idea. Today it's &lt;strong&gt;425,000+ lines of Pure Go&lt;/strong&gt; — a complete GPU computing stack with a shader compiler, WebGPU implementation, 2D graphics library, and application framework. All without CGO.&lt;/p&gt;

&lt;p&gt;The core architecture that powers everything — from shader compilation to pixel output — is &lt;strong&gt;production-ready&lt;/strong&gt;. And that means it's time to build what this was all leading to: &lt;strong&gt;a real GUI toolkit for Go&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where We Are: The Ecosystem at a Glance
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Active Development
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;LOC&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/ui" rel="noopener noreferrer"&gt;gogpu/ui&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GUI Toolkit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;55K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Phase 2 Beta&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Foundation (Maintenance Mode)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;LOC&lt;/th&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gg" rel="noopener noreferrer"&gt;gogpu/gg&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2D Graphics (Canvas, GPU accel, text)&lt;/td&gt;
&lt;td&gt;167K&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gg/releases/tag/v0.29.0" rel="noopener noreferrer"&gt;v0.29.0&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/wgpu" rel="noopener noreferrer"&gt;gogpu/wgpu&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Pure Go WebGPU (Vulkan/DX12/Metal/GLES)&lt;/td&gt;
&lt;td&gt;105K&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/wgpu/releases/tag/v0.16.9" rel="noopener noreferrer"&gt;v0.16.9&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/naga" rel="noopener noreferrer"&gt;gogpu/naga&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Shader Compiler (WGSL → SPIR-V/MSL/GLSL/HLSL)&lt;/td&gt;
&lt;td&gt;54K&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/naga/releases/tag/v0.14.1" rel="noopener noreferrer"&gt;v0.14.1&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gogpu" rel="noopener noreferrer"&gt;gogpu/gogpu&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Application Framework (windowing, input)&lt;/td&gt;
&lt;td&gt;37K&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/gogpu/gogpu/releases/tag/v0.20.0" rel="noopener noreferrer"&gt;v0.20.0&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ 4 more&lt;/td&gt;
&lt;td&gt;gpucontext, gputypes, gg-pdf, gg-svg&lt;/td&gt;
&lt;td&gt;9K&lt;/td&gt;
&lt;td&gt;Stable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The foundation libraries handle issues, feature requests, bug fixes, and performance improvements as they come in. Their architecture is settled. Our full energy goes into &lt;strong&gt;gogpu/ui&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Build Another GUI Toolkit?
&lt;/h2&gt;

&lt;p&gt;Go's GUI landscape has options — Fyne, Gio, Wails. We respect all of them. But we're solving a different problem:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We want Go to power the applications that currently require Electron, Qt, or native platform toolkits.&lt;/strong&gt; IDEs. Design tools. CAD. Professional dashboards.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero CGO&lt;/strong&gt; — &lt;code&gt;go build&lt;/code&gt; and it works, everywhere&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WebGPU rendering&lt;/strong&gt; — native GPU acceleration on all platforms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise layout&lt;/strong&gt; — Flexbox, Grid, docking, virtualization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reactive state&lt;/strong&gt; — Signals-based data binding (not callbacks)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accessibility&lt;/strong&gt; — ARIA roles from day one, not bolted on later&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pluggable design systems&lt;/strong&gt; — Material 3 today, Fluent or Cupertino tomorrow&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Architecture: Three Layers, Clean Separation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer 3b: Design Systems ── theme/material3/ (HCT color science)
Layer 3a: Generic Widgets ── core/button/, core/checkbox/, core/radio/
Layer 2:  CDK (headless)  ── Content[C] polymorphic pattern
Layer 1:  Foundation      ── widget/, event/, geometry/, layout/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;strong&gt;widgets don't know their design system&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;button.Button&lt;/code&gt; defines &lt;em&gt;behavior&lt;/em&gt; — click handling, keyboard activation, focus management. How it &lt;em&gt;looks&lt;/em&gt; is determined by a &lt;code&gt;Painter&lt;/code&gt; interface that the design system implements.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// The widget defines behavior&lt;/span&gt;
&lt;span class="n"&gt;btn&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Submit"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OnClick&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;save&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;VariantOpt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Filled&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// Material 3 painter handles appearance&lt;/span&gt;
&lt;span class="c"&gt;// (injected via ThemeProvider, not imported by the widget)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Core widgets are &lt;strong&gt;design-system-agnostic&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Swapping from Material 3 to Fluent is changing one import&lt;/li&gt;
&lt;li&gt;Community can create custom design systems without forking widgets&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Dependency Inversion
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;gogpu/ui&lt;/code&gt; never imports &lt;code&gt;gogpu&lt;/code&gt; directly. Instead, it depends on &lt;em&gt;interfaces&lt;/em&gt; from &lt;code&gt;gpucontext&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ui ──imports──&amp;gt; gpucontext (interfaces: WindowProvider, EventSource)
examples ──imports──&amp;gt; gogpu (concrete implementation)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means you can test the entire widget tree &lt;strong&gt;headlessly&lt;/strong&gt; — no window, no GPU, just unit tests. The toolkit has &lt;strong&gt;97% average test coverage&lt;/strong&gt; across all packages.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Already Working
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Phase 0: Foundation (Complete)
&lt;/h3&gt;

&lt;p&gt;Core infrastructure that every widget builds on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;geometry&lt;/strong&gt; — Point, Size, Rect, Constraints, Insets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;event&lt;/strong&gt; — Mouse, keyboard, wheel, focus events with modifier keys&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;widget&lt;/strong&gt; — Widget interface with 3-phase lifecycle (Layout → Draw → Event)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;layout&lt;/strong&gt; — CSS Flexbox, VStack/HStack, CSS Grid engines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;render&lt;/strong&gt; — Canvas implementation backed by gogpu/gg&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Phase 1: MVP (Complete)
&lt;/h3&gt;

&lt;p&gt;The basics needed for a working application:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;state&lt;/strong&gt; — Reactive signals (Signal, Computed, Effect, Binding, Scheduler) wrapping &lt;a href="https://github.com/coregx/signals" rel="noopener noreferrer"&gt;coregx/signals&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;a11y&lt;/strong&gt; — Accessibility tree with 35+ ARIA roles&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;primitives&lt;/strong&gt; — Box, Text, Image display widgets with fluent API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;app&lt;/strong&gt; — Window integration via gpucontext interfaces&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Phase 1.5: Extensibility (Complete)
&lt;/h3&gt;

&lt;p&gt;Enabling the community to build on top:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;registry&lt;/strong&gt; — Widget factory registration with categories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;plugin&lt;/strong&gt; — Plugin bundling with dependency resolution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;theme&lt;/strong&gt; — Base theme system (ColorPalette, Typography, Spacing, Shadows, Radii, Extensions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;layout&lt;/strong&gt; (public) — Custom layout algorithms&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Phase 2: Beta (75% Complete)
&lt;/h3&gt;

&lt;p&gt;Interactive widgets and Material Design:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;button&lt;/strong&gt; — 4 variants (Filled, Outlined, TextOnly, Tonal), 3 sizes, keyboard activation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;checkbox&lt;/strong&gt; — Checked / unchecked / indeterminate, label support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;radio&lt;/strong&gt; — Radio groups with vertical/horizontal layout, arrow key navigation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;focus&lt;/strong&gt; — Tab/Shift+Tab navigation, keyboard shortcuts, focus ring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;cdk&lt;/strong&gt; — Component Development Kit with Content[C] polymorphic pattern&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;material3&lt;/strong&gt; — HCT color science, 32 color roles, 15 typography roles, painters for all widgets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Remaining for Phase 2:&lt;/strong&gt; TextField, Dropdown, Slider, Progress indicators, Typography system, Icon system&lt;/p&gt;




&lt;h2&gt;
  
  
  A Working Example
&lt;/h2&gt;

&lt;p&gt;Here's a real application running today — &lt;code&gt;ui/examples/hello&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;gogpuApp&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;gogpu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewApp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gogpu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DefaultConfig&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;WithTitle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"gogpu/ui — Widget Demo"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;WithSize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;800&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
        &lt;span class="n"&gt;WithContinuousRender&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c"&gt;// Event-driven: 0% CPU idle&lt;/span&gt;

    &lt;span class="n"&gt;uiApp&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithWindowProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gogpuApp&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithPlatformProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gogpuApp&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithEventSource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gogpuApp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EventSource&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;uiApp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetRoot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buildUI&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="c"&gt;// ... rendering pipeline setup ...&lt;/span&gt;
    &lt;span class="n"&gt;gogpuApp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;buildUI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;primitives&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BoxWidget&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;primitives&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Box&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;primitives&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"gogpu/ui — Widget Demo"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
            &lt;span class="n"&gt;FontSize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;28&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Bold&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
            &lt;span class="n"&gt;Color&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;widget&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RGBA8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;33&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;33&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;33&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;255&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;

        &lt;span class="n"&gt;checkbox&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;checkbox&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LabelOpt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Enable notifications"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;checkbox&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Checked&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;checkbox&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OnToggle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;checked&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"notifications:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;checked&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}),&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;

        &lt;span class="n"&gt;radio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewGroup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;radio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Items&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;radio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ItemDef&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"light"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Label&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Light"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="n"&gt;radio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ItemDef&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"dark"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Label&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Dark"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="n"&gt;radio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ItemDef&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Label&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"System"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;radio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Selected&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;radio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DirectionOpt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;radio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Horizontal&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Padding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Gap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
      &lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;widget&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RGBA8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;255&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;255&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;255&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;255&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
      &lt;span class="n"&gt;Rounded&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ShadowLevel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key characteristics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Event-driven rendering&lt;/strong&gt; — 0% CPU when nothing changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU-direct pipeline&lt;/strong&gt; — widgets render through gg, ggcanvas blits directly to the GPU surface (zero-copy)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Functional options&lt;/strong&gt; — clean construction without builder hell&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fluent styling&lt;/strong&gt; — &lt;code&gt;.Padding().Gap().Background().Rounded()&lt;/code&gt; chains&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The rendering pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Widget tree → render.Canvas (gg) → ggcanvas → GPU surface
                                    ↓
                              Zero CPU readback
                              Direct GPU composition
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Near-term: Phase 2 Completion
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Widget&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TextField&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Text input with cursor, selection, clipboard&lt;/td&gt;
&lt;td&gt;Next up&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dropdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Popup selection with search&lt;/td&gt;
&lt;td&gt;Planned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Slider&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Range input with track and thumb&lt;/td&gt;
&lt;td&gt;Planned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Progress&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Determinate and indeterminate indicators&lt;/td&gt;
&lt;td&gt;Planned&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Phase 3–4: IDE-Class Application Shell
&lt;/h3&gt;

&lt;p&gt;This is where it gets ambitious. The target is &lt;strong&gt;GoLand-class IDE layout&lt;/strong&gt; — the kind of application shell that currently requires Electron, Qt, or platform-native code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────┐
│  Toolbar                                                 │
├────────┬─────────────────────────────────┬───────────────┤
│        │  Tab1 │ Tab2 │ Tab3             │               │
│  Left  │─────────────────────────────────│  Right Panel  │
│  Panel │                                 │  (Inspector,  │
│  (Tree,│     Main Editor Area            │   Properties) │
│  Files)│                                 │               │
│        │                                 │               │
├────────┴──────────────────┬──────────────┴───────────────┤
│  Terminal │ Problems │ Git│                              │
│  Bottom Panel (resizable, collapsible)                   │
└──────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The building blocks for this, in order:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ScrollView&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Smooth scrolling with inertia&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TabView&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Editor-style tabs with close buttons, reordering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SplitView&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Resizable horizontal/vertical splits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dialog/Modal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Popup windows with focus trapping&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Popover/Tooltip&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Context menus, hover tooltips&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;VirtualizedList/Grid&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Render 100K items (file trees, logs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Animation Engine&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Spring physics, transitions, easing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Docking System&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Draggable panels — left, right, bottom, floating&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Drag &amp;amp; Drop&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Cross-widget, cross-window DnD (tab reordering, panel docking)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fluent Theme&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Windows-native look&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cupertino Theme&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;macOS-native look&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;i18n&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;RTL text, locale-aware formatting&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The docking system is the crown jewel — panels that snap to left/right/bottom, collapse to icon strips, drag between positions. Exactly what you see in GoLand, VS Code, or Photoshop. Built entirely in Go, rendered entirely on the GPU.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Libraries: Maintenance Mode
&lt;/h2&gt;

&lt;p&gt;With the GUI toolkit as the primary focus, the lower-level libraries shift to maintenance:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Library&lt;/th&gt;
&lt;th&gt;Focus Going Forward&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;gg&lt;/strong&gt; (2D graphics)&lt;/td&gt;
&lt;td&gt;Bug fixes (#95 AA quality, #72 circle artifact), GPU pattern support, performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;wgpu&lt;/strong&gt; (WebGPU)&lt;/td&gt;
&lt;td&gt;Community bug reports, platform-specific fixes, new HAL features as needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;naga&lt;/strong&gt; (shaders)&lt;/td&gt;
&lt;td&gt;HLSL matrix fix (DX12 text rendering), new shader targets on demand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;gogpu&lt;/strong&gt; (framework)&lt;/td&gt;
&lt;td&gt;Community issues (#82 NVIDIA crash, #89 macOS Tahoe), WASM support (#70)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This doesn't mean they're done — it means they're &lt;strong&gt;stable enough to build on&lt;/strong&gt; while we focus our development time on the toolkit that will use them.&lt;/p&gt;




&lt;h2&gt;
  
  
  Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total ecosystem&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;425K+ lines of Go&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;UI toolkit alone&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;55K lines, 208 Go files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Test functions (UI)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1,400+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Test coverage (UI)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;97% average&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU backends&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5 (Vulkan, DX12, Metal, GLES, Software)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Shader targets&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4 (SPIR-V, MSL, GLSL, HLSL)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Platforms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Windows, Linux (X11/Wayland), macOS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CGO required&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone and run the hello example&lt;/span&gt;
git clone https://github.com/gogpu/ui
&lt;span class="nb"&gt;cd &lt;/span&gt;ui/examples/hello
go run &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requirements: Go 1.25+, a GPU with Vulkan/DX12/Metal/GLES support.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get Involved
&lt;/h2&gt;

&lt;p&gt;We're building this in the open and we want your input:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/orgs/gogpu/discussions" rel="noopener noreferrer"&gt;GitHub Discussions&lt;/a&gt;&lt;/strong&gt; — Feature requests, architecture feedback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/gogpu/ui/issues" rel="noopener noreferrer"&gt;gogpu/ui Issues&lt;/a&gt;&lt;/strong&gt; — Bug reports, widget requests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/gogpu" rel="noopener noreferrer"&gt;gogpu Organization&lt;/a&gt;&lt;/strong&gt; — All repositories&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The foundation is solid. Now comes the fun part — building the toolkit that makes Go a first-class citizen for desktop applications.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The GoGPU ecosystem is MIT-licensed. Contributions welcome.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>gui</category>
      <category>opensource</category>
      <category>webgpu</category>
    </item>
  </channel>
</rss>
