"Go doesn't have a real graphics ecosystem." — We've heard this for years. So we built one: 636K lines of Pure Go, five GPU backends, zero CGO. And now we've done something that even Rust's naga shader compiler hasn't managed in six years.
At the end of last year, we introduced GoGPU to the Go community — greeting everyone with a New Year's gift: a professional graphics ecosystem written entirely in Go. Four months later, that ecosystem just got its most audacious component: a Pure Go DXIL generator that compiles shaders directly to DirectX 12 bytecode, without any external compiler.
This is the story of how a performance optimization rabbit hole led us to write our own LLVM 3.7 bitcode emitter.
The Problem That Wouldn't Go Away
Every DirectX 12 application needs compiled shaders. The standard pipeline looks like this:
WGSL → HLSL text → FXC (d3dcompiler_47.dll) → DXBC bytecode → GPU
That middle step — d3dcompiler_47.dll — is a 4.3 MB Microsoft DLL that you load at runtime. It works. It's battle-tested. And it was our bottleneck.
We build gogpu — a 636K LOC Pure Go GPU ecosystem. Zero CGO. Every pixel rendered through Go code. Our shader compiler naga translates WGSL to SPIR-V, MSL, GLSL, HLSL, and now DXIL with 100% Rust naga parity across all text/binary backends. 145K LOC of Pure Go.
But on Windows with DirectX 12, we had a dirty secret: d3dcompiler_47.dll. The one external dependency in our "zero external dependencies" stack.
The Optimization Rabbit Hole
We tried everything to make the FXC path fast enough to forget about:
Shader cache — Hash the HLSL, cache the DXBC. First render is slow, subsequent ones instant. Works great until your shader variants explode.
In-memory compilation pool — Pre-compile common shaders at startup. Reduces cold-start latency. But we still load the DLL.
Pipeline State Object caching — We planned disk caching of PSO blobs (GetCachedBlob → os.UserCacheDir()). We wrote the task, designed the key format, specified the invalidation strategy. Then we never shipped it — because we pivoted to eliminating FXC entirely.
naga HLSL fix — FXC was choking on naga-generated HLSL: a (Type[256])0 bulk zero-initialization expanded to a 12KB inline constructor that FXC took 22 seconds to compile. We initially thought our Go naga had a bug, so we tested the same shader through Rust naga + FXC — same 22 seconds. It wasn't our implementation; FXC genuinely can't handle giant inline constructors. The fix was in naga (per-element loop instead of inline constructor, v0.16.3) — 330× faster. But even after fixing the worst case, every shader still went through an external DLL.
Every optimization made the same path faster. None of them removed the path.
At this point we had a decision to make. The obvious next step was adding DXC (dxcompiler.dll) as an opt-in replacement for FXC — newer, faster, supports Shader Model 6.0+. We even created the task for it.
Then, while reviewing the plan, a simple question came up: "Can we write our own DXC in Go?"
The initial answer was: "No. DXC is 500K lines of C++, a fork of LLVM 3.7. That's not something you casually rewrite."
The response: "Not rewrite DXC. Generate DXIL directly. Skip HLSL entirely."
That changed everything. DXC takes HLSL text and produces DXIL. We already have our own IR (naga IR). Why translate IR → HLSL text → parse HLSL → produce DXIL, when we could go IR → DXIL directly?
DXC is a compiler from one language (HLSL) to another (DXIL). We don't need a compiler — we need an emitter. And emitting is much simpler than compiling.
What is DXIL, and Why Nobody Writes It
DXIL (DirectX Intermediate Language) is what FXC and DXC produce. It's LLVM 3.7 bitcode — the same IR format that LLVM uses internally — wrapped in a DXBC container with DirectX-specific metadata and dx.op intrinsic calls.
The reason nobody writes DXIL directly is simple: it's hard.
- LLVM 3.7 bitcode is a binary format with variable-width encoding (VBR), nested blocks, abbreviation records, and forward references. Not something you casually emit.
-
DXIL semantics require
dx.opintrinsic calls instead of normal LLVM instructions for I/O, math, and resource access. 165+ opcodes. - DXBC container needs input/output signatures (ISG1/OSG1), pipeline state validation (PSV0), feature flags (SFI0), and a cryptographic hash.
-
Validation — until January 2025, every DXIL module needed to be signed by
dxil.dll. Microsoft's BYPASS hash sentinel changed this.
Rust's naga shader compiler has had an open issue for DXIL backend since 2020. Six years later, it's still marked as future work. The Rust team called it "a lot of work and a long way away."
Only one project has done it outside of LLVM: Mesa (the open-source OpenGL/Vulkan driver stack). Their DXIL compiler is ~21,000 lines of C/H, written by engineers from Microsoft and Collabora over 3+ years. They wrote their own LLVM 3.7 bitcode writer from scratch — proving it's possible without linking LLVM.
We cloned Mesa's src/microsoft/compiler/ into our reference folder, studied dxil_module.c (the bitcode writer, ~3K lines of C), and mapped out every block type, record format, and abbreviation. Not to copy — to understand the format deeply enough to write our own.
Then came the final piece: in January 2025, Microsoft open-sourced the DXIL validator hash and introduced a BYPASS sentinel — a magic value in the hash field that tells D3D12 "this shader wasn't signed by dxil.dll, but trust it anyway." Without this, our DXIL wouldn't run without Developer Mode on Windows. With it, any third-party DXIL generator can produce shaders that run on retail Windows.
We weren't afraid of binary formats. Before gogpu, we built scigolib/hdf5 — a Pure Go implementation of HDF5, NASA's hierarchical data format with its own B-tree indices, chunked storage, and compression pipelines. After parsing HDF5 superblocks and fractal heaps in pure Go, LLVM bitcode felt almost... reasonable. We also built coregx/coregex — a regex engine from scratch. Binary formats, state machines, low-level encoding — this is what we do.
We spent weeks studying the DXIL format specifically. Reading the DXIL spec, the LLVM 3.7 bitcode reference, Mesa's implementation, Microsoft's DXC headers, the validator hash proposal. We wrote a detailed architecture document comparing four implementation options. We mapped every dx.op opcode we'd need for vertex and fragment shaders. We designed the package structure, the phased rollout plan, the testing strategy.
Only after all that research did we write the first line of code.
Building LLVM 3.7 Bitcode in Pure Go
The first challenge was the bitcode writer. LLVM 3.7's format is... unique:
Bits, not bytes. Variable-width encoding. Nested blocks with
forward-declared sizes. Abbreviation records that compress
common patterns. A module structure that interleaves types,
constants, functions, and metadata in a specific order.
We wrote a bit-level writer from scratch:
// VBR (Variable Bit Rate) encoding — like protobuf varint, but bit-aligned
func (w *Writer) WriteVBR(value uint64, width uint) {
tag := uint32(1) << (width - 1)
mask := tag - 1
for value > uint64(mask) {
chunk := uint32(value&uint64(mask)) | tag
value >>= width - 1
w.WriteBits(chunk, width)
}
w.WriteBits(uint32(value), width)
}
Then the module serializer: TYPE_BLOCK, CONSTANTS_BLOCK, FUNCTION_BLOCK, METADATA_BLOCK — each with its own record formats, abbreviation IDs, and ordering constraints.
The DXBC container assembles the bitcode with signatures and metadata:
[DXBC Header] 32 bytes (magic + digest + version + size + part count)
[SFI0] Shader feature flags (64-bit bitmask)
[DXIL] Program header + LLVM 3.7 bitcode
[ISG1] Input signature (semantic names, registers)
[OSG1] Output signature
[PSV0] Pipeline state validation
[HASH] BYPASS sentinel (no dxil.dll needed!)
The DXIL Difference: Scalarized Vectors
Here's something that makes DXIL fundamentally different from SPIR-V, MSL, GLSL, and HLSL: DXIL has no native vector types.
In SPIR-V, you write OpCompositeConstruct %vec4 %x %y %z %w.
In HLSL, you write float4(x, y, z, w).
In DXIL, there are no vectors. A vec4<f32> becomes four separate float values, tracked independently through every operation.
// Our emitter tracks per-component value IDs
type Emitter struct {
exprValues map[ir.ExpressionHandle]int // scalar value IDs
exprComponents map[ir.ExpressionHandle][]int // per-component IDs for vectors
}
// dot(a, b) becomes:
// %r = call float @dx.op.dot3.f32(i32 55, float %ax, float %ay, float %az,
// float %bx, float %by, float %bz)
This means every vector operation — dot product, cross product, normalize, swizzle — must be decomposed into scalar operations. Our existing backends (SPIR-V, MSL, GLSL, HLSL) all work with native vectors. DXIL required a completely different approach.
Cross product becomes 6 multiplies and 3 subtracts:
// cross(a, b) = vec3(a.y*b.z - a.z*b.y, a.z*b.x - a.x*b.z, a.x*b.y - a.y*b.x)
cx := fsub(fmul(ay, bz), fmul(az, by))
cy := fsub(fmul(az, bx), fmul(ax, bz))
cz := fsub(fmul(ax, by), fmul(ay, bx))
Control Flow: Basic Blocks, Not Nesting
Another fundamental difference: DXIL uses LLVM-style basic blocks with explicit branches, not the nested text structure of HLSL/GLSL/MSL.
Our text backends emit:
if (cond) {
// accept
} else {
// reject
}
DXIL emits:
entry:
br i1 %cond, label %then, label %else
then:
; accept statements
br label %merge
else:
; reject statements
br label %merge
merge:
; continues
Loops use back-edge branches to a header block. Break and continue jump to specific target blocks tracked via a loop context stack.
We studied Mesa's nir_to_dxil.c for the correct patterns, then cross-referenced with our own SPIR-V backend (which also uses structured control flow with merge blocks) to get the Go implementation right.
What Our Backends Taught Us
This is the part that surprised us most. We have four mature backends (SPIR-V, MSL, GLSL, HLSL) totaling ~68K LOC. They all solve the same IR walking problems:
- Expression dispatch and caching
- Type resolution through pointer chains
- Statement nesting and control flow
- Resource binding and I/O handling
Before implementing each DXIL feature, we checked how our existing backends handled it:
| Feature | We checked | What we learned |
|---|---|---|
| Multi-arg math | HLSL writeExpressionKind
|
Arg/Arg1/Arg2/Arg3 dispatch pattern |
| Type casts | SPIR-V emitAs
|
src/dst kind+width → opcode selection |
| Control flow | HLSL writeIfStatement
|
Condition, blocks, merge point structure |
| Store/Load | SPIR-V emitStore
|
Pointer chain resolution |
| Struct access | MSL writeAccessChain
|
Recursive descent through members |
The DXIL backend is different (scalarized, basic blocks, dx.op intrinsics), but the IR patterns are the same. Our existing codebase was its own best reference.
The Moment of Truth
After all the research, all the planning, all the implementation — ~12,500 lines of Go code, 190 tests, weeks of work — came the moment that mattered:
GOGPU_DX12_DXIL=1 GOGPU_GRAPHICS_API=dx12 go run ./cmd/wgpu-triangle
The terminal showed:
wgpu API Triangle Test
Adapter: Intel(R) Iris(R) Xe Graphics
dx12: using DXIL direct compilation (naga dxil backend)
Render loop started
Frame 60 (64.6 FPS)
Frame 120 (62.1 FPS)
...
Frame 2400 (59.9 FPS)
A red triangle on a blue background. The most boring demo in graphics programming. And the most satisfying.
WGSL → naga.Parse → naga.Lower → IR → dxil.Compile → DXIL → D3D12 → GPU
Pure Go Pure Go Pure Go
2,400+ frames. 60 FPS. Stable. On Intel Iris Xe with DirectX 12. Zero external dependencies. No d3dcompiler_47.dll. No DXC. No CGO. Every byte of LLVM bitcode generated by Go code.
import "github.com/gogpu/naga/dxil"
// One call. IR in, DXBC container out.
dxilBytes, err := dxil.Compile(irModule, dxil.DefaultOptions())
The output is a complete DXBC container that D3D12 CreateGraphicsPipelineState accepts directly.
By the Numbers
| Metric | Value |
|---|---|
| Total DXIL code | ~12,500 lines (9,400 code) |
| Test count | 190 |
| New files | 26 |
| Public API surface | 4 types (Compile, DefaultOptions, Options, ShaderModel) |
| External dependencies | 0 |
| CGO calls | 0 |
| CI platforms | macOS + Ubuntu + Windows (all green) |
| Time to first frame | Instant (no subprocess) |
For comparison, Mesa's DXIL compiler is ~21,000 LOC of C/H, developed by engineers from Microsoft and Collabora. We stood on their shoulders — studying their bitcode writer to understand the format, then implementing our own from scratch in Go. Having four existing shader backends meant we already understood the IR walking patterns. The DXIL-specific parts (scalarization, basic blocks, dx.op intrinsics) were new, but the architecture was familiar.
What's Experimental (and What's Next)
This is v0.17.0 with an (experimental) label. Here's what works and what doesn't:
Works now:
- Vertex + fragment shaders
- All arithmetic, comparison, logical operations
- 30+ math intrinsics (min, max, clamp, dot, cross, mix, fma, length, normalize...)
- Type casts (10 LLVM cast opcodes)
- Control flow (if/else, loops, break/continue)
- Local variables (alloca + load + store)
- Texture sampling
- Resource handle creation (CBV/SRV/Sampler)
- I/O signatures and pipeline state validation
Coming next:
- Compute shaders (UAV, atomics, barriers)
- Uniform buffer reads (cbufferLoadLegacy wiring)
- SM 6.1-6.9 features (wave intrinsics, mesh shaders)
The experimental label means: it renders triangles today, but don't ship a game with it tomorrow.
The Bigger Picture
naga is part of GoGPU — a 636K LOC Pure Go GPU ecosystem:
| Project | LOC | What |
|---|---|---|
| gg | 153K | 2D graphics, GPU SDF, SVG renderer |
| naga | 145K | Shader compiler (now with DXIL!) |
| wgpu | 134K | Pure Go WebGPU (Vulkan/DX12/Metal/GLES) |
| ui | 121K | GUI toolkit, 22+ widgets, 4 themes |
| gogpu | 39K | Application framework |
636K lines of Go code. Zero CGO. Every GPU API. Every platform.
With DXIL, gogpu/naga has surpassed Rust naga in backend coverage. This isn't just parity anymore:
| Backend | Go naga | Rust naga |
|---|---|---|
| SPIR-V | 100% (87/87 golden, 164/164 spirv-val) | 100% |
| MSL | 100% (91/91) | 100% |
| GLSL | 100% (68/68) | 100% |
| HLSL | 100% (72/72) | 100% |
| DXIL | Experimental (working) | Not implemented (open issue since 2020) |
We started this project to match Rust naga's output. Now we've gone beyond it.
Try It
go get github.com/gogpu/naga@v0.17.0
import "github.com/gogpu/naga/dxil"
// Parse WGSL, lower to IR, compile to DXIL
ast, _ := naga.Parse(wgslSource)
module, _ := naga.Lower(ast)
dxilBytes, _ := dxil.Compile(module, dxil.DefaultOptions())
// dxilBytes is a complete DXBC container — feed directly to D3D12
Repository: github.com/gogpu/naga
Release: v0.17.0
Previously in this series:
Top comments (0)