Andrey Kolkov

Posted on Apr 5

We Built the First Pure Go DXIL Generator — Because Optimizing the Wrong Path Wasn't Enough

#go #graphics #gpu #opensource

"Go doesn't have a real graphics ecosystem." — We've heard this for years. So we built one: 636K lines of Pure Go, five GPU backends, zero CGO. And now we've done something that even Rust's naga shader compiler hasn't managed in six years.

At the end of last year, we introduced GoGPU to the Go community — greeting everyone with a New Year's gift: a professional graphics ecosystem written entirely in Go. Four months later, that ecosystem just got its most audacious component: a Pure Go DXIL generator that compiles shaders directly to DirectX 12 bytecode, without any external compiler.

This is the story of how a performance optimization rabbit hole led us to write our own LLVM 3.7 bitcode emitter.

The Problem That Wouldn't Go Away

Every DirectX 12 application needs compiled shaders. The standard pipeline looks like this:

WGSL → HLSL text → FXC (d3dcompiler_47.dll) → DXBC bytecode → GPU

That middle step — d3dcompiler_47.dll — is a 4.3 MB Microsoft DLL that you load at runtime. It works. It's battle-tested. And it was our bottleneck.

We build gogpu — a Pure Go GPU ecosystem with its own shader compiler, naga. Four backend outputs already at 100% Rust naga parity. Everything compiles with go build, no C toolchain needed.

But on Windows with DirectX 12, we had a dirty secret: d3dcompiler_47.dll. The one external dependency in our otherwise dependency-free stack.

The Optimization Rabbit Hole

We tried everything to make the FXC path fast enough to forget about:

Shader cache — Hash the HLSL, cache the DXBC. First render is slow, subsequent ones instant. Works great until your shader variants explode.

In-memory compilation pool — Pre-compile common shaders at startup. Reduces cold-start latency. But we still load the DLL.

Pipeline State Object caching — We planned disk caching of PSO blobs (GetCachedBlob → os.UserCacheDir()). We wrote the task, designed the key format, specified the invalidation strategy. Then we never shipped it — because we pivoted to eliminating FXC entirely.

naga HLSL fix — FXC was choking on naga-generated HLSL: a (Type[256])0 bulk zero-initialization expanded to a 12KB inline constructor that FXC took 22 seconds to compile. We initially thought our Go naga had a bug, so we tested the same shader through Rust naga + FXC — same 22 seconds. It wasn't our implementation; FXC genuinely can't handle giant inline constructors. The fix was in naga (per-element loop instead of inline constructor, v0.16.3) — 330× faster. But even after fixing the worst case, every shader still went through an external DLL.

Every optimization made the same path faster. None of them removed the path.

At this point we had a decision to make. The obvious next step was adding DXC (dxcompiler.dll) as an opt-in replacement for FXC — newer, faster, supports Shader Model 6.0+. We even created the task for it.

Then, while reviewing the plan, a simple question came up: "Can we write our own DXC in Go?"

The initial answer was: "No. DXC is 500K lines of C++, a fork of LLVM 3.7. That's not something you casually rewrite."

The response: "Not rewrite DXC. Generate DXIL directly. Skip HLSL entirely."

That changed everything. DXC takes HLSL text and produces DXIL. We already have our own IR (naga IR). Why translate IR → HLSL text → parse HLSL → produce DXIL, when we could go IR → DXIL directly?

DXC is a compiler from one language (HLSL) to another (DXIL). We don't need a compiler — we need an emitter. And emitting is much simpler than compiling.

What is DXIL, and Why Nobody Writes It

DXIL (DirectX Intermediate Language) is what FXC and DXC produce. It's LLVM 3.7 bitcode — the same IR format that LLVM uses internally — wrapped in a DXBC container with DirectX-specific metadata and dx.op intrinsic calls.

The reason nobody writes DXIL directly is simple: it's hard.

LLVM 3.7 bitcode is a binary format with variable-width encoding (VBR), nested blocks, abbreviation records, and forward references. Not something you casually emit.
DXIL semantics require dx.op intrinsic calls instead of normal LLVM instructions for I/O, math, and resource access. 165+ opcodes.
DXBC container needs input/output signatures (ISG1/OSG1), pipeline state validation (PSV0), feature flags (SFI0), and a cryptographic hash.
Validation — until January 2025, every DXIL module needed to be signed by dxil.dll. Microsoft's BYPASS hash sentinel changed this.

Rust's naga shader compiler has had an open issue for DXIL backend since 2020. Six years later, it's still not implemented.

Only one project has done it outside of LLVM: Mesa (the open-source OpenGL/Vulkan driver stack). Their DXIL compiler is ~21,000 lines of C/H, written by engineers from Microsoft and Collabora over 3+ years. They wrote their own LLVM 3.7 bitcode writer from scratch — proving it's possible without linking LLVM.

We cloned Mesa's src/microsoft/compiler/ into our reference folder, studied dxil_module.c (the bitcode writer, ~3K lines of C), and mapped out every block type, record format, and abbreviation. Not to copy — to understand the format deeply enough to write our own.

Then came the final piece: in January 2025, Microsoft open-sourced the DXIL validator hash and introduced a BYPASS sentinel — a magic value in the hash field that tells D3D12 "this shader wasn't signed by dxil.dll, but trust it anyway." Without this, our DXIL wouldn't run without Developer Mode on Windows. With it, any third-party DXIL generator can produce shaders that run on retail Windows.

We weren't afraid of binary formats. Before gogpu, we built scigolib/hdf5 — a Pure Go implementation of HDF5, NASA's hierarchical data format with its own B-tree indices, chunked storage, and compression pipelines. After parsing HDF5 superblocks and fractal heaps in pure Go, LLVM bitcode felt almost... reasonable. We also built coregx/coregex — a multi-engine regex system (17 strategies, Lazy DFA, PikeVM, SIMD prefilters) that runs up to 3000× faster than Go's stdlib. Complex binary formats and low-level encoding are kind of our thing.

We spent weeks studying the DXIL format specifically. Reading the DXIL spec, the LLVM 3.7 bitcode reference, Mesa's implementation, Microsoft's DXC headers, the validator hash proposal. We wrote a detailed architecture document comparing four implementation options. We mapped every dx.op opcode we'd need for vertex and fragment shaders. We designed the package structure, the phased rollout plan, the testing strategy.

Only after all that research did we write the first line of code.

Building LLVM 3.7 Bitcode in Pure Go

The first challenge was the bitcode writer. LLVM 3.7's format is... unique:

Bits, not bytes. Variable-width encoding. Nested blocks with
forward-declared sizes. Abbreviation records that compress
common patterns. A module structure that interleaves types,
constants, functions, and metadata in a specific order.

We wrote a bit-level writer from scratch:

// VBR (Variable Bit Rate) encoding — like protobuf varint, but bit-aligned
func (w *Writer) WriteVBR(value uint64, width uint) {
    tag := uint32(1) << (width - 1)
    mask := tag - 1
    for value > uint64(mask) {
        chunk := uint32(value&uint64(mask)) | tag
        value >>= width - 1
        w.WriteBits(chunk, width)
    }
    w.WriteBits(uint32(value), width)
}

Then the module serializer: TYPE_BLOCK, CONSTANTS_BLOCK, FUNCTION_BLOCK, METADATA_BLOCK — each with its own record formats, abbreviation IDs, and ordering constraints.

The DXBC container assembles the bitcode with signatures and metadata:

[DXBC Header]           32 bytes (magic + digest + version + size + part count)
  [SFI0]                Shader feature flags (64-bit bitmask)
  [DXIL]                Program header + LLVM 3.7 bitcode
  [ISG1]                Input signature (semantic names, registers)
  [OSG1]                Output signature
  [PSV0]                Pipeline state validation
  [HASH]                BYPASS sentinel (no dxil.dll needed!)

The DXIL Difference: Scalarized Vectors

Here's something that makes DXIL fundamentally different from SPIR-V, MSL, GLSL, and HLSL: DXIL has no native vector types.

In SPIR-V, you write OpCompositeConstruct %vec4 %x %y %z %w.
In HLSL, you write float4(x, y, z, w).
In DXIL, there are no vectors. A vec4<f32> becomes four separate float values, tracked independently through every operation.

// Our emitter tracks per-component value IDs
type Emitter struct {
    exprValues     map[ir.ExpressionHandle]int    // scalar value IDs
    exprComponents map[ir.ExpressionHandle][]int  // per-component IDs for vectors
}

// dot(a, b) becomes:
// %r = call float @dx.op.dot3.f32(i32 55, float %ax, float %ay, float %az,
//                                          float %bx, float %by, float %bz)

This means every vector operation — dot product, cross product, normalize, swizzle — must be decomposed into scalar operations. Our existing backends (SPIR-V, MSL, GLSL, HLSL) all work with native vectors. DXIL required a completely different approach.

Cross product becomes 6 multiplies and 3 subtracts:

// cross(a, b) = vec3(a.y*b.z - a.z*b.y, a.z*b.x - a.x*b.z, a.x*b.y - a.y*b.x)
cx := fsub(fmul(ay, bz), fmul(az, by))
cy := fsub(fmul(az, bx), fmul(ax, bz))
cz := fsub(fmul(ax, by), fmul(ay, bx))

Control Flow: Basic Blocks, Not Nesting

Another fundamental difference: DXIL uses LLVM-style basic blocks with explicit branches, not the nested text structure of HLSL/GLSL/MSL.

Our text backends emit:

if (cond) {
    // accept
} else {
    // reject
}

DXIL emits:

entry:
  br i1 %cond, label %then, label %else
then:
  ; accept statements
  br label %merge
else:
  ; reject statements
  br label %merge
merge:
  ; continues

Loops use back-edge branches to a header block. Break and continue jump to specific target blocks tracked via a loop context stack.

We studied Mesa's nir_to_dxil.c for the correct patterns, then cross-referenced with our own SPIR-V backend (which also uses structured control flow with merge blocks) to get the Go implementation right.

What Our Backends Taught Us

This is the part that surprised us most. We have four mature backends (SPIR-V, MSL, GLSL, HLSL) totaling ~68K LOC. They all solve the same IR walking problems:

Expression dispatch and caching
Type resolution through pointer chains
Statement nesting and control flow
Resource binding and I/O handling

Before implementing each DXIL feature, we checked how our existing backends handled it:

Feature	We checked	What we learned
Multi-arg math	HLSL `writeExpressionKind`	Arg/Arg1/Arg2/Arg3 dispatch pattern
Type casts	SPIR-V `emitAs`	src/dst kind+width → opcode selection
Control flow	HLSL `writeIfStatement`	Condition, blocks, merge point structure
Store/Load	SPIR-V `emitStore`	Pointer chain resolution
Struct access	MSL `writeAccessChain`	Recursive descent through members

The DXIL backend is different (scalarized, basic blocks, dx.op intrinsics), but the IR patterns are the same. Our existing codebase was its own best reference.

The Moment of Truth

After all the research, all the planning, all the implementation — ~12,500 lines of Go code, 190 tests, weeks of work — came the moment that mattered:

GOGPU_DX12_DXIL=1 GOGPU_GRAPHICS_API=dx12 go run ./cmd/wgpu-triangle

The terminal showed:

wgpu API Triangle Test
Adapter: Intel(R) Iris(R) Xe Graphics
dx12: using DXIL direct compilation (naga dxil backend)
Render loop started
Frame 60 (64.6 FPS)
Frame 120 (62.1 FPS)
...
Frame 2400 (59.9 FPS)

A red triangle on a blue background. The most boring demo in graphics programming. And the most satisfying.

WGSL → naga.Parse → naga.Lower → IR → dxil.Compile → DXIL → D3D12 → GPU
         Pure Go      Pure Go          Pure Go

2,400+ frames. 60 FPS. Stable. On Intel Iris Xe, DirectX 12. No DLL loaded, no subprocess spawned. Just Go code producing bytes that a GPU executes.

By the Numbers

Metric	Value
Total DXIL code	~12,500 lines (9,400 code)
Test count	190
New files	26
Public API surface	4 types (`Compile`, `DefaultOptions`, `Options`, `ShaderModel`)
External dependencies	0
CGO calls	0
CI platforms	macOS + Ubuntu + Windows (all green)
Time to first frame	Instant (no subprocess)

For comparison, Mesa's DXIL compiler is ~21,000 LOC of C/H, built by engineers from Microsoft and Collabora over three years. We owe them a debt — their bitcode writer was our Rosetta Stone for understanding the format. But Go isn't C, and naga IR isn't NIR, so the actual code is written from scratch.

What's Experimental (and What's Next)

This is v0.17.0 with an (experimental) label. Here's what works and what doesn't:

Works now:

Vertex + fragment shaders
All arithmetic, comparison, logical operations
30+ math intrinsics (min, max, clamp, dot, cross, mix, fma, length, normalize...)
Type casts (10 LLVM cast opcodes)
Control flow (if/else, loops, break/continue)
Local variables (alloca + load + store)
Texture sampling
Resource handle creation (CBV/SRV/Sampler)
I/O signatures and pipeline state validation

Coming next:

Compute shaders (UAV, atomics, barriers)
Uniform buffer reads (cbufferLoadLegacy wiring)
SM 6.1-6.9 features (wave intrinsics, mesh shaders)

The experimental label means: it renders triangles today, but don't ship a game with it tomorrow.

The Bigger Picture

naga is part of GoGPU — a 636K LOC Pure Go GPU ecosystem:

Project	LOC	What
gg	153K	2D graphics, GPU SDF, SVG renderer
naga	145K	Shader compiler (now with DXIL!)
wgpu	134K	Pure Go WebGPU (Vulkan/DX12/Metal/GLES)
ui	121K	GUI toolkit, 22+ widgets, 4 themes
gogpu	39K	Application framework

With DXIL, gogpu/naga has surpassed Rust naga in backend coverage:

Backend	Go naga	Rust naga
SPIR-V	100% (87/87 golden, 164/164 spirv-val)	100%
MSL	100% (91/91)	100%
GLSL	100% (68/68)	100%
HLSL	100% (72/72)	100%
DXIL	Experimental (working)	Not implemented (open issue since 2020)

What started as a compatibility effort is now something more.

Try It

go get github.com/gogpu/naga@v0.17.0

import "github.com/gogpu/naga/dxil"

// Parse WGSL, lower to IR, compile to DXIL
ast, _ := naga.Parse(wgslSource)
module, _ := naga.Lower(ast)
dxilBytes, _ := dxil.Compile(module, dxil.DefaultOptions())

// dxilBytes is a complete DXBC container — feed directly to D3D12