I built a Rust LLM inference engine with custom WGSL GPU kernels, here's what I learned!

#rust #ai #llm #programming

I've been working on a side project called aether , a Rust LLM inference engine that can load GGUF models and run them with WGPU GPU acceleration.

It started as a way to understand how LLMs actually work under the hood. One thing led to another, and now it has:

Loads GGUF models (Llama/Mistral/Phi/Qwen)
WGPU GPU backend (Metal/Vulkan/DX12)
Custom fused WGSL compute shaders for Q8_0 and Q4_K quantized matmul (dequantize inline instead of a separate pass)
Concurrent request pool for serving multiple users
OpenAI-compatible API server (axum)
Pure Rust, no Python dependencies in the hot path

The GPU path is still experimental (CPU mode is the safe default), but the dequant shaders and the fused matmul kernels were honestly the most fun part to write.

I'm not trying to compete with llama.cpp or MLX, this was primarily a learning project that grew into something actually useful. Happy to answer questions or take feedback.

Stack: Rust, WGPU, WGSL, GGUF, axum, Tokio

https://github.com/theoxfaber/aether

(Full transparency, the majority of this code and post were written with AI assistance. I drove the design decisions, architecture, and testing; AI handled a lot of the implementation. Treat it accordingly.)

Top comments (1)

Harjot Singh • May 31

Writing your own inference engine with custom WGSL kernels is hardcore, that's the layer most of us (me included) treat as a black box, so going down to the GPU kernels is a real flex and the best way to actually understand where the latency goes. The lesson that usually falls out: inference perf is dominated by memory bandwidth and kernel fusion, not raw FLOPS, and the naive implementation leaves most of the GPU idle. Not my daily layer, I work up at orchestration with Moonshift, but I respect that everything above only works because someone got this right underneath. What surprised you most, the kernel-fusion wins or the WGSL portability tradeoffs?