Farhan Syah

Posted on Mar 14

numr 0.5.0: The Rust numerical computing library that doesn't make you choose

#rust #machinelearning #programming #opensource

Last year, I started building numr because I was frustrated.

ml-rust / numr

A high-performance numerical computing library for Rust with GPU acceleration, inspired by Numpy

numr

Foundational numerical computing for Rust

numr provides n-dimensional tensors, linear algebra, FFT, statistics, and automatic differentiation—with native GPU acceleration across CPU, CUDA, and WebGPU backends.

numr is like Numpy in Rust but built with gradients, GPUs, and modern dtypes built-in from day one.

What numr Is

A foundation library - Mathematical building blocks for higher-level libraries and applications.

numr IS	numr is NOT
Tensor library (like NumPy's ndarray)	A deep learning framework
Linear algebra (decompositions, solvers)	A high-level ML API
FFT, statistics, random distributions	Domain-specific
Native GPU (CUDA + WebGPU) + autograd

For SciPy-equivalent functionality (optimization, ODE, interpolation, signal), see solvr.

Why numr?

vs NumPy

Capability	NumPy	numr
N-dimensional tensors	✓	✓
Linear algebra, FFT, stats	✓	✓
Automatic differentiation	✗ Need JAX/PyTorch	✓ Built-in `numr::autograd`
GPU acceleration	✗ Need CuPy/JAX	✓ Native CUDA + WebGPU
Non-NVIDIA GPUs	✗ None	✓ AMD, Intel, Apple via WebGPU
FP8 /

…

View on GitHub

I wanted to do numerical computing in Rust — tensors, linear algebra, FFT, gradients — on GPUs. Not just NVIDIA GPUs. Any GPU. And I didn't want to glue together five incompatible crates to do it.

Python didn't plan for this either. NumPy emerged organically, and it took years of bolting on CuPy, JAX, and PyTorch before Python had GPU compute and autograd — scattered across incompatible libraries.

Some people say fragmentation is fine. Separate crates for separate concerns — that's the Unix philosophy. And I'd agree, if they shared conventions, types, and backends. But they don't. ndarray gives you tensors but no GPU. nalgebra gives you linear algebra but no autograd. rustfft gives you FFT but nothing else. Different types, different idioms, none of them compose.

So the burden falls on you — the application developer. You're the one writing adapter layers between crates. You're the one figuring out why this tensor type doesn't work with that decomposition. And when you need GPU support or a missing operation? You're filing issues and PRs upstream, waiting for maintainers, before you can get back to building your actual application.

numr takes that burden off you. One library, one tensor type, one API — tensors, linalg, FFT, statistics, autograd, GPU. numr will handle the hard part. You can just focus on building your application.

one library, one API, every backend. Write your code once. Run it on CPU with AVX-512. Run it on NVIDIA with native CUDA kernels. Run it on AMD, Intel, or Apple silicon through WebGPU. Same code. Same results.

Today, numr 0.5.0 ships. And it's the release where it stopped being a "promising project" and became something you can actually build on.

What changed

Fused kernels — because memory bandwidth is the real bottleneck

The single biggest performance win in GPU computing isn't faster math. It's reading memory fewer times.

A naive softmax reads your tensor five times: max, subtract, exp, sum, divide. A fused softmax reads it once. For large tensors, that's not a 5x difference (the math is cheap), but it's easily 2-3x.

0.5.0 adds fused kernels for the operations that matter most:

GEMM epilogue: matmul + bias + activation in one kernel launch. This is the inner loop of every neural network. Forward and backward.
Activation-mul: for gated architectures like SwiGLU that power modern LLMs. One read instead of three.
Add-norm: residual connection + normalization fused together. The other operation you hit every single transformer layer.

All of these work on CPU, CUDA, and WebGPU. All of them have backward passes for autograd.

FP8 and quantized compute — because not everything needs 32 bits

FP8 isn't just "smaller numbers." It's the difference between fitting a model in VRAM or not. Between one GPU and two.

numr now does FP8 matrix multiplication natively — E4M3 and E5M2 formats, across all backends. No external libraries. No NVIDIA-only restrictions.

We also added i8×i8→i32 quantized matmul on CPU. This is what powers efficient quantized inference when you don't have a GPU.

2:4 structured sparsity — because half your weights are probably zero

NVIDIA's Ampere architecture introduced hardware support for 2:4 sparsity: for every group of 4 weights, exactly 2 are zero. The hardware skips them, doubling throughput for free.

numr 0.5.0 supports 2:4 structured sparsity across all backends. On CUDA, it hits the hardware fast path. On CPU and WebGPU, it uses optimized sparse kernels.

Autograd that actually covers what you need

Previous releases had autograd for basic operations. 0.5.0 makes it comprehensive:

conv1d, conv2d, softmax, rms_norm, layer_norm, SiLU, softplus, SwiGLU, dropout, the fused GEMM epilogue, fused add-norm, dtype cast, narrow, cat, gather — all differentiable, all with correct backward passes, all supporting second-order derivatives.

Activation checkpointing lets you trade compute for memory. Backward hooks let you trigger distributed gradient sync during backprop.

This isn't an ML framework. It's the autograd engine that ML frameworks build on.

A CUDA backend that acts like it belongs there

The CUDA story got serious in 0.5.0:

Caching allocator. CUDA memory allocation is expensive. The old approach (stream-ordered allocation) worked but left performance on the table. The new Rust-side caching allocator reuses memory blocks, cutting allocation overhead dramatically.

Graph capture. Record a sequence of kernel launches once, replay it with zero overhead. Essential for inference serving where you run the same computation thousands of times.

GEMV fast paths. When one matrix dimension is small (which happens constantly during inference — batch size 1), you don't want full tiled GEMM. Specialized GEMV kernels for transposed weight matrices avoid unnecessary work.

Pipelined D2H copy. Overlap GPU computation with data transfer back to the host. The GPU doesn't wait for the CPU, the CPU doesn't wait for the GPU.

Why 0.5.0 matters

This is where numr crosses the threshold from "interesting foundation" to "you can build real things on this." And we know because we did.

0.5.0 has been validated against real downstream consumers. solvr — a scientific computing library with optimization, ODE solvers, and interpolation — builds and runs on numr 0.5.0. boostr — an ML framework with attention, MoE, and Mamba blocks — builds and runs on it too. LLM inference and embedding generation work end-to-end.

This isn't a library that passes unit tests in isolation. It's a library that other libraries are built on, and those libraries work.

The fused kernels mean you're not leaving performance on the table. The autograd coverage means you can differentiate through realistic computation graphs. The CUDA infrastructure means GPU workloads actually perform. And all of it works the same across CPU, CUDA, and WebGPU.

What's next

0.5.0 unblocks new releases of solvr (scientific computing — optimization, ODE solvers, interpolation) and boostr (ML framework) which both build on numr.

For numr itself, 0.6.0 focuses on hardening: cleaning up error handling, API stability audit, and preparing for an eventual 1.0.

ROCm (native AMD GPU) is on the roadmap for 0.7.0+.

Try it

[dependencies]
numr = "0.5.0"

# With GPU support
numr = { version = "0.5.0", features = ["cuda"] }
numr = { version = "0.5.0", features = ["wgpu"] }

GitHub: github.com/ml-rust/numr
Crates.io: crates.io/crates/numr

numr is Apache-2.0 licensed. Contributions welcome.

Top comments (2)

Velx Dev • Mar 14

Interesting timing -- just finished reading the Python Optimization Ladder article that benchmarks every Python speedup tool. One of the conclusions was that Rust via PyO3 tops out around 113-154x on compute-heavy benchmarks, which matches C.

The point about fragmentation in the Rust numerical ecosystem is real. ndarray + nalgebra + rustfft genuinely don't compose nicely. The "one tensor type" promise here is the main value prop for me.

One question: for the WebGPU backend, how close are you to CUDA performance? My understanding is there's a meaningful overhead from WebGPU's shader compilation and abstraction layer. Is this being addressed in the roadmap?

Farhan Syah • Mar 14 • Edited

It actually depends on use case.

There are ops that can get around just around 5-10% performance loss than CUDA, then there are those that can get around 10-20% loss.

There is indeed concern about the performance loss. How I improve the shader performance is if I found it during profiling, when I am building higher level libraries and the applications that use it.

In fact, sometimes I use CPU with Rayon instead of CUDA or WGPU, to specifically avoid the bottleneck. So the developer need to understand when and where to switch backends.

I am sorry I couldn't really do systematic improvement on the whole shaders yet. Rather than that, I might add ROCm or Metal support instead. I'll come back to the performance across backend in the future.

As for now, I am fixing or improving it as I encounter it. I am hoping more people will start using this library, so I can get more performance issue reports.