DEV Community

Cover image for πŸš€ 8x Faster Than ONNX Runtime: Zero-Allocation AI Inference in Pure C#
DevOnBike
DevOnBike

Posted on

πŸš€ 8x Faster Than ONNX Runtime: Zero-Allocation AI Inference in Pure C#

The Myth: "C# is too slow for AI"

For years, the narrative has been the same: if you want high-performance AI, you must use C++ or Python wrappers (like PyTorch/ONNX) that call into native kernels. The common belief is that the Garbage Collector (GC) and the overhead of the "managed" environment make C# unsuitable for ultra-low latency inference.

I decided to challenge that.

By leveraging the latest features in .NET 10, AVX-512 instructions, and strict Zero-Allocation patterns, I built Overfit β€” an inference engine that outperforms ONNX Runtime by 800% in micro-inference tasks.


πŸ“Š The Results: 432 Nanoseconds

The following benchmark compares Overfit against Microsoft.ML.OnnxRuntime. While ONNX is a powerhouse for large models, its overhead becomes a bottleneck for micro-inference.

Environment:

  • CPU: AMD Ryzen 9 9950X3D (Zen 5, AVX-512)
  • Runtime: .NET 10.0 (X64 RyuJIT x86-64-v4)
  • Task: Linear Layer Inference (784 -> 10 units)
Method Mean Latency Allocated Ratio
Overfit (ZeroAlloc) 432.0 ns 0 B 0.12
ONNX Runtime (Pre-allocated) 3,571.8 ns 912 B 1.00
ONNX Runtime (Full-alloc) 3,581.0 ns 1,128 B 1.24

In the time it takes ONNX Runtime to complete one prediction, Overfit completes eight. More importantly, Overfit does it with zero bytes allocated on the heap.


πŸ› οΈ Optimization #1: Persistent Buffers (The Death of GC)

The biggest killer of "Tail Latency" (P99.9) in .NET is the Garbage Collector. Even a small allocation of ~1KB per call triggers Gen-0 collections under heavy load. In high-frequency trading (HFT) or real-time game engines, a GC pause is a disaster.

In Overfit, we use Persistent Inference Buffers. When the model is switched to Eval() mode, all necessary tensors are pre-allocated.

public AutogradNode Forward(ComputationGraph graph, AutogradNode input)
{
    // In Eval mode, we skip building the computation graph
    if (graph == null || !IsTraining)
    {
        // Compute directly into the pre-allocated persistent buffer
        LinearInferenceSimd(
            input.Data.AsReadOnlySpan(),
            _weightsTransposed.AsReadOnlySpan(),
            Biases.Data.AsReadOnlySpan(),
            _inferenceOutputNode.Data.AsSpan());

        return _inferenceOutputNode; // Zero-allocation return
    }

    return TensorMath.Linear(graph, input, Weights, Biases);
}
Enter fullscreen mode Exit fullscreen mode

⚑ Optimization #2: SIMD & AVX-512

Modern CPUs are vector machines. The Ryzen 9 9950X3D features 512-bit registers. Using .NET's Vector<float>, we can process 16 float numbers in a single CPU instruction.

For a layer with 784 inputs, a standard scalar loop does 784 multiplications. Our SIMD kernel does only 49 iterations (784 / 16 = 49), drastically reducing the CPU cycles required for the same mathematical operation.

// Core SIMD loop using Vector<float>
for (; i <= inputSize - vCount; i += vCount)
{
    var vIn = new Vector<float>(input.Slice(i));
    var vW = new Vector<float>(wRow.Slice(i));

    // Fused Multiply-Add equivalent in registers
    sum += vIn * vW; 
}

// Sum reduction and tail loop follow to handle remaining elements...
Enter fullscreen mode Exit fullscreen mode

🧠 Optimization #3: Weight Transposition

Memory access patterns are just as important as CPU cycles. Standard weights are often stored as [Input, Output]. For a single inference (Batch=1), this results in "strided" memory access, which kills the CPU cache performance.

During the Eval() setup, Overfit pre-transposes the weights to [Output, Input]. This ensures that when we calculate a neuron's output, we read the weights sequentially. This maximizes L1/L2 cache hits and allows the CPU's prefetcher to work at 100% efficiency.

private void RebuildTransposedWeights()
{
    _weightsTransposed = new FastTensor<float>(_outputSize, _inputSize);
    // Transpose: W[inputSize, outputSize] -> W_T[outputSize, inputSize]
    // Resulting in sequential rows for high-speed Vector.Dot operations
}
Enter fullscreen mode Exit fullscreen mode

🎯 Why This Matters

This isn't just about winning a benchmark. It’s about predictability.

When you eliminate heap allocations:

  1. Jitter disappears: Your P99.9 latency becomes almost identical to your P50.
  2. Zero GC Pressure: You can run millions of inferences per second without ever triggering a Garbage Collection cycle.
  3. Hardware Saturation: You are finally using the hardware you paid for (AVX-512) instead of wasting cycles on marshaling data between managed and unmanaged memory.

Check out the Project

Overfit is open-source (AGPLv3) and designed for developers who need extreme performance in .NET. Whether you are in FinTech, GameDev, or Edge AI, it’s time to stop settling for "managed overhead."

πŸ‘‰ GitHub Repository: https://github.com/DevOnBike/Overfit


What’s your experience with micro-optimizations in .NET? Let’s discuss in the comments!

Top comments (0)