🚀 8x Faster Than ONNX Runtime: Zero-Allocation AI Inference in Pure C#

#dotnet #performance #ai #benchmark

The Myth: "C# is too slow for AI"

For years, the narrative has been the same: if you want high-performance AI, you must use C++ or Python wrappers (like PyTorch/ONNX) that call into native kernels. The common belief is that the Garbage Collector (GC) and the overhead of the "managed" environment make C# unsuitable for ultra-low latency inference.

I decided to challenge that.

By leveraging the latest features in .NET 10, AVX-512 instructions, and strict Zero-Allocation patterns, I built Overfit — an inference engine that outperforms ONNX Runtime by 800% in micro-inference tasks.

📊 The Results: 432 Nanoseconds

The following benchmark compares Overfit against Microsoft.ML.OnnxRuntime. While ONNX is a powerhouse for large models, its overhead becomes a bottleneck for micro-inference.

Environment:

CPU: AMD Ryzen 9 9950X3D (Zen 5, AVX-512)
Runtime: .NET 10.0 (X64 RyuJIT x86-64-v4)
Task: Linear Layer Inference (784 -> 10 units)

Method	Mean Latency	Allocated	Ratio
Overfit (ZeroAlloc)	432.0 ns	0 B	0.12
ONNX Runtime (Pre-allocated)	3,571.8 ns	912 B	1.00
ONNX Runtime (Full-alloc)	3,581.0 ns	1,128 B	1.24

In the time it takes ONNX Runtime to complete one prediction, Overfit completes eight. More importantly, Overfit does it with zero bytes allocated on the heap.

🛠️ Optimization #1: Persistent Buffers (The Death of GC)

The biggest killer of "Tail Latency" (P99.9) in .NET is the Garbage Collector. Even a small allocation of ~1KB per call triggers Gen-0 collections under heavy load. In high-frequency trading (HFT) or real-time game engines, a GC pause is a disaster.

In Overfit, we use Persistent Inference Buffers. When the model is switched to Eval() mode, all necessary tensors are pre-allocated.

public AutogradNode Forward(ComputationGraph graph, AutogradNode input)
{
    // In Eval mode, we skip building the computation graph
    if (graph == null || !IsTraining)
    {
        // Compute directly into the pre-allocated persistent buffer
        LinearInferenceSimd(
            input.Data.AsReadOnlySpan(),
            _weightsTransposed.AsReadOnlySpan(),
            Biases.Data.AsReadOnlySpan(),
            _inferenceOutputNode.Data.AsSpan());

        return _inferenceOutputNode; // Zero-allocation return
    }

    return TensorMath.Linear(graph, input, Weights, Biases);
}

⚡ Optimization #2: SIMD & AVX-512

Modern CPUs are vector machines. The Ryzen 9 9950X3D features 512-bit registers. Using .NET's Vector<float>, we can process 16 float numbers in a single CPU instruction.

For a layer with 784 inputs, a standard scalar loop does 784 multiplications. Our SIMD kernel does only 49 iterations (784 / 16 = 49), drastically reducing the CPU cycles required for the same mathematical operation.

// Core SIMD loop using Vector<float>
for (; i <= inputSize - vCount; i += vCount)
{
    var vIn = new Vector<float>(input.Slice(i));
    var vW = new Vector<float>(wRow.Slice(i));

    // Fused Multiply-Add equivalent in registers
    sum += vIn * vW; 
}

// Sum reduction and tail loop follow to handle remaining elements...

🧠 Optimization #3: Weight Transposition

Memory access patterns are just as important as CPU cycles. Standard weights are often stored as [Input, Output]. For a single inference (Batch=1), this results in "strided" memory access, which kills the CPU cache performance.

During the Eval() setup, Overfit pre-transposes the weights to [Output, Input]. This ensures that when we calculate a neuron's output, we read the weights sequentially. This maximizes L1/L2 cache hits and allows the CPU's prefetcher to work at 100% efficiency.

private void RebuildTransposedWeights()
{
    _weightsTransposed = new FastTensor<float>(_outputSize, _inputSize);
    // Transpose: W[inputSize, outputSize] -> W_T[outputSize, inputSize]
    // Resulting in sequential rows for high-speed Vector.Dot operations
}

🎯 Why This Matters

This isn't just about winning a benchmark. It’s about predictability.

When you eliminate heap allocations:

Jitter disappears: Your P99.9 latency becomes almost identical to your P50.
Zero GC Pressure: You can run millions of inferences per second without ever triggering a Garbage Collection cycle.
Hardware Saturation: You are finally using the hardware you paid for (AVX-512) instead of wasting cycles on marshaling data between managed and unmanaged memory.

Check out the Project

Overfit is open-source (AGPLv3) and designed for developers who need extreme performance in .NET. Whether you are in FinTech, GameDev, or Edge AI, it’s time to stop settling for "managed overhead."

👉 GitHub Repository: https://github.com/DevOnBike/Overfit

What’s your experience with micro-optimizations in .NET? Let’s discuss in the comments!