The Myth: "C# is too slow for AI"
For years, the narrative has been the same: if you want high-performance AI, you must use C++ or Python wrappers (like PyTorch/ONNX) that call into native kernels. The common belief is that the Garbage Collector (GC) and the overhead of the "managed" environment make C# unsuitable for ultra-low latency inference.
I decided to challenge that.
By leveraging the latest features in .NET 10, AVX-512 instructions, and strict Zero-Allocation patterns, I built Overfit β an inference engine that outperforms ONNX Runtime by 800% in micro-inference tasks.
π The Results: 432 Nanoseconds
The following benchmark compares Overfit against Microsoft.ML.OnnxRuntime. While ONNX is a powerhouse for large models, its overhead becomes a bottleneck for micro-inference.
Environment:
- CPU: AMD Ryzen 9 9950X3D (Zen 5, AVX-512)
- Runtime: .NET 10.0 (X64 RyuJIT x86-64-v4)
- Task: Linear Layer Inference (784 -> 10 units)
| Method | Mean Latency | Allocated | Ratio |
|---|---|---|---|
| Overfit (ZeroAlloc) | 432.0 ns | 0 B | 0.12 |
| ONNX Runtime (Pre-allocated) | 3,571.8 ns | 912 B | 1.00 |
| ONNX Runtime (Full-alloc) | 3,581.0 ns | 1,128 B | 1.24 |
In the time it takes ONNX Runtime to complete one prediction, Overfit completes eight. More importantly, Overfit does it with zero bytes allocated on the heap.
π οΈ Optimization #1: Persistent Buffers (The Death of GC)
The biggest killer of "Tail Latency" (P99.9) in .NET is the Garbage Collector. Even a small allocation of ~1KB per call triggers Gen-0 collections under heavy load. In high-frequency trading (HFT) or real-time game engines, a GC pause is a disaster.
In Overfit, we use Persistent Inference Buffers. When the model is switched to Eval() mode, all necessary tensors are pre-allocated.
public AutogradNode Forward(ComputationGraph graph, AutogradNode input)
{
// In Eval mode, we skip building the computation graph
if (graph == null || !IsTraining)
{
// Compute directly into the pre-allocated persistent buffer
LinearInferenceSimd(
input.Data.AsReadOnlySpan(),
_weightsTransposed.AsReadOnlySpan(),
Biases.Data.AsReadOnlySpan(),
_inferenceOutputNode.Data.AsSpan());
return _inferenceOutputNode; // Zero-allocation return
}
return TensorMath.Linear(graph, input, Weights, Biases);
}
β‘ Optimization #2: SIMD & AVX-512
Modern CPUs are vector machines. The Ryzen 9 9950X3D features 512-bit registers. Using .NET's Vector<float>, we can process 16 float numbers in a single CPU instruction.
For a layer with 784 inputs, a standard scalar loop does 784 multiplications. Our SIMD kernel does only 49 iterations (784 / 16 = 49), drastically reducing the CPU cycles required for the same mathematical operation.
// Core SIMD loop using Vector<float>
for (; i <= inputSize - vCount; i += vCount)
{
var vIn = new Vector<float>(input.Slice(i));
var vW = new Vector<float>(wRow.Slice(i));
// Fused Multiply-Add equivalent in registers
sum += vIn * vW;
}
// Sum reduction and tail loop follow to handle remaining elements...
π§ Optimization #3: Weight Transposition
Memory access patterns are just as important as CPU cycles. Standard weights are often stored as [Input, Output]. For a single inference (Batch=1), this results in "strided" memory access, which kills the CPU cache performance.
During the Eval() setup, Overfit pre-transposes the weights to [Output, Input]. This ensures that when we calculate a neuron's output, we read the weights sequentially. This maximizes L1/L2 cache hits and allows the CPU's prefetcher to work at 100% efficiency.
private void RebuildTransposedWeights()
{
_weightsTransposed = new FastTensor<float>(_outputSize, _inputSize);
// Transpose: W[inputSize, outputSize] -> W_T[outputSize, inputSize]
// Resulting in sequential rows for high-speed Vector.Dot operations
}
π― Why This Matters
This isn't just about winning a benchmark. Itβs about predictability.
When you eliminate heap allocations:
- Jitter disappears: Your P99.9 latency becomes almost identical to your P50.
- Zero GC Pressure: You can run millions of inferences per second without ever triggering a Garbage Collection cycle.
- Hardware Saturation: You are finally using the hardware you paid for (AVX-512) instead of wasting cycles on marshaling data between managed and unmanaged memory.
Check out the Project
Overfit is open-source (AGPLv3) and designed for developers who need extreme performance in .NET. Whether you are in FinTech, GameDev, or Edge AI, itβs time to stop settling for "managed overhead."
π GitHub Repository: https://github.com/DevOnBike/Overfit
Whatβs your experience with micro-optimizations in .NET? Letβs discuss in the comments!
Top comments (0)