DEV Community

Volatile Delegate
Volatile Delegate

Posted on

SIMD aggregate performance

Foreword πŸ“

Dotnet provides several classes, some under the System.Runtime.Intrinsics namespace that allow hardware to execute instructions in parallel.

using System.Runtime.Intrinsics;

Vector512 v512;
Vector256 v256;
Vector128 v128;
Enter fullscreen mode Exit fullscreen mode

The number suffix (512, 256, 128) indicates the size in bits of the vector that the hardware can process in parallel.

This has positive impact in operations that performs aggregates, specially in a loop with large arrays.

To know if the hardware allows this type of registers we can consult the static read-only property IsHardwareAccelerated

if (Vector256.IsHardwareAccelerated)
{
    _is256 = true;
    ...
}
Enter fullscreen mode Exit fullscreen mode

The above code will test if our hardware supports 256 bit vector operations through JIT intrinsics.

Exploring 🧠

Suppose we want to simultaneously calculate the maximum and minimum of a sequence of integers using Vector256.

The process will consist of creating a loop in which we will move forward taking 256-bit chunks and updating the maximum and minimum

(T Min, T Max) MinMax256<T>(ReadOnlySpan<T> source) 
    where T : struct, INumber<T>
{

}
Enter fullscreen mode Exit fullscreen mode

First we initialize some variables to hold the current element, the last element, and the last size wise element (the to variable)

ref T current = ref MemoryMarshal.GetReference(source);
ref T last = ref Unsafe.Add(ref current, source.Length);
ref T to = ref Unsafe.Add(ref last, -Vector256<T>.Count);

Vector256<T> minElement = Vector256.LoadUnsafe(ref current);
Vector256<T> maxElement = minElement;
Enter fullscreen mode Exit fullscreen mode

Then we start the loop. Inside, we load data in 256 bit chunks calling Vector256.LoadUsafe

 while (Unsafe.IsAddressLessThan(ref current, ref to))
 {
     Vector256<T> tempElement = Vector256.LoadUnsafe(ref current);
     minElement = Vector256.Min(minElement, tempElement);
     maxElement = Vector256.Max(maxElement, tempElement);
     current = ref Unsafe.Add(ref current, Vector256<T>.Count);
 }
Enter fullscreen mode Exit fullscreen mode

We use the static Min and Max methods of Vector256and store the value in minElement and maxElement.

Finally, we increment the position counter (current) by adding 256 bits to the pointer.

Once we have exceeded the established size, we have to calculate the maximum and minimum individually

T min = minElement[0];
T max = maxElement[0];

for (int i = 1; i < Vector256<T>.Count; i++)
{
    T tempMin = minElement[i];
    if (tempMin < min)
    {
        min = tempMin;
    }
    T tempMax = maxElement[i];
    if (tempMax > max)
    {
        max = tempMax;
    }
}
Enter fullscreen mode Exit fullscreen mode

After that we calculate the remaining elements if any:

while (Unsafe.IsAddressLessThan(ref current, ref last))
{
    if (current < min)
    {
        min = current;
    }
    if (current > max)
    {
        max = current;
    }
    current = ref Unsafe.Add(ref current, 1);
}
Enter fullscreen mode Exit fullscreen mode

And that's all, we return the results:

return (min, max);
Enter fullscreen mode Exit fullscreen mode

Benchmark πŸ”₯

A quick test with BenchmarkDotnet calculating the maximum and minimum of an array of 10_000 integers reveals a performance improvement of x146 with Vector256 support.

πŸ’‘ Ryzen 7 1700, 1 CPU
.NET SDK=8.0.100-rc.1.23455.8

Method Mean (ns)
🐒 MinMaxLinq .NET Framework 4.8 118,675.226
⚑ MinMaxSimd .NET 8.0 808.150

Farewell

All the code with a more elavorated example is hosted in github. Be happy and love your family πŸ’–

Simd Iteration

Test SIMD 512, 256, 128 registers for fast aggregate calculations.

Unfortunately my hardware doesn't support Vector512.

Anyway, the performance improvement is mindblowing.

Important

net8 is x146 times faster than net48 for calculate the Min and Max at the same time !!

Results

  • BenchmarkDotNet=v0.13.5, OS=Windows 10 (10.0.19044.3086/21H2/November2021Update)
  • AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
  • .NET SDK=8.0.100-rc.1.23455.8
  • [Host] : .NET 8.0.0 (8.0.23.41904), X64 RyuJIT AVX2
  • .NET 7.0 : .NET 7.0.11 (7.0.1123.42427), X64 RyuJIT AVX2
  • .NET 8.0 : .NET 8.0.0 (8.0.23.41904), X64 RyuJIT AVX2
  • .NET Framework 4.8 : .NET Framework 4.8 (4.8.4644.0), X64 RyuJIT VectorSize=256
Method Runtime Size Mean Allocated
MinMaxLinq .NET Framework 4.8 10000 118,675.226 ns 65 B
MinMaxLinq .NET 7.0 10000 2,350.046 ns -
MinMaxLinq .NET 8.0 10000 1,228.518 ns -
MinMaxSimd .NET 7.0 10000 834.291 ns -
MinMaxSimd .NET 8.0 10000 808.150 ns -

References

System.Runtime.Intrinsics Espacio de nombres | Microsoft Learn

Contiene tipos que se usan para crear y transmitir estados de registro de distintos tamaΓ±os y formatos para su uso con las extensiones del conjunto de instrucciones. Para obtener instrucciones sobre cΓ³mo manipular estos registros, vea System.Runtime.Intrinsics.X86 y System.Runtime.Intrinsics.Arm.

learn.microsoft.com

System.Runtime.Intrinsics work planned for .NET 8 #79005

This is a work in progress as we develop our .NET 8 plans. This list is expected to change throughout the release cycle according to ongoing planning and discussions, with possible additions and subtractions to the scope.

Summary

During .NET 8, we will be focusing on AVX-512, an effort that includes the addition of a new intrinsic type Vector512 as well as Vector<T> improvements. Beyond that major theme, we will invest in quality, enhancements and new APIs. This is an ambitious set of work, so it's likely that several of the items below will be pushed out beyond .NET 8. It is also likely additional items will be added throughout the year.

Planned for .NET 8

AVX-512

Quality

Enhancements / New APIs

Top comments (0)