DEV Community

Volatile Delegate

Posted on Sep 22, 2023

SIMD aggregate performance

#csharp #dotnet #simd

Foreword 🍓

Dotnet provides several classes, some under the System.Runtime.Intrinsics namespace that allow hardware to execute instructions in parallel.

using System.Runtime.Intrinsics;

Vector512 v512;
Vector256 v256;
Vector128 v128;

The number suffix (512, 256, 128) indicates the size in bits of the vector that the hardware can process in parallel.

This has positive impact in operations that performs aggregates, specially in a loop with large arrays.

To know if the hardware allows this type of registers we can consult the static read-only property IsHardwareAccelerated

if (Vector256.IsHardwareAccelerated)
{
    _is256 = true;
    ...
}

The above code will test if our hardware supports 256 bit vector operations through JIT intrinsics.

Exploring 🧠

Suppose we want to simultaneously calculate the maximum and minimum of a sequence of integers using Vector256.

The process will consist of creating a loop in which we will move forward taking 256-bit chunks and updating the maximum and minimum

(T Min, T Max) MinMax256<T>(ReadOnlySpan<T> source) 
    where T : struct, INumber<T>
{

}

First we initialize some variables to hold the current element, the last element, and the last size wise element (the to variable)

ref T current = ref MemoryMarshal.GetReference(source);
ref T last = ref Unsafe.Add(ref current, source.Length);
ref T to = ref Unsafe.Add(ref last, -Vector256<T>.Count);

Vector256<T> minElement = Vector256.LoadUnsafe(ref current);
Vector256<T> maxElement = minElement;

Then we start the loop. Inside, we load data in 256 bit chunks calling Vector256.LoadUsafe

 while (Unsafe.IsAddressLessThan(ref current, ref to))
 {
     Vector256<T> tempElement = Vector256.LoadUnsafe(ref current);
     minElement = Vector256.Min(minElement, tempElement);
     maxElement = Vector256.Max(maxElement, tempElement);
     current = ref Unsafe.Add(ref current, Vector256<T>.Count);
 }

We use the static Min and Max methods of Vector256and store the value in minElement and maxElement.

Finally, we increment the position counter (current) by adding 256 bits to the pointer.

Once we have exceeded the established size, we have to calculate the maximum and minimum individually

T min = minElement[0];
T max = maxElement[0];

for (int i = 1; i < Vector256<T>.Count; i++)
{
    T tempMin = minElement[i];
    if (tempMin < min)
    {
        min = tempMin;
    }
    T tempMax = maxElement[i];
    if (tempMax > max)
    {
        max = tempMax;
    }
}

After that we calculate the remaining elements if any:

while (Unsafe.IsAddressLessThan(ref current, ref last))
{
    if (current < min)
    {
        min = current;
    }
    if (current > max)
    {
        max = current;
    }
    current = ref Unsafe.Add(ref current, 1);
}

And that's all, we return the results:

return (min, max);

Benchmark 🔥

A quick test with BenchmarkDotnet calculating the maximum and minimum of an array of 10_000 integers reveals a performance improvement of x146 with Vector256 support.

💡 Ryzen 7 1700, 1 CPU
.NET SDK=8.0.100-rc.1.23455.8

Method	Mean (ns)
🐢 MinMaxLinq .NET Framework 4.8	118,675.226
⚡ MinMaxSimd .NET 8.0	808.150

Farewell

All the code with a more elavorated example is hosted in github. Be happy and love your family 💖

NetDefender / SimdIteration

SIMD tests

Simd Iteration

Test SIMD 512, 256, 128 registers for fast aggregate calculations.

Unfortunately my hardware doesn't support Vector512.

Anyway, the performance improvement is mindblowing.

Important

net8 is x146 times faster than net48 for calculate the Min and Max at the same time !!

Results

BenchmarkDotNet=v0.13.5, OS=Windows 10 (10.0.19044.3086/21H2/November2021Update)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=8.0.100-rc.1.23455.8
[Host] : .NET 8.0.0 (8.0.23.41904), X64 RyuJIT AVX2
.NET 7.0 : .NET 7.0.11 (7.0.1123.42427), X64 RyuJIT AVX2
.NET 8.0 : .NET 8.0.0 (8.0.23.41904), X64 RyuJIT AVX2
.NET Framework 4.8 : .NET Framework 4.8 (4.8.4644.0), X64 RyuJIT VectorSize=256

Method	Runtime	Size	Mean	Allocated
MinMaxLinq	.NET Framework 4.8	10000	118,675.226 ns	65 B
MinMaxLinq	.NET 7.0	10000	2,350.046 ns	-
MinMaxLinq	.NET 8.0	10000	1,228.518 ns	-
MinMaxSimd	.NET 7.0	10000	834.291 ns	-
MinMaxSimd	.NET 8.0	10000	808.150 ns	-

View on GitHub

References

System.Runtime.Intrinsics Espacio de nombres | Microsoft Learn

Contiene tipos que se usan para crear y transmitir estados de registro de distintos tamaños y formatos para su uso con las extensiones del conjunto de instrucciones. Para obtener instrucciones sobre cómo manipular estos registros, vea System.Runtime.Intrinsics.X86 y System.Runtime.Intrinsics.Arm.

learn.microsoft.com

System.Runtime.Intrinsics work planned for .NET 8 #79005

dakersnar posted on Nov 29, 2022

This is a work in progress as we develop our .NET 8 plans. This list is expected to change throughout the release cycle according to ongoing planning and discussions, with possible additions and subtractions to the scope.

Summary

During .NET 8, we will be focusing on AVX-512, an effort that includes the addition of a new intrinsic type Vector512 as well as Vector<T> improvements. Beyond that major theme, we will invest in quality, enhancements and new APIs. This is an ambitious set of work, so it's likely that several of the items below will be pushed out beyond .NET 8. It is also likely additional items will be added throughout the year.

DEV Community

SIMD aggregate performance

Foreword 🍓

Exploring 🧠

Benchmark 🔥

Farewell

NetDefender / SimdIteration

SIMD tests

Simd Iteration

Results

References

System.Runtime.Intrinsics Espacio de nombres | Microsoft Learn

System.Runtime.Intrinsics work planned for .NET 8 #79005

Summary

Planned for .NET 8

AVX-512

Quality

Enhancements / New APIs

Hardware Intrinsics in .NET Core - .NET Blog

Top comments (0)

Read next

Você sabe o que é Message Broker?

Asp.Net Core and Keycloak testcontainer. Testing a secure Asp.Net Core Api using Keycloak Testcontainer

Introduction to Chatbot (Bot) Framework SDK in .NET

Terminal komandalari haqida