James Turner

Posted on Mar 4, 2020 • Edited on Apr 21, 2020 • Originally published at turnerj.com

Levenshtein Distance with SIMD (Bonus Part)

#algorithms #performance #csharp #dotnet

This is a bonus part because the other post was already jam-packed with optimizations plus this is a pretty exotic optimization that less developers are likely to directly use.

"Single Instruction, Multiple Data" (SIMD) is a method by which you can operate on a vector of data - allowing for certain mathematical and logic operations to take place on every element in the vector at the same time. This differs from Simultaneous Multithreading (SMT) where threads perform independent instructions - SIMD does a single instruction but to more than one bit of data at once.

To make things more confusing, there is also "Single Instruction, Multiple Threads" (SIMT) which is effectively how modern GPUs operate but that is a topic for a different post.

If you haven't heard of SIMD before, you may have heard of it under specific implementation names such as:

This is a very low level, CPU specific, optimization and is suited for algorithms that can be vectorized. Some instructions operate on 128-bits, some on 256-bits and some can go all the way to 512-bits (AVX-512).

The Levenshtein Distance algorithm isn't exactly a good candidate as processing a single cell relies on the computation of the cells around it. Nevertheless, there are some areas that SIMD instructions will still help us.

To target SIMD instructions in our code, we will be making using of new APIs specifically in .NET though SIMD instructions are most commonly found in lower-level languages like C, C++ or hand-written assembly. In the future, even WebAssembly may support SIMD instructions.

Let's run through a super basic example of a SIMD vector operation, adding numbers across two vectors.

[ 1, 2, 4, 6 ]
  +  +  +  +
[ 3, 5, 3, 2 ]
  =  =  =  =
[ 4, 7, 7, 8 ]

If we look at the individual columns in the above vector calculation, you'll see how it operates (1 + 3 = 4, 2 + 5 = 7, 4 + 3 = 7, 6 + 2 = 8). Assuming those numbers are all 32-bit Signed Integers, that could be the SSE2 Add instruction "PADDD".

Let's look at another example, comparing two vectors.

[  1,  2,  4,  6 ]
  ==  ==  ==  ==
[  4,  2,  4,  9 ]
   =   =   =   =
[  0, -1, -1,  0 ]

What we are getting as a result here is the HIGH and LOW value where HIGH (all bits are 1) means equal and LOW (all bits are 0) means not equal. We are using a 32-bit Signed Integers again here so an "all bits are 1" case means our result is -1. If we used an unsigned number, the value would be the maximum value of that number instead.

Using 32-bit Signed Integers, the instruction to do the vector comparison could be the SSE2 Compare Equals instruction "PCMPEQD".

Where we can actually use SIMD instructions for our Levenshtein Distance optimizations include:

Comparing the beginning and end characters of a string
Filling our initial row with an incrementing sequence
Branchless calculating the minimum of two values

Comparing the beginning and end characters of a string

Put simply, our best comparison/trimming code for the start and end of strings was performing one-character comparisons at a time.

var startIndex = 0;
var sourceEnd = source.Length;
var targetEnd = target.Length;

while (startIndex < sourceEnd && startIndex < targetEnd && source[startIndex] == target[startIndex])
{
    startIndex++;
}
while (startIndex < sourceEnd && startIndex < targetEnd && source[sourceEnd - 1] == target[targetEnd - 1])
{
    sourceEnd--;
    targetEnd--;
}

var sourceLength = sourceEnd - startIndex;
var targetLength = targetEnd - startIndex;

ReadOnlySpan<char> sourceSpan = source;
ReadOnlySpan<char> targetSpan = target;

sourceSpan = sourceSpan.Slice(startIndex, sourceLength);
targetSpan = targetSpan.Slice(startIndex, targetLength);

In .NET, the individual char of a string is a ushort - a 16-bit value. With this in mind, we could compare 8 characters at a time with a 128-bit vector or 16 characters with a 256-bit vector. We'll opt for the 16 character comparison which will utilise AVX2 SIMD instructions.

var charactersAvailableToTrim = Math.Min(sourceEnd, targetEnd);
if (charactersAvailableToTrim >= 16)
{
    fixed (char* sourcePtr = source)
    fixed (char* targetPtr = target)
    {
        var sourceUShortPtr = (ushort*)sourcePtr;
        var targetUShortPtr = (ushort*)targetPtr;

        while (charactersAvailableToTrim >= 16)
        {
            var sectionEquality = Avx2.MoveMask(
                Avx2.CompareEqual(
                    Avx.LoadDquVector256(sourceUShortPtr + startIndex),
                    Avx.LoadDquVector256(targetUShortPtr + startIndex)
                ).AsByte()
            );

            if (sectionEquality != -1)
            {
                break;
            }

            startIndex += 16;
            charactersAvailableToTrim -= 16;
        }

        while (charactersAvailableToTrim >= 16)
        {
            var sectionEquality = Avx2.MoveMask(
                Avx2.CompareEqual(
                    Avx.LoadDquVector256(sourceUShortPtr + (sourceEnd - 16 + 1)),
                    Avx.LoadDquVector256(targetUShortPtr + (targetEnd - 16 + 1))
                ).AsByte()
            );

            if (sectionEquality != -1)
            {
                break;
            }

            sourceEnd -= 16;
            targetEnd -= 16;
            charactersAvailableToTrim -= 16;
        }
    }
}

while (charactersAvailableToTrim > 0 && source[startIndex] == target[startIndex])
{
    charactersAvailableToTrim--;
    startIndex++;
}

while (charactersAvailableToTrim > 0 && source[sourceEnd - 1] == target[targetEnd - 1])
{
    charactersAvailableToTrim--;
    sourceEnd--;
    targetEnd--;
}

There are a lot of things going on in this block of code but let's focus on the most important part for us - the SIMD instructions.

while (charactersAvailableToTrim >= 16)
{
    var sectionEquality = Avx2.MoveMask(
        Avx2.CompareEqual(
            Avx.LoadDquVector256(sourceUShortPtr + startIndex),
            Avx.LoadDquVector256(targetUShortPtr + startIndex)
        ).AsByte()
    );

    if (sectionEquality != -1)
    {
        break;
    }
    startIndex += 16;
    charactersAvailableToTrim -= 16;
}

Our SIMD instructions are:

Avx.LoadDquVector256: Loads a 256-bit vector from a pointer ("VLDDQU")
Avx2.CompareEqual: Our equality comparison of two 256-bit vectors ("VPCMPEQW")
Avx2.MoveMask: Creates a bitmask from the most significant bit of each item in the vector ("VPMOVMSKB")

Basically we are saying "Unless all characters of this 16-character chunk are equal, break from the loop".

Because we are operating in 16-character blocks, we will still fall back to individual character comparisons so we can have the smallest strings to compare in the main Levenshtein Distance calculating code.

Filling our initial row with an incrementing sequence

In Part 2, we showed how there is a simple loop that fills in the initial row of data.

for (var i = 1; i <= target.Length; ++i)
{
    previousRow[i] = i;
}

But we can go faster, much faster...

var columnIndex = 0;
var columnsRemaining = previousRow.Length;

fixed (int* previousRowPtr = previousRow)
{
    var lastVector256 = Vector256.Create(0, 1, 2, 3, 4, 5, 6, 7);
    var shiftVector256 = Vector256.Create(8);

    while (columnsRemaining >= 8)
    {
        columnsRemaining -= 8;
        Avx.Store(previousRowPtr + columnIndex, lastVector256);
        lastVector256 = Avx2.Add(lastVector256, shiftVector256);
        columnIndex += 8;
    }

    if (columnsRemaining > 4)
    {
        columnsRemaining -= 4;
        previousRowPtr[columnIndex] = ++columnIndex;
        previousRowPtr[columnIndex] = ++columnIndex;
        previousRowPtr[columnIndex] = ++columnIndex;
        previousRowPtr[columnIndex] = ++columnIndex;
    }

    while (columnsRemaining > 0)
    {
        columnsRemaining--;
        previousRowPtr[columnIndex] = ++columnIndex;
    }
}

Our implementation now is partially using SIMD and is partially unrolled. Because our previousRow is an array of 32-bit Integers, we are only doing 8 characters at a time on a 256-bit vector.

Focusing on the SIMD instructions:

var lastVector256 = Vector256.Create(0, 1, 2, 3, 4, 5, 6, 7);
var shiftVector256 = Vector256.Create(8);

while (columnsRemaining >= 8)
{
    columnsRemaining -= 8;
    Avx.Store(previousRowPtr + columnIndex, lastVector256);
    lastVector256 = Avx2.Add(lastVector256, shiftVector256);
    columnIndex += 8;
}

We create an initial vector of the first 8 values (lastVector256), we get a vector we want to increment by (shiftVector256) and we simply work up the remaining columns of previousRow 8 items at a time.

Our SIMD instructions:

Avx.Store: Save a 256-bit vector to the specified pointer location ("VMOVDQA")
Avx2.Add: Adds two 256-bit vectors together, returning the result ("VPADDD")

Branchless calculating the minimum of two values

In Part 3, we covered cutting out branches by re-organizing code or unrolling loops. We can actually take it slightly further using SIMD instructions by abusing/misusing how a SIMD Min comparison works.

Here is our last version of the calculating code:

localCost = previousDiagonal;
deletionCost = previousRow[j];
if (sourceChar != target[j - 1])
{
    localCost = Math.Min(previousColumn, localCost);
    localCost = Math.Min(deletionCost, localCost);
    localCost++;
}
previousColumn = localCost;
previousRow[j++] = localCost;
previousDiagonal = deletionCost;

We can't eliminate the if-statement - I've tried, multiple times - but we can eliminate the branches that would occur as part of Math.Min. To be clear, Math.Min isn't a slow function and by itself, it should perform faster than what we are about to do however when there are two in a row, we can start to benefit from our optimization.

To gain the most performance out of this, we don't want to jump to and from vectors - the more operations we can do while it is a vector, the better it will be for our performance. This is the latency-vs-throughput potential with SIMD instructions.

localCostVector = previousDiagonalVector;
deletionCostVector = Vector128.Create(previousRowPtr[columnIndex]);
if (sourcePrevChar != targetPtr[columnIndex])
{
    localCostVector = Sse2.Add(
        Sse41.Min(
            Sse41.Min(
                previousColumnVector,
                localCostVector
            ),
            deletionCostVector
        ),
        allOnesVector
    );
}
previousColumnVector = localCostVector;
previousRowPtr[columnIndex++] = localCostVector.GetElement(0);
previousDiagonalVector = deletionCostVector;

While most of the variables are the same with just "Vector" at the end, there are some differences to know about.

deletionCostVector = Vector128.Create(previousRowPtr[columnIndex]);

This creates a 128-bit vector of the value at that specific pointer. That is to say, it is only a single value that is in the vector 4 times.

localCostVector = Sse2.Add(
    Sse41.Min(
        Sse41.Min(
            previousColumnVector,
            localCostVector
        ),
        deletionCostVector
    ),
    allOnesVector
);

Like the original code, we are doing two Math.Min equivalent operations and then adding 1 to the result. The vector allOnesVector is aptly named because the vector only contains the number 1 in all positions.

previousRowPtr[columnIndex++] = localCostVector.GetElement(0);

Finally, we end with taking only the first element from the vector and storing it at our specific column.

You might realise now why I said we were abusing/misusing SIMD here - we don't actually care about how big the vector is as we simply don't use anything more than one item in the vector. This is because we are not taking advantage of the vector, we are taking advantage that Sse41.Min makes our code effectively branchless in the minimum value comparison. With the constraint that the Levenshtein Distance calculation relies on previously calculated cells, I can't see a situation where you can make use of the full vector to speed up the calculations further.

Our SIMD instructions:

Sse41.Min: Gets the minimum of each value in the vector ("PMINUD")
Sse2.Add: Adds the values across the vectors ("PADDD")

Caveats of SIMD Instructions

Besides that the data and algorithm needing to support vectorization, it isn't a magic bullet. When digging this deep for performance, you'll be looking at the latency and throughput of instructions.

The latency and throughput of instructions will differ per CPU model.

The best resource that I have found that digs into the latency and throughput for specific models are the optimization manuals by Agner. That said, Intel does have some information about them on their Intrinsics Guide.

It is important for you to benchmark your code to see if it is an appropriate optimization for your own code and the machines you are targeting.