absterdabster

Posted on Jul 23 • Edited on Jul 26

How to vectorize your code for faster performance 🚀

#cpp #systems #compiling #assembly

Hi! Let's say you have a time sensitive application. Either you have a lot of data that you need to process quickly. Or you are trying to write code that is very fast.

It may be possible to make your code very performant. 👀

How so?

With the help of vectorization!

There's a chance you are running a very big loop and running the same set of instructions on all your data.

What if we can shrink this loop a lot? We can process chunks of this loop in one step.

In fact, if you've ever used Python, fast processing libraries like numpy tend to use vectorized instructions as well for handling large amounts of data faster.

Before I show you how to steal the moon.... ahem... I mean vectorize your code, please drop any questions you have in the comments below!

Vectorized Instructions (SIMD)

SIMD stands for Single Instruction Multiple Data. Okay let's explore some instructions.

Before, we look at instructions, I must say every computer is different. Every CPU has a different architecture.

So some may support vectorized instructions, but some may not.

Lucky for us, most CPUs these days are x86 or x86-64 or ARM architectures. All these architectures support SIMD instructions. (Even the Apple M1 chips too I believe).

How do SIMD instructions work?

Good question. (If you're lazy and want to use SIMD instructions without knowing much about how they work, jump to the I'm Lazy section lol).

If you ever took a computer architecture course, you may have heard of these things called registers.

Registers are like memory holders for tiny pieces of data. Generally, a lot of the usual ones your compiler uses are 64 bit or 32 bit registers.

This is for several reasons:

x86-64 means an x86 architecture with 64 bit instructions, x86 generally uses 32 bit instructions
Memory addresses for modern computers are addressable with 64 bit addresses
The largest data types languages support are 64 bits (uint64_t, double, long).

However, computer architectures have been supporting larger and larger registers for things like vectorization, generally 128 bit, 256 bit, and even 512 registers.

`x86/x86-64`

These are the registers generally used for x86 architectures:

mm0-mm7: 64 bit registers for SIMD
xmm0-xmm15: 128 bit registers for SIMD
ymm0-ymm15: 256 bit registers for SIMD
zmm0-zmm15: 512 bit registers for SIMD

The registers and operations available to you on x86 largely depends on the support your CPU has. Here are the CPU supports for SIMD available over the years:

MMX: 64 bit registers for SIMD and instructions, oldest (1997)
SSE: 128 bit registers and introduced 70 new instructions (1999)
SSE2: introduced 144 new instructions to 128 bit registers (2000)
SSE3: introduced 13 new instructions (horizontal add/subtract) (2004)
SSSE3: introduced 38 new instructions to extend MMX and SSE (2008)
AVX: introduced 256 bit register vectors (2011)
AVX-512: introduced 512 bit register vectors (2016)
AMX: introduce 8192 bit registers (tmm0...tmm7) (2023)

How do you use them??? No need to fear, superman is here.

There are special functions for the architecture you can use in C/C++ called intrinsics. There are intrinsics that you can use to vectorize add, vectorize multiply, etc.

To use the intrinsics, for x86 intel CPUs,

All you have to do now to vectorize your code is:

Include one of the following header files based on the intrinsics you want to use, for example:
- <xmmintrin.h>: (MMX)
- <emmintrin.h>: (SSE)
- <pmmintrin.h>: (SSE3)
- <immintrin.h>: (AVX/AVX-512)
Use one of these intrinsics, here is a list: here
Finally compile with one of the flags on gcc, for example:
- -mmmx: (MMX)
- -msse: (SSE)
- -msse3: (SSE3)
- -mavx: (AVX)
- -mavx512f: (AVX-512)

ARM

ARM is the other popular architecture for CPUs, sometimes used for mobile devices. It also supports SIMD.

It uses these registers they call NEON registers, but the idea is similar:

D0-D31: 64 bit registers for SIMD
Q0-Q15: 128 bit registers for SIMD

To use these ARM vectorized instructions, you would have to do the following in C/C++:

#include <arm_neon.h> at the top of your file
Use the intrinic c++ functions (like vaddq_u32) from here
Compile your program with the gcc flag -mfpu=neon

I'm Lazy

Okay lazy boy. Or girl lol.

If you don't want to think about x86 or ARM, compilers are made powerful just for you.

You can let your compiler automagically figure out your architecture and compile your code with SIMD instructions.

Let's keep it short, but all you have to do is compile like this:

g++ -o test test.cpp -O3 -ftree-vectorize -march=native

-O3: extreme optimization, technically you just need -O2 or higher. It also includes ftree-vectorize, so having ftree-vectorize is redundant.
-ftree-vectorize: in case you forget O3 or you use a lower optimization, you can see SIMD instructions in unoptimized code
-march=native: if you use -ftree-vectorize or -O3 without this flag, the compiler will use a default set of vectorized instructions (up to SSE2 for x86-64). Including this flag, utilizes the best features of your CPU's inventory of SIMD instructions.

Comparing speeds

Let's look at this simple sample program lol, adding 30 ints together.

#include <iostream>
uint64_t rdtsc(){
        volatile uint64_t v{0};
        __asm__ volatile(
                "rdtsc"
                :"=A" (v)
        );
        return v;
}

int main(int argc, char** argv){
        int a[30] = {1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5};
        int b[30] = {5,4,3,2,1,2,3,4,5,4,3,2,1,2,3,5,4,3,2,1,2,3,4,5,4,3,2,1,2,3};
        int c[30];

        volatile uint64_t start = rdtsc();
        for(size_t i = 0; i < 30; i++){
                c[i] = a[i] + b[i];
        }
        volatile uint64_t end = rdtsc();

        std::cout << end-start << " cycles" << std::endl;
        for(auto v: c){
                std::cout <<  v << " ";
        }
        std::cout << std::endl;
}

Here, I use rdtsc() to time my program. For reference, I am timing and running my program from an x86-64 architecture CPU.

If you are unfamiliar with RDTSC, GO CHECK OUT MY TIMING YOUR PROGRAM BLOG.

If I try to compile and this program with no optimization:

g++ -o test test.cpp

I get the following results:

463 cycles
6 6 6 6 6 3 5 7 9 9 4 4 4 6 8 6 6 6 6 6 3 5 7 9 9 4 4 4 6 8

For reference, the loop takes 463 CPU cycles. My CPU runs at 2.5GHz. This means it took roughly 182 nanoseconds to run!

Let's take a look at the assembly, low level instructions for the juicy part of the code which we can get from with the -s flag when compiling.

        call    _Z5rdtscv
        movq    %rax, -432(%rbp)
        movq    $0, -416(%rbp)
.L5:
        cmpq    $29, -416(%rbp)
        ja      .L4
        movq    -416(%rbp), %rax
        movl    -384(%rbp,%rax,4), %edx
        movq    -416(%rbp), %rax
        movl    -256(%rbp,%rax,4), %eax
        addl    %eax, %edx
        movq    -416(%rbp), %rax
        movl    %edx, -128(%rbp,%rax,4)
        addq    $1, -416(%rbp)
        jmp     .L5
.L4:
        call    _Z5rdtscv

Great we can see that normal compilation leads to no use of vectorized registers/SIMD instructions. We just have the classic rax, rbp registers which are 64 bit and 32 bit registers (edx, eax, etc.)

Now let's see if we can vectorize these. Okay, let's do something simple with just O3 optimization.

g++ -O3 -o test test.cpp

(Add the -s flag for assembly)

As a fun challenge, try to guess how long the optimized version took to run.

Before I reveal the answer, let's see how this program got optimized in the assembly:

#APP
# 7 "test.cpp" 1
        rdtsc
# 0 "" 2
#NO_APP
        movq    %rax, 16(%rsp)
        movq    16(%rsp), %rax
        movdqa  %xmm1, %xmm5
        movdqa  160(%rsp), %xmm4
        paddd   %xmm0, %xmm5
        paddd   192(%rsp), %xmm3
        paddd   208(%rsp), %xmm2
        movq    %rax, (%rsp)
        paddd   %xmm0, %xmm4
        movl    272(%rsp), %eax
        paddd   240(%rsp), %xmm0
        movaps  %xmm5, 304(%rsp)
        addl    144(%rsp), %eax
        movaps  %xmm4, 288(%rsp)
        paddd   256(%rsp), %xmm1
        movl    %eax, 400(%rsp)
        movl    148(%rsp), %eax
        addl    276(%rsp), %eax
        movaps  %xmm3, 320(%rsp)
        movaps  %xmm2, 336(%rsp)
        movaps  %xmm4, 352(%rsp)
        movaps  %xmm0, 368(%rsp)
        movaps  %xmm1, 384(%rsp)
        movl    %eax, 404(%rsp)
        movq    $0, 24(%rsp)
#APP
# 7 "test.cpp" 1
        rdtsc
# 0 "" 2

Ahhh, now we have xmm registers! We also see SIMD x86 instruction like movaps and paddd.

If you remember from earlier, xmm registers are 128 bit registers which is covered by the default SSE support.

Now to show you the speed difference this has:

217 cycles
6 6 6 6 6 3 5 7 9 9 4 4 4 6 8 6 6 6 6 6 3 5 7 9 9 4 4 4 6 8

Whoa! We 2xed our speed by doubling the size of our registers!

We were just at like 182 nanos around, and now its closer to 85 nanos!

Remember, we're only adding 30 numbers here, but if we were doing a billion numbers, these nanoseconds WILL add up.

Okay, now let's see what happens when we introduce the microarchitecture flag into our compilation. Any better?

g++ -o test test.cpp -O3 -march=native

And here is the new assembly:

#APP
# 7 "test.cpp" 1
        rdtsc
# 0 "" 2
#NO_APP
        movq    %rax, 16(%rsp)
        movq    16(%rsp), %rax
        vpaddd  %ymm3, %ymm0, %ymm0
        vpaddd  160(%rsp), %ymm2, %ymm2
        vpaddd  192(%rsp), %ymm1, %ymm1
        vmovdqa %ymm0, 352(%rsp)
        movq    %rax, (%rsp)
        movl    256(%rsp), %eax
        vmovdqa %ymm2, 288(%rsp)
        addl    128(%rsp), %eax
        movl    %eax, 384(%rsp)
        movl    260(%rsp), %eax
        vmovdqa %ymm1, 320(%rsp)
        addl    132(%rsp), %eax
        movl    %eax, 388(%rsp)
        movl    264(%rsp), %eax
        movq    $0, 24(%rsp)
        addl    136(%rsp), %eax
        movl    %eax, 392(%rsp)
        movl    268(%rsp), %eax
        addl    140(%rsp), %eax
        movl    %eax, 396(%rsp)
        movl    272(%rsp), %eax
        addl    144(%rsp), %eax
        movl    %eax, 400(%rsp)
        movl    148(%rsp), %eax
        addl    276(%rsp), %eax
        movl    %eax, 404(%rsp)
#APP
# 7 "test.cpp" 1
        rdtsc
# 0 "" 2
#NO_APP

WHOAAAAA! Do you see what I see?!

We just got ymm registers! These guys are like 256 bit registers that my CPU supports.

Let's see what that means about our speed. Any guesses?

3....
2....
1....
Boom:

162 cycles
6 6 6 6 6 3 5 7 9 9 4 4 4 6 8 6 6 6 6 6 3 5 7 9 9 4 4 4 6 8

This took 65 nanoseconds to run. It seems like our benefits are starting to decay with only 30 numbers.

We went from 182 nanoseconds to 65 nanoseconds. We basically 3xed our speed!

Note: I could've alternatively chosen to used intrinsics as well. But I trust you will figure it out with the resources I've given you and the internet <3

Examples in the world

Okay, I shall attempt to find 1 example of SIMD instructions in the wild wild west.

OpenCV is a library used for image and vision processing!

Vectorized instructions can be useful for processing rows of pixels the same way.

Here is an example of SSE x86 intrinsics being used in OpenCV to merge two maps/images together.

void convertMaps_nninterpolate32f1c16s_SSE41(const float* src1f, const float* src2f, short* dst1, int width)
{
    int x = 0;
    for (; x <= width - 16; x += 16)
    {
        __m128i v_dst0 = _mm_packs_epi32(_mm_cvtps_epi32(_mm_loadu_ps(src1f + x)),
            _mm_cvtps_epi32(_mm_loadu_ps(src1f + x + 4)));
        __m128i v_dst1 = _mm_packs_epi32(_mm_cvtps_epi32(_mm_loadu_ps(src1f + x + 8)),
            _mm_cvtps_epi32(_mm_loadu_ps(src1f + x + 12)));

        __m128i v_dst2 = _mm_packs_epi32(_mm_cvtps_epi32(_mm_loadu_ps(src2f + x)),
            _mm_cvtps_epi32(_mm_loadu_ps(src2f + x + 4)));
        __m128i v_dst3 = _mm_packs_epi32(_mm_cvtps_epi32(_mm_loadu_ps(src2f + x + 8)),
            _mm_cvtps_epi32(_mm_loadu_ps(src2f + x + 12)));

        _mm_interleave_epi16(v_dst0, v_dst1, v_dst2, v_dst3);

        _mm_storeu_si128((__m128i *)(dst1 + x * 2), v_dst0);
        _mm_storeu_si128((__m128i *)(dst1 + x * 2 + 8), v_dst1);
        _mm_storeu_si128((__m128i *)(dst1 + x * 2 + 16), v_dst2);
        _mm_storeu_si128((__m128i *)(dst1 + x * 2 + 24), v_dst3);
    }

    for (; x < width; x++)
    {
        dst1[x * 2] = saturate_cast<short>(src1f[x]);
        dst1[x * 2 + 1] = saturate_cast<short>(src2f[x]);
    }
}

Okay let's keep this breakdown brief and simple.

__m128i v_dst0 = _mm_packs_epi32(_mm_cvtps_epi32(_mm_loadu_ps(src1f + x)), _mm_cvtps_epi32(_mm_loadu_ps(src1f + x + 4)));
These instructions are:
- _mm_loadu_ps: load 4 floats into an xmm 128 bit register
- _mm_cvtps_epi32: convert the 4 floats in the xmm into 4 byte ints
- _mm_packs_epi32: conver the 2 x 4 ints into 8 shorts (2 Bytes) in another xmm
_mm_interleave_epi16(v_dst0, v_dst1, v_dst2, v_dst3);
Interleaves the 1st and 3rd xmm vectors together. Then it interleaves the 2nd and 4th vectors together. This is done so that src1 and src2 points are interleaved together.
_mm_storeu_si128((__m128i *)(dst1 + x * 2), v_dst0);
Uses vectorized instructions to copy from xmm registers to dst1.
The last loop is scalar to handle the non 16 byte aligned data that is left over.

Cool, that's enough examples in the wild.

The Conclusion

Let's keep this simple. We saw that with a small example you can speed up a program up to 3x with vectorized instructions.

However, with larger programs, you could see even more performance benefits from SIMD (Single Instruction Multiple Data) instructions.

So next time you consider speeding up your program, think about how you can use your CPU's features to your advantage.

All in all:

SIMD support depends on CPU architecture
SIMD instructions can be used with intrinsics
The compiler can handle SIMD optimization for you alternatively
SIMD could speed up your program 2-10x, depending on CPU support/amount of data.
SIMD is used in the wild, like in OpenCV and even numpy libraries.

I hope you are ready to be faster.

That's all I have this time.

Peace
-absterdabster

Top comments (1)

absterdabster • Jul 23

Drop your questions here!!!