Hi! Let's say you have a time sensitive application. Either you have a lot of data that you need to process quickly. Or you are trying to write code that is very fast.
It may be possible to make your code very performant. π
How so?
With the help of vectorization!
There's a chance you are running a very big loop and running the same set of instructions on all your data.
What if we can shrink this loop a lot? We can process chunks of this loop in one step.
In fact, if you've ever used Python, fast processing libraries like numpy
tend to use vectorized instructions as well for handling large amounts of data faster.
Before I show you how to steal the moon.... ahem... I mean vectorize your code, please drop any questions you have in the comments below!
Vectorized Instructions (SIMD)
SIMD stands for Single Instruction Multiple Data. Okay let's explore some instructions.
Before, we look at instructions, I must say every computer is different. Every CPU has a different architecture.
So some may support vectorized instructions, but some may not.
Lucky for us, most CPUs these days are x86 or x86-64 or ARM architectures. All these architectures support SIMD instructions. (Even the Apple M1 chips too I believe).
How do SIMD instructions work?
Good question. (If you're lazy and want to use SIMD instructions without knowing much about how they work, jump to the I'm Lazy
section lol).
If you ever took a computer architecture course, you may have heard of these things called registers.
Registers are like memory holders for tiny pieces of data. Generally, a lot of the usual ones your compiler uses are 64 bit or 32 bit registers.
This is for several reasons:
- x86-64 means an x86 architecture with 64 bit instructions, x86 generally uses 32 bit instructions
- Memory addresses for modern computers are addressable with 64 bit addresses
- The largest data types languages support are 64 bits (
uint64_t
,double
,long
).
However, computer architectures have been supporting larger and larger registers for things like vectorization, generally 128 bit, 256 bit, and even 512 registers.
x86/x86-64
These are the registers generally used for x86 architectures:
-
mm0-mm7
: 64 bit registers for SIMD -
xmm0-xmm15
: 128 bit registers for SIMD -
ymm0-ymm15
: 256 bit registers for SIMD -
zmm0-zmm15
: 512 bit registers for SIMD
The registers and operations available to you on x86 largely depends on the support your CPU has. Here are the CPU supports for SIMD available over the years:
-
MMX
: 64 bit registers for SIMD and instructions, oldest (1997) -
SSE
: 128 bit registers and introduced 70 new instructions (1999) -
SSE2
: introduced 144 new instructions to 128 bit registers (2000) -
SSE3
: introduced 13 new instructions (horizontal add/subtract) (2004) -
SSSE3
: introduced 38 new instructions to extend MMX and SSE (2008) -
AVX
: introduced 256 bit register vectors (2011) -
AVX-512
: introduced 512 bit register vectors (2016) -
AMX
: introduce 8192 bit registers (tmm0...tmm7
) (2023)
How do you use them??? No need to fear, superman is here.
There are special functions for the architecture you can use in C/C++ called intrinsics. There are intrinsics that you can use to vectorize add, vectorize multiply, etc.
To use the intrinsics, for x86 intel CPUs,
All you have to do now to vectorize your code is:
- Include one of the following header files based on the intrinsics you want to use, for example:
-
<xmmintrin.h>
: (MMX) -
<emmintrin.h>
: (SSE) -
<pmmintrin.h>
: (SSE3) -
<immintrin.h>
: (AVX/AVX-512)
-
- Use one of these intrinsics, here is a list: here
- Finally compile with one of the flags on gcc, for example:
-
-mmmx
: (MMX) -
-msse
: (SSE) -
-msse3
: (SSE3) -
-mavx
: (AVX) -
-mavx512f
: (AVX-512)
-
ARM
ARM is the other popular architecture for CPUs, sometimes used for mobile devices. It also supports SIMD.
It uses these registers they call NEON registers, but the idea is similar:
-
D0-D31
: 64 bit registers for SIMD -
Q0-Q15
: 128 bit registers for SIMD
To use these ARM vectorized instructions, you would have to do the following in C/C++:
-
#include <arm_neon.h>
at the top of your file - Use the intrinic c++ functions (like vaddq_u32) from here
- Compile your program with the gcc flag
-mfpu=neon
I'm Lazy
Okay lazy boy. Or girl lol.
If you don't want to think about x86 or ARM, compilers are made powerful just for you.
You can let your compiler automagically figure out your architecture and compile your code with SIMD instructions.
Let's keep it short, but all you have to do is compile like this:
g++ -o test test.cpp -O3 -ftree-vectorize -march=native
-
-O3
: extreme optimization, technically you just need-O2
or higher. It also includesftree-vectorize
, so havingftree-vectorize
is redundant. -
-ftree-vectorize
: in case you forgetO3
or you use a lower optimization, you can see SIMD instructions in unoptimized code -
-march=native
: if you use-ftree-vectorize
or-O3
without this flag, the compiler will use a default set of vectorized instructions (up to SSE2 forx86-64
). Including this flag, utilizes the best features of your CPU's inventory of SIMD instructions.
Comparing speeds
Let's look at this simple sample program lol, adding 30 ints together.
#include <iostream>
uint64_t rdtsc(){
volatile uint64_t v{0};
__asm__ volatile(
"rdtsc"
:"=A" (v)
);
return v;
}
int main(int argc, char** argv){
int a[30] = {1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5};
int b[30] = {5,4,3,2,1,2,3,4,5,4,3,2,1,2,3,5,4,3,2,1,2,3,4,5,4,3,2,1,2,3};
int c[30];
volatile uint64_t start = rdtsc();
for(size_t i = 0; i < 30; i++){
c[i] = a[i] + b[i];
}
volatile uint64_t end = rdtsc();
std::cout << end-start << " cycles" << std::endl;
for(auto v: c){
std::cout << v << " ";
}
std::cout << std::endl;
}
Here, I use rdtsc() to time my program. For reference, I am timing and running my program from an x86-64 architecture CPU.
If you are unfamiliar with RDTSC, GO CHECK OUT MY TIMING YOUR PROGRAM BLOG.
If I try to compile and this program with no optimization:
g++ -o test test.cpp
I get the following results:
463 cycles
6 6 6 6 6 3 5 7 9 9 4 4 4 6 8 6 6 6 6 6 3 5 7 9 9 4 4 4 6 8
For reference, the loop takes 463 CPU cycles. My CPU runs at 2.5GHz. This means it took roughly 182
nanoseconds to run!
Let's take a look at the assembly, low level instructions for the juicy part of the code which we can get from with the -s
flag when compiling.
call _Z5rdtscv
movq %rax, -432(%rbp)
movq $0, -416(%rbp)
.L5:
cmpq $29, -416(%rbp)
ja .L4
movq -416(%rbp), %rax
movl -384(%rbp,%rax,4), %edx
movq -416(%rbp), %rax
movl -256(%rbp,%rax,4), %eax
addl %eax, %edx
movq -416(%rbp), %rax
movl %edx, -128(%rbp,%rax,4)
addq $1, -416(%rbp)
jmp .L5
.L4:
call _Z5rdtscv
Great we can see that normal compilation leads to no use of vectorized registers/SIMD instructions. We just have the classic rax
, rbp
registers which are 64 bit and 32 bit registers (edx
, eax
, etc.)
Now let's see if we can vectorize these. Okay, let's do something simple with just O3
optimization.
g++ -O3 -o test test.cpp
(Add the -s
flag for assembly)
As a fun challenge, try to guess how long the optimized version took to run.
Before I reveal the answer, let's see how this program got optimized in the assembly:
#APP
# 7 "test.cpp" 1
rdtsc
# 0 "" 2
#NO_APP
movq %rax, 16(%rsp)
movq 16(%rsp), %rax
movdqa %xmm1, %xmm5
movdqa 160(%rsp), %xmm4
paddd %xmm0, %xmm5
paddd 192(%rsp), %xmm3
paddd 208(%rsp), %xmm2
movq %rax, (%rsp)
paddd %xmm0, %xmm4
movl 272(%rsp), %eax
paddd 240(%rsp), %xmm0
movaps %xmm5, 304(%rsp)
addl 144(%rsp), %eax
movaps %xmm4, 288(%rsp)
paddd 256(%rsp), %xmm1
movl %eax, 400(%rsp)
movl 148(%rsp), %eax
addl 276(%rsp), %eax
movaps %xmm3, 320(%rsp)
movaps %xmm2, 336(%rsp)
movaps %xmm4, 352(%rsp)
movaps %xmm0, 368(%rsp)
movaps %xmm1, 384(%rsp)
movl %eax, 404(%rsp)
movq $0, 24(%rsp)
#APP
# 7 "test.cpp" 1
rdtsc
# 0 "" 2
Ahhh, now we have xmm
registers! We also see SIMD x86 instruction like movaps
and paddd
.
If you remember from earlier, xmm
registers are 128 bit registers which is covered by the default SSE
support.
Now to show you the speed difference this has:
217 cycles
6 6 6 6 6 3 5 7 9 9 4 4 4 6 8 6 6 6 6 6 3 5 7 9 9 4 4 4 6 8
Whoa! We 2xed our speed by doubling the size of our registers!
We were just at like 182 nanos around, and now its closer to 85 nanos!
Remember, we're only adding 30 numbers here, but if we were doing a billion numbers, these nanoseconds WILL add up.
Okay, now let's see what happens when we introduce the microarchitecture flag into our compilation. Any better?
g++ -o test test.cpp -O3 -march=native
And here is the new assembly:
#APP
# 7 "test.cpp" 1
rdtsc
# 0 "" 2
#NO_APP
movq %rax, 16(%rsp)
movq 16(%rsp), %rax
vpaddd %ymm3, %ymm0, %ymm0
vpaddd 160(%rsp), %ymm2, %ymm2
vpaddd 192(%rsp), %ymm1, %ymm1
vmovdqa %ymm0, 352(%rsp)
movq %rax, (%rsp)
movl 256(%rsp), %eax
vmovdqa %ymm2, 288(%rsp)
addl 128(%rsp), %eax
movl %eax, 384(%rsp)
movl 260(%rsp), %eax
vmovdqa %ymm1, 320(%rsp)
addl 132(%rsp), %eax
movl %eax, 388(%rsp)
movl 264(%rsp), %eax
movq $0, 24(%rsp)
addl 136(%rsp), %eax
movl %eax, 392(%rsp)
movl 268(%rsp), %eax
addl 140(%rsp), %eax
movl %eax, 396(%rsp)
movl 272(%rsp), %eax
addl 144(%rsp), %eax
movl %eax, 400(%rsp)
movl 148(%rsp), %eax
addl 276(%rsp), %eax
movl %eax, 404(%rsp)
#APP
# 7 "test.cpp" 1
rdtsc
# 0 "" 2
#NO_APP
WHOAAAAA! Do you see what I see?!
We just got ymm
registers! These guys are like 256 bit registers that my CPU supports.
Let's see what that means about our speed. Any guesses?
3....
2....
1....
Boom:
162 cycles
6 6 6 6 6 3 5 7 9 9 4 4 4 6 8 6 6 6 6 6 3 5 7 9 9 4 4 4 6 8
This took 65
nanoseconds to run. It seems like our benefits are starting to decay with only 30 numbers.
We went from 182
nanoseconds to 65
nanoseconds. We basically 3x
ed our speed!
Note: I could've alternatively chosen to used intrinsics as well. But I trust you will figure it out with the resources I've given you and the internet <3
Examples in the world
Okay, I shall attempt to find 1 example of SIMD instructions in the wild wild west.
OpenCV is a library used for image and vision processing!
Vectorized instructions can be useful for processing rows of pixels the same way.
Here is an example of SSE x86 intrinsics being used in OpenCV to merge two maps/images together.
void convertMaps_nninterpolate32f1c16s_SSE41(const float* src1f, const float* src2f, short* dst1, int width)
{
int x = 0;
for (; x <= width - 16; x += 16)
{
__m128i v_dst0 = _mm_packs_epi32(_mm_cvtps_epi32(_mm_loadu_ps(src1f + x)),
_mm_cvtps_epi32(_mm_loadu_ps(src1f + x + 4)));
__m128i v_dst1 = _mm_packs_epi32(_mm_cvtps_epi32(_mm_loadu_ps(src1f + x + 8)),
_mm_cvtps_epi32(_mm_loadu_ps(src1f + x + 12)));
__m128i v_dst2 = _mm_packs_epi32(_mm_cvtps_epi32(_mm_loadu_ps(src2f + x)),
_mm_cvtps_epi32(_mm_loadu_ps(src2f + x + 4)));
__m128i v_dst3 = _mm_packs_epi32(_mm_cvtps_epi32(_mm_loadu_ps(src2f + x + 8)),
_mm_cvtps_epi32(_mm_loadu_ps(src2f + x + 12)));
_mm_interleave_epi16(v_dst0, v_dst1, v_dst2, v_dst3);
_mm_storeu_si128((__m128i *)(dst1 + x * 2), v_dst0);
_mm_storeu_si128((__m128i *)(dst1 + x * 2 + 8), v_dst1);
_mm_storeu_si128((__m128i *)(dst1 + x * 2 + 16), v_dst2);
_mm_storeu_si128((__m128i *)(dst1 + x * 2 + 24), v_dst3);
}
for (; x < width; x++)
{
dst1[x * 2] = saturate_cast<short>(src1f[x]);
dst1[x * 2 + 1] = saturate_cast<short>(src2f[x]);
}
}
Okay let's keep this breakdown brief and simple.
-
__m128i v_dst0 = _mm_packs_epi32(_mm_cvtps_epi32(_mm_loadu_ps(src1f + x)), _mm_cvtps_epi32(_mm_loadu_ps(src1f + x + 4)));
These instructions are:-
_mm_loadu_ps
: load 4 floats into anxmm
128 bit register -
_mm_cvtps_epi32
: convert the 4 floats in thexmm
into 4 byte ints -
_mm_packs_epi32
: conver the 2 x 4 ints into 8 shorts (2 Bytes) in anotherxmm
-
_mm_interleave_epi16(v_dst0, v_dst1, v_dst2, v_dst3);
Interleaves the 1st and 3rd xmm vectors together. Then it interleaves the 2nd and 4th vectors together. This is done so that src1 and src2 points are interleaved together._mm_storeu_si128((__m128i *)(dst1 + x * 2), v_dst0);
Uses vectorized instructions to copy fromxmm
registers todst1
.The last loop is scalar to handle the non 16 byte aligned data that is left over.
Cool, that's enough examples in the wild.
The Conclusion
Let's keep this simple. We saw that with a small example you can speed up a program up to 3x with vectorized instructions.
However, with larger programs, you could see even more performance benefits from SIMD (Single Instruction Multiple Data) instructions.
So next time you consider speeding up your program, think about how you can use your CPU's features to your advantage.
All in all:
- SIMD support depends on CPU architecture
- SIMD instructions can be used with intrinsics
- The compiler can handle SIMD optimization for you alternatively
- SIMD could speed up your program 2-10x, depending on CPU support/amount of data.
- SIMD is used in the wild, like in OpenCV and even numpy libraries.
I hope you are ready to be faster.
That's all I have this time.
Peace
-absterdabster
Top comments (1)
Drop your questions here!!!