This is a submission for the 2025 Hacktoberfest Writing Challenge
This is the first time I heard about hacktoberfest - a month-long initiative in October that encourages developers around the world to contribute to open source projects. This looked interesting, so I thought I'd give it a go.
Some time before that I started learning more about SIMD (AVX512 specifically) and was looking for a project where I can apply this in practice. One of the projects I knew was odiff, which I learned about at the FunOCaml conference I went to last year.
This blog post summarises my contribution to odiff, by rewriting odiff core algorithm into AVX512 assembly.
odiff
odiff is a very fast pixel-by-pixel image comparison tool. It is comparing images in the YIQ colour space rather than the RGB space to better represent visual differences. It's a really cool project. Go check it out.
It was originally written in OCaml and recently it was rewritten in zig for better SIMD support.
SIMD
Normally, CPU can only operate on a single memory location at a time (e.g. a byte, word, etc). This means that if we want to apply the same operation to large number of different memory locations, we need to do it in a loop, one element at a time. This is quite tedious and easy to get wrong. It also doesn't use all the power the CPU has.
There's a lot of workflows that could be optimised to do this is the hardware. Image processing, like in odiff, is one example. Other examples include audio and video processing. All of them essentially apply the same processing to a very large number of data samples.
Originally introduced as MMX extension for x86, but over time evolved to a large number of other extensions, with growing capabilities.
Each CPU has its own SIMD extensions: x86 has MMX, SSE and AVX, ARM has NEON, and RISC-V has RVV.
AVX512
AVX is a variant of x86 SIMD that operates on 256 bits of data at a time.
AVX512 is a evolution of AVX that operates on 512 bit of data. It's really awesome.
512 bits is 64 bytes, which for RGBA image data means we can operate on 16 pixels at a time in one instruction.
If you ever used array programming languages, AVX512 kind of feels similar (though of course it's definitely lower level). You always operate on whole arrays, not single elements. This requires a shift in the way you think about doing things, but it's very rewarding and can give you some nice performance boost.
vxdiff
This lead to the creation of vxdiff, a pure assembly AVX512 based reimplementation of the core odiff algorithm. It only implements the default set of odiff options (and odiff has some interesting options to tweak its behaviour), but it's still a capable tool. It's also easier to start simple rather than trying to implement all the possible options right away.
It was a really fun thing to do, and I learnt a lot about AVX
The algorithm
I will now explain the odiff core algorithm after translation to AVX. This might be a bit chaotic, because I was writing this in a bit of a hurry, so bear with me.
The full source is here, you can follow along.
- First the input images are loaded into memory using RGBA8 format, which means that there's 3 colour channels (red, green, blue) and an alpha channel, each having 8 bits.
- Once we've got the image data, the actual processing loop starts at .y_loop label, which iterates over image rows. Each row is handled by .x_loop label, which is a loop that loads 512bit of data at a time.
rdiandrsipoint to image data of the base and other images. After executingxmm1andxmm2will have 64 bytes of loaded data, which we can now operate on.
vmovdqu8 xmm1, [rdi]
vmovdqu8 xmm2, [rsi]
- Now, we want to replace pixels having alpha equal to
0with white. First we cleark6andk7mask registers withkxor. Then we check alpha channel to see which values are equal to 0 and store the result intok6mask usingvpcmpequb. At the indices where alpha was0, mask will equal1and for all other indices it will be0. Then we propagate that information to other channels (red, green and blue) usingkshiftlbandkor. Then we store255at places where resulting mask is1. We do this twice - once for the base image and second time for the image we compare against.
kxor k6, k6, k6
kxor k7, k7, k7
vpcmpequb k6 {k4}, xmm1, xmm0
kshiftlb k6, k6, 1
kor k7, k7, k6
kshiftlb k6, k6, 1
kor k7, k7, k6
kshiftlb k6, k6, 1
kor k7, k7, k6
vmovdqu8 xmm1 {k7}, xmm31
- Next, we want to do some multiplications, so we'll convert bytes to floats. First we convert from bytes to dwords using
vpmovzxbdand then from dwords to floats usingvcvtudq2ps.
vpmovzxbd zmm1, xmm1
vcvtudq2ps zmm1, zmm1
vpmovzxbd zmm2, xmm2
vcvtudq2ps zmm2, zmm2
- Normalise alpha from initial
0-255range to0.0 - 1.0range.k4mask apply thevmulpsoperation (which does multiplication) only to alpha channel.
vmulps zmm1 {k4}, zmm1, zmm3
vmulps zmm2 {k4}, zmm2, zmm3
- Do alpha-blending, so blend red, green and blue channels with white pixel using alpha channel. This essentially compute
(RGB - 255.0) * A + 255.0. I'm not actually sure why it's computed this way, but this is what odiff used to do in the previous version.k5apply the operations only to red, green and blue channels, leaving alpha channel unmodified.
vsubps zmm1 {k5}, zmm1, zmm30
vshufps zmm10, zmm1, zmm1, 0xff
vmulps zmm1 {k5}, zmm1, zmm10
vaddps zmm1 {k5}, zmm1, zmm30
- Now we start RGB to YIQ conversion. We use precomputed matrix in
zmm7,zmm8andzmm9to multiply to with RGB values to get YIQ values withvmulps. That will give us 3 separate vectors withY,IandQchannels. Then we shuffle them around and add them up to getYIQvector. We will use this to compute the difference between base image and the image we are comparing against.
vmulps zmm10, zmm1, zmm7 ; y
vmulps zmm11, zmm1, zmm8 ; i
vmulps zmm12, zmm1, zmm9 ; q
; yiq(R)
vxorps zmm13, zmm13, zmm13
vshufps zmm13 {k1}, zmm10, zmm10, 0b00000000
vshufps zmm13 {k2}, zmm11, zmm11, 0b00000000
vshufps zmm13 {k3}, zmm12, zmm12, 0b00000000
; yiq(G)
vxorps zmm14, zmm14, zmm14
vshufps zmm14 {k1}, zmm10, zmm10, 0b00000001
vshufps zmm14 {k2}, zmm11, zmm11, 0b00000100
vshufps zmm14 {k3}, zmm12, zmm12, 0b00010000
; yiq(B)
vxorps zmm15, zmm15, zmm15
vshufps zmm15 {k1}, zmm10, zmm10, 0b00000010
vshufps zmm15 {k2}, zmm11, zmm11, 0b00001000
vshufps zmm15 {k3}, zmm12, zmm12, 0b00100000
; yiq
vaddps zmm16, zmm13, zmm14
vaddps zmm16, zmm16, zmm15
- The result is that we have
YIQof base image inzmm16andYIQof the other image inzmm26. We take the difference of the two usingvsubps, square the result and multiply by delta coefficient.
; YIQ diff
vsubps zmm16, zmm16, zmm26
; YIQ*YIQ
vmulps zmm16, zmm16, zmm16
; YIQ*YIQ * delta coef
vmulps zmm16, zmm16, zmm29
- Now we have squares of differences of channels in
zmm16vector, so we need to sum those channels into a single one. This way even there's difference in both channelYandI, it will be counted as a single difference.
vxorps zmm17, zmm17, zmm17
vxorps zmm18, zmm18, zmm18
vxorps zmm19, zmm19, zmm19
vshufps zmm17 {k1}, zmm16, zmm16, 0b10101010
vshufps zmm18 {k1}, zmm16, zmm16, 0b01010101
vshufps zmm19 {k1}, zmm16, zmm16, 0b00000000
; delta
vaddps zmm16, zmm19, zmm18
vaddps zmm16, zmm16, zmm17
- Finally, we count the number of different pixels. We compare the resulting values against maximum allowed delta in
xmm28, which gives boolean mask ink6, which we then treat as an array with1or0values and sum it up. The sum is the number of pixels that differ when comparing base image against the other image.
vcompressps zmm16 {k1}, zmm16
vcmpgtps k6, xmm16, xmm28
kmov eax, k6
popcnt eax, eax
Back to odiff
vxdiff is definitely an interesting project, but its usefulness is limited when compared to odiff, given all its options. Fortunately, vxdiff is now part of odiff. To activate it, you need to specify --enable-asm when invoking odiff.
That's all folks
Shout out to Dmitriy Kovalenko and the organisers of FunOCaml, because as strange as it may seem, this project wouldn't have happened if FunOCaml hadn't been organised.
Top comments (0)