SPO 600 – Project Step 3 – Analysis

Hello!

Is time for my final blog about SPO600.
The analysis for what I have done on Step 2.
A quick recap on what happened on step 2 was that I was able to implement auto-vectorization on FFmepg package.

So lets start our analysis.

A snippet of the disassembly (using `objdump -d`)

Because I used auto-vectorization, there was many places on the code that got optimized to sve2.
Lets take a look on the first whilelo we have on the screenshot and see if we can understand what's going on, lets use the arm64 documentation for that:

“WHILELO
While incrementing unsigned scalar lower than scalar

Generate a predicate that starting from the lowest numbered element is true while the incrementing value of the first, unsigned scalar operand is lower than the second scalar operand and false thereafter up to the highest numbered element.

The full width of the scalar operands is significant for the purposes of comparison, and the full width first operand is incremented by one for each destination predicate element, irrespective of the predicate result element size. The first general-purpose source register is not itself updated.”

From: https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/WHILELO--While-incrementing-unsigned-scalar-lower-than-scalar-

Basically whilelo is a loop. And its taking Scalable predicate registers P0, WZR register and register w1.
Next instruction is mov, which is moving #0x0 to the register x0.
Again another mov moving #0 to register z1.

Next instruction is LD1D (vector plus immediate),

“Gather load doublewords to vector (immediate index)
Gather load of doublewords to active elements of a vector register from memory addresses generated by a vector base plus immediate index. The index is a multiple of 8 in the range 0 to 248. Inactive elements will not cause a read from Device memory or signal faults, and are set to zero in the destination vector.
“

From: https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/LD1D--vector-plus-immediate---Gather-load-doublewords-to-vector--immediate-index--
This one is complex; This will load doubleword to a vector as it said, it is preparing for the next instruction.
Storing the doublewords into z0

Next one is:

“ST1D (scalar plus immediate)
Contiguous store doublewords from vector (immediate index)

Contiguous store of doublewords from elements of a vector register to the memory address generated by a 64-bit scalar base and immediate index in the range -8 to 7 which is multiplied by the vector's in-memory size, irrespective of predication, and added to the base address. Inactive elements are not written to memory.”

From: https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/ST1D--scalar-plus-immediate---Contiguous-store-doublewords-from-vector--immediate-index--

This one will get what was on z0 (used on the previous instruction), multiply by the predicator and add to the base address (z1).

The next one:

INCB, INCD, INCH, INCW (scalar)
Increment scalar by multiple of predicate constraint element count
Determines the number of active elements implied by the named predicate constraint, multiplies that by an immediate in the range 1 to 16 inclusive, and then uses the result to increment the scalar destination.
The named predicate constraint limits the number of active elements in a single predicate to:
* A fixed number (VL1 to VL256)
* The largest power of two (POW2)
* The largest multiple of three or four (MUL3 or MUL4)
* All available, implicitly a multiple of two (ALL).
Unspecified or out of range constraint encodings generate an empty predicate or zero element count rather than Undefined Instruction exception.
It has encodings from 4 classes: Byte , Doubleword , Halfword and Word

From: https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/INCB--INCD--INCH--INCW--scalar---Increment-scalar-by-multiple-of-predicate-constraint-element-count-

The last instruction in the 'whilelo' is incrementing x0 by multiple of predicate constraint element count.

As looking at the 'objdump', I can see that many pieces of it were optimized for sve2. Its instructions are using vectors and those vectors are using many registers at once as seem above. This means that it will execute faster than the previous implementation without vectors.

As I showed on Step 2, I did already some tests on it to see if it was working. I processed a video using the new compiled sv2 FFmpeg and the output was working as intended.

I have 2 directories, FFmpeg with sve2 implementation and FFmpeg0 without it:

I made the same test on both, I used the same input file “Flame.avi”.
Here are some screenshots:
On FFmpeg0 :

I used this command to compile my input file
Here is the result of the command:

And here are two screenshots side by side of flami.avi and output.avi played on vlc on the terminal:
Flame.avi:

Output.avi:

On FFmpeg (with sve2):

Flame.avi:

Output.avi (made with sve2 instructions):

Both output.avi files worked the exactly same way.

Other Analysis:

For the future, when we have armv9 hardware available:
Soon the auto vectorization will come to -O2 flag for gcc, this means that it will be considered a safe optimization.

The developers will have to change the Configure script so that they accept vectorization, and after that moment all the FFmepg compiled with that new flag will have sve2 instructions.

A check if the hardware is armv9 or armv8 will be needed, if it armv9 then use the ‘configure’ file for armv9 with the sve2 optimizations, If armv8 hardware is present, then use sve optimizations only as it is right now.

This is my analysis for Step 3 of our project.
I hope you enjoyed reading this.

Thank you