A Serputov

Posted on

Algorithm Selection with Inline Assembly(part2)

This is the second part of Algorithm Selection with Inline Assembly. We are going to change our code to use SVE2 instructions.

Quick Note: The Armv9 Scalable Vector Extensions version 2 (SVE2) provide a variable-width SIMD capability for AArch64 systems.

We will work with the AArch64(Israel server) machine for those who don't remember.

The purpose of this lab is to:

1. Create a new version of the volume scaling code from the Algorithm Selection Lab, which uses SVE2 instructions.
2. Next, we will prove that the code is using SVE2 instructions by analyzing the disassembly of the relevant portion of the binary.

Let's start:

I started by changing the Makefile to compile our files with different commands to use :

gcc -march=armv8-a+sve2 ...

Remember that in order to invoke the autovectorizer in GCC version 11, you must use -O3:

gcc -O3 -march=armv8-a+sve2 ...

The next step is to add SVE instructions into our C code and check how long it will take to execute with new features.

Here is our current code for Vol4:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"
#include <arm_sve.h>

int main() {

#ifndef __aarch64__
printf("Wrong architecture - written for aarch64 only.\n");
#else

// these variables will also be accessed by our assembler code
int16_t*    in_cursor;       // input cursor
int16_t*    out_cursor;       // output cursor
int16_t     vol_int;        // volume as int16_t

int16_t*    limit;         // end of input array

int       x;           // array interator
int       ttl=0 ;         // array total

// ---- Create in[] and out[] arrays
int16_t*    in;
int16_t*    out;
in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]

// set vol_int to fixed-point representation of the volume factor
// Q: should we use 32767 or 32768 in next line? why?
vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

// Q: what is the purpose of these next two lines?
in_cursor = in;
out_cursor = out;
limit = in + SAMPLES;

// Q: what does it mean to "duplicate" values in the next line?
__asm__ ("dup w1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

while ( in_cursor < limit ) {
__asm__ (
"ldr q0, [%[in_cursor]], #16  \n\t"
// load eight samples into q0 (same as v0.8h)
// from [in_cursor]
// post-increment in_cursor by 16 bytes
// ans store back into the pointer register

"sqrdmulh v0.8h, v0.8h, v1.8h  \n\t"
// with 32 signed integer output,
// multiply each lane in v0 * v1 * 2
// saturate results
// store upper 16 bits of results into
// the corresponding lane in v0

"str q0, [%[out_cursor]],#16      \n\t"
// store eight samples to [out_cursor]
// post-increment out_cursor by 16 bytes
// and store back into the pointer register

// Q: What do these next three lines do?
: [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
: "r"(in_cursor),"r"(out_cursor)
: "memory"
);
}

// --------------------------------------------------------------------

for (x = 0; x < SAMPLES; x++) {
ttl=(ttl+out[x])%1000;
}

// Q: are the results usable? are they correct?
printf("Result: %d\n", ttl);

return 0;

#endif
}

Note: We are checking our code on 5M Samples.

We need to know more about SVE2. From the lab instructions, we have these links to ARM developers docs:
Arm Armv9-A A64 Instruction Set Architecture - https://developer.arm.com/documentation/ddi0602/2021-12/
Introduction to SVE2 - https://developer.arm.com/documentation/102340/0001/?lang=en
Intrinsics - Arm C Language Extensions for SVE (ACLE) - https://developer.arm.com/documentation/100987/latest
SVE Coding Considerations with Arm Compiler - Note that this documentation is specific to Arm's compiler, but most of it will apply to other compilers, including gcc - https://developer.arm.com/documentation/100748/0616/SVE-Coding-Considerations-with-Arm-Compiler

After reading, I started implementing the instructions for SVE2.
I decide to work with vol4.

Our Makefile has already new instructions, for vol4
I have

gcc -O3 -march=armv8-a+sve2 \${CCOPTS} vol4.c vol_createsample.o -o vol4
And for all other I decided to experiment:
gcc \${CCOPTS} vol1/2/3/5.c -march=armv8-a+sve2 vol_createsample.o -Ofast -o vol1/2/3/5

Also, I've decided to change the number of samples.

To run all the vol with SVE2 instruction, we need to run it with this command qemu-aarch64 .

Also, I updated the register as said in SVE instructions.

I had one error while working on this lab, but I found out it was because of number of samples.

Conclusion

⚠️ Computer Architecture Blog Post: Link