Introduction
Previously, we explored simple volume adjust algorithms to scale the audio samples by volume factor. Unfortunately, these algorithms use Advanced SIMD instruction, not Scalable Vector Extension that we learned from the last post which can greatly improve vectorization of code. In this post, we are going to implement SVE2 instructions to the volume adjusting algorithms in C++ and explore them in assembly.
Before We Start
Since SVE2 is new technology and not natively supported by current hardware (with Armv8a processor) as of now, we can only emulate a program that is written with SVE2 instructions. This also means that we cannot really measure the performance of the program. Therefore, in this post, we are only going to implement SVE2 and test if the program runs successfully.
Source Code
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#ifdef __ARM_FEATURE_SVE
#include <arm_sve.h>
#endif
#include "vol.h"
int main() {
int x; // array interator
int ttl=0; // array total
// ---- Create in[] and out[] arrays
int16_t* in;
int16_t* out;
in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
// ---- Create dummy samples in in[]
vol_createsample(in, SAMPLES);
// ---- SVE2 implementation
int16_t vol_int = (int16_t)(VOLUME/100.0 * 32767.0);
int32_t i = 0;
int32_t vl = svcnth(); // count the number of 16-bit element
svbool_t pred;
pred = svwhilelt_b16(i, SAMPLES);
while(svptest_first(svptrue_b16(), pred)) {
svst1(pred, &out[i], (svqrdmulh(svld1(pred, &in[i]), svdup_s16(vol_int))));
i += vl;
pred = svwhilelt_b16(i, SAMPLES);
}
// ---- End of SVE2 implementation
for (x = 0; x < SAMPLES; x++) {
ttl=(ttl+out[x])%1000;
}
// Q: Are the results usable? Are they accurate?
printf("Result: %d\n", ttl);
return 0;
}
Why Compiler Intrinsic?
Compiler intrinsic is function-like calls that the compiler replaces with the appropriate SVE2 instructions while handling various jobs including register allocation. It is a great way for developers (like me!) to use SVE2 instructions in C/C++ style without assembly.
Code Analysis
First of all, we define header file to access SVE vectors, predicates, and intrinsics for SVE2 insturctions. We then initialize a loop iterator, i
, and vl
that is used to count the number of elements. We also need to initialize a predicate register by using svwhilelt_b16
to control the while loop. _b16
specifies a predicate for 16-bit elements and conceptually, this would create an integer vector starting at i
and and incrementing by 1 in each subsequent vector lane. Within the while loop condition, we use svptest_first
to check if a lane of the predicate is active and there is a work left to do. The logic inside of the while loop is very similar to the ones written in SIMD instructions. That is, svld1
loads a vector with the value from in[i]
array element and svdup_s16
duplicates the value of vol_int
into a vector. Afterward, svqrdmulh
performs integer multiplication of those two values and svst1
saves the result into out[i]
. Then, i
gets incremented by the number of integer lanes in the vector and the predicate is reassigned.
Building Code
As we discussed before, the current hardware does not support SVE2 instructions. Thus, we have to instruct the compiler to emit code for an Armv8a processor to make it understand SVE2 as following:
$ gcc -march=armv8-a+sve2 vol6.c vol_createsample.o -o vol6
Then, we can execute the program by emulating with the QEMU usermode system. This will trap SVE instructions and run it on the Armv8 system.
$ qemu-aarch64 ./vol6
Once we run it, we can see the program runs successfully without any problem!
Result: -809
Assembler Output
0000000000400698 <main>:
400698: 043f57ff addvl sp, sp, #-1
40069c: d100c3ff sub sp, sp, #0x30
4006a0: a9007bfd stp x29, x30, [sp]
4006a4: 910003fd mov x29, sp
4006a8: 043f5020 addvl x0, sp, #1
4006ac: b900281f str wzr, [x0, #40]
4006b0: d2800041 mov x1, #0x2 // #2
4006b4: d2848000 mov x0, #0x2400 // #9216
4006b8: f2a01e80 movk x0, #0xf4, lsl #16
4006bc: 97ffff91 bl 400500 <calloc@plt>
4006c0: 047f5081 addpl x1, sp, #4
4006c4: f9001020 str x0, [x1, #32]
4006c8: d2800041 mov x1, #0x2 // #2
4006cc: d2848000 mov x0, #0x2400 // #9216
4006d0: f2a01e80 movk x0, #0xf4, lsl #16
4006d4: 97ffff8b bl 400500 <calloc@plt>
4006d8: 047f5081 addpl x1, sp, #4
4006dc: f9000c20 str x0, [x1, #24]
4006e0: 52848001 mov w1, #0x2400 // #9216
4006e4: 72a01e81 movk w1, #0xf4, lsl #16
4006e8: 047f5080 addpl x0, sp, #4
4006ec: f9401000 ldr x0, [x0, #32]
4006f0: 9400006c bl 4008a0 <vol_createsample>
4006f4: 5287ffe0 mov w0, #0x3fff // #16383
4006f8: 047f5081 addpl x1, sp, #4
4006fc: 79002c20 strh w0, [x1, #22]
400700: 043f5020 addvl x0, sp, #1
400704: b900241f str wzr, [x0, #36]
400708: 0460e3e0 cnth x0
40070c: 047f5081 addpl x1, sp, #4
400710: b9001020 str w0, [x1, #16]
400714: 043f5020 addvl x0, sp, #1
400718: b9402400 ldr w0, [x0, #36]
40071c: 52848001 mov w1, #0x2400 // #9216
400720: 72a01e81 movk w1, #0xf4, lsl #16
400724: 25610400 whilelt p0.h, w0, w1
400728: 910093e0 add x0, sp, #0x24
40072c: e5801c00 str p0, [x0, #7, mul vl]
400730: 14000026 b 4007c8 <main+0x130>
400734: 043f5020 addvl x0, sp, #1
400738: b9802400 ldrsw x0, [x0, #36]
40073c: d37ff800 lsl x0, x0, #1
400740: 047f5081 addpl x1, sp, #4
400744: f9400c21 ldr x1, [x1, #24]
400748: 8b000020 add x0, x1, x0
40074c: 043f5021 addvl x1, sp, #1
400750: b9802421 ldrsw x1, [x1, #36]
400754: d37ff821 lsl x1, x1, #1
400758: 047f5082 addpl x2, sp, #4
40075c: f9401042 ldr x2, [x2, #32]
400760: 8b010041 add x1, x2, x1
400764: 910093e2 add x2, sp, #0x24
400768: 85801c40 ldr p0, [x2, #7, mul vl]
40076c: a4a0a020 ld1h {z0.h}, p0/z, [x1]
400770: 047f5081 addpl x1, sp, #4
400774: 91005821 add x1, x1, #0x16
400778: 2518e3e0 ptrue p0.b
40077c: 84c0a021 ld1rh {z1.h}, p0/z, [x1]
400780: 04617400 sqrdmulh z0.h, z0.h, z1.h
400784: 910093e1 add x1, sp, #0x24
400788: 85801c20 ldr p0, [x1, #7, mul vl]
40078c: e4a0e000 st1h {z0.h}, p0, [x0]
400790: 043f5020 addvl x0, sp, #1
400794: b9402401 ldr w1, [x0, #36]
400798: 047f5080 addpl x0, sp, #4
40079c: b9401000 ldr w0, [x0, #16]
4007a0: 0b000020 add w0, w1, w0
4007a4: 043f5021 addvl x1, sp, #1
4007a8: b9002420 str w0, [x1, #36]
4007ac: 043f5020 addvl x0, sp, #1
4007b0: b9402400 ldr w0, [x0, #36]
4007b4: 52848001 mov w1, #0x2400 // #9216
4007b8: 72a01e81 movk w1, #0xf4, lsl #16
4007bc: 25610400 whilelt p0.h, w0, w1
4007c0: 910093e0 add x0, sp, #0x24
4007c4: e5801c00 str p0, [x0, #7, mul vl]
4007c8: 2558e3e0 ptrue p0.h
4007cc: 910093e0 add x0, sp, #0x24
4007d0: 85801c01 ldr p1, [x0, #7, mul vl]
4007d4: 2550c020 ptest p0, p1.b
4007d8: 9a9f57e0 cset x0, mi // mi = first
4007dc: 7100001f cmp w0, #0x0
4007e0: 54fffaa1 b.ne 400734 <main+0x9c> // b.any
4007e4: 043f5020 addvl x0, sp, #1
4007e8: b9002c1f str wzr, [x0, #44]
4007ec: 1400001d b 400860 <main+0x1c8>
4007f0: 043f5020 addvl x0, sp, #1
4007f4: b9802c00 ldrsw x0, [x0, #44]
4007f8: d37ff800 lsl x0, x0, #1
4007fc: 047f5081 addpl x1, sp, #4
400800: f9400c21 ldr x1, [x1, #24]
400804: 8b000020 add x0, x1, x0
400808: 79c00000 ldrsh w0, [x0]
40080c: 2a0003e1 mov w1, w0
400810: 043f5020 addvl x0, sp, #1
400814: b9402800 ldr w0, [x0, #40]
400818: 0b000020 add w0, w1, w0
40081c: 5289ba61 mov w1, #0x4dd3 // #19923
400820: 72a20c41 movk w1, #0x1062, lsl #16
400824: 9b217c01 smull x1, w0, w1
400828: d360fc21 lsr x1, x1, #32
40082c: 13067c22 asr w2, w1, #6
400830: 131f7c01 asr w1, w0, #31
400834: 4b010042 sub w2, w2, w1
400838: 52807d01 mov w1, #0x3e8 // #1000
40083c: 1b017c41 mul w1, w2, w1
400840: 4b010000 sub w0, w0, w1
400844: 043f5021 addvl x1, sp, #1
400848: b9002820 str w0, [x1, #40]
40084c: 043f5020 addvl x0, sp, #1
400850: b9402c00 ldr w0, [x0, #44]
400854: 11000400 add w0, w0, #0x1
400858: 043f5021 addvl x1, sp, #1
40085c: b9002c20 str w0, [x1, #44]
400860: 043f5020 addvl x0, sp, #1
400864: b9402c01 ldr w1, [x0, #44]
400868: 52847fe0 mov w0, #0x23ff // #9215
40086c: 72a01e80 movk w0, #0xf4, lsl #16
400870: 6b00003f cmp w1, w0
400874: 54fffbed b.le 4007f0 <main+0x158>
400878: 043f5020 addvl x0, sp, #1
40087c: b9402801 ldr w1, [x0, #40]
400880: 90000000 adrp x0, 400000 <__abi_tag-0x278>
400884: 9124e000 add x0, x0, #0x938
400888: 97ffff2e bl 400540 <printf@plt>
40088c: 52800000 mov w0, #0x0 // #0
400890: a9407bfd ldp x29, x30, [sp]
400894: 043f503f addvl sp, sp, #1
400898: 9100c3ff add sp, sp, #0x30
40089c: d65f03c0 ret
In order to test if SVE2 instructions are used, we can skim through the codes and search for whilelt
instruction.
400724: 25610400 whilelt p0.h, w0, w1
4007bc: 25610400 whilelt p0.h, w0, w1
As we can see, the SVE-specific instruction like whilelt
is used by the program and it runs without any problem!
Conclusion
In this post, we explored how to implement SVE2 instructions to the volume adjusting algorithm. Unfortunately, the current native hardware does not support SVE2 (yet!) and must use an emulator to run the program. It is also challenging to implement SVE2 as it requires understanding of predicate and new syntax. However, utilizing SVE2 is potentially beneficial for developers because latest hardware plans to support it natively and the vector length is determined by the machine.
Top comments (0)