Implementing SVE2 for Volume Adjusting Algorithm

Introduction

Previously, we explored simple volume adjust algorithms to scale the audio samples by volume factor. Unfortunately, these algorithms use Advanced SIMD instruction, not Scalable Vector Extension that we learned from the last post which can greatly improve vectorization of code. In this post, we are going to implement SVE2 instructions to the volume adjusting algorithms in C++ and explore them in assembly.

Before We Start

Since SVE2 is new technology and not natively supported by current hardware (with Armv8a processor) as of now, we can only emulate a program that is written with SVE2 instructions. This also means that we cannot really measure the performance of the program. Therefore, in this post, we are only going to implement SVE2 and test if the program runs successfully.

Source Code

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#ifdef   __ARM_FEATURE_SVE
#include <arm_sve.h>
#endif
#include "vol.h"

int main() {

        int                     x;              // array interator
        int                     ttl=0;          // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- SVE2 implementation

        int16_t vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

        int32_t i = 0;
        int32_t vl = svcnth(); // count the number of 16-bit element

        svbool_t pred;
        pred = svwhilelt_b16(i, SAMPLES);

        while(svptest_first(svptrue_b16(), pred)) {
                svst1(pred, &out[i], (svqrdmulh(svld1(pred, &in[i]), svdup_s16(vol_int))));
                i += vl;
                pred = svwhilelt_b16(i, SAMPLES);
        }

// ---- End of SVE2 implementation

  for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;

        }

        // Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);

        return 0;
}

Why Compiler Intrinsic?

Compiler intrinsic is function-like calls that the compiler replaces with the appropriate SVE2 instructions while handling various jobs including register allocation. It is a great way for developers (like me!) to use SVE2 instructions in C/C++ style without assembly.

Code Analysis

First of all, we define header file to access SVE vectors, predicates, and intrinsics for SVE2 insturctions. We then initialize a loop iterator, i, and vl that is used to count the number of elements. We also need to initialize a predicate register by using svwhilelt_b16 to control the while loop. _b16 specifies a predicate for 16-bit elements and conceptually, this would create an integer vector starting at i and and incrementing by 1 in each subsequent vector lane. Within the while loop condition, we use svptest_first to check if a lane of the predicate is active and there is a work left to do. The logic inside of the while loop is very similar to the ones written in SIMD instructions. That is, svld1 loads a vector with the value from in[i] array element and svdup_s16 duplicates the value of vol_int into a vector. Afterward, svqrdmulh performs integer multiplication of those two values and svst1 saves the result into out[i]. Then, i gets incremented by the number of integer lanes in the vector and the predicate is reassigned.

Building Code

As we discussed before, the current hardware does not support SVE2 instructions. Thus, we have to instruct the compiler to emit code for an Armv8a processor to make it understand SVE2 as following:

$ gcc -march=armv8-a+sve2 vol6.c vol_createsample.o -o vol6

Then, we can execute the program by emulating with the QEMU usermode system. This will trap SVE instructions and run it on the Armv8 system.

$ qemu-aarch64 ./vol6

Once we run it, we can see the program runs successfully without any problem!

Result: -809

Assembler Output

0000000000400698 <main>:
  400698:       043f57ff        addvl   sp, sp, #-1
  40069c:       d100c3ff        sub     sp, sp, #0x30
  4006a0:       a9007bfd        stp     x29, x30, [sp]
  4006a4:       910003fd        mov     x29, sp
  4006a8:       043f5020        addvl   x0, sp, #1
  4006ac:       b900281f        str     wzr, [x0, #40]
  4006b0:       d2800041        mov     x1, #0x2                        // #2
  4006b4:       d2848000        mov     x0, #0x2400                     // #9216
  4006b8:       f2a01e80        movk    x0, #0xf4, lsl #16
  4006bc:       97ffff91        bl      400500 <calloc@plt>
  4006c0:       047f5081        addpl   x1, sp, #4
  4006c4:       f9001020        str     x0, [x1, #32]
  4006c8:       d2800041        mov     x1, #0x2                        // #2
  4006cc:       d2848000        mov     x0, #0x2400                     // #9216
  4006d0:       f2a01e80        movk    x0, #0xf4, lsl #16
  4006d4:       97ffff8b        bl      400500 <calloc@plt>
  4006d8:       047f5081        addpl   x1, sp, #4
  4006dc:       f9000c20        str     x0, [x1, #24]
  4006e0:       52848001        mov     w1, #0x2400                     // #9216
  4006e4:       72a01e81        movk    w1, #0xf4, lsl #16
  4006e8:       047f5080        addpl   x0, sp, #4
  4006ec:       f9401000        ldr     x0, [x0, #32]
  4006f0:       9400006c        bl      4008a0 <vol_createsample>
  4006f4:       5287ffe0        mov     w0, #0x3fff                     // #16383
  4006f8:       047f5081        addpl   x1, sp, #4
  4006fc:       79002c20        strh    w0, [x1, #22]
  400700:       043f5020        addvl   x0, sp, #1
  400704:       b900241f        str     wzr, [x0, #36]
  400708:       0460e3e0        cnth    x0
  40070c:       047f5081        addpl   x1, sp, #4
  400710:       b9001020        str     w0, [x1, #16]
  400714:       043f5020        addvl   x0, sp, #1
  400718:       b9402400        ldr     w0, [x0, #36]
  40071c:       52848001        mov     w1, #0x2400                     // #9216
  400720:       72a01e81        movk    w1, #0xf4, lsl #16
  400724:       25610400        whilelt p0.h, w0, w1
  400728:       910093e0        add     x0, sp, #0x24
  40072c:       e5801c00        str     p0, [x0, #7, mul vl]
  400730:       14000026        b       4007c8 <main+0x130>
  400734:       043f5020        addvl   x0, sp, #1
  400738:       b9802400        ldrsw   x0, [x0, #36]
  40073c:       d37ff800        lsl     x0, x0, #1
  400740:       047f5081        addpl   x1, sp, #4
  400744:       f9400c21        ldr     x1, [x1, #24]
  400748:       8b000020        add     x0, x1, x0
  40074c:       043f5021        addvl   x1, sp, #1
  400750:       b9802421        ldrsw   x1, [x1, #36]
  400754:       d37ff821        lsl     x1, x1, #1
  400758:       047f5082        addpl   x2, sp, #4
  40075c:       f9401042        ldr     x2, [x2, #32]
  400760:       8b010041        add     x1, x2, x1
  400764:       910093e2        add     x2, sp, #0x24
  400768:       85801c40        ldr     p0, [x2, #7, mul vl]
  40076c:       a4a0a020        ld1h    {z0.h}, p0/z, [x1]
  400770:       047f5081        addpl   x1, sp, #4
  400774:       91005821        add     x1, x1, #0x16
  400778:       2518e3e0        ptrue   p0.b
  40077c:       84c0a021        ld1rh   {z1.h}, p0/z, [x1]
  400780:       04617400        sqrdmulh        z0.h, z0.h, z1.h
  400784:       910093e1        add     x1, sp, #0x24
  400788:       85801c20        ldr     p0, [x1, #7, mul vl]
  40078c:       e4a0e000        st1h    {z0.h}, p0, [x0]
  400790:       043f5020        addvl   x0, sp, #1
  400794:       b9402401        ldr     w1, [x0, #36]
  400798:       047f5080        addpl   x0, sp, #4
  40079c:       b9401000        ldr     w0, [x0, #16]
  4007a0:       0b000020        add     w0, w1, w0
  4007a4:       043f5021        addvl   x1, sp, #1
  4007a8:       b9002420        str     w0, [x1, #36]
  4007ac:       043f5020        addvl   x0, sp, #1
  4007b0:       b9402400        ldr     w0, [x0, #36]
  4007b4:       52848001        mov     w1, #0x2400                     // #9216
  4007b8:       72a01e81        movk    w1, #0xf4, lsl #16
  4007bc:       25610400        whilelt p0.h, w0, w1
  4007c0:       910093e0        add     x0, sp, #0x24
  4007c4:       e5801c00        str     p0, [x0, #7, mul vl]
  4007c8:       2558e3e0        ptrue   p0.h
  4007cc:       910093e0        add     x0, sp, #0x24
  4007d0:       85801c01        ldr     p1, [x0, #7, mul vl]
  4007d4:       2550c020        ptest   p0, p1.b
  4007d8:       9a9f57e0        cset    x0, mi  // mi = first
  4007dc:       7100001f        cmp     w0, #0x0
  4007e0:       54fffaa1        b.ne    400734 <main+0x9c>  // b.any
  4007e4:       043f5020        addvl   x0, sp, #1
  4007e8:       b9002c1f        str     wzr, [x0, #44]
  4007ec:       1400001d        b       400860 <main+0x1c8>
  4007f0:       043f5020        addvl   x0, sp, #1
  4007f4:       b9802c00        ldrsw   x0, [x0, #44]
  4007f8:       d37ff800        lsl     x0, x0, #1
  4007fc:       047f5081        addpl   x1, sp, #4
  400800:       f9400c21        ldr     x1, [x1, #24]
  400804:       8b000020        add     x0, x1, x0
  400808:       79c00000        ldrsh   w0, [x0]
  40080c:       2a0003e1        mov     w1, w0
  400810:       043f5020        addvl   x0, sp, #1
  400814:       b9402800        ldr     w0, [x0, #40]
  400818:       0b000020        add     w0, w1, w0
  40081c:       5289ba61        mov     w1, #0x4dd3                     // #19923
  400820:       72a20c41        movk    w1, #0x1062, lsl #16
  400824:       9b217c01        smull   x1, w0, w1
  400828:       d360fc21        lsr     x1, x1, #32
  40082c:       13067c22        asr     w2, w1, #6
  400830:       131f7c01        asr     w1, w0, #31
  400834:       4b010042        sub     w2, w2, w1
  400838:       52807d01        mov     w1, #0x3e8                      // #1000
  40083c:       1b017c41        mul     w1, w2, w1
  400840:       4b010000        sub     w0, w0, w1
  400844:       043f5021        addvl   x1, sp, #1
  400848:       b9002820        str     w0, [x1, #40]
  40084c:       043f5020        addvl   x0, sp, #1
  400850:       b9402c00        ldr     w0, [x0, #44]
  400854:       11000400        add     w0, w0, #0x1
  400858:       043f5021        addvl   x1, sp, #1
  40085c:       b9002c20        str     w0, [x1, #44]
  400860:       043f5020        addvl   x0, sp, #1
  400864:       b9402c01        ldr     w1, [x0, #44]
  400868:       52847fe0        mov     w0, #0x23ff                     // #9215
  40086c:       72a01e80        movk    w0, #0xf4, lsl #16
  400870:       6b00003f        cmp     w1, w0
  400874:       54fffbed        b.le    4007f0 <main+0x158>
  400878:       043f5020        addvl   x0, sp, #1
  40087c:       b9402801        ldr     w1, [x0, #40]
  400880:       90000000        adrp    x0, 400000 <__abi_tag-0x278>
  400884:       9124e000        add     x0, x0, #0x938
  400888:       97ffff2e        bl      400540 <printf@plt>
  40088c:       52800000        mov     w0, #0x0                        // #0
  400890:       a9407bfd        ldp     x29, x30, [sp]
  400894:       043f503f        addvl   sp, sp, #1
  400898:       9100c3ff        add     sp, sp, #0x30
  40089c:       d65f03c0        ret

In order to test if SVE2 instructions are used, we can skim through the codes and search for whilelt instruction.

  400724:       25610400        whilelt p0.h, w0, w1
  4007bc:       25610400        whilelt p0.h, w0, w1

As we can see, the SVE-specific instruction like whilelt is used by the program and it runs without any problem!

Conclusion

In this post, we explored how to implement SVE2 instructions to the volume adjusting algorithm. Unfortunately, the current native hardware does not support SVE2 (yet!) and must use an emulator to run the program. It is also challenging to implement SVE2 as it requires understanding of predicate and new syntax. However, utilizing SVE2 is potentially beneficial for developers because latest hardware plans to support it natively and the vector length is determined by the machine.