Algorithm Selection on x86_64 vs AArch64 Part I

#assembly #assemblylanguage #simd #aarch64

In this post I'll explore benchmarking a few different programs with different algorithms to scale volume. I'll be benchmarking 5 different algorithms which act on an incoming stream of samples to scale them according to a desired volume. To scale audio in real time, acting on a 48000 kHz signal can involve more than 96,000 bytes of data per second, so efficiency is key to making sure nothing is lost or delayed. With that in mind, let's take a look at some different methods of scaling audio volume and see how they stack up against each other, as well as across x86_64 and AArch64 architectures.

Our incoming sample will be simulated by the following:

void vol_createsample(int16_t* sample, int32_t sample_count) {
        int i;
        for (i=0; i<sample_count; i++) {
                sample[i] = (rand()%65536)-32768;
        }
        return;
}

Algorithm 1 - vol0.c - Naïve

int16_t scale_sample(int16_t sample, int volume) {

        return (int16_t) ((float) (volume/100.0) * (float) sample);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples. (Why is this needed?)
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples. (Why is this needed?)
        printf("Result: %d\n", ttl);

 return 0;
}

This first algorithm takes the naïve route of just multiplying each sample by a scale factor. This involved converting an integer to a floating point value and back again, which is very costly especially at this scale. I'm going to go out on a limb and say this could be done more efficiently, I predict that this one will perform the worst. (Also of note - the sum and print portions of the code are needed so the compiler doesn't optimize away the actual sample scaling portion of the program)

Algorithm 2 - vol1.c - Fixed Point

int16_t scale_sample(int16_t sample, int volume) {

        return ((((int32_t) sample) * ((int32_t) (32767 * volume / 100) <<1) ) >> 16);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples.
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.
        printf("Result: %d\n", ttl);

        return 0;

This algorithm avoids the floating point conversions bogging down the previous code and opts for a whole number multiplication followed by a bit shift, which is much more conservative on compute power. This should save time over our last algorithm but will probably be the 2nd or 3rd slowest.

Algorithm 3 - vol2.c - Precalculated

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

        static int16_t* precalc;

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]

        precalc = (int16_t*) calloc(65536,2);
        if (precalc == NULL) {
                printf("malloc failed!\n");
                return 1;
        }

        for (x = -32768; x <= 32767; x++) {
 // Q: What is the purpose of the cast to unint16_t in the next line?
                precalc[(uint16_t) x] = (int16_t) ((float) x * VOLUME / 100.0);
        }

        for (x = 0; x < SAMPLES; x++) {
                out[x]=precalc[(uint16_t) in[x]];
        }

// ---- This part sums the samples.
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.
        printf("Result: %d\n", ttl);

        return 0;
}

This algorithm has all 65526 values (-32768 to 32767) precalculated, so the program just needs to look up the result for each value. This will elicit a 128kb table for all possible values of a 16 bit number scaled, which is not too much compared to the size of audio files. Performance in this case will hinge largely on how fast the math unit is vs the cache that will be fetching the 128kb of data. I think once again this could be the 2nd or 3rd slowest algorithm.

Algorithm 4 - vol4.c - Inline SIMD

int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else


        // these variables will also be accessed by our assembler code
        int16_t*        in_cursor;              // input cursor
        int16_t*        out_cursor;             // output cursor
        int16_t         vol_int;                // volume as int16_t

        int16_t*        limit;                  // end of input array

        int             x;                      // array interator
        int             ttl=0 ;                 // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]


        // set vol_int to fixed-point representation of the volume factor
        // Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

        // Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES;

        // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

        while ( in_cursor < limit ) {
                __asm__ (
                        "ldr q0, [%[in_cursor]], #16    \n\t"
                        // load eight samples into q0 (same as v0.8h)
                        // from [in_cursor]
                        // post-increment in_cursor by 16 bytes
                        // and store back into the pointer register


                        "sqrdmulh v0.8h, v0.8h, v1.8h   \n\t"
                        // with 32 signed integer output,
                        // multiply each lane in v0 * v1 * 2
                        // saturate results
                        // store upper 16 bits of results into
                        // the corresponding lane in v0

                        "str q0, [%[out_cursor]],#16            \n\t"
                        // store eight samples to [out_cursor]
                        // post-increment out_cursor by 16 bytes
                        // and store back into the pointer register

                        // Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"
                        );
        }

// --------------------------------------------------------------------

        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

        return 0;

#endif
}

This algorithm uses inline assembly to process multiple values simultaneously using SIMD. As such it will almost certainly perform better than the prior algorithms. Because SIMD is only available on AArch64 systems we'll have to see how it runs on those and leave the x86_64 benchmarking out for this algorithm.

There are 5 points of interest marked by "Q" as follows:

// Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

(1). The value needs to be multiplied by 32767 rather than 32768 to prevent integer overflow.

// Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;

(2). The in_cursor and out_cursor are set to point to the first elements of the in and out arrays. These will be used in the following loop to read to and from our scaling logic respectively.

 // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

(3). vol_int represents the volume as a signed 16 bit integer, which we're using the dup instruction on to duplicate the volume scaling factor from the 32-bit w0 to the vector register v1.8h.

// Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"

(4). These 3 lines are all part of the second template in this program, the first line being outputs, the second inputs, and the last being clobbers - memory.

// Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

(5). The results here should be correct as the sqrdmulh instruction above saturates the results, preventing overflow.

Algorithm 5 - vol5.c - Intrinsics SIMD

int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else

        register int16_t*       in_cursor       asm("r20");     // input cursor (pointer)
        register int16_t*       out_cursor      asm("r21");     // output cursor (pointer)
        register int16_t        vol_int         asm("r22");     // volume as int16_t

        int16_t*                limit;          // end of input array

        int                     x;              // array interator
        int                     ttl=0;          // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]

        vol_int = (int16_t) (VOLUME/100.0 * 32767.0);

        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES ;

        while ( in_cursor < limit ) {
                // What do these intrinsic functions do?
                // (See gcc intrinsics documentation)
                vst1q_s16(out_cursor, vqrdmulhq_s16(vld1q_s16(in_cursor), vdupq_n_s16(vol_int)));

                // Q: Why is the increment below 8 instead of 16 or some other value?
                // Q: Why is this line not needed in the inline assembler version
                // of this program?
                in_cursor += 8;
                out_cursor += 8;
        }

// --------------------------------------------------------------------

        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);

        return 0;
#endif
}

This last algorithm also uses SIMD but rather than inline assembler opts for compiler intrinsics. It should likewise benefit from the simultaneous processing of the previous algorithm, so I'd expect either this one or that one to come out on top. Again, some sections are pointed out for clarification which I'll do now.

// What do these intrinsic functions do?
                // (See gcc intrinsics documentation)
                vst1q_s16(out_cursor, vqrdmulhq_s16(vld1q_s16(in_cursor), vdupq_n_s16(vol_int)));

(1). These intrinsic functions are equivalent to the instructions used in the last program. vst1q_s16 is equivalent to str, vqrdmulhq_s16 is equivalent to sqrdmulh, vld1q_s16 is equivalent to ldr, and vdupq_n_s16 is equivalent to dup.

// Q: Why is the increment below 8 instead of 16 or some other value?
                // Q: Why is this line not needed in the inline assembler version
                // of this program?
                in_cursor += 8;

(2). The pointer is incremented by 8 because each intrinsic will calculate 8 elements at a time. In the inline assembler program, the pointer was incremented for us but here we need to do it manually.

// Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);

(3). Once again, as we're using the intrinsic equivalent of sqrdmulh, the results should be saturated and avoid potential overflow, so the output should be reliable.

In the next post we'll get into putting these algorithms to the test and benchmarking them to find which is the fastest. More on that here.