Exploring and Benchmarking Audio Volume Adjusting Algorithms Part 1

Introduction

Uncompressed digital sound is typically represented as signed 16-bit (2 bytes) integer samples. For a 48000 audio sample (kHz), the data rate can easily surpass 96,000 bytes per seconds (2 bytes per sample * 48000 samples per seconds). When we change the sound volume, each sample needs to be scaled by a volume factor between 0 (no volume) and 1 (full volume). Considering the amount of data in sound samples, it is vital to have efficient volume adjusting algorithm to scale sound. This is especially true for a mobile device as the amount of processing required can affect its battery life.

In this post, we are going to explore a number of different algorithms for processing sound samples to control volume level. After that, we will study the performance of each algorithm to benchmark.

volume.h

/* This is the number of samples to be processed */
#define SAMPLES 16

/* This is the volume scaling factor to be used */
#define VOLUME 50.0 // Percent of original volume

/* Function prototype to fill an array sample of
 * length sample_count with random int16_t numbers
 * to simulate an audio buffer */
void vol_createsample(int16_t* sample, int32_t sample_count);

vol_createsample.c

void vol_createsample(int16_t* sample, int32_t sample_count) {
        int i;
        for (i=0; i<sample_count; i++) {
                sample[i] = (rand()%65536)-32768;
        }
        return;
}

In volume.h, we define a constant named SAMPLES to define the number of samples to be processed. We will use a reasonably large number for this to have a processed time at least 20 seconds. This will allow us to analyze the performance much more easily.

vol_createsample function is made to fill an array with random numbers to simulate an audio buffer.

Algorithm 1: vol0.c

int16_t scale_sample(int16_t sample, int volume) {

        return (int16_t) ((float) (volume/100.0) * (float) sample);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples. (Why is this needed?)
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples. (Why is this needed?)
        printf("Result: %d\n", ttl);

 return 0;
}

The vol0.c contains a very naïve algorithm that simply multiplies each sample by the volume scaling factor. This also involves with casting from signed 16-bit integer into floating point and back again - which can be very expensive and take a lot of resources.

It is also noteworthy to mention why we need to loop that sums the samples as well as to print the sum to the console. They must exist so that the algorithm can perform correctly. Let's take a look at the assembly code that is built by the compiler to understand more easily.

Assembly code of the original program

400580:       a9be7bfd        stp     x29, x30, [sp, #-32]!
  400584:       d2800041        mov     x1, #0x2                        // #2
  400588:       d2800200        mov     x0, #0x10                       // #16
  40058c:       910003fd        mov     x29, sp
  400590:       a90153f3        stp     x19, x20, [sp, #16]
  400594:       97ffffdb        bl      400500 <calloc@plt>
  400598:       d2800041        mov     x1, #0x2                        // #2
  40059c:       aa0003f4        mov     x20, x0
  4005a0:       d2800200        mov     x0, #0x10                       // #16
  4005a4:       97ffffd7        bl      400500 <calloc@plt>
  4005a8:       aa0003f3        mov     x19, x0
  4005ac:       52800201        mov     w1, #0x10                       // #16
  4005b0:       aa1403e0        mov     x0, x20
  4005b4:       94000077        bl      400790 <vol_createsample>
  4005b8:       d2800002        mov     x2, #0x0                        // #0
  4005bc:       1e2c1001        fmov    s1, #5.000000000000000000e-01
  4005c0:       78e26a81        ldrsh   w1, [x20, x2]
  4005c4:       1e220020        scvtf   s0, w1
  4005c8:       1e210800        fmul    s0, s0, s1
  4005cc:       5ea1b800        fcvtzs  s0, s0
  4005d0:       7c226a60        str     h0, [x19, x2]
  4005d4:       91000842        add     x2, x2, #0x2
  4005d8:       f100805f        cmp     x2, #0x20
  4005dc:       54ffff21        b.ne    4005c0 <main+0x40>  // b.any
  4005e0:       5289ba64        mov     w4, #0x4dd3                     // #19923
mov     x0, x19
  4005e8:       91008265        add     x5, x19, #0x20
  4005ec:       52800001        mov     w1, #0x0                        // #0
  4005f0:       72a20c44        movk    w4, #0x1062, lsl #16
  4005f4:       52807d03        mov     w3, #0x3e8                      // #1000
  4005f8:       78c02402        ldrsh   w2, [x0], #2
  4005fc:       0b010042        add     w2, w2, w1
  400600:       9b247c41        smull   x1, w2, w4
  400604:       9366fc21        asr     x1, x1, #38
  400608:       4b827c21        sub     w1, w1, w2, asr #31
  40060c:       1b038821        msub    w1, w1, w3, w2
  400610:       eb0000bf        cmp     x5, x0
  400614:       54ffff21        b.ne    4005f8 <main+0x78>  // b.any
  400618:       90000000        adrp    x0, 400000 <__abi_tag-0x278>
  40061c:       9120a000        add     x0, x0, #0x828
  400620:       97ffffc8        bl      400540 <printf@plt>
  400624:       52800000        mov     w0, #0x0                        // #0
  400628:       a94153f3        ldp     x19, x20, [sp, #16]
  40062c:       a8c27bfd        ldp     x29, x30, [sp], #32
  400630:       d65f03c0        ret

Assembly code without the sum loop and print

400500:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
  400504:       d2800041        mov     x1, #0x2                        // #2
  400508:       d2800200        mov     x0, #0x10                       // #16
  40050c:       910003fd        mov     x29, sp
  400510:       97ffffec        bl      4004c0 <calloc@plt>
  400514:       52800201        mov     w1, #0x10                       // #16
  400518:       9400005e        bl      400690 <vol_createsample>
  40051c:       52800000        mov     w0, #0x0                        // #0
  400520:       a8c17bfd        ldp     x29, x30, [sp], #16
  400524:       d65f03c0        ret

You can immediately notice that many parts of the assembly codes are missing when we do not include the sum loop and print. This is because the compiler recognizes that the results of volume scaling calculation is not used and optimizes the code by removing it. Obviously, we need to prevent this from happening as it is the code that we have to test!

Algorithm 2: vol1.c

int16_t scale_sample(int16_t sample, int volume) {

        return ((((int32_t) sample) * ((int32_t) (32767 * volume / 100) <<1) ) >> 16);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples.
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.
        printf("Result: %d\n", ttl);

        return 0;

Instead of using floating-point calculation, vol1.c utilizes a fixed-point calculation with bit-shift operations. In this way, we can avoid the costly casting between integer and floating point and back again.

Algorithm 3: vol2.c

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

        static int16_t* precalc;

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]

        precalc = (int16_t*) calloc(65536,2);
        if (precalc == NULL) {
                printf("malloc failed!\n");
                return 1;
        }

        for (x = -32768; x <= 32767; x++) {
 // Q: What is the purpose of the cast to unint16_t in the next line?
                precalc[(uint16_t) x] = (int16_t) ((float) x * VOLUME / 100.0);
        }

        for (x = 0; x < SAMPLES; x++) {
                out[x]=precalc[(uint16_t) in[x]];
        }

// ---- This part sums the samples.
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.
        printf("Result: %d\n", ttl);

        return 0;
}

In vol2.c, we pre-calculate all 65536 result. Then, we use it to look up the result for each input value. Note we use a casting to uint16_t for each element's index. Since we cast a negative integer to unsigned type, x would have a unsigned integer with the bit pattern representing in the corresponding signed type. For example, -5 would become 65531 (2^16 - 5). In this way, we can populate the array with 65536 elements.

This program may have a better performance than the previous one because we create a table with all of the possible values. However, this may be varied depending on the speed of reading memory.

Dummy Algorithm: vol3.c

int16_t scale_sample(int16_t sample, int volume) {

        return (int16_t) 100;
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sum the samples.
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.
        printf("Result: %d\n", ttl);

        return 0;
}

vol3.c is a simply dummy program and returns an identical sample value (100). The purpose of this program is to determine the possible overhead processing other than the scaling volume algorithm.

Algorithm 4: vol4.c

int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else


        // these variables will also be accessed by our assembler code
        int16_t*        in_cursor;              // input cursor
        int16_t*        out_cursor;             // output cursor
        int16_t         vol_int;                // volume as int16_t

        int16_t*        limit;                  // end of input array

        int             x;                      // array interator
        int             ttl=0 ;                 // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]


        // set vol_int to fixed-point representation of the volume factor
        // Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

        // Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES;

        // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

        while ( in_cursor < limit ) {
                __asm__ (
                        "ldr q0, [%[in_cursor]], #16    \n\t"
                        // load eight samples into q0 (same as v0.8h)
                        // from [in_cursor]
                        // post-increment in_cursor by 16 bytes
                        // and store back into the pointer register


                        "sqrdmulh v0.8h, v0.8h, v1.8h   \n\t"
                        // with 32 signed integer output,
                        // multiply each lane in v0 * v1 * 2
                        // saturate results
                        // store upper 16 bits of results into
                        // the corresponding lane in v0

                        "str q0, [%[out_cursor]],#16            \n\t"
                        // store eight samples to [out_cursor]
                        // post-increment out_cursor by 16 bytes
                        // and store back into the pointer register

                        // Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"
                        );
        }

// --------------------------------------------------------------------

        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

        return 0;

#endif
}

In vol4.c, we utilize Single Instruction, Multiple Data (SIMD) instructions with inline assembly codes (assembly language code inserted into a high-level language). Also note that we use AArch64 specific assembly code here and thus, this program can be only executed in the AArch64 system.

Let's take a look at some of the important points in the code (marked as Q). First of all, we need to multiply by 32767 when calculating vol_int to have a fixed-point representation of the volume factor. This is because the vol_int has a type of int16_t, a signed integer type with width of exactly 16 bits. Since its type is signed, the range of values it can hold is between -32,768 and 32,767. Thus, we need to multiply the sample with 32,767 to prevent the integer overflow.

Next, we need to set three pointers that point to the first element of in and out arrays as well as the end of the in array respectively. In this way, we can make a loop that multiplies the sample by the volume scaling factor.

Once we set all of the requirements mentioned above, we can start implementing inline assembly codes using __asm__. The dup instruction is used to duplicate the volume scaling factor from the register with 32-bit-wide access (w0) into the vector register with 8 lines (v1.8h). By doing this, we can multiply the each element of the vector by the scaling factor.

Inside of the loop, we have another inline assembly code that multiplies eight samples by the scaling factor. In contrast to the last __asm__ code, we have 3 operand parameters that are each separated by colon(:). The first operand parameter defines the output operands, in_cursor and out_cursor'. Each operand is named as[in_cursor] and [out_cursor] respectively so that they can be used in the assembler template (enclosed in double quotation). The + sign indicates a constraint that the given output operands are both read and written by the instruction. The last operand parameter is used for clobbers. The memory clobber is used to tell the compiler that the assembly code performs memory reads and writes.

Lastly, we can assume the printed results would be correct. This is because sqrdmulh instruction can saturate the result when overflowing happens.

Algorithm 4: vol5.c

int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else

        register int16_t*       in_cursor       asm("r20");     // input cursor (pointer)
        register int16_t*       out_cursor      asm("r21");     // output cursor (pointer)
        register int16_t        vol_int         asm("r22");     // volume as int16_t

        int16_t*                limit;          // end of input array

        int                     x;              // array interator
        int                     ttl=0;          // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]

        vol_int = (int16_t) (VOLUME/100.0 * 32767.0);

        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES ;

        while ( in_cursor < limit ) {
                // What do these intrinsic functions do?
                // (See gcc intrinsics documentation)
                vst1q_s16(out_cursor, vqrdmulhq_s16(vld1q_s16(in_cursor), vdupq_n_s16(vol_int)));

                // Q: Why is the increment below 8 instead of 16 or some other value?
                // Q: Why is this line not needed in the inline assembler version
                // of this program?
                in_cursor += 8;
                out_cursor += 8;
        }

// --------------------------------------------------------------------

        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);

        return 0;
#endif
}

vol5.c also uses SIMD instruction but with complier intrinsic that are function-like language extensions built into the compiler. Since it uses the instructions unique to AArch64 architecture, this program is also specific to AArch64.

Let's explore the code together. Note that we use the same set of instructions as before - ldr instruction (vld1q_s16), dup instruction (vdupq_n_s16), sqrdmulh instruction (vqrdmulhq_s16), and str instruction (vst1q_s16). Note that the suffix of the intrinsic (s16, signed 16-bit values) indicates the vector length. Thus, each intrinsic will calculate 8 elements at a time (8 elements x 16 bits = 128 bits). This means we have to increment by 8 elements for both in_cursor and out_cursor (do not confuse that we are incrementing by 8 elements not 8 bytes!).

Also, notice that we have to manually increment both cursors for this time. This is because unlike the assembly inline code, the compiler intrinsic code does not increment the pointer for us.

Since both vol4.c and vol5.c utilize AArch64 specific SIMD instructions, it is logical to think these two should outperform other algorithms.

Conclusion

In this post we explored the multiple algorithms for adjusting volume samples. We saw how each algorithm differed even though they all accomplish the same goal. In the next post, we will examine the performance of each program and create a benchmark to verify our expectation.