Introduction
Uncompressed digital sound is typically represented as signed 16-bit (2 bytes) integer samples. For a 48000 audio sample (kHz), the data rate can easily surpass 96,000 bytes per seconds (2 bytes per sample * 48000 samples per seconds). When we change the sound volume, each sample needs to be scaled by a volume factor between 0 (no volume) and 1 (full volume). Considering the amount of data in sound samples, it is vital to have efficient volume adjusting algorithm to scale sound. This is especially true for a mobile device as the amount of processing required can affect its battery life.
In this post, we are going to explore a number of different algorithms for processing sound samples to control volume level. After that, we will study the performance of each algorithm to benchmark.
volume.h
/* This is the number of samples to be processed */
#define SAMPLES 16
/* This is the volume scaling factor to be used */
#define VOLUME 50.0 // Percent of original volume
/* Function prototype to fill an array sample of
* length sample_count with random int16_t numbers
* to simulate an audio buffer */
void vol_createsample(int16_t* sample, int32_t sample_count);
vol_createsample.c
void vol_createsample(int16_t* sample, int32_t sample_count) {
int i;
for (i=0; i<sample_count; i++) {
sample[i] = (rand()%65536)-32768;
}
return;
}
In volume.h
, we define a constant named SAMPLES
to define the number of samples to be processed. We will use a reasonably large number for this to have a processed time at least 20 seconds. This will allow us to analyze the performance much more easily.
vol_createsample
function is made to fill an array with random numbers to simulate an audio buffer.
Algorithm 1: vol0.c
int16_t scale_sample(int16_t sample, int volume) {
return (int16_t) ((float) (volume/100.0) * (float) sample);
}
int main() {
int x;
int ttl=0;
// ---- Create in[] and out[] arrays
int16_t* in;
int16_t* out;
in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
// ---- Create dummy samples in in[]
vol_createsample(in, SAMPLES);
// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
for (x = 0; x < SAMPLES; x++) {
out[x]=scale_sample(in[x], VOLUME);
}
// ---- This part sums the samples. (Why is this needed?)
for (x = 0; x < SAMPLES; x++) {
ttl=(ttl+out[x])%1000;
}
// ---- Print the sum of the samples. (Why is this needed?)
printf("Result: %d\n", ttl);
return 0;
}
The vol0.c
contains a very naïve algorithm that simply multiplies each sample by the volume scaling factor. This also involves with casting from signed 16-bit integer into floating point and back again - which can be very expensive and take a lot of resources.
It is also noteworthy to mention why we need to loop that sums the samples as well as to print the sum to the console. They must exist so that the algorithm can perform correctly. Let's take a look at the assembly code that is built by the compiler to understand more easily.
Assembly code of the original program
400580: a9be7bfd stp x29, x30, [sp, #-32]!
400584: d2800041 mov x1, #0x2 // #2
400588: d2800200 mov x0, #0x10 // #16
40058c: 910003fd mov x29, sp
400590: a90153f3 stp x19, x20, [sp, #16]
400594: 97ffffdb bl 400500 <calloc@plt>
400598: d2800041 mov x1, #0x2 // #2
40059c: aa0003f4 mov x20, x0
4005a0: d2800200 mov x0, #0x10 // #16
4005a4: 97ffffd7 bl 400500 <calloc@plt>
4005a8: aa0003f3 mov x19, x0
4005ac: 52800201 mov w1, #0x10 // #16
4005b0: aa1403e0 mov x0, x20
4005b4: 94000077 bl 400790 <vol_createsample>
4005b8: d2800002 mov x2, #0x0 // #0
4005bc: 1e2c1001 fmov s1, #5.000000000000000000e-01
4005c0: 78e26a81 ldrsh w1, [x20, x2]
4005c4: 1e220020 scvtf s0, w1
4005c8: 1e210800 fmul s0, s0, s1
4005cc: 5ea1b800 fcvtzs s0, s0
4005d0: 7c226a60 str h0, [x19, x2]
4005d4: 91000842 add x2, x2, #0x2
4005d8: f100805f cmp x2, #0x20
4005dc: 54ffff21 b.ne 4005c0 <main+0x40> // b.any
4005e0: 5289ba64 mov w4, #0x4dd3 // #19923
mov x0, x19
4005e8: 91008265 add x5, x19, #0x20
4005ec: 52800001 mov w1, #0x0 // #0
4005f0: 72a20c44 movk w4, #0x1062, lsl #16
4005f4: 52807d03 mov w3, #0x3e8 // #1000
4005f8: 78c02402 ldrsh w2, [x0], #2
4005fc: 0b010042 add w2, w2, w1
400600: 9b247c41 smull x1, w2, w4
400604: 9366fc21 asr x1, x1, #38
400608: 4b827c21 sub w1, w1, w2, asr #31
40060c: 1b038821 msub w1, w1, w3, w2
400610: eb0000bf cmp x5, x0
400614: 54ffff21 b.ne 4005f8 <main+0x78> // b.any
400618: 90000000 adrp x0, 400000 <__abi_tag-0x278>
40061c: 9120a000 add x0, x0, #0x828
400620: 97ffffc8 bl 400540 <printf@plt>
400624: 52800000 mov w0, #0x0 // #0
400628: a94153f3 ldp x19, x20, [sp, #16]
40062c: a8c27bfd ldp x29, x30, [sp], #32
400630: d65f03c0 ret
Assembly code without the sum loop and print
400500: a9bf7bfd stp x29, x30, [sp, #-16]!
400504: d2800041 mov x1, #0x2 // #2
400508: d2800200 mov x0, #0x10 // #16
40050c: 910003fd mov x29, sp
400510: 97ffffec bl 4004c0 <calloc@plt>
400514: 52800201 mov w1, #0x10 // #16
400518: 9400005e bl 400690 <vol_createsample>
40051c: 52800000 mov w0, #0x0 // #0
400520: a8c17bfd ldp x29, x30, [sp], #16
400524: d65f03c0 ret
You can immediately notice that many parts of the assembly codes are missing when we do not include the sum loop and print. This is because the compiler recognizes that the results of volume scaling calculation is not used and optimizes the code by removing it. Obviously, we need to prevent this from happening as it is the code that we have to test!
Algorithm 2: vol1.c
int16_t scale_sample(int16_t sample, int volume) {
return ((((int32_t) sample) * ((int32_t) (32767 * volume / 100) <<1) ) >> 16);
}
int main() {
int x;
int ttl=0;
// ---- Create in[] and out[] arrays
int16_t* in;
int16_t* out;
in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
// ---- Create dummy samples in in[]
vol_createsample(in, SAMPLES);
// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
for (x = 0; x < SAMPLES; x++) {
out[x]=scale_sample(in[x], VOLUME);
}
// ---- This part sums the samples.
for (x = 0; x < SAMPLES; x++) {
ttl=(ttl+out[x])%1000;
}
// ---- Print the sum of the samples.
printf("Result: %d\n", ttl);
return 0;
Instead of using floating-point calculation, vol1.c
utilizes a fixed-point calculation with bit-shift operations. In this way, we can avoid the costly casting between integer and floating point and back again.
Algorithm 3: vol2.c
int main() {
int x;
int ttl=0;
// ---- Create in[] and out[] arrays
int16_t* in;
int16_t* out;
in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
static int16_t* precalc;
// ---- Create dummy samples in in[]
vol_createsample(in, SAMPLES);
// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
precalc = (int16_t*) calloc(65536,2);
if (precalc == NULL) {
printf("malloc failed!\n");
return 1;
}
for (x = -32768; x <= 32767; x++) {
// Q: What is the purpose of the cast to unint16_t in the next line?
precalc[(uint16_t) x] = (int16_t) ((float) x * VOLUME / 100.0);
}
for (x = 0; x < SAMPLES; x++) {
out[x]=precalc[(uint16_t) in[x]];
}
// ---- This part sums the samples.
for (x = 0; x < SAMPLES; x++) {
ttl=(ttl+out[x])%1000;
}
// ---- Print the sum of the samples.
printf("Result: %d\n", ttl);
return 0;
}
In vol2.c
, we pre-calculate all 65536 result. Then, we use it to look up the result for each input value. Note we use a casting to uint16_t
for each element's index. Since we cast a negative integer to unsigned type, x would have a unsigned integer with the bit pattern representing in the corresponding signed type. For example, -5
would become 65531 (2^16 - 5). In this way, we can populate the array with 65536 elements.
This program may have a better performance than the previous one because we create a table with all of the possible values. However, this may be varied depending on the speed of reading memory.
Dummy Algorithm: vol3.c
int16_t scale_sample(int16_t sample, int volume) {
return (int16_t) 100;
}
int main() {
int x;
int ttl=0;
// ---- Create in[] and out[] arrays
int16_t* in;
int16_t* out;
in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
// ---- Create dummy samples in in[]
vol_createsample(in, SAMPLES);
// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
for (x = 0; x < SAMPLES; x++) {
out[x]=scale_sample(in[x], VOLUME);
}
// ---- This part sum the samples.
for (x = 0; x < SAMPLES; x++) {
ttl=(ttl+out[x])%1000;
}
// ---- Print the sum of the samples.
printf("Result: %d\n", ttl);
return 0;
}
vol3.c
is a simply dummy program and returns an identical sample value (100). The purpose of this program is to determine the possible overhead processing other than the scaling volume algorithm.
Algorithm 4: vol4.c
int main() {
#ifndef __aarch64__
printf("Wrong architecture - written for aarch64 only.\n");
#else
// these variables will also be accessed by our assembler code
int16_t* in_cursor; // input cursor
int16_t* out_cursor; // output cursor
int16_t vol_int; // volume as int16_t
int16_t* limit; // end of input array
int x; // array interator
int ttl=0 ; // array total
// ---- Create in[] and out[] arrays
int16_t* in;
int16_t* out;
in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
// ---- Create dummy samples in in[]
vol_createsample(in, SAMPLES);
// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
// set vol_int to fixed-point representation of the volume factor
// Q: should we use 32767 or 32768 in next line? why?
vol_int = (int16_t)(VOLUME/100.0 * 32767.0);
// Q: what is the purpose of these next two lines?
in_cursor = in;
out_cursor = out;
limit = in + SAMPLES;
// Q: what does it mean to "duplicate" values in the next line?
__asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h
while ( in_cursor < limit ) {
__asm__ (
"ldr q0, [%[in_cursor]], #16 \n\t"
// load eight samples into q0 (same as v0.8h)
// from [in_cursor]
// post-increment in_cursor by 16 bytes
// and store back into the pointer register
"sqrdmulh v0.8h, v0.8h, v1.8h \n\t"
// with 32 signed integer output,
// multiply each lane in v0 * v1 * 2
// saturate results
// store upper 16 bits of results into
// the corresponding lane in v0
"str q0, [%[out_cursor]],#16 \n\t"
// store eight samples to [out_cursor]
// post-increment out_cursor by 16 bytes
// and store back into the pointer register
// Q: What do these next three lines do?
: [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
: "r"(in_cursor),"r"(out_cursor)
: "memory"
);
}
// --------------------------------------------------------------------
for (x = 0; x < SAMPLES; x++) {
ttl=(ttl+out[x])%1000;
}
// Q: are the results usable? are they correct?
printf("Result: %d\n", ttl);
return 0;
#endif
}
In vol4.c
, we utilize Single Instruction, Multiple Data (SIMD) instructions with inline assembly codes (assembly language code inserted into a high-level language). Also note that we use AArch64 specific assembly code here and thus, this program can be only executed in the AArch64 system.
Let's take a look at some of the important points in the code (marked as Q
). First of all, we need to multiply by 32767
when calculating vol_int
to have a fixed-point representation of the volume factor. This is because the vol_int
has a type of int16_t
, a signed integer type with width of exactly 16 bits. Since its type is signed, the range of values it can hold is between -32,768 and 32,767. Thus, we need to multiply the sample with 32,767 to prevent the integer overflow.
Next, we need to set three pointers that point to the first element of in
and out
arrays as well as the end of the in
array respectively. In this way, we can make a loop that multiplies the sample by the volume scaling factor.
Once we set all of the requirements mentioned above, we can start implementing inline assembly codes using __asm__
. The dup
instruction is used to duplicate the volume scaling factor from the register with 32-bit-wide access (w0) into the vector register with 8 lines (v1.8h). By doing this, we can multiply the each element of the vector by the scaling factor.
Inside of the loop, we have another inline assembly code that multiplies eight samples by the scaling factor. In contrast to the last __asm__
code, we have 3 operand parameters that are each separated by colon(:
). The first operand parameter defines the output operands, in_cursor
and out_cursor'. Each operand is named as
[in_cursor] and [out_cursor] respectively so that they can be used in the assembler template (enclosed in double quotation). The +
sign indicates a constraint that the given output operands are both read and written by the instruction. The last operand parameter is used for clobbers. The memory
clobber is used to tell the compiler that the assembly code performs memory reads and writes.
Lastly, we can assume the printed results would be correct. This is because sqrdmulh
instruction can saturate the result when overflowing happens.
Algorithm 4: vol5.c
int main() {
#ifndef __aarch64__
printf("Wrong architecture - written for aarch64 only.\n");
#else
register int16_t* in_cursor asm("r20"); // input cursor (pointer)
register int16_t* out_cursor asm("r21"); // output cursor (pointer)
register int16_t vol_int asm("r22"); // volume as int16_t
int16_t* limit; // end of input array
int x; // array interator
int ttl=0; // array total
// ---- Create in[] and out[] arrays
int16_t* in;
int16_t* out;
in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
// ---- Create dummy samples in in[]
vol_createsample(in, SAMPLES);
// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
vol_int = (int16_t) (VOLUME/100.0 * 32767.0);
in_cursor = in;
out_cursor = out;
limit = in + SAMPLES ;
while ( in_cursor < limit ) {
// What do these intrinsic functions do?
// (See gcc intrinsics documentation)
vst1q_s16(out_cursor, vqrdmulhq_s16(vld1q_s16(in_cursor), vdupq_n_s16(vol_int)));
// Q: Why is the increment below 8 instead of 16 or some other value?
// Q: Why is this line not needed in the inline assembler version
// of this program?
in_cursor += 8;
out_cursor += 8;
}
// --------------------------------------------------------------------
for (x = 0; x < SAMPLES; x++) {
ttl=(ttl+out[x])%1000;
}
// Q: Are the results usable? Are they accurate?
printf("Result: %d\n", ttl);
return 0;
#endif
}
vol5.c
also uses SIMD instruction but with complier intrinsic that are function-like language extensions built into the compiler. Since it uses the instructions unique to AArch64 architecture, this program is also specific to AArch64.
Let's explore the code together. Note that we use the same set of instructions as before - ldr
instruction (vld1q_s16
), dup
instruction (vdupq_n_s16
), sqrdmulh
instruction (vqrdmulhq_s16
), and str
instruction (vst1q_s16
). Note that the suffix of the intrinsic (s16, signed 16-bit values) indicates the vector length. Thus, each intrinsic will calculate 8 elements at a time (8 elements x 16 bits = 128 bits). This means we have to increment by 8 elements for both in_cursor
and out_cursor
(do not confuse that we are incrementing by 8 elements not 8 bytes!).
Also, notice that we have to manually increment both cursors for this time. This is because unlike the assembly inline code, the compiler intrinsic code does not increment the pointer for us.
Since both vol4.c
and vol5.c
utilize AArch64 specific SIMD instructions, it is logical to think these two should outperform other algorithms.
Conclusion
In this post we explored the multiple algorithms for adjusting volume samples. We saw how each algorithm differed even though they all accomplish the same goal. In the next post, we will examine the performance of each program and create a benchmark to verify our expectation.
Top comments (0)