A Serputov

Posted on Feb 28, 2022

Algorithm Selection with Inline Assembly

Background of this lab

Digital sound is typically represented, uncompressed, as signed 16-bit integer signal samples. There are two streams of samples, one each for the left and right stereo channels, at typical sample rates of 44.1 or 48 thousand samples per second per channel, for a total of 88.2 or 96 thousand samples per second (kHz). Since there are 16 bits (2 bytes) per sample, the data rate is 88.2 * 1000 * 2 = 176,400 bytes/second (~172 KiB/sec) or 96 * 1000 * 2 = 192,000 bytes/second (~187.5 KiB/sec).
To change the volume of sound, each sample can be scaled (multiplied) by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).
On a mobile device, the amount of processing required to scale sound will affect battery life.

Multiple Approaches

Six programs are provided, each with a different approach to the problem, named vol0.c through vol5.c. A header file, vol.h, defines how much data (in number of sample) will be processed by each program, as well as the volume level to be used for scaling (50%).

These are the six programs:

vol0.c is the basic or naive algorithm. This approach multiplies each sound sample by the volume scaling factor, casting from signed 16-bit integer to floating point and back again. Casting between integer and floating point can be expensive operations.
vol1.c does the math using fixed-point calculations. This avoids the overhead of casting between integer and floating point and back again.
vol2.c pre-calculates all 65536 different results, and then looks up the answer for each input value.
vol3.c is a dummy program - it doesn't scale the volume at all. It can be used to determine some of the overhead of the rest of the processing (besides scaling the volume) done by the other programs.
vol4.c uses Single Instruction, Multiple Data (SIMD) instructions accessed through inline assembley (assembly language code inserted into a C program). This program is specific to the AArch64 architecture and will not build for x86_64.
vol5.c uses SIMD instructions accessed through Complier Intrinsics. This program is also specific to AArch64. Note that vol4.c and vol5.c will build only on AArch64 systems because they use architecture-specific SIMD instructions.

Benchmarking

AArch64

My first step was to copy files from the root directory into my.
After, I unpacked the archive /public/spo600-volume-examples.tgz.
Here you can see the tree
I build all the programs with make
I tested all of them from ./vol0-5 on aarch64 and 0-3 on x86_64 with vol.h


/* This is the number of samples to be processed */
#define SAMPLES 16

/* This is the volume scaling factor to be used */
#define VOLUME 50.0 // Percent of original volume

/* Function prototype to fill an array sample of 
 * length sample_count with random int16_t numbers
 * to simulate an audio buffer */
void vol_createsample(int16_t* sample, int32_t sample_count);

I changed number of samples and my results was different: I was kind of experimenting with a code. 5 000 000 Samples.

Now lets check each algorithm three times:

vol0	time ex
1 attempt	0m1.488s
2 attempt	0m1.469s
3 attempt	0m1.399s

AVG: 0m1.452s

vol1	time ex
1 attempt	0m1.479s
2 attempt	0m1.259s
3 attempt	0m1.329s

AVG: 0m1.355s

vol2	time ex
1 attempt	0m1.549s
2 attempt	0m1.379s
3 attempt	0m1.219s

AVG: 0m1.382s

vol3	time ex
1 attempt	0m1.279s
2 attempt	0m1.319s
3 attempt	0m1.349s

AVG: 0m1.315s

vol4	time ex
1 attempt	0m1.379s
2 attempt	0m1.359s
3 attempt	0m1.479s

AVG: 0m1.405s

vol5	time ex
1 attempt	0m1.338s
2 attempt	0m1.439s
3 attempt	0m1.349s

AVG: 0m1.375

I also used a time function with #include

My prediction

I thought that vol4 or vol5 will be the best because of specific SIMD instructions, but...

Memory Usage

x86_64

The same steps I did for x86_64

vol0	time ex
1 attempt	0m0.811s
2 attempt	0m0.820s
3 attempt	0m0.844s

AVG: 0m0.825s

vol1	time ex
1 attempt	0m0.839s
2 attempt	0m0.803s
3 attempt	0m0.817s

AVG: 0m0.819s

vol2	time ex
1 attempt	0m0.826s
2 attempt	0m0.837s
3 attempt	0m0.828s

AVG: 0m0.830s

vol3	time ex
1 attempt	0m0.811s
2 attempt	0m0.824s
3 attempt	0m0.846s

AVG: 0m0.827s

Memory Usage:

Questions

Q: This part sums the samples. (Why is this needed?)

for (x = 0; x < SAMPLES; x++) {
     ttl=(ttl+out[x])%1000;
 }

We need this part because it is used to go through out of the SAMPLES that we defined in vol. h. Then assigned all of the results from out[] array to ttl property. It used to print out on as a result.

Q: Print the sum of the samples. (Why is this needed?)

printf("Result: %d\n", ttl);

We need this part because it is used to print the result that we went through the SAMPLE then assigned from the out[] array.

Q: should we use 32767 or 32768 in next line? why?

vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

We should operate 32767 in the subsequent line because we include the limitation of the sample so that the model will begin at the lowest value of a 16-bit signed integer to the highest value of the 16-bit signed integer.

Q: what is the purpose of these next two lines?

  in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES;

The goal of the first line is to set the input cursor to the in[] collection, and the second line proceeds to provide the result cursor to the out[] collection.

Q: what does it mean to "duplicate" values in the next line?

asm ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

A copy of the value is kept in a vector that acts as a similar-sized array. The deal to copy is %w0, and the duplicate value will be sent to the dupv1.8h.

Q: What do these next three lines do?

                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"
                        );
        }

These three lines are used to acquire the input cursor's value, and the output cursor then stores them into the system memory.

// Q: Why is the increment below 8 instead of 16 or some other value?

// Q: Why is this line not needed in the inline assembler version

      // of this program?
                in_cursor += 8;
                out_cursor += 8;
        }

Conclusion

⚠️ Computer Architecture Blog Post: Link

Links

🖇 Follow me on GitHub

🖇 Follow me on Twitter

_p.s This post was made for my Software Portability and Optimization class. Lab 5.

DEV Community

Algorithm Selection with Inline Assembly

Background of this lab

Multiple Approaches

Benchmarking

AArch64

Memory Usage

x86_64

Questions

Conclusion

Links

Top comments (0)

Read next

Let's Talk About Slices in Go: The Dynamic Duo of Arrays!

Ultimate React Interview Study Plan: Your one-stop for your next #React Interview

Boost Your React App's Performance with Memoization: Exploring useMemo, useCallback, and React.memo

The Big Power of Small Pull Requests