Qzhang125

Posted on Dec 5, 2021 • Edited on Dec 15, 2021

SPO600 Project Stage 1: Algorithm Selection for Optimization

Hello everyone, welcome to assignment phase 1 of SPO600(Software Portability and optimization). In this phase, we will benchmark 6 different algorithms in the c program.

Background

Digital sound is typically represented, uncompressed, assigned 16-bit integer signal samples. There are two streams of samples, one each for the left and right stereo channels, at typical sample rates of 44.1 or 48 thousand samples per second per channel, for a total of 88.2 or 96 thousand samples per second (kHz). Since there are 16 bits (2 bytes) per sample, the data rate is 88.2 * 1000 * 2 = 176,400 bytes/second (~172 KiB/sec) or 96 * 1000 * 2 = 192,000 bytes/second (~187.5 KiB/sec).
To change the volume of sound, each sample can be scaled (multiplied) by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).
On a mobile device, the amount of processing required to scale sound will affect battery life.

Introduction

We got 6 approaches that use different algorithms:

vol0.c is the basic or naive algorithm. This approach multiplies each sound sample by the volume scaling factor, casting from signed 16-bit integer to floating point and back again. Casting between integer and floating points can be an expensive operation.
vol1.c does the math using fixed-point calculations. This avoids the overhead of casting between integer and floating-point and back again.
vol2.c pre-calculates all 65536 different results and then looks up the answer for each input value.
vol3.c is a dummy program - it doesn't scale the volume at all. It can be used to determine some of the overhead of the rest of the processing (besides scaling the volume) done by the other programs.
vol4.c uses Single Instruction, Multiple Data (SIMD) instructions accessed through inline assembly (assembly language code inserted into a C program). This program is specific to the AArch64 architecture and will not be built for x86_64.
vol5.c uses SIMD instructions accessed through Compiler Intrinsics. This program is also specific to AArch64. For more details, please click here.

Procedure

Unpack the archive /public/spo600-volume-examples.tgz
Make a prediction of the relative performance of each scaling algorithm.
Build and test each of the programs.
Test the performance of each program.
Find a way to measure performance without the time taken to perform the test setup pre-processing (generating the samples) and post-processing (summing the results) so that you can measure only the time taken to scale the samples.
Increase performance by changing the compiler option (via the Makefile).
Answer the question in the code.

AArch64(Israel)

My prediction

My prediction of the performance is the fastest program is vol5 because of the SIMD and the slowest program is vol2.

Build & test each program

This is what we got in the vol.h file:

/* This is the number of samples to be processed */
#define SAMPLES 16

/* This is the volume scaling factor to be used */
#define VOLUME 50.0 // Percent of original volume
#include <stdint.h>
/* Function prototype to fill an array sample of
 * length sample_count with random int16_t numbers
 * to simulate an audio buffer */
void vol_createsample(int16_t* sample, int32_t sample_count);

In this header file, it defined 16 samples we got for the programs to process.
Now, let's build and test each of the programs:

As you can see here, most of the results are the same but we also got different outputs from the vol1 and vol3 programs. In all of the programs we got the same number of samples but different algorithms, so the algorithm is the only factor that affects different outputs.

Benchmarking

In this part, I designed a function using system time to calculate the time of execution in milliseconds. To complete this test, we also need to add

#include <sys/time.h>

This is the code:

struct timeval  start , end;
gettimeofday(&start,0);
//loop goes here for testing
gettimeofday(&end,0);
float duration = (end.tv_sec - start.tv_sec) * 1000.0f + (end.tv_usec -start.tv_usec)/1000.0f;
printf("Time used: %f milisec\n", duration);

Results:
For a number of samples: 1000000000
vol0.c(Using the basic algorithm)

Iteration	Duration(millisecond)	result
1	3295.242920	289
2	3280.395996	289
3	3295.914062	289
4	3287.457031	289
5	3309.891113	289

The average execution time of execution is 16,468.901122 / 5 = 3,293.7802244 milliseconds.

vol1.c

Iteration	Duration(millisecond)	result
1	2871.633057	192
2	2862.878906	192
3	2855.055908	192
4	2868.831055	192
5	2853.552002	192

The average execution time of vol2.c is 14,311.950928/5 = 2,862.3901856 milliseconds and it is 431.3900388 milliseconds faster than the basic algorithm.

vol2.c

Iteration	Duration(millisecond)	result
1	7004.999023	289
2	6997.184082	289
3	6996.041992	289
4	6994.092773	289
5	6983.860840	289

The average execution time of vol2.c is 34,976.169917 / 5 = 6,995.2339834 milliseconds.
vol2.c is so much slower than the other programs because of the pre-calculation.

vol3.c

Iteration	Duration(millisecond)	result
1	2193.116943	0
2	2192.761963	0
3	2193.154053	0
4	2191.714111	0
5	2193.959961	0

The average execution time of vol3.c is 10,964.707031 / 5 = 2,192.9414062 milliseconds.

vol4.c

Iteration	Duration(millisecond)	result
1	1704.156982	389
2	1701.747070	389
3	1755.469971	389
4	1753.229980	389
5	1736.984009	389

The average execution time of vol4.c is 8,651.588012 / 5 = 1,730.3176024 milliseconds which is the fastest so far.

vol5.c

Iteration	Duration(millisecond)	result
1	1738.630005	389
2	1756.073975	389
3	1754.686035	389
4	1788.035034	389
5	1703.054932	389

The average execution time of vol5.c is 8,740.479981 / 5 = 1,748.0959962 milliseconds.

Relative Memory Usage

To check the memory usage, I used this command:

free -m

This is the result of memory usage on the Israel server in the AArch64 systems:

x86_64(Portugal)

My prediction

My prediction of performance for these programs is that vol3.c is the fastest and vol2.c is the slowest.

Build & test each program

In the x86_64 system, we only got vol0.c through vol3.c because vol4.c and vol5.c are only for the AArch64 system.
This is the initial build with 16 samples:

Compared to the previous initial build, the result is the same as the result that we got from the AArch64 system.

Benchmarking

For the program in the x86_64 system, I choose to use the same function that we used to test the execution time in the AArch64 system. Feel free to copy the previous code to test on the Portugal server.

First of all, let's test the vol0 which uses the basic algorithm.
Results:
For a number of samples: 1000000000
vol0.c(Using the basic algorithm)

Iteration	Duration(millisecond)	result
1	1689.119019	289
2	1656.987061	289
3	1658.895996	289
4	1681.101074	289
5	1662.795044	289

The average execution time of vol0.c is 8,348.898194 / 5 = 1,669.7796388 milliseconds.

vol1.c

Iteration	Duration(millisecond)	result
1	1657.718994	192
2	1659.477051	192
3	1655.046021	192
4	1662.978027	192
5	1654.200928	192

The average execution time of vol1.c is 8,289.421021 / 5 = 1,657.8842042 milliseconds.

vol2.c

Iteration	Duration(millisecond)	result
1	2163.916992	289
2	2120.509033	289
3	2119.781006	289
4	2122.691895	289
5	2131.298096	289

The average execution time of vol2.c is 10,658.197022 / 5 = 2,131.6394044 milliseconds which is considerably slower than other programs in x86_64 systems.

vol3.c

Iteration	Duration(millisecond)	result
1	1311.255981	0
2	1292.249023	0
3	1297.120972	0
4	1300.192017	0
5	1299.275024	0

The average execution time of vol3.c is 6,500.093017 / 5 = 1,300.0186034 milliseconds. It is the fastest program.

Relative Memory Usage

Same as before, use

free -m

command to check the memory usage on the Portugal server.

Questions

In this part, I will answer all of the questions which are marked with Q in all the source files.
Q1:

// ---- This part sums the samples. (Why is this needed?)
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

The reason that we need this part is that it is used to go through out of the SAMPLES that we defined in vol. h. Then assigned all of the results from out[] array to ttl property which then could be used to print out on as the result.

Q2:

// ---- Print the sum of the samples. (Why is this needed?)
        printf("Result: %d\n", ttl);

The reason that we need this part is because it is used to print the result that we went through the SAMPLE then assigned from the out[] array.

Q3:

// Q: What's the point of this dummy program? How does it help
// with benchmarking?

The dummy program does not scale the volume, it is used to determine the processing that has been done by other programs.

Q4:

// Q: What is the purpose of the cast to unint16_t in the next line?
                precalc[(uint16_t) x] = (int16_t) ((float) x * VOLUME / 100.0);

The reason to cast uint16_t is because it will explicitly define the number of bits, and it also ensures that the result is going to be an unsigned 16-bit integer.

Q5:

 // Q: should we use 32767 or 32768 in the next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

We should use 32767 in the next line because we have defined the limit of the sample, so the sample will start at the minimum value of a 16-bit signed integer to the maximum value of the 16-bit signed integer.

Q6:

//Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;

The purpose of the first line is to assign the input cursor to the in[] array and the second line is going to assign the output cursor to the out[] array.

Q7:

 // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

A duplicate of the value is stored in a vector which will act as an array of equal size. The value to duplicate is %w0 and the duplicate value will be sent to the dupv1.8h.

Q8:

// Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"

These three lines are used to get the value from the input cursor and output cursor then store them into the system memory.

Q9:

        // Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

This result is only usable in the AArch64 system because this program is only usable in the AArch64 system. It is correct if we use it in the AArch64 system and it does work properly.

Q10:

 // Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);

Same with the Q9, this result is only usable in the AArch64 system because the vol5.c is only usable in AArch64. However, it works properly in the AAcrh64 system.

Conclusion

Click here to check the full list of comparisons.
In this assignment, my prediction of the performance is the fastest program is vol5.c in among all of the programs because of the SIMD. It was accelerated through the compiler. In the x86_64 the vol3.c is the fastest program. My prediction is correct. Based on what we discovered, the x86_64 system executes the same program with the same number of samples two times faster than the AArch64. Looking at the differences between the programs themselves, the different algorithms could perform huge differences and save a lot of time for a little program.