DEV Community

Qzhang125
Qzhang125

Posted on • Updated on

SPO600 Project Stage 1: Algorithm Selection for Optimization

Hello everyone, welcome to assignment phase 1 of SPO600(Software Portability and optimization). In this phase, we will benchmark 6 different algorithms in the c program.

Background

Digital sound is typically represented, uncompressed, assigned 16-bit integer signal samples. There are two streams of samples, one each for the left and right stereo channels, at typical sample rates of 44.1 or 48 thousand samples per second per channel, for a total of 88.2 or 96 thousand samples per second (kHz). Since there are 16 bits (2 bytes) per sample, the data rate is 88.2 * 1000 * 2 = 176,400 bytes/second (~172 KiB/sec) or 96 * 1000 * 2 = 192,000 bytes/second (~187.5 KiB/sec).
To change the volume of sound, each sample can be scaled (multiplied) by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).
On a mobile device, the amount of processing required to scale sound will affect battery life.

Introduction

We got 6 approaches that use different algorithms:

  1. vol0.c is the basic or naive algorithm. This approach multiplies each sound sample by the volume scaling factor, casting from signed 16-bit integer to floating point and back again. Casting between integer and floating points can be an expensive operation.
  2. vol1.c does the math using fixed-point calculations. This avoids the overhead of casting between integer and floating-point and back again.
  3. vol2.c pre-calculates all 65536 different results and then looks up the answer for each input value.
  4. vol3.c is a dummy program - it doesn't scale the volume at all. It can be used to determine some of the overhead of the rest of the processing (besides scaling the volume) done by the other programs.
  5. vol4.c uses Single Instruction, Multiple Data (SIMD) instructions accessed through inline assembly (assembly language code inserted into a C program). This program is specific to the AArch64 architecture and will not be built for x86_64.
  6. vol5.c uses SIMD instructions accessed through Compiler Intrinsics. This program is also specific to AArch64. For more details, please click here.

Procedure

  1. Unpack the archive /public/spo600-volume-examples.tgz
  2. Make a prediction of the relative performance of each scaling algorithm.
  3. Build and test each of the programs.
  4. Test the performance of each program.
  5. Find a way to measure performance without the time taken to perform the test setup pre-processing (generating the samples) and post-processing (summing the results) so that you can measure only the time taken to scale the samples.
  6. Increase performance by changing the compiler option (via the Makefile).
  7. Answer the question in the code.

AArch64(Israel)

My prediction

My prediction of the performance is the fastest program is vol5 because of the SIMD and the slowest program is vol2.

Build & test each program

This is what we got in the vol.h file:

/* This is the number of samples to be processed */
#define SAMPLES 16

/* This is the volume scaling factor to be used */
#define VOLUME 50.0 // Percent of original volume
#include <stdint.h>
/* Function prototype to fill an array sample of
 * length sample_count with random int16_t numbers
 * to simulate an audio buffer */
void vol_createsample(int16_t* sample, int32_t sample_count);
Enter fullscreen mode Exit fullscreen mode

In this header file, it defined 16 samples we got for the programs to process.
Now, let's build and test each of the programs:
Image description
As you can see here, most of the results are the same but we also got different outputs from the vol1 and vol3 programs. In all of the programs we got the same number of samples but different algorithms, so the algorithm is the only factor that affects different outputs.

Benchmarking

In this part, I designed a function using system time to calculate the time of execution in milliseconds. To complete this test, we also need to add

#include <sys/time.h>
Enter fullscreen mode Exit fullscreen mode

This is the code:

struct timeval  start , end;
gettimeofday(&start,0);
//loop goes here for testing
gettimeofday(&end,0);
float duration = (end.tv_sec - start.tv_sec) * 1000.0f + (end.tv_usec -start.tv_usec)/1000.0f;
printf("Time used: %f milisec\n", duration);
Enter fullscreen mode Exit fullscreen mode

Results:
For a number of samples: 1000000000
vol0.c(Using the basic algorithm)

Iteration Duration(millisecond) result
1 3295.242920 289
2 3280.395996 289
3 3295.914062 289
4 3287.457031 289
5 3309.891113 289

The average execution time of execution is 16,468.901122 / 5 = 3,293.7802244 milliseconds.

vol1.c

Iteration Duration(millisecond) result
1 2871.633057 192
2 2862.878906 192
3 2855.055908 192
4 2868.831055 192
5 2853.552002 192

The average execution time of vol2.c is 14,311.950928/5 = 2,862.3901856 milliseconds and it is 431.3900388 milliseconds faster than the basic algorithm.

vol2.c

Iteration Duration(millisecond) result
1 7004.999023 289
2 6997.184082 289
3 6996.041992 289
4 6994.092773 289
5 6983.860840 289

The average execution time of vol2.c is 34,976.169917 / 5 = 6,995.2339834 milliseconds.
vol2.c is so much slower than the other programs because of the pre-calculation.

vol3.c

Iteration Duration(millisecond) result
1 2193.116943 0
2 2192.761963 0
3 2193.154053 0
4 2191.714111 0
5 2193.959961 0

The average execution time of vol3.c is 10,964.707031 / 5 = 2,192.9414062 milliseconds.

vol4.c

Iteration Duration(millisecond) result
1 1704.156982 389
2 1701.747070 389
3 1755.469971 389
4 1753.229980 389
5 1736.984009 389

The average execution time of vol4.c is 8,651.588012 / 5 = 1,730.3176024 milliseconds which is the fastest so far.

vol5.c

Iteration Duration(millisecond) result
1 1738.630005 389
2 1756.073975 389
3 1754.686035 389
4 1788.035034 389
5 1703.054932 389

The average execution time of vol5.c is 8,740.479981 / 5 = 1,748.0959962 milliseconds.

Relative Memory Usage

To check the memory usage, I used this command:

free -m
Enter fullscreen mode Exit fullscreen mode

This is the result of memory usage on the Israel server in the AArch64 systems:
Image description

x86_64(Portugal)

My prediction

My prediction of performance for these programs is that vol3.c is the fastest and vol2.c is the slowest.

Build & test each program

In the x86_64 system, we only got vol0.c through vol3.c because vol4.c and vol5.c are only for the AArch64 system.
This is the initial build with 16 samples:
Image description
Compared to the previous initial build, the result is the same as the result that we got from the AArch64 system.

Benchmarking

For the program in the x86_64 system, I choose to use the same function that we used to test the execution time in the AArch64 system. Feel free to copy the previous code to test on the Portugal server.

First of all, let's test the vol0 which uses the basic algorithm.
Results:
For a number of samples: 1000000000
vol0.c(Using the basic algorithm)

Iteration Duration(millisecond) result
1 1689.119019 289
2 1656.987061 289
3 1658.895996 289
4 1681.101074 289
5 1662.795044 289

The average execution time of vol0.c is 8,348.898194 / 5 = 1,669.7796388 milliseconds.

vol1.c

Iteration Duration(millisecond) result
1 1657.718994 192
2 1659.477051 192
3 1655.046021 192
4 1662.978027 192
5 1654.200928 192

The average execution time of vol1.c is 8,289.421021 / 5 = 1,657.8842042 milliseconds.

vol2.c

Iteration Duration(millisecond) result
1 2163.916992 289
2 2120.509033 289
3 2119.781006 289
4 2122.691895 289
5 2131.298096 289

The average execution time of vol2.c is 10,658.197022 / 5 = 2,131.6394044 milliseconds which is considerably slower than other programs in x86_64 systems.

vol3.c

Iteration Duration(millisecond) result
1 1311.255981 0
2 1292.249023 0
3 1297.120972 0
4 1300.192017 0
5 1299.275024 0

The average execution time of vol3.c is 6,500.093017 / 5 = 1,300.0186034 milliseconds. It is the fastest program.

Relative Memory Usage

Same as before, use

free -m 
Enter fullscreen mode Exit fullscreen mode

command to check the memory usage on the Portugal server.
Image description

Questions

In this part, I will answer all of the questions which are marked with Q in all the source files.
Q1:

// ---- This part sums the samples. (Why is this needed?)
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }
Enter fullscreen mode Exit fullscreen mode

The reason that we need this part is that it is used to go through out of the SAMPLES that we defined in vol. h. Then assigned all of the results from out[] array to ttl property which then could be used to print out on as the result.

Q2:

// ---- Print the sum of the samples. (Why is this needed?)
        printf("Result: %d\n", ttl);
Enter fullscreen mode Exit fullscreen mode

The reason that we need this part is because it is used to print the result that we went through the SAMPLE then assigned from the out[] array.

Q3:

// Q: What's the point of this dummy program? How does it help
// with benchmarking?
Enter fullscreen mode Exit fullscreen mode

The dummy program does not scale the volume, it is used to determine the processing that has been done by other programs.

Q4:

// Q: What is the purpose of the cast to unint16_t in the next line?
                precalc[(uint16_t) x] = (int16_t) ((float) x * VOLUME / 100.0);
Enter fullscreen mode Exit fullscreen mode

The reason to cast uint16_t is because it will explicitly define the number of bits, and it also ensures that the result is going to be an unsigned 16-bit integer.

Q5:

 // Q: should we use 32767 or 32768 in the next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);
Enter fullscreen mode Exit fullscreen mode

We should use 32767 in the next line because we have defined the limit of the sample, so the sample will start at the minimum value of a 16-bit signed integer to the maximum value of the 16-bit signed integer.

Q6:

//Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;
Enter fullscreen mode Exit fullscreen mode

The purpose of the first line is to assign the input cursor to the in[] array and the second line is going to assign the output cursor to the out[] array.

Q7:

 // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h
Enter fullscreen mode Exit fullscreen mode

A duplicate of the value is stored in a vector which will act as an array of equal size. The value to duplicate is %w0 and the duplicate value will be sent to the dupv1.8h.

Q8:

// Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"
Enter fullscreen mode Exit fullscreen mode

These three lines are used to get the value from the input cursor and output cursor then store them into the system memory.

Q9:

        // Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);
Enter fullscreen mode Exit fullscreen mode

This result is only usable in the AArch64 system because this program is only usable in the AArch64 system. It is correct if we use it in the AArch64 system and it does work properly.

Q10:

 // Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);
Enter fullscreen mode Exit fullscreen mode

Same with the Q9, this result is only usable in the AArch64 system because the vol5.c is only usable in AArch64. However, it works properly in the AAcrh64 system.

Conclusion

Click here to check the full list of comparisons.
In this assignment, my prediction of the performance is the fastest program is vol5.c in among all of the programs because of the SIMD. It was accelerated through the compiler. In the x86_64 the vol3.c is the fastest program. My prediction is correct. Based on what we discovered, the x86_64 system executes the same program with the same number of samples two times faster than the AArch64. Looking at the differences between the programs themselves, the different algorithms could perform huge differences and save a lot of time for a little program.

Top comments (0)