Introduction
In the last post, we explored multiple volume adjusting algorithms and made assumptions of how well they would perform. Now, we are going to measure the performance of each algorithm and test if they are met with our expectations.
The Audio Sample Size
Before we start testing, we will set the number of sample size with a large number so that we can have meaningful result. For this, we will use the size of 1,600,000,000 for each program. If we run the time
command with the dummy program, we have the following result:
real | 1m27.058s |
---|---|
user | 1m22.503s |
sys | 0m4.496s |
The dummy program takes about a minute and a half seconds in total. However, we have to consider that this time does not only account for the volume scale function - there are different processes involved (e.g. generating random samples, calculating results and so on).
Evaluating Algorithm Performance
How do we only measure the performance of the volume scale function (scale_sample
)?
// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
for (x = 0; x < SAMPLES; x++) {
out[x]=scale_sample(in[x], VOLUME);
}
We can easily implement this by utilizing the C Time
library. With this library, we can isolate the function and measure the elapsed time as following:
// ---- Include the C Time library
#include <time.h>
clock_t t;
// ---- Calculate the start time
t = clock();
// Scale Sample Code
//---- Calculate the elapsed time
t = clock() - t;
// ---- Print the elapsed time in seconds
printf("Time elapsed: %f\n", ((double)t)/CLOCKS_PER_SEC);
In this way, we can only estimate the elapsed time of the scale function in seconds.
Benchmark Test Results
For benchmarking, a total of 20 cases were tested for each algorithm. All algorithms also processed 1,600,000,000 samples and were assessed on AArch64 and x84_64 systems. During the tests, the number of background operations were minimized.
The following table shows the results. Both tables show very small number of standard deviation (SD) meaning the data are clustered around the mean value.
AArch64
Algorithm | vol0 | vol1 | vol2 | vol4 | vol5 |
---|---|---|---|---|---|
Time (seconds) | 5.290686 | 4.571809 | 11.204779 | 2.862223 | 2.897304 |
5.271289 | 4.616451 | 11.236343 | 2.869659 | 2.860497 | |
5.3009 | 4.618019 | 11.207497 | 2.839968 | 2.88575 | |
5.257061 | 4.57951 | 11.229004 | 2.794136 | 2.837761 | |
5.29981 | 4.584778 | 11.237608 | 2.879343 | 2.857112 | |
5.252714 | 4.590422 | 11.220075 | 2.785239 | 2.859161 | |
5.300421 | 4.590156 | 11.215143 | 2.870726 | 2.919503 | |
5.286753 | 4.589992 | 11.224697 | 2.794225 | 2.895057 | |
5.317688 | 4.61077 | 11.268087 | 2.907598 | 2.91678 | |
5.272125 | 4.63759 | 11.235228 | 2.799026 | 2.881828 | |
5.308232 | 4.58515 | 11.229461 | 2.882254 | 2.910783 | |
5.286579 | 4.599118 | 11.253098 | 2.85217 | 2.903325 | |
5.282362 | 4.597291 | 11.190576 | 2.875931 | 2.920964 | |
5.276742 | 4.611212 | 11.239454 | 2.849582 | 2.853147 | |
5.293711 | 4.591562 | 11.253258 | 2.870164 | 2.918136 | |
5.293716 | 4.621955 | 11.228463 | 2.858067 | 2.850342 | |
5.318874 | 4.591154 | 11.225114 | 2.864949 | 2.912111 | |
5.306651 | 4.590993 | 11.252793 | 2.841034 | 2.847878 | |
5.30221 | 4.641963 | 11.220678 | 2.877916 | 2.842209 | |
5.299778 | 4.593774 | 11.206139 | 2.868532 | 2.856316 | |
Total | 105.818302 | 92.013669 | 224.577495 | 57.042742 | 57.625964 |
Average | 5.2909151 | 4.60068345 | 11.22887475 | 2.8521371 | 2.8812982 |
SD | 0.01805085609 | 0.01880182964 | 0.01914206262 | 0.0338674976 | 0.02977236719 |
In the previous post, we assumed the algorithms that use SIMD instructions would perform faster than others. Indeed, we can observe that vol4
and vol5
algorithms outperform others. The performance difference between them are really small (~0.0291 seconds) indicating that both inline assembly and compiler intrinsic are almost equally fast.
We can also see that vol1
runs faster than vol0
. This corresponds to our expectation as vol1
uses a fixed-point calculation with bit-shift operations.
Interestingly, vol2
algorithm is found to be significantly slower than others. Initially, we assumed that this algorithm may perform faster than vol0
and vol1
which multiplies each sample with scaling factor because it pre-calculates all the results and stores them in a table. This result would mean that the CPU has an efficient arithmetic logic unit (ALU) that processes the multiplication fast or is slow at reading the memory when looking over the pre-calculated values within the table.
x86_64
Algorithm | vol0 | vol1 | vol2 |
---|---|---|---|
Time (seconds) | 2.821902 | 2.784482 | 3.531761 |
2.903628 | 2.786877 | 3.569542 | |
2.895999 | 2.78038 | 3.551214 | |
2.877543 | 2.785402 | 3.559591 | |
2.886563 | 2.785422 | 3.537273 | |
2.891856 | 2.783449 | 3.545279 | |
2.80208 | 2.786667 | 3.58345 | |
2.855822 | 2.782619 | 3.590136 | |
2.804731 | 2.781633 | 3.572802 | |
2.782909 | 2.801589 | 3.587121 | |
2.783267 | 2.783468 | 3.630578 | |
2.785422 | 2.800091 | 3.562486 | |
2.81526 | 2.77875 | 3.591089 | |
2.873962 | 2.778289 | 3.529016 | |
2.791908 | 2.789269 | 3.579964 | |
2.785272 | 2.792904 | 3.55086 | |
2.804883 | 2.778821 | 3.587747 | |
2.78638 | 2.785906 | 3.545412 | |
2.788079 | 2.795611 | 3.574527 | |
2.810512 | 2.794108 | 3.54657 | |
Total | 56.547978 | 55.735737 | 71.326418 |
Average | 2.8273989 | 2.78678685 | 3.5663209 |
SD | 0.04456744515 | 0.006838116502 | 0.02516021857 |
The x86_64 system shows similar aspects as the AArch64 system -vol1
algorithm is the fastest and vol2
is the slowest. Note that we are missing vol4
and vol5
algorithms because these programs utilize SIMD instructions that are unique to the AArch64 system.
Conclusion
In this post, we measured the performance of each algorithm to test the assumptions we made in the previous post. As expected, the algorithms that use SIMD instructions appear to run faster than others as they can process multiple data at a time.
Top comments (0)