Exploring and Benchmarking Audio Volume Adjusting Algorithms Part 2

Introduction

In the last post, we explored multiple volume adjusting algorithms and made assumptions of how well they would perform. Now, we are going to measure the performance of each algorithm and test if they are met with our expectations.

The Audio Sample Size

Before we start testing, we will set the number of sample size with a large number so that we can have meaningful result. For this, we will use the size of 1,600,000,000 for each program. If we run the time command with the dummy program, we have the following result:

real	1m27.058s
user	1m22.503s
sys	0m4.496s

The dummy program takes about a minute and a half seconds in total. However, we have to consider that this time does not only account for the volume scale function - there are different processes involved (e.g. generating random samples, calculating results and so on).

Evaluating Algorithm Performance

How do we only measure the performance of the volume scale function (scale_sample)?

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

We can easily implement this by utilizing the C Time library. With this library, we can isolate the function and measure the elapsed time as following:

// ---- Include the C Time library
#include <time.h>

        clock_t         t;

// ---- Calculate the start time
        t = clock();

// Scale Sample Code

//----  Calculate the elapsed time
        t = clock() - t;

// ---- Print the elapsed time in seconds
        printf("Time elapsed: %f\n", ((double)t)/CLOCKS_PER_SEC);

In this way, we can only estimate the elapsed time of the scale function in seconds.

Benchmark Test Results

For benchmarking, a total of 20 cases were tested for each algorithm. All algorithms also processed 1,600,000,000 samples and were assessed on AArch64 and x84_64 systems. During the tests, the number of background operations were minimized.

The following table shows the results. Both tables show very small number of standard deviation (SD) meaning the data are clustered around the mean value.

AArch64

Algorithm	vol0	vol1	vol2	vol4	vol5
Time (seconds)	5.290686	4.571809	11.204779	2.862223	2.897304
	5.271289	4.616451	11.236343	2.869659	2.860497
	5.3009	4.618019	11.207497	2.839968	2.88575
	5.257061	4.57951	11.229004	2.794136	2.837761
	5.29981	4.584778	11.237608	2.879343	2.857112
	5.252714	4.590422	11.220075	2.785239	2.859161
	5.300421	4.590156	11.215143	2.870726	2.919503
	5.286753	4.589992	11.224697	2.794225	2.895057
	5.317688	4.61077	11.268087	2.907598	2.91678
	5.272125	4.63759	11.235228	2.799026	2.881828
	5.308232	4.58515	11.229461	2.882254	2.910783
	5.286579	4.599118	11.253098	2.85217	2.903325
	5.282362	4.597291	11.190576	2.875931	2.920964
	5.276742	4.611212	11.239454	2.849582	2.853147
	5.293711	4.591562	11.253258	2.870164	2.918136
	5.293716	4.621955	11.228463	2.858067	2.850342
	5.318874	4.591154	11.225114	2.864949	2.912111
	5.306651	4.590993	11.252793	2.841034	2.847878
	5.30221	4.641963	11.220678	2.877916	2.842209
	5.299778	4.593774	11.206139	2.868532	2.856316
Total	105.818302	92.013669	224.577495	57.042742	57.625964
Average	5.2909151	4.60068345	11.22887475	2.8521371	2.8812982
SD	0.01805085609	0.01880182964	0.01914206262	0.0338674976	0.02977236719

In the previous post, we assumed the algorithms that use SIMD instructions would perform faster than others. Indeed, we can observe that vol4 and vol5 algorithms outperform others. The performance difference between them are really small (~0.0291 seconds) indicating that both inline assembly and compiler intrinsic are almost equally fast.

We can also see that vol1 runs faster than vol0. This corresponds to our expectation as vol1 uses a fixed-point calculation with bit-shift operations.

Interestingly, vol2 algorithm is found to be significantly slower than others. Initially, we assumed that this algorithm may perform faster than vol0 and vol1 which multiplies each sample with scaling factor because it pre-calculates all the results and stores them in a table. This result would mean that the CPU has an efficient arithmetic logic unit (ALU) that processes the multiplication fast or is slow at reading the memory when looking over the pre-calculated values within the table.

x86_64

Algorithm	vol0	vol1	vol2
Time (seconds)	2.821902	2.784482	3.531761
	2.903628	2.786877	3.569542
	2.895999	2.78038	3.551214
	2.877543	2.785402	3.559591
	2.886563	2.785422	3.537273
	2.891856	2.783449	3.545279
	2.80208	2.786667	3.58345
	2.855822	2.782619	3.590136
	2.804731	2.781633	3.572802
	2.782909	2.801589	3.587121
	2.783267	2.783468	3.630578
	2.785422	2.800091	3.562486
	2.81526	2.77875	3.591089
	2.873962	2.778289	3.529016
	2.791908	2.789269	3.579964
	2.785272	2.792904	3.55086
	2.804883	2.778821	3.587747
	2.78638	2.785906	3.545412
	2.788079	2.795611	3.574527
	2.810512	2.794108	3.54657
Total	56.547978	55.735737	71.326418
Average	2.8273989	2.78678685	3.5663209
SD	0.04456744515	0.006838116502	0.02516021857

The x86_64 system shows similar aspects as the AArch64 system -vol1 algorithm is the fastest and vol2 is the slowest. Note that we are missing vol4 and vol5 algorithms because these programs utilize SIMD instructions that are unique to the AArch64 system.

Conclusion

In this post, we measured the performance of each algorithm to test the assumptions we made in the previous post. As expected, the algorithms that use SIMD instructions appear to run faster than others as they can process multiple data at a time.