Introduction
Previously, we tried to implement SVE2 by using the existing codes that are written in NEON instructions. Unfortunately, the result was not so fruitful. Instead, in this post, we will try to utilize the auto-vectorization method in order to add SVE2 instructions.
Before We Start
As we know, Opus package already supports NEON intrinsics. When we run the ./configure
, we can see that the script automatically detects that the existing processor supports ARM NEON intrinsics optimizations.
$ ./configure
opus 1.3.1-107-gccaaffa9-dirty: Automatic configuration OK.
Compiler support:
C99 var arrays: ................ yes
C99 lrintf: .................... yes
Use alloca: .................... no (using var arrays)
General configuration:
Floating point support: ........ yes
Fast float approximations: ..... no
Fixed point debugging: ......... no
Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
External Assembly Optimizations:
Intrinsics Optimizations: ...... ARM (NEON) (NEON Aarch64)
Run-time CPU detection: ........ no
Custom modes: .................. no
Assertion checking: ............ no
Hardening: ..................... yes
Fuzzing: ....................... no
Check ASM: ..................... no
API documentation: ............. yes
Extra programs: ................ yes
However, NEON intrinsic may conflict with auto-vectorizations that are done by the compiler. For this reason, we have to look for a way to disable this feature in the first place.
Thankfully, we can easily achieve this by editing the configure.ac
file. When we search for neon
keyword, we can find the following codes:
AS_IF([test x"$enable_intrinsics" = x"yes"],[
intrinsics_support=""
AS_CASE([$host_cpu],
[arm*|aarch64*],
[
cpu_arm=yes
OPUS_CHECK_INTRINSICS(
[ARM Neon],
[$ARM_NEON_INTR_CFLAGS],
[OPUS_ARM_MAY_HAVE_NEON_INTR],
[OPUS_ARM_PRESUME_NEON_INTR],
[[#include <arm_neon.h>
]],
[[
static float32x4_t A0, A1, SUMM;
SUMM = vmlaq_f32(SUMM, A0, A1);
return (int)vgetq_lane_f32(SUMM, 0);
]]
)
The code checks if the enable_intrinsics
is true or not. By changing x"yes"
to x""
, we can disable the intrinsic configuration. When we rerun autogen.sh
and configure
scripts, we can confirm that the intrinsic optimizations are disabled.
$ ./autogen.sh
$ ./configure
opus 1.3.1-107-gccaaffa9-dirty: Automatic configuration OK.
Compiler support:
C99 var arrays: ................ yes
C99 lrintf: .................... yes
Use alloca: .................... no (using var arrays)
General configuration:
Floating point support: ........ yes
Fast float approximations: ..... no
Fixed point debugging: ......... no
Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
External Assembly Optimizations:
Intrinsics Optimizations: ...... no
Run-time CPU detection: ........ no
Custom modes: .................. no
Assertion checking: ............ no
Hardening: ..................... yes
Fuzzing: ....................... no
Check ASM: ..................... no
API documentation: ............. yes
Extra programs: ................ yes
Auto-vectorization Implementation
Before we begin, we need to check what what compiler flags are being used by the package. To do this, we can look at the Makefile
.
# Makefile
CFLAGS = -g -O2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes
In order to enable the auto-vectorization, we need to edit the CFLAGS
. However, this is not-so-easy job to do because there are multiple Makefile
throughout the package. That is, we have to edit (possibly) all the following files.
$ find . -name Makefile
./doc/Makefile
./doc/latex/Makefile
./Makefile
./celt/dump_modes/Makefile
Fortunately, we can avoid this problem by overriding the configure
script that is generated for us. For this, we use the existing CFLAGS
and modify it as follows:
$ ./configure CFLAGS="-g -O3 -march=armv8-a+sve2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes"
We change the optimization level to 03
to enable the auto-vectorization and specify the machine architecture to be ArmV8 with SVE2 extension. Once we run it, we can check that the CFLAGS
are updated successfully in the Makefile
.
# Makefile
CFLAGS = -g -O3 -march=armv8-a+sve2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes -fvisibility=hidden -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes
Let's try to run the unit tests to see whether the auto-vectorization kicked in during the compilation.
$ make check
# ...
./test-driver: line 107: 1431013 Illegal instruction (core dumped) "$@" > $log_file 2>&1
FAIL: celt/tests/test_unit_cwrs32
./test-driver: line 107: 1431028 Illegal instruction (core dumped) "$@" > $log_file 2>&1
FAIL: tests/test_opus_api
# ...
============================================================================
Testsuite summary for opus 1.3.1-107-gccaaffa9-dirty
============================================================================
# TOTAL: 14
# PASS: 4
# SKIP: 0
# XFAIL: 0
# FAIL: 10
# XPASS: 0
# ERROR: 0
============================================================================
See ./test-suite.log
Please report to opus@xiph.org
============================================================================
This is expected because we are trying to execute a binary that is coded with SVE2 instructions and the existing hardware does not yet support them. To solve this, we need to run the emulation by using qemu-aarch64
command. Let's run the command using one of the unit test that has failed - test_opus_api
.
$ qemu-aarch64 test_opus_api
Error while loading test_opus_api: Exec format error
Interestingly, the command cannot run because the test file format is invalid. Let's take a look at the content of the test file.
$ vi test_opus_api
#! /bin/sh
# tests/test_opus_api - temporary wrapper script for .libs/test_opus_api
# Generated by libtool (GNU libtool) 2.4.6
#
# The tests/test_opus_api program cannot be directly executed until all the libtool
# libraries that it depends on are installed.
#
# This wrapper script should never be moved out of the build directory.
# If it is, it will not operate correctly.
# Sed substitution that helps us do robust quoting. It backslashifies
# metacharacters that are still active within double-quoted strings.
The test file is actually a wrapper script and that is why the qemu-aarch64
cannot run. We can easily solve this by inserting the command when the script launches the actual program.
Let's look for a code where the script starts the program and add the qemu-aarch64
command:
# Core function for launching the target application
func_exec_program_core ()
{
if test -n "$lt_option_debug"; then
$ECHO "test_opus_api:tests/test_opus_api:$LINENO: newargv[0]: $progdir/$program" 1>&2
func_lt_dump_args ${1+"$@"} 1>&2
fi
exec qemu-aarch64 "$progdir/$program" ${1+"$@"}
$ECHO "$0: cannot exec $program $*" 1>&2
exit 1
}
When we rerun the unit test, we can see that all tests passed without any problems.
$ ./test_opus_api
Testing the libopus 1.3.1-107-gccaaffa9-dirty API deterministically
Decoder basic API tests
---------------------------------------------------
opus_decoder_get_size(0)=0 ................... OK.
opus_decoder_get_size(1)=18228 ............... OK.
opus_decoder_get_size(2)=26996 ............... OK.
opus_decoder_get_size(3)=0 ................... OK.
opus_decoder_create() ........................ OK.
opus_decoder_init() .......................... OK.
OPUS_GET_FINAL_RANGE ......................... OK.
OPUS_UNIMPLEMENTED ........................... OK.
OPUS_GET_BANDWIDTH ........................... OK.
OPUS_GET_SAMPLE_RATE ......................... OK.
OPUS_GET_PITCH ............................... OK.
OPUS_GET_LAST_PACKET_DURATION ................ OK.
OPUS_SET_GAIN ................................ OK.
OPUS_GET_GAIN ................................ OK.
OPUS_RESET_STATE ............................. OK.
opus_{packet,decoder}_get_nb_samples() ....... OK.
opus_packet_get_nb_frames() .................. OK.
opus_packet_get_bandwidth() .................. OK.
opus_packet_get_samples_per_frame() .......... OK.
opus_decode() ................................ OK.
opus_decode_float() .......................... OK.
All decoder interface tests passed
(1219433 API invocations)
Multistream decoder basic API tests
---------------------------------------------------
opus_multistream_decoder_get_size(-1,-1)=0 ... OK.
opus_multistream_decoder_get_size(-1, 0)=0 ... OK.
opus_multistream_decoder_get_size(-1, 1)=0 ... OK.
opus_multistream_decoder_get_size(-1, 2)=0 ... OK.
opus_multistream_decoder_get_size(-1, 3)=0 ... OK.
opus_multistream_decoder_get_size( 0,-1)=0 ... OK.
opus_multistream_decoder_get_size( 0, 0)=0 ... OK.
opus_multistream_decoder_get_size( 0, 1)=0 ... OK.
opus_multistream_decoder_get_size( 0, 2)=0 ... OK.
opus_multistream_decoder_get_size( 0, 3)=0 ... OK.
opus_multistream_decoder_get_size( 1,-1)=0 ... OK.
opus_multistream_decoder_get_size( 1, 0)=18504 OK.
opus_multistream_decoder_get_size( 1, 1)=27272 OK.
opus_multistream_decoder_get_size( 1, 2)=0 ... OK.
opus_multistream_decoder_get_size( 1, 3)=0 ... OK.
opus_multistream_decoder_get_size( 2,-1)=0 ... OK.
opus_multistream_decoder_get_size( 2, 0)=36736 OK.
opus_multistream_decoder_get_size( 2, 1)=45504 OK.
opus_multistream_decoder_get_size( 2, 2)=54272 OK.
opus_multistream_decoder_get_size( 2, 3)=0 ... OK.
opus_multistream_decoder_get_size( 3,-1)=0 ... OK.
opus_multistream_decoder_get_size( 3, 0)=54968 OK.
opus_multistream_decoder_get_size( 3, 1)=63736 OK.
opus_multistream_decoder_get_size( 3, 2)=72504 OK.
opus_multistream_decoder_get_size( 3, 3)=81272 OK.
opus_multistream_decoder_create() ............ OK.
opus_multistream_decoder_init() .............. OK.
OPUS_GET_FINAL_RANGE ......................... OK.
OPUS_MULTISTREAM_GET_DECODER_STATE ........... OK.
OPUS_SET_GAIN ................................ OK.
OPUS_GET_GAIN ................................ OK.
OPUS_GET_BANDWIDTH ........................... OK.
OPUS_UNIMPLEMENTED ........................... OK.
OPUS_RESET_STATE ............................. OK.
opus_multistream_decode() .................... OK.
opus_multistream_decode_float() .............. OK.
All multistream decoder interface tests passed
(576106 API invocations)
Packet header parsing tests
---------------------------------------------------
code 0 (65 cases) ............................ OK.
code 1 (163456 cases) ........................ OK.
code 2 (326528 cases) ........................ OK.
code 3 m-truncation (64 cases) ............... OK.
code 3 m=0,49-64 (4096 cases) ................ OK.
code 3 m=1 CBR (81728 cases) ................. OK.
code 3 m=1-48 CBR (103544448 cases) .......... OK.
code 3 m=1-48 VBR (120832 cases) ............. OK.
code 3 padding (1519448 cases) ............... OK.
opus_packet_parse ............................ OK.
All packet parsing tests passed
(105760666 API invocations)
Encoder basic API tests
---------------------------------------------------
opus_encoder_get_size(0)=0 ................... OK.
opus_encoder_get_size(1)=43572 ............... OK.
opus_encoder_get_size(2)=48484 ............... OK.
opus_encoder_get_size(3)=0 ................... OK.
opus_encoder_create() ........................ OK.
opus_encoder_init() .......................... OK.
OPUS_GET_LOOKAHEAD ........................... OK.
OPUS_GET_SAMPLE_RATE ......................... OK.
OPUS_UNIMPLEMENTED ........................... OK.
OPUS_SET_APPLICATION ......................... OK.
OPUS_GET_APPLICATION ......................... OK.
OPUS_SET_BITRATE ............................. OK.
OPUS_GET_BITRATE ............................. OK.
OPUS_SET_FORCE_CHANNELS ...................... OK.
OPUS_GET_FORCE_CHANNELS ...................... OK.
OPUS_SET_BANDWIDTH ........................... OK.
OPUS_GET_BANDWIDTH ........................... OK.
OPUS_SET_MAX_BANDWIDTH ....................... OK.
OPUS_GET_MAX_BANDWIDTH ....................... OK.
OPUS_SET_DTX ................................. OK.
OPUS_GET_DTX ................................. OK.
OPUS_SET_COMPLEXITY .......................... OK.
OPUS_GET_COMPLEXITY .......................... OK.
OPUS_SET_INBAND_FEC .......................... OK.
OPUS_GET_INBAND_FEC .......................... OK.
OPUS_SET_PACKET_LOSS_PERC .................... OK.
OPUS_GET_PACKET_LOSS_PERC .................... OK.
OPUS_SET_VBR ................................. OK.
OPUS_GET_VBR ................................. OK.
OPUS_SET_VBR_CONSTRAINT ...................... OK.
OPUS_GET_VBR_CONSTRAINT ...................... OK.
OPUS_SET_SIGNAL .............................. OK.
OPUS_GET_SIGNAL .............................. OK.
OPUS_SET_LSB_DEPTH ........................... OK.
OPUS_GET_LSB_DEPTH ........................... OK.
OPUS_SET_PREDICTION_DISABLED ................. OK.
OPUS_GET_PREDICTION_DISABLED ................. OK.
OPUS_SET_EXPERT_FRAME_DURATION ............... OK.
OPUS_GET_EXPERT_FRAME_DURATION ............... OK.
OPUS_GET_FINAL_RANGE ......................... OK.
OPUS_RESET_STATE ............................. OK.
opus_encode() ................................ OK.
opus_encode_float() .......................... OK.
All encoder interface tests passed
(1152209 API invocations)
Repacketizer tests
---------------------------------------------------
opus_repacketizer_get_size()=496 ............. OK.
opus_repacketizer_init ....................... OK.
opus_repacketizer_create ..................... OK.
opus_repacketizer_get_nb_frames .............. OK.
opus_repacketizer_cat ........................ OK.
opus_repacketizer_out ........................ OK.
opus_repacketizer_out_range .................. OK.
opus_packet_pad .............................. OK.
opus_packet_unpad ............................ OK.
opus_multistream_packet_pad .................. OK.
opus_multistream_packet_unpad ................ OK.
All repacketizer tests passed
(6713561 API invocations)
malloc() failure tests
---------------------------------------------------
opus_decoder_create() ................... SKIPPED.
opus_encoder_create() ................... SKIPPED.
opus_repacketizer_create() .............. SKIPPED.
opus_multistream_decoder_create() ....... SKIPPED.
opus_multistream_encoder_create() ....... SKIPPED.
(Test only supported with GLIBC and without valgrind)
All API tests passed.
The libopus API was invoked 115421979 times.
Now, we know the SVE2 implementation is successfully added to the program. Let's double-check this by looking for the presence of SVE2 specific instruction within the binary files. Using the following command, we can see the list of files with whilelo
instruction.
find . -type f -executable | while read F ; do echo ======= $F ; objdump -d $F 2> /dev/null | grep whilelo ; done
#...
======= ./.libs/libopus.so.0.8.0
2ef0: 25a40fe1 whilelo p1.s, wzr, w4
2f28: 25a40c60 whilelo p0.s, w3, w4
2f7c: 25a40fe1 whilelo p1.s, wzr, w4
2fa0: 25a40c60 whilelo p0.s, w3, w4
3314: 25b80fe0 whilelo p0.s, wzr, w24
3344: 25b80c20 whilelo p0.s, w1, w24
3784: 25b80fe0 whilelo p0.s, wzr, w24
37a8: 25b80c00 whilelo p0.s, w0, w24
39cc: 25b80fe0 whilelo p0.s, wzr, w24
39f0: 25b80c00 whilelo p0.s, w0, w24
3a5c: 25b80fe0 whilelo p0.s, wzr, w24
3a80: 25b80c00 whilelo p0.s, w0, w24
4488: 25b50fe2 whilelo p2.s, wzr, w21
44fc: 25b50c00 whilelo p0.s, w0, w21
4590: 25b50fe2 whilelo p2.s, wzr, w21
45f8: 25b50c00 whilelo p0.s, w0, w21
4780: 25a40fe2 whilelo p2.s, wzr, w4
47e4: 25a40c00 whilelo p0.s, w0, w4
4940: 25b80fe0 whilelo p0.s, wzr, w24
4954: 25b80c00 whilelo p0.s, w0, w24
4a2c: 25a40fe1 whilelo p1.s, wzr, w4
4a50: 25a40c00 whilelo p0.s, w0, w4
4b34: 25a40fe1 whilelo p1.s, wzr, w4
4b68: 25a40c60 whilelo p0.s, w3, w4
4e34: 25b30fe0 whilelo p0.s, wzr, w19
4e54: 25b30c60 whilelo p0.s, w3, w19
4fe4: 25b30fe0 whilelo p0.s, wzr, w19
5000: 25b30c20 whilelo p0.s, w1, w19
51d8: 25b30fe0 whilelo p0.s, wzr, w19
5208: 25b30c00 whilelo p0.s, w0, w19
54a0: 25a10fe0 whilelo p0.s, wzr, w1
54b8: 25a10c00 whilelo p0.s, w0, w1
55c8: 25a11fe0 whilelo p0.s, xzr, x1
55e8: 25a11c00 whilelo p0.s, x0, x1
5724: 25a11fe0 whilelo p0.s, xzr, x1
5738: 25a11c00 whilelo p0.s, x0, x1
5c2c: 25a80fe1 whilelo p1.s, wzr, w8
5c7c: 25a80d21 whilelo p1.s, w9, w8
5ec4: 25a50fe2 whilelo p2.s, wzr, w5
5f30: 25a50c00 whilelo p0.s, w0, w5
64dc: 25a10fe0 whilelo p0.s, wzr, w1
6500: 25a10c00 whilelo p0.s, w0, w1
6818: 25a11fe0 whilelo p0.s, xzr, x1
6830: 25a11c00 whilelo p0.s, x0, x1
#...
And when we count the total number of lines that use whilelo
, we get a total of 2903 lines.
$ find . -type f -executable | while read F ; do echo ======= $F ; objdump -d $F 2> /dev/null | grep whilelo ; done | wc -l
2903
Conclusion
In this post, we explored and implemented SVE2 by using auto-vectorization of the compiler. Using the existing unit tests, we were able to identify if the auto-vectorization was added successfully. We also found that it added a significant number of SVE2 specific instruction, whilelo
(i.e. 2903 lines). This indicates that the Opus project may greatly benefit from SVE2 implementation.
Top comments (0)