DEV Community: Seung Woo (Paul) Ji

SVE2 Implementation for Opus Codec Library Analysis

Seung Woo (Paul) Ji — Fri, 22 Apr 2022 22:44:05 +0000

Introduction

Previously, we successfully implemented SVE2 into Opus codec library by utilizing auto-vectorization method. In this post, we will analyze the result to further test if the SVE2 code is implemented correctly and determine its possible impact on the software's performance.

SVE2 Code Analysis

As we explored in the previous post, the compiler auto-vectorized many parts of the package. Let's take a look at one of them to see where SVE2 code is used.

Opus Codec utilizes Celt as one of ways to encode and decode audio source. In opus/celt, we can see the following list of files.

$ ls
arch.h           celt.o         entenc.o          mdct.c              quant_bands.lo
arm              cpu_support.h  fixed_c5x.h       mdct.h              quant_bands.o
bands.c          cwrs.c         fixed_c6x.h       mdct.lo             rate.c
bands.h          cwrs.h         fixed_debug.h     mdct.o              rate.h
bands.lo         cwrs.lo        fixed_generic.h   meson.build         rate.lo
bands.o          cwrs.o         float_cast.h      mfrngcod.h          rate.o
celt.c           dump_modes     kiss_fft.c        mips                stack_alloc.h
celt_decoder.c   ecintrin.h     _kiss_fft_guts.h  modes.c             static_modes_fixed_arm_ne10.h
celt_decoder.lo  entcode.c      kiss_fft.h        modes.h             static_modes_fixed.h
celt_decoder.o   entcode.h      kiss_fft.lo       modes.lo            static_modes_float_arm_ne10.h
celt_encoder.c   entcode.lo     kiss_fft.o        modes.o             static_modes_float.h
celt_encoder.lo  entcode.o      laplace.c         opus_custom_demo.c  tests
celt_encoder.o   entdec.c       laplace.h         os_support.h        vq.c
celt.h           entdec.h       laplace.lo        pitch.c             vq.h
celt.lo          entdec.lo      laplace.o         pitch.h             vq.lo
celt_lpc.c       entdec.o       mathops.c         pitch.lo            vq.o
celt_lpc.h       entenc.c       mathops.h         pitch.o             x86
celt_lpc.lo      entenc.h       mathops.lo        quant_bands.c
celt_lpc.o       entenc.lo      mathops.o         quant_bands.h

In celt_encoder.c file, we can see that it contains many for loops that may benefit from SVE2 implementation. The following code example is one of them:

// celt_encode.c
// ...

1100       /* For non-transient CBR/CVBR frames, halve the dynalloc contribution */
1101       if ((!vbr || constrained_vbr)&&!isTransient)
1102       {
1103          for (i=start;i<end;i++)
1104             follower[i] = HALF16(follower[i]);
1105       }
1106       for (i=start;i<end;i++)
1107       {
1108          if (i<8)
1109             follower[i] *= 2;
1110          if (i>=12)
1111             follower[i] = HALF16(follower[i]);

// ...

In the code, we can see a loop that iterates from start to end. Depending on the value of i, the ith element of follower array is either halved or multiplied by two. As we can see, this does not involve complex logic and process a large amount of data in the uniform manner and, therefore, this could be a good candidate to utilize the auto-vectorization by the compiler.

And as we expected, the celt_encoder.o contains multiple SVE-specific whilelo instructions when we disassemble it.

$ objdump -d celt_encoder.o | grep whilelo
     174:       25a30fe0        whilelo p0.s, wzr, w3
     198:       25a30c00        whilelo p0.s, w0, w3
     1e8:       25b40fe0        whilelo p0.s, wzr, w20
     200:       25b40c00        whilelo p0.s, w0, w20
     418:       25bc0fe0        whilelo p0.s, wzr, w28
     430:       25bc0c00        whilelo p0.s, w0, w28
     498:       25bc0fe0        whilelo p0.s, wzr, w28
     4b0:       25bc0c20        whilelo p0.s, w1, w28
   # ...    
    57ac:       25a10c00        whilelo p0.s, w0, w1
    5844:       25a10fe0        whilelo p0.s, wzr, w1
    585c:       25a10c00        whilelo p0.s, w0, w1
    5ae0:       25a10fe0        whilelo p0.s, wzr, w1
    5b00:       25a10c00        whilelo p0.s, w0, w1
    5ea8:       25a10fe0        whilelo p0.s, wzr, w1
    5ebc:       25a10c00        whilelo p0.s, w0, w1

But, this only shows that celt_encode have implemented SVE2 instruction. How can we know if the code that we are interested in utilizes SVE2?

Let's look at this in a different angle - how the compiler can determine if the codes are suitable for auto-vectorization? For this, we can specify an additional option to enable feature when you generate configure binary.

$ ./configure CFLAGS="-g -O3 -fopt-info-vec-all -march=armv8-a+sve2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes" 

$ make -j24 |& tee make.log

fopt-info generates additional log in the compiler output. We specifically asks for all information regarding to vectorization by using vec-all. When we compile the package again using make, this feature will tell us why (or why not) the compiler add SVE2 implementation.

Once we run make command as above, we have the following make.log file that contains every information we want to know.

$ ll make.log
-rw-r--r--. 1 swji1 swji1 2831714 Apr 22 14:21 make.log

Let's refine the result by only searching the logs that happened in the celt directory as follows:

$ grep "celt/celt_encoder"
celt/celt_encoder.c:1810:22: optimized: loop vectorized using variable length vectors
celt/celt_encoder.c:1780:16: optimized: loop vectorized using variable length vectors
celt/celt_encoder.c:1780:16: optimized: loop vectorized using variable length vectors
celt/celt_encoder.c:1778:40: missed: couldn't vectorize loop
celt/celt_encoder.c:1778:40: missed: not vectorized: number of iterations cannot be computed.
celt/celt_encoder.c:1756:17: missed: couldn't vectorize loop
celt/celt_encoder.c:1761:20: missed: not vectorized: complicated access pattern.

We can see which lines of the code are vectorized or not as above. Let's find if the code located at line 1106 that we have examined is vectorized as well.

$ grep "celt/celt_encoder.c:1106"
celt/celt_encoder.c:1106:21: celt/pitch.h:143:14: optimized: loop vectorized using variable length vectors
celt/celt_encoder.c:1106:21: optimized: loop vectorized using variable length vectors

As we expected, the loop is vectorized by the compiler.

Now, we may wonder what are the codes that the compiler cannot perform auto-vectorization and why? Let's take a look at one of them.

celt/celt_encoder.c:1922:39: missed: not vectorized: complicated access pattern.

// celt_encoder.c
// ...
 do {
1915       for (i=start;i<end;i++)
1916       {
1917          /* When the energy is stable, slightly bias energy quantization towards
1918             the previous error to make the gain more stable (a constant offset is
1919             better than fluctuations). */
1920          if (ABS32(SUB32(bandLogE[i+c*nbEBands], oldBandE[i+c*nbEBands])) < QCONST16(2.f, DB_SHIFT))
1921          {
1922             bandLogE[i+c*nbEBands] -= MULT16_16_Q15(energyError[i+c*nbEBands], QCONST16(0.25f, 15));
1923          }
1924       }
1925    } while (++c < C);
// ...

In the if statement inside of the loop, we can see that each element of the arrays requires extensive calculations beforehand. For this reason, the compiler cannot vectorize the loop as it requires complex access pattern.

Performance Prediction

Unfortunately, we cannot benchmark the performance of the package at the moment due to the lack of hardware that supports SVE2. However, we do know the SVE2 implementation would potentially improve the performance as it optimizes loops when processing large datasets like audio and video resources. For this reason, we can assume there is a positive correlation between the number of SVE2 instructions and the performance.

Before we begin, we need to also consider that opus package contains multiple unit tests that can potentially increase the total number. Thus, we have to be extra careful to exclude them.
Let's count the total number of optimizations that are done by the compiler.

$ grep -v "test" make.log | grep "optimized" -c
632

The compiler managed to auto-vectorize a significant amount (632) of codes. Let's take a look at how many of SVE-specific whilelo instruction and registers (i.e. predicate register and scalable vector register) are implemented in the executable opus codec library, libopus.

$ objdump -d libopus.so.0.8.0 | grep whilelo -c
671
$ objdump -d libopus.so.0.8.0 | grep whilelo
    2ef0:       25a40fe1        whilelo p1.s, wzr, w4
    2f28:       25a40c60        whilelo p0.s, w3, w4
    2f7c:       25a40fe1        whilelo p1.s, wzr, w4
    2fa0:       25a40c60        whilelo p0.s, w3, w4
    3314:       25b80fe0        whilelo p0.s, wzr, w24
    3344:       25b80c20        whilelo p0.s, w1, w24
# ...
   47b38:       25a50fe0        whilelo p0.s, wzr, w5
   47b3c:       25a80c23        whilelo p3.s, w1, w8
   47b4c:       25aa0c24        whilelo p4.s, w1, w10
   47b54:       25250c26        whilelo p6.b, w1, w5
   47b5c:       25a60c22        whilelo p2.s, w1, w6
   47b68:       25a50c25        whilelo p5.s, w1, w5
   47b98:       25a50c20        whilelo p0.s, w1, w5

$objdump -d libopus.so.0.8.0 | egrep "[^[:alpha:]]z[[:digit:]]|[^[:alpha:]]p[[:digit:]]" -c
5274
$objdump -d libopus.so.0.8.0 | egrep "[^[:alpha:]]z[[:digit:]]|[^[:alpha:]]p[[:digit:]]"
    2ef0:       25a40fe1        whilelo p1.s, wzr, w4
    2ef8:       04a34801        index   z1.s, #0, w3
    2ef0:       25a40fe1        whilelo p1.s, wzr, w4
    2ef8:       04a34801        index   z1.s, #0, w3
    2f0c:       25814420        mov     p0.b, p1.b
    2f18:       856140a0        ld1w    {z0.s}, p0/z, [x5, z1.s, sxtw #2]
    2f1c:       e54340c0        st1w    {z0.s}, p0, [x6, x3, lsl #2]
    2f28:       25a40c60        whilelo p0.s, w3, w4
# ...
   47f48:       6594a000        scvtf   z0.s, p0/m, z0.s
   47f4c:       25886100        mov     p0.b, p8.b
   47f50:       e544e4a2        st1w    {z2.s}, p1, [x5, #4, mul vl]
   47f54:       e546e0a1        st1w    {z1.s}, p0, [x5, #6, mul vl]
   47f58:       25896520        mov     p0.b, p9.b
   47f5c:       e547e0a0        st1w    {z0.s}, p0, [x5, #7, mul vl]

As we can see, there are substantial amount of SVE2 specific codes that are implemented by the auto-vectorization. Therefore, we can suspect that the opus library may benefit from it to increase the overall performance.

Things that Can Further Improve the Performance

We already know the compiler auto-vectorize a large portion of the codes. But, we have to admit there is a limit to this method. As we already found before, the compiler cannot auto-vectorize some codes. However, this does not mean they cannot be vectorized. In some cases, we may find places where SVE2 implementation could take place if the loop is written differently. For example, as this article suggested, we may use restrict qualifiers to inform the compiler that there is no array overlaps.

Original and SVE2 Implementation Comparison

Now, we know SVE2 implementation is successfully performed by the auto-vectorization. However, this is meaningless if the SVE2-improved library does not generate the same result as the original library. For this, let's examine if the improved version of the program works as well as the original version.

# original file
$ ll libopus.so.0.8.0
-rwxr-xr-x. 1 swji1 swji1 1498808 Apr 13 20:16 libopus.so.0.8.0

# SVE2 implemented file
$ ll libopus.so.0.8.0
-rwxr-xr-x. 1 swji1 swji1 1684704 Apr 22 14:21 libopus.so.0.8.0

The SVE2 implemented version has a little bit larger in size (~0.2 MiB) but does not show a significant change.

Let's run the unit tests that are provided by the package authors. As we know from the previous post, we have to execute them using qemu-aarch64 command to run the emulation. But, unlike previous post, we will run several unit tests to see if the SVE2 code works correctly.

$ ./test_opus_api
Testing the libopus 1.3.1-107-gccaaffa9-dirty API deterministically
Decoder basic API tests
  ---------------------------------------------------
    opus_decoder_get_size(0)=0 ................... OK.
    opus_decoder_get_size(1)=18228 ............... OK.
    opus_decoder_get_size(2)=26996 ............... OK.
    opus_decoder_get_size(3)=0 ................... OK.
    opus_decoder_create() ........................ OK.
    opus_decoder_init() .......................... OK.
    OPUS_GET_FINAL_RANGE ......................... OK.
    OPUS_UNIMPLEMENTED ........................... OK.
    OPUS_GET_BANDWIDTH ........................... OK.
    OPUS_GET_SAMPLE_RATE ......................... OK.
    OPUS_GET_PITCH ............................... OK.
    OPUS_GET_LAST_PACKET_DURATION ................ OK.
    OPUS_SET_GAIN ................................ OK.
    OPUS_GET_GAIN ................................ OK.
    OPUS_RESET_STATE ............................. OK.
    opus_{packet,decoder}_get_nb_samples() ....... OK.
    opus_packet_get_nb_frames() .................. OK.
    opus_packet_get_bandwidth() .................. OK.
    opus_packet_get_samples_per_frame() .......... OK.
    opus_decode() ................................ OK.
    opus_decode_float() .......................... OK.
                   All decoder interface tests passed
                             (1219433 API invocations)
# ...

Repacketizer tests
  ---------------------------------------------------
    opus_repacketizer_get_size()=496 ............. OK.
    opus_repacketizer_init ....................... OK.
    opus_repacketizer_create ..................... OK.
    opus_repacketizer_get_nb_frames .............. OK.
    opus_repacketizer_cat ........................ OK.
    opus_repacketizer_out ........................ OK.
    opus_repacketizer_out_range .................. OK.
    opus_packet_pad .............................. OK.
    opus_packet_unpad ............................ OK.
    opus_multistream_packet_pad .................. OK.
    opus_multistream_packet_unpad ................ OK.
                        All repacketizer tests passed
                            (6713561 API invocations)

  malloc() failure tests
  ---------------------------------------------------
    opus_decoder_create() ................... SKIPPED.
    opus_encoder_create() ................... SKIPPED.
    opus_repacketizer_create() .............. SKIPPED.
    opus_multistream_decoder_create() ....... SKIPPED.
    opus_multistream_encoder_create() ....... SKIPPED.
(Test only supported with GLIBC and without valgrind)

All API tests passed.
The libopus API was invoked 115421979 times.

$ ./test_opus_decode
Testing libopus 1.3.1-107-gccaaffa9-dirty decoder. Random seed: 2918850151 (76BD)
  Starting 10 decoders...
    opus_decoder_create(48000,1) OK. Copy OK.
    opus_decoder_create(48000,2) OK. Copy OK.
    opus_decoder_create(24000,1) OK. Copy OK.
    opus_decoder_create(24000,2) OK. Copy OK.
    opus_decoder_create(16000,1) OK. Copy OK.
    opus_decoder_create(16000,2) OK. Copy OK.
    opus_decoder_create(12000,1) OK. Copy OK.
    opus_decoder_create(12000,2) OK. Copy OK.
    opus_decoder_create( 8000,1) OK. Copy OK.
    opus_decoder_create( 8000,2) OK. Copy OK.
  dec[all] initial frame PLC OK.
  dec[all] all 2-byte prefix for length 3 and PLC, all modes (64) OK.
  dec[  5] all 3-byte prefix for length 4, mode 28 OK.
  dec[  0] all 3-byte prefix for length 4, mode  4 OK.
  dec[all] random packets, all modes (64), every 8th size from from 7 bytes to maximum OK.
  dec[all] random packets, all mode pairs (4096), 145 bytes/frame OK.
  dec[  3] random packets, all mode pairs (4096)*10, 81 bytes/frame OK.
  dec[  0] pre-selected random packets OK.
  Decoders stopped.
  Testing opus_pcm_soft_clip... OK.

$ ./test_opus_encode
Testing libopus 1.3.1-107-gccaaffa9-dirty encoder. Random seed: 2953257216 (421F)
Running simple tests for bugs that have been fixed previously
  Encode+Decode tests.
    Mode     LP FB encode  VBR,   9119 bps OK.
    Mode     LP FB encode  VBR,  13234 bps OK.
    Mode     LP FB encode  VBR,  64668 bps OK.
    Mode Hybrid FB encode  VBR,  28306 bps OK.
    Mode Hybrid FB encode  VBR,  54852 bps OK.
    Mode Hybrid FB encode  VBR,  55130 bps OK.
    Mode Hybrid FB encode  VBR,  96362 bps OK.
    Mode   MDCT FB encode  VBR, 893620 bps OK.
    Mode   MDCT FB encode  VBR,  25608 bps OK.
    Mode   MDCT FB encode  VBR,  29011 bps OK.
    Mode   MDCT FB encode  VBR,  93628 bps OK.
    Mode   MDCT FB encode  VBR,  93328 bps OK.
    Mode   MDCT FB encode  VBR, 160982 bps OK.
# ...
    Mode     LP NB dual-mono MS encode  CBR,  21883 bps OK.
    Mode     LP NB dual-mono MS encode  CBR,  60566 bps OK.
    Mode     LP NB dual-mono MS encode  CBR,  76774 bps OK.
    Mode     LP NB dual-mono MS encode  CBR, 167879 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,   6953 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  12756 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  60193 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  14915 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  16946 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  34028 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  86938 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR, 172977 bps OK.
    All framesize pairs switching encode, 9683 frames OK.
Running fuzz_encoder_settings with 5 encoder(s) and 40 setting change(s) each.
Tests completed successfully.

As we can see, the SVE2 program passes all the unit tests to confirm that it works as well as the original program.

Conclusion

In this post, we found that the compiler successfully vectorized the codes and there would be a significant improvement in the performance considering the substantial amount of SVE2-specific instructions and registers. We also checked that SVE2 does not break the program and run as well as the original program. These findings suggest that the authors of opus package may greatly benefit from the vectorization of the codes when SVE2 become publicly available in the near future.

Implementing SVE2 for Opus Codec Library Part 3: Auto-vectorization

Seung Woo (Paul) Ji — Wed, 13 Apr 2022 22:53:11 +0000

Introduction

Previously, we tried to implement SVE2 by using the existing codes that are written in NEON instructions. Unfortunately, the result was not so fruitful. Instead, in this post, we will try to utilize the auto-vectorization method in order to add SVE2 instructions.

Before We Start

As we know, Opus package already supports NEON intrinsics. When we run the ./configure, we can see that the script automatically detects that the existing processor supports ARM NEON intrinsics optimizations.

$ ./configure
opus 1.3.1-107-gccaaffa9-dirty:  Automatic configuration OK.

    Compiler support:

      C99 var arrays: ................ yes
      C99 lrintf: .................... yes
      Use alloca: .................... no (using var arrays)

    General configuration:

      Floating point support: ........ yes
      Fast float approximations: ..... no
      Fixed point debugging: ......... no
      Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
      External Assembly Optimizations:
      Intrinsics Optimizations: ...... ARM (NEON) (NEON Aarch64)
      Run-time CPU detection: ........ no
      Custom modes: .................. no
      Assertion checking: ............ no
      Hardening: ..................... yes
      Fuzzing: ....................... no
      Check ASM: ..................... no

      API documentation: ............. yes
      Extra programs: ................ yes

However, NEON intrinsic may conflict with auto-vectorizations that are done by the compiler. For this reason, we have to look for a way to disable this feature in the first place.

Thankfully, we can easily achieve this by editing the configure.ac file. When we search for neon keyword, we can find the following codes:

AS_IF([test x"$enable_intrinsics" = x"yes"],[
   intrinsics_support=""
   AS_CASE([$host_cpu],
   [arm*|aarch64*],
   [
      cpu_arm=yes
      OPUS_CHECK_INTRINSICS(
         [ARM Neon],
         [$ARM_NEON_INTR_CFLAGS],
         [OPUS_ARM_MAY_HAVE_NEON_INTR],
         [OPUS_ARM_PRESUME_NEON_INTR],
         [[#include <arm_neon.h>
         ]],
         [[
            static float32x4_t A0, A1, SUMM;
            SUMM = vmlaq_f32(SUMM, A0, A1);
            return (int)vgetq_lane_f32(SUMM, 0);
         ]]
      )

The code checks if the enable_intrinsics is true or not. By changing x"yes" to x"", we can disable the intrinsic configuration. When we rerun autogen.sh and configure scripts, we can confirm that the intrinsic optimizations are disabled.

$ ./autogen.sh
$ ./configure
opus 1.3.1-107-gccaaffa9-dirty:  Automatic configuration OK.

    Compiler support:

      C99 var arrays: ................ yes
      C99 lrintf: .................... yes
      Use alloca: .................... no (using var arrays)

    General configuration:

      Floating point support: ........ yes
      Fast float approximations: ..... no
      Fixed point debugging: ......... no
      Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
      External Assembly Optimizations:
      Intrinsics Optimizations: ...... no
      Run-time CPU detection: ........ no
      Custom modes: .................. no
      Assertion checking: ............ no
      Hardening: ..................... yes
      Fuzzing: ....................... no
      Check ASM: ..................... no

      API documentation: ............. yes
      Extra programs: ................ yes

Auto-vectorization Implementation

Before we begin, we need to check what what compiler flags are being used by the package. To do this, we can look at the Makefile.

# Makefile
CFLAGS = -g -O2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes

In order to enable the auto-vectorization, we need to edit the CFLAGS. However, this is not-so-easy job to do because there are multiple Makefile throughout the package. That is, we have to edit (possibly) all the following files.

$ find . -name Makefile
./doc/Makefile
./doc/latex/Makefile
./Makefile
./celt/dump_modes/Makefile

Fortunately, we can avoid this problem by overriding the configure script that is generated for us. For this, we use the existing CFLAGS and modify it as follows:

$ ./configure CFLAGS="-g -O3 -march=armv8-a+sve2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes"

We change the optimization level to 03 to enable the auto-vectorization and specify the machine architecture to be ArmV8 with SVE2 extension. Once we run it, we can check that the CFLAGS are updated successfully in the Makefile.

# Makefile
CFLAGS = -g -O3 -march=armv8-a+sve2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes -fvisibility=hidden -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes

Let's try to run the unit tests to see whether the auto-vectorization kicked in during the compilation.

$ make check

# ...

./test-driver: line 107: 1431013 Illegal instruction     (core dumped) "$@" > $log_file 2>&1
FAIL: celt/tests/test_unit_cwrs32
./test-driver: line 107: 1431028 Illegal instruction     (core dumped) "$@" > $log_file 2>&1
FAIL: tests/test_opus_api

# ...

============================================================================
Testsuite summary for opus 1.3.1-107-gccaaffa9-dirty
============================================================================
# TOTAL: 14
# PASS:  4
# SKIP:  0
# XFAIL: 0
# FAIL:  10
# XPASS: 0
# ERROR: 0
============================================================================
See ./test-suite.log
Please report to opus@xiph.org
============================================================================

This is expected because we are trying to execute a binary that is coded with SVE2 instructions and the existing hardware does not yet support them. To solve this, we need to run the emulation by using qemu-aarch64 command. Let's run the command using one of the unit test that has failed - test_opus_api.

$ qemu-aarch64 test_opus_api
Error while loading test_opus_api: Exec format error

Interestingly, the command cannot run because the test file format is invalid. Let's take a look at the content of the test file.

$ vi test_opus_api
#! /bin/sh

# tests/test_opus_api - temporary wrapper script for .libs/test_opus_api
# Generated by libtool (GNU libtool) 2.4.6
#
# The tests/test_opus_api program cannot be directly executed until all the libtool
# libraries that it depends on are installed.
#
# This wrapper script should never be moved out of the build directory.
# If it is, it will not operate correctly.

# Sed substitution that helps us do robust quoting.  It backslashifies
# metacharacters that are still active within double-quoted strings.

The test file is actually a wrapper script and that is why the qemu-aarch64 cannot run. We can easily solve this by inserting the command when the script launches the actual program.

Let's look for a code where the script starts the program and add the qemu-aarch64 command:

# Core function for launching the target application
func_exec_program_core ()
{

      if test -n "$lt_option_debug"; then
        $ECHO "test_opus_api:tests/test_opus_api:$LINENO: newargv[0]: $progdir/$program" 1>&2
        func_lt_dump_args ${1+"$@"} 1>&2
      fi
      exec qemu-aarch64 "$progdir/$program" ${1+"$@"}

      $ECHO "$0: cannot exec $program $*" 1>&2
      exit 1
}

When we rerun the unit test, we can see that all tests passed without any problems.

$ ./test_opus_api
Testing the libopus 1.3.1-107-gccaaffa9-dirty API deterministically

  Decoder basic API tests
  ---------------------------------------------------
    opus_decoder_get_size(0)=0 ................... OK.
    opus_decoder_get_size(1)=18228 ............... OK.
    opus_decoder_get_size(2)=26996 ............... OK.
    opus_decoder_get_size(3)=0 ................... OK.
    opus_decoder_create() ........................ OK.
    opus_decoder_init() .......................... OK.
    OPUS_GET_FINAL_RANGE ......................... OK.
    OPUS_UNIMPLEMENTED ........................... OK.
    OPUS_GET_BANDWIDTH ........................... OK.
    OPUS_GET_SAMPLE_RATE ......................... OK.
    OPUS_GET_PITCH ............................... OK.
    OPUS_GET_LAST_PACKET_DURATION ................ OK.
    OPUS_SET_GAIN ................................ OK.
    OPUS_GET_GAIN ................................ OK.
    OPUS_RESET_STATE ............................. OK.
    opus_{packet,decoder}_get_nb_samples() ....... OK.
    opus_packet_get_nb_frames() .................. OK.
    opus_packet_get_bandwidth() .................. OK.
    opus_packet_get_samples_per_frame() .......... OK.
    opus_decode() ................................ OK.
    opus_decode_float() .......................... OK.
                   All decoder interface tests passed
                             (1219433 API invocations)

  Multistream decoder basic API tests
  ---------------------------------------------------
    opus_multistream_decoder_get_size(-1,-1)=0 ... OK.
    opus_multistream_decoder_get_size(-1, 0)=0 ... OK.
    opus_multistream_decoder_get_size(-1, 1)=0 ... OK.
    opus_multistream_decoder_get_size(-1, 2)=0 ... OK.
    opus_multistream_decoder_get_size(-1, 3)=0 ... OK.
    opus_multistream_decoder_get_size( 0,-1)=0 ... OK.
    opus_multistream_decoder_get_size( 0, 0)=0 ... OK.
    opus_multistream_decoder_get_size( 0, 1)=0 ... OK.
    opus_multistream_decoder_get_size( 0, 2)=0 ... OK.
    opus_multistream_decoder_get_size( 0, 3)=0 ... OK.
    opus_multistream_decoder_get_size( 1,-1)=0 ... OK.
    opus_multistream_decoder_get_size( 1, 0)=18504 OK.
    opus_multistream_decoder_get_size( 1, 1)=27272 OK.
    opus_multistream_decoder_get_size( 1, 2)=0 ... OK.
    opus_multistream_decoder_get_size( 1, 3)=0 ... OK.
    opus_multistream_decoder_get_size( 2,-1)=0 ... OK.
    opus_multistream_decoder_get_size( 2, 0)=36736 OK.
    opus_multistream_decoder_get_size( 2, 1)=45504 OK.
    opus_multistream_decoder_get_size( 2, 2)=54272 OK.
    opus_multistream_decoder_get_size( 2, 3)=0 ... OK.
    opus_multistream_decoder_get_size( 3,-1)=0 ... OK.
    opus_multistream_decoder_get_size( 3, 0)=54968 OK.
    opus_multistream_decoder_get_size( 3, 1)=63736 OK.
    opus_multistream_decoder_get_size( 3, 2)=72504 OK.
    opus_multistream_decoder_get_size( 3, 3)=81272 OK.
    opus_multistream_decoder_create() ............ OK.
    opus_multistream_decoder_init() .............. OK.
    OPUS_GET_FINAL_RANGE ......................... OK.
    OPUS_MULTISTREAM_GET_DECODER_STATE ........... OK.
    OPUS_SET_GAIN ................................ OK.
    OPUS_GET_GAIN ................................ OK.
    OPUS_GET_BANDWIDTH ........................... OK.
    OPUS_UNIMPLEMENTED ........................... OK.
    OPUS_RESET_STATE ............................. OK.
    opus_multistream_decode() .................... OK.
    opus_multistream_decode_float() .............. OK.
       All multistream decoder interface tests passed
                             (576106 API invocations)

  Packet header parsing tests
  ---------------------------------------------------
    code 0 (65 cases) ............................ OK.
    code 1 (163456 cases) ........................ OK.
    code 2 (326528 cases) ........................ OK.
    code 3 m-truncation (64 cases) ............... OK.
    code 3 m=0,49-64 (4096 cases) ................ OK.
    code 3 m=1 CBR (81728 cases) ................. OK.
    code 3 m=1-48 CBR (103544448 cases) .......... OK.
    code 3 m=1-48 VBR (120832 cases) ............. OK.
    code 3 padding (1519448 cases) ............... OK.
    opus_packet_parse ............................ OK.
                      All packet parsing tests passed
                          (105760666 API invocations)

  Encoder basic API tests
  ---------------------------------------------------
    opus_encoder_get_size(0)=0 ................... OK.
    opus_encoder_get_size(1)=43572 ............... OK.
    opus_encoder_get_size(2)=48484 ............... OK.
    opus_encoder_get_size(3)=0 ................... OK.
    opus_encoder_create() ........................ OK.
    opus_encoder_init() .......................... OK.
    OPUS_GET_LOOKAHEAD ........................... OK.
    OPUS_GET_SAMPLE_RATE ......................... OK.
    OPUS_UNIMPLEMENTED ........................... OK.
    OPUS_SET_APPLICATION ......................... OK.
    OPUS_GET_APPLICATION ......................... OK.
    OPUS_SET_BITRATE ............................. OK.
    OPUS_GET_BITRATE ............................. OK.
    OPUS_SET_FORCE_CHANNELS ...................... OK.
    OPUS_GET_FORCE_CHANNELS ...................... OK.
    OPUS_SET_BANDWIDTH ........................... OK.
    OPUS_GET_BANDWIDTH ........................... OK.
    OPUS_SET_MAX_BANDWIDTH ....................... OK.
    OPUS_GET_MAX_BANDWIDTH ....................... OK.
    OPUS_SET_DTX ................................. OK.
    OPUS_GET_DTX ................................. OK.
    OPUS_SET_COMPLEXITY .......................... OK.
    OPUS_GET_COMPLEXITY .......................... OK.
    OPUS_SET_INBAND_FEC .......................... OK.
    OPUS_GET_INBAND_FEC .......................... OK.
    OPUS_SET_PACKET_LOSS_PERC .................... OK.
    OPUS_GET_PACKET_LOSS_PERC .................... OK.
    OPUS_SET_VBR ................................. OK.
    OPUS_GET_VBR ................................. OK.
    OPUS_SET_VBR_CONSTRAINT ...................... OK.
    OPUS_GET_VBR_CONSTRAINT ...................... OK.
    OPUS_SET_SIGNAL .............................. OK.
    OPUS_GET_SIGNAL .............................. OK.
    OPUS_SET_LSB_DEPTH ........................... OK.
    OPUS_GET_LSB_DEPTH ........................... OK.
    OPUS_SET_PREDICTION_DISABLED ................. OK.
    OPUS_GET_PREDICTION_DISABLED ................. OK.
    OPUS_SET_EXPERT_FRAME_DURATION ............... OK.
    OPUS_GET_EXPERT_FRAME_DURATION ............... OK.
    OPUS_GET_FINAL_RANGE ......................... OK.
    OPUS_RESET_STATE ............................. OK.
    opus_encode() ................................ OK.
    opus_encode_float() .......................... OK.
                   All encoder interface tests passed
                             (1152209 API invocations)

  Repacketizer tests
  ---------------------------------------------------
    opus_repacketizer_get_size()=496 ............. OK.
    opus_repacketizer_init ....................... OK.
    opus_repacketizer_create ..................... OK.
    opus_repacketizer_get_nb_frames .............. OK.
    opus_repacketizer_cat ........................ OK.
    opus_repacketizer_out ........................ OK.
    opus_repacketizer_out_range .................. OK.
    opus_packet_pad .............................. OK.
    opus_packet_unpad ............................ OK.
    opus_multistream_packet_pad .................. OK.
    opus_multistream_packet_unpad ................ OK.
                        All repacketizer tests passed
                            (6713561 API invocations)

  malloc() failure tests
  ---------------------------------------------------
    opus_decoder_create() ................... SKIPPED.
    opus_encoder_create() ................... SKIPPED.
    opus_repacketizer_create() .............. SKIPPED.
    opus_multistream_decoder_create() ....... SKIPPED.
    opus_multistream_encoder_create() ....... SKIPPED.
(Test only supported with GLIBC and without valgrind)

All API tests passed.
The libopus API was invoked 115421979 times.

Now, we know the SVE2 implementation is successfully added to the program. Let's double-check this by looking for the presence of SVE2 specific instruction within the binary files. Using the following command, we can see the list of files with whilelo instruction.

find . -type f -executable | while read F ; do echo ======= $F ; objdump -d $F 2> /dev/null | grep whilelo ; done

#...
======= ./.libs/libopus.so.0.8.0
    2ef0:       25a40fe1        whilelo p1.s, wzr, w4
    2f28:       25a40c60        whilelo p0.s, w3, w4
    2f7c:       25a40fe1        whilelo p1.s, wzr, w4
    2fa0:       25a40c60        whilelo p0.s, w3, w4
    3314:       25b80fe0        whilelo p0.s, wzr, w24
    3344:       25b80c20        whilelo p0.s, w1, w24
    3784:       25b80fe0        whilelo p0.s, wzr, w24
    37a8:       25b80c00        whilelo p0.s, w0, w24
    39cc:       25b80fe0        whilelo p0.s, wzr, w24
    39f0:       25b80c00        whilelo p0.s, w0, w24
    3a5c:       25b80fe0        whilelo p0.s, wzr, w24
    3a80:       25b80c00        whilelo p0.s, w0, w24
    4488:       25b50fe2        whilelo p2.s, wzr, w21
    44fc:       25b50c00        whilelo p0.s, w0, w21
    4590:       25b50fe2        whilelo p2.s, wzr, w21
    45f8:       25b50c00        whilelo p0.s, w0, w21
    4780:       25a40fe2        whilelo p2.s, wzr, w4
    47e4:       25a40c00        whilelo p0.s, w0, w4
    4940:       25b80fe0        whilelo p0.s, wzr, w24
    4954:       25b80c00        whilelo p0.s, w0, w24
    4a2c:       25a40fe1        whilelo p1.s, wzr, w4
    4a50:       25a40c00        whilelo p0.s, w0, w4
    4b34:       25a40fe1        whilelo p1.s, wzr, w4
    4b68:       25a40c60        whilelo p0.s, w3, w4
    4e34:       25b30fe0        whilelo p0.s, wzr, w19
    4e54:       25b30c60        whilelo p0.s, w3, w19
    4fe4:       25b30fe0        whilelo p0.s, wzr, w19
    5000:       25b30c20        whilelo p0.s, w1, w19
    51d8:       25b30fe0        whilelo p0.s, wzr, w19
    5208:       25b30c00        whilelo p0.s, w0, w19
    54a0:       25a10fe0        whilelo p0.s, wzr, w1
    54b8:       25a10c00        whilelo p0.s, w0, w1
    55c8:       25a11fe0        whilelo p0.s, xzr, x1
    55e8:       25a11c00        whilelo p0.s, x0, x1
    5724:       25a11fe0        whilelo p0.s, xzr, x1
    5738:       25a11c00        whilelo p0.s, x0, x1
    5c2c:       25a80fe1        whilelo p1.s, wzr, w8
    5c7c:       25a80d21        whilelo p1.s, w9, w8
    5ec4:       25a50fe2        whilelo p2.s, wzr, w5
    5f30:       25a50c00        whilelo p0.s, w0, w5
    64dc:       25a10fe0        whilelo p0.s, wzr, w1
    6500:       25a10c00        whilelo p0.s, w0, w1
    6818:       25a11fe0        whilelo p0.s, xzr, x1
    6830:       25a11c00        whilelo p0.s, x0, x1
#...

And when we count the total number of lines that use whilelo, we get a total of 2903 lines.

$ find . -type f -executable | while read F ; do echo ======= $F ; objdump -d $F 2> /dev/null | grep whilelo ; done | wc -l
2903

Conclusion

In this post, we explored and implemented SVE2 by using auto-vectorization of the compiler. Using the existing unit tests, we were able to identify if the auto-vectorization was added successfully. We also found that it added a significant number of SVE2 specific instruction, whilelo (i.e. 2903 lines). This indicates that the Opus project may greatly benefit from SVE2 implementation.

Implementing SVE2 for Opus Codec Library Part 2: Compiler Intrinsics

Seung Woo (Paul) Ji — Wed, 13 Apr 2022 21:30:08 +0000

Introduction

In the last post, we explored how we can compile and test the package. From now on, we will explore how we can add SVE2 implementation to it.

Finding Candidates

As we explored before, Opus contains a number of files that utilizes compiler intrinsics for SIMD implementation.

$ find | grep -i neon
./celt/arm/celt_neon_intr.c
./celt/arm/pitch_neon_intr.c
./silk/fixed/arm/warped_autocorrelation_FIX_neon_intr.c
./silk/arm/NSQ_neon.c
./silk/arm/LPC_inv_pred_gain_neon_intr.c
./silk/arm/NSQ_neon.h
./silk/arm/biquad_alt_neon_intr.c
./silk/arm/NSQ_del_dec_neon_intr.c

Among these, we need to find a file with loops. Let's take a look at celt_neon_intr.c file.

void xcorr_kernel_neon_fixed(const opus_val16 * x, const opus_val16 * y, opus_val32 sum[4], int len)
{
   int j;
   int32x4_t a = vld1q_s32(sum);
   /* Load y[0...3] */
   /* This requires len>0 to always be valid (which we assert in the C code). */
   int16x4_t y0 = vld1_s16(y);

   for (j = 0; j + 8 <= len; j += 8)
   {
      /* Load x[0...7] */
      int16x8_t xx = vld1q_s16(x);
      int16x4_t x0 = vget_low_s16(xx);
      int16x4_t x4 = vget_high_s16(xx);
      /* Load y[4...11] */
      int16x8_t yy = vld1q_s16(y);
      int16x4_t y4 = vget_low_s16(yy);
      int16x4_t y8 = vget_high_s16(yy);
      int32x4_t a0 = vmlal_lane_s16(a, y0, x0, 0);
      int32x4_t a1 = vmlal_lane_s16(a0, y4, x4, 0);

      int16x4_t y1 = vext_s16(y0, y4, 1);
      int16x4_t y5 = vext_s16(y4, y8, 1);
      int32x4_t a2 = vmlal_lane_s16(a1, y1, x0, 1);
      int32x4_t a3 = vmlal_lane_s16(a2, y5, x4, 1);

      int16x4_t y2 = vext_s16(y0, y4, 2);
      int16x4_t y6 = vext_s16(y4, y8, 2);
      int32x4_t a4 = vmlal_lane_s16(a3, y2, x0, 2);
      int32x4_t a5 = vmlal_lane_s16(a4, y6, x4, 2);

      int16x4_t y3 = vext_s16(y0, y4, 3);
      int16x4_t y7 = vext_s16(y4, y8, 3);
      int32x4_t a6 = vmlal_lane_s16(a5, y3, x0, 3);
      int32x4_t a7 = vmlal_lane_s16(a6, y7, x4, 3);

      y0 = y8;
      a = a7;
      x += 8;
      y += 8;
   }

 for (; j < len; j++)
   {
      int16x4_t x0 = vld1_dup_s16(x);  /* load next x */
      int32x4_t a0 = vmlal_s16(a, y0, x0);

      int16x4_t y4 = vld1_dup_s16(y);  /* load next y */
      y0 = vext_s16(y0, y4, 1);
      a = a0;
      x++;
      y++;
   }

   vst1q_s32(sum, a);
}

This function uses multiple intrinsic extensions inside of the for loops which meet our expectation. Before we start implementing SVE2, we need to understand the code thoroughly. Let's walk through the code one by one.

We can see the function takes in three arrays, x, y, and sum. The sum array is first loaded to the vector register with a tuple of 4 lanes that each has 32 bits in length. Since this code uses NEON to implement SIMD, it makes sense the total length of the vector register is limited to 128 bits in total.

Then, the y array is loaded to the vector with a tuple of 4 lanes in which each has 16 bits in length. These correspond to the first four elements in the y array (i.e. y[0...3]).

In the for loop, the x array is loaded into the register. The vector first contains 8 lanes of 16 bits. These, in turn, are divided into two groups, x0, and x4. Ultimately these correspond to the first eight elements in the x array (i.e. x[0...3]).

The code repeats the previous steps for y array. Since, we already assign a vector for the first four elements from the array, we start from the fifth element in the array. At the end, these correspond to the elements ranged from the eighth to the eleventh element (i.e. y[4...11]).

To better understand what we have learned, we can make the following diagram:

x*(val16)   0    1    2    3    4    5    6    7
            |      x0      |    |      x4      |   

y*(val16)   0    1    2    3    4    5    6    7    8    9    10    11
            |      y0      |    |      y4      |    |      y8       | 

sum(val32)  0            1            2            3

In the first vmlal_lane_s16, the intrinsic multiplies the first lane (0) of the x0 to each lane of y0. The result is then accumulated to the destination vector where each element is twice as long as the elements that are multiplied (i.e. 16 bit -> 32 bit). This means we do the following operations between two arrays:

x[0] * (y[0], y[1], y[2], y[3]) = (sum[0], sum[1], sum[2], sum[3])

We repeat the same operation as above but with y4 and x4.

Next, vext_s16 extract a vector from the y0 and y4 pairs. This is done by extracting the lowest vector elements from y4 and the highest vector elements from y0 starting from the element of desired index (i.e. 1). This means we get the following vector as a result:

y0    : y[0], y[1], y[2], y[3] // taking the highest vector starting from the index 1.
y4    : y[4], y[5], y[6], y[7] // filling up the result vector by taking the lowest vector
Result: y[1], y[2], y[3], y[4]

Afterwards, we do the same steps to keep multiplying and adding the rest of x and y elements.

Problem

Unfortunately, the codes that we walked though together are not easy to translate into ones with SVE2 instructions. One of the reasons is because of the lack of SVE2 counterparts of the NEON instruction that are used. This makes sense considering that the SVE2 does not restrict the length of vector registers. In order to solve this, we have to rewrite the codes in such a way that no
tuple of vector lanes are used.

Conclusion

In this post, we explored and analyzed whether the intrinsic codes the package uses are good for implementing SVE2. Unfortunately, the codes are fairly complex and requires more NEON and SVE2 knowledges that are beyond the scope that we have covered in the previous posts. In the following post, we will look for an alternative method to implement SVE2 - that is, by using auto-vectorization.

Implementing SVE2 for Opus Codec Library Part 1: Package Installation

Seung Woo (Paul) Ji — Mon, 11 Apr 2022 01:42:20 +0000

Introduction

Previously, we identified several packages that do not support SVE2 codes yet. We ultimately decided that Opus Audio Codec is the best candidate. In this post, we will explore the package in detail and see how we can implement SVE2 into it.

Before We Start...

When we clone the package, we can see the following files:

$ ls
AUTHORS          cmake           LICENSE_PLEASE_READ.txt  meson.build        opus_sources.mk         silk             update_version
autogen.sh       CMakeLists.txt  m4                       meson_options.txt  opus-uninstalled.pc.in  silk_headers.mk  win32
celt             configure.ac    Makefile.am              NEWS               README                  silk_sources.mk
celt_headers.mk  COPYING         Makefile.mips            opus_headers.mk    README.draft            src
celt_sources.mk  doc             Makefile.unix            opus.m4            releases.sha2           tests
ChangeLog        include         meson                    opus.pc.in         scripts                 training

As we can see, the packages contains several Makefile and configure template files. These gives us an idea that this package may use the GNU Autotools to generate Makefile and configure scripts. To have a clear understanding of how we can install this package, we can read the README file.

# README
# ...

1) Clone the repository:

    % git clone https://gitlab.xiph.org/xiph/opus.git
    % cd opus

2) Compiling the source

    % ./autogen.sh
    % ./configure
    % make

3) Install the codec libraries (optional)

    % sudo make install

Once you have compiled the codec, there will be a opus_demo executable
in the top directory.

Usage: opus_demo [-e] <application> <sampling rate (Hz)> <channels (1/2)>
         <bits per second> [options] <input> <output>
       opus_demo -d <sampling rate (Hz)> <channels (1/2)> [options]
         <input> <output>

# ...

Now, let's follow the instruction. Once we run the ./autogen.sh, we get the following list of files.

$ ./autogen.sh
$ ls
aclocal.m4       CMakeLists.txt  doc                      Makefile.mips      opus.pc.in              silk_headers.mk
AUTHORS          compile         include                  Makefile.unix      opus_sources.mk         silk_sources.mk
autogen.sh       config.guess    INSTALL                  meson              opus-uninstalled.pc.in  src
autom4te.cache   config.h.in     install-sh               meson.build        package_version         test-driver
celt             config.sub      LICENSE_PLEASE_READ.txt  meson_options.txt  README                  tests
celt_headers.mk  configure       ltmain.sh                missing            README.draft            training
celt_sources.mk  configure.ac    m4                       NEWS               releases.sha2           update_version
ChangeLog        COPYING         Makefile.am              opus_headers.mk    scripts                 win32
cmake            depcomp         Makefile.in              opus.m4            silk

We have more files now. The notable files are Makefile.in and configure script file. Makefile.in is generated from Makefile.am file but still is missing some values that are going to be filled with the configure script. Now, let's run the configure script.

$ ./configure
$ ls
aclocal.m4       config.guess   include                  Makefile.unix      opus-uninstalled.pc     stamp-h1
AUTHORS          config.h       INSTALL                  meson              opus-uninstalled.pc.in  test-driver
autogen.sh       config.h.in    install-sh               meson.build        package_version         tests
autom4te.cache   config.log     libtool                  meson_options.txt  README                  training
celt             config.status  LICENSE_PLEASE_READ.txt  missing            README.draft            update_version
celt_headers.mk  config.sub     ltmain.sh                NEWS               releases.sha2           win32
celt_sources.mk  configure      m4                       opus_headers.mk    scripts
ChangeLog        configure.ac   Makefile                 opus.m4            silk
cmake            COPYING        Makefile.am              opus.pc            silk_headers.mk
CMakeLists.txt   depcomp        Makefile.in              opus.pc.in         silk_sources.mk
compile          doc            Makefile.mips            opus_sources.mk    src

Not surprisingly, we get a Makefile amongst the newly generated files. If we inspect the Makefile, we can see what CFLAG it uses to compile the package.

# Makefile
# ...

CFLAGS = -g -O2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes

# ...

From these flags, we can know that the package does not utilize auto-vectorization. This means we can also implement SVE2 codes by utilizing auto-vectorization in this package.

Now, let's compile the package. For this, we can assign more jobs in parallel at a time when we execute the Makefile to increase the speed of compilation. In general, we can calculate the number by doubling the core number plus one. In our case, we can use a value of 24 (16 cores * 2 + 1) to keep every core busy with jobs.

$ make -j 24
$ ls
aclocal.m4       config.guess   include                  Makefile.mips      opus.pc                 silk
AUTHORS          config.h       INSTALL                  Makefile.unix      opus.pc.in              silk_headers.mk
autogen.sh       config.h.in    install-sh               meson              opus_sources.mk         silk_sources.mk
autom4te.cache   config.log     libopus.la               meson.build        opus-uninstalled.pc     src
celt             config.status  libtool                  meson_options.txt  opus-uninstalled.pc.in  stamp-h1
celt_headers.mk  config.sub     LICENSE_PLEASE_READ.txt  missing            package_version         test-driver
celt_sources.mk  configure      ltmain.sh                NEWS               README                  tests
ChangeLog        configure.ac   m4                       opus_compare       README.draft            training
cmake            COPYING        Makefile                 opus_demo          releases.sha2           trivial_example
CMakeLists.txt   depcomp        Makefile.am              opus_headers.mk    repacketizer_demo       update_version
compile          doc            Makefile.in              opus.m4            scripts                 win32

As the README file mentioned, we have a executable file called opus_demo. When we run it, we can see the package is successfully compiled.

$ ./opus_demo
Usage: /home/swji1/opus/.libs/opus_demo [-e] <application> <sampling rate (Hz)> <channels (1/2)> <bits per second>  [options] <input> <output>
       /home/swji1/opus/.libs/opus_demo -d <sampling rate (Hz)> <channels (1/2)> [options] <input> <output>

application: voip | audio | restricted-lowdelay
options:
-e                   : only runs the encoder (output the bit-stream)
-d                   : only runs the decoder (reads the bit-stream as input)
-cbr                 : enable constant bitrate; default: variable bitrate
-cvbr                : enable constrained variable bitrate; default: unconstrained
-delayed-decision    : use look-ahead for speech/music detection (experts only); default: disabled
-bandwidth <NB|MB|WB|SWB|FB> : audio bandwidth (from narrowband to fullband); default: sampling rate
-framesize <2.5|5|10|20|40|60|80|100|120> : frame size in ms; default: 20
-max_payload <bytes> : maximum payload size in bytes, default: 1024
-complexity <comp>   : complexity, 0 (lowest) ... 10 (highest); default: 10
-inbandfec           : enable SILK inband FEC
-forcemono           : force mono encoding, even for stereo input
-dtx                 : enable SILK DTX
-loss <perc>         : simulate packet loss, in percent (0-100); default: 0

Testing

But, how we validate if the binary works as intended? For this, we can refer to the README again.

# README
# ...

== Testing ==

This package includes a collection of automated unit and system tests
which SHOULD be run after compiling the package especially the first
time it is run on a new platform.

To run the integrated tests:

    % make check

# ...

Thankfully, the authors provide a set of unit tests to validate the integrity of the executable file. Using this, we can check if the package is compiled correctly.

$ make check
PASS: celt/tests/test_unit_cwrs32
PASS: celt/tests/test_unit_dft
PASS: celt/tests/test_unit_entropy
PASS: celt/tests/test_unit_laplace
PASS: celt/tests/test_unit_mathops
PASS: celt/tests/test_unit_mdct
PASS: celt/tests/test_unit_rotation
PASS: celt/tests/test_unit_types
PASS: silk/tests/test_unit_LPC_inv_pred_gain
PASS: tests/test_opus_api
PASS: tests/test_opus_decode
PASS: tests/test_opus_encode
PASS: tests/test_opus_padding
PASS: tests/test_opus_projection
============================================================================
Testsuite summary for opus 1.3.1-107-gccaaffa9
============================================================================
# TOTAL: 14
# PASS:  14
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================

Conclusion

In this post, we explored the package and learned how to compile it to generate binary files to execute. We also confirmed that the package does not utilize the auto-vectorization. So, we may try implementing the vectorization in two ways: compiler intrinsics or auto-vectorization. In the next post, we will see how we can add SVE2 codes by using intrinsics.

Implementing SVE2 for Open Source Project

Seung Woo (Paul) Ji — Tue, 29 Mar 2022 03:43:10 +0000

Introduction

In the last post, we explored and implemented Scalable Vector Extension 2 (SVE2) code for the volume adjusting algorithm. Now, we will do the same process but in a much bigger scale - by actually trying to contribute SVE2 code for the ongoing open source project.

Searching for a package

As we learned before, SVE2 is best suitable for processing large amount of data such as:

Computer vision
Multimedia
Long-Term Evolution (LTE) baseband processing
Genomics
In-memory database
Web serving
Cryptography
And so on...

And we know the vectorization can be implemented in 3 different ways:

Auto-vectorization
Compiler Intrinsics
Inline Assembler

Since we already have the experience of intrinsics, we will try our best to search packages that already use them.

We also have to consider if a package supports for our machine (Fedora 35 running on Aarch64 Architecture) as we have to install the program. For this, we will use the Fedora's package manager DNF and run the following commands:

$dnf search search_keyword
$dnf info package_name

By using $dnf search, the keyword is searched in both name and description of every package. Once we find a name of package, we can display the detailed description of that package with $dnf info. We also have to be careful to only choose open-source project.

List of Possible Candidates

With the aforementioned strategy, we can find some possible candidates as follows:

Let's see each package together!

libjpeg-turbo

libjpeg-turbo is a JPEG image codec that utilizes SIMD instructions to perform JPEG compression and decompression. When we inspect the package, we can find a list of promising files as follows:

$ find . -name "*neon*"
./jidctfst-neon.c
./jcsample-neon.c
./aarch32/jchuff-neon.c
./aarch32/jsimd_neon.S
./aarch32/jccolext-neon.c
./jfdctfst-neon.c
./neon-compat.h.in
./aarch64/jchuff-neon.c
./aarch64/jsimd_neon.S
./aarch64/jccolext-neon.c
./jidctred-neon.c
./jfdctint-neon.c
./jdmerge-neon.c
./jidctint-neon.c
./jccolor-neon.c
./jdsample-neon.c
./jdcolor-neon.c
./jdmrgext-neon.c
./jcgryext-neon.c
./jcphuff-neon.c
./jcgray-neon.c
./jdcolext-neon.c
./jquanti-neon.c

// jquanti-neon.c
// ...

#if defined(__clang__) && (defined(__aarch64__) || defined(_M_ARM64))
#pragma unroll
#endif
  for (i = 0; i < DCTSIZE; i += DCTSIZE / 2) {
    /* Load DCT coefficients. */
    int16x8_t row0 = vld1q_s16(workspace + (i + 0) * DCTSIZE);
    int16x8_t row1 = vld1q_s16(workspace + (i + 1) * DCTSIZE);
    int16x8_t row2 = vld1q_s16(workspace + (i + 2) * DCTSIZE);
    int16x8_t row3 = vld1q_s16(workspace + (i + 3) * DCTSIZE);
    /* Load reciprocals of quantization values. */
    uint16x8_t recip0 = vld1q_u16(recip_ptr + (i + 0) * DCTSIZE);
    uint16x8_t recip1 = vld1q_u16(recip_ptr + (i + 1) * DCTSIZE);
    uint16x8_t recip2 = vld1q_u16(recip_ptr + (i + 2) * DCTSIZE);
    uint16x8_t recip3 = vld1q_u16(recip_ptr + (i + 3) * DCTSIZE);
    uint16x8_t corr0 = vld1q_u16(corr_ptr + (i + 0) * DCTSIZE);
    uint16x8_t corr1 = vld1q_u16(corr_ptr + (i + 1) * DCTSIZE);
    uint16x8_t corr2 = vld1q_u16(corr_ptr + (i + 2) * DCTSIZE);
    uint16x8_t corr3 = vld1q_u16(corr_ptr + (i + 3) * DCTSIZE);
    int16x8_t shift0 = vld1q_s16(shift_ptr + (i + 0) * DCTSIZE);
    int16x8_t shift1 = vld1q_s16(shift_ptr + (i + 1) * DCTSIZE);
    int16x8_t shift2 = vld1q_s16(shift_ptr + (i + 2) * DCTSIZE);
    int16x8_t shift3 = vld1q_s16(shift_ptr + (i + 3) * DCTSIZE);

// ...

As we can see, vld1q_s16 intrinsic is used to load a vector from memory. Furthermore, the package does not yet use SVE or SVE2 implementation. This indicates this project is a good candidate where we can contribute our knowledge of SVE2 for this project.

SoundTouch

Soundtouch is an audio-processing library that allows changing the sound tempo, pitch and playback rate parameters. This sounds familiar to us as we dealt with a simple audio algorithm before and maybe another good candidate for us.

$grep -ir neon .
./configure.ac:AC_CHECK_HEADERS([arm_neon.h])
./configure.ac:AC_ARG_ENABLE([neon-optimizations],
./configure.ac:              [AS_HELP_STRING([--enable-neon-optimizations],
./configure.ac:                              [use ARM NEON optimization [default=yes]])],[enable_neon_optimizations="${enableval}"],

# configure.ac 
if test "x$enable_neon_optimizations" = "xyes" -a "x$ac_cv_header_arm_neon_h" = "xyes"; then

        # Check for ARM NEON support
        original_saved_CXXFLAGS=$CXXFLAGS
        have_neon=no
        CXXFLAGS="-mfpu=neon -march=native $CXXFLAGS"

        # Check if can compile neon code using intrinsics, require GCC >= 4.3 for autovectorization.
        AC_COMPILE_IFELSE([AC_LANG_SOURCE([[
        #if defined(__GNUC__) && (__GNUC__ < 4 || (__GNUC__ == 4 && __GNUC_MINOR__ < 3))
        #error "Need GCC >= 4.3 for neon autovectorization"
        #endif
        #include <arm_neon.h>
        int main () {
                int32x4_t t = {1};
                return vaddq_s32(t,t)[0] == 2;
        }]])],[have_neon=yes])
        CXXFLAGS=$original_saved_CXXFLAGS
        if test "x$have_neon" = "xyes" ; then
                echo "****** NEON support enabled ******"
                CPPFLAGS="-mfpu=neon -march=native -mtune=native $CPPFLAGS"
                AC_DEFINE(SOUNDTOUCH_USE_NEON,1,[Use ARM NEON extension])
        fi
fi

The package does not contain any files that has simd or neon in their names. However, it does have a file that contains neon in its content. When we open that file, we can see this package utilizes the auto-vectorization feature by the compiler. As we can see, the package prompts a message saying that it cannot perform the auto-vectorization when it is compiled by GCC with a version less than 4.3.

Opus

Opus is a audio codec for interactive speech and audio transmission across the Internet with compression algorithms. It can support a wide rage of interactive audio applications such as Voice Over IP (VoIP), remote live music performance, and video conferencing. As similar to the last one, this may be a good candidate for us.

$ find | grep -i neon
./celt/arm/celt_neon_intr.c
./celt/arm/pitch_neon_intr.c
./silk/fixed/arm/warped_autocorrelation_FIX_neon_intr.c
./silk/arm/NSQ_neon.c
./silk/arm/LPC_inv_pred_gain_neon_intr.c
./silk/arm/NSQ_neon.h
./silk/arm/biquad_alt_neon_intr.c
./silk/arm/NSQ_del_dec_neon_intr.c

// celt_neon_intr.c
#include <arm_neon.h>
#include "../pitch.h"

#if defined(FIXED_POINT)
void xcorr_kernel_neon_fixed(const opus_val16 * x, const opus_val16 * y, opus_val32 sum[4], int len)
{
   int j;
   int32x4_t a = vld1q_s32(sum);
   /* Load y[0...3] */
   /* This requires len>0 to always be valid (which we assert in the C code). */
   int16x4_t y0 = vld1_s16(y);
   y += 4;

   for (j = 0; j + 8 <= len; j += 8)
   {
      /* Load x[0...7] */
      int16x8_t xx = vld1q_s16(x);
      int16x4_t x0 = vget_low_s16(xx);
      int16x4_t x4 = vget_high_s16(xx);
      /* Load y[4...11] */
      int16x8_t yy = vld1q_s16(y);
      int16x4_t y4 = vget_low_s16(yy);
      int16x4_t y8 = vget_high_s16(yy);
      int32x4_t a0 = vmlal_lane_s16(a, y0, x0, 0);
      int32x4_t a1 = vmlal_lane_s16(a0, y4, x4, 0);

      int16x4_t y1 = vext_s16(y0, y4, 1);
      int16x4_t y5 = vext_s16(y4, y8, 1);
      int32x4_t a2 = vmlal_lane_s16(a1, y1, x0, 1);
      int32x4_t a3 = vmlal_lane_s16(a2, y5, x4, 1);

      int16x4_t y2 = vext_s16(y0, y4, 2);
      int16x4_t y6 = vext_s16(y4, y8, 2);
      int32x4_t a4 = vmlal_lane_s16(a3, y2, x0, 2);
      int32x4_t a5 = vmlal_lane_s16(a4, y6, x4, 2);

      int16x4_t y3 = vext_s16(y0, y4, 3);
      int16x4_t y7 = vext_s16(y4, y8, 3);
      int32x4_t a6 = vmlal_lane_s16(a5, y3, x0, 3);
      int32x4_t a7 = vmlal_lane_s16(a6, y7, x4, 3);

      y0 = y8;
      a = a7;
      x += 8;
      y += 8;
   }
// ...

When searched with neon, we can see a list of promising files that potentially deal with simd instructions. In celt_neon_intr.c file, we can see xcorr_kernel_neon_fixed function executes a loop with SIMD instructions.

Result

We have a pretty good open-source projects to implement SVE2. Amongst them, we will choose Opus project for several reasons. First of all, this project is still well and actively maintained by developers. As a matter of fact, it is standardized by the Internet Engineering Task Force IETF and unmatched for interactive audio transmission over the Internet. Besides, the package is well-documented to understand the code thoroughly. Lastly, and most importantly, the code is written to be more readable by new developers as compared to the first two projects. As we can see, the author kindly commented the purpose of variables and functions. Thus, we will choose Opus project to contribute our SVE2 knowledge.

Contributions

The way to contribute for Opus project is well-explained in its wiki page. Thankfully, the wiki page states that one of ways to contribute to Opus development is by doing optimizations (assembly/intrinsics). To do this, we can easily approach to the developers on the mailing list or through the IRC channel.

Conclusion

In this post, we explored some of the open-source projects where we could contribute our SVE2 knowledge. As it turned out, Opus project is most suitable for us. In the following post, we will start implementing SVE2 codes in the project.

Implementing SVE2 for Volume Adjusting Algorithm

Seung Woo (Paul) Ji — Wed, 23 Mar 2022 02:13:24 +0000

Introduction

Previously, we explored simple volume adjust algorithms to scale the audio samples by volume factor. Unfortunately, these algorithms use Advanced SIMD instruction, not Scalable Vector Extension that we learned from the last post which can greatly improve vectorization of code. In this post, we are going to implement SVE2 instructions to the volume adjusting algorithms in C++ and explore them in assembly.

Before We Start

Since SVE2 is new technology and not natively supported by current hardware (with Armv8a processor) as of now, we can only emulate a program that is written with SVE2 instructions. This also means that we cannot really measure the performance of the program. Therefore, in this post, we are only going to implement SVE2 and test if the program runs successfully.

Source Code

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#ifdef   __ARM_FEATURE_SVE
#include <arm_sve.h>
#endif
#include "vol.h"

int main() {

        int                     x;              // array interator
        int                     ttl=0;          // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- SVE2 implementation

        int16_t vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

        int32_t i = 0;
        int32_t vl = svcnth(); // count the number of 16-bit element

        svbool_t pred;
        pred = svwhilelt_b16(i, SAMPLES);

        while(svptest_first(svptrue_b16(), pred)) {
                svst1(pred, &out[i], (svqrdmulh(svld1(pred, &in[i]), svdup_s16(vol_int))));
                i += vl;
                pred = svwhilelt_b16(i, SAMPLES);
        }

// ---- End of SVE2 implementation

  for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;

        }

        // Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);

        return 0;
}

Why Compiler Intrinsic?

Compiler intrinsic is function-like calls that the compiler replaces with the appropriate SVE2 instructions while handling various jobs including register allocation. It is a great way for developers (like me!) to use SVE2 instructions in C/C++ style without assembly.

Code Analysis

First of all, we define header file to access SVE vectors, predicates, and intrinsics for SVE2 insturctions. We then initialize a loop iterator, i, and vl that is used to count the number of elements. We also need to initialize a predicate register by using svwhilelt_b16 to control the while loop. _b16 specifies a predicate for 16-bit elements and conceptually, this would create an integer vector starting at i and and incrementing by 1 in each subsequent vector lane. Within the while loop condition, we use svptest_first to check if a lane of the predicate is active and there is a work left to do. The logic inside of the while loop is very similar to the ones written in SIMD instructions. That is, svld1 loads a vector with the value from in[i] array element and svdup_s16 duplicates the value of vol_int into a vector. Afterward, svqrdmulh performs integer multiplication of those two values and svst1 saves the result into out[i]. Then, i gets incremented by the number of integer lanes in the vector and the predicate is reassigned.

Building Code

As we discussed before, the current hardware does not support SVE2 instructions. Thus, we have to instruct the compiler to emit code for an Armv8a processor to make it understand SVE2 as following:

$ gcc -march=armv8-a+sve2 vol6.c vol_createsample.o -o vol6

Then, we can execute the program by emulating with the QEMU usermode system. This will trap SVE instructions and run it on the Armv8 system.

$ qemu-aarch64 ./vol6

Once we run it, we can see the program runs successfully without any problem!

Result: -809

Assembler Output

0000000000400698 <main>:
  400698:       043f57ff        addvl   sp, sp, #-1
  40069c:       d100c3ff        sub     sp, sp, #0x30
  4006a0:       a9007bfd        stp     x29, x30, [sp]
  4006a4:       910003fd        mov     x29, sp
  4006a8:       043f5020        addvl   x0, sp, #1
  4006ac:       b900281f        str     wzr, [x0, #40]
  4006b0:       d2800041        mov     x1, #0x2                        // #2
  4006b4:       d2848000        mov     x0, #0x2400                     // #9216
  4006b8:       f2a01e80        movk    x0, #0xf4, lsl #16
  4006bc:       97ffff91        bl      400500 <calloc@plt>
  4006c0:       047f5081        addpl   x1, sp, #4
  4006c4:       f9001020        str     x0, [x1, #32]
  4006c8:       d2800041        mov     x1, #0x2                        // #2
  4006cc:       d2848000        mov     x0, #0x2400                     // #9216
  4006d0:       f2a01e80        movk    x0, #0xf4, lsl #16
  4006d4:       97ffff8b        bl      400500 <calloc@plt>
  4006d8:       047f5081        addpl   x1, sp, #4
  4006dc:       f9000c20        str     x0, [x1, #24]
  4006e0:       52848001        mov     w1, #0x2400                     // #9216
  4006e4:       72a01e81        movk    w1, #0xf4, lsl #16
  4006e8:       047f5080        addpl   x0, sp, #4
  4006ec:       f9401000        ldr     x0, [x0, #32]
  4006f0:       9400006c        bl      4008a0 <vol_createsample>
  4006f4:       5287ffe0        mov     w0, #0x3fff                     // #16383
  4006f8:       047f5081        addpl   x1, sp, #4
  4006fc:       79002c20        strh    w0, [x1, #22]
  400700:       043f5020        addvl   x0, sp, #1
  400704:       b900241f        str     wzr, [x0, #36]
  400708:       0460e3e0        cnth    x0
  40070c:       047f5081        addpl   x1, sp, #4
  400710:       b9001020        str     w0, [x1, #16]
  400714:       043f5020        addvl   x0, sp, #1
  400718:       b9402400        ldr     w0, [x0, #36]
  40071c:       52848001        mov     w1, #0x2400                     // #9216
  400720:       72a01e81        movk    w1, #0xf4, lsl #16
  400724:       25610400        whilelt p0.h, w0, w1
  400728:       910093e0        add     x0, sp, #0x24
  40072c:       e5801c00        str     p0, [x0, #7, mul vl]
  400730:       14000026        b       4007c8 <main+0x130>
  400734:       043f5020        addvl   x0, sp, #1
  400738:       b9802400        ldrsw   x0, [x0, #36]
  40073c:       d37ff800        lsl     x0, x0, #1
  400740:       047f5081        addpl   x1, sp, #4
  400744:       f9400c21        ldr     x1, [x1, #24]
  400748:       8b000020        add     x0, x1, x0
  40074c:       043f5021        addvl   x1, sp, #1
  400750:       b9802421        ldrsw   x1, [x1, #36]
  400754:       d37ff821        lsl     x1, x1, #1
  400758:       047f5082        addpl   x2, sp, #4
  40075c:       f9401042        ldr     x2, [x2, #32]
  400760:       8b010041        add     x1, x2, x1
  400764:       910093e2        add     x2, sp, #0x24
  400768:       85801c40        ldr     p0, [x2, #7, mul vl]
  40076c:       a4a0a020        ld1h    {z0.h}, p0/z, [x1]
  400770:       047f5081        addpl   x1, sp, #4
  400774:       91005821        add     x1, x1, #0x16
  400778:       2518e3e0        ptrue   p0.b
  40077c:       84c0a021        ld1rh   {z1.h}, p0/z, [x1]
  400780:       04617400        sqrdmulh        z0.h, z0.h, z1.h
  400784:       910093e1        add     x1, sp, #0x24
  400788:       85801c20        ldr     p0, [x1, #7, mul vl]
  40078c:       e4a0e000        st1h    {z0.h}, p0, [x0]
  400790:       043f5020        addvl   x0, sp, #1
  400794:       b9402401        ldr     w1, [x0, #36]
  400798:       047f5080        addpl   x0, sp, #4
  40079c:       b9401000        ldr     w0, [x0, #16]
  4007a0:       0b000020        add     w0, w1, w0
  4007a4:       043f5021        addvl   x1, sp, #1
  4007a8:       b9002420        str     w0, [x1, #36]
  4007ac:       043f5020        addvl   x0, sp, #1
  4007b0:       b9402400        ldr     w0, [x0, #36]
  4007b4:       52848001        mov     w1, #0x2400                     // #9216
  4007b8:       72a01e81        movk    w1, #0xf4, lsl #16
  4007bc:       25610400        whilelt p0.h, w0, w1
  4007c0:       910093e0        add     x0, sp, #0x24
  4007c4:       e5801c00        str     p0, [x0, #7, mul vl]
  4007c8:       2558e3e0        ptrue   p0.h
  4007cc:       910093e0        add     x0, sp, #0x24
  4007d0:       85801c01        ldr     p1, [x0, #7, mul vl]
  4007d4:       2550c020        ptest   p0, p1.b
  4007d8:       9a9f57e0        cset    x0, mi  // mi = first
  4007dc:       7100001f        cmp     w0, #0x0
  4007e0:       54fffaa1        b.ne    400734 <main+0x9c>  // b.any
  4007e4:       043f5020        addvl   x0, sp, #1
  4007e8:       b9002c1f        str     wzr, [x0, #44]
  4007ec:       1400001d        b       400860 <main+0x1c8>
  4007f0:       043f5020        addvl   x0, sp, #1
  4007f4:       b9802c00        ldrsw   x0, [x0, #44]
  4007f8:       d37ff800        lsl     x0, x0, #1
  4007fc:       047f5081        addpl   x1, sp, #4
  400800:       f9400c21        ldr     x1, [x1, #24]
  400804:       8b000020        add     x0, x1, x0
  400808:       79c00000        ldrsh   w0, [x0]
  40080c:       2a0003e1        mov     w1, w0
  400810:       043f5020        addvl   x0, sp, #1
  400814:       b9402800        ldr     w0, [x0, #40]
  400818:       0b000020        add     w0, w1, w0
  40081c:       5289ba61        mov     w1, #0x4dd3                     // #19923
  400820:       72a20c41        movk    w1, #0x1062, lsl #16
  400824:       9b217c01        smull   x1, w0, w1
  400828:       d360fc21        lsr     x1, x1, #32
  40082c:       13067c22        asr     w2, w1, #6
  400830:       131f7c01        asr     w1, w0, #31
  400834:       4b010042        sub     w2, w2, w1
  400838:       52807d01        mov     w1, #0x3e8                      // #1000
  40083c:       1b017c41        mul     w1, w2, w1
  400840:       4b010000        sub     w0, w0, w1
  400844:       043f5021        addvl   x1, sp, #1
  400848:       b9002820        str     w0, [x1, #40]
  40084c:       043f5020        addvl   x0, sp, #1
  400850:       b9402c00        ldr     w0, [x0, #44]
  400854:       11000400        add     w0, w0, #0x1
  400858:       043f5021        addvl   x1, sp, #1
  40085c:       b9002c20        str     w0, [x1, #44]
  400860:       043f5020        addvl   x0, sp, #1
  400864:       b9402c01        ldr     w1, [x0, #44]
  400868:       52847fe0        mov     w0, #0x23ff                     // #9215
  40086c:       72a01e80        movk    w0, #0xf4, lsl #16
  400870:       6b00003f        cmp     w1, w0
  400874:       54fffbed        b.le    4007f0 <main+0x158>
  400878:       043f5020        addvl   x0, sp, #1
  40087c:       b9402801        ldr     w1, [x0, #40]
  400880:       90000000        adrp    x0, 400000 <__abi_tag-0x278>
  400884:       9124e000        add     x0, x0, #0x938
  400888:       97ffff2e        bl      400540 <printf@plt>
  40088c:       52800000        mov     w0, #0x0                        // #0
  400890:       a9407bfd        ldp     x29, x30, [sp]
  400894:       043f503f        addvl   sp, sp, #1
  400898:       9100c3ff        add     sp, sp, #0x30
  40089c:       d65f03c0        ret

In order to test if SVE2 instructions are used, we can skim through the codes and search for whilelt instruction.

  400724:       25610400        whilelt p0.h, w0, w1
  4007bc:       25610400        whilelt p0.h, w0, w1

As we can see, the SVE-specific instruction like whilelt is used by the program and it runs without any problem!

Conclusion

In this post, we explored how to implement SVE2 instructions to the volume adjusting algorithm. Unfortunately, the current native hardware does not support SVE2 (yet!) and must use an emulator to run the program. It is also challenging to implement SVE2 as it requires understanding of predicate and new syntax. However, utilizing SVE2 is potentially beneficial for developers because latest hardware plans to support it natively and the vector length is determined by the machine.

Exploring Scalable Vector Extension 2

Seung Woo (Paul) Ji — Sun, 20 Mar 2022 01:31:33 +0000

Introduction

Scalable Vector Extension (SVE) is SIMD extension of ARMv8 and provides a new set of vector instructions to enable vectorization of loops for High Performance Computing (HPC).

Why SVE?

One of the key features of SVE is that it does not require a fixed 128-bit vector length like Neon architecture extension. This enables Vector-length agnostic (VLA) programming in which the vector length is determined by hardware that is best for the workload. Thus, developers can write and build programs once and run them on different hardware with different SVE vector length implementations (better portability!).

SVE2

SVE2 is basically a superset of SVE and Neon extension. With SVE2 instruction, it further extends data-processing domains beyond HPC that now include:

Computer vision
Multimedia
Long-Term Evolution (LTE) baseband processing
Genomics
In-memory database
Web serving
General-purpose software

SVE2 Registers

Like SVE, SVE2 is based on the scalable vectors as follows:

Scalable vector registers

There are a total of 32 scalable vector registers (z0-z31). Their size in bits must be a multiple of 128 and up to 2048 bits. Data in these registers can holder 64, 32, 16, and 8-bit elements. The lower 128 bits of each register holds the corresponding Neon register of the SIMD extension.

Scalable predicate registers

There are a total of 16 predicate registers which are unique to SVE and SVE2. Each predicate register can hold one bit for each byte available in the respective z register (1/8 of the z register length). P0 - P7 registers are governing predicates for load, store, and arithmetic. P8 - p15 registers are extra predicates for loop management.

Conclusion

SVE allows developers to implement vectorization for the program in more efficient manner as they don't have to worry about the vector size. This also enable better portability because different hardware determines the vector size accordingly for the same program. In the next post, we will discuss how we can implement SVE2 to the volume algorithm we explored previously.

Resources

Exploring and Benchmarking Audio Volume Adjusting Algorithms Part 2

Seung Woo (Paul) Ji — Thu, 10 Mar 2022 00:14:53 +0000

Introduction

In the last post, we explored multiple volume adjusting algorithms and made assumptions of how well they would perform. Now, we are going to measure the performance of each algorithm and test if they are met with our expectations.

The Audio Sample Size

Before we start testing, we will set the number of sample size with a large number so that we can have meaningful result. For this, we will use the size of 1,600,000,000 for each program. If we run the time command with the dummy program, we have the following result:

real	1m27.058s
user	1m22.503s
sys	0m4.496s

The dummy program takes about a minute and a half seconds in total. However, we have to consider that this time does not only account for the volume scale function - there are different processes involved (e.g. generating random samples, calculating results and so on).

Evaluating Algorithm Performance

How do we only measure the performance of the volume scale function (scale_sample)?

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

We can easily implement this by utilizing the C Time library. With this library, we can isolate the function and measure the elapsed time as following:

// ---- Include the C Time library
#include <time.h>

        clock_t         t;

// ---- Calculate the start time
        t = clock();

// Scale Sample Code

//----  Calculate the elapsed time
        t = clock() - t;

// ---- Print the elapsed time in seconds
        printf("Time elapsed: %f\n", ((double)t)/CLOCKS_PER_SEC);

In this way, we can only estimate the elapsed time of the scale function in seconds.

Benchmark Test Results

For benchmarking, a total of 20 cases were tested for each algorithm. All algorithms also processed 1,600,000,000 samples and were assessed on AArch64 and x84_64 systems. During the tests, the number of background operations were minimized.

The following table shows the results. Both tables show very small number of standard deviation (SD) meaning the data are clustered around the mean value.

AArch64

Algorithm	vol0	vol1	vol2	vol4	vol5
Time (seconds)	5.290686	4.571809	11.204779	2.862223	2.897304
	5.271289	4.616451	11.236343	2.869659	2.860497
	5.3009	4.618019	11.207497	2.839968	2.88575
	5.257061	4.57951	11.229004	2.794136	2.837761
	5.29981	4.584778	11.237608	2.879343	2.857112
	5.252714	4.590422	11.220075	2.785239	2.859161
	5.300421	4.590156	11.215143	2.870726	2.919503
	5.286753	4.589992	11.224697	2.794225	2.895057
	5.317688	4.61077	11.268087	2.907598	2.91678
	5.272125	4.63759	11.235228	2.799026	2.881828
	5.308232	4.58515	11.229461	2.882254	2.910783
	5.286579	4.599118	11.253098	2.85217	2.903325
	5.282362	4.597291	11.190576	2.875931	2.920964
	5.276742	4.611212	11.239454	2.849582	2.853147
	5.293711	4.591562	11.253258	2.870164	2.918136
	5.293716	4.621955	11.228463	2.858067	2.850342
	5.318874	4.591154	11.225114	2.864949	2.912111
	5.306651	4.590993	11.252793	2.841034	2.847878
	5.30221	4.641963	11.220678	2.877916	2.842209
	5.299778	4.593774	11.206139	2.868532	2.856316
Total	105.818302	92.013669	224.577495	57.042742	57.625964
Average	5.2909151	4.60068345	11.22887475	2.8521371	2.8812982
SD	0.01805085609	0.01880182964	0.01914206262	0.0338674976	0.02977236719

In the previous post, we assumed the algorithms that use SIMD instructions would perform faster than others. Indeed, we can observe that vol4 and vol5 algorithms outperform others. The performance difference between them are really small (~0.0291 seconds) indicating that both inline assembly and compiler intrinsic are almost equally fast.

We can also see that vol1 runs faster than vol0. This corresponds to our expectation as vol1 uses a fixed-point calculation with bit-shift operations.

Interestingly, vol2 algorithm is found to be significantly slower than others. Initially, we assumed that this algorithm may perform faster than vol0 and vol1 which multiplies each sample with scaling factor because it pre-calculates all the results and stores them in a table. This result would mean that the CPU has an efficient arithmetic logic unit (ALU) that processes the multiplication fast or is slow at reading the memory when looking over the pre-calculated values within the table.

x86_64

Algorithm	vol0	vol1	vol2
Time (seconds)	2.821902	2.784482	3.531761
	2.903628	2.786877	3.569542
	2.895999	2.78038	3.551214
	2.877543	2.785402	3.559591
	2.886563	2.785422	3.537273
	2.891856	2.783449	3.545279
	2.80208	2.786667	3.58345
	2.855822	2.782619	3.590136
	2.804731	2.781633	3.572802
	2.782909	2.801589	3.587121
	2.783267	2.783468	3.630578
	2.785422	2.800091	3.562486
	2.81526	2.77875	3.591089
	2.873962	2.778289	3.529016
	2.791908	2.789269	3.579964
	2.785272	2.792904	3.55086
	2.804883	2.778821	3.587747
	2.78638	2.785906	3.545412
	2.788079	2.795611	3.574527
	2.810512	2.794108	3.54657
Total	56.547978	55.735737	71.326418
Average	2.8273989	2.78678685	3.5663209
SD	0.04456744515	0.006838116502	0.02516021857

The x86_64 system shows similar aspects as the AArch64 system -vol1 algorithm is the fastest and vol2 is the slowest. Note that we are missing vol4 and vol5 algorithms because these programs utilize SIMD instructions that are unique to the AArch64 system.

Conclusion

In this post, we measured the performance of each algorithm to test the assumptions we made in the previous post. As expected, the algorithms that use SIMD instructions appear to run faster than others as they can process multiple data at a time.

Exploring and Benchmarking Audio Volume Adjusting Algorithms Part 1

Seung Woo (Paul) Ji — Mon, 07 Mar 2022 00:52:03 +0000

Introduction

Uncompressed digital sound is typically represented as signed 16-bit (2 bytes) integer samples. For a 48000 audio sample (kHz), the data rate can easily surpass 96,000 bytes per seconds (2 bytes per sample * 48000 samples per seconds). When we change the sound volume, each sample needs to be scaled by a volume factor between 0 (no volume) and 1 (full volume). Considering the amount of data in sound samples, it is vital to have efficient volume adjusting algorithm to scale sound. This is especially true for a mobile device as the amount of processing required can affect its battery life.

In this post, we are going to explore a number of different algorithms for processing sound samples to control volume level. After that, we will study the performance of each algorithm to benchmark.

volume.h

/* This is the number of samples to be processed */
#define SAMPLES 16

/* This is the volume scaling factor to be used */
#define VOLUME 50.0 // Percent of original volume

/* Function prototype to fill an array sample of
 * length sample_count with random int16_t numbers
 * to simulate an audio buffer */
void vol_createsample(int16_t* sample, int32_t sample_count);

vol_createsample.c

void vol_createsample(int16_t* sample, int32_t sample_count) {
        int i;
        for (i=0; i<sample_count; i++) {
                sample[i] = (rand()%65536)-32768;
        }
        return;
}

In volume.h, we define a constant named SAMPLES to define the number of samples to be processed. We will use a reasonably large number for this to have a processed time at least 20 seconds. This will allow us to analyze the performance much more easily.

vol_createsample function is made to fill an array with random numbers to simulate an audio buffer.

Algorithm 1: vol0.c

int16_t scale_sample(int16_t sample, int volume) {

        return (int16_t) ((float) (volume/100.0) * (float) sample);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples. (Why is this needed?)
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples. (Why is this needed?)
        printf("Result: %d\n", ttl);

 return 0;
}

The vol0.c contains a very naïve algorithm that simply multiplies each sample by the volume scaling factor. This also involves with casting from signed 16-bit integer into floating point and back again - which can be very expensive and take a lot of resources.

It is also noteworthy to mention why we need to loop that sums the samples as well as to print the sum to the console. They must exist so that the algorithm can perform correctly. Let's take a look at the assembly code that is built by the compiler to understand more easily.

Assembly code of the original program

400580:       a9be7bfd        stp     x29, x30, [sp, #-32]!
  400584:       d2800041        mov     x1, #0x2                        // #2
  400588:       d2800200        mov     x0, #0x10                       // #16
  40058c:       910003fd        mov     x29, sp
  400590:       a90153f3        stp     x19, x20, [sp, #16]
  400594:       97ffffdb        bl      400500 <calloc@plt>
  400598:       d2800041        mov     x1, #0x2                        // #2
  40059c:       aa0003f4        mov     x20, x0
  4005a0:       d2800200        mov     x0, #0x10                       // #16
  4005a4:       97ffffd7        bl      400500 <calloc@plt>
  4005a8:       aa0003f3        mov     x19, x0
  4005ac:       52800201        mov     w1, #0x10                       // #16
  4005b0:       aa1403e0        mov     x0, x20
  4005b4:       94000077        bl      400790 <vol_createsample>
  4005b8:       d2800002        mov     x2, #0x0                        // #0
  4005bc:       1e2c1001        fmov    s1, #5.000000000000000000e-01
  4005c0:       78e26a81        ldrsh   w1, [x20, x2]
  4005c4:       1e220020        scvtf   s0, w1
  4005c8:       1e210800        fmul    s0, s0, s1
  4005cc:       5ea1b800        fcvtzs  s0, s0
  4005d0:       7c226a60        str     h0, [x19, x2]
  4005d4:       91000842        add     x2, x2, #0x2
  4005d8:       f100805f        cmp     x2, #0x20
  4005dc:       54ffff21        b.ne    4005c0 <main+0x40>  // b.any
  4005e0:       5289ba64        mov     w4, #0x4dd3                     // #19923
mov     x0, x19
  4005e8:       91008265        add     x5, x19, #0x20
  4005ec:       52800001        mov     w1, #0x0                        // #0
  4005f0:       72a20c44        movk    w4, #0x1062, lsl #16
  4005f4:       52807d03        mov     w3, #0x3e8                      // #1000
  4005f8:       78c02402        ldrsh   w2, [x0], #2
  4005fc:       0b010042        add     w2, w2, w1
  400600:       9b247c41        smull   x1, w2, w4
  400604:       9366fc21        asr     x1, x1, #38
  400608:       4b827c21        sub     w1, w1, w2, asr #31
  40060c:       1b038821        msub    w1, w1, w3, w2
  400610:       eb0000bf        cmp     x5, x0
  400614:       54ffff21        b.ne    4005f8 <main+0x78>  // b.any
  400618:       90000000        adrp    x0, 400000 <__abi_tag-0x278>
  40061c:       9120a000        add     x0, x0, #0x828
  400620:       97ffffc8        bl      400540 <printf@plt>
  400624:       52800000        mov     w0, #0x0                        // #0
  400628:       a94153f3        ldp     x19, x20, [sp, #16]
  40062c:       a8c27bfd        ldp     x29, x30, [sp], #32
  400630:       d65f03c0        ret

Assembly code without the sum loop and print

400500:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
  400504:       d2800041        mov     x1, #0x2                        // #2
  400508:       d2800200        mov     x0, #0x10                       // #16
  40050c:       910003fd        mov     x29, sp
  400510:       97ffffec        bl      4004c0 <calloc@plt>
  400514:       52800201        mov     w1, #0x10                       // #16
  400518:       9400005e        bl      400690 <vol_createsample>
  40051c:       52800000        mov     w0, #0x0                        // #0
  400520:       a8c17bfd        ldp     x29, x30, [sp], #16
  400524:       d65f03c0        ret

You can immediately notice that many parts of the assembly codes are missing when we do not include the sum loop and print. This is because the compiler recognizes that the results of volume scaling calculation is not used and optimizes the code by removing it. Obviously, we need to prevent this from happening as it is the code that we have to test!

Algorithm 2: vol1.c

int16_t scale_sample(int16_t sample, int volume) {

        return ((((int32_t) sample) * ((int32_t) (32767 * volume / 100) <<1) ) >> 16);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples.
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.
        printf("Result: %d\n", ttl);

        return 0;

Instead of using floating-point calculation, vol1.c utilizes a fixed-point calculation with bit-shift operations. In this way, we can avoid the costly casting between integer and floating point and back again.

Algorithm 3: vol2.c

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

        static int16_t* precalc;

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]

        precalc = (int16_t*) calloc(65536,2);
        if (precalc == NULL) {
                printf("malloc failed!\n");
                return 1;
        }

        for (x = -32768; x <= 32767; x++) {
 // Q: What is the purpose of the cast to unint16_t in the next line?
                precalc[(uint16_t) x] = (int16_t) ((float) x * VOLUME / 100.0);
        }

        for (x = 0; x < SAMPLES; x++) {
                out[x]=precalc[(uint16_t) in[x]];
        }

// ---- This part sums the samples.
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.
        printf("Result: %d\n", ttl);

        return 0;
}

In vol2.c, we pre-calculate all 65536 result. Then, we use it to look up the result for each input value. Note we use a casting to uint16_t for each element's index. Since we cast a negative integer to unsigned type, x would have a unsigned integer with the bit pattern representing in the corresponding signed type. For example, -5 would become 65531 (2^16 - 5). In this way, we can populate the array with 65536 elements.

This program may have a better performance than the previous one because we create a table with all of the possible values. However, this may be varied depending on the speed of reading memory.

Dummy Algorithm: vol3.c

int16_t scale_sample(int16_t sample, int volume) {

        return (int16_t) 100;
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sum the samples.
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.
        printf("Result: %d\n", ttl);

        return 0;
}

vol3.c is a simply dummy program and returns an identical sample value (100). The purpose of this program is to determine the possible overhead processing other than the scaling volume algorithm.

Algorithm 4: vol4.c

int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else


        // these variables will also be accessed by our assembler code
        int16_t*        in_cursor;              // input cursor
        int16_t*        out_cursor;             // output cursor
        int16_t         vol_int;                // volume as int16_t

        int16_t*        limit;                  // end of input array

        int             x;                      // array interator
        int             ttl=0 ;                 // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]


        // set vol_int to fixed-point representation of the volume factor
        // Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

        // Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES;

        // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

        while ( in_cursor < limit ) {
                __asm__ (
                        "ldr q0, [%[in_cursor]], #16    \n\t"
                        // load eight samples into q0 (same as v0.8h)
                        // from [in_cursor]
                        // post-increment in_cursor by 16 bytes
                        // and store back into the pointer register


                        "sqrdmulh v0.8h, v0.8h, v1.8h   \n\t"
                        // with 32 signed integer output,
                        // multiply each lane in v0 * v1 * 2
                        // saturate results
                        // store upper 16 bits of results into
                        // the corresponding lane in v0

                        "str q0, [%[out_cursor]],#16            \n\t"
                        // store eight samples to [out_cursor]
                        // post-increment out_cursor by 16 bytes
                        // and store back into the pointer register

                        // Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"
                        );
        }

// --------------------------------------------------------------------

        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

        return 0;

#endif
}

In vol4.c, we utilize Single Instruction, Multiple Data (SIMD) instructions with inline assembly codes (assembly language code inserted into a high-level language). Also note that we use AArch64 specific assembly code here and thus, this program can be only executed in the AArch64 system.

Let's take a look at some of the important points in the code (marked as Q). First of all, we need to multiply by 32767 when calculating vol_int to have a fixed-point representation of the volume factor. This is because the vol_int has a type of int16_t, a signed integer type with width of exactly 16 bits. Since its type is signed, the range of values it can hold is between -32,768 and 32,767. Thus, we need to multiply the sample with 32,767 to prevent the integer overflow.

Next, we need to set three pointers that point to the first element of in and out arrays as well as the end of the in array respectively. In this way, we can make a loop that multiplies the sample by the volume scaling factor.

Once we set all of the requirements mentioned above, we can start implementing inline assembly codes using __asm__. The dup instruction is used to duplicate the volume scaling factor from the register with 32-bit-wide access (w0) into the vector register with 8 lines (v1.8h). By doing this, we can multiply the each element of the vector by the scaling factor.

Inside of the loop, we have another inline assembly code that multiplies eight samples by the scaling factor. In contrast to the last __asm__ code, we have 3 operand parameters that are each separated by colon(:). The first operand parameter defines the output operands, in_cursor and out_cursor'. Each operand is named as[in_cursor] and [out_cursor] respectively so that they can be used in the assembler template (enclosed in double quotation). The + sign indicates a constraint that the given output operands are both read and written by the instruction. The last operand parameter is used for clobbers. The memory clobber is used to tell the compiler that the assembly code performs memory reads and writes.

Lastly, we can assume the printed results would be correct. This is because sqrdmulh instruction can saturate the result when overflowing happens.

Algorithm 4: vol5.c

int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else

        register int16_t*       in_cursor       asm("r20");     // input cursor (pointer)
        register int16_t*       out_cursor      asm("r21");     // output cursor (pointer)
        register int16_t        vol_int         asm("r22");     // volume as int16_t

        int16_t*                limit;          // end of input array

        int                     x;              // array interator
        int                     ttl=0;          // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]

        vol_int = (int16_t) (VOLUME/100.0 * 32767.0);

        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES ;

        while ( in_cursor < limit ) {
                // What do these intrinsic functions do?
                // (See gcc intrinsics documentation)
                vst1q_s16(out_cursor, vqrdmulhq_s16(vld1q_s16(in_cursor), vdupq_n_s16(vol_int)));

                // Q: Why is the increment below 8 instead of 16 or some other value?
                // Q: Why is this line not needed in the inline assembler version
                // of this program?
                in_cursor += 8;
                out_cursor += 8;
        }

// --------------------------------------------------------------------

        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);

        return 0;
#endif
}

vol5.c also uses SIMD instruction but with complier intrinsic that are function-like language extensions built into the compiler. Since it uses the instructions unique to AArch64 architecture, this program is also specific to AArch64.

Let's explore the code together. Note that we use the same set of instructions as before - ldr instruction (vld1q_s16), dup instruction (vdupq_n_s16), sqrdmulh instruction (vqrdmulhq_s16), and str instruction (vst1q_s16). Note that the suffix of the intrinsic (s16, signed 16-bit values) indicates the vector length. Thus, each intrinsic will calculate 8 elements at a time (8 elements x 16 bits = 128 bits). This means we have to increment by 8 elements for both in_cursor and out_cursor (do not confuse that we are incrementing by 8 elements not 8 bytes!).

Also, notice that we have to manually increment both cursors for this time. This is because unlike the assembly inline code, the compiler intrinsic code does not increment the pointer for us.

Since both vol4.c and vol5.c utilize AArch64 specific SIMD instructions, it is logical to think these two should outperform other algorithms.

Conclusion

In this post we explored the multiple algorithms for adjusting volume samples. We saw how each algorithm differed even though they all accomplish the same goal. In the next post, we will examine the performance of each program and create a benchmark to verify our expectation.

Exploring Assembler on the x86-64 Platform

Seung Woo (Paul) Ji — Sun, 27 Feb 2022 23:45:58 +0000

Introduction

In this post, we are going to develop the same assembly program that we coded in the previous post but within x86_64 system.

Original Code

.text
.globl    _start

min = 0                         /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     $min,%r15           /* loop index */

loop:

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

Like the original code from AArch64 program, the code in x86_64 does not do anything but to loop for 10 times (max = 10). However, we can see that there are a number of notable differences when compared to the AArch64 platform. First of all, we use a $ sign to indicate an immediate value and a % sign to indicate a register. Next, we have inc instruction to directly increment the value of r15 instead of using add instruction. We also use jne instruction to jump to a label instead of breaching and syscall instruction to invoke a system call. Finally, we use specialized group of registers (e.g. rdi, rax) for syscall arguments

With that being said, let's continue developing the code to actually print out something to the console screen.

Improved Code - Print Message

.text
.globl    _start

min = 0                         /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     $min,%r15           /* loop index */

loop:

    mov     $len,%rdx           /* message length */
    mov     $msg,%rsi           /* message location */
    mov     $1,%rdi             /* file descriptor stdout */
    mov     $1,%rax             /* syscall sys_write */
    syscall

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

.section .data

msg:    .ascii      "Loop\n"
        len = . - msg

Result

Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop

The program does what we expected for. But, the printed messages are not meaningful us yet. Let's continue on developing the code so that we can have the number of loop.

Improved Code - Print Loop Number

.text
.globl    _start

min = 0                         /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     $min,%r15           /* loop index */

loop:

    mov     %r15,%r14           /* Copy the value of r15 to r14 */
    add     $'0',%r14           /* Add the ascii value of '0' to the r14 and save */
    movb    %r14b,msg+6         /* Copy one byte of r14 to the address location of msg + 6 */

    mov     $len,%rdx           /* message length */
    mov     $msg,%rsi           /* message location */
    mov     $1,%rdi             /* file descriptor stdout */
    mov     $1,%rax             /* syscall sys_write */
    syscall

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

.section .data

msg:    .ascii      "Loop: #\n"

Result

Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9

Now, the program prints out more meaningful messages to the screen. Note that there are another notable differences as compared to the ones in AArch64 assembly. For example, we may reuse mov instruction to move data from one register to an address pointed by another register. As you remember, we have to utilize str instruction to do such job within AArch64 system. Moreover, we put the b suffix after mov instruction and the register in order to limit the number of byte to be moved.

However, this code is also not sufficient to handle the two-digit loop numbers.

Improved Code - Print Two Digit Loop Number

.text
.globl    _start

min = 0                         /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 15                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     $min,%r15           /* loop index */
    mov     $10,%r13            /* Divisor */

loop:

// Dividing by 10
    mov     %r15,%rax           /* Setting rax with the value of dividend */
    mov     $0,%rdx             /* rdx must be set to 0 before using div instruction */
    div     %r13                /* divide rax by the r13; place quotient into rax and remainder into rdx */
    cmp     $0,%rax
    je     oneDigit

// Inserting tens digit
    add     $'0',%rax           /* Add the ascii value of '0' to the rax and save */
    mov     %rax,%r12
    movb    %r12b,msg+6         /* Copy one byte of rax to the address location of msg + 6 */

oneDigit:

// Inserting ones digit
    add     $'0',%rdx           /* Add the ascii value of '0' to the rdx and save */
    mov     %rdx,%r12
    movb    %r12b,msg+7         /* Copy one byte of rdx to the address location of msg + 7 */

// Print Message
    mov     $len,%rdx           /* message length */
    mov     $msg,%rsi           /* message location */
    mov     $1,%rdi             /* file descriptor stdout */
    mov     $1,%rax             /* syscall sys_write */
    syscall

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

.section .data

msg:    .ascii      "Loop:  #\n"
        len = . - msg

In this code, we divide the given loop index stored in r15 by 10. We use the quotient to find the tens digit. Unlike udiv instruction, div instruction can also calculate a remainder. With given quotient and remainder, we can print the quotient value as tens digit and the remainder value as ones digit. Afterwards, we can remove the leading zero for the tens digit by jumping to the oneDigit label to skip inserting zero digit character when the quotient value is equal to 0.

Conclusion

In this post, we explored how we can make a program in x86_64 system that has the same logic as the one from AArch64 in the previous post. Having two different systems to develop a code that performs the same result bring developers interesting challenges - we have to understand the different set of instructions and the way they perform. Also, debugging in both systems are difficult as we have to rely on either inspecting compiler error messages or using objdump to disassemble the generated machine code.

Exploring Assembler on the AArch64 Platform

Seung Woo (Paul) Ji — Mon, 21 Feb 2022 02:17:44 +0000

Introduction

In this post, we are going to investigate a simple code snippet, that loops a few times, in AArch64 system.

Original Code

.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 30                         /* loop exits when the index hits this number (loop condition is i<max) */

_start:

    mov     x19, min

loop:

    add     x19, x19, 1
    cmp     x19, max
    b.ne    loop

    mov     x0, 0           /* status -> 0 */
    mov     x8, 93          /* exit is syscall #93 */
    svc     0               /* invoke syscall */

The code does not really do anything special but just loop itself for given maximum number of times (max = 30).

Let's improve this code a little bit and make it to print out message for us.

Improved Code - Print Message

.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                         /* loop exits when the index hits this number (loop condition is i<max) */

_start:

    mov     x19, min

loop:

    mov     x0, 1           /* file descriptor: 1 is stdout */
    adr     x1, msg         /* message location (memory address) */
    mov     x2, len         /* message length (bytes) */

    mov     x8, 64          /* write is syscall #64 */
    svc     0               /* invoke syscall */

    add     x19, x19, 1         /* increment by 1 */
    cmp     x19, max
    b.ne    loop

    mov     x0, 0           /* status -> 0 */
    mov     x8, 93          /* exit is syscall #93 */
    svc     0               /* invoke syscall */

.data
msg:    .ascii      "Loop\n"
len=    . - msg

Result

Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop

This is much better than the original code and prints out something in the console. But, the message is not really meaningful us. Why don't we make it in a way that it prints out the loop number instead?

Improved Code - Print Loop Number

.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                         /* loop exits when the index hits this number (loop condition is i<max) */

_start:

    mov     x19, min

loop:

// Inserting digit
    add    x18, x19, '0'        /* Create a digit character by adding a ascii value of '0' */
    adr    x17, msg+6           /* Pointer pointing to the pound sign in the msg */
    strb   w18, [x17]           /* Put the digit within the pound sign of the msg */

// Print message
    mov     x0, 1           /* file descriptor: 1 is stdout */
    adr     x1, msg         /* message location (memory address) */
    mov     x2, len         /* message length (bytes) */

    mov     x8, 64          /* write is syscall #64 */
    svc     0               /* invoke syscall */

// Proceed with loop
    add     x19, x19, 1         /* increment by 1 */
    cmp     x19, max
    b.ne    loop

    mov     x0, 0           /* status -> 0 */
    mov     x8, 93          /* exit is syscall #93 */
    svc     0               /* invoke syscall */

.data
msg:    .ascii      "Loop: #\n"
len=    . - msg

Result

Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9

The code finally prints out some meaningful messages to the console. Note that we use strb instruction instead of str because we only want to deal with a single character (1 byte) not a whole 64 bytes. As a result, we need to add w prefix for the register as it is required to use this instruction.

However, the code above only works for one digit number of loops. If the loop number is bigger than 10, the code would start printing out the non-numeric character because the numeric characters are defined between 48 and 57 in ASCII table. For this, we need to add additional lines of codes.

Improved Code - Print Two Digit Loop Number

.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 15                         /* loop exits when the index hits this number (loop condition is i<max) */

_start:

    mov     x19, min
    mov     x20, 10

loop:

// Finding the tens digit
    udiv    x21, x19, x20       /* Divide by 10 */
    cmp     x21, 0
    b.eq    oneDigit            /* Skip to inser the tens digit if the quotient is equal to zero */

// Inserting the tens digit
    add     x18, x21, '0'       /* Create a digit character by adding a ascii value of '0' */
    adr     x17, msg+6          /* Pointer pointing to the pound sign in the msg */
    strb    w18, [x17]          /* Put the digit within the pound sign of the msg */

oneDigit:
// Finding the ones digit
    msub    x22, x20, x21, x19  /* Load x22 with the value of r19 - (r20 * r21) */

// Inserting the ones digit
    add     x18, x22, '0'       /* Create a digit character by adding a ascii value of '0' */
    adr     x17, msg+7          /* Pointer pointing to the pound sign in the msg */
    strb    w18, [x17]          /* Put the digit within the pound sign of the msg */

// Print message
    mov     x0, 1           /* file descriptor: 1 is stdout */
    adr     x1, msg         /* message location (memory address) */
    mov     x2, len         /* message length (bytes) */

    mov     x8, 64          /* write is syscall #64 */
    svc     0               /* invoke syscall */

// Proceed with loop
    add     x19, x19, 1         /* increment by 1 */
    cmp     x19, max
    b.ne    loop

    mov     x0, 0           /* status -> 0 */
    mov     x8, 93          /* exit is syscall #93 */
    svc     0               /* invoke syscall */

.data
msg:    .ascii      "Loop:  #\n"
len=    . - msg

Result

Loop:  0
Loop:  1
Loop:  2
Loop:  3
Loop:  4
Loop:  5
Loop:  6
Loop:  7
Loop:  8
Loop:  9
Loop: 10
Loop: 11
Loop: 12
Loop: 13
Loop: 14

Let's walk through the code. First of all, we divide the given r19 value of the loop by 10. We use the quotient to fill out the tens digit. Since udiv instruction only gives the quotient value, we have to utilize another instruction to find the remainder and msub instruction is exactly what we need for it. With given quotient and remainder, we just need to print out the numeric character to the screen but one important step remains. That is, we have to remove the leading zero for the r19 value less than 10. For this, we use another label called oneDigit to skip inserting the tens digit if and only if the tens digit is equal to 0.

Conclusion

In this blog post, we learned how to make a small code snippet to print out the number of loops in the screen. It's interesting to see the way AArch64 assembly works is strikingly similar to the one in 6502 system. In the next post, we will further investigate the same code snippet but with another popular system in the modern days, x86_64.

A Simple Maze Game using 6502 Emulator Part 2

Seung Woo (Paul) Ji — Mon, 14 Feb 2022 03:03:53 +0000

Introduction

In the last post, we created a simple 6502 assembly code that generates a maze for a player to explore. Today, we are going to build codes on top of it in order to implement the rest of objectives.

Objectives

For this game, we need to accomplish 3 more objectives as following:

~~1. The game must draw the maze in the bitmapped screen.~~ (Done!)

A player must be able to use the keyboard to control.
A player must find a route to reach to the goal within the maze in order to win the game.
A player cannot goes through the wall.

Code

; zero-page variables
define  ROW     $20 ; current row
define  COL     $21 ; current column
define  DRAWN_ROW   $22 ; number of drawn rows
define  MAZE_L      $14 ; a pointer that points to where the maze will 
define  MAZE_H      $15 ; be drawn
define  PLAYER_L    $10 ; a pointer that points to the player in the 
                ; screen
define  PLAYER_H    $11
define  TARGET_L    $12 ; a pointer that points to the target position
define  TARGET_H    $13 ; where the player wants to proceed

; constants
define  PATH        $03 ; path color
define  PLAYER      $0e ; player color
define  HEIGHT      7   ; height of the maze 
define  WIDTH       7   ; width of the maze

; ROM routine
define  SCINIT      $ff81 ; initialize/clear screen

        jsr printHelp
        jsr drawMaze
        jsr gameInit
        jsr gameLoop

printHelp:  ldy #$00    ; print instructions on the screen
pHelpLoop:  lda help,y
        beq done
        sta $f000,y
        iny
        bne pHelpLoop

gameInit:   lda #$01    ; initialize ROW, COL to make the player 
        sta ROW     ; starting at $0221 of the screen
        sta COL
        rts

gameLoop:   jsr updatePosition
        jsr getkey
        jsr checkCollision
        ldx #$00    ; clear out the key buffer
        stx $ff
        jmp gameLoop

updatePosition: ldy ROW     ; load PLAYER pointer with ROW 
        lda table_low,y
        sta PLAYER_L
        lda table_high,y
        sta PLAYER_H

        ldy COL     ; place the player at (POINTER + COL)
        lda #PLAYER
        sta (PLAYER_L),y
        rts

getkey:     lda $ff     ; get the input key

        cmp #$80    ; allow arrow keys only
        bmi getkey
        cmp #$84
        bpl getkey

        pha     ; save the accumulator
        lda #PATH   ; set color of the current position to PATH
        sta (PLAYER_L),y
        pla     ; restore accumulator

        cmp #$80    ; check key is up
        bne checkRight

        dec ROW     ; ... if yes, decrement ROW
        rts

checkRight: cmp #$81    ; check if key is right
        bne checkDown
        inc COL     ; ... if yes, increment COL
        rts

checkDown:  cmp #$82    ; check if key is down
        bne checkLeft
        inc ROW     ; ... if yes, increment ROW
        rts

checkLeft:  cmp #$83    ; check if key is left
        bne done
        dec COL     ; ... if yes, decrement COL
        rts

done:       rts     ; break out of a loop or subroutine

checkCollision: ldy ROW     ; load TARGET pointer with ROW 
        lda table_low,y
        sta TARGET_L
        lda table_high,y
        sta TARGET_H

        ldy COL     ; load the color from the target
        lda (TARGET_L),y; at (POINTER + COL)

        cmp #$01
        beq done
        cmp #$03
        beq done
        cmp #$0a
        beq gameComplete

        lda #$00
        sta (TARGET_L),y

        lda $ff
        cmp #$80    ; if input key was up...
        bne ifRight

        inc ROW     ; ... if yes, increment ROW
        rts

ifRight:    cmp #$81    ; if input key was right...
        bne ifDown

        dec COL     ; ... if yes, decrement COL
        rts

ifDown:     cmp #$82    ; if input key was down...
        bne ifLeft

        dec ROW     ; ... if yes, decrement ROW
        rts

ifLeft:     cmp #$83    ; if input key was left...
        bne done

        inc COL     ; ... if yes, increment COL
        rts

gameComplete:   jsr SCINIT
        ldy #$00    ; print game completion message on the screen 
pGameComplete:  lda complete,y
        beq done
        sta $f000,y
        iny
        bne pGameComplete
        brk

drawMaze:   lda #$21    ; a pointer pointing to the first pixel
        sta MAZE_L  ; of the screen
        lda #$02
        sta MAZE_H

        lda #$00    ; number of drawn rows
        sta DRAWN_ROW

        ldx #$00    ; maze data index
        ldy #$00    ; column index

draw:       lda maze_data,x
        sta (MAZE_L), y
        inx
        iny
        cpy #WIDTH  ; compare with the number of WIDTH
        bne draw    ; if not, keep drawing the column

        inc DRAWN_ROW   ; increment the number of row
        lda #HEIGHT
        cmp DRAWN_ROW   ; compare with the number of HEIGHT
        beq done

        lda MAZE_L
        clc
        adc #$20    ; add 32(0x0020) to increment the row
        sta MAZE_L  ; of the pixel
        lda MAZE_H
        adc #$00
        sta MAZE_H

        ldy #$00    ; reset the column index for the new row
        beq draw            

; help text message
help:
dcb "P","l","a","y",32,"w","i","t","h",32,"a","r","r","o","w"
dcb 32,"k","e","y","s",32,"t","o",32,"c","o","n","t","r","o","l",10
dcb 00

; game complete message
complete:
dcb "Y","o","u",32,"b","e","a","t",32
dcb "t","h","e",32,"g","a","m","e","!"
dcb 00

; maze map data
maze_data:
dcb 01,00,01,00,01,01,01
dcb 01,01,01,00,00,00,01
dcb 00,00,01,00,01,00,01
dcb 01,00,01,00,01,01,01
dcb 01,00,01,00,01,00,01
dcb 01,00,01,00,01,00,01
dcb 01,01,01,01,01,00,10

; these two tables contain the high and low bytes
; of the addresses of the start of each row
table_high:
dcb $02,$02,$02,$02,$02,$02,$02,$02
dcb $03,$03,$03,$03,$03,$03,$03,$03
dcb $04,$04,$04,$04,$04,$04,$04,$04
dcb $05,$05,$05,$05,$05,$05,$05,$05

table_low:
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0

Let's walk through the code together. First of all, we print out the helpful instruction on the text screen with printHelp subroutine. Then, we draw a maze using the drawMaze subroutine we created from the last post. Having a maze for a player to explore on the screen, we need to first set the game state with the initial player position on the screen $#0221. After that, we call the subroutine called gameLoop which constantly loops itself.

The gameLoop itself consists of a number of subroutines. The first one is updatePosition. This subroutine loads the player pointer with the given the row and column information so that we can place the player on the screen. Afterwards, we call the getKey subroutine to receive the player input from the keyboard. We limit the keyboard input by only accepting arrow keystrokes. Once we receive a key input, we update the number of column and row accordingly. Then, we check if the position the player wants to move is a wall by using the checkCollision subroutine. If the player hits by the wall, we simply retract the move.

Once reaching to the goal, the screen will congratulate the player with the text message.

Conclusion

Making a simple maze using the 6502 assembly language definitely is harder and more time-consuming as compared to other high-level languages. The game we explored together is also not polished and needs a lot of improvements as well (such as having an alert when the player wants to proceed into the wall). we could use. Yet, this experience gives us a very meaningful insight as to how the game really works under the hood.