<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Seung Woo (Paul) Ji</title>
    <description>The latest articles on DEV Community by Seung Woo (Paul) Ji (@seungwooji).</description>
    <link>https://dev.to/seungwooji</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F795176%2F778ce5e8-05a6-45b9-b1aa-5c727422efce.jpeg</url>
      <title>DEV Community: Seung Woo (Paul) Ji</title>
      <link>https://dev.to/seungwooji</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/seungwooji"/>
    <language>en</language>
    <item>
      <title>SVE2 Implementation for Opus Codec Library Analysis</title>
      <dc:creator>Seung Woo (Paul) Ji</dc:creator>
      <pubDate>Fri, 22 Apr 2022 22:44:05 +0000</pubDate>
      <link>https://dev.to/seungwooji/sve2-implementation-for-opus-codec-library-analysis-3pkf</link>
      <guid>https://dev.to/seungwooji/sve2-implementation-for-opus-codec-library-analysis-3pkf</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/seungwooji/implementing-sve2-for-opus-codec-library-part-3-59dg"&gt;Previously&lt;/a&gt;, we successfully implemented SVE2 into &lt;strong&gt;Opus codec library&lt;/strong&gt; by utilizing auto-vectorization method. In this post, we will analyze the result to further test if the SVE2 code is implemented correctly and determine its possible impact on the software's performance. &lt;/p&gt;

&lt;h2&gt;
  
  
  SVE2 Code Analysis
&lt;/h2&gt;

&lt;p&gt;As we explored in the previous post, the compiler auto-vectorized many parts of the package. Let's take a look at one of them to see where SVE2 code is used. &lt;/p&gt;

&lt;p&gt;Opus Codec utilizes &lt;a href="https://en.wikipedia.org/wiki/CELT"&gt;Celt&lt;/a&gt; as one of ways to encode and decode audio source. In &lt;code&gt;opus/celt&lt;/code&gt;, we can see the following list of files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ls
arch.h           celt.o         entenc.o          mdct.c              quant_bands.lo
arm              cpu_support.h  fixed_c5x.h       mdct.h              quant_bands.o
bands.c          cwrs.c         fixed_c6x.h       mdct.lo             rate.c
bands.h          cwrs.h         fixed_debug.h     mdct.o              rate.h
bands.lo         cwrs.lo        fixed_generic.h   meson.build         rate.lo
bands.o          cwrs.o         float_cast.h      mfrngcod.h          rate.o
celt.c           dump_modes     kiss_fft.c        mips                stack_alloc.h
celt_decoder.c   ecintrin.h     _kiss_fft_guts.h  modes.c             static_modes_fixed_arm_ne10.h
celt_decoder.lo  entcode.c      kiss_fft.h        modes.h             static_modes_fixed.h
celt_decoder.o   entcode.h      kiss_fft.lo       modes.lo            static_modes_float_arm_ne10.h
celt_encoder.c   entcode.lo     kiss_fft.o        modes.o             static_modes_float.h
celt_encoder.lo  entcode.o      laplace.c         opus_custom_demo.c  tests
celt_encoder.o   entdec.c       laplace.h         os_support.h        vq.c
celt.h           entdec.h       laplace.lo        pitch.c             vq.h
celt.lo          entdec.lo      laplace.o         pitch.h             vq.lo
celt_lpc.c       entdec.o       mathops.c         pitch.lo            vq.o
celt_lpc.h       entenc.c       mathops.h         pitch.o             x86
celt_lpc.lo      entenc.h       mathops.lo        quant_bands.c
celt_lpc.o       entenc.lo      mathops.o         quant_bands.h
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In &lt;code&gt;celt_encoder.c&lt;/code&gt; file, we can see that it contains many &lt;code&gt;for loops&lt;/code&gt; that may benefit from &lt;code&gt;SVE2&lt;/code&gt; implementation. The following code example is one of them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// celt_encode.c&lt;/span&gt;
&lt;span class="c1"&gt;// ...&lt;/span&gt;

&lt;span class="mi"&gt;1100&lt;/span&gt;       &lt;span class="cm"&gt;/* For non-transient CBR/CVBR frames, halve the dynalloc contribution */&lt;/span&gt;
&lt;span class="mi"&gt;1101&lt;/span&gt;       &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;vbr&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;constrained_vbr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;!&lt;/span&gt;&lt;span class="n"&gt;isTransient&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;1102&lt;/span&gt;       &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="mi"&gt;1103&lt;/span&gt;          &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;1104&lt;/span&gt;             &lt;span class="n"&gt;follower&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HALF16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;follower&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;span class="mi"&gt;1105&lt;/span&gt;       &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="mi"&gt;1106&lt;/span&gt;       &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;1107&lt;/span&gt;       &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="mi"&gt;1108&lt;/span&gt;          &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;1109&lt;/span&gt;             &lt;span class="n"&gt;follower&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="mi"&gt;1110&lt;/span&gt;          &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;1111&lt;/span&gt;             &lt;span class="n"&gt;follower&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HALF16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;follower&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

&lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the code, we can see a loop that iterates from &lt;code&gt;start&lt;/code&gt; to &lt;code&gt;end&lt;/code&gt;. Depending on the value of &lt;code&gt;i&lt;/code&gt;, the &lt;code&gt;i&lt;/code&gt;th element of &lt;code&gt;follower&lt;/code&gt; array is either halved or multiplied by two. As we can see, this does not involve complex logic and process a large amount of data in the uniform manner and, therefore, this could be a good candidate to utilize the auto-vectorization by the compiler.&lt;/p&gt;

&lt;p&gt;And as we expected, the &lt;code&gt;celt_encoder.o&lt;/code&gt; contains multiple SVE-specific &lt;code&gt;whilelo&lt;/code&gt; instructions when we disassemble it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ objdump -d celt_encoder.o | grep whilelo
     174:       25a30fe0        whilelo p0.s, wzr, w3
     198:       25a30c00        whilelo p0.s, w0, w3
     1e8:       25b40fe0        whilelo p0.s, wzr, w20
     200:       25b40c00        whilelo p0.s, w0, w20
     418:       25bc0fe0        whilelo p0.s, wzr, w28
     430:       25bc0c00        whilelo p0.s, w0, w28
     498:       25bc0fe0        whilelo p0.s, wzr, w28
     4b0:       25bc0c20        whilelo p0.s, w1, w28
   # ...    
    57ac:       25a10c00        whilelo p0.s, w0, w1
    5844:       25a10fe0        whilelo p0.s, wzr, w1
    585c:       25a10c00        whilelo p0.s, w0, w1
    5ae0:       25a10fe0        whilelo p0.s, wzr, w1
    5b00:       25a10c00        whilelo p0.s, w0, w1
    5ea8:       25a10fe0        whilelo p0.s, wzr, w1
    5ebc:       25a10c00        whilelo p0.s, w0, w1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But, this only shows that &lt;code&gt;celt_encode&lt;/code&gt; have implemented SVE2 instruction. How can we know if the code that we are interested in utilizes SVE2? &lt;/p&gt;

&lt;p&gt;Let's look at this in a different angle - how the compiler can determine if the codes are suitable for auto-vectorization? For this, we can specify an additional option to enable feature when you generate &lt;code&gt;configure&lt;/code&gt; binary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ./configure CFLAGS="-g -O3 -fopt-info-vec-all -march=armv8-a+sve2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes" 

$ make -j24 |&amp;amp; tee make.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;fopt-info&lt;/code&gt; generates additional log in the compiler output. We specifically asks for all information regarding to vectorization by using &lt;code&gt;vec-all&lt;/code&gt;. When we compile the package again using &lt;code&gt;make&lt;/code&gt;, this feature will tell us why (or why not) the compiler add SVE2 implementation. &lt;/p&gt;

&lt;p&gt;Once we run &lt;code&gt;make&lt;/code&gt; command as above, we have the following &lt;code&gt;make.log&lt;/code&gt; file that contains every information we want to know.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ll make.log
-rw-r--r--. 1 swji1 swji1 2831714 Apr 22 14:21 make.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's refine the result by only searching the logs that happened in the &lt;code&gt;celt&lt;/code&gt; directory as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ grep "celt/celt_encoder"
celt/celt_encoder.c:1810:22: optimized: loop vectorized using variable length vectors
celt/celt_encoder.c:1780:16: optimized: loop vectorized using variable length vectors
celt/celt_encoder.c:1780:16: optimized: loop vectorized using variable length vectors
celt/celt_encoder.c:1778:40: missed: couldn't vectorize loop
celt/celt_encoder.c:1778:40: missed: not vectorized: number of iterations cannot be computed.
celt/celt_encoder.c:1756:17: missed: couldn't vectorize loop
celt/celt_encoder.c:1761:20: missed: not vectorized: complicated access pattern.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can see which lines of the code are vectorized or not as above. Let's find if the code located at &lt;code&gt;line 1106&lt;/code&gt; that we have examined is vectorized as well.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ grep "celt/celt_encoder.c:1106"
celt/celt_encoder.c:1106:21: celt/pitch.h:143:14: optimized: loop vectorized using variable length vectors
celt/celt_encoder.c:1106:21: optimized: loop vectorized using variable length vectors
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As we expected, the loop is vectorized by the compiler. &lt;/p&gt;

&lt;p&gt;Now, we may wonder what are the codes that the compiler cannot perform auto-vectorization and why? Let's take a look at one of them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;celt/celt_encoder.c:1922:39: missed: not vectorized: complicated access pattern.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// celt_encoder.c&lt;/span&gt;
&lt;span class="c1"&gt;// ...&lt;/span&gt;
 &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="mi"&gt;1915&lt;/span&gt;       &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;1916&lt;/span&gt;       &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="mi"&gt;1917&lt;/span&gt;          &lt;span class="cm"&gt;/* When the energy is stable, slightly bias energy quantization towards
1918             the previous error to make the gain more stable (a constant offset is
1919             better than fluctuations). */&lt;/span&gt;
&lt;span class="mi"&gt;1920&lt;/span&gt;          &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ABS32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SUB32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bandLogE&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;nbEBands&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;oldBandE&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;nbEBands&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;QCONST16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DB_SHIFT&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="mi"&gt;1921&lt;/span&gt;          &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="mi"&gt;1922&lt;/span&gt;             &lt;span class="n"&gt;bandLogE&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;nbEBands&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;MULT16_16_Q15&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;energyError&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;nbEBands&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;QCONST16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="mi"&gt;1923&lt;/span&gt;          &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="mi"&gt;1924&lt;/span&gt;       &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="mi"&gt;1925&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the &lt;code&gt;if&lt;/code&gt; statement inside of the loop, we can see that each element of the arrays requires extensive calculations beforehand. For this reason, the compiler cannot vectorize the loop as it requires complex access pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance Prediction
&lt;/h2&gt;

&lt;p&gt;Unfortunately, we cannot benchmark the performance of the package at the moment due to the lack of hardware that supports SVE2. However, we do know the SVE2 implementation would potentially improve the performance as it optimizes loops when processing large datasets like audio and video resources. For this reason, we can assume there is a positive correlation between the number of SVE2 instructions and the performance. &lt;/p&gt;

&lt;p&gt;Before we begin, we need to also consider that &lt;code&gt;opus&lt;/code&gt; package contains multiple unit tests that can potentially increase the total number. Thus, we have to be extra careful to exclude them. &lt;br&gt;
Let's count the total number of optimizations that are done by the compiler.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ grep -v "test" make.log | grep "optimized" -c
632
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The compiler managed to auto-vectorize a significant amount (632) of codes. Let's take a look at how many of SVE-specific &lt;code&gt;whilelo&lt;/code&gt; instruction and registers (i.e. predicate register and scalable vector register) are implemented in the executable &lt;code&gt;opus&lt;/code&gt; codec library, &lt;code&gt;libopus&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ objdump -d libopus.so.0.8.0 | grep whilelo -c
671
$ objdump -d libopus.so.0.8.0 | grep whilelo
    2ef0:       25a40fe1        whilelo p1.s, wzr, w4
    2f28:       25a40c60        whilelo p0.s, w3, w4
    2f7c:       25a40fe1        whilelo p1.s, wzr, w4
    2fa0:       25a40c60        whilelo p0.s, w3, w4
    3314:       25b80fe0        whilelo p0.s, wzr, w24
    3344:       25b80c20        whilelo p0.s, w1, w24
# ...
   47b38:       25a50fe0        whilelo p0.s, wzr, w5
   47b3c:       25a80c23        whilelo p3.s, w1, w8
   47b4c:       25aa0c24        whilelo p4.s, w1, w10
   47b54:       25250c26        whilelo p6.b, w1, w5
   47b5c:       25a60c22        whilelo p2.s, w1, w6
   47b68:       25a50c25        whilelo p5.s, w1, w5
   47b98:       25a50c20        whilelo p0.s, w1, w5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$objdump -d libopus.so.0.8.0 | egrep "[^[:alpha:]]z[[:digit:]]|[^[:alpha:]]p[[:digit:]]" -c
5274
$objdump -d libopus.so.0.8.0 | egrep "[^[:alpha:]]z[[:digit:]]|[^[:alpha:]]p[[:digit:]]"
    2ef0:       25a40fe1        whilelo p1.s, wzr, w4
    2ef8:       04a34801        index   z1.s, #0, w3
    2ef0:       25a40fe1        whilelo p1.s, wzr, w4
    2ef8:       04a34801        index   z1.s, #0, w3
    2f0c:       25814420        mov     p0.b, p1.b
    2f18:       856140a0        ld1w    {z0.s}, p0/z, [x5, z1.s, sxtw #2]
    2f1c:       e54340c0        st1w    {z0.s}, p0, [x6, x3, lsl #2]
    2f28:       25a40c60        whilelo p0.s, w3, w4
# ...
   47f48:       6594a000        scvtf   z0.s, p0/m, z0.s
   47f4c:       25886100        mov     p0.b, p8.b
   47f50:       e544e4a2        st1w    {z2.s}, p1, [x5, #4, mul vl]
   47f54:       e546e0a1        st1w    {z1.s}, p0, [x5, #6, mul vl]
   47f58:       25896520        mov     p0.b, p9.b
   47f5c:       e547e0a0        st1w    {z0.s}, p0, [x5, #7, mul vl]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As we can see, there are substantial amount of SVE2 specific codes that are implemented by the auto-vectorization. Therefore, we can suspect that the &lt;code&gt;opus&lt;/code&gt; library may benefit from it to increase the overall performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Things that Can Further Improve the Performance
&lt;/h2&gt;

&lt;p&gt;We already know the compiler auto-vectorize a large portion of the codes. But, we have to admit there is a limit to this method. As we already found before, the compiler cannot auto-vectorize some codes. However, this does not mean they cannot be vectorized. In some cases, we may find places where SVE2 implementation could take place if the loop is written differently. For example, as &lt;a href="https://locklessinc.com/articles/vectorize/"&gt;this article&lt;/a&gt; suggested, we may use &lt;code&gt;restrict&lt;/code&gt; qualifiers to inform the compiler that there is no array overlaps. &lt;/p&gt;

&lt;h2&gt;
  
  
  Original and SVE2 Implementation Comparison
&lt;/h2&gt;

&lt;p&gt;Now, we know SVE2 implementation is successfully performed by the auto-vectorization. However, this is meaningless if the SVE2-improved library does not generate the same result as the original library. For this, let's examine if the improved version of the program works as well as the original version.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# original file
$ ll libopus.so.0.8.0
-rwxr-xr-x. 1 swji1 swji1 1498808 Apr 13 20:16 libopus.so.0.8.0

# SVE2 implemented file
$ ll libopus.so.0.8.0
-rwxr-xr-x. 1 swji1 swji1 1684704 Apr 22 14:21 libopus.so.0.8.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SVE2 implemented version has a little bit larger in size (~0.2 MiB) but does not show a significant change. &lt;/p&gt;

&lt;p&gt;Let's run the unit tests that are provided by the package authors. As we know from the previous post, we have to execute them using &lt;code&gt;qemu-aarch64&lt;/code&gt; command to run the emulation. But, unlike previous post, we will run several unit tests to see if the SVE2 code works correctly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ./test_opus_api
Testing the libopus 1.3.1-107-gccaaffa9-dirty API deterministically
Decoder basic API tests
  ---------------------------------------------------
    opus_decoder_get_size(0)=0 ................... OK.
    opus_decoder_get_size(1)=18228 ............... OK.
    opus_decoder_get_size(2)=26996 ............... OK.
    opus_decoder_get_size(3)=0 ................... OK.
    opus_decoder_create() ........................ OK.
    opus_decoder_init() .......................... OK.
    OPUS_GET_FINAL_RANGE ......................... OK.
    OPUS_UNIMPLEMENTED ........................... OK.
    OPUS_GET_BANDWIDTH ........................... OK.
    OPUS_GET_SAMPLE_RATE ......................... OK.
    OPUS_GET_PITCH ............................... OK.
    OPUS_GET_LAST_PACKET_DURATION ................ OK.
    OPUS_SET_GAIN ................................ OK.
    OPUS_GET_GAIN ................................ OK.
    OPUS_RESET_STATE ............................. OK.
    opus_{packet,decoder}_get_nb_samples() ....... OK.
    opus_packet_get_nb_frames() .................. OK.
    opus_packet_get_bandwidth() .................. OK.
    opus_packet_get_samples_per_frame() .......... OK.
    opus_decode() ................................ OK.
    opus_decode_float() .......................... OK.
                   All decoder interface tests passed
                             (1219433 API invocations)
# ...

Repacketizer tests
  ---------------------------------------------------
    opus_repacketizer_get_size()=496 ............. OK.
    opus_repacketizer_init ....................... OK.
    opus_repacketizer_create ..................... OK.
    opus_repacketizer_get_nb_frames .............. OK.
    opus_repacketizer_cat ........................ OK.
    opus_repacketizer_out ........................ OK.
    opus_repacketizer_out_range .................. OK.
    opus_packet_pad .............................. OK.
    opus_packet_unpad ............................ OK.
    opus_multistream_packet_pad .................. OK.
    opus_multistream_packet_unpad ................ OK.
                        All repacketizer tests passed
                            (6713561 API invocations)

  malloc() failure tests
  ---------------------------------------------------
    opus_decoder_create() ................... SKIPPED.
    opus_encoder_create() ................... SKIPPED.
    opus_repacketizer_create() .............. SKIPPED.
    opus_multistream_decoder_create() ....... SKIPPED.
    opus_multistream_encoder_create() ....... SKIPPED.
(Test only supported with GLIBC and without valgrind)

All API tests passed.
The libopus API was invoked 115421979 times.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ./test_opus_decode
Testing libopus 1.3.1-107-gccaaffa9-dirty decoder. Random seed: 2918850151 (76BD)
  Starting 10 decoders...
    opus_decoder_create(48000,1) OK. Copy OK.
    opus_decoder_create(48000,2) OK. Copy OK.
    opus_decoder_create(24000,1) OK. Copy OK.
    opus_decoder_create(24000,2) OK. Copy OK.
    opus_decoder_create(16000,1) OK. Copy OK.
    opus_decoder_create(16000,2) OK. Copy OK.
    opus_decoder_create(12000,1) OK. Copy OK.
    opus_decoder_create(12000,2) OK. Copy OK.
    opus_decoder_create( 8000,1) OK. Copy OK.
    opus_decoder_create( 8000,2) OK. Copy OK.
  dec[all] initial frame PLC OK.
  dec[all] all 2-byte prefix for length 3 and PLC, all modes (64) OK.
  dec[  5] all 3-byte prefix for length 4, mode 28 OK.
  dec[  0] all 3-byte prefix for length 4, mode  4 OK.
  dec[all] random packets, all modes (64), every 8th size from from 7 bytes to maximum OK.
  dec[all] random packets, all mode pairs (4096), 145 bytes/frame OK.
  dec[  3] random packets, all mode pairs (4096)*10, 81 bytes/frame OK.
  dec[  0] pre-selected random packets OK.
  Decoders stopped.
  Testing opus_pcm_soft_clip... OK.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ./test_opus_encode
Testing libopus 1.3.1-107-gccaaffa9-dirty encoder. Random seed: 2953257216 (421F)
Running simple tests for bugs that have been fixed previously
  Encode+Decode tests.
    Mode     LP FB encode  VBR,   9119 bps OK.
    Mode     LP FB encode  VBR,  13234 bps OK.
    Mode     LP FB encode  VBR,  64668 bps OK.
    Mode Hybrid FB encode  VBR,  28306 bps OK.
    Mode Hybrid FB encode  VBR,  54852 bps OK.
    Mode Hybrid FB encode  VBR,  55130 bps OK.
    Mode Hybrid FB encode  VBR,  96362 bps OK.
    Mode   MDCT FB encode  VBR, 893620 bps OK.
    Mode   MDCT FB encode  VBR,  25608 bps OK.
    Mode   MDCT FB encode  VBR,  29011 bps OK.
    Mode   MDCT FB encode  VBR,  93628 bps OK.
    Mode   MDCT FB encode  VBR,  93328 bps OK.
    Mode   MDCT FB encode  VBR, 160982 bps OK.
# ...
    Mode     LP NB dual-mono MS encode  CBR,  21883 bps OK.
    Mode     LP NB dual-mono MS encode  CBR,  60566 bps OK.
    Mode     LP NB dual-mono MS encode  CBR,  76774 bps OK.
    Mode     LP NB dual-mono MS encode  CBR, 167879 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,   6953 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  12756 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  60193 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  14915 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  16946 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  34028 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  86938 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR, 172977 bps OK.
    All framesize pairs switching encode, 9683 frames OK.
Running fuzz_encoder_settings with 5 encoder(s) and 40 setting change(s) each.
Tests completed successfully.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As we can see, the SVE2 program passes all the unit tests to confirm that it works as well as the original program. &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this post, we found that the compiler successfully vectorized the codes and there would be a significant improvement in the performance considering the substantial amount of SVE2-specific instructions and registers. We also checked that SVE2 does not break the program and run as well as the original program. These findings suggest that the authors of &lt;code&gt;opus&lt;/code&gt; package may greatly benefit from the vectorization of the codes when SVE2 become publicly available in the near future.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Implementing SVE2 for Opus Codec Library Part 3: Auto-vectorization</title>
      <dc:creator>Seung Woo (Paul) Ji</dc:creator>
      <pubDate>Wed, 13 Apr 2022 22:53:11 +0000</pubDate>
      <link>https://dev.to/seungwooji/implementing-sve2-for-opus-codec-library-part-3-59dg</link>
      <guid>https://dev.to/seungwooji/implementing-sve2-for-opus-codec-library-part-3-59dg</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/seungwooji/implementing-sve2-for-opus-codec-library-part-2-4a81"&gt;Previously&lt;/a&gt;, we tried to implement SVE2 by using the existing codes that are written in NEON instructions. Unfortunately, the result was not so fruitful. Instead, in this post, we will try to utilize the auto-vectorization method in order to add SVE2 instructions. &lt;/p&gt;

&lt;h2&gt;
  
  
  Before We Start
&lt;/h2&gt;

&lt;p&gt;As we know, Opus package already supports NEON intrinsics. When we run the &lt;code&gt;./configure&lt;/code&gt;, we can see that the script automatically detects that the existing processor supports ARM NEON intrinsics optimizations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ./configure
opus 1.3.1-107-gccaaffa9-dirty:  Automatic configuration OK.

    Compiler support:

      C99 var arrays: ................ yes
      C99 lrintf: .................... yes
      Use alloca: .................... no (using var arrays)

    General configuration:

      Floating point support: ........ yes
      Fast float approximations: ..... no
      Fixed point debugging: ......... no
      Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
      External Assembly Optimizations:
      Intrinsics Optimizations: ...... ARM (NEON) (NEON Aarch64)
      Run-time CPU detection: ........ no
      Custom modes: .................. no
      Assertion checking: ............ no
      Hardening: ..................... yes
      Fuzzing: ....................... no
      Check ASM: ..................... no

      API documentation: ............. yes
      Extra programs: ................ yes

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, NEON intrinsic may conflict with auto-vectorizations that are done by the compiler. For this reason, we have to look for a way to disable this feature in the first place.&lt;/p&gt;

&lt;p&gt;Thankfully, we can easily achieve this by editing the &lt;code&gt;configure.ac&lt;/code&gt; file. When we search for &lt;code&gt;neon&lt;/code&gt; keyword, we can find the following codes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;AS_IF&lt;span class="o"&gt;([&lt;/span&gt;&lt;span class="nb"&gt;test &lt;/span&gt;x&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$enable_intrinsics&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; x&lt;span class="s2"&gt;"yes"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;,[
   &lt;span class="nv"&gt;intrinsics_support&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;
   AS_CASE&lt;span class="o"&gt;([&lt;/span&gt;&lt;span class="nv"&gt;$host_cpu&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;,
   &lt;span class="o"&gt;[&lt;/span&gt;arm&lt;span class="k"&gt;*&lt;/span&gt;|aarch64&lt;span class="k"&gt;*&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;,
   &lt;span class="o"&gt;[&lt;/span&gt;
      &lt;span class="nv"&gt;cpu_arm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;yes
      &lt;/span&gt;OPUS_CHECK_INTRINSICS&lt;span class="o"&gt;(&lt;/span&gt;
         &lt;span class="o"&gt;[&lt;/span&gt;ARM Neon],
         &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;$ARM_NEON_INTR_CFLAGS&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;,
         &lt;span class="o"&gt;[&lt;/span&gt;OPUS_ARM_MAY_HAVE_NEON_INTR],
         &lt;span class="o"&gt;[&lt;/span&gt;OPUS_ARM_PRESUME_NEON_INTR],
         &lt;span class="o"&gt;[[&lt;/span&gt;&lt;span class="c"&gt;#include &amp;lt;arm_neon.h&amp;gt;&lt;/span&gt;
         &lt;span class="o"&gt;]]&lt;/span&gt;,
         &lt;span class="o"&gt;[[&lt;/span&gt;
            static float32x4_t A0, A1, SUMM&lt;span class="p"&gt;;&lt;/span&gt;
            SUMM &lt;span class="o"&gt;=&lt;/span&gt; vmlaq_f32&lt;span class="o"&gt;(&lt;/span&gt;SUMM, A0, A1&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;int&lt;span class="o"&gt;)&lt;/span&gt;vgetq_lane_f32&lt;span class="o"&gt;(&lt;/span&gt;SUMM, 0&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
         &lt;span class="o"&gt;]]&lt;/span&gt;
      &lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code checks if the &lt;code&gt;enable_intrinsics&lt;/code&gt; is true or not. By changing &lt;code&gt;x"yes"&lt;/code&gt; to &lt;code&gt;x""&lt;/code&gt;, we can disable the intrinsic configuration. When we rerun &lt;code&gt;autogen.sh&lt;/code&gt; and &lt;code&gt;configure&lt;/code&gt; scripts, we can confirm that the intrinsic optimizations are disabled.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ./autogen.sh
$ ./configure
opus 1.3.1-107-gccaaffa9-dirty:  Automatic configuration OK.

    Compiler support:

      C99 var arrays: ................ yes
      C99 lrintf: .................... yes
      Use alloca: .................... no (using var arrays)

    General configuration:

      Floating point support: ........ yes
      Fast float approximations: ..... no
      Fixed point debugging: ......... no
      Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
      External Assembly Optimizations:
      Intrinsics Optimizations: ...... no
      Run-time CPU detection: ........ no
      Custom modes: .................. no
      Assertion checking: ............ no
      Hardening: ..................... yes
      Fuzzing: ....................... no
      Check ASM: ..................... no

      API documentation: ............. yes
      Extra programs: ................ yes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Auto-vectorization Implementation
&lt;/h2&gt;

&lt;p&gt;Before we begin, we need to check what what compiler flags are being used by the package. To do this, we can look at the &lt;code&gt;Makefile&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Makefile&lt;/span&gt;
CFLAGS &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; &lt;span class="nt"&gt;-O2&lt;/span&gt; &lt;span class="nt"&gt;-fvisibility&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;hidden &lt;span class="nt"&gt;-D_FORTIFY_SOURCE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2 &lt;span class="nt"&gt;-W&lt;/span&gt; &lt;span class="nt"&gt;-Wall&lt;/span&gt; &lt;span class="nt"&gt;-Wextra&lt;/span&gt; &lt;span class="nt"&gt;-Wcast-align&lt;/span&gt; &lt;span class="nt"&gt;-Wnested-externs&lt;/span&gt; &lt;span class="nt"&gt;-Wshadow&lt;/span&gt; &lt;span class="nt"&gt;-Wstrict-prototypes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In order to enable the auto-vectorization, we need to edit the &lt;code&gt;CFLAGS&lt;/code&gt;. However, this is not-so-easy job to do because there are multiple &lt;code&gt;Makefile&lt;/code&gt; throughout the package. That is, we have to edit (possibly) all the following files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ find . -name Makefile
./doc/Makefile
./doc/latex/Makefile
./Makefile
./celt/dump_modes/Makefile
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fortunately, we can avoid this problem by overriding the &lt;code&gt;configure&lt;/code&gt; script that is generated for us. For this, we use the existing &lt;code&gt;CFLAGS&lt;/code&gt; and modify it as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ./configure CFLAGS="-g -O3 -march=armv8-a+sve2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We change the optimization level to &lt;code&gt;03&lt;/code&gt; to enable the auto-vectorization and specify the machine architecture to be ArmV8 with SVE2 extension. Once we run it, we can check that the &lt;code&gt;CFLAGS&lt;/code&gt; are updated successfully in the &lt;code&gt;Makefile&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Makefile&lt;/span&gt;
CFLAGS &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; &lt;span class="nt"&gt;-O3&lt;/span&gt; &lt;span class="nt"&gt;-march&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;armv8-a+sve2 &lt;span class="nt"&gt;-fvisibility&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;hidden &lt;span class="nt"&gt;-D_FORTIFY_SOURCE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2 &lt;span class="nt"&gt;-W&lt;/span&gt; &lt;span class="nt"&gt;-Wall&lt;/span&gt; &lt;span class="nt"&gt;-Wextra&lt;/span&gt; &lt;span class="nt"&gt;-Wcast-align&lt;/span&gt; &lt;span class="nt"&gt;-Wnested-externs&lt;/span&gt; &lt;span class="nt"&gt;-Wshadow&lt;/span&gt; &lt;span class="nt"&gt;-Wstrict-prototypes&lt;/span&gt; &lt;span class="nt"&gt;-fvisibility&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;hidden &lt;span class="nt"&gt;-W&lt;/span&gt; &lt;span class="nt"&gt;-Wall&lt;/span&gt; &lt;span class="nt"&gt;-Wextra&lt;/span&gt; &lt;span class="nt"&gt;-Wcast-align&lt;/span&gt; &lt;span class="nt"&gt;-Wnested-externs&lt;/span&gt; &lt;span class="nt"&gt;-Wshadow&lt;/span&gt; &lt;span class="nt"&gt;-Wstrict-prototypes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's try to run the unit tests to see whether the auto-vectorization kicked in during the compilation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ make check

# ...

./test-driver: line 107: 1431013 Illegal instruction     (core dumped) "$@" &amp;gt; $log_file 2&amp;gt;&amp;amp;1
FAIL: celt/tests/test_unit_cwrs32
./test-driver: line 107: 1431028 Illegal instruction     (core dumped) "$@" &amp;gt; $log_file 2&amp;gt;&amp;amp;1
FAIL: tests/test_opus_api

# ...

============================================================================
Testsuite summary for opus 1.3.1-107-gccaaffa9-dirty
============================================================================
# TOTAL: 14
# PASS:  4
# SKIP:  0
# XFAIL: 0
# FAIL:  10
# XPASS: 0
# ERROR: 0
============================================================================
See ./test-suite.log
Please report to opus@xiph.org
============================================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is expected because we are trying to execute a binary that is coded with SVE2 instructions and the existing hardware does not yet support them. To solve this, we need to run the emulation by using &lt;code&gt;qemu-aarch64&lt;/code&gt; command. Let's run the command using one of the unit test that has failed - &lt;code&gt;test_opus_api&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ qemu-aarch64 test_opus_api
Error while loading test_opus_api: Exec format error
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Interestingly, the command cannot run because the test file format is invalid. Let's take a look at the content of the test file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ vi test_opus_api
#! /bin/sh

# tests/test_opus_api - temporary wrapper script for .libs/test_opus_api
# Generated by libtool (GNU libtool) 2.4.6
#
# The tests/test_opus_api program cannot be directly executed until all the libtool
# libraries that it depends on are installed.
#
# This wrapper script should never be moved out of the build directory.
# If it is, it will not operate correctly.

# Sed substitution that helps us do robust quoting.  It backslashifies
# metacharacters that are still active within double-quoted strings.

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The test file is actually a wrapper script and that is why the &lt;code&gt;qemu-aarch64&lt;/code&gt; cannot run. We can easily solve this by inserting the command when the script launches the actual program. &lt;/p&gt;

&lt;p&gt;Let's look for a code where the script starts the program and add the &lt;code&gt;qemu-aarch64&lt;/code&gt; command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Core function for launching the target application&lt;/span&gt;
func_exec_program_core &lt;span class="o"&gt;()&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;

      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$lt_option_debug&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
        &lt;span class="nv"&gt;$ECHO&lt;/span&gt; &lt;span class="s2"&gt;"test_opus_api:tests/test_opus_api:&lt;/span&gt;&lt;span class="nv"&gt;$LINENO&lt;/span&gt;&lt;span class="s2"&gt;: newargv[0]: &lt;/span&gt;&lt;span class="nv"&gt;$progdir&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;$program&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; 1&amp;gt;&amp;amp;2
        func_lt_dump_args &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;1&lt;/span&gt;&lt;span class="p"&gt;+&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$@&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; 1&amp;gt;&amp;amp;2
      &lt;span class="k"&gt;fi
      &lt;/span&gt;&lt;span class="nb"&gt;exec &lt;/span&gt;qemu-aarch64 &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$progdir&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;$program&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;1&lt;/span&gt;&lt;span class="p"&gt;+&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$@&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;

      &lt;span class="nv"&gt;$ECHO&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$0&lt;/span&gt;&lt;span class="s2"&gt;: cannot exec &lt;/span&gt;&lt;span class="nv"&gt;$program&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$*&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; 1&amp;gt;&amp;amp;2
      &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When we rerun the unit test, we can see that all tests passed without any problems.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ./test_opus_api
Testing the libopus 1.3.1-107-gccaaffa9-dirty API deterministically

  Decoder basic API tests
  ---------------------------------------------------
    opus_decoder_get_size(0)=0 ................... OK.
    opus_decoder_get_size(1)=18228 ............... OK.
    opus_decoder_get_size(2)=26996 ............... OK.
    opus_decoder_get_size(3)=0 ................... OK.
    opus_decoder_create() ........................ OK.
    opus_decoder_init() .......................... OK.
    OPUS_GET_FINAL_RANGE ......................... OK.
    OPUS_UNIMPLEMENTED ........................... OK.
    OPUS_GET_BANDWIDTH ........................... OK.
    OPUS_GET_SAMPLE_RATE ......................... OK.
    OPUS_GET_PITCH ............................... OK.
    OPUS_GET_LAST_PACKET_DURATION ................ OK.
    OPUS_SET_GAIN ................................ OK.
    OPUS_GET_GAIN ................................ OK.
    OPUS_RESET_STATE ............................. OK.
    opus_{packet,decoder}_get_nb_samples() ....... OK.
    opus_packet_get_nb_frames() .................. OK.
    opus_packet_get_bandwidth() .................. OK.
    opus_packet_get_samples_per_frame() .......... OK.
    opus_decode() ................................ OK.
    opus_decode_float() .......................... OK.
                   All decoder interface tests passed
                             (1219433 API invocations)

  Multistream decoder basic API tests
  ---------------------------------------------------
    opus_multistream_decoder_get_size(-1,-1)=0 ... OK.
    opus_multistream_decoder_get_size(-1, 0)=0 ... OK.
    opus_multistream_decoder_get_size(-1, 1)=0 ... OK.
    opus_multistream_decoder_get_size(-1, 2)=0 ... OK.
    opus_multistream_decoder_get_size(-1, 3)=0 ... OK.
    opus_multistream_decoder_get_size( 0,-1)=0 ... OK.
    opus_multistream_decoder_get_size( 0, 0)=0 ... OK.
    opus_multistream_decoder_get_size( 0, 1)=0 ... OK.
    opus_multistream_decoder_get_size( 0, 2)=0 ... OK.
    opus_multistream_decoder_get_size( 0, 3)=0 ... OK.
    opus_multistream_decoder_get_size( 1,-1)=0 ... OK.
    opus_multistream_decoder_get_size( 1, 0)=18504 OK.
    opus_multistream_decoder_get_size( 1, 1)=27272 OK.
    opus_multistream_decoder_get_size( 1, 2)=0 ... OK.
    opus_multistream_decoder_get_size( 1, 3)=0 ... OK.
    opus_multistream_decoder_get_size( 2,-1)=0 ... OK.
    opus_multistream_decoder_get_size( 2, 0)=36736 OK.
    opus_multistream_decoder_get_size( 2, 1)=45504 OK.
    opus_multistream_decoder_get_size( 2, 2)=54272 OK.
    opus_multistream_decoder_get_size( 2, 3)=0 ... OK.
    opus_multistream_decoder_get_size( 3,-1)=0 ... OK.
    opus_multistream_decoder_get_size( 3, 0)=54968 OK.
    opus_multistream_decoder_get_size( 3, 1)=63736 OK.
    opus_multistream_decoder_get_size( 3, 2)=72504 OK.
    opus_multistream_decoder_get_size( 3, 3)=81272 OK.
    opus_multistream_decoder_create() ............ OK.
    opus_multistream_decoder_init() .............. OK.
    OPUS_GET_FINAL_RANGE ......................... OK.
    OPUS_MULTISTREAM_GET_DECODER_STATE ........... OK.
    OPUS_SET_GAIN ................................ OK.
    OPUS_GET_GAIN ................................ OK.
    OPUS_GET_BANDWIDTH ........................... OK.
    OPUS_UNIMPLEMENTED ........................... OK.
    OPUS_RESET_STATE ............................. OK.
    opus_multistream_decode() .................... OK.
    opus_multistream_decode_float() .............. OK.
       All multistream decoder interface tests passed
                             (576106 API invocations)

  Packet header parsing tests
  ---------------------------------------------------
    code 0 (65 cases) ............................ OK.
    code 1 (163456 cases) ........................ OK.
    code 2 (326528 cases) ........................ OK.
    code 3 m-truncation (64 cases) ............... OK.
    code 3 m=0,49-64 (4096 cases) ................ OK.
    code 3 m=1 CBR (81728 cases) ................. OK.
    code 3 m=1-48 CBR (103544448 cases) .......... OK.
    code 3 m=1-48 VBR (120832 cases) ............. OK.
    code 3 padding (1519448 cases) ............... OK.
    opus_packet_parse ............................ OK.
                      All packet parsing tests passed
                          (105760666 API invocations)

  Encoder basic API tests
  ---------------------------------------------------
    opus_encoder_get_size(0)=0 ................... OK.
    opus_encoder_get_size(1)=43572 ............... OK.
    opus_encoder_get_size(2)=48484 ............... OK.
    opus_encoder_get_size(3)=0 ................... OK.
    opus_encoder_create() ........................ OK.
    opus_encoder_init() .......................... OK.
    OPUS_GET_LOOKAHEAD ........................... OK.
    OPUS_GET_SAMPLE_RATE ......................... OK.
    OPUS_UNIMPLEMENTED ........................... OK.
    OPUS_SET_APPLICATION ......................... OK.
    OPUS_GET_APPLICATION ......................... OK.
    OPUS_SET_BITRATE ............................. OK.
    OPUS_GET_BITRATE ............................. OK.
    OPUS_SET_FORCE_CHANNELS ...................... OK.
    OPUS_GET_FORCE_CHANNELS ...................... OK.
    OPUS_SET_BANDWIDTH ........................... OK.
    OPUS_GET_BANDWIDTH ........................... OK.
    OPUS_SET_MAX_BANDWIDTH ....................... OK.
    OPUS_GET_MAX_BANDWIDTH ....................... OK.
    OPUS_SET_DTX ................................. OK.
    OPUS_GET_DTX ................................. OK.
    OPUS_SET_COMPLEXITY .......................... OK.
    OPUS_GET_COMPLEXITY .......................... OK.
    OPUS_SET_INBAND_FEC .......................... OK.
    OPUS_GET_INBAND_FEC .......................... OK.
    OPUS_SET_PACKET_LOSS_PERC .................... OK.
    OPUS_GET_PACKET_LOSS_PERC .................... OK.
    OPUS_SET_VBR ................................. OK.
    OPUS_GET_VBR ................................. OK.
    OPUS_SET_VBR_CONSTRAINT ...................... OK.
    OPUS_GET_VBR_CONSTRAINT ...................... OK.
    OPUS_SET_SIGNAL .............................. OK.
    OPUS_GET_SIGNAL .............................. OK.
    OPUS_SET_LSB_DEPTH ........................... OK.
    OPUS_GET_LSB_DEPTH ........................... OK.
    OPUS_SET_PREDICTION_DISABLED ................. OK.
    OPUS_GET_PREDICTION_DISABLED ................. OK.
    OPUS_SET_EXPERT_FRAME_DURATION ............... OK.
    OPUS_GET_EXPERT_FRAME_DURATION ............... OK.
    OPUS_GET_FINAL_RANGE ......................... OK.
    OPUS_RESET_STATE ............................. OK.
    opus_encode() ................................ OK.
    opus_encode_float() .......................... OK.
                   All encoder interface tests passed
                             (1152209 API invocations)

  Repacketizer tests
  ---------------------------------------------------
    opus_repacketizer_get_size()=496 ............. OK.
    opus_repacketizer_init ....................... OK.
    opus_repacketizer_create ..................... OK.
    opus_repacketizer_get_nb_frames .............. OK.
    opus_repacketizer_cat ........................ OK.
    opus_repacketizer_out ........................ OK.
    opus_repacketizer_out_range .................. OK.
    opus_packet_pad .............................. OK.
    opus_packet_unpad ............................ OK.
    opus_multistream_packet_pad .................. OK.
    opus_multistream_packet_unpad ................ OK.
                        All repacketizer tests passed
                            (6713561 API invocations)

  malloc() failure tests
  ---------------------------------------------------
    opus_decoder_create() ................... SKIPPED.
    opus_encoder_create() ................... SKIPPED.
    opus_repacketizer_create() .............. SKIPPED.
    opus_multistream_decoder_create() ....... SKIPPED.
    opus_multistream_encoder_create() ....... SKIPPED.
(Test only supported with GLIBC and without valgrind)

All API tests passed.
The libopus API was invoked 115421979 times.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, we know the SVE2 implementation is successfully added to the program. Let's double-check this by looking for the presence of SVE2 specific instruction within the binary files. Using the following command, we can see the list of files with &lt;code&gt;whilelo&lt;/code&gt; instruction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;find . -type f -executable | while read F ; do echo ======= $F ; objdump -d $F 2&amp;gt; /dev/null | grep whilelo ; done

#...
======= ./.libs/libopus.so.0.8.0
    2ef0:       25a40fe1        whilelo p1.s, wzr, w4
    2f28:       25a40c60        whilelo p0.s, w3, w4
    2f7c:       25a40fe1        whilelo p1.s, wzr, w4
    2fa0:       25a40c60        whilelo p0.s, w3, w4
    3314:       25b80fe0        whilelo p0.s, wzr, w24
    3344:       25b80c20        whilelo p0.s, w1, w24
    3784:       25b80fe0        whilelo p0.s, wzr, w24
    37a8:       25b80c00        whilelo p0.s, w0, w24
    39cc:       25b80fe0        whilelo p0.s, wzr, w24
    39f0:       25b80c00        whilelo p0.s, w0, w24
    3a5c:       25b80fe0        whilelo p0.s, wzr, w24
    3a80:       25b80c00        whilelo p0.s, w0, w24
    4488:       25b50fe2        whilelo p2.s, wzr, w21
    44fc:       25b50c00        whilelo p0.s, w0, w21
    4590:       25b50fe2        whilelo p2.s, wzr, w21
    45f8:       25b50c00        whilelo p0.s, w0, w21
    4780:       25a40fe2        whilelo p2.s, wzr, w4
    47e4:       25a40c00        whilelo p0.s, w0, w4
    4940:       25b80fe0        whilelo p0.s, wzr, w24
    4954:       25b80c00        whilelo p0.s, w0, w24
    4a2c:       25a40fe1        whilelo p1.s, wzr, w4
    4a50:       25a40c00        whilelo p0.s, w0, w4
    4b34:       25a40fe1        whilelo p1.s, wzr, w4
    4b68:       25a40c60        whilelo p0.s, w3, w4
    4e34:       25b30fe0        whilelo p0.s, wzr, w19
    4e54:       25b30c60        whilelo p0.s, w3, w19
    4fe4:       25b30fe0        whilelo p0.s, wzr, w19
    5000:       25b30c20        whilelo p0.s, w1, w19
    51d8:       25b30fe0        whilelo p0.s, wzr, w19
    5208:       25b30c00        whilelo p0.s, w0, w19
    54a0:       25a10fe0        whilelo p0.s, wzr, w1
    54b8:       25a10c00        whilelo p0.s, w0, w1
    55c8:       25a11fe0        whilelo p0.s, xzr, x1
    55e8:       25a11c00        whilelo p0.s, x0, x1
    5724:       25a11fe0        whilelo p0.s, xzr, x1
    5738:       25a11c00        whilelo p0.s, x0, x1
    5c2c:       25a80fe1        whilelo p1.s, wzr, w8
    5c7c:       25a80d21        whilelo p1.s, w9, w8
    5ec4:       25a50fe2        whilelo p2.s, wzr, w5
    5f30:       25a50c00        whilelo p0.s, w0, w5
    64dc:       25a10fe0        whilelo p0.s, wzr, w1
    6500:       25a10c00        whilelo p0.s, w0, w1
    6818:       25a11fe0        whilelo p0.s, xzr, x1
    6830:       25a11c00        whilelo p0.s, x0, x1
#...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And when we count the total number of lines that use &lt;code&gt;whilelo&lt;/code&gt;, we get a total of 2903 lines.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ find . -type f -executable | while read F ; do echo ======= $F ; objdump -d $F 2&amp;gt; /dev/null | grep whilelo ; done | wc -l
2903
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this post, we explored and implemented SVE2 by using auto-vectorization of the compiler. Using the existing unit tests, we were able to identify if the auto-vectorization was added successfully. We also found that it added a significant number of SVE2 specific instruction, &lt;code&gt;whilelo&lt;/code&gt; (i.e. 2903 lines). This indicates that the Opus project may greatly benefit from SVE2 implementation.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Implementing SVE2 for Opus Codec Library Part 2: Compiler Intrinsics</title>
      <dc:creator>Seung Woo (Paul) Ji</dc:creator>
      <pubDate>Wed, 13 Apr 2022 21:30:08 +0000</pubDate>
      <link>https://dev.to/seungwooji/implementing-sve2-for-opus-codec-library-part-2-4a81</link>
      <guid>https://dev.to/seungwooji/implementing-sve2-for-opus-codec-library-part-2-4a81</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/seungwooji/implementing-sve2-for-opus-2i6l"&gt;In the last post&lt;/a&gt;, we explored how we can compile and test the package. From now on, we will explore how we can add SVE2 implementation to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding Candidates
&lt;/h2&gt;

&lt;p&gt;As we explored &lt;a href="https://dev.to/seungwooji/implementing-sve2-for-open-source-project-27h0"&gt;before&lt;/a&gt;, Opus contains a number of files that utilizes compiler intrinsics for SIMD implementation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ find | grep -i neon
./celt/arm/celt_neon_intr.c
./celt/arm/pitch_neon_intr.c
./silk/fixed/arm/warped_autocorrelation_FIX_neon_intr.c
./silk/arm/NSQ_neon.c
./silk/arm/LPC_inv_pred_gain_neon_intr.c
./silk/arm/NSQ_neon.h
./silk/arm/biquad_alt_neon_intr.c
./silk/arm/NSQ_del_dec_neon_intr.c
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Among these, we need to find a file with loops. Let's take a look at &lt;code&gt;celt_neon_intr.c&lt;/code&gt; file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;xcorr_kernel_neon_fixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;opus_val16&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;opus_val16&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;opus_val32&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_s32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
   &lt;span class="cm"&gt;/* Load y[0...3] */&lt;/span&gt;
   &lt;span class="cm"&gt;/* This requires len&amp;gt;0 to always be valid (which we assert in the C code). */&lt;/span&gt;
   &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

   &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="cm"&gt;/* Load x[0...7] */&lt;/span&gt;
      &lt;span class="n"&gt;int16x8_t&lt;/span&gt; &lt;span class="n"&gt;xx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;x0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vget_low_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;x4&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vget_high_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="cm"&gt;/* Load y[4...11] */&lt;/span&gt;
      &lt;span class="n"&gt;int16x8_t&lt;/span&gt; &lt;span class="n"&gt;yy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y4&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vget_low_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yy&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y8&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vget_high_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yy&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vmlal_lane_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vmlal_lane_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vext_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y5&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vext_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vmlal_lane_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vmlal_lane_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vext_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y6&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vext_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a4&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vmlal_lane_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a5&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vmlal_lane_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vext_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y7&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vext_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a6&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vmlal_lane_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a7&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vmlal_lane_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="n"&gt;y0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a7&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="p"&gt;}&lt;/span&gt;

 &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;x0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1_dup_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="cm"&gt;/* load next x */&lt;/span&gt;
      &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vmlal_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y4&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1_dup_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="cm"&gt;/* load next y */&lt;/span&gt;
      &lt;span class="n"&gt;y0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vext_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="p"&gt;}&lt;/span&gt;

   &lt;span class="n"&gt;vst1q_s32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This function uses multiple intrinsic extensions inside of the for loops which meet our expectation. Before we start implementing SVE2, we need to understand the code thoroughly. Let's walk through the code one by one.&lt;/p&gt;

&lt;p&gt;We can see the function takes in three arrays, &lt;code&gt;x&lt;/code&gt;, &lt;code&gt;y&lt;/code&gt;, and &lt;code&gt;sum&lt;/code&gt;. The &lt;code&gt;sum&lt;/code&gt; array is first loaded to the vector register with a tuple of 4 lanes that each has 32 bits in length. Since this code uses NEON to implement SIMD, it makes sense the total length of the vector register is limited to 128 bits in total. &lt;/p&gt;

&lt;p&gt;Then, the &lt;code&gt;y&lt;/code&gt; array is loaded to the vector with a tuple of 4 lanes in which each has 16 bits in length. These correspond to the first four elements in the &lt;code&gt;y&lt;/code&gt; array (i.e. y[0...3]). &lt;/p&gt;

&lt;p&gt;In the for loop, the &lt;code&gt;x&lt;/code&gt; array is loaded into the register. The vector first contains 8 lanes of 16 bits. These, in turn, are divided into two groups, x0, and x4. Ultimately these correspond to the first eight elements in the &lt;code&gt;x&lt;/code&gt; array (i.e. x[0...3]). &lt;/p&gt;

&lt;p&gt;The code repeats the previous steps for &lt;code&gt;y&lt;/code&gt; array. Since, we already assign a vector for the first four elements from the array, we start from the fifth element in the array. At the end, these correspond to the elements ranged from the eighth to the eleventh element (i.e. y[4...11]).&lt;/p&gt;

&lt;p&gt;To better understand what we have learned, we can make the following diagram:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x*(val16)   0    1    2    3    4    5    6    7
            |      x0      |    |      x4      |   

y*(val16)   0    1    2    3    4    5    6    7    8    9    10    11
            |      y0      |    |      y4      |    |      y8       | 

sum(val32)  0            1            2            3     
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the first &lt;code&gt;vmlal_lane_s16&lt;/code&gt;, the intrinsic multiplies the first lane (0) of the &lt;code&gt;x0&lt;/code&gt; to each lane of &lt;code&gt;y0&lt;/code&gt;. The result is then accumulated to the destination vector where each element is twice as long as the elements that are multiplied (i.e. 16 bit -&amp;gt; 32 bit). This means we do the following operations between two arrays:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x[0] * (y[0], y[1], y[2], y[3]) = (sum[0], sum[1], sum[2], sum[3])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We repeat the same operation as above but with &lt;code&gt;y4&lt;/code&gt; and &lt;code&gt;x4&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Next, &lt;code&gt;vext_s16&lt;/code&gt; extract a vector from the y0 and  y4 pairs. This is done by extracting the lowest vector elements from &lt;code&gt;y4&lt;/code&gt; and the highest vector elements from &lt;code&gt;y0&lt;/code&gt; starting from the element of desired index (i.e. 1). This means we get the following vector as a result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;y0    : y[0], y[1], y[2], y[3] // taking the highest vector starting from the index 1.
y4    : y[4], y[5], y[6], y[7] // filling up the result vector by taking the lowest vector
Result: y[1], y[2], y[3], y[4] 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Afterwards, we do the same steps to keep multiplying and adding the rest of x and y elements. &lt;/p&gt;

&lt;h2&gt;
  
  
  Problem
&lt;/h2&gt;

&lt;p&gt;Unfortunately, the codes that we walked though together are not easy to translate into ones with SVE2 instructions. One of the reasons is because of the lack of SVE2 counterparts of the NEON instruction that are used. This makes sense considering that the SVE2 does not restrict the length of vector registers. In order to solve this, we have to rewrite the codes in such a way that no &lt;br&gt;
 tuple of vector lanes are used. &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this post, we explored and analyzed whether the intrinsic codes the package uses are good for implementing SVE2. Unfortunately, the codes are fairly complex and requires more NEON and SVE2 knowledges that are beyond the scope that we have covered in the previous posts. In the following post, we will look for an alternative method to implement SVE2 - that is, by using auto-vectorization.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Implementing SVE2 for Opus Codec Library Part 1: Package Installation</title>
      <dc:creator>Seung Woo (Paul) Ji</dc:creator>
      <pubDate>Mon, 11 Apr 2022 01:42:20 +0000</pubDate>
      <link>https://dev.to/seungwooji/implementing-sve2-for-opus-2i6l</link>
      <guid>https://dev.to/seungwooji/implementing-sve2-for-opus-2i6l</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/seungwooji/implementing-sve2-for-open-source-project-27h0"&gt;Previously&lt;/a&gt;, we identified several packages that do not support SVE2 codes yet. We ultimately decided that &lt;a href="https://gitlab.xiph.org/xiph/opus"&gt;Opus Audio Codec&lt;/a&gt; is the best candidate. In this post, we will explore the package in detail and see how we can implement SVE2 into it. &lt;/p&gt;

&lt;h2&gt;
  
  
  Before We Start...
&lt;/h2&gt;

&lt;p&gt;When we clone the package, we can see the following files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;ls
&lt;/span&gt;AUTHORS          cmake           LICENSE_PLEASE_READ.txt  meson.build        opus_sources.mk         silk             update_version
autogen.sh       CMakeLists.txt  m4                       meson_options.txt  opus-uninstalled.pc.in  silk_headers.mk  win32
celt             configure.ac    Makefile.am              NEWS               README                  silk_sources.mk
celt_headers.mk  COPYING         Makefile.mips            opus_headers.mk    README.draft            src
celt_sources.mk  doc             Makefile.unix            opus.m4            releases.sha2           tests
ChangeLog        include         meson                    opus.pc.in         scripts                 training
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As we can see, the packages contains several &lt;code&gt;Makefile&lt;/code&gt; and configure template files. These gives us an idea that this package may use the GNU Autotools to generate &lt;code&gt;Makefile&lt;/code&gt; and configure scripts. To have a clear understanding of how we can install this package, we can read the &lt;code&gt;README&lt;/code&gt; file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# README
# ...

1) Clone the repository:

    % git clone https://gitlab.xiph.org/xiph/opus.git
    % cd opus

2) Compiling the source

    % ./autogen.sh
    % ./configure
    % make

3) Install the codec libraries (optional)

    % sudo make install

Once you have compiled the codec, there will be a opus_demo executable
in the top directory.

Usage: opus_demo [-e] &amp;lt;application&amp;gt; &amp;lt;sampling rate (Hz)&amp;gt; &amp;lt;channels (1/2)&amp;gt;
         &amp;lt;bits per second&amp;gt; [options] &amp;lt;input&amp;gt; &amp;lt;output&amp;gt;
       opus_demo -d &amp;lt;sampling rate (Hz)&amp;gt; &amp;lt;channels (1/2)&amp;gt; [options]
         &amp;lt;input&amp;gt; &amp;lt;output&amp;gt;

# ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, let's follow the instruction. Once we run the &lt;code&gt;./autogen.sh&lt;/code&gt;, we get the following list of files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ./autogen.sh
$ ls
aclocal.m4       CMakeLists.txt  doc                      Makefile.mips      opus.pc.in              silk_headers.mk
AUTHORS          compile         include                  Makefile.unix      opus_sources.mk         silk_sources.mk
autogen.sh       config.guess    INSTALL                  meson              opus-uninstalled.pc.in  src
autom4te.cache   config.h.in     install-sh               meson.build        package_version         test-driver
celt             config.sub      LICENSE_PLEASE_READ.txt  meson_options.txt  README                  tests
celt_headers.mk  configure       ltmain.sh                missing            README.draft            training
celt_sources.mk  configure.ac    m4                       NEWS               releases.sha2           update_version
ChangeLog        COPYING         Makefile.am              opus_headers.mk    scripts                 win32
cmake            depcomp         Makefile.in              opus.m4            silk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have more files now. The notable files are &lt;code&gt;Makefile.in&lt;/code&gt; and &lt;code&gt;configure&lt;/code&gt; script file. &lt;code&gt;Makefile.in&lt;/code&gt; is generated from &lt;code&gt;Makefile.am&lt;/code&gt; file but still is missing some values that are going to be filled with the &lt;code&gt;configure&lt;/code&gt; script. Now, let's run the &lt;code&gt;configure&lt;/code&gt; script.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ./configure
$ ls
aclocal.m4       config.guess   include                  Makefile.unix      opus-uninstalled.pc     stamp-h1
AUTHORS          config.h       INSTALL                  meson              opus-uninstalled.pc.in  test-driver
autogen.sh       config.h.in    install-sh               meson.build        package_version         tests
autom4te.cache   config.log     libtool                  meson_options.txt  README                  training
celt             config.status  LICENSE_PLEASE_READ.txt  missing            README.draft            update_version
celt_headers.mk  config.sub     ltmain.sh                NEWS               releases.sha2           win32
celt_sources.mk  configure      m4                       opus_headers.mk    scripts
ChangeLog        configure.ac   Makefile                 opus.m4            silk
cmake            COPYING        Makefile.am              opus.pc            silk_headers.mk
CMakeLists.txt   depcomp        Makefile.in              opus.pc.in         silk_sources.mk
compile          doc            Makefile.mips            opus_sources.mk    src
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not surprisingly, we get a &lt;code&gt;Makefile&lt;/code&gt; amongst the newly generated files. If we inspect the &lt;code&gt;Makefile&lt;/code&gt;, we can see what &lt;code&gt;CFLAG&lt;/code&gt; it uses to compile the package.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Makefile
# ...

CFLAGS = -g -O2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes

# ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From these flags, we can know that the package does not utilize auto-vectorization. This means we can also implement SVE2 codes by utilizing auto-vectorization in this package. &lt;/p&gt;

&lt;p&gt;Now, let's compile the package. For this, we can assign more jobs in parallel at a time when we execute the &lt;code&gt;Makefile&lt;/code&gt; to increase the speed of compilation. In general, we can calculate the number by doubling the core number plus one. In our case, we can use a value of 24 (16 cores * 2 + 1) to keep every core busy with jobs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ make -j 24
$ ls
aclocal.m4       config.guess   include                  Makefile.mips      opus.pc                 silk
AUTHORS          config.h       INSTALL                  Makefile.unix      opus.pc.in              silk_headers.mk
autogen.sh       config.h.in    install-sh               meson              opus_sources.mk         silk_sources.mk
autom4te.cache   config.log     libopus.la               meson.build        opus-uninstalled.pc     src
celt             config.status  libtool                  meson_options.txt  opus-uninstalled.pc.in  stamp-h1
celt_headers.mk  config.sub     LICENSE_PLEASE_READ.txt  missing            package_version         test-driver
celt_sources.mk  configure      ltmain.sh                NEWS               README                  tests
ChangeLog        configure.ac   m4                       opus_compare       README.draft            training
cmake            COPYING        Makefile                 opus_demo          releases.sha2           trivial_example
CMakeLists.txt   depcomp        Makefile.am              opus_headers.mk    repacketizer_demo       update_version
compile          doc            Makefile.in              opus.m4            scripts                 win32
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As the &lt;code&gt;README&lt;/code&gt; file mentioned, we have a executable file called &lt;code&gt;opus_demo&lt;/code&gt;. When we run it, we can see the package is successfully compiled.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ./opus_demo
Usage: /home/swji1/opus/.libs/opus_demo [-e] &amp;lt;application&amp;gt; &amp;lt;sampling rate (Hz)&amp;gt; &amp;lt;channels (1/2)&amp;gt; &amp;lt;bits per second&amp;gt;  [options] &amp;lt;input&amp;gt; &amp;lt;output&amp;gt;
       /home/swji1/opus/.libs/opus_demo -d &amp;lt;sampling rate (Hz)&amp;gt; &amp;lt;channels (1/2)&amp;gt; [options] &amp;lt;input&amp;gt; &amp;lt;output&amp;gt;

application: voip | audio | restricted-lowdelay
options:
-e                   : only runs the encoder (output the bit-stream)
-d                   : only runs the decoder (reads the bit-stream as input)
-cbr                 : enable constant bitrate; default: variable bitrate
-cvbr                : enable constrained variable bitrate; default: unconstrained
-delayed-decision    : use look-ahead for speech/music detection (experts only); default: disabled
-bandwidth &amp;lt;NB|MB|WB|SWB|FB&amp;gt; : audio bandwidth (from narrowband to fullband); default: sampling rate
-framesize &amp;lt;2.5|5|10|20|40|60|80|100|120&amp;gt; : frame size in ms; default: 20
-max_payload &amp;lt;bytes&amp;gt; : maximum payload size in bytes, default: 1024
-complexity &amp;lt;comp&amp;gt;   : complexity, 0 (lowest) ... 10 (highest); default: 10
-inbandfec           : enable SILK inband FEC
-forcemono           : force mono encoding, even for stereo input
-dtx                 : enable SILK DTX
-loss &amp;lt;perc&amp;gt;         : simulate packet loss, in percent (0-100); default: 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;But, how we validate if the binary works as intended? For this, we can refer to the &lt;code&gt;README&lt;/code&gt; again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# README
# ...

== Testing ==

This package includes a collection of automated unit and system tests
which SHOULD be run after compiling the package especially the first
time it is run on a new platform.

To run the integrated tests:

    % make check

# ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Thankfully, the authors provide a set of unit tests to validate the integrity of the executable file. Using this, we can check if the package is compiled correctly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ make check
PASS: celt/tests/test_unit_cwrs32
PASS: celt/tests/test_unit_dft
PASS: celt/tests/test_unit_entropy
PASS: celt/tests/test_unit_laplace
PASS: celt/tests/test_unit_mathops
PASS: celt/tests/test_unit_mdct
PASS: celt/tests/test_unit_rotation
PASS: celt/tests/test_unit_types
PASS: silk/tests/test_unit_LPC_inv_pred_gain
PASS: tests/test_opus_api
PASS: tests/test_opus_decode
PASS: tests/test_opus_encode
PASS: tests/test_opus_padding
PASS: tests/test_opus_projection
============================================================================
Testsuite summary for opus 1.3.1-107-gccaaffa9
============================================================================
# TOTAL: 14
# PASS:  14
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this post, we explored the package and learned how to compile it to generate binary files to execute. We also confirmed that the package does not utilize the auto-vectorization. So, we may try implementing the vectorization in two ways: compiler intrinsics or auto-vectorization. In the next post, we will see how we can add SVE2 codes by using intrinsics.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Implementing SVE2 for Open Source Project</title>
      <dc:creator>Seung Woo (Paul) Ji</dc:creator>
      <pubDate>Tue, 29 Mar 2022 03:43:10 +0000</pubDate>
      <link>https://dev.to/seungwooji/implementing-sve2-for-open-source-project-27h0</link>
      <guid>https://dev.to/seungwooji/implementing-sve2-for-open-source-project-27h0</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/seungwooji/implementing-sve2-for-volume-adjusting-algorithm-4697"&gt;In the last post&lt;/a&gt;, we explored and implemented Scalable Vector Extension 2 (SVE2) code for the volume adjusting algorithm. Now, we will do the same process but in a much bigger scale - by actually trying to contribute SVE2 code for the ongoing open source project. &lt;/p&gt;

&lt;h3&gt;
  
  
  Searching for a package
&lt;/h3&gt;

&lt;p&gt;As we learned before, SVE2 is best suitable for processing large amount of data such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Computer vision&lt;/li&gt;
&lt;li&gt;Multimedia&lt;/li&gt;
&lt;li&gt;Long-Term Evolution (LTE) baseband processing&lt;/li&gt;
&lt;li&gt;Genomics&lt;/li&gt;
&lt;li&gt;In-memory database&lt;/li&gt;
&lt;li&gt;Web serving&lt;/li&gt;
&lt;li&gt;Cryptography&lt;/li&gt;
&lt;li&gt;And so on...&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And we know the vectorization can be implemented in 3 different ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Auto-vectorization&lt;/li&gt;
&lt;li&gt;Compiler Intrinsics&lt;/li&gt;
&lt;li&gt;Inline Assembler&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since we already have the experience of intrinsics, we will try our best to search packages that already use them.&lt;/p&gt;

&lt;p&gt;We also have to consider if a package supports for our machine (Fedora 35 running on Aarch64 Architecture) as we have to install the program. For this, we will use the Fedora's package manager &lt;a href="https://docs.fedoraproject.org/en-US/quick-docs/dnf/"&gt;DNF&lt;/a&gt; and run the following commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$dnf&lt;/span&gt; search search_keyword
&lt;span class="nv"&gt;$dnf&lt;/span&gt; info package_name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By using &lt;code&gt;$dnf search&lt;/code&gt;, the keyword is searched in both name and description of every package. Once we find a name of package, we can display the detailed description of that package with &lt;code&gt;$dnf info&lt;/code&gt;. We also have to be careful to only choose open-source project. &lt;/p&gt;

&lt;h3&gt;
  
  
  List of Possible Candidates
&lt;/h3&gt;

&lt;p&gt;With the aforementioned strategy, we can find some possible candidates as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://github.com/libjpeg-turbo/libjpeg-turbo"&gt;libjpeg-turbo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://codeberg.org/soundtouch/soundtouch"&gt;SoundTouch&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.xiph.org/xiph/opus"&gt;Opus Audio Codec&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let's see each package together!&lt;/p&gt;

&lt;h4&gt;
  
  
  libjpeg-turbo
&lt;/h4&gt;

&lt;p&gt;libjpeg-turbo is a JPEG image codec that utilizes SIMD instructions to perform JPEG compression and decompression. When we inspect the package, we can find a list of promising files as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;find &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;-name&lt;/span&gt; &lt;span class="s2"&gt;"*neon*"&lt;/span&gt;
./jidctfst-neon.c
./jcsample-neon.c
./aarch32/jchuff-neon.c
./aarch32/jsimd_neon.S
./aarch32/jccolext-neon.c
./jfdctfst-neon.c
./neon-compat.h.in
./aarch64/jchuff-neon.c
./aarch64/jsimd_neon.S
./aarch64/jccolext-neon.c
./jidctred-neon.c
./jfdctint-neon.c
./jdmerge-neon.c
./jidctint-neon.c
./jccolor-neon.c
./jdsample-neon.c
./jdcolor-neon.c
./jdmrgext-neon.c
./jcgryext-neon.c
./jcphuff-neon.c
./jcgray-neon.c
./jdcolext-neon.c
./jquanti-neon.c
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// jquanti-neon.c&lt;/span&gt;
&lt;span class="c1"&gt;// ...&lt;/span&gt;

&lt;span class="cp"&gt;#if defined(__clang__) &amp;amp;&amp;amp; (defined(__aarch64__) || defined(_M_ARM64))
#pragma unroll
#endif
&lt;/span&gt;  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cm"&gt;/* Load DCT coefficients. */&lt;/span&gt;
    &lt;span class="n"&gt;int16x8_t&lt;/span&gt; &lt;span class="n"&gt;row0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;int16x8_t&lt;/span&gt; &lt;span class="n"&gt;row1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;int16x8_t&lt;/span&gt; &lt;span class="n"&gt;row2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;int16x8_t&lt;/span&gt; &lt;span class="n"&gt;row3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="cm"&gt;/* Load reciprocals of quantization values. */&lt;/span&gt;
    &lt;span class="n"&gt;uint16x8_t&lt;/span&gt; &lt;span class="n"&gt;recip0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_u16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recip_ptr&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;uint16x8_t&lt;/span&gt; &lt;span class="n"&gt;recip1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_u16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recip_ptr&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;uint16x8_t&lt;/span&gt; &lt;span class="n"&gt;recip2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_u16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recip_ptr&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;uint16x8_t&lt;/span&gt; &lt;span class="n"&gt;recip3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_u16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recip_ptr&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;uint16x8_t&lt;/span&gt; &lt;span class="n"&gt;corr0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_u16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;corr_ptr&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;uint16x8_t&lt;/span&gt; &lt;span class="n"&gt;corr1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_u16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;corr_ptr&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;uint16x8_t&lt;/span&gt; &lt;span class="n"&gt;corr2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_u16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;corr_ptr&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;uint16x8_t&lt;/span&gt; &lt;span class="n"&gt;corr3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_u16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;corr_ptr&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;int16x8_t&lt;/span&gt; &lt;span class="n"&gt;shift0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shift_ptr&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;int16x8_t&lt;/span&gt; &lt;span class="n"&gt;shift1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shift_ptr&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;int16x8_t&lt;/span&gt; &lt;span class="n"&gt;shift2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shift_ptr&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;int16x8_t&lt;/span&gt; &lt;span class="n"&gt;shift3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shift_ptr&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;DCTSIZE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As we can see, &lt;code&gt;vld1q_s16&lt;/code&gt; intrinsic is used to load a vector from memory. Furthermore, the package does not yet use SVE or SVE2 implementation. This indicates this project is a good candidate where we can contribute our knowledge of SVE2 for this project.&lt;/p&gt;

&lt;h4&gt;
  
  
  SoundTouch
&lt;/h4&gt;

&lt;p&gt;Soundtouch is an audio-processing library that allows changing the sound tempo, pitch and playback rate parameters. This sounds familiar to us as we dealt with a simple audio algorithm before and maybe another good candidate for us.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$grep&lt;/span&gt; &lt;span class="nt"&gt;-ir&lt;/span&gt; neon &lt;span class="nb"&gt;.&lt;/span&gt;
./configure.ac:AC_CHECK_HEADERS&lt;span class="o"&gt;([&lt;/span&gt;arm_neon.h]&lt;span class="o"&gt;)&lt;/span&gt;
./configure.ac:AC_ARG_ENABLE&lt;span class="o"&gt;([&lt;/span&gt;neon-optimizations],
./configure.ac:              &lt;span class="o"&gt;[&lt;/span&gt;AS_HELP_STRING&lt;span class="o"&gt;([&lt;/span&gt;&lt;span class="nt"&gt;--enable-neon-optimizations&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;,
./configure.ac:                              &lt;span class="o"&gt;[&lt;/span&gt;use ARM NEON optimization &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;yes&lt;/span&gt;&lt;span class="o"&gt;]])]&lt;/span&gt;,[enable_neon_optimizations&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;enableval&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;,
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# configure.ac &lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="s2"&gt;"x&lt;/span&gt;&lt;span class="nv"&gt;$enable_neon_optimizations&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"xyes"&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt; &lt;span class="s2"&gt;"x&lt;/span&gt;&lt;span class="nv"&gt;$ac_cv_header_arm_neon_h&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"xyes"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;

        &lt;span class="c"&gt;# Check for ARM NEON support&lt;/span&gt;
        &lt;span class="nv"&gt;original_saved_CXXFLAGS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$CXXFLAGS&lt;/span&gt;
        &lt;span class="nv"&gt;have_neon&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;no
        &lt;span class="nv"&gt;CXXFLAGS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"-mfpu=neon -march=native &lt;/span&gt;&lt;span class="nv"&gt;$CXXFLAGS&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

        &lt;span class="c"&gt;# Check if can compile neon code using intrinsics, require GCC &amp;gt;= 4.3 for autovectorization.&lt;/span&gt;
        AC_COMPILE_IFELSE&lt;span class="o"&gt;([&lt;/span&gt;AC_LANG_SOURCE&lt;span class="o"&gt;([[&lt;/span&gt;
        &lt;span class="c"&gt;#if defined(__GNUC__) &amp;amp;&amp;amp; (__GNUC__ &amp;lt; 4 || (__GNUC__ == 4 &amp;amp;&amp;amp; __GNUC_MINOR__ &amp;lt; 3))&lt;/span&gt;
        &lt;span class="c"&gt;#error "Need GCC &amp;gt;= 4.3 for neon autovectorization"&lt;/span&gt;
        &lt;span class="c"&gt;#endif&lt;/span&gt;
        &lt;span class="c"&gt;#include &amp;lt;arm_neon.h&amp;gt;&lt;/span&gt;
        int main &lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                int32x4_t t &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;1&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                &lt;span class="k"&gt;return &lt;/span&gt;vaddq_s32&lt;span class="o"&gt;(&lt;/span&gt;t,t&lt;span class="o"&gt;)[&lt;/span&gt;0] &lt;span class="o"&gt;==&lt;/span&gt; 2&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="o"&gt;}]])]&lt;/span&gt;,[have_neon&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;yes&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;
        &lt;span class="nv"&gt;CXXFLAGS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$original_saved_CXXFLAGS&lt;/span&gt;
        &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="s2"&gt;"x&lt;/span&gt;&lt;span class="nv"&gt;$have_neon&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"xyes"&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
                &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"****** NEON support enabled ******"&lt;/span&gt;
                &lt;span class="nv"&gt;CPPFLAGS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"-mfpu=neon -march=native -mtune=native &lt;/span&gt;&lt;span class="nv"&gt;$CPPFLAGS&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
                AC_DEFINE&lt;span class="o"&gt;(&lt;/span&gt;SOUNDTOUCH_USE_NEON,1,[Use ARM NEON extension]&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;fi
fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The package does not contain any files that has &lt;code&gt;simd&lt;/code&gt; or &lt;code&gt;neon&lt;/code&gt; in their names. However, it does have a file that contains &lt;code&gt;neon&lt;/code&gt; in its content. When we open that file, we can see this package utilizes the auto-vectorization feature by the compiler. As we can see, the package prompts a message saying that it cannot perform the auto-vectorization when it is compiled by &lt;code&gt;GCC&lt;/code&gt; with a version less than 4.3.&lt;/p&gt;

&lt;h4&gt;
  
  
  Opus
&lt;/h4&gt;

&lt;p&gt;Opus is a audio codec for interactive speech and audio transmission across the Internet with compression algorithms. It can support a wide rage of interactive audio applications such as Voice Over IP (VoIP), remote live music performance, and video conferencing. As similar to the last one, this may be a good candidate for us.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;find | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; neon
./celt/arm/celt_neon_intr.c
./celt/arm/pitch_neon_intr.c
./silk/fixed/arm/warped_autocorrelation_FIX_neon_intr.c
./silk/arm/NSQ_neon.c
./silk/arm/LPC_inv_pred_gain_neon_intr.c
./silk/arm/NSQ_neon.h
./silk/arm/biquad_alt_neon_intr.c
./silk/arm/NSQ_del_dec_neon_intr.c
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// celt_neon_intr.c&lt;/span&gt;
&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;arm_neon.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
#include&lt;/span&gt; &lt;span class="cpf"&gt;"../pitch.h"&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="cp"&gt;#if defined(FIXED_POINT)
&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;xcorr_kernel_neon_fixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;opus_val16&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;opus_val16&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;opus_val32&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_s32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
   &lt;span class="cm"&gt;/* Load y[0...3] */&lt;/span&gt;
   &lt;span class="cm"&gt;/* This requires len&amp;gt;0 to always be valid (which we assert in the C code). */&lt;/span&gt;
   &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
   &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

   &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;len&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="cm"&gt;/* Load x[0...7] */&lt;/span&gt;
      &lt;span class="n"&gt;int16x8_t&lt;/span&gt; &lt;span class="n"&gt;xx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;x0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vget_low_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;x4&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vget_high_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;xx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="cm"&gt;/* Load y[4...11] */&lt;/span&gt;
      &lt;span class="n"&gt;int16x8_t&lt;/span&gt; &lt;span class="n"&gt;yy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vld1q_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y4&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vget_low_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yy&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y8&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vget_high_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yy&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vmlal_lane_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vmlal_lane_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vext_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y5&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vext_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vmlal_lane_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vmlal_lane_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vext_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y6&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vext_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a4&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vmlal_lane_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a5&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vmlal_lane_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vext_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int16x4_t&lt;/span&gt; &lt;span class="n"&gt;y7&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vext_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a6&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vmlal_lane_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;int32x4_t&lt;/span&gt; &lt;span class="n"&gt;a7&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vmlal_lane_s16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="n"&gt;y0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a7&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When searched with &lt;code&gt;neon&lt;/code&gt;, we can see a list of promising files that potentially deal with &lt;code&gt;simd&lt;/code&gt; instructions. In &lt;code&gt;celt_neon_intr.c&lt;/code&gt; file, we can see &lt;code&gt;xcorr_kernel_neon_fixed&lt;/code&gt; function executes a loop with SIMD instructions. &lt;/p&gt;

&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;p&gt;We have a pretty good open-source projects to implement SVE2. Amongst them, we will choose &lt;code&gt;Opus&lt;/code&gt; project for several reasons. First of all, this project is still well and actively maintained by developers. As a matter of fact, it is standardized by the Internet Engineering Task Force IETF and unmatched for interactive audio transmission over the Internet. Besides, the package is &lt;a href="https://opus-codec.org/docs/opus_api-1.3.1.pdf"&gt;well-documented&lt;/a&gt; to understand the code thoroughly. Lastly, and most importantly, the code is written to be more readable by new developers as compared to the first two projects. As we can see, the author kindly commented the purpose of variables and functions. Thus, we will choose &lt;code&gt;Opus&lt;/code&gt; project to contribute our SVE2 knowledge.&lt;/p&gt;

&lt;h3&gt;
  
  
  Contributions
&lt;/h3&gt;

&lt;p&gt;The way to contribute for &lt;code&gt;Opus&lt;/code&gt; project is well-explained in its wiki page. Thankfully, &lt;a href="https://wiki.xiph.org/OpusContributing"&gt;the wiki page&lt;/a&gt; states that one of ways to contribute to &lt;code&gt;Opus&lt;/code&gt; development is by doing optimizations (assembly/intrinsics). To do this, we can easily approach to the developers on the mailing list or through the IRC channel. &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this post, we explored some of the open-source projects where we could contribute our SVE2 knowledge. As it turned out, &lt;code&gt;Opus&lt;/code&gt; project is most suitable for us. In the following post, we will start implementing SVE2 codes in the project.  &lt;/p&gt;

</description>
    </item>
    <item>
      <title>Implementing SVE2 for Volume Adjusting Algorithm</title>
      <dc:creator>Seung Woo (Paul) Ji</dc:creator>
      <pubDate>Wed, 23 Mar 2022 02:13:24 +0000</pubDate>
      <link>https://dev.to/seungwooji/implementing-sve2-for-volume-adjusting-algorithm-4697</link>
      <guid>https://dev.to/seungwooji/implementing-sve2-for-volume-adjusting-algorithm-4697</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/seungwooji/exploring-and-benchmarking-audio-volume-adjusting-algorithms-2oi"&gt;Previously&lt;/a&gt;, we explored simple volume adjust algorithms to scale the audio samples by volume factor. Unfortunately, these algorithms use Advanced SIMD instruction, not Scalable Vector Extension that we learned from &lt;a href="https://dev.to/seungwooji/exploring-scalable-vector-extension-2-1mdg"&gt;the last post&lt;/a&gt; which can greatly improve vectorization of code. In this post, we are going to implement SVE2 instructions to the volume adjusting algorithms in C++ and explore them in assembly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before We Start
&lt;/h2&gt;

&lt;p&gt;Since SVE2 is new technology and not natively supported by current hardware (with Armv8a processor) as of now, we can only emulate a program that is written with SVE2 instructions. This also means that we cannot really measure the performance of the program. Therefore, in this post, we are only going to implement SVE2 and test if the program runs successfully. &lt;/p&gt;

&lt;h2&gt;
  
  
  Source Code
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#include &amp;lt;stdlib.h&amp;gt;
#include &amp;lt;stdio.h&amp;gt;
#include &amp;lt;stdint.h&amp;gt;
#ifdef   __ARM_FEATURE_SVE
#include &amp;lt;arm_sve.h&amp;gt;
#endif
#include "vol.h"

int main() {

        int                     x;              // array interator
        int                     ttl=0;          // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- SVE2 implementation

        int16_t vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

        int32_t i = 0;
        int32_t vl = svcnth(); // count the number of 16-bit element

        svbool_t pred;
        pred = svwhilelt_b16(i, SAMPLES);

        while(svptest_first(svptrue_b16(), pred)) {
                svst1(pred, &amp;amp;out[i], (svqrdmulh(svld1(pred, &amp;amp;in[i]), svdup_s16(vol_int))));
                i += vl;
                pred = svwhilelt_b16(i, SAMPLES);
        }

// ---- End of SVE2 implementation

  for (x = 0; x &amp;lt; SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;

        }

        // Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);

        return 0;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why Compiler Intrinsic?
&lt;/h3&gt;

&lt;p&gt;Compiler intrinsic is function-like calls that the compiler replaces with the appropriate SVE2 instructions while handling various jobs including register allocation. It is a great way for developers (like me!) to use SVE2 instructions in C/C++ style without assembly. &lt;/p&gt;

&lt;h4&gt;
  
  
  Code Analysis
&lt;/h4&gt;

&lt;p&gt;First of all, we define  header file to access SVE vectors, predicates, and intrinsics for SVE2 insturctions. We then initialize a loop iterator, &lt;code&gt;i&lt;/code&gt;, and &lt;code&gt;vl&lt;/code&gt; that is used to count the number of elements. We also need to initialize a predicate register by using &lt;code&gt;svwhilelt_b16&lt;/code&gt; to control the while loop. &lt;code&gt;_b16&lt;/code&gt; specifies a predicate for 16-bit elements and conceptually, this would create an integer vector starting at &lt;code&gt;i&lt;/code&gt; and and incrementing by 1 in each subsequent vector lane. Within the while loop condition, we use &lt;code&gt;svptest_first&lt;/code&gt; to check if a lane of the predicate is active and there is a work left to do. The logic inside of the while loop is very similar to the ones written in SIMD instructions. That is, &lt;code&gt;svld1&lt;/code&gt; loads a vector with the value from &lt;code&gt;in[i]&lt;/code&gt; array element and &lt;code&gt;svdup_s16&lt;/code&gt; duplicates the value of &lt;code&gt;vol_int&lt;/code&gt; into a vector. Afterward, &lt;code&gt;svqrdmulh&lt;/code&gt; performs integer multiplication of those two values and &lt;code&gt;svst1&lt;/code&gt; saves the result into &lt;code&gt;out[i]&lt;/code&gt;. Then, &lt;code&gt;i&lt;/code&gt; gets incremented by the number of integer lanes in the vector and the predicate is reassigned.&lt;/p&gt;

&lt;h4&gt;
  
  
  Building Code
&lt;/h4&gt;

&lt;p&gt;As we discussed before, the current hardware does not support SVE2 instructions. Thus, we have to instruct the compiler to emit code for an Armv8a processor to make it understand SVE2 as following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ gcc -march=armv8-a+sve2 vol6.c vol_createsample.o -o vol6
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, we can execute the program by emulating with the QEMU usermode system. This will trap SVE instructions and run it on the Armv8 system.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ qemu-aarch64 ./vol6
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once we run it, we can see the program runs successfully without any problem!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Result: -809
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Assembler Output
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0000000000400698 &amp;lt;main&amp;gt;:
  400698:       043f57ff        addvl   sp, sp, #-1
  40069c:       d100c3ff        sub     sp, sp, #0x30
  4006a0:       a9007bfd        stp     x29, x30, [sp]
  4006a4:       910003fd        mov     x29, sp
  4006a8:       043f5020        addvl   x0, sp, #1
  4006ac:       b900281f        str     wzr, [x0, #40]
  4006b0:       d2800041        mov     x1, #0x2                        // #2
  4006b4:       d2848000        mov     x0, #0x2400                     // #9216
  4006b8:       f2a01e80        movk    x0, #0xf4, lsl #16
  4006bc:       97ffff91        bl      400500 &amp;lt;calloc@plt&amp;gt;
  4006c0:       047f5081        addpl   x1, sp, #4
  4006c4:       f9001020        str     x0, [x1, #32]
  4006c8:       d2800041        mov     x1, #0x2                        // #2
  4006cc:       d2848000        mov     x0, #0x2400                     // #9216
  4006d0:       f2a01e80        movk    x0, #0xf4, lsl #16
  4006d4:       97ffff8b        bl      400500 &amp;lt;calloc@plt&amp;gt;
  4006d8:       047f5081        addpl   x1, sp, #4
  4006dc:       f9000c20        str     x0, [x1, #24]
  4006e0:       52848001        mov     w1, #0x2400                     // #9216
  4006e4:       72a01e81        movk    w1, #0xf4, lsl #16
  4006e8:       047f5080        addpl   x0, sp, #4
  4006ec:       f9401000        ldr     x0, [x0, #32]
  4006f0:       9400006c        bl      4008a0 &amp;lt;vol_createsample&amp;gt;
  4006f4:       5287ffe0        mov     w0, #0x3fff                     // #16383
  4006f8:       047f5081        addpl   x1, sp, #4
  4006fc:       79002c20        strh    w0, [x1, #22]
  400700:       043f5020        addvl   x0, sp, #1
  400704:       b900241f        str     wzr, [x0, #36]
  400708:       0460e3e0        cnth    x0
  40070c:       047f5081        addpl   x1, sp, #4
  400710:       b9001020        str     w0, [x1, #16]
  400714:       043f5020        addvl   x0, sp, #1
  400718:       b9402400        ldr     w0, [x0, #36]
  40071c:       52848001        mov     w1, #0x2400                     // #9216
  400720:       72a01e81        movk    w1, #0xf4, lsl #16
  400724:       25610400        whilelt p0.h, w0, w1
  400728:       910093e0        add     x0, sp, #0x24
  40072c:       e5801c00        str     p0, [x0, #7, mul vl]
  400730:       14000026        b       4007c8 &amp;lt;main+0x130&amp;gt;
  400734:       043f5020        addvl   x0, sp, #1
  400738:       b9802400        ldrsw   x0, [x0, #36]
  40073c:       d37ff800        lsl     x0, x0, #1
  400740:       047f5081        addpl   x1, sp, #4
  400744:       f9400c21        ldr     x1, [x1, #24]
  400748:       8b000020        add     x0, x1, x0
  40074c:       043f5021        addvl   x1, sp, #1
  400750:       b9802421        ldrsw   x1, [x1, #36]
  400754:       d37ff821        lsl     x1, x1, #1
  400758:       047f5082        addpl   x2, sp, #4
  40075c:       f9401042        ldr     x2, [x2, #32]
  400760:       8b010041        add     x1, x2, x1
  400764:       910093e2        add     x2, sp, #0x24
  400768:       85801c40        ldr     p0, [x2, #7, mul vl]
  40076c:       a4a0a020        ld1h    {z0.h}, p0/z, [x1]
  400770:       047f5081        addpl   x1, sp, #4
  400774:       91005821        add     x1, x1, #0x16
  400778:       2518e3e0        ptrue   p0.b
  40077c:       84c0a021        ld1rh   {z1.h}, p0/z, [x1]
  400780:       04617400        sqrdmulh        z0.h, z0.h, z1.h
  400784:       910093e1        add     x1, sp, #0x24
  400788:       85801c20        ldr     p0, [x1, #7, mul vl]
  40078c:       e4a0e000        st1h    {z0.h}, p0, [x0]
  400790:       043f5020        addvl   x0, sp, #1
  400794:       b9402401        ldr     w1, [x0, #36]
  400798:       047f5080        addpl   x0, sp, #4
  40079c:       b9401000        ldr     w0, [x0, #16]
  4007a0:       0b000020        add     w0, w1, w0
  4007a4:       043f5021        addvl   x1, sp, #1
  4007a8:       b9002420        str     w0, [x1, #36]
  4007ac:       043f5020        addvl   x0, sp, #1
  4007b0:       b9402400        ldr     w0, [x0, #36]
  4007b4:       52848001        mov     w1, #0x2400                     // #9216
  4007b8:       72a01e81        movk    w1, #0xf4, lsl #16
  4007bc:       25610400        whilelt p0.h, w0, w1
  4007c0:       910093e0        add     x0, sp, #0x24
  4007c4:       e5801c00        str     p0, [x0, #7, mul vl]
  4007c8:       2558e3e0        ptrue   p0.h
  4007cc:       910093e0        add     x0, sp, #0x24
  4007d0:       85801c01        ldr     p1, [x0, #7, mul vl]
  4007d4:       2550c020        ptest   p0, p1.b
  4007d8:       9a9f57e0        cset    x0, mi  // mi = first
  4007dc:       7100001f        cmp     w0, #0x0
  4007e0:       54fffaa1        b.ne    400734 &amp;lt;main+0x9c&amp;gt;  // b.any
  4007e4:       043f5020        addvl   x0, sp, #1
  4007e8:       b9002c1f        str     wzr, [x0, #44]
  4007ec:       1400001d        b       400860 &amp;lt;main+0x1c8&amp;gt;
  4007f0:       043f5020        addvl   x0, sp, #1
  4007f4:       b9802c00        ldrsw   x0, [x0, #44]
  4007f8:       d37ff800        lsl     x0, x0, #1
  4007fc:       047f5081        addpl   x1, sp, #4
  400800:       f9400c21        ldr     x1, [x1, #24]
  400804:       8b000020        add     x0, x1, x0
  400808:       79c00000        ldrsh   w0, [x0]
  40080c:       2a0003e1        mov     w1, w0
  400810:       043f5020        addvl   x0, sp, #1
  400814:       b9402800        ldr     w0, [x0, #40]
  400818:       0b000020        add     w0, w1, w0
  40081c:       5289ba61        mov     w1, #0x4dd3                     // #19923
  400820:       72a20c41        movk    w1, #0x1062, lsl #16
  400824:       9b217c01        smull   x1, w0, w1
  400828:       d360fc21        lsr     x1, x1, #32
  40082c:       13067c22        asr     w2, w1, #6
  400830:       131f7c01        asr     w1, w0, #31
  400834:       4b010042        sub     w2, w2, w1
  400838:       52807d01        mov     w1, #0x3e8                      // #1000
  40083c:       1b017c41        mul     w1, w2, w1
  400840:       4b010000        sub     w0, w0, w1
  400844:       043f5021        addvl   x1, sp, #1
  400848:       b9002820        str     w0, [x1, #40]
  40084c:       043f5020        addvl   x0, sp, #1
  400850:       b9402c00        ldr     w0, [x0, #44]
  400854:       11000400        add     w0, w0, #0x1
  400858:       043f5021        addvl   x1, sp, #1
  40085c:       b9002c20        str     w0, [x1, #44]
  400860:       043f5020        addvl   x0, sp, #1
  400864:       b9402c01        ldr     w1, [x0, #44]
  400868:       52847fe0        mov     w0, #0x23ff                     // #9215
  40086c:       72a01e80        movk    w0, #0xf4, lsl #16
  400870:       6b00003f        cmp     w1, w0
  400874:       54fffbed        b.le    4007f0 &amp;lt;main+0x158&amp;gt;
  400878:       043f5020        addvl   x0, sp, #1
  40087c:       b9402801        ldr     w1, [x0, #40]
  400880:       90000000        adrp    x0, 400000 &amp;lt;__abi_tag-0x278&amp;gt;
  400884:       9124e000        add     x0, x0, #0x938
  400888:       97ffff2e        bl      400540 &amp;lt;printf@plt&amp;gt;
  40088c:       52800000        mov     w0, #0x0                        // #0
  400890:       a9407bfd        ldp     x29, x30, [sp]
  400894:       043f503f        addvl   sp, sp, #1
  400898:       9100c3ff        add     sp, sp, #0x30
  40089c:       d65f03c0        ret
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In order to test if SVE2 instructions are used, we can skim through the codes and search for &lt;code&gt;whilelt&lt;/code&gt; instruction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  400724:       25610400        whilelt p0.h, w0, w1
  4007bc:       25610400        whilelt p0.h, w0, w1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As we can see, the SVE-specific instruction like &lt;code&gt;whilelt&lt;/code&gt; is used by the program and it runs without any problem!&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this post, we explored how to implement SVE2 instructions to the volume adjusting algorithm. Unfortunately, the current native hardware does not support SVE2 (yet!) and must use an emulator to run the program. It is also challenging to implement SVE2 as it requires understanding of predicate and new syntax. However, utilizing SVE2 is potentially beneficial for developers because latest hardware plans to support it natively and the vector length is determined by the machine.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Exploring Scalable Vector Extension 2</title>
      <dc:creator>Seung Woo (Paul) Ji</dc:creator>
      <pubDate>Sun, 20 Mar 2022 01:31:33 +0000</pubDate>
      <link>https://dev.to/seungwooji/exploring-scalable-vector-extension-2-1mdg</link>
      <guid>https://dev.to/seungwooji/exploring-scalable-vector-extension-2-1mdg</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://developer.arm.com/documentation/101726/0400/Learn-about-the-Scalable-Vector-Extension--SVE-/What-is-the-Scalable-Vector-Extension-"&gt;Scalable Vector Extension (SVE)&lt;/a&gt; is SIMD extension of ARMv8 and provides a new set of vector instructions to enable vectorization of loops for &lt;a href="https://en.wikipedia.org/wiki/High-performance_computing"&gt;High Performance Computing (HPC)&lt;/a&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  Why SVE?
&lt;/h2&gt;

&lt;p&gt;One of the key features of SVE is that it does not require a fixed 128-bit vector length like &lt;a href="https://developer.arm.com/architectures/instruction-sets/simd-isas/neon"&gt;Neon architecture extension&lt;/a&gt;. This enables Vector-length agnostic (VLA) programming in which the vector length is determined by hardware that is best for the workload. Thus, developers can write and build programs once and run them on different hardware with different SVE vector length implementations (better portability!). &lt;/p&gt;

&lt;h2&gt;
  
  
  SVE2
&lt;/h2&gt;

&lt;p&gt;SVE2 is basically a superset of SVE and Neon extension. With SVE2 instruction, it further extends data-processing domains beyond HPC that now include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Computer vision&lt;/li&gt;
&lt;li&gt;Multimedia&lt;/li&gt;
&lt;li&gt;Long-Term Evolution (LTE) baseband processing&lt;/li&gt;
&lt;li&gt;Genomics&lt;/li&gt;
&lt;li&gt;In-memory database&lt;/li&gt;
&lt;li&gt;Web serving&lt;/li&gt;
&lt;li&gt;General-purpose software&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  SVE2 Registers
&lt;/h2&gt;

&lt;p&gt;Like SVE, SVE2 is based on the scalable vectors as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Scalable vector registers&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LRieL1mZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jvrm4421unj15n527c7j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LRieL1mZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jvrm4421unj15n527c7j.png" alt="Scalable_Vector_Registers" width="594" height="347"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are a total of 32 scalable vector registers (z0-z31). Their size in bits must be a multiple of 128 and up to 2048 bits. Data in these registers can holder 64, 32, 16, and 8-bit elements. The lower 128 bits of each register holds the corresponding Neon register of the SIMD extension. &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Scalable predicate registers 
&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3VPUZ71D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ig95tezfy8dvckzpr1nz.png" alt="Scalable_Predicate_Registers" width="594" height="317"&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There are a total of 16 predicate registers which are unique to SVE and SVE2. Each predicate register can hold one bit for each byte available in the respective z register (1/8 of the z register length). P0 - P7 registers are governing predicates for load, store, and arithmetic. P8 - p15 registers are extra predicates for loop management.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;SVE allows developers to implement vectorization for the program in more efficient manner as they don't have to worry about the vector size. This also enable better portability because different hardware determines the vector size accordingly for the same program. In the next post, we will discuss how we can implement SVE2 to the volume algorithm we explored previously.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://developer.arm.com/documentation/101726/0400/Learn-about-the-Scalable-Vector-Extension--SVE-/What-is-the-Scalable-Vector-Extension-"&gt;What is the Scalable Vector Extension?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developer.arm.com/documentation/102340/0001/Introducing-SVE2?lang=en"&gt;Introducing SVE2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=eGCcPo4UAHs"&gt;Introduction to Arm SVE&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
    </item>
    <item>
      <title>Exploring and Benchmarking Audio Volume Adjusting Algorithms Part 2</title>
      <dc:creator>Seung Woo (Paul) Ji</dc:creator>
      <pubDate>Thu, 10 Mar 2022 00:14:53 +0000</pubDate>
      <link>https://dev.to/seungwooji/exploring-and-benchmarking-audio-volume-adjusting-algorithms-part-2-3lpb</link>
      <guid>https://dev.to/seungwooji/exploring-and-benchmarking-audio-volume-adjusting-algorithms-part-2-3lpb</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In the last post, we explored multiple volume adjusting algorithms and made assumptions of how well they would perform. Now, we are going to measure the performance of each algorithm and test if they are met with our expectations.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Audio Sample Size
&lt;/h2&gt;

&lt;p&gt;Before we start testing, we will set the number of sample size with a large number so that we can have meaningful result. For this, we will use the size of 1,600,000,000 for each program. If we run the &lt;code&gt;time&lt;/code&gt; command with the dummy program, we have the following result:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;real&lt;/th&gt;
&lt;th&gt;1m27.058s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;user&lt;/td&gt;
&lt;td&gt;1m22.503s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sys&lt;/td&gt;
&lt;td&gt;0m4.496s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The dummy program takes about a minute and a half seconds in total. However, we have to consider that this time does not only account for the volume scale function - there are different processes involved (e.g. generating random samples, calculating results and so on).&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluating Algorithm Performance
&lt;/h2&gt;

&lt;p&gt;How do we only measure the performance of the volume scale function (&lt;code&gt;scale_sample&lt;/code&gt;)?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x &amp;lt; SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can easily implement this by utilizing the &lt;code&gt;C Time&lt;/code&gt; library. With this library, we can isolate the function and measure the elapsed time as following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// ---- Include the C Time library
#include &amp;lt;time.h&amp;gt;

        clock_t         t;

// ---- Calculate the start time
        t = clock();

// Scale Sample Code

//----  Calculate the elapsed time
        t = clock() - t;

// ---- Print the elapsed time in seconds
        printf("Time elapsed: %f\n", ((double)t)/CLOCKS_PER_SEC);

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this way, we can only estimate the elapsed time of the scale function in seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Test Results
&lt;/h2&gt;

&lt;p&gt;For benchmarking, a total of 20 cases were tested for each algorithm. All algorithms also processed 1,600,000,000 samples and were assessed on AArch64 and x84_64 systems. During the tests, the number of background operations were minimized.&lt;/p&gt;

&lt;p&gt;The following table shows the results. Both tables show very small number of standard deviation (SD) meaning the data are clustered around the mean value.&lt;/p&gt;

&lt;h3&gt;
  
  
  AArch64
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Algorithm&lt;/th&gt;
&lt;th&gt;vol0&lt;/th&gt;
&lt;th&gt;vol1&lt;/th&gt;
&lt;th&gt;vol2&lt;/th&gt;
&lt;th&gt;vol4&lt;/th&gt;
&lt;th&gt;vol5&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Time (seconds)&lt;/td&gt;
&lt;td&gt;5.290686&lt;/td&gt;
&lt;td&gt;4.571809&lt;/td&gt;
&lt;td&gt;11.204779&lt;/td&gt;
&lt;td&gt;2.862223&lt;/td&gt;
&lt;td&gt;2.897304&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.271289&lt;/td&gt;
&lt;td&gt;4.616451&lt;/td&gt;
&lt;td&gt;11.236343&lt;/td&gt;
&lt;td&gt;2.869659&lt;/td&gt;
&lt;td&gt;2.860497&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.3009&lt;/td&gt;
&lt;td&gt;4.618019&lt;/td&gt;
&lt;td&gt;11.207497&lt;/td&gt;
&lt;td&gt;2.839968&lt;/td&gt;
&lt;td&gt;2.88575&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.257061&lt;/td&gt;
&lt;td&gt;4.57951&lt;/td&gt;
&lt;td&gt;11.229004&lt;/td&gt;
&lt;td&gt;2.794136&lt;/td&gt;
&lt;td&gt;2.837761&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.29981&lt;/td&gt;
&lt;td&gt;4.584778&lt;/td&gt;
&lt;td&gt;11.237608&lt;/td&gt;
&lt;td&gt;2.879343&lt;/td&gt;
&lt;td&gt;2.857112&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.252714&lt;/td&gt;
&lt;td&gt;4.590422&lt;/td&gt;
&lt;td&gt;11.220075&lt;/td&gt;
&lt;td&gt;2.785239&lt;/td&gt;
&lt;td&gt;2.859161&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.300421&lt;/td&gt;
&lt;td&gt;4.590156&lt;/td&gt;
&lt;td&gt;11.215143&lt;/td&gt;
&lt;td&gt;2.870726&lt;/td&gt;
&lt;td&gt;2.919503&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.286753&lt;/td&gt;
&lt;td&gt;4.589992&lt;/td&gt;
&lt;td&gt;11.224697&lt;/td&gt;
&lt;td&gt;2.794225&lt;/td&gt;
&lt;td&gt;2.895057&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.317688&lt;/td&gt;
&lt;td&gt;4.61077&lt;/td&gt;
&lt;td&gt;11.268087&lt;/td&gt;
&lt;td&gt;2.907598&lt;/td&gt;
&lt;td&gt;2.91678&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.272125&lt;/td&gt;
&lt;td&gt;4.63759&lt;/td&gt;
&lt;td&gt;11.235228&lt;/td&gt;
&lt;td&gt;2.799026&lt;/td&gt;
&lt;td&gt;2.881828&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.308232&lt;/td&gt;
&lt;td&gt;4.58515&lt;/td&gt;
&lt;td&gt;11.229461&lt;/td&gt;
&lt;td&gt;2.882254&lt;/td&gt;
&lt;td&gt;2.910783&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.286579&lt;/td&gt;
&lt;td&gt;4.599118&lt;/td&gt;
&lt;td&gt;11.253098&lt;/td&gt;
&lt;td&gt;2.85217&lt;/td&gt;
&lt;td&gt;2.903325&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.282362&lt;/td&gt;
&lt;td&gt;4.597291&lt;/td&gt;
&lt;td&gt;11.190576&lt;/td&gt;
&lt;td&gt;2.875931&lt;/td&gt;
&lt;td&gt;2.920964&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.276742&lt;/td&gt;
&lt;td&gt;4.611212&lt;/td&gt;
&lt;td&gt;11.239454&lt;/td&gt;
&lt;td&gt;2.849582&lt;/td&gt;
&lt;td&gt;2.853147&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.293711&lt;/td&gt;
&lt;td&gt;4.591562&lt;/td&gt;
&lt;td&gt;11.253258&lt;/td&gt;
&lt;td&gt;2.870164&lt;/td&gt;
&lt;td&gt;2.918136&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.293716&lt;/td&gt;
&lt;td&gt;4.621955&lt;/td&gt;
&lt;td&gt;11.228463&lt;/td&gt;
&lt;td&gt;2.858067&lt;/td&gt;
&lt;td&gt;2.850342&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.318874&lt;/td&gt;
&lt;td&gt;4.591154&lt;/td&gt;
&lt;td&gt;11.225114&lt;/td&gt;
&lt;td&gt;2.864949&lt;/td&gt;
&lt;td&gt;2.912111&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.306651&lt;/td&gt;
&lt;td&gt;4.590993&lt;/td&gt;
&lt;td&gt;11.252793&lt;/td&gt;
&lt;td&gt;2.841034&lt;/td&gt;
&lt;td&gt;2.847878&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.30221&lt;/td&gt;
&lt;td&gt;4.641963&lt;/td&gt;
&lt;td&gt;11.220678&lt;/td&gt;
&lt;td&gt;2.877916&lt;/td&gt;
&lt;td&gt;2.842209&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.299778&lt;/td&gt;
&lt;td&gt;4.593774&lt;/td&gt;
&lt;td&gt;11.206139&lt;/td&gt;
&lt;td&gt;2.868532&lt;/td&gt;
&lt;td&gt;2.856316&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total&lt;/td&gt;
&lt;td&gt;105.818302&lt;/td&gt;
&lt;td&gt;92.013669&lt;/td&gt;
&lt;td&gt;224.577495&lt;/td&gt;
&lt;td&gt;57.042742&lt;/td&gt;
&lt;td&gt;57.625964&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average&lt;/td&gt;
&lt;td&gt;5.2909151&lt;/td&gt;
&lt;td&gt;4.60068345&lt;/td&gt;
&lt;td&gt;11.22887475&lt;/td&gt;
&lt;td&gt;2.8521371&lt;/td&gt;
&lt;td&gt;2.8812982&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SD&lt;/td&gt;
&lt;td&gt;0.01805085609&lt;/td&gt;
&lt;td&gt;0.01880182964&lt;/td&gt;
&lt;td&gt;0.01914206262&lt;/td&gt;
&lt;td&gt;0.0338674976&lt;/td&gt;
&lt;td&gt;0.02977236719&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In the previous post, we assumed the algorithms that use SIMD instructions would perform faster than others. Indeed, we can observe that &lt;code&gt;vol4&lt;/code&gt; and &lt;code&gt;vol5&lt;/code&gt; algorithms outperform others. The performance difference between them are really small (~0.0291 seconds) indicating that both inline assembly and compiler intrinsic are almost equally fast.&lt;/p&gt;

&lt;p&gt;We can also see that &lt;code&gt;vol1&lt;/code&gt; runs faster than &lt;code&gt;vol0&lt;/code&gt;. This corresponds to our expectation as &lt;code&gt;vol1&lt;/code&gt; uses a fixed-point calculation with bit-shift operations.&lt;/p&gt;

&lt;p&gt;Interestingly, &lt;code&gt;vol2&lt;/code&gt; algorithm is found to be significantly slower than others. Initially, we assumed that this algorithm may perform faster than &lt;code&gt;vol0&lt;/code&gt; and &lt;code&gt;vol1&lt;/code&gt; which multiplies each sample with scaling factor because it pre-calculates all the results and stores them in a table. This result would mean that the CPU has an efficient arithmetic logic unit (ALU) that processes the multiplication fast or is slow at reading the memory when looking over the pre-calculated values within the table.&lt;/p&gt;

&lt;h3&gt;
  
  
  x86_64
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Algorithm&lt;/th&gt;
&lt;th&gt;vol0&lt;/th&gt;
&lt;th&gt;vol1&lt;/th&gt;
&lt;th&gt;vol2&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Time (seconds)&lt;/td&gt;
&lt;td&gt;2.821902&lt;/td&gt;
&lt;td&gt;2.784482&lt;/td&gt;
&lt;td&gt;3.531761&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.903628&lt;/td&gt;
&lt;td&gt;2.786877&lt;/td&gt;
&lt;td&gt;3.569542&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.895999&lt;/td&gt;
&lt;td&gt;2.78038&lt;/td&gt;
&lt;td&gt;3.551214&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.877543&lt;/td&gt;
&lt;td&gt;2.785402&lt;/td&gt;
&lt;td&gt;3.559591&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.886563&lt;/td&gt;
&lt;td&gt;2.785422&lt;/td&gt;
&lt;td&gt;3.537273&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.891856&lt;/td&gt;
&lt;td&gt;2.783449&lt;/td&gt;
&lt;td&gt;3.545279&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.80208&lt;/td&gt;
&lt;td&gt;2.786667&lt;/td&gt;
&lt;td&gt;3.58345&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.855822&lt;/td&gt;
&lt;td&gt;2.782619&lt;/td&gt;
&lt;td&gt;3.590136&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.804731&lt;/td&gt;
&lt;td&gt;2.781633&lt;/td&gt;
&lt;td&gt;3.572802&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.782909&lt;/td&gt;
&lt;td&gt;2.801589&lt;/td&gt;
&lt;td&gt;3.587121&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.783267&lt;/td&gt;
&lt;td&gt;2.783468&lt;/td&gt;
&lt;td&gt;3.630578&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.785422&lt;/td&gt;
&lt;td&gt;2.800091&lt;/td&gt;
&lt;td&gt;3.562486&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.81526&lt;/td&gt;
&lt;td&gt;2.77875&lt;/td&gt;
&lt;td&gt;3.591089&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.873962&lt;/td&gt;
&lt;td&gt;2.778289&lt;/td&gt;
&lt;td&gt;3.529016&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.791908&lt;/td&gt;
&lt;td&gt;2.789269&lt;/td&gt;
&lt;td&gt;3.579964&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.785272&lt;/td&gt;
&lt;td&gt;2.792904&lt;/td&gt;
&lt;td&gt;3.55086&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.804883&lt;/td&gt;
&lt;td&gt;2.778821&lt;/td&gt;
&lt;td&gt;3.587747&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.78638&lt;/td&gt;
&lt;td&gt;2.785906&lt;/td&gt;
&lt;td&gt;3.545412&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.788079&lt;/td&gt;
&lt;td&gt;2.795611&lt;/td&gt;
&lt;td&gt;3.574527&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.810512&lt;/td&gt;
&lt;td&gt;2.794108&lt;/td&gt;
&lt;td&gt;3.54657&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total&lt;/td&gt;
&lt;td&gt;56.547978&lt;/td&gt;
&lt;td&gt;55.735737&lt;/td&gt;
&lt;td&gt;71.326418&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average&lt;/td&gt;
&lt;td&gt;2.8273989&lt;/td&gt;
&lt;td&gt;2.78678685&lt;/td&gt;
&lt;td&gt;3.5663209&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SD&lt;/td&gt;
&lt;td&gt;0.04456744515&lt;/td&gt;
&lt;td&gt;0.006838116502&lt;/td&gt;
&lt;td&gt;0.02516021857&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The x86_64 system shows similar aspects as the AArch64 system -&lt;code&gt;vol1&lt;/code&gt; algorithm is the fastest and &lt;code&gt;vol2&lt;/code&gt; is the slowest. Note that we are missing &lt;code&gt;vol4&lt;/code&gt; and &lt;code&gt;vol5&lt;/code&gt; algorithms because these programs utilize SIMD instructions that are unique to the AArch64 system. &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this post, we measured the performance of each algorithm to test the assumptions we made in the previous post. As expected, the algorithms that use SIMD instructions appear to run faster than others as they can process multiple data at a time. &lt;/p&gt;

</description>
    </item>
    <item>
      <title>Exploring and Benchmarking Audio Volume Adjusting Algorithms Part 1</title>
      <dc:creator>Seung Woo (Paul) Ji</dc:creator>
      <pubDate>Mon, 07 Mar 2022 00:52:03 +0000</pubDate>
      <link>https://dev.to/seungwooji/exploring-and-benchmarking-audio-volume-adjusting-algorithms-2oi</link>
      <guid>https://dev.to/seungwooji/exploring-and-benchmarking-audio-volume-adjusting-algorithms-2oi</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Uncompressed digital sound is typically represented as signed 16-bit (2 bytes) integer samples. For a 48000 audio sample (kHz), the data rate can easily surpass 96,000 bytes per seconds (2 bytes per sample * 48000 samples per seconds). When we change the sound volume, each sample needs to be scaled by a volume factor between 0 (no volume) and 1 (full volume). Considering the amount of data in sound samples, it is vital to have efficient volume adjusting algorithm to scale sound. This is especially true for a mobile device as the amount of processing required can affect its battery life.&lt;/p&gt;

&lt;p&gt;In this post, we are going to explore a number of different algorithms for processing sound samples to control volume level. After that, we will study the performance of each algorithm to benchmark.&lt;/p&gt;

&lt;h3&gt;
  
  
  volume.h
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/* This is the number of samples to be processed */
#define SAMPLES 16

/* This is the volume scaling factor to be used */
#define VOLUME 50.0 // Percent of original volume

/* Function prototype to fill an array sample of
 * length sample_count with random int16_t numbers
 * to simulate an audio buffer */
void vol_createsample(int16_t* sample, int32_t sample_count);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  vol_createsample.c
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;void vol_createsample(int16_t* sample, int32_t sample_count) {
        int i;
        for (i=0; i&amp;lt;sample_count; i++) {
                sample[i] = (rand()%65536)-32768;
        }
        return;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In &lt;code&gt;volume.h&lt;/code&gt;, we define a constant named &lt;code&gt;SAMPLES&lt;/code&gt; to define the number of samples to be processed. We will use a reasonably large number for this to have a processed time at least 20 seconds. This will allow us to analyze the performance much more easily.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;vol_createsample&lt;/code&gt; function is made to fill an array with random numbers to simulate an audio buffer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Algorithm 1: vol0.c
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int16_t scale_sample(int16_t sample, int volume) {

        return (int16_t) ((float) (volume/100.0) * (float) sample);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x &amp;lt; SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples. (Why is this needed?)
        for (x = 0; x &amp;lt; SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples. (Why is this needed?)
        printf("Result: %d\n", ttl);

 return 0;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;vol0.c&lt;/code&gt; contains a very naïve algorithm that simply multiplies each sample by the volume scaling factor. This also involves with casting from signed 16-bit integer into floating point and back again - which can be very expensive and take a lot of resources.&lt;/p&gt;

&lt;p&gt;It is also noteworthy to mention why we need to loop that sums the samples as well as to print the sum to the console. They must exist so that the algorithm can perform correctly. Let's take a look at the assembly code that is built by the compiler to understand more easily.&lt;/p&gt;

&lt;h3&gt;
  
  
  Assembly code of the original program
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;400580:       a9be7bfd        stp     x29, x30, [sp, #-32]!
  400584:       d2800041        mov     x1, #0x2                        // #2
  400588:       d2800200        mov     x0, #0x10                       // #16
  40058c:       910003fd        mov     x29, sp
  400590:       a90153f3        stp     x19, x20, [sp, #16]
  400594:       97ffffdb        bl      400500 &amp;lt;calloc@plt&amp;gt;
  400598:       d2800041        mov     x1, #0x2                        // #2
  40059c:       aa0003f4        mov     x20, x0
  4005a0:       d2800200        mov     x0, #0x10                       // #16
  4005a4:       97ffffd7        bl      400500 &amp;lt;calloc@plt&amp;gt;
  4005a8:       aa0003f3        mov     x19, x0
  4005ac:       52800201        mov     w1, #0x10                       // #16
  4005b0:       aa1403e0        mov     x0, x20
  4005b4:       94000077        bl      400790 &amp;lt;vol_createsample&amp;gt;
  4005b8:       d2800002        mov     x2, #0x0                        // #0
  4005bc:       1e2c1001        fmov    s1, #5.000000000000000000e-01
  4005c0:       78e26a81        ldrsh   w1, [x20, x2]
  4005c4:       1e220020        scvtf   s0, w1
  4005c8:       1e210800        fmul    s0, s0, s1
  4005cc:       5ea1b800        fcvtzs  s0, s0
  4005d0:       7c226a60        str     h0, [x19, x2]
  4005d4:       91000842        add     x2, x2, #0x2
  4005d8:       f100805f        cmp     x2, #0x20
  4005dc:       54ffff21        b.ne    4005c0 &amp;lt;main+0x40&amp;gt;  // b.any
  4005e0:       5289ba64        mov     w4, #0x4dd3                     // #19923
mov     x0, x19
  4005e8:       91008265        add     x5, x19, #0x20
  4005ec:       52800001        mov     w1, #0x0                        // #0
  4005f0:       72a20c44        movk    w4, #0x1062, lsl #16
  4005f4:       52807d03        mov     w3, #0x3e8                      // #1000
  4005f8:       78c02402        ldrsh   w2, [x0], #2
  4005fc:       0b010042        add     w2, w2, w1
  400600:       9b247c41        smull   x1, w2, w4
  400604:       9366fc21        asr     x1, x1, #38
  400608:       4b827c21        sub     w1, w1, w2, asr #31
  40060c:       1b038821        msub    w1, w1, w3, w2
  400610:       eb0000bf        cmp     x5, x0
  400614:       54ffff21        b.ne    4005f8 &amp;lt;main+0x78&amp;gt;  // b.any
  400618:       90000000        adrp    x0, 400000 &amp;lt;__abi_tag-0x278&amp;gt;
  40061c:       9120a000        add     x0, x0, #0x828
  400620:       97ffffc8        bl      400540 &amp;lt;printf@plt&amp;gt;
  400624:       52800000        mov     w0, #0x0                        // #0
  400628:       a94153f3        ldp     x19, x20, [sp, #16]
  40062c:       a8c27bfd        ldp     x29, x30, [sp], #32
  400630:       d65f03c0        ret
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Assembly code without the sum loop and print
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;400500:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
  400504:       d2800041        mov     x1, #0x2                        // #2
  400508:       d2800200        mov     x0, #0x10                       // #16
  40050c:       910003fd        mov     x29, sp
  400510:       97ffffec        bl      4004c0 &amp;lt;calloc@plt&amp;gt;
  400514:       52800201        mov     w1, #0x10                       // #16
  400518:       9400005e        bl      400690 &amp;lt;vol_createsample&amp;gt;
  40051c:       52800000        mov     w0, #0x0                        // #0
  400520:       a8c17bfd        ldp     x29, x30, [sp], #16
  400524:       d65f03c0        ret
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can immediately notice that many parts of the assembly codes are missing when we do not include the sum loop and print. This is because the compiler recognizes that the results of volume scaling calculation is not used and optimizes the code by removing it. Obviously, we need to prevent this from happening as it is the code that we have to test! &lt;/p&gt;

&lt;h3&gt;
  
  
  Algorithm 2: vol1.c
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int16_t scale_sample(int16_t sample, int volume) {

        return ((((int32_t) sample) * ((int32_t) (32767 * volume / 100) &amp;lt;&amp;lt;1) ) &amp;gt;&amp;gt; 16);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x &amp;lt; SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples.
        for (x = 0; x &amp;lt; SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.
        printf("Result: %d\n", ttl);

        return 0;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of using floating-point calculation, &lt;code&gt;vol1.c&lt;/code&gt; utilizes a fixed-point calculation with bit-shift operations. In this way, we can avoid the costly casting between integer and floating point and back again.&lt;/p&gt;

&lt;h3&gt;
  
  
  Algorithm 3: vol2.c
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

        static int16_t* precalc;

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]

        precalc = (int16_t*) calloc(65536,2);
        if (precalc == NULL) {
                printf("malloc failed!\n");
                return 1;
        }

        for (x = -32768; x &amp;lt;= 32767; x++) {
 // Q: What is the purpose of the cast to unint16_t in the next line?
                precalc[(uint16_t) x] = (int16_t) ((float) x * VOLUME / 100.0);
        }

        for (x = 0; x &amp;lt; SAMPLES; x++) {
                out[x]=precalc[(uint16_t) in[x]];
        }

// ---- This part sums the samples.
        for (x = 0; x &amp;lt; SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.
        printf("Result: %d\n", ttl);

        return 0;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In &lt;code&gt;vol2.c&lt;/code&gt;, we pre-calculate all 65536 result. Then, we use it to look up the result for each input value. Note we use a casting to &lt;code&gt;uint16_t&lt;/code&gt; for each element's index. Since we cast a negative integer to unsigned type, x would have a unsigned integer with the bit pattern representing in the corresponding signed type. For example, &lt;code&gt;-5&lt;/code&gt; would become 65531 (2^16 - 5). In this way, we can populate the array with 65536 elements.&lt;/p&gt;

&lt;p&gt;This program may have a better performance than the previous one because we create a table with all of the possible values. However, this may be varied depending on the speed of reading memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dummy Algorithm: vol3.c
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int16_t scale_sample(int16_t sample, int volume) {

        return (int16_t) 100;
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x &amp;lt; SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sum the samples.
        for (x = 0; x &amp;lt; SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.
        printf("Result: %d\n", ttl);

        return 0;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;vol3.c&lt;/code&gt; is a simply dummy program and returns an identical sample value (100). The purpose of this program is to determine the possible overhead processing other than the scaling volume algorithm. &lt;/p&gt;

&lt;h3&gt;
  
  
  Algorithm 4: vol4.c
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else


        // these variables will also be accessed by our assembler code
        int16_t*        in_cursor;              // input cursor
        int16_t*        out_cursor;             // output cursor
        int16_t         vol_int;                // volume as int16_t

        int16_t*        limit;                  // end of input array

        int             x;                      // array interator
        int             ttl=0 ;                 // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]


        // set vol_int to fixed-point representation of the volume factor
        // Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

        // Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES;

        // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

        while ( in_cursor &amp;lt; limit ) {
                __asm__ (
                        "ldr q0, [%[in_cursor]], #16    \n\t"
                        // load eight samples into q0 (same as v0.8h)
                        // from [in_cursor]
                        // post-increment in_cursor by 16 bytes
                        // and store back into the pointer register


                        "sqrdmulh v0.8h, v0.8h, v1.8h   \n\t"
                        // with 32 signed integer output,
                        // multiply each lane in v0 * v1 * 2
                        // saturate results
                        // store upper 16 bits of results into
                        // the corresponding lane in v0

                        "str q0, [%[out_cursor]],#16            \n\t"
                        // store eight samples to [out_cursor]
                        // post-increment out_cursor by 16 bytes
                        // and store back into the pointer register

                        // Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"
                        );
        }

// --------------------------------------------------------------------

        for (x = 0; x &amp;lt; SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

        return 0;

#endif
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In &lt;code&gt;vol4.c&lt;/code&gt;, we utilize &lt;a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_data"&gt;Single Instruction, Multiple Data (SIMD)&lt;/a&gt; instructions with inline assembly codes (assembly language code inserted into a high-level language). Also note that we use AArch64 specific assembly code here and thus, this program can be only executed in the AArch64 system. &lt;/p&gt;

&lt;p&gt;Let's take a look at some of the important points in the code (marked as &lt;code&gt;Q&lt;/code&gt;). First of all, we need to multiply by &lt;code&gt;32767&lt;/code&gt; when calculating &lt;code&gt;vol_int&lt;/code&gt; to have a fixed-point representation of the volume factor. This is because the &lt;code&gt;vol_int&lt;/code&gt; has a type of &lt;code&gt;int16_t&lt;/code&gt;, a signed integer type with width of exactly 16 bits. Since its type is signed, the range of values it can hold is between -32,768 and 32,767. Thus, we need to multiply the sample with 32,767 to prevent the integer overflow.&lt;/p&gt;

&lt;p&gt;Next, we need to set three pointers that point to the first element of &lt;code&gt;in&lt;/code&gt; and &lt;code&gt;out&lt;/code&gt; arrays as well as the end of the &lt;code&gt;in&lt;/code&gt; array respectively. In this way, we can make a loop that multiplies the sample by the volume scaling factor.&lt;/p&gt;

&lt;p&gt;Once we set all of the requirements mentioned above, we can start implementing inline assembly codes using &lt;code&gt;__asm__&lt;/code&gt;. The &lt;code&gt;dup&lt;/code&gt; instruction is used to duplicate the volume scaling factor from the register with 32-bit-wide access (w0) into the vector register with 8 lines (v1.8h). By doing this, we can multiply the each element of the vector by the scaling factor.&lt;/p&gt;

&lt;p&gt;Inside of the loop, we have another inline assembly code that multiplies eight samples by the scaling factor. In contrast to the last &lt;code&gt;__asm__&lt;/code&gt; code, we have 3 operand parameters that are each separated by colon(&lt;code&gt;:&lt;/code&gt;). The first operand parameter defines the output operands, &lt;code&gt;in_cursor&lt;/code&gt; and &lt;code&gt;out_cursor'. Each operand is named as&lt;/code&gt;[in_cursor] and [out_cursor] respectively so that they can be used in the assembler template (enclosed in double quotation). The &lt;code&gt;+&lt;/code&gt; sign indicates &lt;a href="https://gcc.gnu.org/onlinedocs/gcc/Modifiers.html"&gt;a constraint&lt;/a&gt; that the given output operands are both read and written by the instruction. The last operand parameter is used for clobbers. The &lt;code&gt;memory&lt;/code&gt; clobber is used to tell the compiler that the assembly code performs memory reads and writes.&lt;/p&gt;

&lt;p&gt;Lastly, we can assume the printed results would be correct. This is because &lt;code&gt;sqrdmulh&lt;/code&gt; instruction can saturate the result when overflowing happens.&lt;/p&gt;

&lt;h3&gt;
  
  
  Algorithm 4: vol5.c
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else

        register int16_t*       in_cursor       asm("r20");     // input cursor (pointer)
        register int16_t*       out_cursor      asm("r21");     // output cursor (pointer)
        register int16_t        vol_int         asm("r22");     // volume as int16_t

        int16_t*                limit;          // end of input array

        int                     x;              // array interator
        int                     ttl=0;          // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]

        vol_int = (int16_t) (VOLUME/100.0 * 32767.0);

        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES ;

        while ( in_cursor &amp;lt; limit ) {
                // What do these intrinsic functions do?
                // (See gcc intrinsics documentation)
                vst1q_s16(out_cursor, vqrdmulhq_s16(vld1q_s16(in_cursor), vdupq_n_s16(vol_int)));

                // Q: Why is the increment below 8 instead of 16 or some other value?
                // Q: Why is this line not needed in the inline assembler version
                // of this program?
                in_cursor += 8;
                out_cursor += 8;
        }

// --------------------------------------------------------------------

        for (x = 0; x &amp;lt; SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);

        return 0;
#endif
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;vol5.c&lt;/code&gt; also uses SIMD instruction but with complier intrinsic that are function-like language extensions built into the compiler. Since it uses the instructions unique to AArch64 architecture, this program is also specific to AArch64.&lt;/p&gt;

&lt;p&gt;Let's explore the code together. Note that we use the same set of instructions as before - &lt;code&gt;ldr&lt;/code&gt; instruction (&lt;code&gt;vld1q_s16&lt;/code&gt;), &lt;code&gt;dup&lt;/code&gt; instruction (&lt;code&gt;vdupq_n_s16&lt;/code&gt;), &lt;code&gt;sqrdmulh&lt;/code&gt; instruction (&lt;code&gt;vqrdmulhq_s16&lt;/code&gt;), and &lt;code&gt;str&lt;/code&gt; instruction (&lt;code&gt;vst1q_s16&lt;/code&gt;). Note that the suffix of the intrinsic (s16, signed 16-bit values) indicates the vector length. Thus, each intrinsic will calculate 8 elements at a time (8 elements x 16 bits = 128 bits). This means we have to increment by 8 elements for both &lt;code&gt;in_cursor&lt;/code&gt; and &lt;code&gt;out_cursor&lt;/code&gt; (do not confuse that we are incrementing by 8 elements not 8 bytes!).&lt;/p&gt;

&lt;p&gt;Also, notice that we have to manually increment both cursors for this time. This is because unlike the assembly inline code, the compiler intrinsic code does not increment the pointer for us. &lt;/p&gt;

&lt;p&gt;Since both &lt;code&gt;vol4.c&lt;/code&gt; and &lt;code&gt;vol5.c&lt;/code&gt; utilize AArch64 specific SIMD instructions, it is logical to think these two should outperform other algorithms. &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this post we explored the multiple algorithms for adjusting volume samples. We saw how each algorithm differed even though they all accomplish the same goal. In the next post, we will examine the performance of each program and create a benchmark to verify our expectation.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Exploring Assembler on the x86-64 Platform</title>
      <dc:creator>Seung Woo (Paul) Ji</dc:creator>
      <pubDate>Sun, 27 Feb 2022 23:45:58 +0000</pubDate>
      <link>https://dev.to/seungwooji/investigating-sound-volume-changing-algorithms-in-aarch64-and-x8664-systems-5a5p</link>
      <guid>https://dev.to/seungwooji/investigating-sound-volume-changing-algorithms-in-aarch64-and-x8664-systems-5a5p</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In this post, we are going to develop the same assembly program that we coded in the &lt;a href="https://dev.to/seungwooji/exploring-assembler-on-the-x8664-and-aarch64-platforms-1p5i"&gt;previous post&lt;/a&gt; but within x86_64 system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Original Code
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl    _start

min = 0                         /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                        /* loop exits when the index hits this number (loop condition is i&amp;lt;max) */

_start:
    mov     $min,%r15           /* loop index */

loop:

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Like the original code from AArch64 program, the code in x86_64 does not do anything but to loop for 10 times (max = 10). However, we can see that there are a number of notable differences when compared to the AArch64 platform. First of all, we use a &lt;code&gt;$&lt;/code&gt; sign to indicate an immediate value and a &lt;code&gt;%&lt;/code&gt; sign to indicate a register. Next, we have &lt;code&gt;inc&lt;/code&gt; instruction to directly increment the value of &lt;code&gt;r15&lt;/code&gt; instead of using &lt;code&gt;add&lt;/code&gt; instruction. We also use &lt;code&gt;jne&lt;/code&gt; instruction to jump to a label instead of breaching and &lt;code&gt;syscall&lt;/code&gt; instruction to invoke a system call. Finally, we use specialized group of registers (e.g. rdi, rax) for &lt;code&gt;syscall&lt;/code&gt; arguments&lt;/p&gt;

&lt;p&gt;With that being said, let's continue developing the code to actually print out something to the console screen.&lt;/p&gt;

&lt;h3&gt;
  
  
  Improved Code - Print Message
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl    _start

min = 0                         /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                        /* loop exits when the index hits this number (loop condition is i&amp;lt;max) */

_start:
    mov     $min,%r15           /* loop index */

loop:

    mov     $len,%rdx           /* message length */
    mov     $msg,%rsi           /* message location */
    mov     $1,%rdi             /* file descriptor stdout */
    mov     $1,%rax             /* syscall sys_write */
    syscall

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

.section .data

msg:    .ascii      "Loop\n"
        len = . - msg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The program does what we expected for. But, the printed messages are not meaningful us yet. Let's continue on developing the code so that we can have the number of loop.&lt;/p&gt;

&lt;h3&gt;
  
  
  Improved Code - Print Loop Number
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl    _start

min = 0                         /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                        /* loop exits when the index hits this number (loop condition is i&amp;lt;max) */

_start:
    mov     $min,%r15           /* loop index */

loop:

    mov     %r15,%r14           /* Copy the value of r15 to r14 */
    add     $'0',%r14           /* Add the ascii value of '0' to the r14 and save */
    movb    %r14b,msg+6         /* Copy one byte of r14 to the address location of msg + 6 */

    mov     $len,%rdx           /* message length */
    mov     $msg,%rsi           /* message location */
    mov     $1,%rdi             /* file descriptor stdout */
    mov     $1,%rax             /* syscall sys_write */
    syscall

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

.section .data

msg:    .ascii      "Loop: #\n"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, the program prints out more meaningful messages to the screen. Note that there are another notable differences as compared to the ones in AArch64 assembly. For example, we may reuse &lt;code&gt;mov&lt;/code&gt; instruction to move data from one register to an address pointed by another register. As you remember, we have to utilize &lt;code&gt;str&lt;/code&gt; instruction to do such job within AArch64 system. Moreover, we put the &lt;code&gt;b&lt;/code&gt; suffix after &lt;code&gt;mov&lt;/code&gt; instruction and the register in order to limit the number of byte to be moved. &lt;/p&gt;

&lt;p&gt;However, this code is also not sufficient to handle the two-digit loop numbers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Improved Code - Print Two Digit Loop Number
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl    _start

min = 0                         /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 15                        /* loop exits when the index hits this number (loop condition is i&amp;lt;max) */

_start:
    mov     $min,%r15           /* loop index */
    mov     $10,%r13            /* Divisor */

loop:

// Dividing by 10
    mov     %r15,%rax           /* Setting rax with the value of dividend */
    mov     $0,%rdx             /* rdx must be set to 0 before using div instruction */
    div     %r13                /* divide rax by the r13; place quotient into rax and remainder into rdx */
    cmp     $0,%rax
    je     oneDigit

// Inserting tens digit
    add     $'0',%rax           /* Add the ascii value of '0' to the rax and save */
    mov     %rax,%r12
    movb    %r12b,msg+6         /* Copy one byte of rax to the address location of msg + 6 */

oneDigit:

// Inserting ones digit
    add     $'0',%rdx           /* Add the ascii value of '0' to the rdx and save */
    mov     %rdx,%r12
    movb    %r12b,msg+7         /* Copy one byte of rdx to the address location of msg + 7 */

// Print Message
    mov     $len,%rdx           /* message length */
    mov     $msg,%rsi           /* message location */
    mov     $1,%rdi             /* file descriptor stdout */
    mov     $1,%rax             /* syscall sys_write */
    syscall

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

.section .data

msg:    .ascii      "Loop:  #\n"
        len = . - msg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this code, we divide the given loop index stored in &lt;code&gt;r15&lt;/code&gt; by 10. We use the quotient to find the tens digit. Unlike &lt;code&gt;udiv&lt;/code&gt; instruction, &lt;code&gt;div&lt;/code&gt; instruction can also calculate a remainder. With given quotient and remainder, we can print the quotient value as tens digit and the remainder value as ones digit. Afterwards, we can remove the leading zero for the tens digit by jumping to the &lt;code&gt;oneDigit&lt;/code&gt; label to skip inserting zero digit character when the quotient value is equal to 0.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this post, we explored how we can make a program in x86_64 system that has the same logic as the one from AArch64 in the previous post. Having two different systems to develop a code that performs the same result bring developers interesting challenges - we have to understand the different set of instructions and the way they perform. Also, debugging in both systems are difficult as we have to rely on either inspecting compiler error messages or using &lt;code&gt;objdump&lt;/code&gt; to disassemble the generated machine code.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Exploring Assembler on the AArch64 Platform</title>
      <dc:creator>Seung Woo (Paul) Ji</dc:creator>
      <pubDate>Mon, 21 Feb 2022 02:17:44 +0000</pubDate>
      <link>https://dev.to/seungwooji/exploring-assembler-on-the-x8664-and-aarch64-platforms-1p5i</link>
      <guid>https://dev.to/seungwooji/exploring-assembler-on-the-x8664-and-aarch64-platforms-1p5i</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In this post, we are going to investigate a simple code snippet, that loops a few times, in AArch64 system. &lt;/p&gt;

&lt;h3&gt;
  
  
  Original Code
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 30                         /* loop exits when the index hits this number (loop condition is i&amp;lt;max) */

_start:

    mov     x19, min

loop:

    add     x19, x19, 1
    cmp     x19, max
    b.ne    loop

    mov     x0, 0           /* status -&amp;gt; 0 */
    mov     x8, 93          /* exit is syscall #93 */
    svc     0               /* invoke syscall */
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code does not really do anything special but just loop itself for given maximum number of times (max = 30).&lt;/p&gt;

&lt;p&gt;Let's improve this code a little bit and make it to print out message for us.&lt;/p&gt;

&lt;h3&gt;
  
  
  Improved Code - Print Message
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                         /* loop exits when the index hits this number (loop condition is i&amp;lt;max) */

_start:

    mov     x19, min

loop:

    mov     x0, 1           /* file descriptor: 1 is stdout */
    adr     x1, msg         /* message location (memory address) */
    mov     x2, len         /* message length (bytes) */

    mov     x8, 64          /* write is syscall #64 */
    svc     0               /* invoke syscall */

    add     x19, x19, 1         /* increment by 1 */
    cmp     x19, max
    b.ne    loop

    mov     x0, 0           /* status -&amp;gt; 0 */
    mov     x8, 93          /* exit is syscall #93 */
    svc     0               /* invoke syscall */

.data
msg:    .ascii      "Loop\n"
len=    . - msg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is much better than the original code and prints out something in the console. But, the message is not really meaningful us. Why don't we make it in a way that it prints out the loop number instead?&lt;/p&gt;

&lt;h3&gt;
  
  
  Improved Code - Print Loop Number
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                         /* loop exits when the index hits this number (loop condition is i&amp;lt;max) */

_start:

    mov     x19, min

loop:

// Inserting digit
    add    x18, x19, '0'        /* Create a digit character by adding a ascii value of '0' */
    adr    x17, msg+6           /* Pointer pointing to the pound sign in the msg */
    strb   w18, [x17]           /* Put the digit within the pound sign of the msg */

// Print message
    mov     x0, 1           /* file descriptor: 1 is stdout */
    adr     x1, msg         /* message location (memory address) */
    mov     x2, len         /* message length (bytes) */

    mov     x8, 64          /* write is syscall #64 */
    svc     0               /* invoke syscall */

// Proceed with loop
    add     x19, x19, 1         /* increment by 1 */
    cmp     x19, max
    b.ne    loop

    mov     x0, 0           /* status -&amp;gt; 0 */
    mov     x8, 93          /* exit is syscall #93 */
    svc     0               /* invoke syscall */

.data
msg:    .ascii      "Loop: #\n"
len=    . - msg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code finally prints out some meaningful messages to the console. Note that we use &lt;code&gt;strb&lt;/code&gt; instruction instead of &lt;code&gt;str&lt;/code&gt; because we only want to deal with a single character (1 byte) not a whole 64 bytes. As a result, we need to add &lt;code&gt;w&lt;/code&gt; prefix for the register as it is required to use this instruction.&lt;/p&gt;

&lt;p&gt;However, the code above only works for one digit number of loops. If the loop number is bigger than 10, the code would start printing out the non-numeric character because the numeric characters are defined between 48 and 57 in ASCII table. For this, we need to add additional lines of codes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Improved Code - Print Two Digit Loop Number
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 15                         /* loop exits when the index hits this number (loop condition is i&amp;lt;max) */

_start:

    mov     x19, min
    mov     x20, 10

loop:

// Finding the tens digit
    udiv    x21, x19, x20       /* Divide by 10 */
    cmp     x21, 0
    b.eq    oneDigit            /* Skip to inser the tens digit if the quotient is equal to zero */

// Inserting the tens digit
    add     x18, x21, '0'       /* Create a digit character by adding a ascii value of '0' */
    adr     x17, msg+6          /* Pointer pointing to the pound sign in the msg */
    strb    w18, [x17]          /* Put the digit within the pound sign of the msg */

oneDigit:
// Finding the ones digit
    msub    x22, x20, x21, x19  /* Load x22 with the value of r19 - (r20 * r21) */

// Inserting the ones digit
    add     x18, x22, '0'       /* Create a digit character by adding a ascii value of '0' */
    adr     x17, msg+7          /* Pointer pointing to the pound sign in the msg */
    strb    w18, [x17]          /* Put the digit within the pound sign of the msg */

// Print message
    mov     x0, 1           /* file descriptor: 1 is stdout */
    adr     x1, msg         /* message location (memory address) */
    mov     x2, len         /* message length (bytes) */

    mov     x8, 64          /* write is syscall #64 */
    svc     0               /* invoke syscall */

// Proceed with loop
    add     x19, x19, 1         /* increment by 1 */
    cmp     x19, max
    b.ne    loop

    mov     x0, 0           /* status -&amp;gt; 0 */
    mov     x8, 93          /* exit is syscall #93 */
    svc     0               /* invoke syscall */

.data
msg:    .ascii      "Loop:  #\n"
len=    . - msg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Loop:  0
Loop:  1
Loop:  2
Loop:  3
Loop:  4
Loop:  5
Loop:  6
Loop:  7
Loop:  8
Loop:  9
Loop: 10
Loop: 11
Loop: 12
Loop: 13
Loop: 14
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's walk through the code. First of all, we divide the given &lt;code&gt;r19&lt;/code&gt; value of the loop by 10. We use the quotient to fill out the tens digit. Since &lt;code&gt;udiv&lt;/code&gt; instruction only gives the quotient value, we have to utilize another instruction to find the remainder and &lt;code&gt;msub&lt;/code&gt; instruction is exactly what we need for it. With given quotient and remainder, we just need to print out the numeric character to the screen but one important step remains. That is, we have to remove the leading zero for the &lt;code&gt;r19&lt;/code&gt; value less than 10. For this, we use another label called &lt;code&gt;oneDigit&lt;/code&gt; to skip inserting the tens digit if and only if the tens digit is equal to 0.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this blog post, we learned how to make a small code snippet to print out the number of loops in the screen. It's interesting to see the way AArch64 assembly works is strikingly similar to the one in 6502 system. In the next post, we will further investigate the same code snippet but with another popular system in the modern days, x86_64.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>A Simple Maze Game using 6502 Emulator Part 2</title>
      <dc:creator>Seung Woo (Paul) Ji</dc:creator>
      <pubDate>Mon, 14 Feb 2022 03:03:53 +0000</pubDate>
      <link>https://dev.to/seungwooji/a-simple-maze-game-using-6502-emulator-part-2-4doc</link>
      <guid>https://dev.to/seungwooji/a-simple-maze-game-using-6502-emulator-part-2-4doc</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In the last post, we created a simple 6502 assembly code that generates a maze for a player to explore. Today, we are going to build codes on top of it in order to implement the rest of objectives.&lt;/p&gt;

&lt;h2&gt;
  
  
  Objectives
&lt;/h2&gt;

&lt;p&gt;For this game, we need to accomplish 3 more objectives as following:&lt;/p&gt;

&lt;p&gt;&lt;del&gt;1. The game must draw the maze in the bitmapped screen.&lt;/del&gt; (Done!)&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A player must be able to use the keyboard to control.&lt;/li&gt;
&lt;li&gt;A player must find a route to reach to the goal within the maze in order to win the game.&lt;/li&gt;
&lt;li&gt;A player cannot goes through the wall.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Code
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;; zero-page variables
define  ROW     $20 ; current row
define  COL     $21 ; current column
define  DRAWN_ROW   $22 ; number of drawn rows
define  MAZE_L      $14 ; a pointer that points to where the maze will 
define  MAZE_H      $15 ; be drawn
define  PLAYER_L    $10 ; a pointer that points to the player in the 
                ; screen
define  PLAYER_H    $11
define  TARGET_L    $12 ; a pointer that points to the target position
define  TARGET_H    $13 ; where the player wants to proceed

; constants
define  PATH        $03 ; path color
define  PLAYER      $0e ; player color
define  HEIGHT      7   ; height of the maze 
define  WIDTH       7   ; width of the maze

; ROM routine
define  SCINIT      $ff81 ; initialize/clear screen

        jsr printHelp
        jsr drawMaze
        jsr gameInit
        jsr gameLoop

printHelp:  ldy #$00    ; print instructions on the screen
pHelpLoop:  lda help,y
        beq done
        sta $f000,y
        iny
        bne pHelpLoop

gameInit:   lda #$01    ; initialize ROW, COL to make the player 
        sta ROW     ; starting at $0221 of the screen
        sta COL
        rts

gameLoop:   jsr updatePosition
        jsr getkey
        jsr checkCollision
        ldx #$00    ; clear out the key buffer
        stx $ff
        jmp gameLoop

updatePosition: ldy ROW     ; load PLAYER pointer with ROW 
        lda table_low,y
        sta PLAYER_L
        lda table_high,y
        sta PLAYER_H

        ldy COL     ; place the player at (POINTER + COL)
        lda #PLAYER
        sta (PLAYER_L),y
        rts

getkey:     lda $ff     ; get the input key

        cmp #$80    ; allow arrow keys only
        bmi getkey
        cmp #$84
        bpl getkey

        pha     ; save the accumulator
        lda #PATH   ; set color of the current position to PATH
        sta (PLAYER_L),y
        pla     ; restore accumulator

        cmp #$80    ; check key is up
        bne checkRight

        dec ROW     ; ... if yes, decrement ROW
        rts

checkRight: cmp #$81    ; check if key is right
        bne checkDown
        inc COL     ; ... if yes, increment COL
        rts

checkDown:  cmp #$82    ; check if key is down
        bne checkLeft
        inc ROW     ; ... if yes, increment ROW
        rts

checkLeft:  cmp #$83    ; check if key is left
        bne done
        dec COL     ; ... if yes, decrement COL
        rts

done:       rts     ; break out of a loop or subroutine

checkCollision: ldy ROW     ; load TARGET pointer with ROW 
        lda table_low,y
        sta TARGET_L
        lda table_high,y
        sta TARGET_H

        ldy COL     ; load the color from the target
        lda (TARGET_L),y; at (POINTER + COL)

        cmp #$01
        beq done
        cmp #$03
        beq done
        cmp #$0a
        beq gameComplete

        lda #$00
        sta (TARGET_L),y

        lda $ff
        cmp #$80    ; if input key was up...
        bne ifRight

        inc ROW     ; ... if yes, increment ROW
        rts

ifRight:    cmp #$81    ; if input key was right...
        bne ifDown

        dec COL     ; ... if yes, decrement COL
        rts

ifDown:     cmp #$82    ; if input key was down...
        bne ifLeft

        dec ROW     ; ... if yes, decrement ROW
        rts

ifLeft:     cmp #$83    ; if input key was left...
        bne done

        inc COL     ; ... if yes, increment COL
        rts

gameComplete:   jsr SCINIT
        ldy #$00    ; print game completion message on the screen 
pGameComplete:  lda complete,y
        beq done
        sta $f000,y
        iny
        bne pGameComplete
        brk

drawMaze:   lda #$21    ; a pointer pointing to the first pixel
        sta MAZE_L  ; of the screen
        lda #$02
        sta MAZE_H

        lda #$00    ; number of drawn rows
        sta DRAWN_ROW

        ldx #$00    ; maze data index
        ldy #$00    ; column index

draw:       lda maze_data,x
        sta (MAZE_L), y
        inx
        iny
        cpy #WIDTH  ; compare with the number of WIDTH
        bne draw    ; if not, keep drawing the column

        inc DRAWN_ROW   ; increment the number of row
        lda #HEIGHT
        cmp DRAWN_ROW   ; compare with the number of HEIGHT
        beq done

        lda MAZE_L
        clc
        adc #$20    ; add 32(0x0020) to increment the row
        sta MAZE_L  ; of the pixel
        lda MAZE_H
        adc #$00
        sta MAZE_H

        ldy #$00    ; reset the column index for the new row
        beq draw            

; help text message
help:
dcb "P","l","a","y",32,"w","i","t","h",32,"a","r","r","o","w"
dcb 32,"k","e","y","s",32,"t","o",32,"c","o","n","t","r","o","l",10
dcb 00

; game complete message
complete:
dcb "Y","o","u",32,"b","e","a","t",32
dcb "t","h","e",32,"g","a","m","e","!"
dcb 00

; maze map data
maze_data:
dcb 01,00,01,00,01,01,01
dcb 01,01,01,00,00,00,01
dcb 00,00,01,00,01,00,01
dcb 01,00,01,00,01,01,01
dcb 01,00,01,00,01,00,01
dcb 01,00,01,00,01,00,01
dcb 01,01,01,01,01,00,10

; these two tables contain the high and low bytes
; of the addresses of the start of each row
table_high:
dcb $02,$02,$02,$02,$02,$02,$02,$02
dcb $03,$03,$03,$03,$03,$03,$03,$03
dcb $04,$04,$04,$04,$04,$04,$04,$04
dcb $05,$05,$05,$05,$05,$05,$05,$05

table_low:
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4dg3m5euhpoiecz5onk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4dg3m5euhpoiecz5onk.png" alt="game_screen"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's walk through the code together. First of all, we print out the helpful instruction on the text screen with &lt;code&gt;printHelp&lt;/code&gt; subroutine. Then, we draw a maze using the &lt;code&gt;drawMaze&lt;/code&gt; subroutine we created from the last post. Having a maze for a player to explore on the screen, we need to first set the game state with the initial player position on the screen &lt;code&gt;$#0221&lt;/code&gt;. After that, we call the subroutine called &lt;code&gt;gameLoop&lt;/code&gt; which constantly loops itself.&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7elx3xsa2haiil4p4mlg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7elx3xsa2haiil4p4mlg.png" alt="text_help_message"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;gameLoop&lt;/code&gt; itself consists of a number of subroutines. The first one is &lt;code&gt;updatePosition&lt;/code&gt;. This subroutine loads the player pointer with the given the row and column information so that we can place the player on the screen. Afterwards, we call the &lt;code&gt;getKey&lt;/code&gt; subroutine to receive the player input from the keyboard. We limit the keyboard input by only accepting arrow keystrokes. Once we receive a key input, we update the number of column and row accordingly. Then, we check if the position the player wants to move is a wall by using the &lt;code&gt;checkCollision&lt;/code&gt; subroutine. If the player hits by the wall, we simply retract the move. &lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuh2c0qtv6200gmme4o4u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuh2c0qtv6200gmme4o4u.png" alt="game_screen_2"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once reaching to the goal, the screen will congratulate the player with the text message.&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2l9b79xcx8z3q0h0zed2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2l9b79xcx8z3q0h0zed2.png" alt="game_screen_3n"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn06wiaz3hrwh8nqiyfp2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn06wiaz3hrwh8nqiyfp2.png" alt="text_game_completed"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Making a simple maze using the 6502 assembly language definitely is harder and more time-consuming as compared to other high-level languages. The game we explored together is also not polished and needs a lot of improvements as well (such as having an alert when the player wants to proceed into the wall). we could use. Yet, this experience gives us a very meaningful insight as to how the game really works under the hood. &lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
