<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: gus</title>
    <description>The latest articles on DEV Community by gus (@gusmccallum).</description>
    <link>https://dev.to/gusmccallum</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F701032%2Fd506080b-4518-4ee6-b4f7-c545ff0a48ff.jpeg</url>
      <title>DEV Community: gus</title>
      <link>https://dev.to/gusmccallum</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gusmccallum"/>
    <language>en</language>
    <item>
      <title>Adding SVE2 Support to an Open Source Library - Part III</title>
      <dc:creator>gus</dc:creator>
      <pubDate>Fri, 22 Apr 2022 19:43:19 +0000</pubDate>
      <link>https://dev.to/gusmccallum/adding-sve2-support-to-an-open-source-library-part-iii-4bac</link>
      <guid>https://dev.to/gusmccallum/adding-sve2-support-to-an-open-source-library-part-iii-4bac</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/gusmccallum/adding-sve2-support-to-an-open-source-library-part-i-349m"&gt;Part 1&lt;/a&gt;&lt;br&gt;
&lt;a href="https://dev.to/gusmccallum/adding-sve2-support-to-an-open-source-library-part-ii-ali"&gt;Part 2&lt;/a&gt;&lt;br&gt;
&lt;a href="https://dev.to/gusmccallum/adding-sve2-support-to-an-open-source-library-part-iii-4bac"&gt;Part 3&lt;/a&gt;&lt;/p&gt;



&lt;p&gt;In my last post I ran into some snags at the end when building &lt;a href="https://www.opus-codec.org/"&gt;opus&lt;/a&gt;, apparently some of the intrinsics I wrote for the file I modified errored out and as such I wasn't able to build and test the library. In this post, I'm going to change tactics and try autovectorization to see if I can successfully build and test the library, after which I'll give some analysis on the results. &lt;/p&gt;

&lt;p&gt;First off I'll start by clearing my work so far and downloading a fresh copy of the library. At this point I need to configure and build, but in order to prevent the NEON intrinsics from conflicting with the autovectorization I'm going to implement I'll need to turn off NEON support in the &lt;code&gt;configure.ac&lt;/code&gt; file. I searched for mentions of intrinsics and turned them off, and then ran &lt;code&gt;autogen.sh&lt;/code&gt; and &lt;code&gt;configure&lt;/code&gt; to get the build configured. We can confirm intrinsics are now turned off by the output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;------------------------------------------------------------------------
  opus 1.3.1-107-gccaaffa9-dirty:  Automatic configuration OK.

    Compiler support:

    C99 var arrays: ................ yes
    C99 lrintf: .................... yes
    Use alloca: .................... no (using var arrays)

    General configuration:

    Floating point support: ........ yes
    Fast float approximations: ..... no
    Fixed point debugging: ......... no
    Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
    External Assembly Optimizations:  
    Intrinsics Optimizations: ...... no
    Run-time CPU detection: ........ no
    Custom modes: .................. no
    Assertion checking: ............ no
    Hardening: ..................... yes
    Fuzzing: ....................... no
    Check ASM: ..................... no

    API documentation: ............. yes
    Extra programs: ................ yes
------------------------------------------------------------------------

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now by subbing the CFLAGS mentioned in the last post (&lt;code&gt;-O3 -march=armv8-a+sve2&lt;/code&gt;) into the makefile and taking care to run the build with the &lt;code&gt;qemu-aarch64&lt;/code&gt; argument, we can see that the build and most of the tests execute successfully.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FAIL: celt/tests/test_unit_cwrs32
./test-driver: line 107: 448983 Illegal instruction     (core dumped) "$@" &amp;gt; $log_file 2&amp;gt;&amp;amp;1
FAIL: celt/tests/test_unit_dft
PASS: celt/tests/test_unit_entropy
PASS: celt/tests/test_unit_laplace
PASS: celt/tests/test_unit_mathops
./test-driver: line 107: 449031 Illegal instruction     (core dumped) "$@" &amp;gt; $log_file 2&amp;gt;&amp;amp;1
FAIL: celt/tests/test_unit_mdct
./test-driver: line 107: 449046 Illegal instruction     (core dumped) "$@" &amp;gt; $log_file 2&amp;gt;&amp;amp;1
FAIL: celt/tests/test_unit_rotation
PASS: celt/tests/test_unit_types
./test-driver: line 107: 449072 Illegal instruction     (core dumped) "$@" &amp;gt; $log_file 2&amp;gt;&amp;amp;1
FAIL: silk/tests/test_unit_LPC_inv_pred_gain
PASS: tests/test_opus_api
PASS: tests/test_opus_decode
PASS: tests/test_opus_encode
PASS: tests/test_opus_padding
./test-driver: line 107: 449716 Illegal instruction     (core dumped) "$@" &amp;gt; $log_file 2&amp;gt;&amp;amp;1
FAIL: tests/test_opus_projection
======================================================
   opus 1.3.1-107-gccaaffa9-dirty: ./test-suite.log
======================================================

# TOTAL: 14
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  6
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: celt/tests/test_unit_cwrs32
=================================

FAIL celt/tests/test_unit_cwrs32 (exit status: 132)

FAIL: celt/tests/test_unit_dft
==============================

FAIL celt/tests/test_unit_dft (exit status: 132)

FAIL: celt/tests/test_unit_mdct
===============================

FAIL celt/tests/test_unit_mdct (exit status: 132)

FAIL: celt/tests/test_unit_rotation
===================================

FAIL celt/tests/test_unit_rotation (exit status: 132)

FAIL: silk/tests/test_unit_LPC_inv_pred_gain
============================================

FAIL silk/tests/test_unit_LPC_inv_pred_gain (exit status: 132)

FAIL: tests/test_opus_projection
================================

FAIL tests/test_opus_projection (exit status: 132)

============================================================================
Testsuite summary for opus 1.3.1-107-gccaaffa9-dirty
============================================================================
# TOTAL: 14
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  6
# XPASS: 0
# ERROR: 0
============================================================================

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's take a closer look at one of the tests that successfully made use of the SVE2 inclusion:&lt;/p&gt;

&lt;h2&gt;
  
  
  Running Opus Encode Test
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;./test_opus_encode
Testing libopus 1.3.1-107-gccaaffa9-dirty encoder. Random seed: 3135156945 (95E3)
Running simple tests for bugs that have been fixed previously
  Encode+Decode tests.
    Mode    LP FB encode  VBR,  11318 bps OK.
    Mode    LP FB encode  VBR,  14930 bps OK.
    Mode    LP FB encode  VBR,  67659 bps OK.
    Mode Hybrid FB encode  VBR,  17712 bps OK.
    Mode Hybrid FB encode  VBR,  51200 bps OK.
    Mode Hybrid FB encode  VBR,  80954 bps OK.
    Mode Hybrid FB encode  VBR, 127480 bps OK.
    Mode   MDCT FB encode  VBR, 752629 bps OK.
    Mode   MDCT FB encode  VBR,  25609 bps OK.
    Mode   MDCT FB encode  VBR,  33107 bps OK.
    Mode   MDCT FB encode  VBR,  78592 bps OK.
    Mode   MDCT FB encode  VBR,  73157 bps OK.
    Mode   MDCT FB encode  VBR, 137477 bps OK.
    Mode    LP FB encode CVBR,  11480 bps OK.
    Mode    LP FB encode CVBR,  21257 bps OK.
    Mode    LP FB encode CVBR,  63201 bps OK.
    Mode Hybrid FB encode CVBR,  25583 bps OK.
    Mode Hybrid FB encode CVBR,  36126 bps OK.
    Mode Hybrid FB encode CVBR,  54107 bps OK.
    Mode Hybrid FB encode CVBR, 108482 bps OK.
    Mode   MDCT FB encode CVBR, 934758 bps OK.
    Mode   MDCT FB encode CVBR,  25111 bps OK.
    Mode   MDCT FB encode CVBR,  33929 bps OK.
    Mode   MDCT FB encode CVBR,  52270 bps OK.
    Mode   MDCT FB encode CVBR,  79059 bps OK.
    Mode   MDCT FB encode CVBR, 117366 bps OK.
    Mode    LP FB encode  CBR,   7432 bps OK.
    Mode    LP FB encode  CBR,  16781 bps OK.
    Mode    LP FB encode  CBR,  90950 bps OK.
    Mode Hybrid FB encode  CBR,  18257 bps OK.
    Mode Hybrid FB encode  CBR,  37925 bps OK.
    Mode Hybrid FB encode  CBR,  56473 bps OK.
    Mode Hybrid FB encode  CBR,  78233 bps OK.
    Mode   MDCT FB encode  CBR, 780220 bps OK.
    Mode   MDCT FB encode  CBR,  20668 bps OK.
    Mode   MDCT FB encode  CBR,  38398 bps OK.
    Mode   MDCT FB encode  CBR,  74376 bps OK.
    Mode   MDCT FB encode  CBR,  68468 bps OK.
    Mode   MDCT FB encode  CBR, 141108 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,   4884 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  18110 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  44628 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  15245 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  26620 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  61885 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  86977 bps OK.
    Mode    LP NB dual-mono MS encode  VBR, 119885 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,   7123 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  19106 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  41453 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  10135 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  19040 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  57693 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  77731 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR, 165272 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,   7245 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  16460 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  56065 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  13411 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  28783 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  61638 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  92219 bps OK.
    Mode    LP NB dual-mono MS encode CVBR, 110936 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,   4047 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  21622 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  43253 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  12557 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  28091 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  57473 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  77203 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR, 154714 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,   4000 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  12396 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  56699 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  10327 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  19576 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  36651 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  50625 bps OK.
    Mode    LP NB dual-mono MS encode  CBR, 122376 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,   4916 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  14647 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  55741 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  12307 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  23408 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  62311 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  54876 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR, 104358 bps OK.
    All framesize pairs switching encode, 9810 frames OK.
Running fuzz_encoder_settings with 5 encoder(s) and 40 setting change(s) each.
Tests completed successfully.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can inspect the encoding program and see how it makes use of SVE2 instructions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;find . -type f -executable -print | while read X ; do echo ======== $X ; objdump -d $X | grep whilelo ;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The lines in question are too numerous to put here but the files affected are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;======== ./tests/test_opus_projection
======== ./tests/.libs/test_opus_encode
======== ./tests/.libs/test_opus_api
======== ./tests/.libs/test_opus_decode
======== ./celt/tests/test_unit_entropy
======== ./celt/tests/test_unit_cwrs32
======== ./celt/tests/test_unit_mathops
======== ./celt/tests/test_unit_rotation
======== ./celt/tests/test_unit_dft
======== ./celt/tests/test_unit_mdct
======== ./.libs/opus_demo
======== ./.libs/libopus.so.0.8.0
======== ./.libs/trivial_example
======== ./opus_compare
======== ./silk/tests/test_unit_LPC_inv_pred_gain
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And a line count with &lt;code&gt;find . -type f -executable -print | while read X ; do echo ======== $X ; objdump -d $X 2&amp;gt; /dev/null | grep whilelo ; done | wc -l&lt;/code&gt; returns 2903 instances of whilelo. I'll zero in on one of these files to see how it makes use of its SVE2 instructions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Analyzing Opus Encode Test
&lt;/h2&gt;

&lt;p&gt;I'll go back to the encode test I ran before and take a look at how it's using its SVE2 instructions now.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;objdump -d test_opus_encode &amp;gt; ~/opus_encode_objdump
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In searching around the output I can find 6 instances of whilelo at play here, the first 2 being in this &lt;code&gt;&amp;lt;generate_music&amp;gt;&lt;/code&gt; section.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;00000000004016b0 &amp;lt;generate_music&amp;gt;:
  4016b0:       d2800002        mov     x2, #0x0                        // #0
  4016b4:       d282d003        mov     x3, #0x1680                     // #5760
  4016b8:       2538c000        mov     z0.b, #0
  4016bc:       25631fe0        whilelo p0.h, xzr, x3
  4016c0:       e4a24000        st1h    {z0.h}, p0, [x0, x2, lsl #1]
  4016c4:       0470e3e2        inch    x2
  4016c8:       25631c40        whilelo p0.h, x2, x3
  4016cc:       54ffffa1        b.ne    4016c0 &amp;lt;generate_music+0x10&amp;gt;  // b.any
  4016d0:       712d003f        cmp     w1, #0xb40
  4016d4:       54000e4d        b.le    40189c &amp;lt;generate_music+0x1ec&amp;gt;
  4016d8:       a9bb7bfd        stp     x29, x30, [sp, #-80]!
  4016dc:       f000017e        adrp    x30, 430000 &amp;lt;memcpy@GLIBC_2.17&amp;gt;
  4016e0:       910593de        add     x30, x30, #0x164
  4016e4:       910003fd        mov     x29, sp
  4016e8:       a90153f3        stp     x19, x20, [sp, #16]
  4016ec:       d285a002        mov     x2, #0x2d00                     // #11520
  4016f0:       52955571        mov     w17, #0xaaab                    // #43691
  4016f4:       294093d4        ldp     w20, w4, [x30, #4]
  4016f8:       52955550        mov     w16, #0xaaaa                    // #43690
  4016fc:       8b020002        add     x2, x0, x2
  401700:       52800006        mov     w6, #0x0                        // #0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So let's break down what it's doing here. &lt;a href="https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/WHILELO--While-incrementing-unsigned-scalar-lower-than-scalar-"&gt;Whilelo&lt;/a&gt; is a loop that's taking scalable predicate register &lt;code&gt;p0.h&lt;/code&gt; as its first argument (the destination register), and increments until the second argument - the value in register &lt;code&gt;xzr&lt;/code&gt; is lower than the value in register &lt;code&gt;x3&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  4016bc:       25631fe0        whilelo p0.h, xzr, x3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While that condition is true, the program performs a &lt;a href="https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/ST1H--scalar-plus-scalar---Contiguous-store-halfwords-from-vector--scalar-index--"&gt;st1h&lt;/a&gt;, or a contiguous store halfwords from vector, with a scalar index as its argument.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; 4016c0:    e4a24000        st1h    {z0.h}, p0, [x0, x2, lsl #1]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It then increments &lt;code&gt;x2&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  4016c4:       0470e3e2        inch    x2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While this helps us understand the mechanics of what's being called and why, what function does this serve in the program? The source code can give us some clues in a language that's easier to parse:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   /* Generate input data */
   inbuf = (opus_int16*)malloc(sizeof(*inbuf)*SSAMPLES);
   generate_music(inbuf, SSAMPLES/2);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can see here that &lt;code&gt;generate_music&lt;/code&gt; is a function that, much like the &lt;code&gt;vol_createsample&lt;/code&gt; function in &lt;a href="https://dev.to/gusmccallum/algorithm-selection-on-x8664-vs-aarch64-part-i-5ff6"&gt;lab 5&lt;/a&gt; creates dummy data to operate on and test the encoding utility. Looking at the function definition in full:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;void generate_music(short *buf, opus_int32 len)
{
   opus_int32 a1,b1,a2,b2;
   opus_int32 c1,c2,d1,d2;
   opus_int32 i,j;
   a1=b1=a2=b2=0;
   c1=c2=d1=d2=0;
   j=0;
   /*60ms silence*/
   for(i=0;i&amp;lt;2880;i++)buf[i*2]=buf[i*2+1]=0;
   for(i=2880;i&amp;lt;len;i++)
   {
    opus_uint32 r;
    opus_int32 v1,v2;
    v1=v2=(((j*((j&amp;gt;&amp;gt;12)^((j&amp;gt;&amp;gt;10|j&amp;gt;&amp;gt;12)&amp;amp;26&amp;amp;j&amp;gt;&amp;gt;7)))&amp;amp;128)+128)&amp;lt;&amp;lt;15;
    r=fast_rand();v1+=r&amp;amp;65535;v1-=r&amp;gt;&amp;gt;16;
    r=fast_rand();v2+=r&amp;amp;65535;v2-=r&amp;gt;&amp;gt;16;
    b1=v1-a1+((b1*61+32)&amp;gt;&amp;gt;6);a1=v1;
    b2=v2-a2+((b2*61+32)&amp;gt;&amp;gt;6);a2=v2;
    c1=(30*(c1+b1+d1)+32)&amp;gt;&amp;gt;6;d1=b1;
    c2=(30*(c2+b2+d2)+32)&amp;gt;&amp;gt;6;d2=b2;
    v1=(c1+128)&amp;gt;&amp;gt;8;
    v2=(c2+128)&amp;gt;&amp;gt;8;
    buf[i*2]=v1&amp;gt;32767?32767:(v1&amp;lt;-32768?-32768:v1);
    buf[i*2+1]=v2&amp;gt;32767?32767:(v2&amp;lt;-32768?-32768:v2);
    if(i%6==0)j++;
   }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can see that the entire function is essentially two loops, so it makes sense that we would be able to take advantage of whilelo to squeeze some more performance out of it. Using SIMD in this way allows multiple iterations of the &lt;code&gt;generate_music&lt;/code&gt; function to run simultaneously, which should speed up the performance greatly.  &lt;/p&gt;

&lt;p&gt;With that in mind, it would be interesting to see if there are loops in the source code that didn't get converted to SVE2 instructions and ascertain why. One such example is in main, which I'll show the first part of for context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int main(int _argc, char **_argv)
{
   int args=1;
   char * strtol_str=NULL;
   const char * oversion;
   const char * env_seed;
   int env_used;
   int num_encoders_to_fuzz=5;
   int num_setting_changes=40;

   env_used=0;
   env_seed=getenv("SEED");
   if(_argc&amp;gt;1)
    iseed=strtol(_argv[1], &amp;amp;strtol_str, 10);  /* the first input argument might be the seed */
   if(strtol_str!=NULL &amp;amp;&amp;amp; strtol_str[0]=='\0')   /* iseed is a valid number */
    args++;
   else if(env_seed) {
    iseed=atoi(env_seed);
    env_used=1;
   }
   else iseed=(opus_uint32)time(NULL)^(((opus_uint32)getpid()&amp;amp;65535)&amp;lt;&amp;lt;16);
   Rw=Rz=iseed;

while(args&amp;lt;_argc)
   {
    if(strcmp(_argv[args], "-fuzz")==0 &amp;amp;&amp;amp; _argc==(args+3)) {
        num_encoders_to_fuzz=strtol(_argv[args+1], &amp;amp;strtol_str, 10);
        if(strtol_str[0]!='\0' || num_encoders_to_fuzz&amp;lt;=0) {
            print_usage(_argv);
            return EXIT_FAILURE;
        }
        num_setting_changes=strtol(_argv[args+2], &amp;amp;strtol_str, 10);
        if(strtol_str[0]!='\0' || num_setting_changes&amp;lt;=0) {
            print_usage(_argv);
            return EXIT_FAILURE;
        }
        args+=3;
    }
    else {
        print_usage(_argv);
        return EXIT_FAILURE;
    }
   }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The while loop here iterates through the command line arguments &lt;code&gt;argc&lt;/code&gt;, and the logic within checks for the validity of the arguments. The correct way to call the encoding test is in the format &lt;code&gt;/test_opus_encode [&amp;lt;seed&amp;gt;] [-fuzz &amp;lt;num_encoders&amp;gt; &amp;lt;num_settings_per_encoder&amp;gt;]&lt;/code&gt;. Disassembled, the first loop section looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  4012f4:       97ffff7f        bl      4010f0 &amp;lt;strcmp@plt&amp;gt;
  4012f8:       350001e0        cbnz    w0, 401334 &amp;lt;main+0x134&amp;gt;
  4012fc:       11000e73        add     w19, w19, #0x3
  401300:       6b14027f        cmp     w19, w20
  401304:       54000181        b.ne    401334 &amp;lt;main+0x134&amp;gt;  // b.any
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can tell from the reference to &lt;code&gt;&amp;lt;strcmp@plt&amp;gt;&lt;/code&gt; that this is where the loop's first condition is evaluated, with the string comparison between the current command line argument and "-fuzz" taking place. So why isn't this loop vectorized? Let's break it down.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;while(args&amp;lt;_argc)
   {
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;args&lt;/code&gt; is initialized to 1. The while loop executes as long as &lt;code&gt;args&lt;/code&gt; is less than &lt;code&gt;argc&lt;/code&gt; (&lt;code&gt;argc&lt;/code&gt; is the number of command line argument provided when invoking the program).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    if(strcmp(_argv[args], "-fuzz")==0 &amp;amp;&amp;amp; _argc==(args+3)) {
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first condition evaluated is if the argument is the string "-fuzz".&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;        num_encoders_to_fuzz=strtol(_argv[args+1], &amp;amp;strtol_str, 10);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If it is and the number of arguments is 4, the number of encoders to fuzz is set with the next argument and execution moves to evaluation of the next condition.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;        if(strtol_str[0]!='\0' || num_encoders_to_fuzz&amp;lt;=0) {
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;strtol_str[0]&lt;/code&gt; (the character following a number from the &lt;code&gt;_argv[args+1]&lt;/code&gt; string that was just parsed) is not a null terminating character or the &lt;code&gt;num_encoders_to_fuzz&lt;/code&gt; is less than or equal to zero - that is to say there are characters in the arguments when there should only be numbers at this point, or the number of encoders to fuzz was improperly set - then print the proper usage of the invocation arguments and exit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if(strtol_str[0]!='\0' || num_encoders_to_fuzz&amp;lt;=0) {
            print_usage(_argv);
            return EXIT_FAILURE;
        }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Otherwise, continue evaluating the command line arguments and check if the &lt;code&gt;num_setting_changes&lt;/code&gt; is set properly by the third argument using the same logic of the previous condition.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;num_setting_changes=strtol(_argv[args+2], &amp;amp;strtol_str, 10);
        if(strtol_str[0]!='\0' || num_setting_changes&amp;lt;=0) {
            print_usage(_argv);
            return EXIT_FAILURE;
        }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this is true, increment &lt;code&gt;args&lt;/code&gt; by 3. Otherwise, exit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;        args+=3;
    }
    else {
        print_usage(_argv);
        return EXIT_FAILURE;
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;args&lt;/code&gt; increment at the end will make the while condition evaluate false, so all this to say - the loop only evaluates once so it makes sense that SVE2 instructions wouldn't apply here. There would be no benefit to simultaneously running a loop that can only execute once.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In conclusion, it's been interesting looking at how SVE2 optimization can benefit an open source library. This is a cool technology that will no doubt become pervasive very quickly and  have widespread benefits, especially for large data processing libraries such as this. I explored some different ways to make use of it through compiler intrinsics as well as autovectorization, some attempts were challenging and less fruitful while others seemed to find purchase and successfully optimize opus' encoding functionality. I broke down some code that was optimized and some that wasn't and the reasons why, and gave a closer look at the disassembled code compared to its source to see how the compiler implements SVE2 for us and why. &lt;/p&gt;

&lt;p&gt;I hope my work can be useful to those interested in implementing SVE2 in their own projects, or to the maintainers of the opus project. The latter might find those tests that I couldn't get to pass with autovectorization to be a good place to start, as the "core dump" error message means that the &lt;code&gt;qemu-aarch64&lt;/code&gt; argument wasn't applied to those tests at runtime as I couldn't determine how to apply it in those cases. Doing so would likely cause all tests to pass and allow the entire library to take advantage of SVE2. &lt;/p&gt;

&lt;p&gt;This project and this course at large have been very useful in changing my perspective on programming and allowed me to get much closer to the metal than I have before. It's cleared up many misconceptions about how computers treat data - to paraphrase my professor, "Your other teachers probably told you  variables are stored in memory - they lied." This project and course have been full of little epiphanies like that that I think have been influential in refining my concept of programming and I'm glad I was able to have this experience before graduating. Thanks for reading. &lt;/p&gt;

</description>
      <category>opensource</category>
      <category>assembly</category>
      <category>assemblylanguage</category>
      <category>sve2</category>
    </item>
    <item>
      <title>Optimizing a Program Through SVE2 Auto-Vectorization</title>
      <dc:creator>gus</dc:creator>
      <pubDate>Thu, 21 Apr 2022 22:44:00 +0000</pubDate>
      <link>https://dev.to/gusmccallum/optimizing-a-program-through-sve2-auto-vectorization-49o9</link>
      <guid>https://dev.to/gusmccallum/optimizing-a-program-through-sve2-auto-vectorization-49o9</guid>
      <description>&lt;p&gt;Today I'm going to be taking another look at the volume scaling algorithms we benchmarked in &lt;a href="https://dev.to/gusmccallum/algorithm-selection-on-x8664-vs-aarch64-part-ii-1ogd"&gt;my last post&lt;/a&gt; with the goal of adding SVE2 optimization and further improving the runtime. Because we're using SVE2 we need to make these changes on either vol4.c or vol5.c, as those are the AArch64-specific algorithms that take advantage of inline assembly and intrinsics, respectively. &lt;/p&gt;

&lt;p&gt;To make things simple I'll use the first candidate, vol4.c, which uses inline assembly. The full code is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else


        // these variables will also be accessed by our assembler code
        int16_t*        in_cursor;              // input cursor
        int16_t*        out_cursor;             // output cursor
        int16_t         vol_int;                // volume as int16_t

        int16_t*        limit;                  // end of input array

        int             x;                      // array interator
        int             ttl=0 ;                 // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]


        // set vol_int to fixed-point representation of the volume factor
        // Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

        // Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES;

        // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

        while ( in_cursor &amp;lt; limit ) {
                __asm__ (
                        "ldr q0, [%[in_cursor]], #16    \n\t"
                        // load eight samples into q0 (same as v0.8h)
                        // from [in_cursor]
                        // post-increment in_cursor by 16 bytes
                        // and store back into the pointer register


                        "sqrdmulh v0.8h, v0.8h, v1.8h   \n\t"
                        // with 32 signed integer output,
                        // multiply each lane in v0 * v1 * 2
                        // saturate results
                        // store upper 16 bits of results into
                        // the corresponding lane in v0

                        "str q0, [%[out_cursor]],#16            \n\t"
                        // store eight samples to [out_cursor]
                        // post-increment out_cursor by 16 bytes
                        // and store back into the pointer register

                        // Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"
                        );
        }

// --------------------------------------------------------------------

        for (x = 0; x &amp;lt; SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

        return 0;

#endif
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To start, we need to include the relevant library by adding an include.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#include &amp;lt;stdlib.h&amp;gt;
#include &amp;lt;stdio.h&amp;gt;
#include &amp;lt;stdint.h&amp;gt;
#include "vol.h"
#include &amp;lt;time.h&amp;gt;
#include &amp;lt;arm_sve.h&amp;gt;

#ifndef __aarch64__
        printf("Wrong architecture- written for aarch64 only.\n");
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, I changed the duplicate instruction's destination to the z register as per the &lt;a href="https://developer.arm.com/documentation/102340/0001/SVE2-architecture-fundamentals?lang=en"&gt;SVE2 standard&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;__asm__ ("dup z1.h,%w0"::"r"(vol_int)); //duplicate vol_int into z1.h
...
"sqrdmulh z0.h, z0.h, z1.h      \n\t"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next the makefile that we use to build the program needs to be changed to trigger the use of SVE2 by the compiler.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vol4:    vol4.c vol_createsample.o vol.h
         gcc ${CCOPTS} vol4.c -march=armv8-a+sve2 vol_createsample.o -o vol4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And finally, when running it we need to make sure to add the &lt;code&gt;qemu-aarch64&lt;/code&gt; argument to specify that we'll be emulating the appropriate hardware to run SVE2, as the real thing isn't available to us yet. I ran it with the following command and confirmed it worked as intended.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;qemu-aarch64 ./vol4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This has been a quick exploration of making use of autovectorization to implement SVE2 in a program. Enjoy!&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>assembly</category>
      <category>sve2</category>
      <category>simd</category>
    </item>
    <item>
      <title>Adding SVE2 Support to an Open Source Library - Part II</title>
      <dc:creator>gus</dc:creator>
      <pubDate>Tue, 12 Apr 2022 01:27:38 +0000</pubDate>
      <link>https://dev.to/gusmccallum/adding-sve2-support-to-an-open-source-library-part-ii-ali</link>
      <guid>https://dev.to/gusmccallum/adding-sve2-support-to-an-open-source-library-part-ii-ali</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/gusmccallum/adding-sve2-support-to-an-open-source-library-part-i-349m"&gt;Part 1&lt;/a&gt;&lt;br&gt;
&lt;a href="https://dev.to/gusmccallum/adding-sve2-support-to-an-open-source-library-part-ii-ali"&gt;Part 2&lt;/a&gt;&lt;br&gt;
&lt;a href="https://dev.to/gusmccallum/adding-sve2-support-to-an-open-source-library-part-iii-4bac"&gt;Part 3&lt;/a&gt;&lt;/p&gt;



&lt;p&gt;In the last entry in this series I found a library called &lt;a href="https://www.opus-codec.org/"&gt;opus&lt;/a&gt; which currently uses SIMD by way of compiler intrinsics. Today I'm implementing SVE2 optimization in this library.&lt;/p&gt;

&lt;p&gt;My first step will be swapping out the SIMD intrinsics in a file for their &lt;a href="https://developer.arm.com/documentation/100987/0000/"&gt;SVE2 counterparts&lt;/a&gt;. Then I can modify the makefile to detect when it's appropriate to use those enhancements and compile them accordingly. If a machine can't support SVE2, there's no use compiling that code.&lt;/p&gt;

&lt;p&gt;By performing a search for "neon" in the package we get the following results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;find | grep neon

./celt/arm/pitch_neon_intr.lo
./celt/arm/celt_neon_intr.lo
./celt/arm/celt_neon_intr.c
./celt/arm/pitch_neon_intr.o
./celt/arm/pitch_neon_intr.c
./celt/arm/celt_neon_intr.o
./celt/arm/.libs/pitch_neon_intr.o
./celt/arm/.libs/celt_neon_intr.o
./celt/arm/.deps/pitch_neon_intr.Plo
./celt/arm/.deps/celt_neon_intr.Plo
./silk/fixed/arm/.deps/warped_autocorrelation_FIX_neon_intr.Plo
./silk/fixed/arm/warped_autocorrelation_FIX_neon_intr.c
./silk/arm/biquad_alt_neon_intr.lo
./silk/arm/NSQ_neon.c
./silk/arm/NSQ_del_dec_neon_intr.o
./silk/arm/LPC_inv_pred_gain_neon_intr.c
./silk/arm/NSQ_neon.lo
./silk/arm/NSQ_neon.h
./silk/arm/LPC_inv_pred_gain_neon_intr.o
./silk/arm/.libs/NSQ_del_dec_neon_intr.o
./silk/arm/.libs/LPC_inv_pred_gain_neon_intr.o
./silk/arm/.libs/biquad_alt_neon_intr.o
./silk/arm/.libs/NSQ_neon.o
./silk/arm/LPC_inv_pred_gain_neon_intr.lo
./silk/arm/.deps/NSQ_neon.Plo
./silk/arm/.deps/NSQ_del_dec_neon_intr.Plo
./silk/arm/.deps/LPC_inv_pred_gain_neon_intr.Plo
./silk/arm/.deps/biquad_alt_neon_intr.Plo
./silk/arm/biquad_alt_neon_intr.o
./silk/arm/biquad_alt_neon_intr.c
./silk/arm/NSQ_del_dec_neon_intr.c
./silk/arm/NSQ_del_dec_neon_intr.lo
./silk/arm/NSQ_neon.o
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It looks like there's a lot to work with here - unfortunately we don't have time to add SVE2 intrinsics to all these files so we'll have to narrow in on one file or even a section of a file to start with, which the maintainers can use as a jumping off point for future optimization. In the last post I'd mentioned one file in particular, &lt;code&gt;opus/celt/arm/pitch_neon_intr.c&lt;/code&gt;. I'll start there and see what I can do.&lt;/p&gt;

&lt;p&gt;First we'll include the appropriate header:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#ifdef __ARM_FEATURE_SVE
#include &amp;lt;arm_sve.h&amp;gt;
#endif /* __ARM_FEATURE_SVE */ 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Starting with the first loop we encounter, the code is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;opus_val32 celt_inner_prod_neon(const opus_val16 *x, const opus_val16 *y, int N)
{

int i;
    opus_val32 xy;
    int16x8_t x_s16x8, y_s16x8;
    int32x4_t xy_s32x4 = vdupq_n_s32(0);
    int64x2_t xy_s64x2;
    int64x1_t xy_s64x1;

    for (i = 0; i &amp;lt; N - 7; i += 8) {
        x_s16x8  = vld1q_s16(&amp;amp;x[i]);
        y_s16x8  = vld1q_s16(&amp;amp;y[i]);
        xy_s32x4 = vmlal_s16(xy_s32x4, vget_low_s16 (x_s16x8), vget_low_s16 (y_s16x8));
        xy_s32x4 = vmlal_s16(xy_s32x4, vget_high_s16(x_s16x8), vget_high_s16(y_s16x8));
    }

for (; i &amp;lt; N; i++) {
        xy = MAC16_16(xy, x[i], y[i]);
    }

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By looking up the intrinsics in the &lt;a href="https://developer.arm.com/architectures/instruction-sets/intrinsics"&gt;instruction set&lt;/a&gt; arm provides, we can quickly find out what the Neon intrinsics represent and determine their SVE2 counterparts. &lt;/p&gt;

&lt;p&gt;We start with initializations, including one initialization to the result of &lt;code&gt;vdupq_n_s32&lt;/code&gt; - which sets all lanes of the register to the same value. The SVE2 version of this is &lt;code&gt;svdup_lane&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The first intrinsic in the loop, &lt;code&gt;vld1q_s16&lt;/code&gt;, can load multiple elements to multiple registers. In this case, it loads &lt;code&gt;x_s16x8&lt;/code&gt; with the value from &lt;code&gt;&amp;amp;x[i]&lt;/code&gt;. It's followed by another of the same type which loads &lt;code&gt;y_s16x8&lt;/code&gt; with the value from &lt;code&gt;&amp;amp;y[i]&lt;/code&gt;. The SVE2 version of this is &lt;code&gt;svldnf1sh_32&lt;/code&gt;. Next there are two multiplications between the low portions of x and y and then the high portions using the &lt;code&gt;vmlal_s16&lt;/code&gt; instruction. The SVE versions of these are &lt;code&gt;svpmullb&lt;/code&gt; and &lt;code&gt;svpmullt&lt;/code&gt; respectively, for the bottom and top halves. We also need to call &lt;code&gt;vget_low_s16&lt;/code&gt; and &lt;code&gt;vget_high_s16&lt;/code&gt;, or rather their SVE2 counterparts: &lt;code&gt;svunpklo&lt;/code&gt; and &lt;code&gt;svunpkhi&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;After making all the aforementioned adjustments, here's what we get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#ifdef __ARM_FEATURE_SVE2
pus_val32 celt_inner_prod_neon(const opus_val16 *x, const opus_val16 *y, int N)
{
    int i;
    opus_val32 xy;
    svint16_t x_s16x8, y_s16x8;
    svint32_t xy_s32x4 = svdup_lane(0);
    svint64_t xy_s64x2;
    svint64_t xy_s64x1;

    for (i = 0; i &amp;lt; N - 7; i += 8) {
        x_s16x8  = svldnf1sh_s32(&amp;amp;x[i]);
        y_s16x8  = svldnf1sh_s32(&amp;amp;y[i]);
        xy_s32x4 = svpmullb(xy_s32x4, svunpklo (x_s16x8), svunpklo (y_s16x8));
        xy_s32x4 = svpmullb(xy_s32x4, svunpkhi (x_s16x8), svunpkhi (y_s16x8));
    }

    if (N - i &amp;gt;= 4) {
        const int16x4_t x_s16x4 = vld1_s16(&amp;amp;x[i]);
        const int16x4_t y_s16x4 = vld1_s16(&amp;amp;y[i]);
        xy_s32x4 = vmlal_s16(xy_s32x4, x_s16x4, y_s16x4);
        i += 4;
    }

    xy_s64x2 = vpaddlq_s32(xy_s32x4);
    xy_s64x1 = vadd_s64(vget_low_s64(xy_s64x2), vget_high_s64(xy_s64x2));
    xy      = vget_lane_s32(vreinterpret_s32_s64(xy_s64x1), 0);

    for (; i &amp;lt; N; i++) {
        xy = MAC16_16(xy, x[i], y[i]);
    }
#endif
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now all we have to do is see if we can compile and run it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CCASFLAGS = -g -O3 -march=armv8-a+sve2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes
CCDEPMODE = depmode=gcc3
CFLAGS = -g -O3 -march=armv8-a+sve2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes -fvisibility=hidden -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I added the relevant compile flags to turn on SVE2 optimization and gave it a go - unfortunately there were some build errors that would have to be dealt with so in my next post I'll go over next steps to solve those and continue building SVE2 optimizations into this package. More on that soon!&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>assembly</category>
      <category>assemblylanguage</category>
      <category>sve2</category>
    </item>
    <item>
      <title>Algorithm Selection on x86_64 vs AArch64 Part II</title>
      <dc:creator>gus</dc:creator>
      <pubDate>Sun, 10 Apr 2022 21:47:16 +0000</pubDate>
      <link>https://dev.to/gusmccallum/algorithm-selection-on-x8664-vs-aarch64-part-ii-1ogd</link>
      <guid>https://dev.to/gusmccallum/algorithm-selection-on-x8664-vs-aarch64-part-ii-1ogd</guid>
      <description>&lt;p&gt;This is part 2 of a series on algorithm benchmarking and selection on x86_64 and AArch64 systems. You can find part 1 &lt;a href="https://dev.to/gusmccallum/algorithm-selection-on-x8664-vs-aarch64-part-i-5ff6"&gt;here&lt;/a&gt;. In the previous post we went through the algorithms we're to benchmark and broke some of their workings down, providing predictions along the way as to how they would stack up. Now it's time to put them to the test and see which comes out on top. &lt;/p&gt;

&lt;p&gt;You may have noticed there was a gap in the numbering of the algorithms, between vol2.c and vol4.c. vol3.c is a dummy program provided to us without the volume scaling algorithm, so we can isolate the performance of that one function. Alternatively, we can do so with code by including the C time library and timing the scaling function. This method is less error prone so I'll be benchmarking the algorithms in this way. &lt;/p&gt;

&lt;p&gt;The first step is to increase the sample size in our header to work with a substantial enough dataset in our benchmarking to get some meaningful results. I cranked up the sample number to 1600000000, after which I got to work inserting the timing code into each of the programs.&lt;/p&gt;

&lt;p&gt;For example vol0.c looks like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]

        clock_t t;
        t = clock();

        for (x = 0; x &amp;lt; SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

        t = clock() - t;

        printf("Time elapsed: %f\n", ((double)t)/CLOCKS_PER_SEC);

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I then ran it in a loop to execute 20 times and send the output to a log file like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for ((i = 0; i &amp;lt; 20; i++)) ; do ./vol0 ; done |&amp;amp;tee vol0output.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After following these steps for all the programs I had my results for the AArch64 system ready to compare. I did the same for the x86_64 algorithms, omitting the last two algorithms that use SIMD as they won't run on that architecture. The results are as follows:&lt;/p&gt;

&lt;h2&gt;
  
  
  AArch64 Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Algorithm&lt;/th&gt;
&lt;th&gt;Vol0.c&lt;/th&gt;
&lt;th&gt;Vol1.c&lt;/th&gt;
&lt;th&gt;Vol2.c&lt;/th&gt;
&lt;th&gt;Vol4.c&lt;/th&gt;
&lt;th&gt;Vol5.c&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Time (s)&lt;/td&gt;
&lt;td&gt;5.286&lt;/td&gt;
&lt;td&gt;4.644&lt;/td&gt;
&lt;td&gt;11.257&lt;/td&gt;
&lt;td&gt;2.756&lt;/td&gt;
&lt;td&gt;2.837&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.251&lt;/td&gt;
&lt;td&gt;4.587&lt;/td&gt;
&lt;td&gt;11.258&lt;/td&gt;
&lt;td&gt;2.776&lt;/td&gt;
&lt;td&gt;2.777&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.295&lt;/td&gt;
&lt;td&gt;4.623&lt;/td&gt;
&lt;td&gt;11.226&lt;/td&gt;
&lt;td&gt;2.766&lt;/td&gt;
&lt;td&gt;2.803&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.277&lt;/td&gt;
&lt;td&gt;4.573&lt;/td&gt;
&lt;td&gt;11.239&lt;/td&gt;
&lt;td&gt;2.784&lt;/td&gt;
&lt;td&gt;2.784&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.287&lt;/td&gt;
&lt;td&gt;4.603&lt;/td&gt;
&lt;td&gt;11.25&lt;/td&gt;
&lt;td&gt;2.757&lt;/td&gt;
&lt;td&gt;2.801&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.283&lt;/td&gt;
&lt;td&gt;4.568&lt;/td&gt;
&lt;td&gt;11.229&lt;/td&gt;
&lt;td&gt;2.796&lt;/td&gt;
&lt;td&gt;2.787&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.283&lt;/td&gt;
&lt;td&gt;4.581&lt;/td&gt;
&lt;td&gt;11.234&lt;/td&gt;
&lt;td&gt;2.74&lt;/td&gt;
&lt;td&gt;2.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.311&lt;/td&gt;
&lt;td&gt;4.566&lt;/td&gt;
&lt;td&gt;11.244&lt;/td&gt;
&lt;td&gt;2.782&lt;/td&gt;
&lt;td&gt;2.806&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.287&lt;/td&gt;
&lt;td&gt;4.601&lt;/td&gt;
&lt;td&gt;11.233&lt;/td&gt;
&lt;td&gt;2.848&lt;/td&gt;
&lt;td&gt;2.796&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.244&lt;/td&gt;
&lt;td&gt;4.639&lt;/td&gt;
&lt;td&gt;11.244&lt;/td&gt;
&lt;td&gt;2.756&lt;/td&gt;
&lt;td&gt;2.755&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.279&lt;/td&gt;
&lt;td&gt;4.558&lt;/td&gt;
&lt;td&gt;11.239&lt;/td&gt;
&lt;td&gt;2.744&lt;/td&gt;
&lt;td&gt;2.763&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.293&lt;/td&gt;
&lt;td&gt;4.56&lt;/td&gt;
&lt;td&gt;11.236&lt;/td&gt;
&lt;td&gt;2.782&lt;/td&gt;
&lt;td&gt;2.79&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.288&lt;/td&gt;
&lt;td&gt;4.632&lt;/td&gt;
&lt;td&gt;11.233&lt;/td&gt;
&lt;td&gt;2.73&lt;/td&gt;
&lt;td&gt;2.886&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.27&lt;/td&gt;
&lt;td&gt;4.591&lt;/td&gt;
&lt;td&gt;11.262&lt;/td&gt;
&lt;td&gt;2.775&lt;/td&gt;
&lt;td&gt;2.818&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.277&lt;/td&gt;
&lt;td&gt;4.567&lt;/td&gt;
&lt;td&gt;11.243&lt;/td&gt;
&lt;td&gt;2.721&lt;/td&gt;
&lt;td&gt;2.836&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.31&lt;/td&gt;
&lt;td&gt;4.576&lt;/td&gt;
&lt;td&gt;11.234&lt;/td&gt;
&lt;td&gt;2.812&lt;/td&gt;
&lt;td&gt;2.799&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.295&lt;/td&gt;
&lt;td&gt;4.552&lt;/td&gt;
&lt;td&gt;11.237&lt;/td&gt;
&lt;td&gt;2.784&lt;/td&gt;
&lt;td&gt;2.789&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.25&lt;/td&gt;
&lt;td&gt;4.567&lt;/td&gt;
&lt;td&gt;11.23&lt;/td&gt;
&lt;td&gt;2.776&lt;/td&gt;
&lt;td&gt;2.806&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.283&lt;/td&gt;
&lt;td&gt;4.565&lt;/td&gt;
&lt;td&gt;11.235&lt;/td&gt;
&lt;td&gt;2.798&lt;/td&gt;
&lt;td&gt;2.762&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;5.248&lt;/td&gt;
&lt;td&gt;4.564&lt;/td&gt;
&lt;td&gt;11.215&lt;/td&gt;
&lt;td&gt;2.823&lt;/td&gt;
&lt;td&gt;2.824&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average&lt;/td&gt;
&lt;td&gt;5.279&lt;/td&gt;
&lt;td&gt;4.585&lt;/td&gt;
&lt;td&gt;11.238&lt;/td&gt;
&lt;td&gt;2.775&lt;/td&gt;
&lt;td&gt;2.800&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.ibb.co%2FJdJMYC4%2FAArch64.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.ibb.co%2FJdJMYC4%2FAArch64.png" alt="AArch64 Volume Scaling Algorithm Performance"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It looks like these results more or less confirm what we predicted, with the fastest 2 being those that took advantage of SIMD optimization to run concurrently. Vol2.c was way behind in execution time at a whopping average of 11.238 seconds per execution, over double the next slowest algorithm. This confirms that precalculating a table of results can be incredibly costly in compute time due to the cache not being fast enough to outpace the math unit of the processor. The naïve approach in Vol0.c of multiplying each sample by a scale factor with multiple type conversions in the process somewhat unsurprisingly takes the second slowest pace. Avoiding the conversions by bit shifting in Vol1.c yields a slightly faster runtime. Now onto the x86_64 results:&lt;/p&gt;

&lt;h2&gt;
  
  
  x86_64 Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Algorithm&lt;/th&gt;
&lt;th&gt;Vol0.c&lt;/th&gt;
&lt;th&gt;Vol1.c&lt;/th&gt;
&lt;th&gt;Vol2.c&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Time (s)&lt;/td&gt;
&lt;td&gt;2.91&lt;/td&gt;
&lt;td&gt;2.755&lt;/td&gt;
&lt;td&gt;3.574&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.849&lt;/td&gt;
&lt;td&gt;2.762&lt;/td&gt;
&lt;td&gt;3.552&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.764&lt;/td&gt;
&lt;td&gt;2.747&lt;/td&gt;
&lt;td&gt;3.543&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.753&lt;/td&gt;
&lt;td&gt;2.739&lt;/td&gt;
&lt;td&gt;3.502&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.763&lt;/td&gt;
&lt;td&gt;2.771&lt;/td&gt;
&lt;td&gt;3.497&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.761&lt;/td&gt;
&lt;td&gt;2.739&lt;/td&gt;
&lt;td&gt;3.503&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.77&lt;/td&gt;
&lt;td&gt;2.774&lt;/td&gt;
&lt;td&gt;3.527&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.77&lt;/td&gt;
&lt;td&gt;2.77&lt;/td&gt;
&lt;td&gt;3.507&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.782&lt;/td&gt;
&lt;td&gt;2.751&lt;/td&gt;
&lt;td&gt;3.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.752&lt;/td&gt;
&lt;td&gt;2.763&lt;/td&gt;
&lt;td&gt;3.496&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.765&lt;/td&gt;
&lt;td&gt;2.757&lt;/td&gt;
&lt;td&gt;3.53&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.753&lt;/td&gt;
&lt;td&gt;2.757&lt;/td&gt;
&lt;td&gt;3.501&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.776&lt;/td&gt;
&lt;td&gt;2.759&lt;/td&gt;
&lt;td&gt;3.515&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.771&lt;/td&gt;
&lt;td&gt;2.758&lt;/td&gt;
&lt;td&gt;3.527&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.768&lt;/td&gt;
&lt;td&gt;2.761&lt;/td&gt;
&lt;td&gt;3.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.758&lt;/td&gt;
&lt;td&gt;2.777&lt;/td&gt;
&lt;td&gt;3.518&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.783&lt;/td&gt;
&lt;td&gt;2.749&lt;/td&gt;
&lt;td&gt;3.499&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.764&lt;/td&gt;
&lt;td&gt;2.747&lt;/td&gt;
&lt;td&gt;3.496&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.772&lt;/td&gt;
&lt;td&gt;2.752&lt;/td&gt;
&lt;td&gt;3.504&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;2.777&lt;/td&gt;
&lt;td&gt;2.756&lt;/td&gt;
&lt;td&gt;3.502&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average&lt;/td&gt;
&lt;td&gt;2.778&lt;/td&gt;
&lt;td&gt;2.757&lt;/td&gt;
&lt;td&gt;3.514&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.ibb.co%2FrwYzmhc%2Fx86-64.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.ibb.co%2FrwYzmhc%2Fx86-64.png" alt="AArch64 Volume Scaling Algorithm Performance"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The execution times on x86_64 tell a similar story, although there are a few interesting distinctions. First, the type conversions that set Vol0.c back so much in the AArch64 benchmarks seem to have much less of an impact here. Vol0.c and Vol1.c share almost exactly the same runtime, although working with one type and bit shifting did shave off a few milliseconds. Also of note is that Vol2.c doesn't seem to incur the massive performance penalty seen on its AArch64 counterpart. This is evidence that the cache on this machine's processor is much closer to the math unit in terms of getting the results we need. &lt;/p&gt;

&lt;p&gt;In conclusion, this was an eye opening experience that confirmed my knowledge about the advantages of SIMD while giving specific evidence to support just how fast it is compared to traditional processing. We also learned just how important it is to know the machine you're optimizing for intimately, to account for differences like that between the algorithm using the precalculated table on the AArch64 machine vs the x86_64 one. Doing so can inform your programming decisions and help avoid making costly assumptions that in this case might mean more than doubling your runtime. &lt;/p&gt;

</description>
      <category>assembly</category>
      <category>assemblylanguage</category>
      <category>simd</category>
      <category>aarch64</category>
    </item>
    <item>
      <title>Adding SVE2 Support to an Open Source Library - Part I</title>
      <dc:creator>gus</dc:creator>
      <pubDate>Mon, 28 Mar 2022 18:41:27 +0000</pubDate>
      <link>https://dev.to/gusmccallum/adding-sve2-support-to-an-open-source-library-part-i-349m</link>
      <guid>https://dev.to/gusmccallum/adding-sve2-support-to-an-open-source-library-part-i-349m</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/gusmccallum/adding-sve2-support-to-an-open-source-library-part-i-349m"&gt;Part 1&lt;/a&gt;&lt;br&gt;
&lt;a href="https://dev.to/gusmccallum/adding-sve2-support-to-an-open-source-library-part-ii-ali"&gt;Part 2&lt;/a&gt;&lt;br&gt;
&lt;a href="https://dev.to/gusmccallum/adding-sve2-support-to-an-open-source-library-part-iii-4bac"&gt;Part 3&lt;/a&gt;&lt;/p&gt;



&lt;p&gt;SVE was developed by Arm as a new SIMD instruction set used as an extension to AArch64, that allows for variable vector length implementations. SVE2 is a superset of SVE and its precursor, Neon. Among many benefits of SVE and SVE2, one is that the same binaries can run on different AArch64 hardware with differing vector length implementations. It is especially suited to processing large datasets and for this reason I'll be implementing its use in an open source library to improve performance. &lt;/p&gt;

&lt;p&gt;My first task is to find an open source library to implement SVE2 support for, ideally one that's used for processing large amounts of data like a crypto or multimedia library. As I'm interested in audio and audio programming, I'll start looking there and hopefully find a good candidate. Criteria for my search are as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open source&lt;/li&gt;
&lt;li&gt;Library level package, application level SVE2 optimization is less useful&lt;/li&gt;
&lt;li&gt;Ideally has Neon implementation already to glean ideas for how I'll approach SVE2 implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I started by thinking of what open source audio applications I know of, and the first that came to mind was Audacity. I used dnf list as my prof recommended to look up the package on the AArch64 server and confirmed one was available.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--RJYQDhjr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/u6ccf511rl8d26lh7i70.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--RJYQDhjr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/u6ccf511rl8d26lh7i70.png" alt="Image description" width="880" height="65"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I then used dnf deplist to see what dependencies it had to try and narrow down which would be a good target for optimization. There were several libraries which could be good candidates:&lt;/p&gt;
&lt;h2&gt;
  
  
  Advanced Linux Sound Architecture Library (ALSA)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Y8aAfCcV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/323kwvtan7o9e5gzcx4h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Y8aAfCcV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/323kwvtan7o9e5gzcx4h.png" alt="Image description" width="722" height="116"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Free Lossless Audio Codec (FLAC)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--t9OLMW2y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/00uecftj4pstaozstd8o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--t9OLMW2y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/00uecftj4pstaozstd8o.png" alt="Image description" width="721" height="129"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Libogg
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--1upSEx6z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/r64om59p6yx02f9lbuht.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--1upSEx6z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/r64om59p6yx02f9lbuht.png" alt="Image description" width="673" height="61"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From there I checked the FLAC library to get access to the source code and find out more about how an SVE2 optimization could work out. The git URL on their website was down so I left it for now to check out the other libraries and circle back to it if they don't pan out.&lt;/p&gt;

&lt;p&gt;I found the &lt;a href="https://www.alsa-project.org/wiki/GIT_Server"&gt;page&lt;/a&gt; with the relevant info to clone the ALSA library and did so.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git clone git://git.alsa-project.org/alsa-lib.git alsa-lib
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unfortunately, after many searches trying to find anything related to sve, Neon, or AArch64 specific implementations, I wasn't able to find anything. Again I'm going to keep going and circle back to this if I hit a wall. &lt;/p&gt;

&lt;p&gt;Last in my list is Libogg. I found out it's located &lt;a href="https://gitlab.xiph.org/xiph/liboggz"&gt;here&lt;/a&gt; and is maintained by the same organization that maintains FLAC. Thankfully this git link wasn't broken. Unfortunately I once again came up empty when looking for references to Neon or SIMD, so I expanded my search to look through the various xiph projects - the maintainer of the aforementioned FLAC and ogg libraries. In doing so I found a great candidate, this library called &lt;a href="https://www.opus-codec.org/"&gt;opus&lt;/a&gt; with specific references to AArch64 and Neon.&lt;/p&gt;

&lt;h2&gt;
  
  
  Opus
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rHC6_iAu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uxb4p11zfcpxz2t6b1fw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rHC6_iAu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uxb4p11zfcpxz2t6b1fw.png" alt="Image description" width="762" height="259"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;opus/cmake/OpusFunctions.cmake&lt;/code&gt; I was able to find a check to establish whether the CPU and the compiler support Neon. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--iBMhjkYu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hkq8mns8ssjkhxtgpgiz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--iBMhjkYu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hkq8mns8ssjkhxtgpgiz.png" alt="Image description" width="521" height="180"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This indicates that this package takes advantage of SIMD, Neon being one implementation.  &lt;/p&gt;

&lt;p&gt;After configuring the library I was able to find a Makefile and see what compilation options it was using. In this case it had the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CFLAGS = -g -O2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Moving this up a level to &lt;code&gt;-O3&lt;/code&gt; would get the SVE2 autovectorization optimization to kick in, and furthermore I found that this package takes advantages of intrinsics, for example in the &lt;code&gt;opus/celt/arm/pitch_neon_intr.c&lt;/code&gt; source file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for (i = 0; i &amp;lt; N - 7; i += 8) {
        x_s16x8  = vld1q_s16(&amp;amp;x[i]);
        y_s16x8  = vld1q_s16(&amp;amp;y[i]);
        xy_s32x4 = vmlal_s16(xy_s32x4, vget_low_s16 (x_s16x8), vget_low_s16 (y_s16x8));
        xy_s32x4 = vmlal_s16(xy_s32x4, vget_high_s16(x_s16x8), vget_high_s16(y_s16x8));
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This would be a good place to start - create an SVE2 equivalent of &lt;code&gt;pitch_neon_intr.c&lt;/code&gt; and/or &lt;code&gt;celt_neon_intr.c&lt;/code&gt; with the SVE2 versions of the intrinsics therein, I can get the ball rolling on optimizing this package for SVE2. I sent an email to the opus developer mailing list expressing my intention to do so, and now all that's left is to do it! More on that soon.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>assembly</category>
      <category>assemblylanguage</category>
      <category>sve2</category>
    </item>
    <item>
      <title>Algorithm Selection on x86_64 vs AArch64 Part I</title>
      <dc:creator>gus</dc:creator>
      <pubDate>Thu, 24 Mar 2022 02:30:10 +0000</pubDate>
      <link>https://dev.to/gusmccallum/algorithm-selection-on-x8664-vs-aarch64-part-i-5ff6</link>
      <guid>https://dev.to/gusmccallum/algorithm-selection-on-x8664-vs-aarch64-part-i-5ff6</guid>
      <description>&lt;p&gt;In this post I'll explore benchmarking a few different programs with different algorithms to scale volume. I'll be benchmarking 5 different algorithms which act on an incoming stream of samples to scale them according to a desired volume. To scale audio in real time, acting on a 48000 kHz signal can involve more than 96,000 bytes of data per second, so efficiency is key to making sure nothing is lost or delayed. With that in mind, let's take a look at some different methods of scaling audio volume and see how they stack up against each other, as well as across x86_64 and AArch64 architectures. &lt;/p&gt;

&lt;p&gt;Our incoming sample will be simulated by the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;void vol_createsample(int16_t* sample, int32_t sample_count) {
        int i;
        for (i=0; i&amp;lt;sample_count; i++) {
                sample[i] = (rand()%65536)-32768;
        }
        return;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Algorithm 1 - vol0.c - Naïve
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int16_t scale_sample(int16_t sample, int volume) {

        return (int16_t) ((float) (volume/100.0) * (float) sample);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x &amp;lt; SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples. (Why is this needed?)
        for (x = 0; x &amp;lt; SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples. (Why is this needed?)
        printf("Result: %d\n", ttl);

 return 0;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This first algorithm takes the naïve route of just multiplying each sample by a scale factor. This involved converting an integer to a floating point value and back again, which is very costly especially at this scale. I'm going to go out on a limb and say this could be done more efficiently, I predict that this one will perform the worst. (Also of note - the sum and print portions of the code are needed so the compiler doesn't optimize away the actual sample scaling portion of the program)&lt;/p&gt;

&lt;h1&gt;
  
  
  Algorithm 2 - vol1.c - Fixed Point
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int16_t scale_sample(int16_t sample, int volume) {

        return ((((int32_t) sample) * ((int32_t) (32767 * volume / 100) &amp;lt;&amp;lt;1) ) &amp;gt;&amp;gt; 16);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x &amp;lt; SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples.
        for (x = 0; x &amp;lt; SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.
        printf("Result: %d\n", ttl);

        return 0;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This algorithm avoids the floating point conversions bogging down the previous code and opts for a whole number multiplication followed by a bit shift, which is much more conservative on compute power. This should save time over our last algorithm but will probably be the 2nd or 3rd slowest. &lt;/p&gt;

&lt;h1&gt;
  
  
  Algorithm 3 - vol2.c - Precalculated
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

        static int16_t* precalc;

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]

        precalc = (int16_t*) calloc(65536,2);
        if (precalc == NULL) {
                printf("malloc failed!\n");
                return 1;
        }

        for (x = -32768; x &amp;lt;= 32767; x++) {
 // Q: What is the purpose of the cast to unint16_t in the next line?
                precalc[(uint16_t) x] = (int16_t) ((float) x * VOLUME / 100.0);
        }

        for (x = 0; x &amp;lt; SAMPLES; x++) {
                out[x]=precalc[(uint16_t) in[x]];
        }

// ---- This part sums the samples.
        for (x = 0; x &amp;lt; SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.
        printf("Result: %d\n", ttl);

        return 0;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This algorithm has all 65526 values (-32768 to 32767) precalculated, so the program just needs to look up the result for each value. This will elicit a 128kb table for all possible values of a 16 bit number scaled, which is not too much compared to the size of audio files. Performance in this case will hinge largely on how fast the math unit is vs the cache that will be fetching the 128kb of data. I think once again this could be the 2nd or 3rd slowest algorithm.&lt;/p&gt;

&lt;h1&gt;
  
  
  Algorithm 4 - vol4.c - Inline SIMD
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else


        // these variables will also be accessed by our assembler code
        int16_t*        in_cursor;              // input cursor
        int16_t*        out_cursor;             // output cursor
        int16_t         vol_int;                // volume as int16_t

        int16_t*        limit;                  // end of input array

        int             x;                      // array interator
        int             ttl=0 ;                 // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]


        // set vol_int to fixed-point representation of the volume factor
        // Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

        // Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES;

        // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

        while ( in_cursor &amp;lt; limit ) {
                __asm__ (
                        "ldr q0, [%[in_cursor]], #16    \n\t"
                        // load eight samples into q0 (same as v0.8h)
                        // from [in_cursor]
                        // post-increment in_cursor by 16 bytes
                        // and store back into the pointer register


                        "sqrdmulh v0.8h, v0.8h, v1.8h   \n\t"
                        // with 32 signed integer output,
                        // multiply each lane in v0 * v1 * 2
                        // saturate results
                        // store upper 16 bits of results into
                        // the corresponding lane in v0

                        "str q0, [%[out_cursor]],#16            \n\t"
                        // store eight samples to [out_cursor]
                        // post-increment out_cursor by 16 bytes
                        // and store back into the pointer register

                        // Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"
                        );
        }

// --------------------------------------------------------------------

        for (x = 0; x &amp;lt; SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

        return 0;

#endif
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This algorithm uses inline assembly to process multiple values simultaneously using SIMD. As such it will almost certainly perform better than the prior algorithms. Because SIMD is only available on AArch64 systems we'll have to see how it runs on those and leave the x86_64 benchmarking out for this algorithm.&lt;/p&gt;

&lt;p&gt;There are 5 points of interest marked by "Q" as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(1). The value needs to be multiplied by 32767 rather than 32768 to prevent integer overflow.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(2). The in_cursor and out_cursor are set to point to the first elements of the in and out arrays. These will be used in the following loop to read to and from our scaling logic respectively.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(3). vol_int represents the volume as a signed 16 bit integer, which we're using the &lt;code&gt;dup&lt;/code&gt; instruction on to duplicate the volume scaling factor from the 32-bit &lt;code&gt;w0&lt;/code&gt; to the vector register &lt;code&gt;v1.8h&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(4). These 3 lines are all part of the second template in this program, the first line being outputs, the second inputs, and the last being clobbers - memory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(5). The results here should be correct as the &lt;code&gt;sqrdmulh&lt;/code&gt; instruction above saturates the results, preventing overflow. &lt;/p&gt;

&lt;h1&gt;
  
  
  Algorithm 5 - vol5.c - Intrinsics SIMD
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else

        register int16_t*       in_cursor       asm("r20");     // input cursor (pointer)
        register int16_t*       out_cursor      asm("r21");     // output cursor (pointer)
        register int16_t        vol_int         asm("r22");     // volume as int16_t

        int16_t*                limit;          // end of input array

        int                     x;              // array interator
        int                     ttl=0;          // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]

        vol_int = (int16_t) (VOLUME/100.0 * 32767.0);

        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES ;

        while ( in_cursor &amp;lt; limit ) {
                // What do these intrinsic functions do?
                // (See gcc intrinsics documentation)
                vst1q_s16(out_cursor, vqrdmulhq_s16(vld1q_s16(in_cursor), vdupq_n_s16(vol_int)));

                // Q: Why is the increment below 8 instead of 16 or some other value?
                // Q: Why is this line not needed in the inline assembler version
                // of this program?
                in_cursor += 8;
                out_cursor += 8;
        }

// --------------------------------------------------------------------

        for (x = 0; x &amp;lt; SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);

        return 0;
#endif
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This last algorithm also uses SIMD but rather than inline assembler opts for compiler intrinsics. It should likewise benefit from the simultaneous processing of the previous algorithm, so I'd expect either this one or that one to come out on top. Again, some sections are pointed out for clarification which I'll do now.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// What do these intrinsic functions do?
                // (See gcc intrinsics documentation)
                vst1q_s16(out_cursor, vqrdmulhq_s16(vld1q_s16(in_cursor), vdupq_n_s16(vol_int)));
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(1). These intrinsic functions are equivalent to the instructions used in the last program. &lt;code&gt;vst1q_s16&lt;/code&gt; is equivalent to &lt;code&gt;str&lt;/code&gt;, &lt;code&gt;vqrdmulhq_s16&lt;/code&gt; is equivalent to &lt;code&gt;sqrdmulh&lt;/code&gt;, &lt;code&gt;vld1q_s16&lt;/code&gt; is equivalent to &lt;code&gt;ldr&lt;/code&gt;, and &lt;code&gt;vdupq_n_s16&lt;/code&gt; is equivalent to &lt;code&gt;dup&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Q: Why is the increment below 8 instead of 16 or some other value?
                // Q: Why is this line not needed in the inline assembler version
                // of this program?
                in_cursor += 8;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(2). The pointer is incremented by 8 because each intrinsic will calculate 8 elements at a time. In the inline assembler program, the pointer was incremented for us but here we need to do it manually.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(3). Once again, as we're using the intrinsic equivalent of &lt;code&gt;sqrdmulh&lt;/code&gt;, the results should be saturated and avoid potential overflow, so the output should be reliable. &lt;/p&gt;

&lt;p&gt;In the next post we'll get into putting these algorithms to the test and benchmarking them to find which is the fastest. More on that &lt;a href="https://dev.to/gusmccallum/algorithm-selection-on-x8664-vs-aarch64-part-ii-1ogd"&gt;here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>assembly</category>
      <category>assemblylanguage</category>
      <category>simd</category>
      <category>aarch64</category>
    </item>
    <item>
      <title>x86_64 Assembly Language</title>
      <dc:creator>gus</dc:creator>
      <pubDate>Tue, 22 Mar 2022 02:32:51 +0000</pubDate>
      <link>https://dev.to/gusmccallum/x8664-assembly-language-193o</link>
      <guid>https://dev.to/gusmccallum/x8664-assembly-language-193o</guid>
      <description>&lt;p&gt;In my last post we went through writing a program for printing a message with a 2 digit incrementing value in AArch64 assembly. This time, we're going to tackle the same thing on an x86_64 architecture system. &lt;/p&gt;

&lt;p&gt;Starting with the original code we're given:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl    _start

min = 0                         /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                        /* loop exits when the index hits this number (loop condition is i&amp;lt;max) */

_start:
    mov     $min,%r15           /* loop index */

loop:
    /* ... body of the loop ... do something useful here ... */

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First we need to add logic to print a message, like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl    _start

min = 0                         /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                        /* loop exits when the index hits this number (loop condition is i&amp;lt;max) */

_start:
    mov     $min,%r15           /* loop index */

loop:

    mov     $len,%rdx           /* message length */
    mov     $msg,%rsi           /* message location */
    mov     $1,%rdi             /* file descriptor stdout */
    mov     $1,%rax             /* syscall sys_write */
    syscall

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

.section .data

msg:    .ascii      "Loop\n"
        len = . - msg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By adding the provided text output code as above we get the following output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now by adding logic to iterate an index and add it to the &lt;code&gt;msg&lt;/code&gt; string, we get the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl    _start

min = 0                         /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                        /* loop exits when the index hits this number (loop condition is i&amp;lt;max) */

_start:
    mov     $min,%r15           /* loop index */

loop:

    mov     %r15,%r14           
    add     $'0',%r14     
    movb    %r14b,msg+6        

    mov     $len,%rdx           /* message length */
    mov     $msg,%rsi           /* message location */
    mov     $1,%rdi             /* file descriptor stdout */
    mov     $1,%rax             /* syscall sys_write */
    syscall

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

.section .data

msg:    .ascii      "Loop: #\n"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the resulting output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, we'll move to printing a 2 digit index along with our loop by making the following changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl    _start

min = 0                         /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 15                        /* loop exits when the index hits this number (loop condition is i&amp;lt;max) */

_start:
    mov     $min,%r15        
    mov     $10,%r13          

loop:

    mov     %r15,%rax         
    mov     $0,%rdx          
    div     %r13          
    cmp     $0,%rax
    je     secondDigit

    add     $'0',%rax          
    mov     %rax,%r12
    movb    %r12b,msg+6       

secondDigit:

    add     $'0',%rdx          
    mov     %rdx,%r12
    movb    %r12b,msg+7        

    mov     $len,%rdx           /* message length */
    mov     $msg,%rsi           /* message location */
    mov     $1,%rdi             /* file descriptor stdout */
    mov     $1,%rax             /* syscall sys_write */
    syscall

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

.section .data

msg:    .ascii      "Loop:  #\n"
        len = . - msg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Which gets us the appropriate output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Loop: #0
Loop: #1
Loop: #2
Loop: #3
Loop: #4
Loop: #5
Loop: #6
Loop: #7
Loop: #8
Loop: #9
Loop: #10
Loop: #11
Loop: #12
Loop: #13
Loop: #14
Loop: #15
Loop: #16
Loop: #17
Loop: #18
Loop: #19
Loop: #20
Loop: #21
Loop: #22
Loop: #23
Loop: #24
Loop: #25
Loop: #26
Loop: #27
Loop: #28
Loop: #29
Loop: #30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Again this looks like the output we're looking for, so I'll break down how we got here and leave some parting thoughts on writing assembly programs for x86_64 vs my experience writing for AArch64.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;_start:
    mov     $min,%r15        
    mov     $10,%r13       
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Starting off with the start section, we move the value of min to &lt;code&gt;r15&lt;/code&gt; and set &lt;code&gt;r13&lt;/code&gt; to 10, which we'll use to divide and split our 2 digits. Remember in this syntax the destination register is placed on the right, contrary to how it was arranged in the AArch64 program.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;loop:

    mov     %r15,%rax         
    mov     $0,%rdx          
    div     %r13          
    cmp     $0,%rax
    je     secondDigit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next we place the value to be divided into &lt;code&gt;rax&lt;/code&gt; and clear &lt;code&gt;rdx&lt;/code&gt; to accept the remainder, before using the &lt;code&gt;div&lt;/code&gt; instruction to divide what's in the &lt;code&gt;rax&lt;/code&gt; register. We compare the value placed in &lt;code&gt;rax&lt;/code&gt;, our "tens" column, to zero and branch to the secondDigit section if there's no tens column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    add     $'0',%rax          
    mov     %rax,%r12
    movb    %r12b,msg+6   
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the first line here we add an ascii 0 to the result of the division, after which we move that to &lt;code&gt;r12&lt;/code&gt; and finally move a byte of that with &lt;code&gt;movb&lt;/code&gt; to the address of the pound sign in &lt;code&gt;msg&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;secondDigit:

    add     $'0',%rdx          
    mov     %rdx,%r12
    movb    %r12b,msg+7 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Again here we add an ascii 0, this time to the remainder from the previous division, and then move that to &lt;code&gt;r12&lt;/code&gt; to have a byte moved to the pound sign address in &lt;code&gt;msg&lt;/code&gt;. Like the last program, I'll end my breakdown here as the rest is pretty self evident and discuss my experience with x86_64. &lt;/p&gt;

&lt;p&gt;This was pretty similar to assembly in AArch64 in a lot of ways, there were some minor syntactical differences like the precedence of operands listed above, and $ and % symbols being used to denote immediate values and registers, respectively. I'd be hard pressed to pick one I prefer, but if I had to I'd lean toward the AArch64 for its syntax, which I find slightly more readable. The difference is pretty negligible though. I also like the philosophy of improvement and not being weighed down by legacy features and nomenclature that comes with x86_64, but that hasn't affected my coding on either to any great extent thus far. &lt;/p&gt;

&lt;p&gt;Overall this was a good challenge and I'm looking forward to diving deeper into these architectures.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>assembly</category>
      <category>assemblylanguage</category>
      <category>x8664</category>
    </item>
    <item>
      <title>AArch64 Assembly Language Part II</title>
      <dc:creator>gus</dc:creator>
      <pubDate>Mon, 21 Mar 2022 20:08:04 +0000</pubDate>
      <link>https://dev.to/gusmccallum/aarch64-assembly-language-part-ii-36jp</link>
      <guid>https://dev.to/gusmccallum/aarch64-assembly-language-part-ii-36jp</guid>
      <description>&lt;p&gt;This is the second post in a series on writing assembly language code for a program on the AArch64 architecture. You can find the first &lt;a href="https://dev.to/gusmccallum/aarch64-assembly-language-part-i-13a9"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Once again the task for this last stretch of our AArch64 exploration is to write a loop that iterates 30 times, which necessitates dealing with 2 digit numbers. This can be done by dividing the index by 10. The result goes into the first digit, with a branch to the secondDigit procedure if the first digit is 0. The full code is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 31                         /* loop exits when the index hits this number (loop condition is i&amp;lt;max) */

_start:

    mov     x19, min
    mov     x20, 0x0A

loop:

    udiv    x21, x19, x20
    cmp     x21, 0
    b.eq    secondDigit

    add     x18, x21, '0'
    adr     x17, msg+6 
    strb    w18, [x17] 

secondDigit:
    msub    x22, x20, x21, x19

    add     x18, x22, '0' 
    adr     x17, msg+7 
    strb    w18, [x17] 

    mov     x0, 1           /* file descriptor: 1 is stdout */
    adr     x1, msg         /* message location (memory address) */
    mov     x2, len         /* message length (bytes) */

    mov     x8, 64          /* write is syscall #64 */
    svc     0               /* invoke syscall */

// Proceed with loop
    add     x19, x19, 1   
    cmp     x19, max
    b.ne    loop

    mov     x0, 0           /* status -&amp;gt; 0 */
    mov     x8, 93          /* exit is syscall #93 */
    svc     0               /* invoke syscall */

.data
msg:    .ascii      "Loop:  #\n"
len=    . - msg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the following output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Loop: #0
Loop: #1
Loop: #2
Loop: #3
Loop: #4
Loop: #5
Loop: #6
Loop: #7
Loop: #8
Loop: #9
Loop: #10
Loop: #11
Loop: #12
Loop: #13
Loop: #14
Loop: #15
Loop: #16
Loop: #17
Loop: #18
Loop: #19
Loop: #20
Loop: #21
Loop: #22
Loop: #23
Loop: #24
Loop: #25
Loop: #26
Loop: #27
Loop: #28
Loop: #29
Loop: #30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looks like the output we're looking for which is great, let me break down the program in a little more detail and summarize my experiences writing AArch64 assembly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;_start:

    mov     x19, min
    mov     x20, 0x0A
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We start the program by assigning 0 and 10 to registers 19 and 20, respectively. Both are being used as 64 bit widths as made evident by the &lt;code&gt;x&lt;/code&gt; prefix.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;loop:

    udiv    x21, x19, x20
    cmp     x21, 0
    b.eq    secondDigit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This portion divides the values in &lt;code&gt;x19&lt;/code&gt; by &lt;code&gt;x20&lt;/code&gt; and places it in &lt;code&gt;x21&lt;/code&gt;. The syntax for AArch64 assembly is such that you can look at operations as &lt;code&gt;operand = value&lt;/code&gt; or &lt;code&gt;operand = expression&lt;/code&gt; in this case, as the destination register comes first in this syntax. &lt;/p&gt;

&lt;p&gt;The second line compares the first digit of the result with 0, branching to the secondDigit label if the expression evaluates true. That would be in the case that it's a single digit result, which it will be for the first 10 iterations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    add     x18, x21, '0'
    adr     x17, msg+6 
    strb    w18, [x17] 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first line adds '0' to the value in &lt;code&gt;x21&lt;/code&gt; and places it in &lt;code&gt;x18&lt;/code&gt;, after which the address of the pound sign in &lt;code&gt;msg&lt;/code&gt; is read into x17. The final line stores a byte from &lt;code&gt;w18&lt;/code&gt; to the address pointed to by &lt;code&gt;x17&lt;/code&gt;, the pound sign pointer we just created.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;secondDigit:
    msub    x22, x20, x21, x19
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, the secondDigit label gets the remainder with the msub instruction by setting &lt;code&gt;x22&lt;/code&gt; to the result of &lt;code&gt;x20&lt;/code&gt;-(&lt;code&gt;x21&lt;/code&gt; * &lt;code&gt;x19&lt;/code&gt;), or 10 - (result of the division) * (loop index).&lt;/p&gt;

&lt;p&gt;The rest of the code is largely unchanged from the last few iterations of this program so I'll leave it at that.&lt;/p&gt;

&lt;p&gt;This was a challenging program to write, although I thought the transition from 6502 assembly to 64 bit assembly would be harder. I'm sure my next step, writing the x86_64 equivalent, will be equally if not not more difficult. There's definitely a more robust feeling for lack of a better word, to writing and building assembly on a machine rather than an emulator, and during my debugging process it seemed like the error messages were more meaningful as well. Although I'm not very familiar with working with Linux and that made some things awkward I like knowing that if I need to dig deeper to find out why something's not working that's an option I have. That's it for this post, stay tuned for the x86_64 equivalent of this program coming soon.&lt;/p&gt;

</description>
      <category>assembly</category>
      <category>assemblylanguage</category>
      <category>aarch64</category>
    </item>
    <item>
      <title>AArch64 Assembly Language Part I</title>
      <dc:creator>gus</dc:creator>
      <pubDate>Sun, 20 Mar 2022 21:20:55 +0000</pubDate>
      <link>https://dev.to/gusmccallum/aarch64-assembly-language-part-i-13a9</link>
      <guid>https://dev.to/gusmccallum/aarch64-assembly-language-part-i-13a9</guid>
      <description>&lt;p&gt;Today I'm moving on from the 6502 processor and starting to work with 64 bit assembly language. My tasks this time are to build off of two assembly code examples, one for the AArch64 architecture and the other for x86_64, to first print 0-9 in a loop and then 0-30. &lt;/p&gt;

&lt;p&gt;First off, I'll be working with the AArch64 program. The starting point given to us is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 30                         /* loop exits when the index hits this number (loop condition is i&amp;lt;max) */

_start:

    mov     x19, min

loop:

    /* ... body of the loop ... do something useful here ... */

    add     x19, x19, 1
    cmp     x19, max
    b.ne    loop

    mov     x0, 0           /* status -&amp;gt; 0 */
    mov     x8, 93          /* exit is syscall #93 */
    svc     0               /* invoke syscall */
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code loops until it reaches 30 but doesn't do anything within the loop yet. Our first task is to change this so it prints each time a loop iteration executes. By adding in some code provided to us for printing a message we get the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                         /* loop exits when the index hits this number (loop condition is i&amp;lt;max) */

_start:

    mov     x19, min

loop:

    mov     x0, 1           /* file descriptor: 1 is stdout */
    adr     x1, msg         /* message location (memory address) */
    mov     x2, len         /* message length (bytes) */

    mov     x8, 64          /* write is syscall #64 */
    svc     0               /* invoke syscall */

    add     x19, x19, 1 
    cmp     x19, max
    b.ne    loop

    mov     x0, 0           /* status -&amp;gt; 0 */
    mov     x8, 93          /* exit is syscall #93 */
    svc     0               /* invoke syscall */

.data
msg:    .ascii      "Loop\n"
len=    . - msg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the output is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next up is to get the loop to print a number that iterates each time the loop repeats. I modified the code like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                         /* loop exits when the index hits this number (loop condition is i&amp;lt;max) */

_start:

    mov     x19, min

loop:

    add    x18, x19, '0'         
    adr    x17, msg+6            
    strb   w18, [x17]            

    mov     x0, 1                /* file descriptor: 1 is stdout */
    adr     x1, msg              /* message location (memory address) */
    mov     x2, len              /* message length (bytes) */

    mov     x8, 64               /* write is syscall #64 */
    svc     0                    /* invoke syscall */

    add     x19, x19, 1       
    cmp     x19, max
    b.ne    loop

    mov     x0, 0                /* status -&amp;gt; 0 */
    mov     x8, 93               /* exit is syscall #93 */
    svc     0                    /* invoke syscall */

.data
msg:    .ascii      "Loop: #\n"
len=    . - msg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And got the appropriate output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next time I'll implement iterating until 30, after which I'll tackle the same on an x86_64 architecture. More on that soon!&lt;/p&gt;

</description>
      <category>assembly</category>
      <category>assemblylanguage</category>
      <category>aarch64</category>
    </item>
    <item>
      <title>6502 Math and Strings Part III</title>
      <dc:creator>gus</dc:creator>
      <pubDate>Tue, 15 Mar 2022 00:06:20 +0000</pubDate>
      <link>https://dev.to/gusmccallum/6502-math-and-strings-part-iii-29ip</link>
      <guid>https://dev.to/gusmccallum/6502-math-and-strings-part-iii-29ip</guid>
      <description>&lt;p&gt;This is part 3 in a series on writing a program in assembly for the 6502 processor. You can find part 1 &lt;a href="https://dev.to/gusmccallum/6502-math-and-strings-3b0e"&gt;here&lt;/a&gt; and part 2 &lt;a href="https://dev.to/gusmccallum/6502-math-and-strings-part-ii-37oa"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;After covering the necessary tools for building our program, today I'm getting into the coding. I've chosen to write a simple program to get input from the user and determine if a number is even or odd using the Logical Shift Right I discussed in my last post. &lt;/p&gt;

&lt;p&gt;First up, I set up a subroutine to draw the result of the operation on the bitmapped display. Referencing the example &lt;a href="https://wiki.cdot.senecacollege.ca/wiki/6502_Emulator_Example_Code#Place_a_Graphic_on_the_Screen" rel="noopener noreferrer"&gt;here&lt;/a&gt; I wrote code to display "Even" or "Odd" depending on the result.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6vz9s3ywtm4z0thegyi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6vz9s3ywtm4z0thegyi.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx347jfkovfu4yeqtpko.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx347jfkovfu4yeqtpko.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's the code for just the "Odd" portion. From here I need to add character input, logic to perform the LSR operation, logic to get the remainder (through the carry flag I believe?) and finally print one of the two results to the bitmapped display.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;define WIDTH      32 ; width  of sprite
define HEIGHT     8  ; height of sprite

done:   brk

; win sprite print subroutine 
    lda #$26 ; create a pointer at $26
    sta $17  ; which points to where
    lda #$02 ; the sprite should be drawn
    sta $18

    lda #$00 ; number of rows we've drawn
    sta $19  ; is stored in $19

    ldx #$00 ; index for data
    ldy #$00 ; index for screen column

odddraw:lda oddmsg ,x
    sta ($17),y
    inx
    iny
    cpy #WIDTH
    bne odddraw
    inc $19     ; increment row counter
    lda #HEIGHT ; are we done yet?
    cmp $19
    beq done    ; ...exit if we are

    lda $17     ; load pointer
    clc
    adc #$20    ; add 32 to drop one row
    sta $17
    lda $18     ; carry to high byte if needed
    adc #$00
    sta $18

    ldy #$00
    beq odddraw

; sprite
oddmsg:               
dcb 00,18,18,00,00,00,18,18,00,00,00,18,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00

dcb 18,00,00,18,00,00,18,00,18,00,00,18,00,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 18,00,00,18,00,00,18,00,00,18,00,18,00,00,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 18,00,00,18,00,00,18,00,00,18,00,18,00,00,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 18,00,00,18,00,00,18,00,18,00,00,18,00,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 00,18,18,00,00,00,18,18,00,00,00,18,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unfortunately I ran into some issues with my program's execution so I'm going to have to leave it at that and move on, but the code for what I wrote is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;; ROM Subroutines
define  SCINIT      $ff81 ; initialize/clear screen
define  CHRIN       $ffcf ; input character from keyboard
define  CHROUT      $ffd2 ; output character to screen
define  SCREEN      $ffed ; get screen size
define  PLOT        $fff0 ; get/set cursor coordinates


;CONSTANTS

define  WIDTH       32 ; width  of sprite
define  HEIGHT      8  ; height of sprite


define  INPUT       $10
define  NUM     $00

    ldy #$00

init:   lda msg,y
    beq getnum
    jsr CHROUT
    iny
    bne init

getnum:

    lda #NUM
    sta INPUT

        ldy #$00
        jsr CHRIN

        cmp #$00
        beq getnum

        cmp #$30
        bmi getnum

        cmp #$39
        bpl getnum

        jsr CHROUT
    sta input
    jmp modulomsg

modulomsg:

    lda msg2,y
    beq printinput
    jsr CHROUT
    iny
    bne modulomsg

printinput:

    lda INPUT
    jsr CHROUT
    jmp result

result:     
    lda input
    clc
    jsr chrout
    lsr x
    lda x
    jsr CHROUT
    bcc evendraw
    bcs odddraw


msg:
dcb "E","n","t","e","r",32,"a",32,"n","u","m","b","e","r",":",32,0, 

msg2:
dcb $0d,$0d,$0d,"Y","o","u",32,"e","n","t","e","r","e","d",":",32,0


done:   brk

odddraw:
    lda #$20 ; create a pointer at $26
    sta $17  ; which points to where
    lda #$02 ; the sprite should be drawn
    sta $18

    lda #$00 ; number of rows we've drawn
    sta $19  ; is stored in $0a

    ldx #$00 ; index for data
    ldy #$00 ; index for screen column

    lda odddata ,x
    sta ($17),y
    inx
    iny
    cpy #WIDTH
    bne odddraw
    inc $19     ; increment row counter
    lda #HEIGHT ; are we done yet?
    cmp $19
    beq done    ; ...exit if we are

    lda $17     ; load pointer
    clc
    adc #$20    ; add 32 to drop one row
    sta $17
    lda $18     ; carry to high byte if needed
    adc #$00
    sta $18

    ldy #$00
    beq odddraw

evendraw:
    lda #$20 ; create a pointer at $20
    sta $17  ; which points to where
    lda #$02 ; the sprite should be drawn
    sta $18

    lda #$00 ; number of rows we've drawn
    sta $19  ; is stored in $0a

    ldx #$00 ; index for data
    ldy #$00 ; index for screen column

    lda evendata ,x
    sta ($17),y
    inx
    iny
    cpy #WIDTH
    bne evendraw
    inc $19     ; increment row counter
    lda #HEIGHT ; are we done yet?
    cmp $19
    beq done    ; ...exit if we are

    lda $17     ; load pointer
    clc
    adc #$20    ; add 32 to drop one row
    sta $17
    lda $18     ; carry to high byte if needed
    adc #$00
    sta $18

    ldy #$00
    beq evendraw

; odd sprite
odddata:               
dcb 00,18,18,00,00,00,18,18,00,00,00,18,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00

dcb 18,00,00,18,00,00,18,00,18,00,00,18,00,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 18,00,00,18,00,00,18,00,00,18,00,18,00,00,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 18,00,00,18,00,00,18,00,00,18,00,18,00,00,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 18,00,00,18,00,00,18,00,18,00,00,18,00,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 00,18,18,00,00,00,18,18,00,00,00,18,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00

; even sprite
evendata:               
dcb 05,05,05,05,00,00,05,00,00,00,00,05,00,00,05,05,05,05,00,00,05,05,00,00,05,00,00,00,00,00,00,00

dcb 05,00,00,00,00,00,05,00,00,00,00,05,00,00,05,00,00,00,00,00,05,05,00,00,05,00,00,00,00,00,00,00
dcb 05,05,05,05,00,00,00,05,00,00,05,00,00,00,05,05,05,05,00,00,05,05,05,00,05,00,00,00,00,00,00,00
dcb 05,00,00,00,00,00,00,05,00,00,05,00,00,00,05,00,00,00,00,00,05,00,05,00,05,00,00,00,00,00,00,00
dcb 05,00,00,00,00,00,00,00,05,05,00,00,00,00,05,00,00,00,00,00,05,00,00,05,05,00,00,00,00,00,00,00
dcb 05,05,05,05,00,00,00,00,05,05,00,00,00,00,05,05,05,05,00,00,05,00,00,05,05,00,00,00,00,00,00,00
dcb 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For some reason it stops running after the character input is received and the subroutines for getting the result of the LSR and printing the appropriate message aren't triggered. And that concludes this series on math and strings in assembly on the 6502, hopefully it's been informative.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>assembly</category>
      <category>6502</category>
      <category>assemblylanguage</category>
    </item>
    <item>
      <title>6502 Math and Strings Part II</title>
      <dc:creator>gus</dc:creator>
      <pubDate>Wed, 02 Mar 2022 21:47:08 +0000</pubDate>
      <link>https://dev.to/gusmccallum/6502-math-and-strings-part-ii-37oa</link>
      <guid>https://dev.to/gusmccallum/6502-math-and-strings-part-ii-37oa</guid>
      <description>&lt;p&gt;This is part 2 in a series on writing a program in assembly for the 6502 processor. You can find part 1 &lt;a href="https://dev.to/gusmccallum/6502-math-and-strings-3b0e"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Picking up from where we left off last time, the last requirements to satisfy for this program are that it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Must accept user input from the keyboard in some form.&lt;/li&gt;
&lt;li&gt;Must use some arithmetic/math instructions (to add, subtract, do bitwise operations, or rotate/shift)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Character input can, much like character output, be performed in a variety of ways. One way without using the CHRIN ROM routine can be done like so:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxha54ogbmfctkpe1b5qd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxha54ogbmfctkpe1b5qd.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The code in full:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;; let the user type on the first page of character screen
; has blinking cursor!
; does not use ROM routines
; backspace works (non-destructive), arrows/ENTER don't

next:     ldx #$00
idle:     inx
          cpx #$10
          bne check
          lda $f000,y
          eor #$80
          sta $f000,y

check:    lda $ff
          beq idle

          ldx #$00
          stx $ff

          cmp #$08 ; bs
          bne print

          lda $f000,y
          and #$7f
          sta $f000,y

          dey
          jmp next

print:    sta $f000,y
          iny
          jmp next
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(Many of the code examples I'm using here are from  &lt;a href="https://wiki.cdot.senecacollege.ca/wiki/6502_Emulator_Example_Code#Using_the_ROM_routines" rel="noopener noreferrer"&gt;this page&lt;/a&gt; if you want to inspect the code in full.)&lt;/p&gt;

&lt;p&gt;Otherwise generally one uses the CHRIN ROM routine to get each character and then manipulates them from there. You can see a full example of that &lt;a href="https://github.com/gusmccallum/SPO600/blob/bfb8207ec8f302cf1565bc8e43855c3859168b39/Text%20Input%20Program" rel="noopener noreferrer"&gt;here&lt;/a&gt; where I've recreated a program from a lecture that allows for text input with a tracked cursor, responsive to backspace and enter characters and stores and prints the user's input. &lt;/p&gt;

&lt;p&gt;The final piece to cover before diving into coding our program is math on the 6502. There are two different ways of performing math, binary or decimal, which is decided by setting or clearing the decimal flag in the status register. the &lt;code&gt;SED&lt;/code&gt; instruction sets it, while the &lt;code&gt;CLD&lt;/code&gt; instruction clears it. Decimal mode treats each byte as two decimal digits, the lower 5 bits representing the lower digit and the upper 4 bits the upper ones. Numbers are treated as positive and values greater than 9 are invalid. &lt;/p&gt;

&lt;p&gt;Special care must be taken to clear the carry flag before the low-byte portion of a multi-byte addition, or before a single-byte operation. If a multi-byte addition is performed by adding the low-byte first, the carry flag will correctly carry bits forward from one byte to the next. Subtraction is performed similarly, with the carry flag set before performing subtraction on the lowest byte of a single or multi-byte subtraction, with subtraction then performed on each byte in sequence up to the highest byte. &lt;/p&gt;

&lt;p&gt;Multiplication and division are not generally available, but a Logical Shift right or left effectively performs a division or multiplication, respectively. Similarly, rotations perform the same function but the rotate left instruction will move the highest bit to the carry flag and the carry flag to the lowest bit. The opposite is true of rotate right. &lt;/p&gt;

&lt;p&gt;More bitwise operations can also be found &lt;a href="https://wiki.cdot.senecacollege.ca/wiki/Bitwise_Operations" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I'll leave it at that for this post, and with all the last two posts have covered we'll be ready to dive into coding in earnest for our program in the next post. &lt;/p&gt;

</description>
      <category>opensource</category>
      <category>assembly</category>
      <category>6502</category>
      <category>assemblylanguage</category>
    </item>
    <item>
      <title>6502 Math and Strings Part I</title>
      <dc:creator>gus</dc:creator>
      <pubDate>Fri, 18 Feb 2022 03:15:04 +0000</pubDate>
      <link>https://dev.to/gusmccallum/6502-math-and-strings-3b0e</link>
      <guid>https://dev.to/gusmccallum/6502-math-and-strings-3b0e</guid>
      <description>&lt;p&gt;This week I'm working on another assembly program for the 6502, with the conditions that the program:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Must work in the &lt;a href="http://6502.cdot.systems/"&gt;6502 Emulator&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Must output to the character screen as well as the graphics (bitmapped) screen.&lt;/li&gt;
&lt;li&gt;Must accept user input from the keyboard in some form.&lt;/li&gt;
&lt;li&gt;Must use some arithmetic/math instructions (to add, subtract, do bitwise operations, or rotate/shift)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Almost all of these are new to me in assembly, with the exception of having printed graphics to the bitmapped display, so this should be fun! &lt;/p&gt;

&lt;p&gt;I'll start by breaking down the requirements and getting up to speed on each one so I can then apply them to my program. The first is a given as I'm writing the program directly in the 6502 emulator, so let's dive into the second requirement.&lt;/p&gt;

&lt;p&gt;This requirement is twofold - output to the character screen as well as the graphics screen. I've already covered outputting colours to the bitmapped screen &lt;a href="https://dev.to/gusmccallum/calculating-6502-execution-time-part-3-4jf1"&gt;here&lt;/a&gt;, so the next thing to look into is outputting to the text screen. This can be accomplished in a few ways. &lt;/p&gt;

&lt;p&gt;The first is to manually assign a decimal or hexadecimal value representing each ASCII character to an address. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--GF8QHVIF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ff094a7djuuidu0cgwvl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--GF8QHVIF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ff094a7djuuidu0cgwvl.png" alt="Image description" width="880" height="263"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The second is to use the DCB (Declare Constant Byte) mnemonic to define a string and then assign it to memory.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--nuvSqazT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/inkq00qpjh4riwpwgik7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--nuvSqazT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/inkq00qpjh4riwpwgik7.png" alt="Image description" width="880" height="268"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can also use the CHROUT ROM Routine to output characters to the text screen, which works the same as the last way but without manually assigning and iterating memory locations. &lt;/p&gt;

&lt;p&gt;In the my next post I'll go into the second two requirements and start writing my program. &lt;/p&gt;

</description>
      <category>opensource</category>
      <category>assembly</category>
      <category>6502</category>
      <category>assemblylanguage</category>
    </item>
  </channel>
</rss>
