DEV Community: gus

Adding SVE2 Support to an Open Source Library - Part III

gus — Fri, 22 Apr 2022 19:43:19 +0000

In my last post I ran into some snags at the end when building opus, apparently some of the intrinsics I wrote for the file I modified errored out and as such I wasn't able to build and test the library. In this post, I'm going to change tactics and try autovectorization to see if I can successfully build and test the library, after which I'll give some analysis on the results.

First off I'll start by clearing my work so far and downloading a fresh copy of the library. At this point I need to configure and build, but in order to prevent the NEON intrinsics from conflicting with the autovectorization I'm going to implement I'll need to turn off NEON support in the configure.ac file. I searched for mentions of intrinsics and turned them off, and then ran autogen.sh and configure to get the build configured. We can confirm intrinsics are now turned off by the output:

------------------------------------------------------------------------
  opus 1.3.1-107-gccaaffa9-dirty:  Automatic configuration OK.

    Compiler support:

    C99 var arrays: ................ yes
    C99 lrintf: .................... yes
    Use alloca: .................... no (using var arrays)

    General configuration:

    Floating point support: ........ yes
    Fast float approximations: ..... no
    Fixed point debugging: ......... no
    Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
    External Assembly Optimizations:  
    Intrinsics Optimizations: ...... no
    Run-time CPU detection: ........ no
    Custom modes: .................. no
    Assertion checking: ............ no
    Hardening: ..................... yes
    Fuzzing: ....................... no
    Check ASM: ..................... no

    API documentation: ............. yes
    Extra programs: ................ yes
------------------------------------------------------------------------

Now by subbing the CFLAGS mentioned in the last post (-O3 -march=armv8-a+sve2) into the makefile and taking care to run the build with the qemu-aarch64 argument, we can see that the build and most of the tests execute successfully.

FAIL: celt/tests/test_unit_cwrs32
./test-driver: line 107: 448983 Illegal instruction     (core dumped) "$@" > $log_file 2>&1
FAIL: celt/tests/test_unit_dft
PASS: celt/tests/test_unit_entropy
PASS: celt/tests/test_unit_laplace
PASS: celt/tests/test_unit_mathops
./test-driver: line 107: 449031 Illegal instruction     (core dumped) "$@" > $log_file 2>&1
FAIL: celt/tests/test_unit_mdct
./test-driver: line 107: 449046 Illegal instruction     (core dumped) "$@" > $log_file 2>&1
FAIL: celt/tests/test_unit_rotation
PASS: celt/tests/test_unit_types
./test-driver: line 107: 449072 Illegal instruction     (core dumped) "$@" > $log_file 2>&1
FAIL: silk/tests/test_unit_LPC_inv_pred_gain
PASS: tests/test_opus_api
PASS: tests/test_opus_decode
PASS: tests/test_opus_encode
PASS: tests/test_opus_padding
./test-driver: line 107: 449716 Illegal instruction     (core dumped) "$@" > $log_file 2>&1
FAIL: tests/test_opus_projection
======================================================
   opus 1.3.1-107-gccaaffa9-dirty: ./test-suite.log
======================================================

# TOTAL: 14
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  6
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: celt/tests/test_unit_cwrs32
=================================

FAIL celt/tests/test_unit_cwrs32 (exit status: 132)

FAIL: celt/tests/test_unit_dft
==============================

FAIL celt/tests/test_unit_dft (exit status: 132)

FAIL: celt/tests/test_unit_mdct
===============================

FAIL celt/tests/test_unit_mdct (exit status: 132)

FAIL: celt/tests/test_unit_rotation
===================================

FAIL celt/tests/test_unit_rotation (exit status: 132)

FAIL: silk/tests/test_unit_LPC_inv_pred_gain
============================================

FAIL silk/tests/test_unit_LPC_inv_pred_gain (exit status: 132)

FAIL: tests/test_opus_projection
================================

FAIL tests/test_opus_projection (exit status: 132)

============================================================================
Testsuite summary for opus 1.3.1-107-gccaaffa9-dirty
============================================================================
# TOTAL: 14
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  6
# XPASS: 0
# ERROR: 0
============================================================================

Let's take a closer look at one of the tests that successfully made use of the SVE2 inclusion:

Running Opus Encode Test

./test_opus_encode
Testing libopus 1.3.1-107-gccaaffa9-dirty encoder. Random seed: 3135156945 (95E3)
Running simple tests for bugs that have been fixed previously
  Encode+Decode tests.
    Mode    LP FB encode  VBR,  11318 bps OK.
    Mode    LP FB encode  VBR,  14930 bps OK.
    Mode    LP FB encode  VBR,  67659 bps OK.
    Mode Hybrid FB encode  VBR,  17712 bps OK.
    Mode Hybrid FB encode  VBR,  51200 bps OK.
    Mode Hybrid FB encode  VBR,  80954 bps OK.
    Mode Hybrid FB encode  VBR, 127480 bps OK.
    Mode   MDCT FB encode  VBR, 752629 bps OK.
    Mode   MDCT FB encode  VBR,  25609 bps OK.
    Mode   MDCT FB encode  VBR,  33107 bps OK.
    Mode   MDCT FB encode  VBR,  78592 bps OK.
    Mode   MDCT FB encode  VBR,  73157 bps OK.
    Mode   MDCT FB encode  VBR, 137477 bps OK.
    Mode    LP FB encode CVBR,  11480 bps OK.
    Mode    LP FB encode CVBR,  21257 bps OK.
    Mode    LP FB encode CVBR,  63201 bps OK.
    Mode Hybrid FB encode CVBR,  25583 bps OK.
    Mode Hybrid FB encode CVBR,  36126 bps OK.
    Mode Hybrid FB encode CVBR,  54107 bps OK.
    Mode Hybrid FB encode CVBR, 108482 bps OK.
    Mode   MDCT FB encode CVBR, 934758 bps OK.
    Mode   MDCT FB encode CVBR,  25111 bps OK.
    Mode   MDCT FB encode CVBR,  33929 bps OK.
    Mode   MDCT FB encode CVBR,  52270 bps OK.
    Mode   MDCT FB encode CVBR,  79059 bps OK.
    Mode   MDCT FB encode CVBR, 117366 bps OK.
    Mode    LP FB encode  CBR,   7432 bps OK.
    Mode    LP FB encode  CBR,  16781 bps OK.
    Mode    LP FB encode  CBR,  90950 bps OK.
    Mode Hybrid FB encode  CBR,  18257 bps OK.
    Mode Hybrid FB encode  CBR,  37925 bps OK.
    Mode Hybrid FB encode  CBR,  56473 bps OK.
    Mode Hybrid FB encode  CBR,  78233 bps OK.
    Mode   MDCT FB encode  CBR, 780220 bps OK.
    Mode   MDCT FB encode  CBR,  20668 bps OK.
    Mode   MDCT FB encode  CBR,  38398 bps OK.
    Mode   MDCT FB encode  CBR,  74376 bps OK.
    Mode   MDCT FB encode  CBR,  68468 bps OK.
    Mode   MDCT FB encode  CBR, 141108 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,   4884 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  18110 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  44628 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  15245 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  26620 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  61885 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  86977 bps OK.
    Mode    LP NB dual-mono MS encode  VBR, 119885 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,   7123 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  19106 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  41453 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  10135 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  19040 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  57693 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  77731 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR, 165272 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,   7245 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  16460 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  56065 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  13411 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  28783 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  61638 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  92219 bps OK.
    Mode    LP NB dual-mono MS encode CVBR, 110936 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,   4047 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  21622 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  43253 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  12557 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  28091 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  57473 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  77203 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR, 154714 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,   4000 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  12396 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  56699 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  10327 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  19576 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  36651 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  50625 bps OK.
    Mode    LP NB dual-mono MS encode  CBR, 122376 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,   4916 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  14647 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  55741 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  12307 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  23408 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  62311 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  54876 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR, 104358 bps OK.
    All framesize pairs switching encode, 9810 frames OK.
Running fuzz_encoder_settings with 5 encoder(s) and 40 setting change(s) each.
Tests completed successfully.

Now we can inspect the encoding program and see how it makes use of SVE2 instructions.

find . -type f -executable -print | while read X ; do echo ======== $X ; objdump -d $X | grep whilelo ;

The lines in question are too numerous to put here but the files affected are:

======== ./tests/test_opus_projection
======== ./tests/.libs/test_opus_encode
======== ./tests/.libs/test_opus_api
======== ./tests/.libs/test_opus_decode
======== ./celt/tests/test_unit_entropy
======== ./celt/tests/test_unit_cwrs32
======== ./celt/tests/test_unit_mathops
======== ./celt/tests/test_unit_rotation
======== ./celt/tests/test_unit_dft
======== ./celt/tests/test_unit_mdct
======== ./.libs/opus_demo
======== ./.libs/libopus.so.0.8.0
======== ./.libs/trivial_example
======== ./opus_compare
======== ./silk/tests/test_unit_LPC_inv_pred_gain

And a line count with find . -type f -executable -print | while read X ; do echo ======== $X ; objdump -d $X 2> /dev/null | grep whilelo ; done | wc -l returns 2903 instances of whilelo. I'll zero in on one of these files to see how it makes use of its SVE2 instructions.

Analyzing Opus Encode Test

I'll go back to the encode test I ran before and take a look at how it's using its SVE2 instructions now.

objdump -d test_opus_encode > ~/opus_encode_objdump

In searching around the output I can find 6 instances of whilelo at play here, the first 2 being in this <generate_music> section.

00000000004016b0 <generate_music>:
  4016b0:       d2800002        mov     x2, #0x0                        // #0
  4016b4:       d282d003        mov     x3, #0x1680                     // #5760
  4016b8:       2538c000        mov     z0.b, #0
  4016bc:       25631fe0        whilelo p0.h, xzr, x3
  4016c0:       e4a24000        st1h    {z0.h}, p0, [x0, x2, lsl #1]
  4016c4:       0470e3e2        inch    x2
  4016c8:       25631c40        whilelo p0.h, x2, x3
  4016cc:       54ffffa1        b.ne    4016c0 <generate_music+0x10>  // b.any
  4016d0:       712d003f        cmp     w1, #0xb40
  4016d4:       54000e4d        b.le    40189c <generate_music+0x1ec>
  4016d8:       a9bb7bfd        stp     x29, x30, [sp, #-80]!
  4016dc:       f000017e        adrp    x30, 430000 <memcpy@GLIBC_2.17>
  4016e0:       910593de        add     x30, x30, #0x164
  4016e4:       910003fd        mov     x29, sp
  4016e8:       a90153f3        stp     x19, x20, [sp, #16]
  4016ec:       d285a002        mov     x2, #0x2d00                     // #11520
  4016f0:       52955571        mov     w17, #0xaaab                    // #43691
  4016f4:       294093d4        ldp     w20, w4, [x30, #4]
  4016f8:       52955550        mov     w16, #0xaaaa                    // #43690
  4016fc:       8b020002        add     x2, x0, x2
  401700:       52800006        mov     w6, #0x0                        // #0

So let's break down what it's doing here. Whilelo is a loop that's taking scalable predicate register p0.h as its first argument (the destination register), and increments until the second argument - the value in register xzr is lower than the value in register x3.

  4016bc:       25631fe0        whilelo p0.h, xzr, x3

While that condition is true, the program performs a st1h, or a contiguous store halfwords from vector, with a scalar index as its argument.

 4016c0:    e4a24000        st1h    {z0.h}, p0, [x0, x2, lsl #1]

It then increments x2.

  4016c4:       0470e3e2        inch    x2

While this helps us understand the mechanics of what's being called and why, what function does this serve in the program? The source code can give us some clues in a language that's easier to parse:

   /* Generate input data */
   inbuf = (opus_int16*)malloc(sizeof(*inbuf)*SSAMPLES);
   generate_music(inbuf, SSAMPLES/2);

We can see here that generate_music is a function that, much like the vol_createsample function in lab 5 creates dummy data to operate on and test the encoding utility. Looking at the function definition in full:

void generate_music(short *buf, opus_int32 len)
{
   opus_int32 a1,b1,a2,b2;
   opus_int32 c1,c2,d1,d2;
   opus_int32 i,j;
   a1=b1=a2=b2=0;
   c1=c2=d1=d2=0;
   j=0;
   /*60ms silence*/
   for(i=0;i<2880;i++)buf[i*2]=buf[i*2+1]=0;
   for(i=2880;i<len;i++)
   {
    opus_uint32 r;
    opus_int32 v1,v2;
    v1=v2=(((j*((j>>12)^((j>>10|j>>12)&26&j>>7)))&128)+128)<<15;
    r=fast_rand();v1+=r&65535;v1-=r>>16;
    r=fast_rand();v2+=r&65535;v2-=r>>16;
    b1=v1-a1+((b1*61+32)>>6);a1=v1;
    b2=v2-a2+((b2*61+32)>>6);a2=v2;
    c1=(30*(c1+b1+d1)+32)>>6;d1=b1;
    c2=(30*(c2+b2+d2)+32)>>6;d2=b2;
    v1=(c1+128)>>8;
    v2=(c2+128)>>8;
    buf[i*2]=v1>32767?32767:(v1<-32768?-32768:v1);
    buf[i*2+1]=v2>32767?32767:(v2<-32768?-32768:v2);
    if(i%6==0)j++;
   }
}

We can see that the entire function is essentially two loops, so it makes sense that we would be able to take advantage of whilelo to squeeze some more performance out of it. Using SIMD in this way allows multiple iterations of the generate_music function to run simultaneously, which should speed up the performance greatly.

With that in mind, it would be interesting to see if there are loops in the source code that didn't get converted to SVE2 instructions and ascertain why. One such example is in main, which I'll show the first part of for context:

int main(int _argc, char **_argv)
{
   int args=1;
   char * strtol_str=NULL;
   const char * oversion;
   const char * env_seed;
   int env_used;
   int num_encoders_to_fuzz=5;
   int num_setting_changes=40;

   env_used=0;
   env_seed=getenv("SEED");
   if(_argc>1)
    iseed=strtol(_argv[1], &strtol_str, 10);  /* the first input argument might be the seed */
   if(strtol_str!=NULL && strtol_str[0]=='\0')   /* iseed is a valid number */
    args++;
   else if(env_seed) {
    iseed=atoi(env_seed);
    env_used=1;
   }
   else iseed=(opus_uint32)time(NULL)^(((opus_uint32)getpid()&65535)<<16);
   Rw=Rz=iseed;

while(args<_argc)
   {
    if(strcmp(_argv[args], "-fuzz")==0 && _argc==(args+3)) {
        num_encoders_to_fuzz=strtol(_argv[args+1], &strtol_str, 10);
        if(strtol_str[0]!='\0' || num_encoders_to_fuzz<=0) {
            print_usage(_argv);
            return EXIT_FAILURE;
        }
        num_setting_changes=strtol(_argv[args+2], &strtol_str, 10);
        if(strtol_str[0]!='\0' || num_setting_changes<=0) {
            print_usage(_argv);
            return EXIT_FAILURE;
        }
        args+=3;
    }
    else {
        print_usage(_argv);
        return EXIT_FAILURE;
    }
   }

The while loop here iterates through the command line arguments argc, and the logic within checks for the validity of the arguments. The correct way to call the encoding test is in the format /test_opus_encode [<seed>] [-fuzz <num_encoders> <num_settings_per_encoder>]. Disassembled, the first loop section looks like this:

  4012f4:       97ffff7f        bl      4010f0 <strcmp@plt>
  4012f8:       350001e0        cbnz    w0, 401334 <main+0x134>
  4012fc:       11000e73        add     w19, w19, #0x3
  401300:       6b14027f        cmp     w19, w20
  401304:       54000181        b.ne    401334 <main+0x134>  // b.any

We can tell from the reference to <strcmp@plt> that this is where the loop's first condition is evaluated, with the string comparison between the current command line argument and "-fuzz" taking place. So why isn't this loop vectorized? Let's break it down.

while(args<_argc)
   {

args is initialized to 1. The while loop executes as long as args is less than argc (argc is the number of command line argument provided when invoking the program).

    if(strcmp(_argv[args], "-fuzz")==0 && _argc==(args+3)) {

The first condition evaluated is if the argument is the string "-fuzz".

        num_encoders_to_fuzz=strtol(_argv[args+1], &strtol_str, 10);

If it is and the number of arguments is 4, the number of encoders to fuzz is set with the next argument and execution moves to evaluation of the next condition.

        if(strtol_str[0]!='\0' || num_encoders_to_fuzz<=0) {

If strtol_str[0] (the character following a number from the _argv[args+1] string that was just parsed) is not a null terminating character or the num_encoders_to_fuzz is less than or equal to zero - that is to say there are characters in the arguments when there should only be numbers at this point, or the number of encoders to fuzz was improperly set - then print the proper usage of the invocation arguments and exit.

if(strtol_str[0]!='\0' || num_encoders_to_fuzz<=0) {
            print_usage(_argv);
            return EXIT_FAILURE;
        }

Otherwise, continue evaluating the command line arguments and check if the num_setting_changes is set properly by the third argument using the same logic of the previous condition.

num_setting_changes=strtol(_argv[args+2], &strtol_str, 10);
        if(strtol_str[0]!='\0' || num_setting_changes<=0) {
            print_usage(_argv);
            return EXIT_FAILURE;
        }

If this is true, increment args by 3. Otherwise, exit.

        args+=3;
    }
    else {
        print_usage(_argv);
        return EXIT_FAILURE;
    }

The args increment at the end will make the while condition evaluate false, so all this to say - the loop only evaluates once so it makes sense that SVE2 instructions wouldn't apply here. There would be no benefit to simultaneously running a loop that can only execute once.

Conclusion

In conclusion, it's been interesting looking at how SVE2 optimization can benefit an open source library. This is a cool technology that will no doubt become pervasive very quickly and have widespread benefits, especially for large data processing libraries such as this. I explored some different ways to make use of it through compiler intrinsics as well as autovectorization, some attempts were challenging and less fruitful while others seemed to find purchase and successfully optimize opus' encoding functionality. I broke down some code that was optimized and some that wasn't and the reasons why, and gave a closer look at the disassembled code compared to its source to see how the compiler implements SVE2 for us and why.

I hope my work can be useful to those interested in implementing SVE2 in their own projects, or to the maintainers of the opus project. The latter might find those tests that I couldn't get to pass with autovectorization to be a good place to start, as the "core dump" error message means that the qemu-aarch64 argument wasn't applied to those tests at runtime as I couldn't determine how to apply it in those cases. Doing so would likely cause all tests to pass and allow the entire library to take advantage of SVE2.

This project and this course at large have been very useful in changing my perspective on programming and allowed me to get much closer to the metal than I have before. It's cleared up many misconceptions about how computers treat data - to paraphrase my professor, "Your other teachers probably told you variables are stored in memory - they lied." This project and course have been full of little epiphanies like that that I think have been influential in refining my concept of programming and I'm glad I was able to have this experience before graduating. Thanks for reading.

Optimizing a Program Through SVE2 Auto-Vectorization

gus — Thu, 21 Apr 2022 22:44:00 +0000

Today I'm going to be taking another look at the volume scaling algorithms we benchmarked in my last post with the goal of adding SVE2 optimization and further improving the runtime. Because we're using SVE2 we need to make these changes on either vol4.c or vol5.c, as those are the AArch64-specific algorithms that take advantage of inline assembly and intrinsics, respectively.

To make things simple I'll use the first candidate, vol4.c, which uses inline assembly. The full code is as follows:

int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else


        // these variables will also be accessed by our assembler code
        int16_t*        in_cursor;              // input cursor
        int16_t*        out_cursor;             // output cursor
        int16_t         vol_int;                // volume as int16_t

        int16_t*        limit;                  // end of input array

        int             x;                      // array interator
        int             ttl=0 ;                 // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]


        // set vol_int to fixed-point representation of the volume factor
        // Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

        // Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES;

        // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

        while ( in_cursor < limit ) {
                __asm__ (
                        "ldr q0, [%[in_cursor]], #16    \n\t"
                        // load eight samples into q0 (same as v0.8h)
                        // from [in_cursor]
                        // post-increment in_cursor by 16 bytes
                        // and store back into the pointer register


                        "sqrdmulh v0.8h, v0.8h, v1.8h   \n\t"
                        // with 32 signed integer output,
                        // multiply each lane in v0 * v1 * 2
                        // saturate results
                        // store upper 16 bits of results into
                        // the corresponding lane in v0

                        "str q0, [%[out_cursor]],#16            \n\t"
                        // store eight samples to [out_cursor]
                        // post-increment out_cursor by 16 bytes
                        // and store back into the pointer register

                        // Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"
                        );
        }

// --------------------------------------------------------------------

        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

        return 0;

#endif
}

To start, we need to include the relevant library by adding an include.

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"
#include <time.h>
#include <arm_sve.h>

#ifndef __aarch64__
        printf("Wrong architecture- written for aarch64 only.\n");

Next, I changed the duplicate instruction's destination to the z register as per the SVE2 standard.

__asm__ ("dup z1.h,%w0"::"r"(vol_int)); //duplicate vol_int into z1.h
...
"sqrdmulh z0.h, z0.h, z1.h      \n\t"

Next the makefile that we use to build the program needs to be changed to trigger the use of SVE2 by the compiler.

vol4:    vol4.c vol_createsample.o vol.h
         gcc ${CCOPTS} vol4.c -march=armv8-a+sve2 vol_createsample.o -o vol4

And finally, when running it we need to make sure to add the qemu-aarch64 argument to specify that we'll be emulating the appropriate hardware to run SVE2, as the real thing isn't available to us yet. I ran it with the following command and confirmed it worked as intended.

qemu-aarch64 ./vol4

This has been a quick exploration of making use of autovectorization to implement SVE2 in a program. Enjoy!

Adding SVE2 Support to an Open Source Library - Part II

gus — Tue, 12 Apr 2022 01:27:38 +0000

Part 1
Part 2
Part 3

In the last entry in this series I found a library called opus which currently uses SIMD by way of compiler intrinsics. Today I'm implementing SVE2 optimization in this library.

My first step will be swapping out the SIMD intrinsics in a file for their SVE2 counterparts. Then I can modify the makefile to detect when it's appropriate to use those enhancements and compile them accordingly. If a machine can't support SVE2, there's no use compiling that code.

By performing a search for "neon" in the package we get the following results:

find | grep neon

./celt/arm/pitch_neon_intr.lo
./celt/arm/celt_neon_intr.lo
./celt/arm/celt_neon_intr.c
./celt/arm/pitch_neon_intr.o
./celt/arm/pitch_neon_intr.c
./celt/arm/celt_neon_intr.o
./celt/arm/.libs/pitch_neon_intr.o
./celt/arm/.libs/celt_neon_intr.o
./celt/arm/.deps/pitch_neon_intr.Plo
./celt/arm/.deps/celt_neon_intr.Plo
./silk/fixed/arm/.deps/warped_autocorrelation_FIX_neon_intr.Plo
./silk/fixed/arm/warped_autocorrelation_FIX_neon_intr.c
./silk/arm/biquad_alt_neon_intr.lo
./silk/arm/NSQ_neon.c
./silk/arm/NSQ_del_dec_neon_intr.o
./silk/arm/LPC_inv_pred_gain_neon_intr.c
./silk/arm/NSQ_neon.lo
./silk/arm/NSQ_neon.h
./silk/arm/LPC_inv_pred_gain_neon_intr.o
./silk/arm/.libs/NSQ_del_dec_neon_intr.o
./silk/arm/.libs/LPC_inv_pred_gain_neon_intr.o
./silk/arm/.libs/biquad_alt_neon_intr.o
./silk/arm/.libs/NSQ_neon.o
./silk/arm/LPC_inv_pred_gain_neon_intr.lo
./silk/arm/.deps/NSQ_neon.Plo
./silk/arm/.deps/NSQ_del_dec_neon_intr.Plo
./silk/arm/.deps/LPC_inv_pred_gain_neon_intr.Plo
./silk/arm/.deps/biquad_alt_neon_intr.Plo
./silk/arm/biquad_alt_neon_intr.o
./silk/arm/biquad_alt_neon_intr.c
./silk/arm/NSQ_del_dec_neon_intr.c
./silk/arm/NSQ_del_dec_neon_intr.lo
./silk/arm/NSQ_neon.o

It looks like there's a lot to work with here - unfortunately we don't have time to add SVE2 intrinsics to all these files so we'll have to narrow in on one file or even a section of a file to start with, which the maintainers can use as a jumping off point for future optimization. In the last post I'd mentioned one file in particular, opus/celt/arm/pitch_neon_intr.c. I'll start there and see what I can do.

First we'll include the appropriate header:

#ifdef __ARM_FEATURE_SVE
#include <arm_sve.h>
#endif /* __ARM_FEATURE_SVE */

Starting with the first loop we encounter, the code is as follows:

opus_val32 celt_inner_prod_neon(const opus_val16 *x, const opus_val16 *y, int N)
{

int i;
    opus_val32 xy;
    int16x8_t x_s16x8, y_s16x8;
    int32x4_t xy_s32x4 = vdupq_n_s32(0);
    int64x2_t xy_s64x2;
    int64x1_t xy_s64x1;

    for (i = 0; i < N - 7; i += 8) {
        x_s16x8  = vld1q_s16(&x[i]);
        y_s16x8  = vld1q_s16(&y[i]);
        xy_s32x4 = vmlal_s16(xy_s32x4, vget_low_s16 (x_s16x8), vget_low_s16 (y_s16x8));
        xy_s32x4 = vmlal_s16(xy_s32x4, vget_high_s16(x_s16x8), vget_high_s16(y_s16x8));
    }

for (; i < N; i++) {
        xy = MAC16_16(xy, x[i], y[i]);
    }

By looking up the intrinsics in the instruction set arm provides, we can quickly find out what the Neon intrinsics represent and determine their SVE2 counterparts.

We start with initializations, including one initialization to the result of vdupq_n_s32 - which sets all lanes of the register to the same value. The SVE2 version of this is svdup_lane.

The first intrinsic in the loop, vld1q_s16, can load multiple elements to multiple registers. In this case, it loads x_s16x8 with the value from &x[i]. It's followed by another of the same type which loads y_s16x8 with the value from &y[i]. The SVE2 version of this is svldnf1sh_32. Next there are two multiplications between the low portions of x and y and then the high portions using the vmlal_s16 instruction. The SVE versions of these are svpmullb and svpmullt respectively, for the bottom and top halves. We also need to call vget_low_s16 and vget_high_s16, or rather their SVE2 counterparts: svunpklo and svunpkhi.

After making all the aforementioned adjustments, here's what we get:

#ifdef __ARM_FEATURE_SVE2
pus_val32 celt_inner_prod_neon(const opus_val16 *x, const opus_val16 *y, int N)
{
    int i;
    opus_val32 xy;
    svint16_t x_s16x8, y_s16x8;
    svint32_t xy_s32x4 = svdup_lane(0);
    svint64_t xy_s64x2;
    svint64_t xy_s64x1;

    for (i = 0; i < N - 7; i += 8) {
        x_s16x8  = svldnf1sh_s32(&x[i]);
        y_s16x8  = svldnf1sh_s32(&y[i]);
        xy_s32x4 = svpmullb(xy_s32x4, svunpklo (x_s16x8), svunpklo (y_s16x8));
        xy_s32x4 = svpmullb(xy_s32x4, svunpkhi (x_s16x8), svunpkhi (y_s16x8));
    }

    if (N - i >= 4) {
        const int16x4_t x_s16x4 = vld1_s16(&x[i]);
        const int16x4_t y_s16x4 = vld1_s16(&y[i]);
        xy_s32x4 = vmlal_s16(xy_s32x4, x_s16x4, y_s16x4);
        i += 4;
    }

    xy_s64x2 = vpaddlq_s32(xy_s32x4);
    xy_s64x1 = vadd_s64(vget_low_s64(xy_s64x2), vget_high_s64(xy_s64x2));
    xy      = vget_lane_s32(vreinterpret_s32_s64(xy_s64x1), 0);

    for (; i < N; i++) {
        xy = MAC16_16(xy, x[i], y[i]);
    }
#endif

Now all we have to do is see if we can compile and run it.

CCASFLAGS = -g -O3 -march=armv8-a+sve2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes
CCDEPMODE = depmode=gcc3
CFLAGS = -g -O3 -march=armv8-a+sve2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes -fvisibility=hidden -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes

I added the relevant compile flags to turn on SVE2 optimization and gave it a go - unfortunately there were some build errors that would have to be dealt with so in my next post I'll go over next steps to solve those and continue building SVE2 optimizations into this package. More on that soon!

Algorithm Selection on x86_64 vs AArch64 Part II

gus — Sun, 10 Apr 2022 21:47:16 +0000

This is part 2 of a series on algorithm benchmarking and selection on x86_64 and AArch64 systems. You can find part 1 here. In the previous post we went through the algorithms we're to benchmark and broke some of their workings down, providing predictions along the way as to how they would stack up. Now it's time to put them to the test and see which comes out on top.

You may have noticed there was a gap in the numbering of the algorithms, between vol2.c and vol4.c. vol3.c is a dummy program provided to us without the volume scaling algorithm, so we can isolate the performance of that one function. Alternatively, we can do so with code by including the C time library and timing the scaling function. This method is less error prone so I'll be benchmarking the algorithms in this way.

The first step is to increase the sample size in our header to work with a substantial enough dataset in our benchmarking to get some meaningful results. I cranked up the sample number to 1600000000, after which I got to work inserting the timing code into each of the programs.

For example vol0.c looks like so:

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]

        clock_t t;
        t = clock();

        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

        t = clock() - t;

        printf("Time elapsed: %f\n", ((double)t)/CLOCKS_PER_SEC);

I then ran it in a loop to execute 20 times and send the output to a log file like so:

for ((i = 0; i < 20; i++)) ; do ./vol0 ; done |&tee vol0output.log

After following these steps for all the programs I had my results for the AArch64 system ready to compare. I did the same for the x86_64 algorithms, omitting the last two algorithms that use SIMD as they won't run on that architecture. The results are as follows:

AArch64 Results

Algorithm	Vol0.c	Vol1.c	Vol2.c	Vol4.c	Vol5.c
Time (s)	5.286	4.644	11.257	2.756	2.837
	5.251	4.587	11.258	2.776	2.777
	5.295	4.623	11.226	2.766	2.803
	5.277	4.573	11.239	2.784	2.784
	5.287	4.603	11.25	2.757	2.801
	5.283	4.568	11.229	2.796	2.787
	5.283	4.581	11.234	2.74	2.8
	5.311	4.566	11.244	2.782	2.806
	5.287	4.601	11.233	2.848	2.796
	5.244	4.639	11.244	2.756	2.755
	5.279	4.558	11.239	2.744	2.763
	5.293	4.56	11.236	2.782	2.79
	5.288	4.632	11.233	2.73	2.886
	5.27	4.591	11.262	2.775	2.818
	5.277	4.567	11.243	2.721	2.836
	5.31	4.576	11.234	2.812	2.799
	5.295	4.552	11.237	2.784	2.789
	5.25	4.567	11.23	2.776	2.806
	5.283	4.565	11.235	2.798	2.762
	5.248	4.564	11.215	2.823	2.824
Average	5.279	4.585	11.238	2.775	2.800

It looks like these results more or less confirm what we predicted, with the fastest 2 being those that took advantage of SIMD optimization to run concurrently. Vol2.c was way behind in execution time at a whopping average of 11.238 seconds per execution, over double the next slowest algorithm. This confirms that precalculating a table of results can be incredibly costly in compute time due to the cache not being fast enough to outpace the math unit of the processor. The naïve approach in Vol0.c of multiplying each sample by a scale factor with multiple type conversions in the process somewhat unsurprisingly takes the second slowest pace. Avoiding the conversions by bit shifting in Vol1.c yields a slightly faster runtime. Now onto the x86_64 results:

x86_64 Results

Algorithm	Vol0.c	Vol1.c	Vol2.c
Time (s)	2.91	2.755	3.574
	2.849	2.762	3.552
	2.764	2.747	3.543
	2.753	2.739	3.502
	2.763	2.771	3.497
	2.761	2.739	3.503
	2.77	2.774	3.527
	2.77	2.77	3.507
	2.782	2.751	3.5
	2.752	2.763	3.496
	2.765	2.757	3.53
	2.753	2.757	3.501
	2.776	2.759	3.515
	2.771	2.758	3.527
	2.768	2.761	3.5
	2.758	2.777	3.518
	2.783	2.749	3.499
	2.764	2.747	3.496
	2.772	2.752	3.504
	2.777	2.756	3.502
Average	2.778	2.757	3.514

The execution times on x86_64 tell a similar story, although there are a few interesting distinctions. First, the type conversions that set Vol0.c back so much in the AArch64 benchmarks seem to have much less of an impact here. Vol0.c and Vol1.c share almost exactly the same runtime, although working with one type and bit shifting did shave off a few milliseconds. Also of note is that Vol2.c doesn't seem to incur the massive performance penalty seen on its AArch64 counterpart. This is evidence that the cache on this machine's processor is much closer to the math unit in terms of getting the results we need.

In conclusion, this was an eye opening experience that confirmed my knowledge about the advantages of SIMD while giving specific evidence to support just how fast it is compared to traditional processing. We also learned just how important it is to know the machine you're optimizing for intimately, to account for differences like that between the algorithm using the precalculated table on the AArch64 machine vs the x86_64 one. Doing so can inform your programming decisions and help avoid making costly assumptions that in this case might mean more than doubling your runtime.

Adding SVE2 Support to an Open Source Library - Part I

gus — Mon, 28 Mar 2022 18:41:27 +0000

Part 1
Part 2
Part 3

SVE was developed by Arm as a new SIMD instruction set used as an extension to AArch64, that allows for variable vector length implementations. SVE2 is a superset of SVE and its precursor, Neon. Among many benefits of SVE and SVE2, one is that the same binaries can run on different AArch64 hardware with differing vector length implementations. It is especially suited to processing large datasets and for this reason I'll be implementing its use in an open source library to improve performance.

My first task is to find an open source library to implement SVE2 support for, ideally one that's used for processing large amounts of data like a crypto or multimedia library. As I'm interested in audio and audio programming, I'll start looking there and hopefully find a good candidate. Criteria for my search are as follows:

Open source
Library level package, application level SVE2 optimization is less useful
Ideally has Neon implementation already to glean ideas for how I'll approach SVE2 implementation

I started by thinking of what open source audio applications I know of, and the first that came to mind was Audacity. I used dnf list as my prof recommended to look up the package on the AArch64 server and confirmed one was available.

I then used dnf deplist to see what dependencies it had to try and narrow down which would be a good target for optimization. There were several libraries which could be good candidates:

Advanced Linux Sound Architecture Library (ALSA)

Free Lossless Audio Codec (FLAC)

Libogg

From there I checked the FLAC library to get access to the source code and find out more about how an SVE2 optimization could work out. The git URL on their website was down so I left it for now to check out the other libraries and circle back to it if they don't pan out.

I found the page with the relevant info to clone the ALSA library and did so.

git clone git://git.alsa-project.org/alsa-lib.git alsa-lib

Unfortunately, after many searches trying to find anything related to sve, Neon, or AArch64 specific implementations, I wasn't able to find anything. Again I'm going to keep going and circle back to this if I hit a wall.

Last in my list is Libogg. I found out it's located here and is maintained by the same organization that maintains FLAC. Thankfully this git link wasn't broken. Unfortunately I once again came up empty when looking for references to Neon or SIMD, so I expanded my search to look through the various xiph projects - the maintainer of the aforementioned FLAC and ogg libraries. In doing so I found a great candidate, this library called opus with specific references to AArch64 and Neon.

Opus

In opus/cmake/OpusFunctions.cmake I was able to find a check to establish whether the CPU and the compiler support Neon.

This indicates that this package takes advantage of SIMD, Neon being one implementation.

After configuring the library I was able to find a Makefile and see what compilation options it was using. In this case it had the following:

CFLAGS = -g -O2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes

Moving this up a level to -O3 would get the SVE2 autovectorization optimization to kick in, and furthermore I found that this package takes advantages of intrinsics, for example in the opus/celt/arm/pitch_neon_intr.c source file:

for (i = 0; i < N - 7; i += 8) {
        x_s16x8  = vld1q_s16(&x[i]);
        y_s16x8  = vld1q_s16(&y[i]);
        xy_s32x4 = vmlal_s16(xy_s32x4, vget_low_s16 (x_s16x8), vget_low_s16 (y_s16x8));
        xy_s32x4 = vmlal_s16(xy_s32x4, vget_high_s16(x_s16x8), vget_high_s16(y_s16x8));
    }

This would be a good place to start - create an SVE2 equivalent of pitch_neon_intr.c and/or celt_neon_intr.c with the SVE2 versions of the intrinsics therein, I can get the ball rolling on optimizing this package for SVE2. I sent an email to the opus developer mailing list expressing my intention to do so, and now all that's left is to do it! More on that soon.

Algorithm Selection on x86_64 vs AArch64 Part I

gus — Thu, 24 Mar 2022 02:30:10 +0000

In this post I'll explore benchmarking a few different programs with different algorithms to scale volume. I'll be benchmarking 5 different algorithms which act on an incoming stream of samples to scale them according to a desired volume. To scale audio in real time, acting on a 48000 kHz signal can involve more than 96,000 bytes of data per second, so efficiency is key to making sure nothing is lost or delayed. With that in mind, let's take a look at some different methods of scaling audio volume and see how they stack up against each other, as well as across x86_64 and AArch64 architectures.

Our incoming sample will be simulated by the following:

void vol_createsample(int16_t* sample, int32_t sample_count) {
        int i;
        for (i=0; i<sample_count; i++) {
                sample[i] = (rand()%65536)-32768;
        }
        return;
}

Algorithm 1 - vol0.c - Naïve

int16_t scale_sample(int16_t sample, int volume) {

        return (int16_t) ((float) (volume/100.0) * (float) sample);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples. (Why is this needed?)
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples. (Why is this needed?)
        printf("Result: %d\n", ttl);

 return 0;
}

This first algorithm takes the naïve route of just multiplying each sample by a scale factor. This involved converting an integer to a floating point value and back again, which is very costly especially at this scale. I'm going to go out on a limb and say this could be done more efficiently, I predict that this one will perform the worst. (Also of note - the sum and print portions of the code are needed so the compiler doesn't optimize away the actual sample scaling portion of the program)

Algorithm 2 - vol1.c - Fixed Point

int16_t scale_sample(int16_t sample, int volume) {

        return ((((int32_t) sample) * ((int32_t) (32767 * volume / 100) <<1) ) >> 16);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples.
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.
        printf("Result: %d\n", ttl);

        return 0;

This algorithm avoids the floating point conversions bogging down the previous code and opts for a whole number multiplication followed by a bit shift, which is much more conservative on compute power. This should save time over our last algorithm but will probably be the 2nd or 3rd slowest.

Algorithm 3 - vol2.c - Precalculated

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

        static int16_t* precalc;

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]

        precalc = (int16_t*) calloc(65536,2);
        if (precalc == NULL) {
                printf("malloc failed!\n");
                return 1;
        }

        for (x = -32768; x <= 32767; x++) {
 // Q: What is the purpose of the cast to unint16_t in the next line?
                precalc[(uint16_t) x] = (int16_t) ((float) x * VOLUME / 100.0);
        }

        for (x = 0; x < SAMPLES; x++) {
                out[x]=precalc[(uint16_t) in[x]];
        }

// ---- This part sums the samples.
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.
        printf("Result: %d\n", ttl);

        return 0;
}

This algorithm has all 65526 values (-32768 to 32767) precalculated, so the program just needs to look up the result for each value. This will elicit a 128kb table for all possible values of a 16 bit number scaled, which is not too much compared to the size of audio files. Performance in this case will hinge largely on how fast the math unit is vs the cache that will be fetching the 128kb of data. I think once again this could be the 2nd or 3rd slowest algorithm.

Algorithm 4 - vol4.c - Inline SIMD

int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else


        // these variables will also be accessed by our assembler code
        int16_t*        in_cursor;              // input cursor
        int16_t*        out_cursor;             // output cursor
        int16_t         vol_int;                // volume as int16_t

        int16_t*        limit;                  // end of input array

        int             x;                      // array interator
        int             ttl=0 ;                 // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]


        // set vol_int to fixed-point representation of the volume factor
        // Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

        // Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES;

        // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

        while ( in_cursor < limit ) {
                __asm__ (
                        "ldr q0, [%[in_cursor]], #16    \n\t"
                        // load eight samples into q0 (same as v0.8h)
                        // from [in_cursor]
                        // post-increment in_cursor by 16 bytes
                        // and store back into the pointer register


                        "sqrdmulh v0.8h, v0.8h, v1.8h   \n\t"
                        // with 32 signed integer output,
                        // multiply each lane in v0 * v1 * 2
                        // saturate results
                        // store upper 16 bits of results into
                        // the corresponding lane in v0

                        "str q0, [%[out_cursor]],#16            \n\t"
                        // store eight samples to [out_cursor]
                        // post-increment out_cursor by 16 bytes
                        // and store back into the pointer register

                        // Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"
                        );
        }

// --------------------------------------------------------------------

        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

        return 0;

#endif
}

This algorithm uses inline assembly to process multiple values simultaneously using SIMD. As such it will almost certainly perform better than the prior algorithms. Because SIMD is only available on AArch64 systems we'll have to see how it runs on those and leave the x86_64 benchmarking out for this algorithm.

There are 5 points of interest marked by "Q" as follows:

// Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

(1). The value needs to be multiplied by 32767 rather than 32768 to prevent integer overflow.

// Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;

(2). The in_cursor and out_cursor are set to point to the first elements of the in and out arrays. These will be used in the following loop to read to and from our scaling logic respectively.

 // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

(3). vol_int represents the volume as a signed 16 bit integer, which we're using the dup instruction on to duplicate the volume scaling factor from the 32-bit w0 to the vector register v1.8h.

// Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"

(4). These 3 lines are all part of the second template in this program, the first line being outputs, the second inputs, and the last being clobbers - memory.

// Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

(5). The results here should be correct as the sqrdmulh instruction above saturates the results, preventing overflow.

Algorithm 5 - vol5.c - Intrinsics SIMD

int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else

        register int16_t*       in_cursor       asm("r20");     // input cursor (pointer)
        register int16_t*       out_cursor      asm("r21");     // output cursor (pointer)
        register int16_t        vol_int         asm("r22");     // volume as int16_t

        int16_t*                limit;          // end of input array

        int                     x;              // array interator
        int                     ttl=0;          // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]

        vol_int = (int16_t) (VOLUME/100.0 * 32767.0);

        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES ;

        while ( in_cursor < limit ) {
                // What do these intrinsic functions do?
                // (See gcc intrinsics documentation)
                vst1q_s16(out_cursor, vqrdmulhq_s16(vld1q_s16(in_cursor), vdupq_n_s16(vol_int)));

                // Q: Why is the increment below 8 instead of 16 or some other value?
                // Q: Why is this line not needed in the inline assembler version
                // of this program?
                in_cursor += 8;
                out_cursor += 8;
        }

// --------------------------------------------------------------------

        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);

        return 0;
#endif
}

This last algorithm also uses SIMD but rather than inline assembler opts for compiler intrinsics. It should likewise benefit from the simultaneous processing of the previous algorithm, so I'd expect either this one or that one to come out on top. Again, some sections are pointed out for clarification which I'll do now.

// What do these intrinsic functions do?
                // (See gcc intrinsics documentation)
                vst1q_s16(out_cursor, vqrdmulhq_s16(vld1q_s16(in_cursor), vdupq_n_s16(vol_int)));

(1). These intrinsic functions are equivalent to the instructions used in the last program. vst1q_s16 is equivalent to str, vqrdmulhq_s16 is equivalent to sqrdmulh, vld1q_s16 is equivalent to ldr, and vdupq_n_s16 is equivalent to dup.

// Q: Why is the increment below 8 instead of 16 or some other value?
                // Q: Why is this line not needed in the inline assembler version
                // of this program?
                in_cursor += 8;

(2). The pointer is incremented by 8 because each intrinsic will calculate 8 elements at a time. In the inline assembler program, the pointer was incremented for us but here we need to do it manually.

// Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);

(3). Once again, as we're using the intrinsic equivalent of sqrdmulh, the results should be saturated and avoid potential overflow, so the output should be reliable.

In the next post we'll get into putting these algorithms to the test and benchmarking them to find which is the fastest. More on that here.

x86_64 Assembly Language

gus — Tue, 22 Mar 2022 02:32:51 +0000

In my last post we went through writing a program for printing a message with a 2 digit incrementing value in AArch64 assembly. This time, we're going to tackle the same thing on an x86_64 architecture system.

Starting with the original code we're given:

.text
.globl    _start

min = 0                         /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     $min,%r15           /* loop index */

loop:
    /* ... body of the loop ... do something useful here ... */

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

First we need to add logic to print a message, like so:

.text
.globl    _start

min = 0                         /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     $min,%r15           /* loop index */

loop:

    mov     $len,%rdx           /* message length */
    mov     $msg,%rsi           /* message location */
    mov     $1,%rdi             /* file descriptor stdout */
    mov     $1,%rax             /* syscall sys_write */
    syscall

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

.section .data

msg:    .ascii      "Loop\n"
        len = . - msg

By adding the provided text output code as above we get the following output:

Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop

Now by adding logic to iterate an index and add it to the msg string, we get the following:

.text
.globl    _start

min = 0                         /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     $min,%r15           /* loop index */

loop:

    mov     %r15,%r14           
    add     $'0',%r14     
    movb    %r14b,msg+6        

    mov     $len,%rdx           /* message length */
    mov     $msg,%rsi           /* message location */
    mov     $1,%rdi             /* file descriptor stdout */
    mov     $1,%rax             /* syscall sys_write */
    syscall

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

.section .data

msg:    .ascii      "Loop: #\n"

With the resulting output:

Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9

Finally, we'll move to printing a 2 digit index along with our loop by making the following changes:

.text
.globl    _start

min = 0                         /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 15                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     $min,%r15        
    mov     $10,%r13          

loop:

    mov     %r15,%rax         
    mov     $0,%rdx          
    div     %r13          
    cmp     $0,%rax
    je     secondDigit

    add     $'0',%rax          
    mov     %rax,%r12
    movb    %r12b,msg+6       

secondDigit:

    add     $'0',%rdx          
    mov     %rdx,%r12
    movb    %r12b,msg+7        

    mov     $len,%rdx           /* message length */
    mov     $msg,%rsi           /* message location */
    mov     $1,%rdi             /* file descriptor stdout */
    mov     $1,%rax             /* syscall sys_write */
    syscall

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

.section .data

msg:    .ascii      "Loop:  #\n"
        len = . - msg

Which gets us the appropriate output:

Loop: #0
Loop: #1
Loop: #2
Loop: #3
Loop: #4
Loop: #5
Loop: #6
Loop: #7
Loop: #8
Loop: #9
Loop: #10
Loop: #11
Loop: #12
Loop: #13
Loop: #14
Loop: #15
Loop: #16
Loop: #17
Loop: #18
Loop: #19
Loop: #20
Loop: #21
Loop: #22
Loop: #23
Loop: #24
Loop: #25
Loop: #26
Loop: #27
Loop: #28
Loop: #29
Loop: #30

Again this looks like the output we're looking for, so I'll break down how we got here and leave some parting thoughts on writing assembly programs for x86_64 vs my experience writing for AArch64.

_start:
    mov     $min,%r15        
    mov     $10,%r13

Starting off with the start section, we move the value of min to r15 and set r13 to 10, which we'll use to divide and split our 2 digits. Remember in this syntax the destination register is placed on the right, contrary to how it was arranged in the AArch64 program.

loop:

    mov     %r15,%rax         
    mov     $0,%rdx          
    div     %r13          
    cmp     $0,%rax
    je     secondDigit

Next we place the value to be divided into rax and clear rdx to accept the remainder, before using the div instruction to divide what's in the rax register. We compare the value placed in rax, our "tens" column, to zero and branch to the secondDigit section if there's no tens column.

    add     $'0',%rax          
    mov     %rax,%r12
    movb    %r12b,msg+6

In the first line here we add an ascii 0 to the result of the division, after which we move that to r12 and finally move a byte of that with movb to the address of the pound sign in msg.

secondDigit:

    add     $'0',%rdx          
    mov     %rdx,%r12
    movb    %r12b,msg+7

Again here we add an ascii 0, this time to the remainder from the previous division, and then move that to r12 to have a byte moved to the pound sign address in msg. Like the last program, I'll end my breakdown here as the rest is pretty self evident and discuss my experience with x86_64.

This was pretty similar to assembly in AArch64 in a lot of ways, there were some minor syntactical differences like the precedence of operands listed above, and $ and % symbols being used to denote immediate values and registers, respectively. I'd be hard pressed to pick one I prefer, but if I had to I'd lean toward the AArch64 for its syntax, which I find slightly more readable. The difference is pretty negligible though. I also like the philosophy of improvement and not being weighed down by legacy features and nomenclature that comes with x86_64, but that hasn't affected my coding on either to any great extent thus far.

Overall this was a good challenge and I'm looking forward to diving deeper into these architectures.

AArch64 Assembly Language Part II

gus — Mon, 21 Mar 2022 20:08:04 +0000

This is the second post in a series on writing assembly language code for a program on the AArch64 architecture. You can find the first here.

Once again the task for this last stretch of our AArch64 exploration is to write a loop that iterates 30 times, which necessitates dealing with 2 digit numbers. This can be done by dividing the index by 10. The result goes into the first digit, with a branch to the secondDigit procedure if the first digit is 0. The full code is as follows:

.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 31                         /* loop exits when the index hits this number (loop condition is i<max) */

_start:

    mov     x19, min
    mov     x20, 0x0A

loop:

    udiv    x21, x19, x20
    cmp     x21, 0
    b.eq    secondDigit

    add     x18, x21, '0'
    adr     x17, msg+6 
    strb    w18, [x17] 

secondDigit:
    msub    x22, x20, x21, x19

    add     x18, x22, '0' 
    adr     x17, msg+7 
    strb    w18, [x17] 

    mov     x0, 1           /* file descriptor: 1 is stdout */
    adr     x1, msg         /* message location (memory address) */
    mov     x2, len         /* message length (bytes) */

    mov     x8, 64          /* write is syscall #64 */
    svc     0               /* invoke syscall */

// Proceed with loop
    add     x19, x19, 1   
    cmp     x19, max
    b.ne    loop

    mov     x0, 0           /* status -> 0 */
    mov     x8, 93          /* exit is syscall #93 */
    svc     0               /* invoke syscall */

.data
msg:    .ascii      "Loop:  #\n"
len=    . - msg

With the following output:

Loop: #0
Loop: #1
Loop: #2
Loop: #3
Loop: #4
Loop: #5
Loop: #6
Loop: #7
Loop: #8
Loop: #9
Loop: #10
Loop: #11
Loop: #12
Loop: #13
Loop: #14
Loop: #15
Loop: #16
Loop: #17
Loop: #18
Loop: #19
Loop: #20
Loop: #21
Loop: #22
Loop: #23
Loop: #24
Loop: #25
Loop: #26
Loop: #27
Loop: #28
Loop: #29
Loop: #30

This looks like the output we're looking for which is great, let me break down the program in a little more detail and summarize my experiences writing AArch64 assembly.

_start:

    mov     x19, min
    mov     x20, 0x0A

We start the program by assigning 0 and 10 to registers 19 and 20, respectively. Both are being used as 64 bit widths as made evident by the x prefix.

loop:

    udiv    x21, x19, x20
    cmp     x21, 0
    b.eq    secondDigit

This portion divides the values in x19 by x20 and places it in x21. The syntax for AArch64 assembly is such that you can look at operations as operand = value or operand = expression in this case, as the destination register comes first in this syntax.

The second line compares the first digit of the result with 0, branching to the secondDigit label if the expression evaluates true. That would be in the case that it's a single digit result, which it will be for the first 10 iterations.

    add     x18, x21, '0'
    adr     x17, msg+6 
    strb    w18, [x17]

The first line adds '0' to the value in x21 and places it in x18, after which the address of the pound sign in msg is read into x17. The final line stores a byte from w18 to the address pointed to by x17, the pound sign pointer we just created.

secondDigit:
    msub    x22, x20, x21, x19

Finally, the secondDigit label gets the remainder with the msub instruction by setting x22 to the result of x20-(x21 * x19), or 10 - (result of the division) * (loop index).

The rest of the code is largely unchanged from the last few iterations of this program so I'll leave it at that.

This was a challenging program to write, although I thought the transition from 6502 assembly to 64 bit assembly would be harder. I'm sure my next step, writing the x86_64 equivalent, will be equally if not not more difficult. There's definitely a more robust feeling for lack of a better word, to writing and building assembly on a machine rather than an emulator, and during my debugging process it seemed like the error messages were more meaningful as well. Although I'm not very familiar with working with Linux and that made some things awkward I like knowing that if I need to dig deeper to find out why something's not working that's an option I have. That's it for this post, stay tuned for the x86_64 equivalent of this program coming soon.

AArch64 Assembly Language Part I

gus — Sun, 20 Mar 2022 21:20:55 +0000

Today I'm moving on from the 6502 processor and starting to work with 64 bit assembly language. My tasks this time are to build off of two assembly code examples, one for the AArch64 architecture and the other for x86_64, to first print 0-9 in a loop and then 0-30.

First off, I'll be working with the AArch64 program. The starting point given to us is as follows:

.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 30                         /* loop exits when the index hits this number (loop condition is i<max) */

_start:

    mov     x19, min

loop:

    /* ... body of the loop ... do something useful here ... */

    add     x19, x19, 1
    cmp     x19, max
    b.ne    loop

    mov     x0, 0           /* status -> 0 */
    mov     x8, 93          /* exit is syscall #93 */
    svc     0               /* invoke syscall */

This code loops until it reaches 30 but doesn't do anything within the loop yet. Our first task is to change this so it prints each time a loop iteration executes. By adding in some code provided to us for printing a message we get the following:

.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                         /* loop exits when the index hits this number (loop condition is i<max) */

_start:

    mov     x19, min

loop:

    mov     x0, 1           /* file descriptor: 1 is stdout */
    adr     x1, msg         /* message location (memory address) */
    mov     x2, len         /* message length (bytes) */

    mov     x8, 64          /* write is syscall #64 */
    svc     0               /* invoke syscall */

    add     x19, x19, 1 
    cmp     x19, max
    b.ne    loop

    mov     x0, 0           /* status -> 0 */
    mov     x8, 93          /* exit is syscall #93 */
    svc     0               /* invoke syscall */

.data
msg:    .ascii      "Loop\n"
len=    . - msg

And the output is:

Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop
Loop

Next up is to get the loop to print a number that iterates each time the loop repeats. I modified the code like so:

.text
.globl _start

min = 0                          /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                         /* loop exits when the index hits this number (loop condition is i<max) */

_start:

    mov     x19, min

loop:

    add    x18, x19, '0'         
    adr    x17, msg+6            
    strb   w18, [x17]            

    mov     x0, 1                /* file descriptor: 1 is stdout */
    adr     x1, msg              /* message location (memory address) */
    mov     x2, len              /* message length (bytes) */

    mov     x8, 64               /* write is syscall #64 */
    svc     0                    /* invoke syscall */

    add     x19, x19, 1       
    cmp     x19, max
    b.ne    loop

    mov     x0, 0                /* status -> 0 */
    mov     x8, 93               /* exit is syscall #93 */
    svc     0                    /* invoke syscall */

.data
msg:    .ascii      "Loop: #\n"
len=    . - msg

And got the appropriate output:

Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9

Next time I'll implement iterating until 30, after which I'll tackle the same on an x86_64 architecture. More on that soon!

6502 Math and Strings Part III

gus — Tue, 15 Mar 2022 00:06:20 +0000

This is part 3 in a series on writing a program in assembly for the 6502 processor. You can find part 1 here and part 2 here.

After covering the necessary tools for building our program, today I'm getting into the coding. I've chosen to write a simple program to get input from the user and determine if a number is even or odd using the Logical Shift Right I discussed in my last post.

First up, I set up a subroutine to draw the result of the operation on the bitmapped display. Referencing the example here I wrote code to display "Even" or "Odd" depending on the result.

Here's the code for just the "Odd" portion. From here I need to add character input, logic to perform the LSR operation, logic to get the remainder (through the carry flag I believe?) and finally print one of the two results to the bitmapped display.

define WIDTH      32 ; width  of sprite
define HEIGHT     8  ; height of sprite

done:   brk

; win sprite print subroutine 
    lda #$26 ; create a pointer at $26
    sta $17  ; which points to where
    lda #$02 ; the sprite should be drawn
    sta $18

    lda #$00 ; number of rows we've drawn
    sta $19  ; is stored in $19

    ldx #$00 ; index for data
    ldy #$00 ; index for screen column

odddraw:lda oddmsg ,x
    sta ($17),y
    inx
    iny
    cpy #WIDTH
    bne odddraw
    inc $19     ; increment row counter
    lda #HEIGHT ; are we done yet?
    cmp $19
    beq done    ; ...exit if we are

    lda $17     ; load pointer
    clc
    adc #$20    ; add 32 to drop one row
    sta $17
    lda $18     ; carry to high byte if needed
    adc #$00
    sta $18

    ldy #$00
    beq odddraw

; sprite
oddmsg:               
dcb 00,18,18,00,00,00,18,18,00,00,00,18,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00

dcb 18,00,00,18,00,00,18,00,18,00,00,18,00,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 18,00,00,18,00,00,18,00,00,18,00,18,00,00,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 18,00,00,18,00,00,18,00,00,18,00,18,00,00,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 18,00,00,18,00,00,18,00,18,00,00,18,00,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 00,18,18,00,00,00,18,18,00,00,00,18,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00

Unfortunately I ran into some issues with my program's execution so I'm going to have to leave it at that and move on, but the code for what I wrote is as follows:

; ROM Subroutines
define  SCINIT      $ff81 ; initialize/clear screen
define  CHRIN       $ffcf ; input character from keyboard
define  CHROUT      $ffd2 ; output character to screen
define  SCREEN      $ffed ; get screen size
define  PLOT        $fff0 ; get/set cursor coordinates


;CONSTANTS

define  WIDTH       32 ; width  of sprite
define  HEIGHT      8  ; height of sprite


define  INPUT       $10
define  NUM     $00

    ldy #$00

init:   lda msg,y
    beq getnum
    jsr CHROUT
    iny
    bne init

getnum:

    lda #NUM
    sta INPUT

        ldy #$00
        jsr CHRIN

        cmp #$00
        beq getnum

        cmp #$30
        bmi getnum

        cmp #$39
        bpl getnum

        jsr CHROUT
    sta input
    jmp modulomsg

modulomsg:

    lda msg2,y
    beq printinput
    jsr CHROUT
    iny
    bne modulomsg

printinput:

    lda INPUT
    jsr CHROUT
    jmp result

result:     
    lda input
    clc
    jsr chrout
    lsr x
    lda x
    jsr CHROUT
    bcc evendraw
    bcs odddraw


msg:
dcb "E","n","t","e","r",32,"a",32,"n","u","m","b","e","r",":",32,0, 

msg2:
dcb $0d,$0d,$0d,"Y","o","u",32,"e","n","t","e","r","e","d",":",32,0


done:   brk

odddraw:
    lda #$20 ; create a pointer at $26
    sta $17  ; which points to where
    lda #$02 ; the sprite should be drawn
    sta $18

    lda #$00 ; number of rows we've drawn
    sta $19  ; is stored in $0a

    ldx #$00 ; index for data
    ldy #$00 ; index for screen column

    lda odddata ,x
    sta ($17),y
    inx
    iny
    cpy #WIDTH
    bne odddraw
    inc $19     ; increment row counter
    lda #HEIGHT ; are we done yet?
    cmp $19
    beq done    ; ...exit if we are

    lda $17     ; load pointer
    clc
    adc #$20    ; add 32 to drop one row
    sta $17
    lda $18     ; carry to high byte if needed
    adc #$00
    sta $18

    ldy #$00
    beq odddraw

evendraw:
    lda #$20 ; create a pointer at $20
    sta $17  ; which points to where
    lda #$02 ; the sprite should be drawn
    sta $18

    lda #$00 ; number of rows we've drawn
    sta $19  ; is stored in $0a

    ldx #$00 ; index for data
    ldy #$00 ; index for screen column

    lda evendata ,x
    sta ($17),y
    inx
    iny
    cpy #WIDTH
    bne evendraw
    inc $19     ; increment row counter
    lda #HEIGHT ; are we done yet?
    cmp $19
    beq done    ; ...exit if we are

    lda $17     ; load pointer
    clc
    adc #$20    ; add 32 to drop one row
    sta $17
    lda $18     ; carry to high byte if needed
    adc #$00
    sta $18

    ldy #$00
    beq evendraw

; odd sprite
odddata:               
dcb 00,18,18,00,00,00,18,18,00,00,00,18,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00

dcb 18,00,00,18,00,00,18,00,18,00,00,18,00,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 18,00,00,18,00,00,18,00,00,18,00,18,00,00,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 18,00,00,18,00,00,18,00,00,18,00,18,00,00,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 18,00,00,18,00,00,18,00,18,00,00,18,00,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 00,18,18,00,00,00,18,18,00,00,00,18,18,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00

; even sprite
evendata:               
dcb 05,05,05,05,00,00,05,00,00,00,00,05,00,00,05,05,05,05,00,00,05,05,00,00,05,00,00,00,00,00,00,00

dcb 05,00,00,00,00,00,05,00,00,00,00,05,00,00,05,00,00,00,00,00,05,05,00,00,05,00,00,00,00,00,00,00
dcb 05,05,05,05,00,00,00,05,00,00,05,00,00,00,05,05,05,05,00,00,05,05,05,00,05,00,00,00,00,00,00,00
dcb 05,00,00,00,00,00,00,05,00,00,05,00,00,00,05,00,00,00,00,00,05,00,05,00,05,00,00,00,00,00,00,00
dcb 05,00,00,00,00,00,00,00,05,05,00,00,00,00,05,00,00,00,00,00,05,00,00,05,05,00,00,00,00,00,00,00
dcb 05,05,05,05,00,00,00,00,05,05,00,00,00,00,05,05,05,05,00,00,05,00,00,05,05,00,00,00,00,00,00,00
dcb 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00
dcb 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00

For some reason it stops running after the character input is received and the subroutines for getting the result of the LSR and printing the appropriate message aren't triggered. And that concludes this series on math and strings in assembly on the 6502, hopefully it's been informative.

6502 Math and Strings Part II

gus — Wed, 02 Mar 2022 21:47:08 +0000

This is part 2 in a series on writing a program in assembly for the 6502 processor. You can find part 1 here.

Picking up from where we left off last time, the last requirements to satisfy for this program are that it:

Must accept user input from the keyboard in some form.
Must use some arithmetic/math instructions (to add, subtract, do bitwise operations, or rotate/shift)

Character input can, much like character output, be performed in a variety of ways. One way without using the CHRIN ROM routine can be done like so:

The code in full:

; let the user type on the first page of character screen
; has blinking cursor!
; does not use ROM routines
; backspace works (non-destructive), arrows/ENTER don't

next:     ldx #$00
idle:     inx
          cpx #$10
          bne check
          lda $f000,y
          eor #$80
          sta $f000,y

check:    lda $ff
          beq idle

          ldx #$00
          stx $ff

          cmp #$08 ; bs
          bne print

          lda $f000,y
          and #$7f
          sta $f000,y

          dey
          jmp next

print:    sta $f000,y
          iny
          jmp next

(Many of the code examples I'm using here are from this page if you want to inspect the code in full.)

Otherwise generally one uses the CHRIN ROM routine to get each character and then manipulates them from there. You can see a full example of that here where I've recreated a program from a lecture that allows for text input with a tracked cursor, responsive to backspace and enter characters and stores and prints the user's input.

The final piece to cover before diving into coding our program is math on the 6502. There are two different ways of performing math, binary or decimal, which is decided by setting or clearing the decimal flag in the status register. the SED instruction sets it, while the CLD instruction clears it. Decimal mode treats each byte as two decimal digits, the lower 5 bits representing the lower digit and the upper 4 bits the upper ones. Numbers are treated as positive and values greater than 9 are invalid.

Special care must be taken to clear the carry flag before the low-byte portion of a multi-byte addition, or before a single-byte operation. If a multi-byte addition is performed by adding the low-byte first, the carry flag will correctly carry bits forward from one byte to the next. Subtraction is performed similarly, with the carry flag set before performing subtraction on the lowest byte of a single or multi-byte subtraction, with subtraction then performed on each byte in sequence up to the highest byte.

Multiplication and division are not generally available, but a Logical Shift right or left effectively performs a division or multiplication, respectively. Similarly, rotations perform the same function but the rotate left instruction will move the highest bit to the carry flag and the carry flag to the lowest bit. The opposite is true of rotate right.

More bitwise operations can also be found here.

I'll leave it at that for this post, and with all the last two posts have covered we'll be ready to dive into coding in earnest for our program in the next post.

6502 Math and Strings Part I

gus — Fri, 18 Feb 2022 03:15:04 +0000

This week I'm working on another assembly program for the 6502, with the conditions that the program:

Must work in the 6502 Emulator
Must output to the character screen as well as the graphics (bitmapped) screen.
Must accept user input from the keyboard in some form.
Must use some arithmetic/math instructions (to add, subtract, do bitwise operations, or rotate/shift)

Almost all of these are new to me in assembly, with the exception of having printed graphics to the bitmapped display, so this should be fun!

I'll start by breaking down the requirements and getting up to speed on each one so I can then apply them to my program. The first is a given as I'm writing the program directly in the 6502 emulator, so let's dive into the second requirement.

This requirement is twofold - output to the character screen as well as the graphics screen. I've already covered outputting colours to the bitmapped screen here, so the next thing to look into is outputting to the text screen. This can be accomplished in a few ways.

The first is to manually assign a decimal or hexadecimal value representing each ASCII character to an address.

The second is to use the DCB (Declare Constant Byte) mnemonic to define a string and then assign it to memory.

You can also use the CHROUT ROM Routine to output characters to the text screen, which works the same as the last way but without manually assigning and iterating memory locations.

In the my next post I'll go into the second two requirements and start writing my program.