DEV Community: Tecca Yu

Project stage 3 - Analysis

Tecca Yu — Tue, 19 Apr 2022 17:21:42 +0000

Introduction

Hi, this is Tecca, and this post is a summary of the project, for more details check previous posts about stage1 and stage2 of the project.

In stage 2, I implemented auto-vectorization to the project and in this post I will go over some details and see if there are places that auto-vectorization was not applied and why.

Re-cap

Before applying auto-vectorization running djpeg

After applying auto-vectorization and running djpeg on qemu-aarch64

number of whilelo after applying auto-vectorization

I assumed the above screen shots evidently show that auto-vectorization was successfully applied to the project and the original function djpeg works fine without crashing. But I wonder if all the necessary locations were auto-vectorized.

I believe my implementation will actually run slower than the original code if I test it on qemu-aarch64 due to the nature of qEmu-aarch64 that it will allow regular code to run at full speed on processors, and run SVE2 instructions at a slower speed.

Anyways, to get a log of all the vectorized file and not vectorized file, I need to rebuild the project the way I did in stage 2.

make -j$((`nproc`+1)) |& tee make.log

Through this we are storing the make process detail in the make.log file.

We can tell from the log that there are definitely places that were "missed" from auto-vectorization.

Detail of not vectorized files

We can tell from the above screen shot, the amount of files vectorized are way less than the files that are not vectorized.

My guess is that only the files with loops that will process large amount of data will get optimized by auto-vectorization, because optimizing the loops that does not process large amount of data will very unlikely benefit from auto-vectorization. And it make sense that the important loops are way less than the less important ones.

Different vectorization

two types of vectors were used variable length and specified byte vector

Trying to look into not vectorized code and apply modifications

I tried to modify the codes that are not vectorized and try to see if I could manually auto-vectorize them. But none of the methods I try work, and there could be various reasons to that and it was actually explained in to make.log.

I did a bit of research and some say in most cases, a C/C++ compiler cannot vectorize the for-loop because it cannot match its structure to a predefined vectorization template.

Conclusion

Throughout the 3 stages, I selected candidate open-source package for optimization, in stage 3 I tried adding SVE2 support manually but I couldn't add more vectorization through modifying the source code, but I successfully added auto-vectorization in stage 2 to presumably all the necessary locations in the library by modifying compiler options.

Project stage 2 - Implementation (part 2)

Tecca Yu — Wed, 06 Apr 2022 10:47:18 +0000

Introduction

Hi，this is Tecca, this post is for the purpose of SPO600 project stage 2, to understand it more please check part 1. In this post, I will be adding auto-vectorization as SVE2 support for project libjpeg-turbo.

Stage 2: Implementing auto-vectorization

Last time, I successfully set up our environment and executed one of the executable(djpeg) with the option(-fast) in Unix system.

This time, I will start by adding the new compiler options to the entire program.

The default compile option was set to -O3 -DNDEBUG as you can see on the above image.

What I need to do is to modify it to adapt SVE2 implementation. After a bit of research online and through the project directories. I found out that Compiler flags and CMAKE_ASM_FLAGS both resides in the CMakeCache.txt file which is generated after first run of

cmake -G"Unix Makefiles"

//Flags used by the C compiler during all build types.
CMAKE_C_FLAGS:STRING=
...
//Flags used by the C compiler during RELEASE builds.
CMAKE_C_FLAGS_RELEASE:STRING=-O3 -DNDEBUG
...
//Flags used by the ASM compiler during all build types.
CMAKE_ASM_FLAGS:STRING=
...
//Flags used by the ASM compiler during RELEASE builds.
CMAKE_ASM_FLAGS_RELEASE:STRING=-O3 -DNDEBUG
...

We can either modify the CMAKE_ASM_FLAGS_RELEASE:STRING and CMAKE_ASM_FLAGS_RELEASE:STRING in the CMakeCache.txt file manually or export CFLAGS as environment variable before running cmake for the first time.

Modifiying CMAKE_C_FLAGS would take affect on all build types, but since we are only working with release build at this time it would be better for us to only modify the ones that will be used by C compiler during RELEASE builds.

Example command for exporting CFLAGS as environment variable

export CFLAGS="-g -fopt-info-vec-all -march=armv8-a+sve2"

After changing the compiler options to the one we need and cmake again.

We can see that both compiler flags are changed to the one we want for SVE2.

Now we make again just like we did in part 1.

make -j$((`nproc`+1))

The build was successful.
Now we need to see if the whilelo instructions are applied in all possible locations.

The find and grep was able to fetch 1972 whilelo instructions among all possible locations. Which means that we have build the project correctly. Now let's see if djpeg works fine as how we did in part 1.

Note: We can't run it the same way we did in part 1 since currently there is no hardware that supports SVE2 instructions, We now have to use qemu-aarch64 which will allow us to use SVE2 instructions.

The testimgint.jpg was successfully decompressed and generated a new decompressed.pgm just like what we've done in part 1.

Conclusion

In this post, we've successfully add SVE2(auto-vectorization)support to the project and run with qumu-aarch64 without breaking anything.

Project stage 2 - Implementation (part 1)

Tecca Yu — Tue, 05 Apr 2022 15:39:39 +0000

Introduction

Hi，this is Tecca, this post is for the purpose of SPO600 project stage 2, to understand it more please check previous post. I decided to add auto-vectorization as SVE2 support in the package libjpeg-turbo of the project.

Stage 2: Setting up environment

First thing is to set up the package for the Unix environment and compile the project to get everything running as it is suppose to be before making auto-vectorization implementation.

After cloning the repo, let's take a look at all the files first.

Within the BUILDING.md file, it tells us the building procedure we need to take in order to build the project for our Unix system. I will build it with default flags first to see if everything is working properly.

cmake -G"Unix Makefiles"

After running the above building procedure command:

Notice the default Compiler flags = -03 -DNDEBUG, this tells us that with that optimization level it allows all normal optimization.

Just in case you are curious about the -DNDEBUG, it is actually used to disable assert(), If the macro NDEBUG is defined as a macro name at the point in the source code where or is included, then assert does nothing. After a bit of research, in general people use it to signal a release/debug switch.

Resource

Lets run the command make to build the project.

make -j$((`nproc`+1))

-j indicates how many jobs it will run in parallel. nproc will return the number of cores on your machine. By wrapping it in the ticks, the nproc command will execute first, return a number and that number will be passed into make. In this case, I used core-count + 1 by recommendation for faster compile time also making sure all cores are loaded, you can use define core to use ranging from core-count+1 to core-count*2. so -j$((nproc+1)) means it will run number of core I have on this system + 1 jobs in parallel at a time.

Source

Note: I'm running on a 16 core system.

Seems like everything is successfully build, lets look at the generated executable files before we move onto the next step.

The files in green are the executables and we can also use the following command to list only executables.

find . -executable -type f

Within the usage.txt it has guidance on how to use the executables and what is their purpose. I will go with djpeg executable, which decompress a JPEG file back into a conventional image format.

GENERAL USAGE

We provide two programs, cjpeg to compress an image file into JPEG format,
and djpeg to decompress a JPEG file back into a conventional image format.

On most systems, you say:
        cjpeg [switches] [imagefile] >jpegfile
or
        djpeg [switches] [jpegfile]  >imagefile

I will also try the -fast option they mentioned in the usage.txt.

 -fast           Select recommended processing options for fast, low
                        quality output.  (The default options are chosen for
                        highest quality output.)  Currently, this is equivalent
                        to "-dct fast -nosmooth -onepass -dither ordered".

I will be using test images in the /testimages directory to test out djpeg.

After running the command

./djpeg -fast ./testimages/testimgint.jpg > decompressed.pgm

We got a new decompressed pgm file from the testimgint.jpg file!

By comparing the size of the decompressed pgm file and orginal jpg file, we can tell from the increased size that djep was successfully executed.

Conclusion

In this post, we've successfully set up our environment and executed one of the executable(djpeg) with the option(-fast) in Unix system, I will be moving onto implementing the auto-vectorization in the up coming post soon.

SPO600 Project - Candidate Selection final summary

Tecca Yu — Tue, 29 Mar 2022 02:40:34 +0000

Hello, this is Tecca, with regard to previous post, in this post I will do an update on two candidate packages that I was looking into, and summarize with what my approaches are towards optimization in the end.

FFmpeg

FFmpeg is a collection of libraries and tools to process multimedia content such as audio, video, subtitles and related metadata.

FFmpeg website
FFmpeg Github repo

Rtm - Realtime Math

This library is geared towards realtime applications that require their math to be as fast as possible. Much care was taken to maximize inlining opportunities and for code generation to be optimal when a function isn't inlined by passing values in registers whenever possible.

Rtm Github repo

libjpeg-turbo

libjpeg-turbo is a JPEG image codec that uses SIMD instructions to accelerate baseline JPEG compression and decompression on x86, x86-64, Arm, PowerPC, and MIPS systems, as well as progressive JPEG compression on x86, x86-64, and Arm systems.

libjpeg-turbo repo

I came across these three packages feeling they could greatly benefit from utilizing sve2, both of them does not have SVE/SVE2 optimization implemented. After doing a bit of research, I decided to go with libjpeg for this project.

This(libjpet-turbo) package utilizes SIMD operations, and is already supporting SIMD instructions (MMX, SSE2, AVX2, Neon, AltiVec) and ARM64 architectures. Implementing SVE2 for ARM-v9 seems to be a valid optimization.

Strategy - optimization approach

Three options implementing SVE2 optimizations

auto vectorization
inline assembler
using SVE2 intrinsic

My plan is to start with something small, within the file that utilizes SIMD operation, I will try to implement auto vectorization to it.

SPO600 Project - Candidate Selection

Tecca Yu — Sat, 26 Mar 2022 07:02:42 +0000

Hi this is Tecca, this post is meant to be an update on progress of me finding a candidate package for the SPO600 project stage 1.

Stages of the project

Stage 1: Selection

Searching and selecting candidate packages to optimize, pick an approach to optimize the current code in the package. ## Stage 2: Implementation
Implementing the approach of optimizations mentioned in Stage 1, and testing the program to ensure that it uses the SVE2 instructions without crashing the program. ## Stage 3: Upstreaming
submitting changes into the upstream codebase

What I need

I need to be able to find a package that is open-source and use SIMD(Single Input, Multiple Data) implementation at the library-level, and making sure they don't already have SVE or SVE2 optimizations already implemented.

SVE2 (Scalable Vector Extension version 2) - LAB 6 part 2

Tecca Yu — Sat, 26 Mar 2022 07:02:25 +0000

Introduction

Welcome back to part 2 of SVE2(Scalable Vector Extension version 2). If you are not sure about what this post is about, you can see the part 1 to have a better idea.

Source code (vol1.c) for conversion to adapt SVE2

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>
#include "vol.h"

int16_t scale_sample(int16_t sample, int volume) {

        return ((((int32_t) sample) * ((int32_t) (32767 * volume / 100) <<1) ) >> 16);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples.

        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.

        printf("Result: %d\n", ttl);

        return 0;

}

As you can tell, this is vol1 from previous post about algorithm selection.
Note that vol1 utilizes a fixed-point calculation. This avoids the cost of repetitively casting between integer and floating point.

Converting

C Compiler Options

most compilers do not have a specific target for Armv9 systems. Therefore, to build code that includes SVE2 instructions, we will need to instruct the complier to emit code for an Armv8-a processor that also understands the SVE2 instructions; on the GCC compiler, this is performed using the -march= option

we have to instruct the compiler to emit code for an Armv8a processor to make it understand SVE2 to do that we need to invoke the autovectorizer in GCC version 11, we must use -O3 or the appropriate feature options

gcc -O3 -march=armv8-a+sve2

In our case, we will be working with vol1

gcc -o3 -march=armv8-a+sve2 vol1.c vol_createsample.c -o vol1

Then, we can execute the program by emulating with the QEMU usermode system. This will trap SVE2 instructions and emulate them in software, while executing Armv8a instructions directly on the hardware:

qemu-aarch64 ./vol1

Result:

Converted code

.arch armv8-a+sve2
        .file   "vol1.c"
        .text
        .align  2
        .p2align 4,,11
        .global scale_sample
        .type   scale_sample, %function
scale_sample:
.LFB24:
        .cfi_startproc
        lsl     w2, w1, 15
        mov     w3, 34079
        sub     w1, w2, w1
        movk    w3, 0x51eb, lsl 16
        sxth    w0, w0
        smull   x3, w1, w3
        asr     x3, x3, 37
        sub     w1, w3, w1, asr 31
        lsl     w1, w1, 1
        mul     w0, w1, w0
        lsr     w0, w0, 16
        ret
        .cfi_endproc
.LFE24:
        .size   scale_sample, .-scale_sample
        .section        .rodata.str1.8,"aMS",@progbits,1
        .align  3
.LC0:
        .string "Total Time: %2.9f\n"

Understanding converted code

SVE2 instructions

 .cfi_startproc
        lsl     w2, w1, 15
        mov     w3, 34079
        sub     w1, w2, w1
        movk    w3, 0x51eb, lsl 16
        sxth    w0, w0
        smull   x3, w1, w3
        asr     x3, x3, 37
        sub     w1, w3, w1, asr 31
        lsl     w1, w1, 1
        mul     w0, w1, w0
        lsr     w0, w0, 16
        ret
        .cfi_endproc

corresponding C code

return ((((int32_t) sample) * ((int32_t) (32767 * volume / 100) <<1) ) >> 16);

‘movk w3, 0x51eb, lsl 16’ contains an ‘lsl 16’ instruction, indicating that the bits are to be shifted left by 16 bits.
‘sxth ’ tells register w0 to sign the least-significant element of itself.
‘smull x3, w1, w3’ refers to the multiplication of the value of ‘volume’ by 32767.
‘lsl w1, w1, 1’ refers to the shifting left one bit at the end.
‘mul w0, w1, w0’ turns the result of multiplying the sample into a signed 32-bit integer.
‘lsr w0, w0, 16’ shifts the final resulting integer’s bits to the right 16 times.

Conclusion

We've done experimenting with SVE2 instructions to the volume adjusting algorithm(vol1). Since SVE2 is very new at the moment and has practically no systems developed for it. And we must use an emulator to run the program. I wasn't able to find a way to test the SVE2 performance of the assembly code.
The most challenging part of the lab I found was when after converting the C code into SVE2 instructions, trying to relate the instructions from SVE2 with the code in the original C file.

Source: SVE2

SVE2 (Scalable Vector Extension version 2) - LAB 6 part 1

Tecca Yu — Tue, 22 Mar 2022 07:52:11 +0000

Hi, this is Tecca. In this post I will be going over SVE2 which stands for (Scalable Vector Extension version 2)

SVE2 is an SIMD (Single Input, Multiple Data) instruction set that is only used in the AArch64 computer architecture.

SVE2 enable vectorization of loops for High Performance Computing, basically means making the program to run faster!

By applying SVE enables software that process large amount of data to be ported more easily (for example the volume scaling lab). The greatest advantage of using SVE is the ability to write and build software only once, then we can run program on different AArch64 hardware with those binaries.

The main difference between SVE2 and SVE is the functional coverage of the instruction set. SVE was designed for HPC and ML applications. SVE2 extends the SVE instruction set to enable data-processing domains beyond HPC and ML. The SVE2 instruction set can also accelerate the common algorithms that are used in the following applications:

Computer vision
Multimedia
Long-Term Evolution (LTE) baseband processing
Genomics
In-memory database
Web serving
General-purpose software

Source from: Introducing SVE2

The goal of this experiment is to try to convert our code from previous post # Source code to be converted to adapt SVE.

In the next post, I will be applying the conversion and verify the result.

Algorithm Selection LAB 5 in x86_64 and AArch64 - part 2

Tecca Yu — Mon, 14 Mar 2022 17:09:21 +0000

Hi this is Tecca, this post will be a continuation of last post for comparing the relative performance of various algorithms on a same machine across several implementations of AArch64 and x86_64 systems.

In this post I will be demonstrating the result of the performance of each algorithms mentions in last post. And going over some questions relate to the algorithms.

We want to measure the performance of each algorithm specifically and nothing else should be in the way in order to give a correct measurement of the time elapsed performing the algorithm.

In order to do that we need to make some modification to all the files.

#include <time.h>

        clock_t         start_t, end_t;

        start_t = clock();

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

        end_t = clock();
        printf("Time elapsed: %f\n", ((double)start_t - end_t)/CLOCKS_PER_SEC);

every other code goes blow..

The key point is to wrap the for loop which performs the scaling of the samples to get the most accurate result of the algorithm when it runs.

Declaration for testing

Each file was compiled with the gcc compiler with gcc -g -O2 -fopt-info-vec-all -fno-lto -fno-tree-vectorize -fno-tree-loop-vectorize.
1,500,000,000 Sample was define for each algorithm to test.
Each algorithm/file was tested 15 times.

Example running vol0 in x86_64 systems

Benchmarking Result running in x86_64 systems

	vol0	vol1	vol2
1	2.696961	2.655414	3.370985
2	2.666739	2.649839	3.302379
3	2.671609	2.648382	3.30752
4	2.664414	2.653646	3.327133
5	2.660572	2.679785	3.285885
6	2.671767	2.639657	3.302924
7	2.685051	2.633608	3.396682
8	2.674105	2.663188	3.352611
9	2.675021	2.663653	3.331424
10	2.677584	2.651484	3.291357
11	2.655889	2.653947	3.314665
12	2.659099	2.634871	3.287128
13	2.591898	2.66246	3.283349
14	2.645027	2.662764	3.327623
15	2.582574	2.662238	3.298565
average	2.658554	2.654329067	3.318682
median	2.666739	2.653947	3.30752

In the x86_64 system vol1 algorithm is the fastest, second comes vol0, and vol2 is the slowest out of all three. As what I was expecting in the previous post, because vol2 pre-calculates all results, and looking up answers for each input value afterward absolutely will cost more resources than the others and vol1 beats vol0 by a little by using fixed-point calculation. Which avoids the cost of repetitively casting between integer and floating point.

We will be testing algorithms in vol4 and vol5 in AArch64 system because the utilize SIMD instructions the programs used are written unique to the AArch64 system.

Benchmarking result running in AArch64 systems

	vol0	vol1	vol2	vol4	vol5
1	4.980222	4.313869	10.567804	3.481179	4.04965
2	4.968114	4.323427	10.560218	3.456832	4.045124
3	4.959307	4.30239	10.59247	3.460554	4.045296
4	4.973315	4.325246	10.590718	3.469896	4.028074
5	4.961954	4.31275	10.585899	3.493608	4.039965
6	4.976137	4.320063	10.532851	3.438831	4.040304
7	4.959057	4.349743	10.642468	3.482111	4.055758
8	4.960718	4.317437	10.534451	3.469382	4.150955
9	4.95485	4.336698	10.548297	3.451517	4.085376
10	4.960642	4.329521	10.552774	3.455712	4.022335
11	4.952321	4.332141	10.550225	3.384459	4.041784
12	4.990002	4.334904	10.572559	3.423148	4.058293
13	4.942643	4.326342	10.545258	3.420771	4.096189
14	4.957297	4.317604	10.548684	3.391878	4.020974
15	4.967439	4.317819	10.558838	3.433111	4.062904
average	4.964267867	4.323996933	10.5655676	3.4475326	4.056198733
median	4.960718	4.323427	10.558838	3.455712	4.045296

Looking at the results from running vol0, vol1, vol2, vol4, vol5.

In the previous post, we assumed the algorithms that use SIMD instructions would perform faster than others(vol4 and vol5). We can observe that vol4 and vol5 algorithms definitely outperform others. The performance difference between the two come very close to each other. I believe it is because both algorithm inline assembly and compiler intrinsic are almost equally fast.

vol1 still has a better performance than vol0 in AArch64 as well.

vol2 algorithm became significantly slower than other algorithms. This result could possibly mean that the CPU is slow at reading the memory when looking over the pre-calculated values.

QUESTIONS SPECIFIC TO ALGORITHMS

Q: Why is this code needed?

for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

 printf("Result: %d\n", ttl);

This code prints out the number of scaled samples in the time the program was run. This is meant to be a post-processing value that is displayed to the user, which can be used to determine how long a certain algorithm takes.

Q: What does this next block do? Why?

ifeq ($(shell echo | gcc -E -dM - | grep -c aarch64),1)
        BINARIES:=vol0 vol1 vol2 vol3 vol4 vol5
else
        BINARIES:=vol0 vol1 vol2 vol3
Endif

This block within the Makefile validates which architecture the user is currently in and changes “BINARIES” to different values.
Since vol4.c and vol5.c are design in a way that can only be run in AArch64，it is important to remove it from the BINARIES if the user is not in a aarch64 system to prevent it from building and causing error.

Q: What is the purpose of the cast to unint16_t in the next line?

precalc[(uint16_t) x] = (int16_t) ((float) x * VOLUME / 100.0);

I guess this is some kind of casting to the values to stay the same size (16-bit) even if it is in a different system.

Q: What's the point of this dummy program? How does it help with benchmarking?

The dummy program(vol3.c) helps the calculation of each algorithm by doing nothing but creating the samples and count the results. Basically everything but algorithm. Which means it's a baseline in terms of processing time to all the other files that contains different algorithms.

Q: what is the purpose of these next two lines?

in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES;

These two lines prepare the two int16_t pointers and points them to the in and out arrays respectively.

Q: Are the results usable? Are they accurate?

printf("Result: %d\n", ttl);

The results for vol5 and vol4 are the same, but when comparing them to the naïve implementation the results vary by a lot so I'm not too sure about the accuracy.

Q: Why is the increment below 8 instead of 16 or some other value?

Q: Why is this line not needed in the inline assembler version of this program?

 in_cursor += 8;
 out_cursor += 8;

Incrementing the in_cursor and out_cursor by 8 is because the values are 8 bits apart from each other.

In the inline assembler version, chances are the assembly code automatically does it within its logic. But with C language, we have to be specific about it.

Q: what does it mean to "duplicate" values in the next line?

__asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

dup loads the scaling factor and loads it into a 128-bits wide register.

Conclusion

In the end, the measurement of the performance of each algorithm is done to test the assumptions made in the part-1 post of this lab.
As expected, the algorithms(vol4&vol5 in this case) that use SIMD instructions appear to outperform others as they are best at processing multiple data simultaneously.

Algorithm Selection LAB 5 in x86_64 and AArch64 - part 1

Tecca Yu — Fri, 11 Mar 2022 01:35:51 +0000

Introduction

Hi this is Tecca, and in this post I will be comparing the relative performance of various algorithms on a same machine across several implementations of AArch64 and x86_64 systems.

Source code
vol.h

/* This is the number of samples to be processed */
#define SAMPLES 16

/* This is the volume scaling factor to be used */
#define VOLUME 50.0 // Percent of original volume

/* Function prototype to fill an array sample of
 * length sample_count with random int16_t numbers
 * to simulate an audio buffer */
void vol_createsample(int16_t* sample, int32_t sample_count);

vol.h controls the number of samples(16) to be processed and the volume level(50) to be used

In vol.h a large number of sample for the algorithms to process seems to be reasonable, because it will allow us to analyze the differences in terms of performance much easier.

vol0.c

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>
#include "vol.h"

int16_t scale_sample(int16_t sample, int volume) {
        return (int16_t) ((float) (volume/100.0) * (float) sample);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);


// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples
        printf("Result: %d\n", ttl);
        return 0;
}

In vol0.c, Audio samples are multiplied by the volume scaling factor, casting between signed 16-bit integers and floating-point values. This way takes up a lot of resources.

vol1.c

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>
#include "vol.h"

int16_t scale_sample(int16_t sample, int volume) {
        return ((((int32_t) sample) * ((int32_t) (32767 * volume / 100) <<1) ) >> 16);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples
        printf("Result: %d\n", ttl);
        return 0;
}

vol1.c utilizes a fixed-point calculation. This avoids the cost of repetitively casting between integer and floating point.

vol2.c

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>
#include "vol.h"

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

        static int16_t* precalc;

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- Scale the samples from in[], placing results in out[]

        precalc = (int16_t*) calloc(65536,2);
        if (precalc == NULL) {
                printf("malloc failed!\n");
                return 1;
        }

        for (x = -32768; x <= 32767; x++) {
                precalc[(uint16_t) x] = (int16_t) ((float) x * VOLUME / 100.0);
        }

        for (x = 0; x < SAMPLES; x++) {
                out[x]=precalc[(uint16_t) in[x]];
        }

// ---- This part sums the samples
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples
        printf("Result: %d\n", ttl);
        return 0;
}

Unlike vol0.c and vol1.c, vol2.c pre-calculates all 65535 results, looking up answers for each input value afterward.

vol3.c

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>
#include "vol.h"

int16_t scale_sample(int16_t sample, int volume) {
        return (int16_t) 100;
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sum the samples
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples
        printf("Result: %d\n", ttl);
        return 0;
}

vol3.c returns an identical sample value, the purpose of this program seems to be a baseline to compare to the other scaling volume algorithm.

vol4.c

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"

int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else
        // these variables will also be accessed by our assembler code
        int16_t*        in_cursor;              // input cursor
        int16_t*        out_cursor;             // output cursor
        int16_t         vol_int;                // volume as int16_t

        int16_t*        limit;                  // end of input array

        int             x;                      // array interator
        int             ttl=0 ;                 // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- Scale the samples from in[], placing results in out[]
        // set vol_int to fixed-point representation of the volume factor
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES;

        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

        while ( in_cursor < limit ) {
                __asm__ (
                        "ldr q0, [%[in_cursor]], #16    \n\t"
                        // load eight samples into q0 (same as v0.8h)
                        // from [in_cursor]
                        // post-increment in_cursor by 16 bytes
                        // ans store back into the pointer register

                        "sqrdmulh v0.8h, v0.8h, v1.8h   \n\t"
                        // with 32 signed integer output,
                        // multiply each lane in v0 * v1 * 2
                        // saturate results
                        // store upper 16 bits of results into
                        // the corresponding lane in v0

                        "str q0, [%[out_cursor]],#16            \n\t"
                        // store eight samples to [out_cursor]
                        // post-increment out_cursor by 16 bytes
                        // and store back into the pointer register

                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"
                        );
        }

// --------------------------------------------------------------------

        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        printf("Result: %d\n", ttl);
        return 0;

#endif
}

vol4.c uses the SIMD (Single Input, Multiple Data) instructions accessed through inline assembly. Which is only available on AArch64 architectures.

vol5.c

#include <stdint.h>
#ifdef __aarch64__
#include <arm_neon.h>
#endif
#include "vol.h"

int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else

        register int16_t*       in_cursor       asm("r20");     // input cursor (pointer)
        register int16_t*       out_cursor      asm("r21");     // output cursor (pointer)
        register int16_t        vol_int         asm("r22");     // volume as int16_t

        int16_t*                limit;          // end of input array

        int                     x;              // array interator
        int                     ttl=0;          // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);


// ---- Scale the samples from in[], placing results in out[]
        vol_int = (int16_t) (VOLUME/100.0 * 32767.0);

        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES ;

        while ( in_cursor < limit ) {
                vst1q_s16(out_cursor, vqrdmulhq_s16(vld1q_s16(in_cursor), vdupq_n_s16(vol_int)));


                in_cursor += 8;
                out_cursor += 8;
        }

// --------------------------------------------------------------------

        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }


        printf("Result: %d\n", ttl);
        return 0;
#endif
}

vol5.c like vol4.c also utilize SIMD instruction but with complier intrinsic built into the compiler. vol5.c is also specific to AArch64 due to usage of unique instructions of AArch64 architecture.

Above code from vol0.c to vol5.c implement the various algorithms that will be use to test the performance.

vol_createsample.c

#include <stdlib.h>
#include <stdint.h>
#include "vol.h"

void vol_createsample(int16_t* sample, int32_t sample_count) {
        int i;
        for (i=0; i<sample_count; i++) {
                sample[i] = (rand()%65536)-32768;
        }
        return;
}

vol_createsample.c contains the function vol_createsample(int16_t* sample, int32_t sample_count) that will be use to create dummy samples for the algorithms to run with.

Conclusion
In this post we've examine a couple of algorithms for adjusting volume samples. We are aware of how each algorithm differed in their approach of achieving the same goal. In the next post, we will see how the performance of each program is and benchmarking it to prove the expectation.

LAB4 - Continued

Tecca Yu — Mon, 28 Feb 2022 21:47:48 +0000

In the previous post, the code was able to loop and print from 0 to 9 onto the screen with both Aarch64 and x84_64.

In order to print numbers beyond that meaning two digit numbers, modifications to the code needs to be made.

Expected result:

Loop: 00
Loop: 01
Loop: 02
Loop: 03
Loop: 04
...
Loop: 30
etc.

The strategy is to store both the quotient and remainder of the number when it is divided by 10. This allows to also display the number that goes beyond 10 when we display the quotient and remainder side by side onto the screen.

partial code in x84_64

loop:
        movq     $0, %rdx                        /* clearing reminder for division */
        movq     %r15, %rax                      /* set rax to be divide */
        movq     $10, %r10                       /* set divisor 10 */
        div      %r10                            /* perform division */
        movq     %rax, %r14                      /* store quotient to the register*/
        movq     %rdx, %r13                      /* store remainder to the register*/

        add      $0x30, %r14                     /* converting quotient to ASCII */
        add      $0x30, %r13                     /* converting remainder to ASCII */
        mov      %r13b, msg+8                    /* Extra byte in msg with remainder than quotient */

The code above takes care of performing the division of the number by 10. Stores the quotient and remainder to the register, converting both of them into ASCII code and later to be set to the position of msg.
Complete Source Code in x84_64:

.text
.globl  _start


_start:

        movq     $0, %r15                        /* Loop counter */
        movq     $0x30, %r12                     /* value of 0 in Ascii */

loop:
        movq     $0, %rdx                        /* clearing reminder for division */
        movq     %r15, %rax                      /* set rax to be divide */
        movq     $10, %r10                       /* set divisor 10 */
        div      %r10                            /* perform division */
        movq     %rax, %r14                      /* store quotient to the register*/
        movq     %rdx, %r13                      /* store remainder to the register*/

        add      $0x30, %r14                     /* converting quotient to ascii */
        add      $0x30, %r13                     /* converting remainder to ascii */
        mov      %r13b, msg+8                    /* Modify 1 byte inmsg with remainder */

        cmp      %r12, %r14
        mov      %r14b, msg+7                    /* Modify 1 byte in msg with quotient */

        movq     $len, %rdx                      /* message length */
        movq     $msg, %rsi                      /* message location */
        movq     $1, %rdi                                /* file descriptor stdout */
        movq     $1, %rax                                /* syscall sys_write */
        syscall

        inc     %r15                            /* increment counter */
        cmp     $31, %r15                               /* see if we're done */
        jne     loop                            /* if not, loop */

        movq     $0, %rdi                                /* exit status */
        movq     $60, %rax                       /* syscall sys_exit */
        syscall

.section .data

        msg:    .ascii   "Loop :   \n"
        len = . - msg

Complete Source Code in Aarch64:

.text
.globl _start


_start:

        mov     x19, 0
        mov     x17, 10

loop:
        mov     x0, 1           /* file descriptor: 1 is stdout */
        adr     x1, msg         /* message location (memory address) */
        mov     x2, len         /* message length (bytes) */

        mov     x18, x19        
        udiv    x9, x18, x17
        add     x13, x9, 0x30
        msub    x10, x9, x17, x18   
        add     x14, x10, 0x30
        adr     x15, msg
        strb    w13, [x15, 7]

        strb    w14, [x15, 8]
        mov     x8, 64          /* write is syscall #64 */
        svc     0               /* invoke syscall */

        add     x19, x19, 1
        cmp     x19, 31
        b.ne    loop

.data

msg:    .ascii      "Loop : #\n"
len=    . - msg

Result:

Conclusion:
Personally the biggest challenge I found in this lab was to figure out how to perform the division and store the values from the division into two registers and placing then into the msg.

LAB3 - continued

Tecca Yu — Mon, 28 Feb 2022 01:31:44 +0000

In the previous post, a drawing program was successfully run on 6502 Emulator, now I want to create a kaleidoscope drawing program.

My plan is to allow the user to only draw on the second quadrant which is the top left quadrant in the 6502 display and reflecting whatever the user draw on it to the other three quadrant to create an kaleidoscope effect.

A couple more things needs to be done in order to achieve this.

First it needs to restrict the cursor's movement to only allow user draw on the second quadrant, in drawing cursor by setting the range of row and column to [$00 : $0F] will make this happen.

drawing_cursor:
    lda ROW     ; ensure ROW is in range 0:0F
    and #$0f
    sta ROW

    lda COL     ; ensure COL is in range 0:0F
    and #$0f
    sta COL

    ldy ROW     ; load POINTER with start-of-row
    lda table_low,y
    sta POINTER
    lda table_high,y
    sta POINTER_H

    ldy COL     ; store CURSOR at POINTER plus COL
    lda #CURSOR
    sta (POINTER),y

    rts

Then whatever pixels that are in the second quadrant should be reflected to other three quadrant. This is done by storing a copy of the current position and adding the x or y offset to the cursor every time user moves the cursor.

drawing_other_quads:     
    LDA POINTER ;; save the pointer to the
    PHA     ;; original location in top_left_quad
    LDA POINTER_H
    PHA

; top right quadrant
    LDA #$10
    CLC
    SBC COL
    CLC
    ADC #$10
    TAY
    LDA DOT
    STA (POINTER),y

    TYA
    PHA     ; save the y offset

; bottom left quadrant  
    lda #$10    ; load POINTER with start-of-row
    CLC
    SBC ROW
    CLC
    ADC #$10
    TAY

    lda table_low,y
    sta POINTER
    lda table_high,y
    sta POINTER_H

    ldy COL     ; store CURSOR at POINTER plus COL
    lda DOT
    sta (POINTER),y

    PLA
    TAY

; bottom right quadrant 
    lda DOT
    sta (POINTER),y

    PLA
    STA POINTER_H
    PLA
    STA POINTER

    RTS

Complete source code:

; zero-page variable locations
 define ROW     $20 ; current row
 define COL     $21 ; current column
 define POINTER     $10 ; ptr: start of row
 define POINTER_H   $11

 ; constants
 define DOT     $01 ; dot colour location
 define CURSOR      $04 ; purple colour


setup:  lda #$0f    ; set initial ROW,COL
    sta ROW
    sta COL
    LDA #$01
    STA DOT

draw:   jsr drawing_cursor

getkey:
    ldx #$00    ; clear out the key buffer
    lda $ff     ; get a keystroke
    stx $ff

    cmp #$30
    bmi getkey
    cmp #$40
    bpl continue

    SEC
    sbc #$30
    tay
    lda color_pallete, y
    sta DOT
    jmp done

continue:   
    cmp #$43    ; handle C or c
    beq clear
    cmp #$63
    beq clear

    cmp #$80    ; if not a cursor key, ignore
    bmi getkey
    cmp #$84
    bpl getkey

    pha     ; save A

    lda DOT ; set current position to DOT
    sta (POINTER),y
    jsr drawing_other_quads

    pla     ; restore A

    cmp #$80    ; check key == up
    bne check1

    dec ROW     ; ... if yes, decrement ROW
    jmp done

 check1:    
    cmp #$81    ; check key == right
    bne check2

    inc COL     ; ... if yes, increment COL
    jmp done

 check2:    
    cmp #$82    ; check if key == down
    bne check3

    inc ROW     ; ... if yes, increment ROW
    jmp done

 check3:    
    cmp #$83    ; check if key == left
    bne done

    dec COL     ; ... if yes, decrement COL
    clc
    bcc done

 clear: 
    lda table_low   ; clear the screen
    sta POINTER
    lda table_high
    sta POINTER_H

    ldy #$00
    tya

 c_loop:    
    sta (POINTER),y
    iny
    bne c_loop

    inc POINTER_H
    ldx POINTER_H
    cpx #$06
    bne c_loop

 done:  
    clc     ; repeat
    bcc draw


 drawing_cursor:
    lda ROW     ; ensure ROW is in range 0:0F
    and #$0f
    sta ROW

    lda COL     ; ensure COL is in range 0:0F
    and #$0f
    sta COL

    ldy ROW     ; load POINTER with start-of-row
    lda table_low,y
    sta POINTER
    lda table_high,y
    sta POINTER_H

    ldy COL     ; store CURSOR at POINTER plus COL
    lda #CURSOR
    sta (POINTER),y

    rts

 drawing_other_quads:     
    LDA POINTER ;; save the pointer to the
    PHA     ;; original location in top_left_quad
    LDA POINTER_H
    PHA

; top right quadrant
    LDA #$10
    CLC
    SBC COL
    CLC
    ADC #$10
    TAY
    LDA DOT
    STA (POINTER),y

    TYA
    PHA     ; save the y offset

; bottom left quadrant  
    lda #$10    ; load POINTER with start-of-row
    CLC
    SBC ROW
    CLC
    ADC #$10
    TAY

    lda table_low,y
    sta POINTER
    lda table_high,y
    sta POINTER_H

    ldy COL     ; store CURSOR at POINTER plus COL
    lda DOT
    sta (POINTER),y

    PLA
    TAY

; bottom right quadrant 
    lda DOT
    sta (POINTER),y

    PLA
    STA POINTER_H
    PLA
    STA POINTER

    RTS

 ; these two tables contain the high and low bytes
 ; of the addresses of the start of each row

 table_high:
 dcb $02,$02,$02,$02,$02,$02,$02,$02
 dcb $03,$03,$03,$03,$03,$03,$03,$03
 dcb $04,$04,$04,$04,$04,$04,$04,$04
 dcb $05,$05,$05,$05,$05,$05,$05,$05,

 table_low:
 dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
 dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
 dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
 dcb $00,$20,$40,$60,$80,$a0,$c0,$e0

color_pallete:
dcb $01,$02,$03,$04,$05,$06,$07,$08,$09,$0a

Results:

LAB4 - Aarch64 VS. X84_64

Tecca Yu — Mon, 28 Feb 2022 01:04:43 +0000

Hi this is Tecca, and in this post I will be demonstrating my findings with assembly language in the AArch64, an ARM architecture, and x86_64, an x86 architecture.

The goal of this lab is to generate output like the following in both AArch64 and x86_64:

Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
...
Loop: 30
etc.

Source code for this lab in x86_64

.text
.globl  _start


_start:


print:  mov     $0,%r15                 /* loop index */

loop:   mov %r15,%r14
    add $'0',%r14
    movb    %r14b,msg+6

    movq    $len,%rdx                       /* message length */
        movq    $msg,%rsi                       /* message location */
        movq    $1,%rdi                         /* file descriptor stdout */
        movq    $1,%rax                         /* syscall sys_write */
        syscall

        inc     %r15                /* increment the index */
        cmp     $10,%r15           /* check if we're done */
        jne     loop                /* keep looping if we're not */

        movq    $0,%rdi                         /* exit status */
        movq    $60,%rax                        /* syscall sys_exit */
        syscall

.section .data

msg:    .ascii      "Loop: #\n"
        len = . - msg

Result:

Moving onto AArch64 source code:

.text
.globl _start

_start:

        mov     x19, 0

loop:
        add     x20, x19, '0'   // Create digit character
        adr     x17, msg+6      // Get a pointer to desired location of digit
        strb    w20, [x17]      // Put digit to desired location

        mov     x0, 1           /* file descriptor: 1 is stdout */
        adr     x1, msg         /* message location (memory address) */
        mov     x2, len         /* message length (bytes) */

        mov     x8, 64          
        svc     0               /* invoke syscall */

        add     x19, x19, 1
        cmp     x19, 10
        b.ne    loop

        mov     x0, 0           /* status -> 0 */
        mov     x8, 93          /* exit is syscall #93 */
        svc     0               /* invoke syscall */

.data
       msg:    .ascii      "Loop: #\n"
       len=    . - msg

This is done from storing value defined from the given range into a register, an ASCII value is then converted from the value within the register.
The value ‘0’ is stored into a register, which in turn is converted into an ASCII value of 48 in decimal. After the loop incremented it to 10, the program would not print an integer character as expected, there is no associated ASCII reference for '10'.

I've tried adding 10 to the ASCII value of ‘0’, the resulting ASCII value prints out a colon “:” instead of the expected number 10 because 48+10=58 and 58 is a ":".

I'll focus on making it loop past 10 and onward in the next post.