DEV Community: Qzhang125

week 14 Project Stage 3: SVE2

Qzhang125 — Wed, 15 Dec 2021 23:25:54 +0000

Hello my friend, welcome to the last project stage of SPO600. In this project, we will discuss how to extend SIMD(Single instruction, multiple data) Neon package to support Scalable Vector Extensions v2(SVE2) on the opensource software that I chose from the project stage 2. Before I start stage 3, let's do a short review of what we did in project stage 2. In stage 2, I chose an open-source software which is FFmpeg, it is a cross-platform software to record, stream, and convert video and audio. It used a lot of SIMD methods to accelerate data processing but the Neon architecture extension of SIMD only has a fixed 128-bit vector length for the instruction set. For this case, Arm designed the SVE to improve SIMD implementation.

What is SVE

We all know that the 128-bit vector instruction set could operate the data which is inside 128 bits. To improve that, SVE allows choosing a suitable vector length between 128 bits and 2048 bits. Beyond this, SVE design enables developers to write and build software once and then the software can be used on different AArch64 hardware regardless of the length of the hardware’s vector implementation. Also, SVE includes:

Per-lane prediction.
Gather-load and scatter-store
Speculative vectorization
These features help to vectorize and optimize 5. loops for large datasets.

SVE 2

The main difference between SVE and SVE 2 is the functional coverage of the instruction set. SVE improves the suitability of the architecture for High-Performance Computing(HPC) and Machine learning(ML). SVE 2 expands the domain of data processing and accelerate the common algorithms that are used in the areas below:

Computer vision
Multimedia
Long-Term Evolution (LTE) baseband processing
Genomics
In-memory database
Web serving
General-purpose software

SVE2 usage

 int16x8_t q0s16, q2s16, q3s16, q8s16, q10s16, q11s16, q13s16;
    int16x8_t q14s16, q15s16, qzs16;
    int16x4_t d0s16, d2s16, d3s16, dzs16;
    uint16x8_t q1u16, q9u16;
    uint16x4_t d1u16;

Now let’s talk about how to extend the open-source software to support SVE2. In project stage 2 we discussed the mpegvideo.c file, this file is working with MPEG to compress and decompress moving pictures by using the Neon. The SVE2 could improve this procedure, to help the compiler with vectorization, the SVE2 adds a new feature which is Vector Length Agnostic(VLA). The VLA could save so much time for the compiler when it is working with the picture elements. For the FFmpeg to be extended to support and take advantage of SVE2, the FFmpeg developer could consider adding inline assemblers into the package to support SVE2 because SVE2 provides 32 scalable vector registers. Then add the SVE2 assembly syntax to invoke the instruction set.

Conclusion

In this project, we discussed the many algorithms including SIMD and its instruction set SVE2. To extend the FFmpeg to support SVE2, the software that runs on the system has to be Neon and then the developer should add inline assemblers and change the old vector length for SVE2. Since this is the last blog for this course, I would say this is one of the hardest courses that I have ever taken so far. I learned a lot about how the program works under high-level programming languages and how to benchmark an application. Lastly, I got a picture of how the compiler optimizes programs using SIMD and many other algorithms. It is a fun and valuable experience for me and it helps me to create a picture of the relationship between machine, compiler, and my program.

Week 13 Project Stage 2

Qzhang125 — Sun, 12 Dec 2021 04:33:35 +0000

Hello everyone, welcome to the week 13 project blog, this is the second phase of the SPO600(Software Portability and Optimization) project, click here to check phase 1. I this blog we are going to pick an open-source software and locate the SIMD code from the source code of the software and then determine the SIMD code usage in a certain program.

Instruction

For my blog, I choose FFmpeg. FFmpeg is a cross-platform framework to record, convert, and stream video and audio, it provides an easy way to convert video and audio to other formats.

FFmpeg is a very active open-source package with daily updates and debugging on Github. This is a short log of the recent updates of FFmpeg below, or you can click here to catch more updates history.

Single Instruction, Multiple Data Units

SIMD(Single Instruction, Multiple Data) is a kind of high-performance embedded computing. The concept of SIMD is a single instruction that does one operation but we could improve it to process multiple data in parallel. To get more information about SIMD, you can also check my last blog for week 11.

SIMD Usage

To get the source code of FFmpeg, you can simply go to FFmeg’s github and then clone the git onto AArch64 or x86_64 system. Let's take a look at the code snippet I found from /FFmpeg/libavcodec/neon/mpegvideo.c file.
This SIMD implementation is only working on AArch64 systems:

static void inline ff_dct_unquantize_h263_neon(int qscale, int qadd, int nCoeffs,
                                               int16_t *block)
{
    int16x8_t q0s16, q2s16, q3s16, q8s16, q10s16, q11s16, q13s16;
    int16x8_t q14s16, q15s16, qzs16;
    int16x4_t d0s16, d2s16, d3s16, dzs16;
    uint16x8_t q1u16, q9u16;
    uint16x4_t d1u16;

    dzs16 = vdup_n_s16(0);
    qzs16 = vdupq_n_s16(0);

    q15s16 = vdupq_n_s16(qscale << 1);
    q14s16 = vdupq_n_s16(qadd);
    q13s16 = vnegq_s16(q14s16);

    if (nCoeffs > 4) {
        for (; nCoeffs > 8; nCoeffs -= 16, block += 16) {
            q0s16 = vld1q_s16(block);
            q3s16 = vreinterpretq_s16_u16(vcltq_s16(q0s16, qzs16));
            q8s16 = vld1q_s16(block + 8);
            q1u16 = vceqq_s16(q0s16, qzs16);
            q2s16 = vmulq_s16(q0s16, q15s16);
            q11s16 = vreinterpretq_s16_u16(vcltq_s16(q8s16, qzs16));
            q10s16 = vmulq_s16(q8s16, q15s16);
            q3s16 = vbslq_s16(vreinterpretq_u16_s16(q3s16), q13s16, q14s16);
            q11s16 = vbslq_s16(vreinterpretq_u16_s16(q11s16), q13s16, q14s16);
            q2s16 = vaddq_s16(q2s16, q3s16);
            q9u16 = vceqq_s16(q8s16, qzs16);
            q10s16 = vaddq_s16(q10s16, q11s16);
            q0s16 = vbslq_s16(q1u16, q0s16, q2s16);
            q8s16 = vbslq_s16(q9u16, q8s16, q10s16);
            vst1q_s16(block, q0s16);
            vst1q_s16(block + 8, q8s16);
        }
    }
    if (nCoeffs <= 0)
        return;

    d0s16 = vld1_s16(block);
    d3s16 = vreinterpret_s16_u16(vclt_s16(d0s16, dzs16));
    d1u16 = vceq_s16(d0s16, dzs16);
    d2s16 = vmul_s16(d0s16, vget_high_s16(q15s16));
    d3s16 = vbsl_s16(vreinterpret_u16_s16(d3s16),
                     vget_high_s16(q13s16), vget_high_s16(q14s16));
    d2s16 = vadd_s16(d2s16, d3s16);
    d0s16 = vbsl_s16(d1u16, d0s16, d2s16);
    vst1_s16(block, d0s16);
}

This code snippet is for AArch64 systems, it is using the Neon instruction set at the beginning of this code. It defines a couple of registers as integer or unsigned integer types.

int16x8_t q0s16, q2s16, q3s16, q8s16, q10s16, q11s16, q13s16;
    int16x8_t q14s16, q15s16, qzs16;
    int16x4_t d0s16, d2s16, d3s16, dzs16;
    uint16x8_t q1u16, q9u16;
    uint16x4_t d1u16;

Next, load the values to the registers that we discussed before.

dzs16 = vdup_n_s16(0);
    qzs16 = vdupq_n_s16(0);

    q15s16 = vdupq_n_s16(qscale << 1);
    q14s16 = vdupq_n_s16(qadd);
    q13s16 = vnegq_s16(q14s16);

Afterward, SIMD is being used from here:

if (nCoeffs > 4) {
        for (; nCoeffs > 8; nCoeffs -= 16, block += 16) {
            q0s16 = vld1q_s16(block);
            q3s16 = vreinterpretq_s16_u16(vcltq_s16(q0s16, qzs16));
            q8s16 = vld1q_s16(block + 8);
            q1u16 = vceqq_s16(q0s16, qzs16);
            q2s16 = vmulq_s16(q0s16, q15s16);
            q11s16 = vreinterpretq_s16_u16(vcltq_s16(q8s16, qzs16));
            q10s16 = vmulq_s16(q8s16, q15s16);
            q3s16 = vbslq_s16(vreinterpretq_u16_s16(q3s16), q13s16, q14s16);
            q11s16 = vbslq_s16(vreinterpretq_u16_s16(q11s16), q13s16, q14s16);
            q2s16 = vaddq_s16(q2s16, q3s16);
            q9u16 = vceqq_s16(q8s16, qzs16);
            q10s16 = vaddq_s16(q10s16, q11s16);
            q0s16 = vbslq_s16(q1u16, q0s16, q2s16);
            q8s16 = vbslq_s16(q9u16, q8s16, q10s16);
            vst1q_s16(block, q0s16);
            vst1q_s16(block + 8, q8s16);
        }
    }

This part of the code snippet is using if-condition and for-loop to load values, multiply two values that were stored in the registers. For example,

q0s16 = vld1q_s16(block); //load
q2s16 = vmulq_s16(q0s16, q15s16); //pub unsafe fn vmulq_s16(a: int16x8_t, b: int16x8_t) -> int16x8_t

Looking at this way the SIMD is being used in this part of code, we could find that the intrinsic is also being used:

q3s16 = vreinterpretq_s16_u16(vcltq_s16(q0s16, qzs16));

Looking at the inside part first, vcltq_s16 is a method to compare each vector element in the first register q0s16 with the corresponding vector element in the second register qzs16 and if the first signed integer value is greater than the second signed integer value, sets every bit of the corresponding vector element into the destination register to 1 otherwise sets every bit of the corresponding vector element into the destination register to 0. And the vreinterpretq_s16_u16 method is a reinterpret-cast operation as an outer function for the intrinsics.
After calculation, use loop condition swap algorithm to store the result to the register:

if (nCoeffs <= 0)
        return;

    d0s16 = vld1_s16(block);
    d3s16 = vreinterpret_s16_u16(vclt_s16(d0s16, dzs16));
    d1u16 = vceq_s16(d0s16, dzs16);
    d2s16 = vmul_s16(d0s16, vget_high_s16(q15s16));
    d3s16 = vbsl_s16(vreinterpret_u16_s16(d3s16),
                     vget_high_s16(q13s16), vget_high_s16(q14s16));
    d2s16 = vadd_s16(d2s16, d3s16);
    d0s16 = vbsl_s16(d1u16, d0s16, d2s16);
    vst1_s16(block, d0s16);

After my examination, the SIMD implementation for mpegvideo.c file is for compression and decompression. It is selected during compile-time and then calculates tons of elements without macros in different vectors at the same time and also puts all of the SIMD methods into a for loop to execute. The ff_dct_unquantize_h263_neon function saved a lot of time for the package to calculate.

Looking at the SIMD implementation, FFmpeg explicitly creates a folder for the Neon instruction set:

In the Neon directory we got this:

We could see that the SIMD implementation for this part of the functionality is separated from other SIMD implementations.

The SIMD is also working on x86 systems. I found the code snippet from /FFmpeg/libavcodec/x86/lpc.c :

static void lpc_compute_autocorr_sse2(const double *data, int len, int lag,
                                      double *autoc)
{
    int j;

    if((x86_reg)data & 15)
        data++;

    for(j=0; j<lag; j+=2){
        x86_reg i = -len*sizeof(double);
        if(j == lag-2) {
            __asm__ volatile(
                "movsd    "MANGLE(pd_1)", %%xmm0    \n\t"
                "movsd    "MANGLE(pd_1)", %%xmm1    \n\t"
                "movsd    "MANGLE(pd_1)", %%xmm2    \n\t"
                "1:                                 \n\t"
                "movapd   (%2,%0), %%xmm3           \n\t"
                "movupd -8(%3,%0), %%xmm4           \n\t"
                "movapd   (%3,%0), %%xmm5           \n\t"
                "mulpd     %%xmm3, %%xmm4           \n\t"
                "mulpd     %%xmm3, %%xmm5           \n\t"
                "mulpd -16(%3,%0), %%xmm3           \n\t"
                "addpd     %%xmm4, %%xmm1           \n\t"
                "addpd     %%xmm5, %%xmm0           \n\t"
                "addpd     %%xmm3, %%xmm2           \n\t"
                "add       $16,    %0               \n\t"
                "jl 1b                              \n\t"
                "movhlps   %%xmm0, %%xmm3           \n\t"
                "movhlps   %%xmm1, %%xmm4           \n\t"
                "movhlps   %%xmm2, %%xmm5           \n\t"
                "addsd     %%xmm3, %%xmm0           \n\t"
                "addsd     %%xmm4, %%xmm1           \n\t"
                "addsd     %%xmm5, %%xmm2           \n\t"
                "movsd     %%xmm0,   (%1)           \n\t"
                "movsd     %%xmm1,  8(%1)           \n\t"
                "movsd     %%xmm2, 16(%1)           \n\t"
                :"+&r"(i)
                :"r"(autoc+j), "r"(data+len), "r"(data+len-j)
                 NAMED_CONSTRAINTS_ARRAY_ADD(pd_1)
                :"memory"
            );
        } else {
            __asm__ volatile(
                "movsd    "MANGLE(pd_1)", %%xmm0    \n\t"
                "movsd    "MANGLE(pd_1)", %%xmm1    \n\t"
                "1:                                 \n\t"
                "movapd   (%3,%0), %%xmm3           \n\t"
                "movupd -8(%4,%0), %%xmm4           \n\t"
                "mulpd     %%xmm3, %%xmm4           \n\t"
                "mulpd    (%4,%0), %%xmm3           \n\t"
                "addpd     %%xmm4, %%xmm1           \n\t"
                "addpd     %%xmm3, %%xmm0           \n\t"
                "add       $16,    %0               \n\t"
                "jl 1b                              \n\t"
                "movhlps   %%xmm0, %%xmm3           \n\t"
                "movhlps   %%xmm1, %%xmm4           \n\t"
                "addsd     %%xmm3, %%xmm0           \n\t"
                "addsd     %%xmm4, %%xmm1           \n\t"
                "movsd     %%xmm0, %1               \n\t"
                "movsd     %%xmm1, %2               \n\t"
                :"+&r"(i), "=m"(autoc[j]), "=m"(autoc[j+1])
                :"r"(data+len), "r"(data+len-j)
                 NAMED_CONSTRAINTS_ARRAY_ADD(pd_1)
            );
        }
    }

This function, it is using an inline assembler at runtime because it is located in a .c file. This is for x86 systems to use SSE2 to do the calculation and data transformation.

Conclusion

To conclude the usage of SIMD in FFmpeg, I only used two examples from AArch64 and x86 systems but I would say that FFmpeg takes full advantage of SIMD implementation. It saves a lot of time for this package when the software converts videos and audio to another format. It could calculate multiple data simultaneously by using the SIMD instruction set. For the code structure, after this project, I got a brief picture of the structure management of FFmpeg. I would say FFmpeg is well structured because FFmpeg decided to put the SIMD implementation in separate folders, it provides me with an idea which is that the separate code implementation is very helpful to debug and locate the error position when the program gets into an issue. Also, it provides a chance to reuse some of the code that they developed before. To improve the code structure, I would say maybe use more loop condition swaps to avoid some unnecessary loop procedures.

Week 12 Reflection: Paged Memory System

Qzhang125 — Fri, 10 Dec 2021 03:13:31 +0000

Hello everyone, welcome back to the SPO600(Software Portability and Optimization) week 12 blog. In this blog, we will find something fun with the paged memory system.

Instruction

Most modern systems use page memory. It means the physical memory of the system is divided into pages and these pages can then be arranged via memory table or memory map into a memory configuration that is viewed by a process. It allows us to present a different view of memory to different processes and thus isolate some areas of memory that are visible from one process but invisible for another process and it also allows us to share memory between processes.

This diagram here shows the relationship between physical memory and logical memory or virtual memory. In this diagram the current process is a game process. So we noticed that on the left hand side we have the view of memory that is visible to that process. So the game process appears to have 4 pages of virtual memory corresponding to three pages of physical memory and they don't have to be contiguous in physical memory like where they are located in virtual memory. The size of each page in the x86_64 system is commonly 4k or 1M, in the AArch64 system it is usually 4k or 64k.

How much memory is process using

To define how much memory a process uses is a quite difficult question because we need a standard counter to calculate the size that a process page is presenting and has been allocated at that moment. So we have two different standard sizes VSS(Virtual Set Size) and RSS(Resident Set Size). The VSS all of the pages that are present in the virtual memory or logical memory for that process. It actually counts all of the memory no matter if the page has been allocated or not, it is the worst case to calculate the size. There's also the RSS, the resident set size only counts the pages that are physically present in memory, it excludes the pages which haven't yet been loaded from disk or haven't been swapped out to disk. It would give us a more accurate picture of the actual physical usage of memory by a particular process.

What it looks like

To see what it looks like on an actual operating system, we need to use this command:

top

For my personal laptop, I got this result below:

As we can see, I got 11946.7 mebibytes in total, 5615.7 mebibytes for free, 740.6 mebibytes in use, and 5588.7 mebibytes for buff and catch.

Below this, we could see individual processes. I used the top command which has VIRT(Virtual Set Size), RES(Resident Set Size), and SHR(Share Memory Size).

Conclusion

This week we learned almost the last topic of this course. To be honest, this course is the hardest course for me so far because the programming level of this course is lower than the other computer science classes that I took before like C++, Java, JavaScript, etc. However, it is fun to look at how and where a program actually works and what the compiler and systems could do to run the program that we created. In this course, I worked with 6502 assembly language, assembly language in x86_64 and AArch64 systems, and algorithms to speed up a program. I think I have got a picture of how to run a program at a lower level and I hope I could explore more about it.

Week 11 Reflection SIMD

Qzhang125 — Fri, 10 Dec 2021 03:13:20 +0000

Hello my friend, welcome back to the week 11 blog about SPO600(Software Portability and Optimization). In this blog, I'm going to blog about SIMD.

Introduction

SIMD(Single instruction multiple data) also called vectorization. The concept of SIMD is a single instruction which does one operation but we could improve it to process multiple data in parallel. It is an ability that many modern processors have. It is also called vectorization which is working with vectors. Vectors are arrays of values, typically one dimensional arrays then apply the same operation to different parts of the array simultaneously. For example,

In the diagram above, the SIMD unit could calculate multiple values from vector A and vector B with the same operation and then put the result into the vector C in parallel which means it is executing four operations simultaneously.

MMX instruction set

MMX defines 8 processor registers, named from MM0 to MM7, and operations that operate on them. Each register is 64 bits wide and can be used to hold either one 64-bit integer or multiple smaller integer: one instruction can be applied to two 32-bit integers, or four 16-bit integers or eight 8-bit integers at once.

How to check SIMD instruction

Before using the SIMD instruction set, we need help from the processor and compiler. I used my personal laptop as an example:
Check SIMD instruction sets that are supported by the CPU:

cat /proc/cpuinfo

Then find the flags, we could find the instruction set that the CPU supports.

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse

Check SIMD instruction sets which are supported by GCC

gcc -march=native -c -Q --help=target

We could see the instruction set which is enabled or disabled.
For example:

The following options are target specific:
  -m128bit-long-double                  [enabled]
  -m16                                  [disabled]
  -m32                                  [disabled]
  -m3dnow                               [disabled]
  -m3dnowa                              [disabled]
  -m64                                  [enabled]
  -m80387                               [enabled]
  -m8bit-idiv                           [disabled]
  -m96bit-long-double                   [disabled]
  -mabi=                                sysv
  -mabm                                 [disabled]
  -maccumulate-outgoing-args            [disabled]
  -maddress-mode=                       long
  -madx                                 [disabled]
  -maes                                 [disabled]
  -malign-data=                         compat
  -malign-double                        [disabled]
  -malign-functions=                    0
  -malign-jumps=                        0
  -malign-loops=                        0
  -malign-stringops                     [enabled]
  -mamx-bf16                            [disabled]
  -mamx-int8                            [disabled]
  -mamx-tile                            [disabled]
  -mandroid                             [disabled]
  -march=                               nehalem

Conclusion

SIMD is a very significant ability that the modern processor should have. It helps to process multiple data simultaneously and saves a lot of time. In the Lab6(aka project 1) we explored more and test how much time the SIMD saved for the program. It is fun to learn and actually test how it works.

SPO600 Project Stage 1: Algorithm Selection for Optimization

Qzhang125 — Sun, 05 Dec 2021 04:53:23 +0000

Hello everyone, welcome to assignment phase 1 of SPO600(Software Portability and optimization). In this phase, we will benchmark 6 different algorithms in the c program.

Background

Digital sound is typically represented, uncompressed, assigned 16-bit integer signal samples. There are two streams of samples, one each for the left and right stereo channels, at typical sample rates of 44.1 or 48 thousand samples per second per channel, for a total of 88.2 or 96 thousand samples per second (kHz). Since there are 16 bits (2 bytes) per sample, the data rate is 88.2 * 1000 * 2 = 176,400 bytes/second (~172 KiB/sec) or 96 * 1000 * 2 = 192,000 bytes/second (~187.5 KiB/sec).
To change the volume of sound, each sample can be scaled (multiplied) by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).
On a mobile device, the amount of processing required to scale sound will affect battery life.

Introduction

We got 6 approaches that use different algorithms:

vol0.c is the basic or naive algorithm. This approach multiplies each sound sample by the volume scaling factor, casting from signed 16-bit integer to floating point and back again. Casting between integer and floating points can be an expensive operation.
vol1.c does the math using fixed-point calculations. This avoids the overhead of casting between integer and floating-point and back again.
vol2.c pre-calculates all 65536 different results and then looks up the answer for each input value.
vol3.c is a dummy program - it doesn't scale the volume at all. It can be used to determine some of the overhead of the rest of the processing (besides scaling the volume) done by the other programs.
vol4.c uses Single Instruction, Multiple Data (SIMD) instructions accessed through inline assembly (assembly language code inserted into a C program). This program is specific to the AArch64 architecture and will not be built for x86_64.
vol5.c uses SIMD instructions accessed through Compiler Intrinsics. This program is also specific to AArch64. For more details, please click here.

Procedure

Unpack the archive /public/spo600-volume-examples.tgz
Make a prediction of the relative performance of each scaling algorithm.
Build and test each of the programs.
Test the performance of each program.
Find a way to measure performance without the time taken to perform the test setup pre-processing (generating the samples) and post-processing (summing the results) so that you can measure only the time taken to scale the samples.
Increase performance by changing the compiler option (via the Makefile).
Answer the question in the code.

AArch64(Israel)

My prediction

My prediction of the performance is the fastest program is vol5 because of the SIMD and the slowest program is vol2.

Build & test each program

This is what we got in the vol.h file:

/* This is the number of samples to be processed */
#define SAMPLES 16

/* This is the volume scaling factor to be used */
#define VOLUME 50.0 // Percent of original volume
#include <stdint.h>
/* Function prototype to fill an array sample of
 * length sample_count with random int16_t numbers
 * to simulate an audio buffer */
void vol_createsample(int16_t* sample, int32_t sample_count);

In this header file, it defined 16 samples we got for the programs to process.
Now, let's build and test each of the programs:

As you can see here, most of the results are the same but we also got different outputs from the vol1 and vol3 programs. In all of the programs we got the same number of samples but different algorithms, so the algorithm is the only factor that affects different outputs.

Benchmarking

In this part, I designed a function using system time to calculate the time of execution in milliseconds. To complete this test, we also need to add

#include <sys/time.h>

This is the code:

struct timeval  start , end;
gettimeofday(&start,0);
//loop goes here for testing
gettimeofday(&end,0);
float duration = (end.tv_sec - start.tv_sec) * 1000.0f + (end.tv_usec -start.tv_usec)/1000.0f;
printf("Time used: %f milisec\n", duration);

Results:
For a number of samples: 1000000000
vol0.c(Using the basic algorithm)

Iteration	Duration(millisecond)	result
1	3295.242920	289
2	3280.395996	289
3	3295.914062	289
4	3287.457031	289
5	3309.891113	289

The average execution time of execution is 16,468.901122 / 5 = 3,293.7802244 milliseconds.

vol1.c

Iteration	Duration(millisecond)	result
1	2871.633057	192
2	2862.878906	192
3	2855.055908	192
4	2868.831055	192
5	2853.552002	192

The average execution time of vol2.c is 14,311.950928/5 = 2,862.3901856 milliseconds and it is 431.3900388 milliseconds faster than the basic algorithm.

vol2.c

Iteration	Duration(millisecond)	result
1	7004.999023	289
2	6997.184082	289
3	6996.041992	289
4	6994.092773	289
5	6983.860840	289

The average execution time of vol2.c is 34,976.169917 / 5 = 6,995.2339834 milliseconds.
vol2.c is so much slower than the other programs because of the pre-calculation.

vol3.c

Iteration	Duration(millisecond)	result
1	2193.116943	0
2	2192.761963	0
3	2193.154053	0
4	2191.714111	0
5	2193.959961	0

The average execution time of vol3.c is 10,964.707031 / 5 = 2,192.9414062 milliseconds.

vol4.c

Iteration	Duration(millisecond)	result
1	1704.156982	389
2	1701.747070	389
3	1755.469971	389
4	1753.229980	389
5	1736.984009	389

The average execution time of vol4.c is 8,651.588012 / 5 = 1,730.3176024 milliseconds which is the fastest so far.

vol5.c

Iteration	Duration(millisecond)	result
1	1738.630005	389
2	1756.073975	389
3	1754.686035	389
4	1788.035034	389
5	1703.054932	389

The average execution time of vol5.c is 8,740.479981 / 5 = 1,748.0959962 milliseconds.

Relative Memory Usage

To check the memory usage, I used this command:

free -m

This is the result of memory usage on the Israel server in the AArch64 systems:

x86_64(Portugal)

My prediction

My prediction of performance for these programs is that vol3.c is the fastest and vol2.c is the slowest.

Build & test each program

In the x86_64 system, we only got vol0.c through vol3.c because vol4.c and vol5.c are only for the AArch64 system.
This is the initial build with 16 samples:

Compared to the previous initial build, the result is the same as the result that we got from the AArch64 system.

Benchmarking

For the program in the x86_64 system, I choose to use the same function that we used to test the execution time in the AArch64 system. Feel free to copy the previous code to test on the Portugal server.

First of all, let's test the vol0 which uses the basic algorithm.
Results:
For a number of samples: 1000000000
vol0.c(Using the basic algorithm)

Iteration	Duration(millisecond)	result
1	1689.119019	289
2	1656.987061	289
3	1658.895996	289
4	1681.101074	289
5	1662.795044	289

The average execution time of vol0.c is 8,348.898194 / 5 = 1,669.7796388 milliseconds.

vol1.c

Iteration	Duration(millisecond)	result
1	1657.718994	192
2	1659.477051	192
3	1655.046021	192
4	1662.978027	192
5	1654.200928	192

The average execution time of vol1.c is 8,289.421021 / 5 = 1,657.8842042 milliseconds.

vol2.c

Iteration	Duration(millisecond)	result
1	2163.916992	289
2	2120.509033	289
3	2119.781006	289
4	2122.691895	289
5	2131.298096	289

The average execution time of vol2.c is 10,658.197022 / 5 = 2,131.6394044 milliseconds which is considerably slower than other programs in x86_64 systems.

vol3.c

Iteration	Duration(millisecond)	result
1	1311.255981	0
2	1292.249023	0
3	1297.120972	0
4	1300.192017	0
5	1299.275024	0

The average execution time of vol3.c is 6,500.093017 / 5 = 1,300.0186034 milliseconds. It is the fastest program.

Relative Memory Usage

Same as before, use

free -m

command to check the memory usage on the Portugal server.

Questions

In this part, I will answer all of the questions which are marked with Q in all the source files.
Q1:

// ---- This part sums the samples. (Why is this needed?)
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

The reason that we need this part is that it is used to go through out of the SAMPLES that we defined in vol. h. Then assigned all of the results from out[] array to ttl property which then could be used to print out on as the result.

Q2:

// ---- Print the sum of the samples. (Why is this needed?)
        printf("Result: %d\n", ttl);

The reason that we need this part is because it is used to print the result that we went through the SAMPLE then assigned from the out[] array.

Q3:

// Q: What's the point of this dummy program? How does it help
// with benchmarking?

The dummy program does not scale the volume, it is used to determine the processing that has been done by other programs.

Q4:

// Q: What is the purpose of the cast to unint16_t in the next line?
                precalc[(uint16_t) x] = (int16_t) ((float) x * VOLUME / 100.0);

The reason to cast uint16_t is because it will explicitly define the number of bits, and it also ensures that the result is going to be an unsigned 16-bit integer.

Q5:

 // Q: should we use 32767 or 32768 in the next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

We should use 32767 in the next line because we have defined the limit of the sample, so the sample will start at the minimum value of a 16-bit signed integer to the maximum value of the 16-bit signed integer.

Q6:

//Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;

The purpose of the first line is to assign the input cursor to the in[] array and the second line is going to assign the output cursor to the out[] array.

Q7:

 // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

A duplicate of the value is stored in a vector which will act as an array of equal size. The value to duplicate is %w0 and the duplicate value will be sent to the dupv1.8h.

Q8:

// Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"

These three lines are used to get the value from the input cursor and output cursor then store them into the system memory.

Q9:

        // Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

This result is only usable in the AArch64 system because this program is only usable in the AArch64 system. It is correct if we use it in the AArch64 system and it does work properly.

Q10:

 // Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);

Same with the Q9, this result is only usable in the AArch64 system because the vol5.c is only usable in AArch64. However, it works properly in the AAcrh64 system.

Conclusion

Click here to check the full list of comparisons.
In this assignment, my prediction of the performance is the fastest program is vol5.c in among all of the programs because of the SIMD. It was accelerated through the compiler. In the x86_64 the vol3.c is the fastest program. My prediction is correct. Based on what we discovered, the x86_64 system executes the same program with the same number of samples two times faster than the AArch64. Looking at the differences between the programs themselves, the different algorithms could perform huge differences and save a lot of time for a little program.

Week10 Lab 5: Assembler Lab

Qzhang125 — Thu, 25 Nov 2021 04:48:42 +0000

Welcome back to week 10 of SPO600(Software Portability and Optimization) blog. This week we will work on lab 5. In this lab, we are going to write a simple program to make a loop that will run 30 times and display a number from 0 to 30 in AArch64 and x86_64 systems.

x86_64 system

.text
.globl  _start


_start:

        mov     $0, %r15                        /* Loop counter */
        mov     $0x30, %r12                     /* value of 0 in Ascii */

loop:
        mov     $0, %rdx                        /* clearing reminder for division */
        mov     %r15, %rax                      /* set rax to be divide */
        mov     $10, %r10                       /* set divisor */
        div     %r10                            /* divide */
        mov     %rax, %r14                      /* store quotient */
        mov     %rdx, %r13                      /* store remainder */

        add     $0x30, %r14                     /* quotient to ascii */
        add     $0x30, %r13                     /* remainder to ascii */
        mov     %r13b, msg+7                    /* Modify 1 byte inmsg with remainder */

        cmp     %r12, %r14
        mov     %r14b, msg+6                    /* Modify 1 byte in msg with quotient */

        mov     $len, %rdx                      /* message length */
        mov     $msg, %rsi                      /* message location */
        mov     $1, %rdi                                /* file descriptor stdout */
        mov     $1, %rax                                /* syscall sys_write */
        syscall

        inc     %r15                            /* increment counter */
        cmp     $31, %r15                               /* see if we're done */
        jne     loop                            /* if not, loop */

        mov     $0, %rdi                                /* exit status */
        mov     $60, %rax                       /* syscall sys_exit */
        syscall

.section .data

        msg:    .ascii   "Loop:   \n"
        len = . - msg

AArch64 system

.text
.globl _start
_start:

        mov     x4, 0           /* file descriptor: 1 is stdout */
        mov     w10, 0x3        /* Value of 0 in ascii */

loop:
        add     w24, w4, 0x30   /* Converting iterator to ascii  */
        mov     x11, 10         /* Using 10 as a divider  */
        udiv    x12, x4, x11    /* Getting equitent  */
        msub    x13, x11, x12, x4       /* Getting the remainder  */

        add     w14, w12, 0x30  /* Ascii conversion  */
        add     w15, w13, 0x30  /* Ascii conversion  */

        adr     x16, msg        /* Storing the message  */
        strb    w15, [x16, 7]   /* Storing remainder into msg at byte 7  */
        cmp     w14, w10        /* Is 0  */
        strb    w14, [x16, 6]   /* Storing quotient in msg at byte 6  */

        mov     x0, 1           /* file descriptor  */
        adr     x1, msg         /* message location (memory address) */
        mov     x2, len         /* message length (bytes) */

        mov     x8, 64          /* write is syscall #64 */
        svc     0               /* invoke syscall */

        add     x4, x4, 1
        cmp     x4, 31          /* Checks if the iterator equals 31  */
        b.ne    loop

        mov     x0, 0           /* status -> 0 */
        mov     x8, 93          /* exit is syscall #93 */
        svc     0               /* invoke syscall */

.data
msg:    .ascii      "Loop:    \n"
len=    . - msg

The result is same on both systems:

Reflection

This lab is the first lab that we actually step on the programming of x86_64 and AArch64 systems. It is considerably hard because my group and I are completely unfamiliar with it and we had no idea how to do programming with it. However, it is fun to see the difference between the 6502 assembly language and the x86_64 and AArch64 assembly language. They are quite different but as an assembly language, it is not hard to find something similar from the coding style and programming logic, I will definitely explore more about in the future.

SPO600 Week 9 reflection

Qzhang125 — Wed, 24 Nov 2021 05:08:09 +0000

Hello, welcome to the week 9 of SPO600(Software Portability and Optimization) blog. In this week we will discuss the benchmarking and profiling.

Test a C program

This part we will do a very simple profiling test, we got 2 c-language programs:
The first is hello.c file

#include <stdio.h>

int main() {
        printf("Hello World!\n");
}

The second is hello2.c file

#include <unistd.h>

int main() {
        write(1,"Hello World!\n",13);
}

We could use a simple command

time make

to retrieve how much time it is spending in each function or program.
Let's look at the result below:

As what we see above, the time has been divided up into 3 parts: real, user, and sys. The real time corresponds to the wall clock time, in this test it is 0.372 second. The user time is how much time this program is executing on my behalf, it is the time that the program is directly executing with my normal permissions, in this test it is 0.276 second. The system time(sys) is the amount of time that the kernel was doing stuff on behalf of this program, in this test it is 0.092 second. If we add the user time and system time together, the result should be very close to what we got for the real time of this test.

Fun things to know

Type command

less /proc/cpuinfo

Could display your cpu information.

Reflection

This week we worked on benchmarking and profiling. Benchmarking and profiling is super useful in real programming life. It provides a clear view to identify how much time that a program will consume, it helps programmers to pick up the best solution from tons of ideas.

SPO600 Week 8 reflection

Qzhang125 — Wed, 24 Nov 2021 05:07:40 +0000

Hello my friends, welcome to the week 8 of SPO600(Software Portability and Optimization) blog. In this blog, let’s talk about the registers in the x86_64 framework.

Instruction

Remember that the x86_64 processor family is based on a long long series of processors that date back to the 1970s. So originally on these processors we had a small number of registers, like the X and Y register of the 6502 assembly language on the x86 family. The mean registers in x86_64 are A, B, C, D registers. Through the years, the register family has been extended, we still have the A, B, C, D registers but now they are called RAX, RBX, RCX, RDX which mean the extended registers of A, B, C, D and they are 64-bits wide. Beyond this, we also get RBP which means the register base pointer to store the start of the stack, and then we get RSP which stores the stack pointer, RSI and RDI are typically used for copy operation. RSI stores the source index which is pointing to the source that you want to copy and the RDI stores the destination index where you want to place the copy to. X64 also added another 8 (R8-R15) registers. Since some of x86’s registers have special implicit meanings and are not used as general-purpose.

Passing Argument

In this blog, I only focus on integer arguments. For example, 7 integer arguments to a function are passed in registers. The first integer is placed in RDI, the second integer is placed in RSI, the third integer is placed in RDX, the fourth integer is placed in RCX, and then R8 and R9. only the 7th integer is passed on the stack.

Reflection

This is the second week that we step on X86_64 and the AArch64 framework. In this blog we only discuss the registers in x86_64 but in the future I will also blog more about the AArch64 framework. This week I learned that modern processors are super fast because they often "guess" the outcome of an operation and store the results of "guess" into the registers that we talked about above.

SPO600 String lab option 2: Color Selector

Qzhang125 — Wed, 24 Nov 2021 05:05:23 +0000

Introduction

Hello my friend, welcome back to the second part of the string lab. In this lab, we decided to create 2 programs and split it into part 1 and part 2 using 6502 assembly language. In the first part, we created an adding calculator and now what you are reading is the second part of this lab. In this part, we decided to create a color selector. This program allows the user to choose the color on the right side screen by using keystroke, then the program will print the certain color on the screen.

Steps

Create a ROM to print the color string on the screen.
Create the up key and down key then assign the functionality. It will allow users to select the color string on the screen.
Print the title of the color string.
Print 15 color strings using loop
Use a loop to highlight the name string that the user has selected.
Use draw_screen loop to change the color of the pixels.

Code

define SCINIT  $ff81 ; initialize/clear screen
define CHROUT  $ffd2 ; output character to screen

define COLOUR  $10
define COLOUR_INDEX $11
define POINTER  $40
define POINTER_H $41
define UP_KEY  $80
define DOWN_KEY $82

 lda #$00 
 sta COLOUR
 sta COLOUR_INDEX 

 jsr initializePrint

getKey:
 lda $ff
 sty $ff

 cmp #UP_KEY
 beq decrementKey

 cmp #DOWN_KEY 
 beq incrementKey

 jmp getKey

decrementKey:
 lda COLOUR
 cmp #$01
 bpl decrementColour

 jmp getKey

decrementColour:
 dec COLOUR

 jsr initializePrint
 jsr initializePaint
 jmp getKey

incrementKey:
 lda COLOUR
 cmp #$0f
 bmi incrementColour

 jmp getKey

incrementColour:
 inc COLOUR
 jsr initializePrint
 jsr initializePaint
 jmp getKey

initializePrint:
 jsr SCINIT
        ldy #$00

writeTitle:
 lda title,y
        beq titleDone
        jsr CHROUT
        iny
        bne writeTitle

titleDone:
 lda #$00
 sta COLOUR_INDEX

startColour:
 ora #$00
 ldy #$00

colourName:
 jsr selectedColour
 beq afterWriting
 jsr highlightLine

 jsr CHROUT

 iny
 bne colourName

afterWriting:
 inc COLOUR_INDEX
 lda COLOUR_INDEX
 cmp #$10
 bne startColour

selectedColour:
 lda COLOUR_INDEX

 cmp #$00
 beq printColour0

 cmp #$01
 beq printColour1

 cmp #$02
 beq printColour2

 cmp #$03
 beq printColour3

 cmp #$04
 beq printColour4

 cmp #$05
 beq printColour5

 cmp #$06
 beq printColour6

 cmp #$07
 beq printColour7

 cmp #$08
 beq printColour8

 cmp #$09
 beq printColour9

 cmp #$0a
 beq printColour10

 cmp #$0b
 beq printColour11

 cmp #$0c
 beq printColour12

 cmp #$0d
 beq printColour13

 cmp #$0e
 beq printColour14

 cmp #$0f
 beq printColour15

 rts

printColour0:
 lda colour0,y
 rts

printColour1:
 lda colour1,y
 rts

printColour2:
 lda colour2,y
 rts

printColour3:
 lda colour3,y
 rts

printColour4:
 lda colour4,y
 rts

printColour5:
 lda colour5,y
 rts

printColour6:
 lda colour6,y
 rts

printColour7:
 lda colour7,y
 rts

printColour8:
 lda colour8,y
 rts

printColour9:
 lda colour9,y
 rts

printColour10:
 lda colour10,y
 rts

printColour11:
 lda colour11,y
 rts

printColour12:
 lda colour12,y
 rts

printColour13:
 lda colour13,y
 rts

printColour14:
 lda colour14,y
 rts

printColour15: 
 lda colour15,y
 rts

highlightLine:
 ldx COLOUR_INDEX
 cpx COLOUR
 beq highlight

 ora #$00
 rts

highlight:
 ora #$80
 rts

initializePaint:
 lda #$00         ; set a pointer at $40 to point to $0200
        sta POINTER
        lda #$02
        sta POINTER_H

 ldy #$00

 lda COLOUR

draw_screen:
  sta ($40), y     ; set pixel

        iny              ; increment index
        bne draw_screen  ; continue until done the page

        inc $41          ; increment the page
        ldx $41          ; get the page
        cpx #$06         ; compare with 6
        bne draw_screen  ; continue until done all pages

 rts

title:
dcb "L","i","s","t",32,"o","f",32,"C","o","l","o","u","r","s",":",13
dcb 00

colour0:
dcb "B","l","a","c","k",13
dcb 00

colour1:
dcb "W","h","i","t","e",13
dcb 00

colour2:
dcb "R","e","d",13
dcb 00

colour3:
dcb "C","y","a","n",13
dcb 00

colour4:
dcb "P","u","r","p","l","e",13
dcb 00

colour5:
dcb "G","r","e","e","n",13
dcb 00

colour6:
dcb "B","l","u","e",13
dcb 00

colour7:
dcb "Y","e","l","l","o","w",13
dcb 00

colour8:
dcb "O","r","a","n","g","e",13
dcb 00

colour9:
dcb "B","r","o","w","n",13
dcb 00

colour10:
dcb "L","i","g","h","t",32,"r","e","d",13
dcb 00

colour11:
dcb "D","a","r","k",32,"g","r","e","y",13
dcb 00

colour12:
dcb "G","r","e","y",13
dcb 00

colour13:
dcb "L","i","g","h","t",32,"g","r","e","e","n",13
dcb 00

colour14:
dcb "L","i","g","h","t",32,"b","l","u","e",13
dcb 00

colour15:
dcb "L","i","g","h","t",32,"g","r","e","y",13
dcb 00

The sample results :

Reflection

In lab 4 part 2, we created a color selector using assembly language. It is a new challenge for us. The most tedious thing is to use different loops to print the color strings and highlight the user selections. Afterwards, we also have to use a loop to change the color pixels. The assembly language is hard for me as a beginner to learn and work with it in 4 different labs. I can not say I'm a master of it because I’m not hundred-percent familiar with the syntax and the coding logic like other higher programming languages. But I’m glad to learn and work on it for almost the whole semester and it shows me a view of a low level programming language that I never touched before, it also helps me to understand the higher-level programming language deeply on the storage and ram view.

SPO600 Lab4 String lab Option 1: Adding Calculator

Qzhang125 — Wed, 24 Nov 2021 02:53:04 +0000

Hello, welcome to lab 4 for the SPO600(Software Portability and Optimization).This is a lab for the week 4 materials. My team chose to work on option 1 and option 4.

Introduction

This blog is for option 1, we created an adding calculator using 6502 assembly language. I will create a subroutine for the user to insert 2 numbers and add them together.

Requirements

Create a subroutine which enables the user to enter two numbers of up to two digits. Indicate where the cursor is, and allow the user to use the digit keys (0-9), backspace, and enter keys. Return the user's input value in the accumulator (A) register.
Using this subroutine, write a program which add the two numbers (each of which is in the range 0-99) and print the result.

Coding

define  SCINIT  $ff81 ; initialize/clear screen
define  CHRIN  $ffcf ; input character from keyboard
define  CHROUT  $ffd2 ; output character to screen
define  SCREEN  $ffed ; get screen size
define  PLOT    $fff0 ; get/set cursor coordinates

define RIGHT  $81
define LEFT  $83
define ENTER  $0d
define BACKSPACE $08

define NUM1    $15;
define  NUM2    $16;

jsr SCINIT

ldy #$00
jsr firstNumPrint ; ask for input for first number
jsr getNum; get the first number
jsr storeFirstNum  ; then store the first number
ldy #$00
jsr secondNumPrint  ; ask for input for second number
jsr getNum; get the second number
jsr storeSecondNum  ; store the second number
ldy #$00
jsr resultPrintString  ; print a string 'Result'
jsr printResult ; print the result
jmp mainLoop  ; go back to the first step

getNum:
     sec
     jsr PLOT
     ldx #$15
     clc
     jsr PLOT

getNumLoop:
     sec
     jsr PLOT
     jsr CHRIN


charCheck: 
     cmp #BACKSPACE ; if user enter backspace, it erase the #$15 digit
     beq move_back

     cmp #RIGHT ; if user enter right arrow, it goes to the first digit
     beq move_right

     cmp #LEFT ; if user enter left arrow, it goes to the second digit
     beq move_left

     cmp #ENTER ; if user enter enter, it goes to the next process
     beq move

printNum:
     cmp #$30
     bcc getNumLoop

     clc
     cmp #$3a
     bcs getNumLoop

     jsr CHROUT

     sec
     jsr PLOT
     cpx #$17
     bne getNumLoop
     dex
     clc
     jsr PLOT
     jmp getNumLoop

move_back:
 cpx #$15
 beq getNumLoop
 jsr CHROUT
 jmp getNumLoop

move_left: 
     cpx #$15 ; first digit
     beq getNumLoop
     jsr CHROUT
     jmp getNumLoop

move_right: 
 cpx #$16 ; second digit
     beq getNumLoop
     jsr CHROUT
     jmp getNumLoop

move:
     sec
     jsr PLOT
     ldx #$15 ; first digit
     clc
     jsr PLOT
     sec
     jsr PLOT

     clc
     sbc #$2F ; to calculate it, it should be subtracted by #$2f

     asl
     asl
     asl
     asl

     pha

     ldx #$16
     clc
     jsr PLOT
     sec
     jsr PLOT

     clc
     sbc #$2F ; to calculate it, it should be subtracted by #$2f
     pha

     ldx #$00
     iny
     clc
     jsr PLOT
     sec
     jsr PLOT

     pla
     tax
     pla

     rts

storeFirstNum:
     sta NUM1
     txa
     eor NUM1
     sta NUM1
     rts

storeSecondNum:
     sta NUM2
     txa
     eor NUM2
     sta NUM2
     rts

printResult:
     sec
     jsr PLOT
     ldx #$15
     clc
     jsr PLOT
     sec
     jsr PLOT

     sed
     lda NUM1
     adc NUM2
     cld
     pha

     bcc outputAddition
     ldx #$14
     clc
     jsr PLOT
     sec
     jsr PLOT
     lda #$31
     jsr CHROUT

outputAddition:
     lsr
     lsr
     lsr
     lsr
     clc
     adc #$30 ; as the received number does not fit for ASCII, it needs to add 
     jsr CHROUT

     pla
     and #$0F
     clc
     adc #$30 ; as the received number does not fit for ASCII, it needs to add 
     jsr CHROUT

     sec
     jsr PLOT
     ldx #$00
     iny
     clc
     jsr PLOT

     rts

firstNumPrint:
     lda firstNum,y
     beq goback_main
     jsr CHROUT
     iny
     bne firstNumPrint

secondNumPrint:
 lda secondNum,y
        beq goback_main
        jsr CHROUT
        iny
        bne secondNumPrint

resultPrintString:
 lda result,y
        beq goback_main
        jsr CHROUT
        iny
        bne resultPrintString

goback_main:
     rts

firstNum:
dcb "E","N","T","E","R",32,"F","I","R","S","T",32,"N","U","M","B","E","R",":",32,32,"0","0"
dcb 00


secondNum:
dcb "E","N","T","E","R",32,"S","E","C","O","N","D",32,"N","U","M","B","E","R",":",32,"0","0"
dcb 00

result:
dcb "R","E","S","U","L","T",":"
dcb 00

This is the results:

Thoughts & reflection

The assembly language program is getting harder than before, even harder than the third lab. In this lab, I spent a lot of time on how to get the user input, store the two numbers and the results, and display the results on the screen. As I said in each blog about the 6502 assembly language, It is a very low level programming language and it frequently accesses the memory and registers. The operation that we designed for this program is quite easy in any higher level language. But in assembler language, I'm still not familiar with the syntax. I hope more and more practices would help me to solve the problem easily.

SPO600 Lab3 - Math Lab Update Kaleidoscope

Qzhang125 — Tue, 23 Nov 2021 23:06:12 +0000

Hello my friend, good to see you again here. Today I'm going to update the lab 3 that I started a couple weeks ago. Unfortunately, my group mates and I are unable to solve the Pong game by using 6502 Assembly language. Our problem is we are not sure how to make it bounce on a wall. So we decided to start a new project, if you are still interested in the Pong game and want to explore it further, click here. But now the new project is Kaleidoscope.

For this project we needed to create a project to draw the same shape in each quarter of the screen when pressing the arrow keys. This is the code:

define POINTER     $10      ;ptr: start of row
 define POINTER_H   $11 
define ROW     $20      ;current row
 define COL     $21         ; current column
 define DOT     $01         ; dot colour location
 define CURSOR      $05     ; green colour
setup: lda #$0f         ; initial ROW,COL
    sta ROW
    sta COL
    LDA #$01
    STA DOT
 draw:  jsr draw_cursor
 getkey:    lda $ff         ; get the keystroke
    ldx #$00            ; clear the key buffer
    stx $ff
    cmp #$30
    bmi getkey
    cmp #$40
    bpl continue
    SEC
    sbc #$30
    tay
    lda color_pallete, y
    sta DOT
    jmp done

continue:   cmp #$43    
    beq clear
    cmp #$63
    beq clear
    cmp #$80    
    bmi getkey
    cmp #$84
    bpl getkey
    pha    
    lda DOT
    sta (POINTER),y
    jsr draw_on_quads
    pla
    cmp #$80
    bne check1
    dec ROW  
    jmp done

 check1:    cmp #$81    ; right key
    bne check2
    inc COL     ; increment COL
    jmp done

 check2:    cmp #$82    ; down key
    bne check3
    inc ROW     ; increment ROW
    jmp done

 check3:    cmp #$83    ; left key
    bne done
    dec COL     ; decrement COL
    clc
    bcc done

 clear: lda table_low   ; clear screen
    sta POINTER
    lda table_high
    sta POINTER_H
    ldy #$00
    tya

 c_loop:    sta (POINTER),y
    iny
    bne c_loop
    inc POINTER_H
    ldx POINTER_H
    cpx #$06
    bne c_loop

 done:  clc   
    bcc draw

 draw_cursor:
    lda ROW     
    and #$0f
    sta ROW
    lda COL     
    and #$0f
    sta COL
    ldy ROW     
    lda table_low,y
    sta POINTER
    lda table_high,y
    sta POINTER_H
    ldy COL     
    lda #CURSOR
    sta (POINTER),y
    rts

 draw_on_quads:     
    LDA POINTER 
    PHA         
    LDA POINTER_H
    PHA
    LDA #$10
    CLC
    SBC COL
    CLC
    ADC #$10
    TAY
    LDA DOT
    STA (POINTER),y
    TYA
    PHA 
    lda #$10    
    CLC
    SBC ROW
    CLC
    ADC #$10
    TAY
    lda table_low,y
    sta POINTER
    lda table_high,y
    sta POINTER_H
    ldy COL     
    lda DOT
    sta (POINTER),y
    PLA
    TAY
    lda DOT
    sta (POINTER),y
    PLA
    STA POINTER_H
    PLA
    STA POINTER
    RTS
 table_high:
 dcb $02,$02,$02,$02,$02,$02,$02,$02
 dcb $03,$03,$03,$03,$03,$03,$03,$03
 dcb $04,$04,$04,$04,$04,$04,$04,$04
 dcb $05,$05,$05,$05,$05,$05,$05,$05,
table_low:
 dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
 dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
 dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
 dcb $00,$20,$40,$60,$80,$a0,$c0,$e0
color_pallete:
dcb $01,$02,$03,$04,$05,$06,$07,$08,$09,$0a

This is a sample of the result:

Reflection

In this lab, we actually worked on a fully functional program, this program is available to receive user’s input and draw out a diagram. After I finished this lab, I learned how to work on subroutines and understand the syntax and logic of assembly language deeply. Assembly language is a low-level programming language which frequently and explicitly accesses the memory and registers. I hope this practice will help me to work on 6502 assembly language easier in the future.

SPO 600 Week 7 Reflection

Qzhang125 — Wed, 27 Oct 2021 18:47:05 +0000

Hello my friend, welcome to the week 7 of SPO600(Software Portability and Optimization) blog. In this blog we will discuss about the compiler optimization.

Introduction

Compiler optimizations are alternations that are made by a compiler to keep the same result as the original code but achieve the best performance. It usually means reducing execution time, reducing code size, and improving execution speed. In other words, compiler optimization is going to make the code as efficient as it could.

Performance##

Let's look at an example.

void foo(int size) {
    for (int i = 0; i < 5; i++) {
        if (size == 10) {
            std::cout << "Size is 10" << std::endl;
        }
        else {
            std::cout << "Keep going" << std::endl;
        }
    }
}

void betterFoo(int size) {
    if (size == 10) {
        for (int i = 0; i < 5; i++) {
            std::cout << "Size is 10" << std::endl;
        }
    }
    else {
        for (int i = 0; i < 5; i++) {
            std::cout << "Keep going" << std::endl;
        }
    }
}



int main() {
    cout << "Foo" << endl;
    foo(10);
    cout << "better foo" << endl;
    betterFoo(10);
    return 0;
}

The result:

In this c++ program, I created 2 functions. The first function foo is going to print 5 times if the “size is 10” or “keep going” message. The second function betterFoo is doing the same thing. The difference between them is the “size” value is never changed and it is executed 5 times in the loop. In the betterFoo function, the size is only checked once before doing the loop.

Reflection

In this week, we worked on many different compiler optimizations, each of them looks tiny but they are able to improve our program to be faster and time efficient because the algorithm is super important for a program. It controls the sequence of execution between each method and functions, every tiny improvement will make the final solution to be a bit better.