Posted on Mar 5

SPO600 Lab 5 - Analyzing and Experimenting with Assembly

#spo600 #aarch64 #x86 #assembly

Introduction

x86 and aarch64 have a more rich set of instructions compared to 6502 - the former is far more capable with complex operations. Each architecture are designed with different goals - x86 prioritizes raw performance, aarch64 prioritizes power efficiency, and 6502 is simple and cost efficient (relative to today’s standards).

Let’s slowly get introduced to the x86 and aarch64 instruction set by analyzing a simple “Hello World” program, then experiment with loops.

Preparing the Lab Environment

The source code was obtained by using tar:

tar xvf /public/spo600-assembler-lab-examples.tgz

This will extract the contents into the current directory - I decided to extract it into my labs/l5 directory.

The source code directory follows this structure (Source):

 spo600
 └── examples
     └── hello                     # "hello world" example programs
         ├── assembler
         │   ├── aarch64           # aarch64 gas assembly language version
         │   │   ├── hello.s
         │   │   └── Makefile
         │   ├── Makefile
         │   └── x86_64            # x86_64 assembly language versions
         │       ├── hello-gas.s   # ... gas syntax
         │       ├── hello-nasm.s  # ... nasm syntax
         │       └── Makefile
         └── c                     # Portable C versions
             ├── hello2.c          # ... using write()
             ├── hello3.c          # ... using syscall()
             ├── hello.c           # ... using printf()
             └── Makefile

`aarch64`

Navigate to the spo600/examples/hello/assembler/aarch64 and run make, the will assemble and link to create an executable named hello.

Comparing the source code hello.s to the disassembled output by using objdump -d hello.o.

[japablo@aarch64-002 aarch64]$ cat hello.s
.text
.globl _start
_start:

    mov     x0, 1           /* file descriptor: 1 is stdout */
    adr     x1, msg     /* message location (memory address) */
    mov     x2, len     /* message length (bytes) */

    mov     x8, 64      /* write is syscall #64 */
    svc     0           /* invoke syscall */

    mov     x0, 0       /* status -> 0 */
    mov     x8, 93      /* exit is syscall #93 */
    svc     0           /* invoke syscall */

.data
msg:    .ascii      "Hello, world!\n"
len=    . - msg
[japablo@aarch64-002 aarch64]$ objdump -d hello.o

hello.o:     file format elf64-littleaarch64

Disassembly of section .text:

0000000000000000 <_start>:
   0:   d2800020    mov x0, #0x1                    // #1
   4:   10000001    adr x1, 0 <_start>
   8:   d28001c2    mov x2, #0xe                    // #14
   c:   d2800808    mov x8, #0x40                   // #64
  10:   d4000001    svc #0x0
  14:   d2800000    mov x0, #0x0                    // #0
  18:   d2800ba8    mov x8, #0x5d                   // #93
  1c:   d4000001    svc #0x0

The dissembled output is the machine readable version of the source code. Both versions have 8 instructions and each line starts with where the instruction starts followed by the binary code being executed followed by the corresponding human-readable instruction.

Loops

The first step is to produce a loop to print “Loop” 5 times.

The high level instructions are as follows:

create a string variable Loop:
have a min and max value, defines the number of loops
initiate a variable to track loop iteration
print the string variable Loop:
print newline
increment the loop counter
check if iteration reaches the max value

    .section .data
msg:    .ascii "Loop: "        // message prefix (6 bytes)
digit:  .byte '0'              // placeholder for the loop index (initialized to '0')
newline:.ascii "\n"            // newline character (1 byte)

    .section .text
    .globl _start

    min = 0                    
    max = 6                    

_start:
    mov     x19, min           // init loop counter (x19 = 0)

loop:
    // Write "Loop: "
    mov     x0, 1              // file descriptor: stdout (1)
    ldr     x1, =msg           // load address of "Loop: "
    mov     x2, 6              // length: 6 bytes ("Loop: ")
    mov     x8, 64             // syscall: write (64)
    svc     0                  // invoke syscall

    // Print newline character
    mov     x0, 1             
    ldr     x1, =newline       // load address of newline character
    mov     x2, 1              // length: 1 byte
    mov     x8, 64             // syscall: write (64)
    svc     0                  // invoke syscall

    // Increment loop counter and check condition
    **add     x19, x19, 1**        // x19 = x19 + 1
    cmp     x19, max           // compare x19 to max (6)
    b.ne    loop               // if not equal, continue loop

    // Exit program
    mov     x0, 0              // exit status: 0
    mov     x8, 93             // syscall: exit (93)
    svc     0                  // invoke syscall

Loop:
Loop:
Loop:
Loop:
Loop:
Loop:

The instructions function similar to 6502 where values are loaded in and out of registers, and these values can be compared by references an address. The add instruction is not found in 6502 but exists in aarch64, where it sums 2 operands and stores it in an address. Performing addition in 6502 is verbose, requiring to track the carry flag, and loading values in and out of memory and the accumulator.

Loop with Iteration Number

Next is to print the iteration next to Loop: (e.g. Loop: 1). The bolded instructions is what was added to include the iteration number in each loop.

    .section .data
msg:    .ascii "Loop: "        
**digit:  .byte '0'**              // placeholder for the loop index (initialized to '0')
newline:.ascii "\n"            // newline character (1 byte)

    .section .text
    .globl _start

    min = 0                    // loop start index
    max = 6                    // loop exit condition (exclusive)

_start:
    mov     x19, min            // initialize loop counter (x19 = 0)

loop:
    // Write "Loop: "
    mov     x0, 1              // file descriptor: stdout (1)
    ldr     x1, =msg           // load address of "Loop: "
    mov     x2, 6              // length: 6 bytes ("Loop: ")
    mov     x8, 64             // syscall: write (64)
    svc     0                  // invoke syscall

    **// Convert loop index to ASCII and print it
    mov     x0, 1              
    ldr     x1, =digit         // load address of digit buffer
    add     w20, w19, '0'      // convert loop index to ASCII (0x30)
    strb    w20, [x1]          // store ASCII character in buffer
    mov     x2, 1              // length: 1 byte (single digit)
    mov     x8, 64             // syscall: write (64)
    svc     0                  // invoke syscall**

    // Print newline character
    mov     x0, 1              
    ldr     x1, =newline       // load address of newline character
    mov     x2, 1              // length: 1 byte
    mov     x8, 64             // syscall: write (64)
    svc     0                  // invoke syscall

    // Increment loop counter and check condition
    add     x19, x19, 1        // x19 = x19 + 1
    cmp     x19, max           // compare x19 to max (6)
    b.ne    loop               // if not equal, continue loop

    // Exit program
    mov     x0, 0              // exit status: 0 (success)
    mov     x8, 93             // syscall: exit (93)
    svc     0                  // Invoke syscall

Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5

add converts the current iteration (in w19) into its respective ASCII value to be printed in the console. It does this by adding the numerical value to 0x30 (adding to 0x30 acts as an offset when converting to ASCII) then storing the ASCII value in register w20. digit is updated with the the value in w20 with strb w20, [x1].

Loop to 32

    .section .data
msg:    .ascii "Loop: "          
digits: .ascii "00"             // Buffer to hold 2-digit number
newline:.ascii "\n"             

    .section .text
    .globl _start

    min = 0                      
    max = 33                     

_start:
    mov     x19, min             

loop:
    // Write "Loop: "
    mov     x0, 1                // file descriptor: stdout (1)
    ldr     x1, =msg             // load address of "Loop: "
    mov     x2, 6                // length: 6 bytes
    mov     x8, 64               // syscall: write (64)
    svc     0                    // invoke syscall

    **// Convert x19 to 2-digit ASCII
    mov     x20, x19             // copy loop counter

    // Divide by 10 to extract the tens digit
    mov     x21, 10
    udiv    x22, x20, x21        // x22 = quotient (tens place)
    msub    x23, x22, x21, x20   // x23 = remainder (ones place)

    add     w22, w22, '0'        // convert quotient to ASCII
    add     w23, w23, '0'        // convert remainder to ASCII

    ldr     x1, =digits          // load address of digits buffer
    strb    w22, [x1]            // store tens digit
    strb    w23, [x1, 1]         // store ones digit

    // Write the 2-digit number
    mov     x0, 1                
    mov     x2, 2                // length: 2 bytes
    mov     x8, 64               // syscall: write (64)
    svc     0                    // invoke syscall

    // Write newline character
    mov     x0, 1                
    ldr     x1, =newline         // load address of newline character
    mov     x2, 1                // length: 1 byte
    mov     x8, 64               // syscall: write (64)
    svc     0                    // invoke syscall**

    // Increment loop counter and check condition
    add     x19, x19, 1          // x19 = x19 + 1
    cmp     x19, max             // compare x19 to max (33)
    b.ne    loop                 // if not equal, continue loop

    // Exit program
    mov     x0, 0                // exit status: 0 (success)
    mov     x8, 93               // syscall: exit (93)
    svc     0                    // invoke syscall

Loop: 00
Loop: 01
Loop: 02
Loop: 03
Loop: 04
Loop: 05
Loop: 06
Loop: 07
Loop: 08
Loop: 09
Loop: 10
Loop: 11
Loop: 12
Loop: 13
Loop: 14
Loop: 15
Loop: 16
Loop: 17
Loop: 18
Loop: 19
Loop: 20
Loop: 21
Loop: 22
Loop: 23
Loop: 24
Loop: 25
Loop: 26
Loop: 27
Loop: 28
Loop: 29
Loop: 30
Loop: 31
Loop: 32

The addition of udiv and msub calculates the tens and ones digits, respectively.

udiv, unsigned division, takes our loop counter stored in x20, and 10 stored in x21, and stores the quotient (tens digit) in x22.

In the first iteration, when x20 (the loop counter) is 0, the operation:

udiv x22, x20, x21

Calculates:

x22 = 0 ÷ 10 → 0 (tens digit)

Since the quotient is 0, the tens digit remains 0.

Next, the msub instruction is used to calculate the remainder (ones digit):

msub x23, x22, x21, x20

This performs the following calculation:

x23 = x20 - (x22 * x21)
    = 0 - (0 * 10)
    = 0 (ones digit)

So, for the first iteration (x20 = 0), both x22 (tens) and x23 (ones) are 0.

Loop to 32 Without Trailing 0’s

    .section .data
msg:    .ascii "Loop: "          
digits: .ascii "00"             // buffer to hold 2-digit number
newline:.ascii "\n"             

    .section .text
    .globl _start

    min = 0                      
    max = 33                     

_start:
    mov     x19, min             // initialize loop counter (x19 = 0)

loop:
    // Write "Loop: "
    mov     x0, 1                // file descriptor: stdout (1)
    ldr     x1, =msg             // load address of "Loop: "
    mov     x2, 6                // length: 6 bytes
    mov     x8, 64               // syscall: write (64)
    svc     0                    // invoke syscall

    // Convert x19 to 2-digit ASCII
    mov     x20, x19             // copy loop counter

    // Divide by 10 to extract the tens digit
    mov     x21, 10
    udiv    x22, x20, x21        // x22 = quotient (tens place)
    msub    x23, x22, x21, x20   // x23 = remainder (ones place)

    add     w22, w22, '0'        // convert quotient to ASCII
    add     w23, w23, '0'        // convert remainder to ASCII

    **ldr     x1, =digits          // load address of digits buffer

    // Handle leading zeros
    cmp     x19, 10              // check if the number is less than 10
    blt     print_ones_only      

    // Store both digits
    strb    w22, [x1]            // store tens digit
    strb    w23, [x1, 1]         // store ones digit

    mov     x0, 1                // file descriptor: stdout (1)
    mov     x2, 2                // length: 2 bytes
    b       write_digits

print_ones_only:
    mov     w22, ' '             // clear the tens place with a space
    strb    w22, [x1]            // store space in the tens place
    strb    w23, [x1, 1]         // store ones digit
    mov     x0, 1                // file descriptor: stdout (1)
    mov     x2, 2                // length: 2 bytes**

write_digits:
    mov     x8, 64               // syscall: write (64)
    svc     0                    // invoke syscall

    // Write newline character
    mov     x0, 1                
    ldr     x1, =newline         // load address of newline character
    mov     x2, 1                // length: 1 byte
    mov     x8, 64               // syscall: write (64)
    svc     0                    // invoke syscall

    // Increment loop counter and check condition
    add     x19, x19, 1          // x19 = x19 + 1
    cmp     x19, max             // compare x19 to max (33)
    b.ne    loop                 // if not equal, continue loop

    // Exit program
    mov     x0, 0                // exit status: 0 (success)
    mov     x8, 93               // syscall: exit (93)
    svc     0                    // invoke syscall

Loop:  0
Loop:  1
Loop:  2
Loop:  3
Loop:  4
Loop:  5
Loop:  6
Loop:  7
Loop:  8
Loop:  9
Loop: 10
Loop: 11
Loop: 12
Loop: 13
Loop: 14
Loop: 15
Loop: 16
Loop: 17
Loop: 18
Loop: 19
Loop: 20
Loop: 21
Loop: 22
Loop: 23
Loop: 24
Loop: 25
Loop: 26
Loop: 27
Loop: 28
Loop: 29
Loop: 30
Loop: 31
Loop: 32

If the iteration is greater or equal to 10, the loop is the same otherwise, it will branch to print_ones_only via blt, where it will clear out the memory storing the tens values, then load the empty tens value followed by the ones value into digit.

Looping to 32 with Hex Values

    .section .data
msg:    .ascii "Loop: "          
digits: .ascii "00"             
newline:.ascii "\n"             

    .section .text
    .globl _start

    min = 0                      
    max = 33                     

_start:
    mov     x19, min              

loop:
    // Write "Loop: "
    mov     x0, 1                // file descriptor: stdout (1)
    ldr     x1, =msg             // load address of "Loop: "
    mov     x2, 6                // length: 6 bytes
    mov     x8, 64               // syscall: write (64)
    svc     0                    // invoke syscall

    // Convert x19 to 2-digit hexadecimal
    mov     x20, x19             // copy loop counter

    // Get the hexadecimal value
    mov     x21, 16              // hexadecimal base (16)
    udiv    x22, x20, x21        // x22 = quotient (tens place)
    msub    x23, x22, x21, x20   // x23 = remainder (ones place)

    **// Convert quotient and remainder to ASCII (hex digits)
    add     w22, w22, '0'        // convert quotient to ASCII (0-9 or A-F)
    add     w23, w23, '0'        // convert remainder to ASCII (0-9 or A-F)
    cmp     w22, '9'             // check if quotient is greater than 9 (A-F)
    ble     no_hex_conversion    // if less than or equal to '9', skip
    add     w22, w22, 7          // convert 'A' - 'F' (10-15)
no_hex_conversion:
    cmp     w23, '9'             // check if remainder is greater than 9 (A-F)
    ble     no_hex_conversion_2  // if less than or equal to '9', skip
    add     w23, w23, 7          // convert 'A' - 'F' (10-15)

no_hex_conversion_2:
    ldr     x1, =digits          // load address of digits buffer

    // Store both digits
    strb    w22, [x1]            // store quotient (tens place)
    strb    w23, [x1, 1]         // store remainder (ones place)

    mov     x0, 1                // file descriptor: stdout (1)
    mov     x2, 2                // length: 2 bytes**
    b       write_digits

print_ones_only:
    mov     w22, ' '             // clear the tens place with a space
    strb    w22, [x1]            // store space in the tens place
    strb    w23, [x1, 1]         // store ones digit
    mov     x0, 1                // file descriptor: stdout (1)
    mov     x2, 2                // length: 2 bytes

write_digits:
    mov     x8, 64               // syscall: write (64)
    svc     0                    // invoke syscall

    // Write newline character
    mov     x0, 1                
    ldr     x1, =newline         // load address of newline character
    mov     x2, 1                // length: 1 byte
    mov     x8, 64               // syscall: write (64)
    svc     0                    // invoke syscall

    // Increment loop counter and check condition
    add     x19, x19, 1          // x19 = x19 + 1
    cmp     x19, max             
    b.ne    loop                 // if not equal, continue loop

    // Exit program
    mov     x0, 0                // exit status: 0 (success)
    mov     x8, 93               // syscall: exit (93)
    svc     0                    // invoke syscall

The usual conversion to ASCII determines whether or not the tens and or ones need to offset to display its hexadecimal equivalent. Specially, cmp w22, '9' and ble no_hex_conversion check if the tens is below 9, branching to no_hex_conversion where it checks if the ones need conversion.

The conversion to hexadecimal is adding an offset of 7 to its correspondent numerical value - resulting in a value between A-F.

Loop: 00
Loop: 01
Loop: 02
Loop: 03
Loop: 04
Loop: 05
Loop: 06
Loop: 07
Loop: 08
Loop: 09
Loop: 0A
Loop: 0B
Loop: 0C
Loop: 0D
Loop: 0E
Loop: 0F
Loop: 10
Loop: 11
Loop: 12
Loop: 13
Loop: 14
Loop: 15
Loop: 16
Loop: 17
Loop: 18
Loop: 19
Loop: 1A
Loop: 1B
Loop: 1C
Loop: 1D
Loop: 1E
Loop: 1F
Loop: 20

`x86`

The same approach was taken with x86 for each loop, Link to GitHub containing both x86 and aarch64 implementations.

Results

The history of each architecture plays a role on how the data is handled. x86 was first introduced in 1978 with the Intel 8086 as a 16-bit processor (Source). x86-64 is also backwards compatible, meaning that 64 bit processor should work on 32 bit programs but not 16 bit programs directly. aarch64 is relatively new, first introduced in 2011 as part of the ARMv8 architecture (Source). As we have seen, it uses RISC (Reduced Instruction Set Computer) design with a fixed 32-bit instruction size. While x86-64 prioritizes backward compatibility, aarch64 maintains a cleaner instruction set and may support 32-bit ARM code on some hardware, though newer devices are 64-bit only.

These design differences can be seen in how the iteration number is converted into its hexadecimal value.

Extracting Tens and Ones

In x86, it handles the data in nibbles (4 bits), where it holds the tens and ones. The instruction shr rax, 4 shifts the value right by 4 bits to isolate the high nibble (tens place) in the lower 4 bits of the register. The instruction and al, 0x0F performs a bitwise AND operation to isolate the low nibble (ones place) by masking out the upper 4 bits.

In aarch64, the values are handled using arithmetic operations like udiv (for division) and msub (for remainder), which extract the tens and ones values directly. In contrast, x86 uses bitwise operations such as shr (shift right) and and (masking) to extract the tens and ones values by manipulating individual bits and nibbles.

Conclusion

Each architecture produces the same result for each loop but their approaches differ. The history and design choices of each architecture partly explain why. aarch64 simplifies the complex operations of multiplication and division whereas x86 is quite verbose. The level of verbosity in x86 however does boost its performance because of direct management of bits with bitwise manipulation, and shifting - allowing for optimizations where possible.

DEV Community