Introduction
x86
and aarch64
have a more rich set of instructions compared to 6502
- the former is far more capable with complex operations. Each architecture are designed with different goals - x86
prioritizes raw performance, aarch64
prioritizes power efficiency, and 6502
is simple and cost efficient (relative to today’s standards).
Let’s slowly get introduced to the x86
and aarch64
instruction set by analyzing a simple “Hello World” program, then experiment with loops.
Preparing the Lab Environment
The source code was obtained by using tar
:
tar xvf /public/spo600-assembler-lab-examples.tgz
This will extract the contents into the current directory - I decided to extract it into my labs/l5
directory.
The source code directory follows this structure (Source):
spo600
└── examples
└── hello # "hello world" example programs
├── assembler
│ ├── aarch64 # aarch64 gas assembly language version
│ │ ├── hello.s
│ │ └── Makefile
│ ├── Makefile
│ └── x86_64 # x86_64 assembly language versions
│ ├── hello-gas.s # ... gas syntax
│ ├── hello-nasm.s # ... nasm syntax
│ └── Makefile
└── c # Portable C versions
├── hello2.c # ... using write()
├── hello3.c # ... using syscall()
├── hello.c # ... using printf()
└── Makefile
aarch64
Navigate to the spo600/examples/hello/assembler/aarch64
and run make
, the will assemble and link to create an executable named hello
.
Comparing the source code hello.s
to the disassembled output by using objdump -d hello.o
.
[japablo@aarch64-002 aarch64]$ cat hello.s
.text
.globl _start
_start:
mov x0, 1 /* file descriptor: 1 is stdout */
adr x1, msg /* message location (memory address) */
mov x2, len /* message length (bytes) */
mov x8, 64 /* write is syscall #64 */
svc 0 /* invoke syscall */
mov x0, 0 /* status -> 0 */
mov x8, 93 /* exit is syscall #93 */
svc 0 /* invoke syscall */
.data
msg: .ascii "Hello, world!\n"
len= . - msg
[japablo@aarch64-002 aarch64]$ objdump -d hello.o
hello.o: file format elf64-littleaarch64
Disassembly of section .text:
0000000000000000 <_start>:
0: d2800020 mov x0, #0x1 // #1
4: 10000001 adr x1, 0 <_start>
8: d28001c2 mov x2, #0xe // #14
c: d2800808 mov x8, #0x40 // #64
10: d4000001 svc #0x0
14: d2800000 mov x0, #0x0 // #0
18: d2800ba8 mov x8, #0x5d // #93
1c: d4000001 svc #0x0
The dissembled output is the machine readable version of the source code. Both versions have 8 instructions and each line starts with where the instruction starts followed by the binary code being executed followed by the corresponding human-readable instruction.
Loops
The first step is to produce a loop to print “Loop” 5 times.
The high level instructions are as follows:
- create a string variable
Loop:
- have a
min
andmax
value, defines the number of loops - initiate a variable to track loop iteration
- print the string variable
Loop:
- print newline
- increment the loop counter
- check if iteration reaches the
max
value
.section .data
msg: .ascii "Loop: " // message prefix (6 bytes)
digit: .byte '0' // placeholder for the loop index (initialized to '0')
newline:.ascii "\n" // newline character (1 byte)
.section .text
.globl _start
min = 0
max = 6
_start:
mov x19, min // init loop counter (x19 = 0)
loop:
// Write "Loop: "
mov x0, 1 // file descriptor: stdout (1)
ldr x1, =msg // load address of "Loop: "
mov x2, 6 // length: 6 bytes ("Loop: ")
mov x8, 64 // syscall: write (64)
svc 0 // invoke syscall
// Print newline character
mov x0, 1
ldr x1, =newline // load address of newline character
mov x2, 1 // length: 1 byte
mov x8, 64 // syscall: write (64)
svc 0 // invoke syscall
// Increment loop counter and check condition
**add x19, x19, 1** // x19 = x19 + 1
cmp x19, max // compare x19 to max (6)
b.ne loop // if not equal, continue loop
// Exit program
mov x0, 0 // exit status: 0
mov x8, 93 // syscall: exit (93)
svc 0 // invoke syscall
Loop:
Loop:
Loop:
Loop:
Loop:
Loop:
The instructions function similar to 6502
where values are loaded in and out of registers, and these values can be compared by references an address. The add
instruction is not found in 6502
but exists in aarch64
, where it sums 2 operands and stores it in an address. Performing addition in 6502
is verbose, requiring to track the carry flag, and loading values in and out of memory and the accumulator.
Loop with Iteration Number
Next is to print the iteration next to Loop:
(e.g. Loop: 1
). The bolded instructions is what was added to include the iteration number in each loop.
.section .data
msg: .ascii "Loop: "
**digit: .byte '0'** // placeholder for the loop index (initialized to '0')
newline:.ascii "\n" // newline character (1 byte)
.section .text
.globl _start
min = 0 // loop start index
max = 6 // loop exit condition (exclusive)
_start:
mov x19, min // initialize loop counter (x19 = 0)
loop:
// Write "Loop: "
mov x0, 1 // file descriptor: stdout (1)
ldr x1, =msg // load address of "Loop: "
mov x2, 6 // length: 6 bytes ("Loop: ")
mov x8, 64 // syscall: write (64)
svc 0 // invoke syscall
**// Convert loop index to ASCII and print it
mov x0, 1
ldr x1, =digit // load address of digit buffer
add w20, w19, '0' // convert loop index to ASCII (0x30)
strb w20, [x1] // store ASCII character in buffer
mov x2, 1 // length: 1 byte (single digit)
mov x8, 64 // syscall: write (64)
svc 0 // invoke syscall**
// Print newline character
mov x0, 1
ldr x1, =newline // load address of newline character
mov x2, 1 // length: 1 byte
mov x8, 64 // syscall: write (64)
svc 0 // invoke syscall
// Increment loop counter and check condition
add x19, x19, 1 // x19 = x19 + 1
cmp x19, max // compare x19 to max (6)
b.ne loop // if not equal, continue loop
// Exit program
mov x0, 0 // exit status: 0 (success)
mov x8, 93 // syscall: exit (93)
svc 0 // Invoke syscall
Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
add
converts the current iteration (in w19
) into its respective ASCII value to be printed in the console. It does this by adding the numerical value to 0x30
(adding to 0x30
acts as an offset when converting to ASCII) then storing the ASCII value in register w20
. digit
is updated with the the value in w20
with strb w20, [x1]
.
Loop to 32
.section .data
msg: .ascii "Loop: "
digits: .ascii "00" // Buffer to hold 2-digit number
newline:.ascii "\n"
.section .text
.globl _start
min = 0
max = 33
_start:
mov x19, min
loop:
// Write "Loop: "
mov x0, 1 // file descriptor: stdout (1)
ldr x1, =msg // load address of "Loop: "
mov x2, 6 // length: 6 bytes
mov x8, 64 // syscall: write (64)
svc 0 // invoke syscall
**// Convert x19 to 2-digit ASCII
mov x20, x19 // copy loop counter
// Divide by 10 to extract the tens digit
mov x21, 10
udiv x22, x20, x21 // x22 = quotient (tens place)
msub x23, x22, x21, x20 // x23 = remainder (ones place)
add w22, w22, '0' // convert quotient to ASCII
add w23, w23, '0' // convert remainder to ASCII
ldr x1, =digits // load address of digits buffer
strb w22, [x1] // store tens digit
strb w23, [x1, 1] // store ones digit
// Write the 2-digit number
mov x0, 1
mov x2, 2 // length: 2 bytes
mov x8, 64 // syscall: write (64)
svc 0 // invoke syscall
// Write newline character
mov x0, 1
ldr x1, =newline // load address of newline character
mov x2, 1 // length: 1 byte
mov x8, 64 // syscall: write (64)
svc 0 // invoke syscall**
// Increment loop counter and check condition
add x19, x19, 1 // x19 = x19 + 1
cmp x19, max // compare x19 to max (33)
b.ne loop // if not equal, continue loop
// Exit program
mov x0, 0 // exit status: 0 (success)
mov x8, 93 // syscall: exit (93)
svc 0 // invoke syscall
Loop: 00
Loop: 01
Loop: 02
Loop: 03
Loop: 04
Loop: 05
Loop: 06
Loop: 07
Loop: 08
Loop: 09
Loop: 10
Loop: 11
Loop: 12
Loop: 13
Loop: 14
Loop: 15
Loop: 16
Loop: 17
Loop: 18
Loop: 19
Loop: 20
Loop: 21
Loop: 22
Loop: 23
Loop: 24
Loop: 25
Loop: 26
Loop: 27
Loop: 28
Loop: 29
Loop: 30
Loop: 31
Loop: 32
The addition of udiv
and msub
calculates the tens and ones digits, respectively.
udiv
, unsigned division, takes our loop counter stored in x20
, and 10
stored in x21
, and stores the quotient (tens digit) in x22
.
In the first iteration, when x20
(the loop counter) is 0
, the operation:
udiv x22, x20, x21
Calculates:
x22 = 0 ÷ 10 → 0 (tens digit)
Since the quotient is 0
, the tens digit remains 0
.
Next, the msub
instruction is used to calculate the remainder (ones digit):
msub x23, x22, x21, x20
This performs the following calculation:
x23 = x20 - (x22 * x21)
= 0 - (0 * 10)
= 0 (ones digit)
So, for the first iteration (x20 = 0
), both x22
(tens) and x23
(ones) are 0
.
Loop to 32 Without Trailing 0’s
.section .data
msg: .ascii "Loop: "
digits: .ascii "00" // buffer to hold 2-digit number
newline:.ascii "\n"
.section .text
.globl _start
min = 0
max = 33
_start:
mov x19, min // initialize loop counter (x19 = 0)
loop:
// Write "Loop: "
mov x0, 1 // file descriptor: stdout (1)
ldr x1, =msg // load address of "Loop: "
mov x2, 6 // length: 6 bytes
mov x8, 64 // syscall: write (64)
svc 0 // invoke syscall
// Convert x19 to 2-digit ASCII
mov x20, x19 // copy loop counter
// Divide by 10 to extract the tens digit
mov x21, 10
udiv x22, x20, x21 // x22 = quotient (tens place)
msub x23, x22, x21, x20 // x23 = remainder (ones place)
add w22, w22, '0' // convert quotient to ASCII
add w23, w23, '0' // convert remainder to ASCII
**ldr x1, =digits // load address of digits buffer
// Handle leading zeros
cmp x19, 10 // check if the number is less than 10
blt print_ones_only
// Store both digits
strb w22, [x1] // store tens digit
strb w23, [x1, 1] // store ones digit
mov x0, 1 // file descriptor: stdout (1)
mov x2, 2 // length: 2 bytes
b write_digits
print_ones_only:
mov w22, ' ' // clear the tens place with a space
strb w22, [x1] // store space in the tens place
strb w23, [x1, 1] // store ones digit
mov x0, 1 // file descriptor: stdout (1)
mov x2, 2 // length: 2 bytes**
write_digits:
mov x8, 64 // syscall: write (64)
svc 0 // invoke syscall
// Write newline character
mov x0, 1
ldr x1, =newline // load address of newline character
mov x2, 1 // length: 1 byte
mov x8, 64 // syscall: write (64)
svc 0 // invoke syscall
// Increment loop counter and check condition
add x19, x19, 1 // x19 = x19 + 1
cmp x19, max // compare x19 to max (33)
b.ne loop // if not equal, continue loop
// Exit program
mov x0, 0 // exit status: 0 (success)
mov x8, 93 // syscall: exit (93)
svc 0 // invoke syscall
Loop: 0
Loop: 1
Loop: 2
Loop: 3
Loop: 4
Loop: 5
Loop: 6
Loop: 7
Loop: 8
Loop: 9
Loop: 10
Loop: 11
Loop: 12
Loop: 13
Loop: 14
Loop: 15
Loop: 16
Loop: 17
Loop: 18
Loop: 19
Loop: 20
Loop: 21
Loop: 22
Loop: 23
Loop: 24
Loop: 25
Loop: 26
Loop: 27
Loop: 28
Loop: 29
Loop: 30
Loop: 31
Loop: 32
If the iteration is greater or equal to 10, the loop is the same otherwise, it will branch to print_ones_only
via blt
, where it will clear out the memory storing the tens values, then load the empty tens value followed by the ones value into digit
.
Looping to 32 with Hex Values
.section .data
msg: .ascii "Loop: "
digits: .ascii "00"
newline:.ascii "\n"
.section .text
.globl _start
min = 0
max = 33
_start:
mov x19, min
loop:
// Write "Loop: "
mov x0, 1 // file descriptor: stdout (1)
ldr x1, =msg // load address of "Loop: "
mov x2, 6 // length: 6 bytes
mov x8, 64 // syscall: write (64)
svc 0 // invoke syscall
// Convert x19 to 2-digit hexadecimal
mov x20, x19 // copy loop counter
// Get the hexadecimal value
mov x21, 16 // hexadecimal base (16)
udiv x22, x20, x21 // x22 = quotient (tens place)
msub x23, x22, x21, x20 // x23 = remainder (ones place)
**// Convert quotient and remainder to ASCII (hex digits)
add w22, w22, '0' // convert quotient to ASCII (0-9 or A-F)
add w23, w23, '0' // convert remainder to ASCII (0-9 or A-F)
cmp w22, '9' // check if quotient is greater than 9 (A-F)
ble no_hex_conversion // if less than or equal to '9', skip
add w22, w22, 7 // convert 'A' - 'F' (10-15)
no_hex_conversion:
cmp w23, '9' // check if remainder is greater than 9 (A-F)
ble no_hex_conversion_2 // if less than or equal to '9', skip
add w23, w23, 7 // convert 'A' - 'F' (10-15)
no_hex_conversion_2:
ldr x1, =digits // load address of digits buffer
// Store both digits
strb w22, [x1] // store quotient (tens place)
strb w23, [x1, 1] // store remainder (ones place)
mov x0, 1 // file descriptor: stdout (1)
mov x2, 2 // length: 2 bytes**
b write_digits
print_ones_only:
mov w22, ' ' // clear the tens place with a space
strb w22, [x1] // store space in the tens place
strb w23, [x1, 1] // store ones digit
mov x0, 1 // file descriptor: stdout (1)
mov x2, 2 // length: 2 bytes
write_digits:
mov x8, 64 // syscall: write (64)
svc 0 // invoke syscall
// Write newline character
mov x0, 1
ldr x1, =newline // load address of newline character
mov x2, 1 // length: 1 byte
mov x8, 64 // syscall: write (64)
svc 0 // invoke syscall
// Increment loop counter and check condition
add x19, x19, 1 // x19 = x19 + 1
cmp x19, max
b.ne loop // if not equal, continue loop
// Exit program
mov x0, 0 // exit status: 0 (success)
mov x8, 93 // syscall: exit (93)
svc 0 // invoke syscall
The usual conversion to ASCII determines whether or not the tens and or ones need to offset to display its hexadecimal equivalent. Specially, cmp w22, '9'
and ble no_hex_conversion
check if the tens is below 9, branching to no_hex_conversion
where it checks if the ones need conversion.
The conversion to hexadecimal is adding an offset of 7 to its correspondent numerical value - resulting in a value between A-F.
Loop: 00
Loop: 01
Loop: 02
Loop: 03
Loop: 04
Loop: 05
Loop: 06
Loop: 07
Loop: 08
Loop: 09
Loop: 0A
Loop: 0B
Loop: 0C
Loop: 0D
Loop: 0E
Loop: 0F
Loop: 10
Loop: 11
Loop: 12
Loop: 13
Loop: 14
Loop: 15
Loop: 16
Loop: 17
Loop: 18
Loop: 19
Loop: 1A
Loop: 1B
Loop: 1C
Loop: 1D
Loop: 1E
Loop: 1F
Loop: 20
x86
The same approach was taken with x86
for each loop, Link to GitHub containing both x86
and aarch64
implementations.
Results
The history of each architecture plays a role on how the data is handled. x86
was first introduced in 1978 with the Intel 8086 as a 16-bit processor (Source). x86-64
is also backwards compatible, meaning that 64 bit processor should work on 32 bit programs but not 16 bit programs directly. aarch64
is relatively new, first introduced in 2011 as part of the ARMv8 architecture (Source). As we have seen, it uses RISC (Reduced Instruction Set Computer) design with a fixed 32-bit instruction size. While x86-64
prioritizes backward compatibility, aarch64
maintains a cleaner instruction set and may support 32-bit ARM code on some hardware, though newer devices are 64-bit only.
These design differences can be seen in how the iteration number is converted into its hexadecimal value.
Extracting Tens and Ones
In x86, it handles the data in nibbles (4 bits), where it holds the tens and ones. The instruction shr rax, 4
shifts the value right by 4 bits to isolate the high nibble (tens place) in the lower 4 bits of the register. The instruction and al, 0x0F
performs a bitwise AND operation to isolate the low nibble (ones place) by masking out the upper 4 bits.
In aarch64
, the values are handled using arithmetic operations like udiv
(for division) and msub
(for remainder), which extract the tens and ones values directly. In contrast, x86
uses bitwise operations such as shr
(shift right) and and
(masking) to extract the tens and ones values by manipulating individual bits and nibbles.
Conclusion
Each architecture produces the same result for each loop but their approaches differ. The history and design choices of each architecture partly explain why. aarch64
simplifies the complex operations of multiplication and division whereas x86
is quite verbose. The level of verbosity in x86
however does boost its performance because of direct management of bits with bitwise manipulation, and shifting - allowing for optimizations where possible.
Top comments (0)