Tomasz Wegrzanowski

Posted on Jan 4, 2022

100 Languages Speedrun: Episode 46: ARM64 Assembly

#arm #assembly #apple

There aren't that many relevant CPU architectures these days. x86-64 is dominant on the high-performance devices, ARM is dominant on the low-power devices, and RISC-V is the only serious upcoming challenger. It's been like that for about 20 years now, more or less since the time Intel's Itanium architecture launched and instantly failed.

I did [x86-64]](https://dev.to/taw/100-languages-speedrun-episode-40-x86-64-assembly-jjj)(https://taw.hashnode.dev/100-languages-speedrun-episode-40-x86-64-assembly) and RISC-V(https://taw.hashnode.dev/100-languages-speedrun-episode-44-risc-v-assembly), so it's time to complete the set with ARM.

Just like x86 and RISC-V, ARM has both 32-bit and 64-bit versions, and we'll be doing 64bit.

How to run ARM64 in Docker

Docker comes with QEMU preinstalled, so it can emulate other architectures, and I'll be using that.

You might actually have an ARM machine around, like a Raspberry Pi or Apple M1. If you follow along you'd need to adjust it slightly. Most Raspberries are 32bit and only some 64bit versions started showing up recently. Apple M1 is 64bit, but OSX uses slightly different system calls than Linux. The code from this episode could work with minor adaptations on either.

So let's start a new container, install compiling tools on it, and we can begin:

$ docker run -it --platform linux/arm64/v8 -v $(pwd):/source arm64v8/ubuntu
$ apt update
$ apt install -y build-essential

Simplest ARM64 program

The simplest program just exits with a numeric error code. These are normally used to indicate errors, with 0 being success, and various non-zero values indicating failure. Some programs have complicated mapping which non-zero value means which kind of issue, others just use same value for every problem.

To exit, or to do any interaction with the outside world, our program needs to call the operating system. This is how it's done:

.global _start

.text
_start:
  /* _exit(7) */
  mov x8, 93
  mov x0, 7
  svc 0

$ as exit.s -o exit.o
$ ld exit.o -o exit
$ ./exit
$ echo $?
7

This program is really similar to x86-64 and RISC-V versions, it's just the opcode names that are different.

.text is code section
_start is where the code execution will begin when the program is loaded, we need to mark this symbol as exported with .global _start too
/* ... */ for comments, I'm so baffled that assemblers for different architectures use different comment characters
mov x8, 93 means x8 = 93 - the x8 register is where we pass function number to the operating system, 93 is Linux system call number for exit, just as on other architectures with Linux. On OSX it would be some different number.
mov x0, 7 means x0 = 7 - the x0 register is where we pass the first argument, in this case the exit code
svc 0 performs the system call on Linux

Simplest OSX ARM64 program

I don't have one lying around, so I never tried running it, but from the documentation, this is how the same program would look on OSX:

.global _start

.text
_start:
  /* _exit(7) */
  mov x16, 1
  mov x0, 7
  svc 0x80

A few things are different:

system call number is different, exit is 93 on ARM64 Linux and 1 on ARM64 OSX. Interestingly it's 93 on all Linuxes we checked, but x86-64 OSX uses different numbers from ARM64 OSX.
we pass the operation number in x16 instead of x8
we use svc 0x80 not svc 0 to call the operating system

Hello, World!

The Hello, World also looks similar to x86-64 and RISC-V versions, but there's some real differences below the surface we'll get to:

.global _start

.text
_start:
  /* write(1, "Hello, World!\n", 14) */
  mov x8, 64
  mov x0, 1
  ldr x1, =hello
  mov x2, 14
  svc 0

  /* _exit(7) */
  mov x8, 93
  mov x0, 7
  svc 0

.data
hello:
  .ascii "Hello, World!\n"

Let's run in:

$ as hello.s -o hello.o
$ ld hello.o -o hello
$ ./hello
Hello, World!

mov x8, 64 means x8 = 64, that's Linux operating system number for the write function
mov x0, 1 means x0 = 1, that means standard output
ldr x1, =hello means x1 = address of hello, but there's more here
mov x2, 14 means x2 = 14, that's the length of the string
.data is data section
hello is a label for where we have the string
.ascii "Hello, World!\n" is the string itself

Constant pools

Assemblers don't pass strings and other such objects around, they pass their addresses in memory. On 64-bit machine, addresses are 64 bits, or 8 bytes. So how do we load an address into memory?

On x86-64 it's super easy - instructions have variable length, so if you need to load a 64-bit address or any other 64-bit number into a register, you can use mov instruction, and it will be 10 bytes (2 to select instruction, 8 for data), but that's fine.

ARM64 instructions are all 32bit. Part of that needs to select which instruction we use, so an instruction can't contain a 32bit number, let alone a 64bit one. So how does that work?

Enough talk, let's take a peak inside with objdump. -d means to disassemble, -s to show contents of each section:

$ objdump -ds ./hello

./hello:     file format elf64-littleaarch64

Contents of section .text:
 4000b0 080880d2 200080d2 c1000058 c20180d2  .... ......X....
 4000c0 010000d4 a80b80d2 e00080d2 010000d4  ................
 4000d0 d8004100 00000000                    ..A.....
Contents of section .data:
 4100d8 48656c6c 6f2c2057 6f726c64 210a      Hello, World!.

Disassembly of section .text:

00000000004000b0 <_start>:
  4000b0:   d2800808    mov x8, #0x40                   // #64
  4000b4:   d2800020    mov x0, #0x1                    // #1
  4000b8:   580000c1    ldr x1, 4000d0 <_start+0x20>
  4000bc:   d28001c2    mov x2, #0xe                    // #14
  4000c0:   d4000001    svc #0x0
  4000c4:   d2800ba8    mov x8, #0x5d                   // #93
  4000c8:   d28000e0    mov x0, #0x7                    // #7
  4000cc:   d4000001    svc #0x0
  4000d0:   004100d8    .word   0x004100d8
  4000d4:   00000000    .word   0x00000000

So:

data ended up at address 0x00000000004100d8 - the only data entry is our string there
code ended up at address 0x00000000004000b0 - the only function is our _start function
but there's something weird following the _start function, there's a big number 0x00000000004100d8 (objdump splits it between two lines, but it's a single 64-bit number)
ldr x1, =hello got translated to ldr x1, 4000d0 - but that decoding not completely accurate, what's actually in the instruction is address of what we're loading, relative to address of the current instruction - and the whole memory might be many GBs, but the constant pool is very close to the function itself, so the small offset that fits in the instruction is generally enough.
so address of hello isn't anywhere in the code, it's in constant pool just after the function, and the code loads address of hell from the constant pool

Constant pools aren't the only way to load big numbers on ARM64, you can also use 4 instructions to load 4 16-bit chunks with movk, but that would generally be slower and take even more space - 4 instructions and 16 bytes instead of 1 instruction and 12 bytes.

The whole situation is definitely more complicated than on x86-64. Normally the compiler deals with all that complexity for you, and in this case even assembler does some of it for you.

Loop

Loops are very straightforward. Just like x86-64 and unlike RISC-V, ARM64 has separate flags register so first we compare with one instruction that sets some flags, then we do a conditional jump with another instruction.

This code keeps loop counter in x19, it starts at 5, goes down by 1 every iteration, and the loop ends when it reaches 0.

.global _start

.text
_start:
  mov x19, 5

loop:
  /* write(1, "Hello, World!\n", 14) */
  mov x8, 64
  mov x0, 1
  ldr x1, =hello
  mov x2, 14
  svc 0

  /* x19 = x19 - 1 */
  sub x19, x19, 1
  /* if x19 != 0 goto loop for another iteration */
  cmp x19, 0
  b.ne loop

  /* _exit(7) */
  mov x8, 93
  mov x0, 7
  svc 0

.data
hello:
  .ascii "Hello, World!\n"

It indeed it prints the message 5 times:

$ as loop.s -o loop.o
$ ld loop.o -o loop
$ ./loop
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!

Print numbers

As usual, the most challenging part is converting numbers to strings. It's the same algorithm - building the string digit by digit starting from the last one. I put some comments all over the code, hopefully it should be clear enough.

.global _start

.text
print_number:
  /* start with x1 pointing at last character of the buffer */
  /* that's where the digit will go */
  /* x2 is total count of characters to print (including newline) */
  ldr x1, =buffer
  add x1, x1, 31
  mov x2, 2
  mov x3, 10

print_number_loop:
  /* do one digit, shift x0 */
  /* x4 = x0/10 */
  /* x5 = x0%10 */
  sdiv x4, x0, x3
  /* ARM doesn't have a modulo instruction, but it has "multiply and subtract" instruction */
  msub x5, x4, x3, x0
  /* add 48 to convert number to ASCII code, then write to buffer */
  add x5, x5, 48
  /* strb = SToRe Byte */
  /* w5 is bottom 32bits of x5 */
  /* it doesn't really matter, as we're only writing the lowest byte */
  strb w5, [x1]

  /* check if x4 is 0 */
  /* if yes, we're done and can print what we built */
  /* if not, more digits are coming */
  cmp x4, 0
  b.eq print_number_loop_done

  mov x0, x4
  sub x1, x1, 1
  add x2, x2, 1
  b print_number_loop

print_number_loop_done:
  /* output some part of the buffer */
  /* write(1, x1, x2) */
  mov x8, 64
  mov x0, 1
  svc 0
  ret

_start:
  /* load big_number from the constant pool to x0 */
  ldr x0, =big_number
  /* call print_number */
  bl print_number

  /* _exit(7) */
  mov x8, 93
  mov x0, 7
  svc 0

.data
big_number = 12345678901234
/* just put some random stuff in the buffer */
/* we'll overwrite it before printin anyway (except final \n) */
buffer:
  .ascii "0123456789abcdef0123456789abcdef\n"

That works just as expected:

$ as print_number.s -o print_number.o
$ ld print_number.o -o print_number
$ ./print_number
12345678901234

Let's look inside too:

$ objdump -ds print_number

print_number:     file format elf64-littleaarch64

Contents of section .text:
 4000b0 01030058 217c0091 420080d2 430180d2  ...X!|..B...C...
 4000c0 040cc39a 8580039b a5c00091 25000039  ............%..9
 4000d0 9f0000f1 a0000054 e00304aa 210400d1  .......T....!...
 4000e0 42040091 f7ffff17 080880d2 200080d2  B........... ...
 4000f0 010000d4 c0035fd6 00010058 edffff97  ......_....X....
 400100 a80b80d2 e00080d2 010000d4 00000000  ................
 400110 20014100 00000000 f22fce73 3a0b0000   .A....../.s:...
Contents of section .data:
 410120 30313233 34353637 38396162 63646566  0123456789abcdef
 410130 30313233 34353637 38396162 63646566  0123456789abcdef
 410140 0a                                   .

Disassembly of section .text:

00000000004000b0 <print_number>:
  4000b0:   58000301    ldr x1, 400110 <_start+0x18>
  4000b4:   91007c21    add x1, x1, #0x1f
  4000b8:   d2800042    mov x2, #0x2                    // #2
  4000bc:   d2800143    mov x3, #0xa                    // #10

00000000004000c0 <print_number_loop>:
  4000c0:   9ac30c04    sdiv    x4, x0, x3
  4000c4:   9b038085    msub    x5, x4, x3, x0
  4000c8:   9100c0a5    add x5, x5, #0x30
  4000cc:   39000025    strb    w5, [x1]
  4000d0:   f100009f    cmp x4, #0x0
  4000d4:   540000a0    b.eq    4000e8 <print_number_loop_done>  // b.none
  4000d8:   aa0403e0    mov x0, x4
  4000dc:   d1000421    sub x1, x1, #0x1
  4000e0:   91000442    add x2, x2, #0x1
  4000e4:   17fffff7    b   4000c0 <print_number_loop>

00000000004000e8 <print_number_loop_done>:
  4000e8:   d2800808    mov x8, #0x40                   // #64
  4000ec:   d2800020    mov x0, #0x1                    // #1
  4000f0:   d4000001    svc #0x0
  4000f4:   d65f03c0    ret

00000000004000f8 <_start>:
  4000f8:   58000100    ldr x0, 400118 <_start+0x20>
  4000fc:   97ffffed    bl  4000b0 <print_number>
  400100:   d2800ba8    mov x8, #0x5d                   // #93
  400104:   d28000e0    mov x0, #0x7                    // #7
  400108:   d4000001    svc #0x0
  40010c:   00000000    .inst   0x00000000 ; undefined
  400110:   00410120    .word   0x00410120
  400114:   00000000    .word   0x00000000
  400118:   73ce2ff2    .word   0x73ce2ff2
  40011c:   00000b3a    .word   0x00000b3a

The .data section contains our buffer and nothing else, that makes sense. The instructions generally correspond to what we wrote, but there's also one mystery entry at 40010c, a bunch of extra zeroes. It's there to pad the code to multiple of 64bits, so the constant pool starts at even 64bit boundary.

Architectures differ on their alignment requirements. Usually "misaligned" data access still works, it's just slower. And it used to be the case on some architectures, that unaligned data access just wouldn't be supported at all.

I don't think any modern architecture strictly requires alignment for normal data access, but compilers still care about it as it's bad for performance, and occasionally some extra features like atomic memory access or SIMD might require aligned addresses. It's easiest to just align all the things. A few extra zeroes are usually no big deal, and it's done for us automatically.

Print loop

We can print one number, so how about we print a bunch of them? We already know how to loop, so it should be easy. Other than b.le (branch if less or equal), there's nothing new here:

.global _start

.text
print_number:
  ldr x1, =buffer
  add x1, x1, 31
  mov x2, 2
  mov x3, 10

print_number_loop:
  sdiv x4, x0, x3
  msub x5, x4, x3, x0
  add x5, x5, 48
  strb w5, [x1]

  cmp x4, 0
  b.eq print_number_loop_done

  mov x0, x4
  sub x1, x1, 1
  add x2, x2, 1
  b print_number_loop

print_number_loop_done:
  mov x8, 64
  mov x0, 1
  svc 0
  ret

_start:

  mov x19, 1

loop:
  mov x0, x19
  bl print_number

  add x19, x19, 1
  cmp x19, 20
  b.le loop

  /* _exit(7) */
  mov x8, 93
  mov x0, 7
  svc 0

.data
/* just put some random stuff in the buffer */
/* we'll overwrite it before printin anyway (except final \n) */
buffer:
  .ascii "0123456789abcdef0123456789abcdef\n"

It prints:

$ as print_loop.s -o print_loop.o
$ ld print_loop.o -o print_loop
$ ./print_loop
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

FizzBuzz

And now we're ready to write the FizzBuzz! We've seen all the pieces before, now it's the time to assemble them.

.global _start

.text
print_number:
  ldr x1, =buffer
  add x1, x1, 31
  mov x2, 2
  mov x3, 10

print_number_loop:
  sdiv x4, x0, x3
  msub x5, x4, x3, x0
  add x5, x5, 48
  strb w5, [x1]

  cmp x4, 0
  b.eq print_number_loop_done

  mov x0, x4
  sub x1, x1, 1
  add x2, x2, 1
  b print_number_loop

print_number_loop_done:
  mov x8, 64
  mov x0, 1
  svc 0
  ret

_start:

  mov x19, 1
  mov x20, 3
  mov x21, 5

loop:
  /* x5 = x19 % 3 */
  sdiv x4, x19, x20
  msub x5, x4, x20, x19
  /* is the remainder zero? */
  cmp x5, 0
  b.eq divides_by_three

does_not_divide_by_three:
  /* x5 = x19 % 5 */
  sdiv x4, x19, x21
  msub x5, x4, x21, x19
  /* is the remainder zero? */
  cmp x5, 0
  b.eq divides_by_five_only

divides_by_neither:
  /* print_number(x19) */
  mov x0, x19
  bl print_number
  b continue_loop

divides_by_three:
  /* x5 = x19 % 5 */
  sdiv x4, x19, x21
  msub x5, x4, x21, x19
  /* is the remainder zero? */
  cmp x5, 0
  b.eq divides_by_three_and_five

divides_by_three_only:
  /* write(1, "Fizz", 5) */
  mov x8, 64
  mov x0, 1
  ldr x1, =fizz
  mov x2, 5
  svc 0
  b continue_loop

divides_by_five_only:
  /* write(1, "Buzz", 5) */
  mov x8, 64
  mov x0, 1
  ldr x1, =buzz
  mov x2, 5
  svc 0
  b continue_loop

divides_by_three_and_five:
  /* write(1, "FizzBuzz", 9) */
  mov x8, 64
  mov x0, 1
  ldr x1, =fizzbuzz
  mov x2, 9
  svc 0

continue_loop:
  add x19, x19, 1
  cmp x19, 100
  b.le loop

  /* _exit(7) */
  mov x8, 93
  mov x0, 7
  svc 0

.data
/* just put some random stuff in the buffer */
/* we'll overwrite it before printin anyway (except final \n) */
buffer:
  .ascii "0123456789abcdef0123456789abcdef\n"
fizz:
  .ascii "Fizz\n"
buzz:
  .ascii "Buzz\n"
fizzbuzz:
  .ascii "FizzBuzz\n"

Which does the expected thing:

$ as fizzbuzz.s -o fizzbuzz.o
$ ld fizzbuzz.o -o fizzbuzz
$ ./fizzbuzz
1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
19
Buzz
...
Buzz
Fizz
97
98
Fizz
Buzz

Should you use ARM64 Assembly?

Definitely not for writing any real programs.

Assembly is still a lot of fun to play with. If you want to learn some assembly for fun, and are trying to decide which one to start with, I recommend x86-64.

You probably have x86-64 computer already. There's orders of magnitude more x86-64 code you might want to decompile than ARM64 code - software for ARM devices like phones doesn't even come in assembly, they come in some bytecode format that only gets Just In Time compiled on the device. For x86-64 you have old games you might want to hack, CTF hacking challenges, real software with real security issues, and so on. x86-64 is also significantly more approachable, and it tried its best to keep things similar enough as much as it could over decades. ARM is a lot more fragmented, with multiple significantly different variants of ARM, so what you learn will need a lot more adjustment for a different ARM device.

But once you played with x86-64 enough, and want another one to try, ARM is the second most likely you'd have access to. Get a Raspberry Pi, preferably a 64bit kind, and have a go.