There aren't that many relevant CPU architectures these days. x86-64 is dominant on the high-performance devices, ARM is dominant on the low-power devices, and RISC-V is the only serious upcoming challenger. It's been like that for about 20 years now, more or less since the time Intel's Itanium architecture launched and instantly failed.
I did [x86-64]](https://dev.to/taw/100-languages-speedrun-episode-40-x86-64-assembly-jjj)(https://taw.hashnode.dev/100-languages-speedrun-episode-40-x86-64-assembly) and RISC-V(https://taw.hashnode.dev/100-languages-speedrun-episode-44-risc-v-assembly), so it's time to complete the set with ARM.
Just like x86 and RISC-V, ARM has both 32-bit and 64-bit versions, and we'll be doing 64bit.
Docker comes with QEMU preinstalled, so it can emulate other architectures, and I'll be using that.
You might actually have an ARM machine around, like a Raspberry Pi or Apple M1. If you follow along you'd need to adjust it slightly. Most Raspberries are 32bit and only some 64bit versions started showing up recently. Apple M1 is 64bit, but OSX uses slightly different system calls than Linux. The code from this episode could work with minor adaptations on either.
So let's start a new container, install compiling tools on it, and we can begin:
$ docker run -it --platform linux/arm64/v8 -v $(pwd):/source arm64v8/ubuntu $ apt update $ apt install -y build-essential
The simplest program just exits with a numeric error code. These are normally used to indicate errors, with 0 being success, and various non-zero values indicating failure. Some programs have complicated mapping which non-zero value means which kind of issue, others just use same value for every problem.
To exit, or to do any interaction with the outside world, our program needs to call the operating system. This is how it's done:
.global _start .text _start: /* _exit(7) */ mov x8, 93 mov x0, 7 svc 0
$ as exit.s -o exit.o $ ld exit.o -o exit $ ./exit $ echo $? 7
This program is really similar to x86-64 and RISC-V versions, it's just the opcode names that are different.
.textis code section
_startis where the code execution will begin when the program is loaded, we need to mark this symbol as exported with
/* ... */for comments, I'm so baffled that assemblers for different architectures use different comment characters
mov x8, 93means
x8 = 93- the
x8register is where we pass function number to the operating system,
93is Linux system call number for exit, just as on other architectures with Linux. On OSX it would be some different number.
mov x0, 7means
x0 = 7- the
x0register is where we pass the first argument, in this case the exit code
svc 0performs the system call on Linux
I don't have one lying around, so I never tried running it, but from the documentation, this is how the same program would look on OSX:
.global _start .text _start: /* _exit(7) */ mov x16, 1 mov x0, 7 svc 0x80
A few things are different:
- system call number is different,
93on ARM64 Linux and
1on ARM64 OSX. Interestingly it's
93on all Linuxes we checked, but x86-64 OSX uses different numbers from ARM64 OSX.
- we pass the operation number in
- we use
svc 0to call the operating system
The Hello, World also looks similar to x86-64 and RISC-V versions, but there's some real differences below the surface we'll get to:
.global _start .text _start: /* write(1, "Hello, World!\n", 14) */ mov x8, 64 mov x0, 1 ldr x1, =hello mov x2, 14 svc 0 /* _exit(7) */ mov x8, 93 mov x0, 7 svc 0 .data hello: .ascii "Hello, World!\n"
Let's run in:
$ as hello.s -o hello.o $ ld hello.o -o hello $ ./hello Hello, World!
mov x8, 64means
x8 = 64, that's Linux operating system number for the
mov x0, 1means
x0 = 1, that means standard output
ldr x1, =hellomeans
x1 = address of hello, but there's more here
mov x2, 14means
x2 = 14, that's the length of the string
.datais data section
hellois a label for where we have the string
.ascii "Hello, World!\n"is the string itself
Assemblers don't pass strings and other such objects around, they pass their addresses in memory. On 64-bit machine, addresses are 64 bits, or 8 bytes. So how do we load an address into memory?
On x86-64 it's super easy - instructions have variable length, so if you need to load a 64-bit address or any other 64-bit number into a register, you can use
mov instruction, and it will be 10 bytes (2 to select instruction, 8 for data), but that's fine.
ARM64 instructions are all 32bit. Part of that needs to select which instruction we use, so an instruction can't contain a 32bit number, let alone a 64bit one. So how does that work?
Enough talk, let's take a peak inside with
-d means to disassemble,
-s to show contents of each section:
$ objdump -ds ./hello ./hello: file format elf64-littleaarch64 Contents of section .text: 4000b0 080880d2 200080d2 c1000058 c20180d2 .... ......X.... 4000c0 010000d4 a80b80d2 e00080d2 010000d4 ................ 4000d0 d8004100 00000000 ..A..... Contents of section .data: 4100d8 48656c6c 6f2c2057 6f726c64 210a Hello, World!. Disassembly of section .text: 00000000004000b0 <_start>: 4000b0: d2800808 mov x8, #0x40 // #64 4000b4: d2800020 mov x0, #0x1 // #1 4000b8: 580000c1 ldr x1, 4000d0 <_start+0x20> 4000bc: d28001c2 mov x2, #0xe // #14 4000c0: d4000001 svc #0x0 4000c4: d2800ba8 mov x8, #0x5d // #93 4000c8: d28000e0 mov x0, #0x7 // #7 4000cc: d4000001 svc #0x0 4000d0: 004100d8 .word 0x004100d8 4000d4: 00000000 .word 0x00000000
- data ended up at address
0x00000000004100d8- the only data entry is our string there
- code ended up at address
0x00000000004000b0- the only function is our
- but there's something weird following the
_startfunction, there's a big number
objdumpsplits it between two lines, but it's a single 64-bit number)
ldr x1, =hellogot translated to
ldr x1, 4000d0- but that decoding not completely accurate, what's actually in the instruction is address of what we're loading, relative to address of the current instruction - and the whole memory might be many GBs, but the constant pool is very close to the function itself, so the small offset that fits in the instruction is generally enough.
- so address of
helloisn't anywhere in the code, it's in constant pool just after the function, and the code loads address of
hellfrom the constant pool
Constant pools aren't the only way to load big numbers on ARM64, you can also use 4 instructions to load 4 16-bit chunks with
movk, but that would generally be slower and take even more space - 4 instructions and 16 bytes instead of 1 instruction and 12 bytes.
The whole situation is definitely more complicated than on x86-64. Normally the compiler deals with all that complexity for you, and in this case even assembler does some of it for you.
Loops are very straightforward. Just like x86-64 and unlike RISC-V, ARM64 has separate flags register so first we compare with one instruction that sets some flags, then we do a conditional jump with another instruction.
This code keeps loop counter in
x19, it starts at
5, goes down by
1 every iteration, and the loop ends when it reaches
.global _start .text _start: mov x19, 5 loop: /* write(1, "Hello, World!\n", 14) */ mov x8, 64 mov x0, 1 ldr x1, =hello mov x2, 14 svc 0 /* x19 = x19 - 1 */ sub x19, x19, 1 /* if x19 != 0 goto loop for another iteration */ cmp x19, 0 b.ne loop /* _exit(7) */ mov x8, 93 mov x0, 7 svc 0 .data hello: .ascii "Hello, World!\n"
It indeed it prints the message 5 times:
$ as loop.s -o loop.o $ ld loop.o -o loop $ ./loop Hello, World! Hello, World! Hello, World! Hello, World! Hello, World!
As usual, the most challenging part is converting numbers to strings. It's the same algorithm - building the string digit by digit starting from the last one. I put some comments all over the code, hopefully it should be clear enough.
.global _start .text print_number: /* start with x1 pointing at last character of the buffer */ /* that's where the digit will go */ /* x2 is total count of characters to print (including newline) */ ldr x1, =buffer add x1, x1, 31 mov x2, 2 mov x3, 10 print_number_loop: /* do one digit, shift x0 */ /* x4 = x0/10 */ /* x5 = x0%10 */ sdiv x4, x0, x3 /* ARM doesn't have a modulo instruction, but it has "multiply and subtract" instruction */ msub x5, x4, x3, x0 /* add 48 to convert number to ASCII code, then write to buffer */ add x5, x5, 48 /* strb = SToRe Byte */ /* w5 is bottom 32bits of x5 */ /* it doesn't really matter, as we're only writing the lowest byte */ strb w5, [x1] /* check if x4 is 0 */ /* if yes, we're done and can print what we built */ /* if not, more digits are coming */ cmp x4, 0 b.eq print_number_loop_done mov x0, x4 sub x1, x1, 1 add x2, x2, 1 b print_number_loop print_number_loop_done: /* output some part of the buffer */ /* write(1, x1, x2) */ mov x8, 64 mov x0, 1 svc 0 ret _start: /* load big_number from the constant pool to x0 */ ldr x0, =big_number /* call print_number */ bl print_number /* _exit(7) */ mov x8, 93 mov x0, 7 svc 0 .data big_number = 12345678901234 /* just put some random stuff in the buffer */ /* we'll overwrite it before printin anyway (except final \n) */ buffer: .ascii "0123456789abcdef0123456789abcdef\n"
That works just as expected:
$ as print_number.s -o print_number.o $ ld print_number.o -o print_number $ ./print_number 12345678901234
Let's look inside too:
$ objdump -ds print_number print_number: file format elf64-littleaarch64 Contents of section .text: 4000b0 01030058 217c0091 420080d2 430180d2 ...X!|..B...C... 4000c0 040cc39a 8580039b a5c00091 25000039 ............%..9 4000d0 9f0000f1 a0000054 e00304aa 210400d1 .......T....!... 4000e0 42040091 f7ffff17 080880d2 200080d2 B........... ... 4000f0 010000d4 c0035fd6 00010058 edffff97 ......_....X.... 400100 a80b80d2 e00080d2 010000d4 00000000 ................ 400110 20014100 00000000 f22fce73 3a0b0000 .A....../.s:... Contents of section .data: 410120 30313233 34353637 38396162 63646566 0123456789abcdef 410130 30313233 34353637 38396162 63646566 0123456789abcdef 410140 0a . Disassembly of section .text: 00000000004000b0 <print_number>: 4000b0: 58000301 ldr x1, 400110 <_start+0x18> 4000b4: 91007c21 add x1, x1, #0x1f 4000b8: d2800042 mov x2, #0x2 // #2 4000bc: d2800143 mov x3, #0xa // #10 00000000004000c0 <print_number_loop>: 4000c0: 9ac30c04 sdiv x4, x0, x3 4000c4: 9b038085 msub x5, x4, x3, x0 4000c8: 9100c0a5 add x5, x5, #0x30 4000cc: 39000025 strb w5, [x1] 4000d0: f100009f cmp x4, #0x0 4000d4: 540000a0 b.eq 4000e8 <print_number_loop_done> // b.none 4000d8: aa0403e0 mov x0, x4 4000dc: d1000421 sub x1, x1, #0x1 4000e0: 91000442 add x2, x2, #0x1 4000e4: 17fffff7 b 4000c0 <print_number_loop> 00000000004000e8 <print_number_loop_done>: 4000e8: d2800808 mov x8, #0x40 // #64 4000ec: d2800020 mov x0, #0x1 // #1 4000f0: d4000001 svc #0x0 4000f4: d65f03c0 ret 00000000004000f8 <_start>: 4000f8: 58000100 ldr x0, 400118 <_start+0x20> 4000fc: 97ffffed bl 4000b0 <print_number> 400100: d2800ba8 mov x8, #0x5d // #93 400104: d28000e0 mov x0, #0x7 // #7 400108: d4000001 svc #0x0 40010c: 00000000 .inst 0x00000000 ; undefined 400110: 00410120 .word 0x00410120 400114: 00000000 .word 0x00000000 400118: 73ce2ff2 .word 0x73ce2ff2 40011c: 00000b3a .word 0x00000b3a
.data section contains our buffer and nothing else, that makes sense. The instructions generally correspond to what we wrote, but there's also one mystery entry at
40010c, a bunch of extra zeroes. It's there to pad the code to multiple of 64bits, so the constant pool starts at even 64bit boundary.
Architectures differ on their alignment requirements. Usually "misaligned" data access still works, it's just slower. And it used to be the case on some architectures, that unaligned data access just wouldn't be supported at all.
I don't think any modern architecture strictly requires alignment for normal data access, but compilers still care about it as it's bad for performance, and occasionally some extra features like atomic memory access or SIMD might require aligned addresses. It's easiest to just align all the things. A few extra zeroes are usually no big deal, and it's done for us automatically.
We can print one number, so how about we print a bunch of them? We already know how to loop, so it should be easy. Other than
b.le (branch if less or equal), there's nothing new here:
.global _start .text print_number: ldr x1, =buffer add x1, x1, 31 mov x2, 2 mov x3, 10 print_number_loop: sdiv x4, x0, x3 msub x5, x4, x3, x0 add x5, x5, 48 strb w5, [x1] cmp x4, 0 b.eq print_number_loop_done mov x0, x4 sub x1, x1, 1 add x2, x2, 1 b print_number_loop print_number_loop_done: mov x8, 64 mov x0, 1 svc 0 ret _start: mov x19, 1 loop: mov x0, x19 bl print_number add x19, x19, 1 cmp x19, 20 b.le loop /* _exit(7) */ mov x8, 93 mov x0, 7 svc 0 .data /* just put some random stuff in the buffer */ /* we'll overwrite it before printin anyway (except final \n) */ buffer: .ascii "0123456789abcdef0123456789abcdef\n"
$ as print_loop.s -o print_loop.o $ ld print_loop.o -o print_loop $ ./print_loop 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
And now we're ready to write the FizzBuzz! We've seen all the pieces before, now it's the time to assemble them.
.global _start .text print_number: ldr x1, =buffer add x1, x1, 31 mov x2, 2 mov x3, 10 print_number_loop: sdiv x4, x0, x3 msub x5, x4, x3, x0 add x5, x5, 48 strb w5, [x1] cmp x4, 0 b.eq print_number_loop_done mov x0, x4 sub x1, x1, 1 add x2, x2, 1 b print_number_loop print_number_loop_done: mov x8, 64 mov x0, 1 svc 0 ret _start: mov x19, 1 mov x20, 3 mov x21, 5 loop: /* x5 = x19 % 3 */ sdiv x4, x19, x20 msub x5, x4, x20, x19 /* is the remainder zero? */ cmp x5, 0 b.eq divides_by_three does_not_divide_by_three: /* x5 = x19 % 5 */ sdiv x4, x19, x21 msub x5, x4, x21, x19 /* is the remainder zero? */ cmp x5, 0 b.eq divides_by_five_only divides_by_neither: /* print_number(x19) */ mov x0, x19 bl print_number b continue_loop divides_by_three: /* x5 = x19 % 5 */ sdiv x4, x19, x21 msub x5, x4, x21, x19 /* is the remainder zero? */ cmp x5, 0 b.eq divides_by_three_and_five divides_by_three_only: /* write(1, "Fizz", 5) */ mov x8, 64 mov x0, 1 ldr x1, =fizz mov x2, 5 svc 0 b continue_loop divides_by_five_only: /* write(1, "Buzz", 5) */ mov x8, 64 mov x0, 1 ldr x1, =buzz mov x2, 5 svc 0 b continue_loop divides_by_three_and_five: /* write(1, "FizzBuzz", 9) */ mov x8, 64 mov x0, 1 ldr x1, =fizzbuzz mov x2, 9 svc 0 continue_loop: add x19, x19, 1 cmp x19, 100 b.le loop /* _exit(7) */ mov x8, 93 mov x0, 7 svc 0 .data /* just put some random stuff in the buffer */ /* we'll overwrite it before printin anyway (except final \n) */ buffer: .ascii "0123456789abcdef0123456789abcdef\n" fizz: .ascii "Fizz\n" buzz: .ascii "Buzz\n" fizzbuzz: .ascii "FizzBuzz\n"
Which does the expected thing:
$ as fizzbuzz.s -o fizzbuzz.o $ ld fizzbuzz.o -o fizzbuzz $ ./fizzbuzz 1 2 Fizz 4 Buzz Fizz 7 8 Fizz Buzz 11 Fizz 13 14 FizzBuzz 16 17 Fizz 19 Buzz ... Buzz Fizz 97 98 Fizz Buzz
Definitely not for writing any real programs.
Assembly is still a lot of fun to play with. If you want to learn some assembly for fun, and are trying to decide which one to start with, I recommend x86-64.
You probably have x86-64 computer already. There's orders of magnitude more x86-64 code you might want to decompile than ARM64 code - software for ARM devices like phones doesn't even come in assembly, they come in some bytecode format that only gets Just In Time compiled on the device. For x86-64 you have old games you might want to hack, CTF hacking challenges, real software with real security issues, and so on. x86-64 is also significantly more approachable, and it tried its best to keep things similar enough as much as it could over decades. ARM is a lot more fragmented, with multiple significantly different variants of ARM, so what you learn will need a lot more adjustment for a different ARM device.
But once you played with x86-64 enough, and want another one to try, ARM is the second most likely you'd have access to. Get a Raspberry Pi, preferably a 64bit kind, and have a go.