Tomasz Wegrzanowski

Posted on Dec 29, 2021

100 Languages Speedrun: Episode 40: x86-64 Assembly

#asm #x86 #intel

We write in all kinds of programming languages, but in the end, actual CPU needs to run it.

CPU sees a list of numbers. It takes a few of those numbers at a time, interprets that as an instruction to run, then goes to the next one. At least that's the general idea, CPUs are now far more complex than that. To keep the post approachable I'll do my best to ignore all that extra complexity.

We'll be using x86-64 assembly. x86 is the instruction set originally introduced by 16-bit Intel processor 8086, which then got expanded to 32-bit and finally to 64-bit, getting all sorts of expansions along the way. x86-64 just means x86 in 64-bit mode.

Currently there are only two CPU architectures that matter. x86-64 is completely dominant with PCs and laptops. ARM is completely dominant with phones, laptops, and other smart devices. Generally x86-64 offers better performance, while ARM offers better power efficiency. Other kinds of CPUs see a lot less use.

x86-64 assembly for OSX, Linux, and Windows works the same as far as calculations go, but different operating systems have different ways of telling the operating systems to print data and so on. Code for this episode will run on OSX.

How to run assembly

The two names are often used interchangeably, as it's generally clear from the context, but "machine code" refers to numbers CPUs sees, while "assembly" refers to human readable text that is turned into those numbers is a pretty much one-to-one way.

Enough introduction, let's brew install nasm and write some code:

global start

section .text

start:
  ; Tell operating system to exit with code 7
  mov rax, 0x2000001 ; B8 01 00 00 02
  mov rdi, 7         ; BF 07 00 00 00
  syscall            ; 0F 05

section .data

And compile it:

$ nasm -f macho64 exit.asm
$ ld -static -o exit exit.o
$ ./exit
$ echo $?
7

This is as simple an assembly program as it gets.

Step by step:

I put in comments the numbers this code gets turned into
global start and start: are there to tell the operating system where we want to start our program.
section .text starts the executable code, that doesn't really have anything to do with text, but the name stuck
section .data starts the data section, which in our case is empty
rax, rdi etc. are 64-bit registers, which are used to store numbers - CPU has a bunch of those registers, which it can access super fast, if you need more data, you need to put it in "memory", which is a lot slower
mov rax, 0x2000001 means rax = 0x2000001
mov rdi, 7 means rdi = 7
syscall means to call operating system to do something - arguments are passed in registers
on OSX system call 0x2000001 is for exit, and rdi contains exit code - which is generally used to tell the parent process if we succeeded or not, we can see it in shell with $?
due to its complicated history OSX has multiple sets of system calls, 0x0200.... are where all the usual ones go. On Linux the interface is very similar, mainly the system call numbers are different.
for register names r prefix means 64bit, e prefix means 32bit, without any means 16bit, and there are special names for 8bit. There's also a lot of special registers for floating point numbers, doing multiple operations at once, and so on. CPUs are really complicated, but we'll be sticking to simple stuff.
first we use nasm to compile one assembly file to "object file", then we use ld to gather a bunch of object files into a final executable. This two step process is fairly common with many compiled languages.
nasm -f macho64 just means OSX 64-bit.
ld -static means we don't want to link to any libraries and we'll do everything on our own - so no printf, atoi, malloc or such, just raw assembly and syscalls.

Disassembly

We can do it in reverse, and use objdump turn binary data back into readable text:

$ objdump -x86-asm-syntax=intel -d exit.o

exit.o: file format Mach-O 64-bit x86-64


Disassembly of section __TEXT,__text:

0000000000000000 start:
       0: b8 01 00 00 02                mov eax, 33554433
       5: bf 07 00 00 00                mov edi, 7
       a: 0f 05                         syscall

You should already have objdump installed. We need to pass it -x86-asm-syntax=intel to specify that we want the Intel / NASM syntax. For stupid historical reasons there are multiple different syntaxes, and GNU tooling defaults to a non-standard one, with a lot of sigils and with order of arguments that's backwards:

$ objdump -d exit.o

exit.o: file format Mach-O 64-bit x86-64


Disassembly of section __TEXT,__text:

0000000000000000 start:
       0: b8 01 00 00 02                movl    $33554433, %eax
       5: bf 07 00 00 00                movl    $7, %edi
       a: 0f 05                         syscall

Oh wait, why are they referencing eax not rax? eax is the bottom 32bits of 64bit register rax. When writing to eax, the top 32bits are all automatically set to 0.

Hello, World!

All right, let's write a simple program that prints "Hello, World!" to the screen.

For that we'll need second system call, write, that takes three arguments: file descriptor, memory address, and amount of data we're writing. Every program starts with 0 as "standard input", 1 as "standard output", and 2 as "standard error", and any extra files or Internet connections we open get extra file descriptor numbers.

global start

section .text

start:
  ; write(1, "Hello, World!\n", 14)
  mov rax, 0x2000004
  mov rdi, 1
  mov rsi, hello
  mov rdx, 14
  syscall

  ; exit(0)
  mov rax, 0x2000001
  mov rdi, 0
  syscall

section .data
hello:
  db "Hello, World!", 10

You probably won't be too surprised by the result:

$ nasm -f macho64 hello.asm
$ ld -static -o hello hello.o
$ ./hello
$ ./hello
Hello, World!

Step by step:

in .data section we add a label hello, and put "Hello, World!", 10 there. 10 is just code for \n, as nasm doesn't understand escape codes.
we calculated length of it manually to be 14 - nasm can definitely help with that, and we'll get to how
mov rsi, hello doesn't move the string to rsi - rsi only contains numbers, it cannot contain strings - it just means to assign to rsi the memory address where hello is.

Loop

Let's write a simple loop that prints Hello, World! five times.

global start

section .text

start:
  ; rbx = 5
  ; start the loop
  mov rbx, 5
  jmp loop_check

loop_iteration:
  ; inside loop body
  ; write(1, "Hello, World!\n", 14)
  mov rax, 0x2000004
  mov rdi, 1
  mov rsi, hello
  mov rdx, hello_len
  syscall
  ; rbx -= 1
  dec rbx

loop_check:
  ; check if we want to run the loop or not
  ; if (rbx != 0) goto loop_iteration
  cmp rbx, 0
  jne loop_iteration

  ; we're outside the loop now
  ; exit(0)
  mov rax, 0x2000001
  mov rdi, 0
  syscall

section .data
hello:
  db "Hello, World!", 10
hello_len: equ $ - hello

It does what we expect:

$ ./loop
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!

What's going on here:

putting hello_len: equ $ - hello just after hello defines a constant, that says "how far are we from start of hello". This is just a number and isn't stored into memory. This way we can have nasm do string lengths and such for us
we store iteration counter in rbx, starting from 5 and ending the loop when it hits 0
we can either put loop body first, or loop condition first
jmp loop_check simply jumps to loop_check
cmp rbx, 0 compares if rbx is > 0, = 0, or < 0 and sets some CPU flags
after cmp runs, we can do jne loop_iteration which means jump if not equal - it checks the flags set by the last instruction that set them
this isn't the "optimal" way to do this

Print numbers

If we linked with C standard library, we could use printf, but that's not what we're going for.

Instead let's write out own print_number function:

global start

section .text

; number to use goes into rax
print_number:
  ; we'll build the string to print backwards
  ; so 1234 will be built step by step as
  ; "\n" "4\n" "34\n" "234\n" "1234\n"
  mov rbx, buffer_last_byte
  mov [rbx], byte 10

print_number_loop:
  ; make space for another character
  dec rbx

  ; div instruction is more complicated it uses 2 registers for input, and 2 for output
  ; input is always rdx:rax
  ; rdx = input % 10
  ; rax = input / 10
  mov rdx, 0
  mov rdi, 10
  div rdi

  ; we add 48 to turn numbers 0-9 to ascii codes for digits 48-57
  ; then store it in a string
  add rdx, 48
  mov [rbx], dl

  ; if rax is 0, we're done, otherwise continue
  cmp rax, 0
  jne print_number_loop

  ; time to tell operating system what we want to print
  ; we know how many bytes to print by how far rbx moved from end of the buffer
  ; write(1, rbx, buffer_after-rbx)
  mov rax, 0x2000004
  mov rdi, 1
  mov rsi, rbx
  mov rdx, buffer_after
  sub rdx, rsi
  syscall

  ; return to caller
  ret

start:
  ; call print_number(12345678)
  ; it saves return address on stack
  ; when ret is called we return to continue this code
  mov rax, 12345678
  call print_number

  ; exit(0)
  mov rax, 0x2000001
  mov rdi, 0
  syscall

section .data
buffer:
  db "                        "
buffer_last_byte: equ $ - 1
buffer_after: equ $

Hopefully comments in the code explain enough. It works just fine:

$ ./print_number
12345678

Oh wait, what happens if we didn't leave it enough memory, and the number is too big? The program crashes of course. And hackers can take advantage of that, and take over your computer. Better hope you did it right.

Print numbers with loop

Now that we have our number printing function, we can call it in a loop:

global start

section .text

print_number:
  mov rbx, buffer_last_byte
  mov [rbx], byte 10
print_number_loop:
  dec rbx
  mov rdx, 0
  mov rdi, 10
  div rdi
  add rdx, 48
  mov [rbx], dl
  cmp rax, 0
  jne print_number_loop
  mov rax, 0x2000004
  mov rdi, 1
  mov rsi, rbx
  mov rdx, buffer_after
  sub rdx, rsi
  syscall
  ret

start:
  ; r12 = 0
  mov r12, 0

  ; do {
  ;  r12 = r12 + 1; print_number(r12);
  ; } while (r12 < 10);
loop:
  inc r12
  mov rax, r12
  call print_number
  cmp r12, 10
  jl loop

  ; exit(0)
  mov rax, 0x2000001
  mov rdi, 0
  syscall

section .data
buffer:
  db "                        "
buffer_last_byte: equ $ - 1
buffer_after: equ $

Which prints the numbers:

$ ./print_loop
1
2
3
4
5
6
7
8
9
10

You might already be noticing a small problem. If CPU only has small number of registers, and every function needs to use some to do its things, how do we decide which function gets to use which register? There are many very complicated conventions for this, some registers are "callee-saved" - so the called function should save them on stack and restore before returning, if it wants to use them; others are "caller-saved" - so the caller can expect them to get overwritten and if it wants them preserved, the caller needs to save them on stack. syscall does that too, and some registers won't be preserved by the syscall. This usually works well enough.

For now our programs are small enough we can completely ignore the issue.

FizzBuzz

And now that we have it, we can build the FizzBuzz:

global start

section .text

print_number:
  mov rbx, buffer_last_byte
  mov [rbx], byte 10
print_number_loop:
  dec rbx
  mov rdx, 0
  mov rdi, 10
  div rdi
  add rdx, 48
  mov [rbx], dl
  cmp rax, 0
  jne print_number_loop
  mov rax, 0x2000004
  mov rdi, 1
  mov rsi, rbx
  mov rdx, buffer_after
  sub rdx, rsi
  syscall
  ret

start:
  ; r12 = 0
  mov r12, 0

loop:
  ; r12 += 1
  inc r12

  ; if (r12 % 3 == 0) go to divides_by_three
  mov rdx, 0
  mov rax, r12
  mov rdi, 3
  div rdi
  cmp rdx, 0
  je divides_by_3

  ; if (r12 % 3 == 0) go to divides_by_five
  mov rdx, 0
  mov rax, r12
  mov rdi, 5
  div rdi
  cmp rdx, 0
  je divides_only_by_5

does_not_divide_by_3_or_5:
  mov rax, r12
  call print_number
  jmp loop_continue

divides_only_by_5:
  mov rax, 0x2000004
  mov rdi, 1
  mov rsi, buzz
  mov rdx, buzz_len
  syscall
  jmp loop_continue

divides_by_3:
  ; if (r12 % 3 == 0) go to divides_by_five
  mov rdx, 0
  mov rax, r12
  mov rdi, 5
  div rdi
  cmp rdx, 0
  je divides_by_3_and_5

divides_only_3:
  mov rax, 0x2000004
  mov rdi, 1
  mov rsi, fizz
  mov rdx, fizz_len
  syscall
  jmp loop_continue

divides_by_3_and_5:
  mov rax, 0x2000004
  mov rdi, 1
  mov rsi, fizzbuzz
  mov rdx, fizzbuzz_len
  syscall
  jmp loop_continue

loop_continue:
  cmp r12, iterations
  jl loop

  ; exit(0)
  mov rax, 0x2000001
  mov rdi, 0
  syscall

section .data
buffer:
  db "                        "
buffer_last_byte: equ $ - 1
buffer_after: equ $

fizz:
  db "Fizz", 10
fizz_len: equ $ - fizz

buzz:
  db "Buzz", 10
buzz_len: equ $ - buzz

fizzbuzz:
  db "FizzBuzz", 10
fizzbuzz_len: equ $ - fizzbuzz

iterations: equ 100

Which does exactly what we want.

These examples are all extremely unoptimized just to keep things simple. I also feel like I barely scratched the surface, but this episode is already nearly the longest so far, so I'll just end here. I might do a followup episode after the series is over.

Should you use Assembly?

It's good fun as esoteric language, and useful to have some assembly basics if you enjoy CTFs and other hacking challenges.

Assembly has a few real world uses:

compilers need to write some assembly somehow, so if you're writing a compiler, you might want some familiarity with it. It's definitely possible to create a programming language without it - nowadays most compiled languages just target LLVM or JVM JIT or whatnot, and have the VM deal with all the assembly issues, and interpreted ones just use C, but traditionally compilers were turning source code intto assembly
programs sometimes use assembly for some extremely performance sensitive code, like crypto, or decoding video etc. - but that's less and less common, and it takes insane amount of effort to beat what compilers do; almost always this time is better spent by careful benchmarking and giving compiler hints how to optimize things better - assembly is really no magic here
if you want to write binary exploits, you'll generally need some assembly

As for writing real programs in it, it would be completely insane.

I can't verify it, but rumor has it that in Japan, where programmers work 100 hours a week on salaries lower than what a Walmart cashier makes in the US, still commonly wrote whole program in assembly well into this century. So I guess if you have slave labor available, that's a thing you could do. Otherwise, don't bother.