Tomasz Wegrzanowski

Posted on Jan 3, 2022

100 Languages Speedrun: Episode 44: RISC-V Assembly

#riscv #assembly #risc

The world of computing is dominated by two instruction set architectures, x86-64 (which I covered in episode 40) and ARM, with all other architectures irrelevant by now.

But a new contender recently emerged, RISC-V. RISC-V attempts to be an Open Source and architecturally neutral ISA. The goal isn't for the hardware itself to be Open Source, it is for any company or researcher to be able to make their own RISC-V-based processor, with whichever extra features and performance trade-offs they need, and to have them be interoperable thanks to shared RISC-V standard.

So far it's a very distant third, but RISC-V based products keep showing up, especially in embedded space, and it is obviously making ARM really uncomfortable.

How to run RISC-V code

As you probably don't have any RISC-V computers at hand, the best way is to use QEMU emulation and Docker. If you run docker run -it riscv64/ubuntu, you can get RISC-V environment. At least on OSX, Docker comes with everything preconfigured for it, and it shouldn't be too hard on other systems. This is of course much slower than the real thing would be.

So let's start a new container and install compiler tools on it:

$ docker run -it --platform linux/riscv64 -v $(pwd):/source riscv64/ubuntu
$ apt update
$ apt install -y build-essential

Simplest RISC-V program

Let's write and compile our first RISC-V program. It will just tell the operating system to exit, with exit code 7:

.text
.global _start
_start:
  /* Tell the operating system to exit with code 7 */
  li a7, 93
  li a0, 7
  ecall

The x86-64 version we had was very similar:

global start
section .text
start:
  ; Tell operating system to exit with code 7
  mov rax, 0x2000001
  mov rdi, 7
  syscall

And let's run it:

$ as exit.s -o exit.o
$ ld -static exit.o -o exit
$ ./exit
$ echo $?
7

Step by step:

these are different assemblers (GNU as vs NASM), different operating systems (Linux vs OSX), and different architectures (RISC-V vs x86-64), but it's so damn similar
.text (NASM .section text)
I'm not really sure why start vs _start as default start symbol for statically linked
ecall (x86 syscall) - call operating system function
.text (or section .text in NASM) - where the code is
li - load integer, on RISCV it only supports very small numbers with a single opcode (12bit), and if you want to load 64bit number you'd need multiple instructions, on x86-64 you can load number of up to 64bit size as x86-64 supports much more complex instructions
a7 (x86-64 rax) is where we choose the operating system function to call, in this case exit
numbers of system calls vary both by operating system and architecture, on x86-64 OSX it's 0x2000001, on RISC-V Linux it's 93.
a0 (x86-64 rdi) is where the first argument goes, exit has only one argument, where we pass 7
weirdly assembler comment syntax is different on different architectures, and ; comments don't work on RISC-V version of GNU as.

Hello, World!

Now that we know how to run RISC-V code, let's write a simple program that prints "Hello, World!" to the screen.

It needs to do two system calls

write call, passing file descriptor number (1 for standard output), address of "Hello, World!\n" string, and length of the string 14.
exit call, passing 0 as exit code to indicate success

.text
.global _start
_start:
  /* Tell the operating system to write "Hello, World!\n" to stadard output */
  li a7, 64
  li a0, 1       /* standard output */
  lla a1, hello  /* address of thing to write */
  li a2, 14      /* amount of data to write */
  ecall

  /* Tell the operating system to exit with code 0 */
  li a7, 93
  li a0, 0
  ecall

.data
hello:
  .ascii "Hello World!\n"

It works just like we'd expect:

$ as hello.s -o hello.o
$ ld -static hello.o -o hello
$ ./hello
Hello World!

But hang on, how does RISC-V load an address if addresses are 64bit, and it cannot load such big numbers, let's disassemble it and take a peak:

$ objdump -d hello

hello:     file format elf64-littleriscv


Disassembly of section .text:

00000000000100b0 <_start>:
   100b0:   04000893            li  a7,64
   100b4:   00100513            li  a0,1
   100b8:   00001597            auipc   a1,0x1
   100bc:   01c58593            addi    a1,a1,28 # 110d4 <__DATA_BEGIN__>
   100c0:   00e00613            li  a2,14
   100c4:   00000073            ecall
   100c8:   05d00893            li  a7,93
   100cc:   00000513            li  a0,0
   100d0:   00000073            ecall

As you can see, each instruction is exactly 32bits, or 4 bytes. As some of those bits must identify which instruction we want, it's not possible to load 32bit number, let alone 64bit number, with one instruction.

x86-64 on the other hand can absolutely do that, as instructions on x86-64 have variable width, and some of them can be very long. Apparently the longest valid instruction is 15 bytes, but you're not likely to see many such instructions, and most instructions are a lot shorter. The most common 10 bytes instruction is to load a 64 bit number into a register (2 bytes to select such an instruction, then 8 bytes of data) - but if top half of that number is all zeroes or ones, it will be a lot shorter.

x86-64 style variable length instruction encoding tends to use a lot less memory, and due to limited size of CPU's innermost caches, it is generally a lot faster. RISC-V constant length instruction encoding tends to be a lot simpler to implement, but more instructions will be required, and less of a program will fit in a cache, so generally such choices result in poorer performance. Of course in practice a lot of other factors affect performance as well.

Anyway, how does RISC-V load that address? It translated our lla into two instructions:

auipc a1,0x10 (Add Upper Immediate to Program Counter) checks current "program counter" and adds 0x1 * 4096 to it, and saves it to a1
addi a1,a1,28 (add immediate) - adds 28 (numbers from 0 to 4095 fit in the instruction) to a1 and saves result in a1
so as result, we have a1 equal to pc + 4096 + 28, presumably covering distance between where that instruction was, and where "Hello, World!\n" is located in the program
this pair of instructions can load any address within 32bit from PC, more instructions would be needed for 64bit
very similar technique is used for loading 32bit numbers - first lui to load 4096 * constant, then addi to add the final digits to that

This kind of relative loading is a pretty decent idea - memory might be huge, but programs tend to be small, so distance between instruction and constant it refers should generally fit 32bit.

This also illustrates one interesting idea RISC-V has - on surface level, it only has constant length instructions, and for simpler implementations that's it, but the idea is that more complex high-performance implementations could look at multiple instructions together, and treat such a common pair like auipc+addi or lui+addi as a single double-length instruction to load a 32bit number, instead of following each step separately. How well that's going to work in practice is a big unknown.

Loop

The loop is very straightforward:

.text
.global _start
_start:

  /* initialize iteration count to 5 */
  li s0, 5

loop:
  /* print one "Hello, World!\n" */
  li a7, 64
  li a0, 1       /* standard output */
  lla a1, hello  /* address of thing to write */
  li a2, 14      /* amount of data to write */
  ecall

  /* subtract 1, check if s0 reached 0 */
  addi s0, s0, -1
  bnez s0, loop

done:
  li a7, 93
  li a0, 0
  ecall

.data
hello:
  .ascii "Hello World!\n"

We use s0 to store how many loop iterations are remaining.

There are two interesting quirks of RISC-V. It doesn't have any "subtract constant" operation, instead it adds a negative number.

And second, there are no "CPU flags" - on x86-64 there was an operation to compare numbers that set some flags, then there was conditional jump based on those flags. RISC-V doesn't have that design, instead it uses "compare and jump" (with all the comparison flavors) as a single instruction. CPU flags were very easy to implement on simple CPUs, but they're really complicate things on modern high performance CPUs where multiple operations can happen in parallel while CPU needs to manage the results as if it was executing things one at a time.

Print numbers

On our way to FizzBuzz, we need to be able to print a number. This is very similar algorithm to the one I wrote for x86-64. We start building the string from the back, one digit at a time. At every iteration we do number % 10, add 48 to get corresponding ASCII code, and store that in a string. Then we divide number by 10, and repeat the process.

The GNU assembler equivalent of NASM macros I used buffer_last_byte: equ $ - 1 and buffer_after: equ $ don't work correctly here, it looks like some linker issue. I didn't investigate it further, and just added extra operation to add 31 to the buffer address.

RISC-V

.text
.global _start

print_number:
  /* a1 = address of the last character of the buffer (excluding newline) */
  /* a2 = number of characters to print (including newline) */
  /* a3 = 10 */
  lla a1, buffer
  addi a1, a1, 31
  li a2, 2
  li a3, 10

print_number_loop_iteration:
  /* split last digit out */
  /* a4 = a0 / 10 */
  div a4, a0, a3
  /* a5 = a0 % 10 */
  rem a5, a0, a3

  /* store one character at the address */
  addi a5, a5, 48
  sb a5, (a1)

  beqz a4, print_number_loop_done

  mv a0, a4        /* a0 = a0/10, that is remove last digit */
  addi a1, a1, -1  /* move buffer back one character  */
  addi a2, a2, 1   /* increase number of characters to print by one */
  j print_number_loop_iteration

print_number_loop_done:
  /* now we can tell the operating system to print string we built */
  li a7, 64
  li a0, 1
  ecall
  /* and return */
  ret

_start:
  /* set a0 to be argument ad call print_number function */
  li a0, 12345678
  call print_number

  /* Tell the operating system to exit with code 0 */
  li a7, 93
  li a0, 0
  ecall

.data
/* 32 characters + \n; initial contents do not matter */
buffer:
  .ascii "0123456789abcdef0123456789abcdef\n"

Print numbers with loop

It takes very little extra work to loop all numbers from 1 to 20 instead of printing just one.

.text
.global _start

print_number:
  /* a1 = address of the last character of the buffer (excluding newline) */
  /* a2 = number of characters to print (including newline) */
  /* a3 = 10 */
  lla a1, buffer
  addi a1, a1, 31
  li a2, 2
  li a3, 10

print_number_loop_iteration:
  /* split last digit out */
  /* a4 = a0 / 10 */
  div a4, a0, a3
  /* a5 = a0 % 10 */
  rem a5, a0, a3

  /* store one character at the address */
  addi a5, a5, 48
  sb a5, (a1)

  beqz a4, print_number_loop_done

  mv a0, a4        /* a0 = a0/10, that is remove last digit */
  addi a1, a1, -1  /* move buffer back one character  */
  addi a2, a2, 1   /* increase number of characters to print by one */
  j print_number_loop_iteration

print_number_loop_done:
  /* now we can tell the operating system to print string we built */
  li a7, 64
  li a0, 1
  ecall
  /* and return */
  ret

_start:
  /* set a0 to be argument ad call print_number function */
  li s0, 1
  li s1, 20
loop:
  mv a0, s0
  call print_number
  addi s0, s0, 1
  ble s0, s1, loop

  /* Tell the operating system to exit with code 0 */
  li a7, 93
  li a0, 0
  ecall

.data
/* 32 characters + \n; initial contents do not matter */
buffer:
  .ascii "0123456789abcdef0123456789abcdef\n"

RISC-V has a lot of registers, and by convention some of them (a0-a7) are used to pass arguments to functions, and can be overwritten by the function. Others (s0-s11) should be saved and restored if a function wants to use them. This is just a convention, not any hard requirement, but we follow it here, as we store data we want to not get overwritten in s0 and s1.

FizzBuzz

I then wrote a perfectly fine FizzBuzz program, but the linker really hated it. I think this is a linker bug as lla is supposed to always use relative addressing. Or assembler incorrectly informs linker about this. Or maybe it's supposed to not work, and need some extra flags, I'm not really sure.

Anyway, it all worked when I changed the command from static to dynamic linking:

$ as fizzbuzz.s -o fizzbuzz.o
$ ld -fPIC -shared fizzbuzz.o -o fizzbuzz
$ ./fizzbuzz
1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
19
Buzz
...
Fizz
97
98
Fizz
Buzz

And here's the program:

.option pic
.text
.global _start

print_number:
  /* a1 = address of the last character of the buffer (excluding newline) */
  /* a2 = number of characters to print (including newline) */
  /* a3 = 10 */
  lla a1, .buffer
  addi a1, a1, 31
  li a2, 2
  li a3, 10

print_number_loop_iteration:
  /* split last digit out */
  /* a4 = a0 / 10 */
  div a4, a0, a3
  /* a5 = a0 % 10 */
  rem a5, a0, a3

  /* store one character at the address */
  addi a5, a5, 48
  sb a5, (a1)

  beqz a4, print_number_loop_done

  mv a0, a4        /* a0 = a0/10, that is remove last digit */
  addi a1, a1, -1  /* move buffer back one character  */
  addi a2, a2, 1   /* increase number of characters to print by one */
  j print_number_loop_iteration

print_number_loop_done:
  /* now we can tell the operating system to print string we built */
  li a7, 64
  li a0, 1
  ecall
  /* and return */
  ret

_start:
  /* set a0 to be argument ad call print_number function */
  li s0, 1
  li s1, 100
  li s3, 3
  li s5, 5
loop:
  rem a3, s0, s3
  rem a5, s0, s5
  beqz a3, divides_by_three
  beqz a5, divides_by_five_only

divides_neither:
  mv a0, s0
  call print_number
  j continue_loop

divides_by_three:
  beqz a5, divides_by_three_and_five

divides_by_three_only:
  li a7, 64
  li a0, 1
  lla a1, .fizz
  li a2, 5
  ecall
  j continue_loop

divides_by_five_only:
  li a7, 64
  li a0, 1
  lla a1, .buzz
  li a2, 5
  ecall
  j continue_loop

divides_by_three_and_five:
  li a7, 64
  li a0, 1
  lla a1, .fizzbuzz
  li a2, 9
  ecall

continue_loop:
  addi s0, s0, 1
  ble s0, s1, loop

  /* Tell the operating system to exit with code 0 */
  li a7, 93
  li a0, 0
  ecall

.data
/* 32 characters + \n; initial contents do not matter */
.buffer:
  .ascii "0123456789abcdef0123456789abcdef\n"
.fizz:
  .ascii "Fizz\n"
.buzz:
  .ascii "Buzz\n"
.fizzbuzz:
  .ascii "FizzBuzz\n"

Should you use RISC-V?

As for the hardware, only the future will tell.

As for RISC-V assembly, it's not something you're likely to ever need, even less so than x86-64 assembly, but it's good fun to play with it if you like esoteric languages, and now thanks to Docker and QEMU it's quite easy.

There's some rumor that RISC-V alternative to ARM-based Raspberry Pi is coming real soon now so maybe you'll even be able to run your code on real hardware.