An update on the Rumi Programming Language, and an overview of the tools used

#computerscience #showdev #opensource

A while ago I posted a blog post mentioning I was working on a programming language. Ever since then some of you were kind enough to reach out. So here is an update on things I've been implementing since then, and then an overview of the tools.

An overview of the language

Rumi is a low-level language, it supports functions, structs, pointers, arrays, and most other basic types there are. A typical program might look like this:

import base

MyStruct: struct{
  id: int = 1;
}

MyStruct.print := () -> void{
  printf("MyStruct with id %d\n", self.id);
}

main :=()->int{
  printf("Hello world!\n");
  m: MyStruct;
  m.print();
  return 0;
}

(Sorry for the color scheme, it was the best I could achieve)

And there are two different executables, a compiler rum, and an interpreter rumi(rum interpreter!). So far, it looks normal, but we have a few different things under our sleeves:

import base
import compiler

rem := (a: int, b:int)->int{
  // returns 0 if a is dividable by b, otherwise the remainder

  // Pretend this is hard to calculate!

  return a % b;
}

main := () -> int{
  printf("This would print in runtime\n");
  printf("8 % 4 == %d\n", rem(8,4));
  return 0;
}

@compile
rem_test := ()->int{
  printf("Running tests for rem\n"); // This will run in compile time

  if(rem(8,4)){
    printf("rem(8,4) should be 0\n");
    return 2; // returning a value higher than 1 results in compile error
  }

  return 0; // This tells the compiler everything is fine
}

@compile
rem_test2 := () -> int{
  printf("Incomplete test, TODO\n"); // This will run in compile time

  return 1; // This will print a compile warning
}

Running the above code, I can get:

$ rum test.rum # this is compile only                             
Running tests for rem
Incomplete test, TODO
Compile directive compile exited with value 1 on line 35

$ rumi test.rum # this is compile and execution
Running tests for rem
Incomplete test, TODO
Compile directive compile exited with value 1 on line 35
This would print in runtime
8 % 4 == 0

$ ./a.out # this is execution only
This would print in runtime
8 % 4 == 0

As you can probably guess, the @compile directive tells the compiler to run that function in run time. I demonstrated how to run tests during compile time, but many other things could be done too. Let's look at another example (this one is not pushed to github yet, as I'm still experimenting with it)

@compile
make_verbose := (c: *compile) -> int {
  printf("making main verbose\n");

  // Get the main function
  main_f = c.getfunction("main");

  // A local function that gets a function call and inserts another function call before it
  add_runtime_print := (fc: *function_call) -> void{
    fc.insert_before(c.fcall(printf, "calling %s in main\n", fc.name));
  }

  // map all function calls to the previous function
  main_f.function_calls.map(add_runtime_print);

  return 0;
}

The possibilities are endless! But enough about the language, for now, let's look at the tools that made this possible...

Compile Chain

Looking from an academic perspective, a compiler is just a set of mapping tools, that converts a stream of strings to a stream of instructions. But digging deeper, this is the compile chain that you see in most compilers (including rumi):

Stream of characters to stream of tokens (lexing with flex)

Stream of characters here means the source code. If you think about it, any source code is just a stream of characters to a computer. But creating instructions from this stream is a hard job, so what should we do?

We use lexers to tokenize this stream. We convert payment = salary * work_hours to the token stream ID = ID * ID. We don't care if the syntax is correct, or what the meaning of any of these tokens are at this stage, all we care about is converting these characters to the tokens. I used flex to get this done, and the source code of the lexer is available at this file.

flex converts the l file into a c program that provides a few functions that are useful in the next part...

Stream of tokens to Abstract Syntax Tree (AST)

Now that we have our token stream, we can think about the syntax. This is where we define how functions are declared, how new variables are created, etc. I used gnu bison which gets rules like this:

while_stmt
: WHILE '(' expr ')' stmt {$$=new WhileStatement($3, $5);}
;

where WHILE is a token, expr and stmt are other rules defined in the parser (full source code is here), and finally, we define how the AST is created.

AST is a syntax tree, for each different statement and expression we create a different structure (or class in my case), where we just store the data. We are not doing any calculations here, just storing it for later. All this statement is saying is that a while statement has a while token, followed by an open parenthesis, followed by an expression, a close parenthesis, and then another statement(which can be a code block!).

Let's look at another example, this is how variable declarations are handled:

variable_decl
: ID DEFINE_AND_ASSIGN expr ';' {$$=new VariableDecl($1, NULL, $3);}
| ID ':' type ';' {$$=new VariableDecl($1, $3);}
| ID ':' type '=' expr ';' {$$=new VariableDecl($1,$3,$5);}
;

You can see that there are three different rules here, one for cases like id := 3;, one for cases like age : int; and finally, for name : string = unnamed;. You can look at the full source code for the parser here.

bison, similar to flex, generated c (or c++) code for us. The good news is bison works flawlessly with flex, and just calls it whenever it is looking for the next token.

AST to code gen

After the parser has finished, we can go and look at the AST, and start generating the code. Some compilers, including rumi, have a stage that go through and resolve types, hande sizes, etc, to make the code generating go smoother. I'm not going to cover that step here, you can look at the compiler.cpp code in the repo for more details.

Now, we can finally get into code generation. Fortunately for us, llvm and gcc both have quite powerful almost high-level assembly-like languages. I chose llvm for various reasons, but gcc would have worked similarly. These languages are assembly-like, meaning they are not exactly assembly, and they are not limited to a machine. llvm provides utilities to optimize the llvm-ir code (the assembly we mentioned!), to generate object files, and to even run them on command. Our previous AST would be converted to something like this:

define private i64 @successor(i64 %a) {
entry:
  %returnval = alloca i64
  %a1 = alloca i64
  store i64 %a, i64* %a1
  %readtmp = load i64, i64* %a1
  %0 = add i64 %readtmp, 1
  store i64 %0, i64* %returnval
  br label %end

end:                                              ; preds = %entry
  %1 = load i64, i64* %returnval
  ret i64 %1
}

As you can imagine, this is a pretty unoptimized version of saying

successor := (a:int)->int{
  return a + 1;
}

But the good news is, llvm handles optimization and can convert that to a really nice optimized assembly version (and then to even nicer machine code).

But how do you know what to generate? The I use clang test.c -S -emit-llvm -c to get the llvm-ir of some code, and then figure out how to output that using llvm's IRBuilder.

Let me know what you think about all this, all comments and ideas are welcome!