Marcell Cruz

Posted on Feb 3, 2023

How To Write A Programming Language

#programming #c #computerscience

The Grammar
Composing Grammar Rules
The $$ Operator And Empty Grammar Rules
Rule Types
Parsing The Language
AST

Recently I wrote a very simple programming language, here is the repo, the language is extremely small, my objective was to write a programming language that can do fibonacci, that's it.
fib written in square

[>:fib n:
  [:if n < 3:
    return n - 1
  ]
  return [:fib n - 1] + [:fib n - 2]
]
[:fib 10]

[:print n]

it works!!!

It uses one of the most popular tools to write the grammar, the AST is very minimal and is executed in C, today I'm gonna explain the main components of a programming language trying to be as concise as possible.

The Grammar

This is the heart of the language, This is how you interact with the language, the grammar rules define how to write code for the language, I used flex and bison to write the grammar, the grammar of the language is written on square.y skim through those rules just to get a idea of how they work.

Here's a grammar rule from the language

ID EQ NUM

and it defines how to interpret assignments

n = 2

You can think about the grammar rules as a way of organizing regex blocks, each token (ID, EQ, NUM) is a regex block

identifier  ([a-z]+)  // this is ID 
equal       ([=])     // this is EQ
numbers     ([0-9])+  // this is NUM

You organize the blocks in a way that you want the blocks to be matched in the language, so when the parser sees

x = 99

it's going to first transform that in the tokens ID EQ NUM then search for a rule that matches that, after finding the rule, the parser is going to execute the body of the rule

ID EQ NUM {
  struct scope * number = (struct scope *) 
  malloc(sizeof(struct scope));
  (number)->type = "number";
  (number)->value = $3;

Think about the body of the rule as a function that is going to be executed when the parser finds that match, you can access the value in the tokens using $1, $2, $3, here is where you build your AST, more on that later

Composing Grammar Rules

You can compose grammar rules, for example, in the language we have function calls that work like this [:add n] but you can also call the function passing the value directly instead of a variable [:add 10] instead of creating two different grammar rules for each case, we create a grammar rule with these two options and use it in the function call grammar rule.

OPBRA IDFUNC fcallparam CLBRA {

fcallparam: ID {
              ..body 
          } 
          | NUM {
              ..body
          }
;

OPBRA is [ IDFUNC is the name of the function :add in this case, fcallparam can be a ID identifier, basically a variable, or NUM, NUM is a number, you're free to compose the rules how many times you want.

The $$ Operator And Empty Grammar Rules

The $$ operator can be used to save the result of the composing grammar rule to be used in another grammar rule, it's easier to see the utility in action, in the last example
we can save anything on $$ than use it in the composing grammar rule

fcallparam: ID {
  $$ = "Hello There!!" 
}

OPBRA IDFUNC fcallparam CLBRA {
  printf($3) // $3 here is "Hello There!!"  
}

remember that we can access the tokens with $1, $2, $3 etc...
So now $3 is "Hello There!!".

If You just want to use the token directly you don't need the body, example

fcallparam: ID | NUM

OPBRA IDFUNC fcallparam CLBRA {
  // $3 will be either a identifier or a number depending on 
  // the match   
}

the default behavior of any rule without a body is

$$ = $1

Rule Types

A rule also has a type, to return a string like we did in the fcallparam example, the rule fcallparam needs to be a string, we define this in the beginning of the parser when we connect the parser with the lexer, you can see all definitions here

In the toy language fcallparam is of type node which is a node of the AST
%type <node> fcallparam

Parsing The Language

With the information that we have above we can begin to understand how to parse the language, and how to create the AST, for example we could create a print like function called output in the language.

In the source file…

  output("hello world")

In the parser...
We parse the contents with the right rules

OUTPUT OPPAR STR CLPAR {
  // here we just print the content inside the parenteshis
  printf($3)
}

for very simple rules we could do something like that and execute the code that we created, the problem is that programming languages have concepts like functions that postpone code execution, you cannot execute what you see directly because we have definitions that are going to be executed when the program tells you to, for example if output was inside another function we can not execute it right away, this is one of the reasons why we need a AST, the AST is a data structure that represents the code but can be executed easily, we don't need to parse anything, just traverse the data structure and execute it.

function test() {
  output("hello world")
}

in this case we cannot execute output as soon as we identify it in the parser, we need to wait for a function call that calls test(), then we execute what's inside test's body, to solve this problem we need to breakdown the parsing and the executing in to different things, that's where the AST comes in.

AST

Abstract Syntax Three is just a data structure like an linked list or binary three, you can use nested array as an AST, and some languages do use it, but we're going to use a linked list instead, this is the data structure that I used in the language

struct scope {
  char * type;
  char * extra;
  struct arg * args;
  int value;
  struct scope * scopes;
  struct scope * next; 
  struct scope * prev; 
  int return_value;
};

This is essentially a doubly linked list after I finished the language I realized that it could have been much easier but I didn't have the patience to change everything so that's what we have right now :P, it's easier to understand what it represents with an example

{
  .type = "function",
  .extra = "global",
  .scopes = {
    .type = "body",
    ...
    .scopes = {
      .type = "function",
      .extra = "fib",
      .args = {
        .key = "n", 
        .value = 0,
      }
      .scopes = {
        .type = "body",
        .scopes = {
          .type = "if"
          ...
        }
      }
    }
  }
}

The root node is a function called global, same thing as main in other languages like C, this node has a body and the body
has another node function called fib which is the declaration of the fib function, this node has a if node and so on..., I hope you get the idea, the AST is just another way of representing the code, this is what we called a IR or Intermediate Representation the reason why we do this is to make it easier to do something with that data, specially traverse it, it's very hard to traverse source code, so instead of doing that we transform the source code in to a different data structure and work with that instead
so this

[:fib n
  [:if n < 3
  ]
]

becomes this

{
  .type = "function",
  .extra = "fib",
  .scopes = {
    .type = "body",
    ...
    .scopes = {
      .type = "if",
      ...

The next thing that we need to do is execute the AST, a lot of popular languages like Python, Ruby and Java have an additional IR representing the code as bytecode and executing the bytecode in a virtual machine for optimization, but since this is a toy language we just execute the AST itself

Executing The AST

Executing the AST is pretty straighforward we pass the AST to an exec function and this function traverse and execute the AST following the rules of the language, this is the file that creates the AST and calls exec passing the AST square.y , the whole thing is less then 400 lines, and it could be much simpler as I said before, for example, the languages accepts evaluating statement in returns, so you can do things like this

return [:fib n - 1] + [:fib n - 2]

but that's not strictly necessary, you can check what the exec function does here but essentially we just traverse the AST check the node type like function if function calls etc.. and do what needs to be done, the most challenging part after building the AST is implementing scope but that can be done with pointers, each node holds the value of the variables used inside it and we can use .next and .prev to jump between scopes, I'm not gonna explain how I execute the AST here because you can simply follow the code here, it's very simple, and I think I already cover the meat of building a language, which is the parser and the AST, after you have the AST it's just a matter of executing it which is an entire different thing outside the scope of showing how a language can be implemented, if you want to build something specific I think the toy language that I've build is a good start, since a lot of the work is the actual set up.