Pablo Díaz

Posted on Apr 5, 2017

Writing My First Compiler

#compiler #java #arm

Start

I wrote my first compiler for a university class project and I really enjoyed it so I want to share my experience on that. I'm about to get very technical on this, so first a little introduction.

Set

As you know, a computer does not understand source code, it understands binary code. So how a we get from source code to binary code? The computer hardware, the physical form in which it performs cycles and processes with the electrical signals, takes care of converting the zeros and ones to those electric signals. So guess what? Each compiler depends on the processor on which it is compiled (unless something else is used between like the java virtual machine). There exists an upper layer for binary code that is called assembly, a low level programming language. So the goal for my compiler is to convert high level programming language for a low level programming language. Take in consideration that a compiler does not necessarily needs to make a conversion to a low level programming language, it could be a conversion to another high level programming language.

The process

A compiler is divided in two main stages: the analysis and the synthesis. The analysis is the recognition of the structure of the source program with the recollection of information (like variables) and the synthesis is the construction of the translation from the structure and the information recollected. I made my own version of the analysis just to go through that process but for my compiler I actually used a powerful parser generator to avoid the problems of my own implementation, called ANTLR using the Visitor pattern.

The phases

Lexical Analysis

Identifies the elements of the source program according to the grammar of the language analysed. This phase saves information on the symbol table and produce tokens that represent the identification of the elements.

pos = initial + rate * 60

Transforms to

<id, 1> <=> <id, 2> <+> <id, 3> <*> <60>

Syntax Analysis

Also called parsing, this uses the tokens from the previous phase to construct a grammatical structure called the syntax tree. In this phase it is possible to recognize the syntax errors like missing ;, missing brackets

Semantic Analysis

Here things start to get fun. I started from this point because I used ANTLR for the previous two phases. The things I liked in this part was that I had to decide how actually my high level language input was going to be. Should I use indent (like python) or brackets (like java) to define scopes? Should I use strong typed variables(int var) or loosely typed (var) ? Along all this questions, there is something that always has to be kept in mind, the target assembly. The target assembly will determine what you can actually do or not. For my compiler, I choose the ARM compiler of the RaspberryPi.

The semantic analysis uses the syntax tree and the symbol table to verify the semantic of the instructions, in other words, that the source program actually makes sense.

Intermediate Code Generation

In this phase, I had to start generating assembly code look alike, called the three address code. Imagine this

pos = initial + rate * 60

t1 = id3 * t1
t3 = id2 + t2
id1 = t3

Where id and t are going to be registers of the assembly code in the next phase. In theory, this phase could make the compiler independent of the target compiler because you could generate different assembly from the same intermediate code generation.

Generating the code

In this part I had to take a few things into consideration. How I will store the values? How many space will each type use? How will I save globals? These are the first question to resolve and that's why it is important to know the how the assembly works. I decided that everything would be saved in 4 bytes, no matter what type of variable it is just to make it simple, but not efficient. And where I will save this bytes? I decided, the global vars would be saved in memory and everything else in the stack. Another step to take in consideration is how and how many registers are going to be used and for that a register descriptor and an address descriptor is needed because it is necessary to know the current values and addresses of the data and when a register is available for usage.

Finally, the most difficult part for me was the activation record. This part is used to initialise the registers to use in a block of code, it includes the parameters, the return value, the temporary variables and the control link. This part is the key to make recursion work.

Based on the Dragon Book.

Top comments (5)

dexafree • Aug 21 '17 • Edited

Both are essentially different.

The order used in a compiler is:

Lexical -> Syntactical -> Semantical -> Code generation.

Until you reach the Code generation step, you only have ensured that the code the developer wrote is "valid" according to your spec (or as technically correctly said: the construction belongs to the language).

The Lexical analysis part only parses the text and converts it to a list of tokens.

IE:

int a = 1;
int b=2;a = b;

Would transform into a list containing:

<type, int>, <id, a>, <EQUAL>, <literal, 1>, <SEMICOLON>, <type, int>, <id, b>, <literal, 2>, <SEMICOLON>, <id, a>, <EQUAL>, <id, b>, <SEMICOLON>

As you can see, is the Lexical's responsibility to take whitespaces and line breaks into account.

This list is the input of the Syntactical analyzer, which will use a defined syntax/grammar (in example, you can see Python's official grammar: docs.python.org/3/reference/gramma...)

It will consume this list and check that the order matches SYNTACTICALLY VALID language constructions (ie: TYPE, ID, EQUAL, [expression, literal, id]).

A lexical error could be "ID name too long" (if the language only allowed for ids with less than 30 characters), and a syntactical error would be trying to do int 2 = 3, as the construction TYPE, literal, EQUAL, literal does not belong to the language.

(I will only make a brief point about the semantical analysis, as it's out of the scope of your question)

The SEMANTICAL analysis takes care about the usage about the said constructions, in example, using a variable before it's defined.

int a = 1;
a = b;
int b = 2;

This is a SYNTACTICALLY valid construction, but SEMANTICALLY incorrect, as you are using the variable b before it's been declared.

Jake Vossen • Jul 18 '17

I loved the post! I think it's easy for programmers to think we just found compilers on a mountain struck by lighting, but this seemed to be very insightful. Thanks for sharing.

Agus Arias • Apr 14 '17

Nice post! The lovely task of creating a compiler. It gives so much insight on how the programming languages I use work. For a Java made compiler I would recommend the Gold parser.