Writing My First Compiler
Pablo Díaz Márquez Apr 5 '17
I wrote my first compiler for a university class project and I really enjoyed it so I want to share my experience on that. I'm about to get very technical on this, so first a little introduction.
As you know, a computer does not understand source code, it understands binary code. So how a we get from source code to binary code? The computer hardware, the physical form in which it performs cycles and processes with the electrical signals, takes care of converting the zeros and ones to those electric signals. So guess what? Each compiler depends on the processor on which it is compiled (unless something else is used between like the java virtual machine). There exists an upper layer for binary code that is called assembly, a low level programming language. So the goal for my compiler is to convert high level programming language for a low level programming language. Take in consideration that a compiler does not necessarily needs to make a conversion to a low level programming language, it could be a conversion to another high level programming language.
A compiler is divided in two main stages: the analysis and the synthesis. The analysis is the recognition of the structure of the source program with the recollection of information (like variables) and the synthesis is the construction of the translation from the structure and the information recollected. I made my own version of the analysis just to go through that process but for my compiler I actually used a powerful parser generator to avoid the problems of my own implementation, called ANTLR using the Visitor pattern.
Identifies the elements of the source program according to the grammar of the language analysed. This phase saves information on the symbol table and produce tokens that represent the identification of the elements.
pos = initial + rate * 60
<id, 1> <=> <id, 2> <+> <id, 3> <*> <60>
Also called parsing, this uses the tokens from the previous phase to construct a grammatical structure called the syntax tree. In this phase it is possible to recognize the syntax errors like missing
;, missing brackets
Here things start to get fun. I started from this point because I used ANTLR for the previous two phases. The things I liked in this part was that I had to decide how actually my high level language input was going to be. Should I use indent (like python) or brackets (like java) to define scopes? Should I use strong typed variables(int var) or loosely typed (var) ? Along all this questions, there is something that always has to be kept in mind, the target assembly. The target assembly will determine what you can actually do or not. For my compiler, I choose the ARM compiler of the RaspberryPi.
The semantic analysis uses the syntax tree and the symbol table to verify the semantic of the instructions, in other words, that the source program actually makes sense.
Intermediate Code Generation
In this phase, I had to start generating assembly code look alike, called the three address code. Imagine this
pos = initial + rate * 60 t1 = id3 * t1 t3 = id2 + t2 id1 = t3
t are going to be registers of the assembly code in the next phase. In theory, this phase could make the compiler independent of the target compiler because you could generate different assembly from the same intermediate code generation.
Generating the code
In this part I had to take a few things into consideration. How I will store the values? How many space will each type use? How will I save globals? These are the first question to resolve and that's why it is important to know the how the assembly works. I decided that everything would be saved in 4 bytes, no matter what type of variable it is just to make it simple, but not efficient. And where I will save this bytes? I decided, the global vars would be saved in memory and everything else in the stack. Another step to take in consideration is how and how many registers are going to be used and for that a register descriptor and an address descriptor is needed because it is necessary to know the current values and addresses of the data and when a register is available for usage.
Finally, the most difficult part for me was the activation record. This part is used to initialise the registers to use in a block of code, it includes the parameters, the return value, the temporary variables and the control link. This part is the key to make recursion work.
Based on the Dragon Book.