DEV Community: James

Compiler Series Part 5: Lexical analysis

James — Sat, 24 Aug 2019 12:02:48 +0000

From this post onwards, I will be explaining sections from my compiler and the thoughts gone into designing them. I will link to the relevant files in my compiler's repository found here.

Defining basic tokens

The smallest "atoms" of my language (aside from individual characters) are called tokens. The scanner's job is to recognise valid tokens in the input text and reject invalid ones. The following BNF describes some basic tokens.

<digit> : [0-9]+
<number> ::= [<digit>]+.[<digit>]+

<name> ::= [a-zA-Z][a-zA-Z0-9]*
<string_lit> ::= " [\w*] "

<op> ::= ['+' | '-' | '*' | '/' | '^' | '%' | '=' | '>' | '<' | '!' | '|' | '&']+
<whitespace> ::= [' ' | '\n' | '\t']+
<comment> ::= # <all>

Strings will make up identifiers and specific keywords as well as string literals. Operators can be any combination of the above symbols. The scanner will then go on to recognise certain combinations. If the combination of input characters is not recognised, the scanner throws an error. The scanner also removes whitespace and comments.

A lexeme is any combination of characters which matches the tokens. My complete list of tokens looks like this:

Token Types {
    PLUS, MINUS, STAR, SLASH, HAT, MOD, INC, DEC,
    EQ, LESS, GREATER, LEQ, GREQ, NEQ,
    AND, OR, NOT,
    ASSIGN, CONDITIONAL, COLON,

    LEFTPAREN, RIGHTPAREN, LEFTSQ, RIGHTSQ, COMMA,
    NUMBER, STRING, IDENTIFIER, BOOL,
    BEGIN, IF, ENDIF, ELSE, THEN,
    FOR, IN, ENDFOR,
    DEFINE, ENDDEF, EXT,
    NEWLINE, END
}

Some language theory¹

The scanner recognises a language which is type 3 on the Chomsky Hierarchy². A type 3 language can be recognised/parsed by regular expressions and state machines. I could implement a literal state machine for the parser, but I will not, as this approach is not very efficient to write. Similarly, I could have written a complex regex to split the input into tokens, but this obscures a lot of the logic for me. I started this project to learn, not write a compiler in as few lines as possible.
Later on, the parser will interpret a type 2 language using a push-down automata. (A type of automata which uses a stack to hold information. In my implementation, this stack is the C++ function call stack). The difference between the two languages being interpreted is the main reason to have a separate lexer and parser³.

A tool called flex⁴ exists which can generate scanner state machines in C. This is a modern version of GNU lex. I decided that this tool introduced too much overhead into my code.

Algorithm

The basic algorithm for the scanner is below. The functions such as getString() and getOp() are described in the BNF above.

getNextTok =>
   nextChar()
   if next char is '#' skip line then nextChar() 
   skip whitespace

   if next char is digit, getNumber() and return number token

   getName() if next char is alpha
       if string is keyword return keyword token
       else return identifier token

   getOp() if next char is operator
       if operator string is recognized return operator token
       else error

   getString() if next char is '"'

   if next char is recognized punctuation
       return punctuation token

   if next char is unrecognized
        error

Writing a scanner is not too difficult at all and should come in useful for many projects, when you need to parse any kind of input.
Finally, here is the scanner from my compiler.

Aho, A., Lam, M., Sethi, R. and Ullman, J. (2006). Compilers: Principles, Techniques, and Tools. 2nd ed. Pearson Education Inc. ↩
Theory of Computation - Chomsky Hierarchy ↩
Let's Build a Compiler, by Jack Crenshaw - Part 7: LEXICAL SCANNING ↩
GitHub - westes/flex: The Fast Lexical Analyzer ↩

Compiler Series Part 4: Designing the SIMPLE language and compiler

James — Sat, 10 Aug 2019 13:31:51 +0000

I've chosen to call the language I've designed for the compiler SIMPLE for fairly self evident reasons. This language is Turing complete as it features conditional statements and variables.
As this compiler is an MVP, there will only be one data type: doubles.

Program structure

The program should be enclosed in BEGIN..END. The top level should contain function definitions with a main function as an entry point.

BEGIN
   DEFINE average(x, y)
       (x + y) * 0.5
   ENDDEF

   DEFINE main()
       average(5, 8)
   ENDDEF
END

Functions

Functions enclose reusable blocks of code. They can take arguments (must be doubles) and return a value (also a double)

DEFINE average(x, y)
       (x + y) * 0.5
ENDDEF

There is no ‘return’ keyword, the evaluation of the last expression is the return value of the function.

Functions can be defined as external. This means that a definition exists elsewhere and will be linked later.

DEFINE EXT printd(x)
DEFINE EXT sin(x)

This allows the language to use functions from the c standard library and any user defined functions which may also be linked.

If/else statements

If statements decide which block of code to execute based on the result of a conditional.

IF 5 < 6 THEN
   ...
ELSE
   ...
ENDIF

IF 4==2 THEN
   ...
ENDIF

For loops

For loops take the form of an identifier (loop variable), end condition and an optional step value.

FOR i = 1, i < n, 1 IN
   ...
ENDFOR

Putting it all together

Once finished, the compiler will be capable of emitting an executable for this program.

BEGIN
 DEFINE EXT sin(x)
 DEFINE EXT putchard(x)

 DEFINE main()
   num = 7
   FOR y = 1, y >= -1, -0.2 IN
     FOR x = 0, x <= num, 0.2 IN
       s = sin(x)
       IF (0.1+y) >= s THEN
         IF (y-0.1) <= s THEN
           putchard(42) # “*“
         ELSE
           putchard(32) # “ ”
         ENDIF
       ELSE
         putchard(32) # “ “
       ENDIF
     ENDFOR
     putchard(10) # “\n”
   ENDFOR
 ENDDEF
END

The program's output is an approximate sketch of a sin wave in the terminal!

Compiler process

A compiler lends itself to a modular approach. One can take this to the extreme, for example see the paper on a 'nanopass' approach to compiler design¹.

A level 1 DFD outlining the processes happening within my compiler is included below. It is a linear process,

When a language feature is added to the compiler it's quite straightforward to extend each section to support it.

In the next section, we'll get into writing the lexer.

A Nanopass Framework for Compiler Education ↩

Compiler Series Part 3: Rust

James — Thu, 08 Aug 2019 16:09:27 +0000

This post will be another fly by look at a production compiler, this time for the Rust language.

Disclaimer! I wrote this a while ago and have not checked how much has changed! The compiler and language are still maturing, so changes are frequent.

The Rust project is a compiler for a new systems language. The compiler and standard library is written in Rust itself, and the project is sponsored by Mozilla. It merits itself on being "blazingly fast", preventing segfaults and guaranteeing thread safety.

Rust is based on multiple paradigms. It takes its variable mutability/borrowing rules from functional programming, ensuring thread safety and preventing data races. For example, after passing a variable as an argument into a function, the variable cannot be used again as it is dropped from the scope. These ownership rules increase the safety of the language, as most potential problems are picked up by the compiler. If the program compiles, you can be fairly confident that there won’t be many unexpected runtime errors.

The compilation process is different to that of gcc, although the underlying idea is the same.

Parsing¹

Although parsing can be done by writing a finite state machine, the rust parser is hand written. First, a lexer takes the input source as UTF-8 text and generates tokens from it. These tokens are then placed into a token tree². This tree is an intermediate stage between the input source, and the AST. This stage isn't always necessary, depending on the complexity of the language. Recursive descent is then used to generate the real AST. The AST does not include unnecessary bits of the input like parentheses. They can be implied at this point.

Expansion

At this point, the AST is passed over. At the appropriate locations, external code such as the standard library and other specified modules are injected. In addition to this, macros are expanded. This is in contrast to the GCC method, where macros are dealt with by manipulating the source code directly. Tests (which are also defined with # syntax) are built into a harness at this stage and injected into the AST. The AST is given node IDs for later use, and optionally can be output at this point.

Analysis

Semantic analysis happens at this stage. This involves traversing through the AST and performing different transformations. At this stage, the AST becomes HIR, High-Level Intermediate Representation. This is basically the same as the AST, so we don’t need to worry about the detail too much. Many tasks are done here, including but not limited to:

Name resolution - checking whether identifiers are variables, functions or modules.
Finding the main method as this is the entry point for the program
Type checking - determining the resultant type of expressions
Checking the rules associated with static, constant and private types are obeyed
Match checking - Rust’s pattern matching rules are enforced here
Dead code checking - emits warnings if unreachable code is detected

These steps are not optimisations as such, more implementations of language features.

Afterwards, the HIR is translated to MIR, Mid-Level Intermediate Representation³. This breaks up the large step between a fairly high level, abstract representation and the final low level representation. For example, all control flow, loops and matches are represented using goto like statements. Borrow checking is done at this stage, making sure the rules are adhered to, and some optimisations are made before translating MIR into LLVM IR.

LLVM

LLVM is a set of tools used to help write compilers. For now we just need to know that it takes a low level intermediate representation of the source code as an input and generates bytecode for the specified platform.

The first job is to translate Rust’s crates to LLVM’s modules. "Crates" and "Modules" are similar concepts, it’s just the technicalities of representation that need to be translated.
After this step, LLVM runs its own optimisation passes. These ensure that the output program is as efficient as possible. Rust has many implementations of its own optimisations that LLVM could run, but are known to be slow. For all its efficiency, compile times with LLVM are frequently quite slow.
Once the IR has been optimized, code generation happens. LLVM either writes object files or assembly in the specified output format to disk.

Linking

If object files were emitted (eg. if we are building a native library or our project contains multiple files) the final step is to link these files. This is done by calling the platform’s c compiler (eg. cc) and uses it to link the object files into an executable.

I would like to use LLVM in my compiler, as it a well used system that will make code generation much more straightforward.

Compiler Series Part 2: GCC

James — Tue, 06 Aug 2019 18:44:32 +0000

GCC, originally written for the GNU operating system is a set of front ends, libraries and back ends for a few different programming languages. This is an old piece of software, but is kept up to date and used regularly by many. The compiler has front ends for (can compile) C/C++, Objective C, Fortran, Ada and Go.

The compiler was originally built to avoid paying for a license to use a vendor compiler for the GNU operating system. Since then, it has has risen past proprietary compilers in both popularity and performance.
The process started when a user runs a command like gcc myfile.c at the terminal can be broken down into 4 steps: preprocessing, compilation, assembly and linking.

The preprocessor¹

The preprocessor deals with making modifications to the source code before it is compiled. We will be looking at the C/C++ preprocessor specifically. Other preprocessors or frontends are available for different languages as mentioned above. A certain amount of lexing occurs at this step. The actions it performs are:

Line splicing

This tidies the lines of the source up if escaped newlines are used. Escaped newlines are backslashes used to break up lines. They must appear between tokens or within comments.
All comments within the code are also replaced with single spaces.

Tokenisation

Once these transformations are complete, the program is split up into preprocessor tokens. These tokens are going to be passed to the compiler, where they will be used as compiler tokens. For now, the preprocessor tokens can be put into 5 categories: identifiers, preprocessing numbers, string literals, punctuators, and other.

Identifiers - any sequence of letters, digits or underscores which DOES NOT begin with a number. Keywords - tokens such as if, else are treated as identifiers. This means that we can define macros with the same name as reserved keywords in the actual C/C++ language
Preprocessing numbers - The definition encompasses a large range of representations of numbers. In order to support hex and other representations, any combination of letters, numbers, periods and underscores are allowed after a decimal digit. This allows many invalid forms to slip through this step. They will be picked up by the compiler however.
String literals - This includes string constants, character constants and arguments for the #include statement wrapped in angular brackets.
Punctuators - All punctuation in ASCII except from ‘@’, ‘$’, and ‘’’ are considered punctuators. Punctuators have meaning to the compiler, but what that meaning is depends on the context in which they are found.

Directive and macro handling

This is the step most people associate with the C/C++ preprocessor, as it is seen most often by the user of the compiler. If the stream of tokens contains nothing in the preprocessing language, it is simply passed to the compiler.

The following are common feature used in the preprocessing language.

Including header files

Header files contain C/C++ declarations and other preprocessor instructions. Header files can be provided by the OS in order to use the system API, or defined by the user. This makes it possible to share definitions between files.
If we were to look at the code output by all the intermediate steps, we would see any #include <xyz.h> replaced with the contents of xyz.h

Macro handling

Macros can be object-like or function-like. Object-like macros look like data objects. They are most commonly used to give names to symbolic constants eg. #define MAX_SIZE = 1024
Where the name MAX_SIZE is used in our code, the preprocessor will replace this text with 1024

Function-like macros look like function calls when used. When they are “called” the code that appears in the definition is simply copied to where it has been called. Eg.
#define SQUARE(x) x*x
Wherever SQUARE(X) appears in code, it is replaced with x*x, where x is the value passed into square.
Macros are used to increase performance. Because the code is just copied by the preprocessor, in the assembly output there is no overhead with function stacks etc. This paragraph barely scratches the surface of macros and how useful they can be, however, they are not an essential part of a compiler.

Conditional compilation

This allows the preprocessor to decide whether to hand the following section of code to the compiler. There are a few situations in which this is useful, for example include guards. Include guards prevent the same header file being copied too many times. This speeds up compilation.

// An example include guard
#ifndef __MY_HEADER
#define __MY_HEADER
...
#endif

Compilation²

This is the stage which takes our preprocessed code, and generates assembly from it. The gcc compiler makes multiple passes over representations of the input program, transforming it into different intermediate representations before outputting machine code.
The figure below shows the steps involved³

Parsing pass

This is the frontend of the compiler. The details about the C preprocessor written above apply to this step. After the preprocessor has done its job, the compiler generates an Abstract Syntax Tree (AST) from the token stream. For now this can be thought of as a tree data structure that represents the input program. The AST is very easy to modify and transform, so is useful as an internal representation of the input.

Each front end works in a different way, but usually ends up outputting a form called Generic (or GIMPLE). The history and internal workings around this stage seem to be a little complicated and confused, but as long as Generic is output and passed to the compiler we are happy.

Gimplification pass

This pass involves transforming our Generic representation into a GIMPLE representation. This is another intermediate representation however, more restrictive than an AST. It is used for optimization passes later on. At its most basic, GIMPLE is a collection of tuples representing Generic expressions. The salient points on their structure are as follows:

There can be no more than three operands per expression. If there are more than three, the expression is split into smaller parts.
All control flow is made up of conditionals and goto statements.

Tree SSA

This step involves multiple passes over the internal representation we now have of the code. SSA stands for Single Static Assignment. The GIMPLE representation is modified so that each variable is only assigned to once. If a variable needs to be assigned to multiple times, new variables are created.
As can be seen in section 9.4 of the gcc internals manual referenced above, this step involves 47 different optimizations. These involve optimizing loops, conditionals, removing unreachable code and other complex looking operations. A new GIMPLE representation with changes applied is returned from this step and passed onto the next.

RTL

RTL stands for Register Transfer Language, and is a step closer to our desired machine code output. The GIMPLE representation is translated to RTL, which assumes an abstract processor with an infinite number of registers. Many passes are made over this form, optimizing the code even further. The optimized form is passed to a gcc backend.

Assembly

Just like frontends, gcc has support for many backends that deal with outputting platform specific machine code. These include x86, i386 and arm. When given the optimized RTL, the backend will emit the assembly for the platform specified. The most commonly used backend is the GNU assembler known as gas or as. This was released in 1986 but is still being used frequently.
The output from assembly is an object file. This is passed to the linker.

Linking⁴

The compilation and assembly process is performed on individual files. The linker is used to connect all parts of a project together, whether that be header and implementation files or external libraries. If a function is defined in a different file with extern or similar, it is the linker’s job to check if it has actually been defined. In a project with thousands of lines of code, recompiling everything for one tiny change in a file would be a big pain. Instead, the file containing the change is recompiled and the project is relinked.

Typically on a GNU/Linux system, the program ld is used. This is the GNU linker. The gcc program wraps calls to ld to make life easier for the developer, but if necessary this can be done manually.

This was just a glimpse at the huge, huge GCC project - currently sitting at just under 20 million lines of code. Picking through the technical documentation helped me get an overview of how a compiler should be structured. After this analysis, I could begin to plan my own compiler.

Compiler Series Part 1: Introduction

James — Tue, 06 Aug 2019 18:39:38 +0000

Programming was my first foray into computer science, and recently I wanted to dig into this topic in more detail. The posts which follow will be a collection of my notes, made while writing my own compiler in C++.

Compilers are an essential part of computing, allowing higher level languages to be translated to machine code that the computer can understand. This provides an important level of abstraction to the programmer, meaning they don’t have to worry about the complexity and quirks of assembly languages.

I will break down the compiler into a collection of separate, easier to understand modules. Compilers lend themselves to this modular approach quite nicely. From front to back:

The lexer - This performs lexical analysis, breaking the input up into tokens
The parser - This puts the tokens we get from the lexer into the context of our language via an intermediate representation
Transformation - Once we have the intermediate representation, this can be tidied and optimised
Code generation - The intermediate representation is used to generate the required assembly for each language construct

In the next section, we'll get a bird's eye view of gcc the GNU Compiler Collection.