Compilers 101 - Overview and Lexer

#llvm #compiler

These compiler posts will all be at a high-level and are based on the LLVM and Compiler session from the Xojo Developer Conference 2016. None of these posts are going to teach you how to write a compiler. The goal of these posts is for you to have a basic understanding of the components of a compiler and how they all work together to create a native app.

Compiler Components

A compiler is a complicated thing and consists of many components. In general the compiler is divided into two major parts: the front end and the back end. In turn, those two parts have their own components.

For the purposes of these posts, this is how we will be covering the components of the compiler:

Front End

The front end is responsible for taking the source code and converting it to a format that the back end can then use to generate binary code that can run on the target CPU architecture. The front end has these components:

Lexer
Parser
Semantic Analyzer
IR (intermediate representation) Generator

Back End

The back end takes the IR, optionally optimizes it and then generates a binary (machine code) file that can be run on the target CPU architecture. These are the components of the back end:

Optimizer
Code Generation
Linker

Each of these steps processes things to get it a little further along for the next step to handle.

The linker is not technically part of the compiler but is often considered part of the compile process.

Lexer

The lexer turns source code into a stream of tokens. This term is actually a shortened version of “lexical analysis”. A token is essentially a representation of each item in the code at a simple level.

By way of example, here is a line of source code that does a simple calculation:

sum = 3.14 + 2 * 4

The first token it finds is “sum”
- type: identifier
- value: sum
- start: 0
- length: 3
Token: =
- type: equals or assigns
- value: n/a
- start: 4
- length: 1
Token: 3.14
- type: double
- value: 3.14
- start: 6
- length: 4
Token: +
- type: plus
- value: n/a
- start: 11
- length: 1
Token: 2
- type: integer
- value: 2
- start: 15
- length: 1
Token: *
- type: multiply
- value: n/a
- start: 15
- length: 1
Token: 4
- type: integer
- value: 4
- start: 17
- length: 1

As you can see, white space and comments are ignored. So after processing that single line of code there are 7 tokens that are handed off to the next part of the compiler, which is the Parser. The Parser is covered in the next post.