DEV Community

Cover image for Compilers 101 - Overview and Lexer
Paul Lefebvre
Paul Lefebvre

Posted on • Updated on

Compilers 101 - Overview and Lexer

These compiler posts will all be at a high-level and are based on the LLVM and Compiler session from the Xojo Developer Conference 2016. None of these posts are going to teach you how to write a compiler. The goal of these posts is for you to have a basic understanding of the components of a compiler and how they all work together to create a native app.

Compiler Components

A compiler is a complicated thing and consists of many components. In general the compiler is divided into two major parts: the front end and the back end. In turn, those two parts have their own components.

For the purposes of these posts, this is how we will be covering the components of the compiler:

Front End

The front end is responsible for taking the source code and converting it to a format that the back end can then use to generate binary code that can run on the target CPU architecture. The front end has these components:

  • Lexer
  • Parser
  • Semantic Analyzer
  • IR (intermediate representation) Generator

Back End

The back end takes the IR, optionally optimizes it and then generates a binary (machine code) file that can be run on the target CPU architecture. These are the components of the back end:

  • Optimizer
  • Code Generation
  • Linker

Each of these steps processes things to get it a little further along for the next step to handle.

The linker is not technically part of the compiler but is often considered part of the compile process.

Lexer

The lexer turns source code into a stream of tokens. This term is actually a shortened version of “lexical analysis”. A token is essentially a representation of each item in the code at a simple level.

By way of example, here is a line of source code that does a simple calculation:

sum = 3.14 + 2 * 4
Enter fullscreen mode Exit fullscreen mode
  1. The first token it finds is “sum”
    • type: identifier
    • value: sum
    • start: 0
    • length: 3
  2. Token: =
    • type: equals or assigns
    • value: n/a
    • start: 4
    • length: 1
  3. Token: 3.14
    • type: double
    • value: 3.14
    • start: 6
    • length: 4
  4. Token: +
    • type: plus
    • value: n/a
    • start: 11
    • length: 1
  5. Token: 2
    • type: integer
    • value: 2
    • start: 15
    • length: 1
  6. Token: *
    • type: multiply
    • value: n/a
    • start: 15
    • length: 1
  7. Token: 4
    • type: integer
    • value: 4
    • start: 17
    • length: 1

As you can see, white space and comments are ignored. So after processing that single line of code there are 7 tokens that are handed off to the next part of the compiler, which is the Parser. The Parser is covered in the next post.

This post first appeared on the Xojo Blog as Compilers 101 - Overview and Lexer.

Top comments (2)

Collapse
 
dkassen profile image
Daniel Kassen

I am already so excited for the next part. Thank you for taking the time to write this!

Collapse
 
lefebvre profile image
Paul Lefebvre • Edited

The next post Compilers 102 - Parser is now available:

Compilers 102 - Parser