DEV Community

Cover image for Cracking the Compiler: What Happens After Preprocessing
xb
xb

Posted on

Cracking the Compiler: What Happens After Preprocessing

Introduction

In our last article, we followed a C program through its entire compilation journey - from source code to executable. Now, we’re zooming in on the compilation phase, the step that transforms your high-level code into something closer to the machine's language. This is where the real magic begins.

Overview of the Compilation Phase

In the previous article, we explored how a C source file is transformed into an executable through a series of key stages. These are:

  • Preprocessing
  • Compilation
  • Assembly
  • Linking

While each of these phases plays a crucial role, they can themselves be broken down into even finer steps. Specifically, the Compilation Phase is made up of 6 smaller steps. These are:

  • Lexical Analysis
  • Syntax Analysis
  • Semantic Analysis
  • Intermediate Analysis
  • Optimization and finally
  • Code Generation

Lexical Analysis (Tokenization)

Lexical Analysis (aka, scanning) is the first step in the Compilation Phase and is responsible for reading and organizing the contents of the source program into a sequence of characters that represent a unit of information, aka, a token. Put simply, the Lexical Analysis Phase handles the conversion of character sequences into token sequences. This conversion from character sequences to token sequences is facilitated through the use of a special program called a Lexical Analyzer.

A Lexical Analyzer has two main responsibilities. These are:

  • Tokenization - Break input text (Keywords, identifiers, numbers, symbols, etc.) into basic units called tokens.

    Let’s say, for example, we had the following line of code within our program.

    int age = 18;
    

    After tokenization, we might end up with an output that looks something like:

    ["int", "age", "=", "21", ";"]
    
  • Meaning Assignment - Categorize each token into types. E.G.:

    • “int"KEYWORD
    • “age”IDENTIFIER
    • “=”ASSIGNMENT_OPERATOR
    • “21”INTEGER_LITERAL
    • “;”SEMICOLON

NOTE: Errors like unrecognized symbols or malformed tokens are caught here.

After the Lexical Analysis, the next step in the Compilation Phase is Syntax Analysis.

Syntax Analysis (Parsing)

Syntax Analysis (aka, parsing) is the next step in the Compilation Phase, and is concerned with interpreting the meaning of the token sequences which were generated during the Lexical Analysis Phase. These token sequences are checked against the grammar of the programming language (in our case, C) and then used to build a Parse Tree / Abstract Syntax Tree (AST), which represents the program’s overall structure. E.g., Let’s say we started the compilation process with the following line of code within our C program:

a = b + 5;
Enter fullscreen mode Exit fullscreen mode

The Tokenization Phase would first break this line of code into tokens:

  • “a”IDENTIFIER
  • “=”ASSIGNMNET_OPERATOR
  • “b”IDENTIER
  • “+”PLUS
  • “5”INTEGER_LITERAL
  • “;”SEMICOLON

The Syntax Analysis Phase would then remove any unnecessary syntax and focus on the semantic structure, in the end, producing an Abstract Semantic Tree that might look something like this:

       =
     /   \
    a     +
         / \
        b   5
Enter fullscreen mode Exit fullscreen mode

NOTE: At this stage, errors like missing semicolons or mismatched parentheses are caught.

After Syntax Analysis, the next step in the Compilation Phase is Semantic Analysis.

Semantic Analysis

Semantic Analysis is the third phase of the Compilation Process and is concerned with verifying the semantic validity of the program’s declarations and statements. Put simply, the Semantic Analysis step ensures that the parsed code makes sense logically. This task is performed with the help of the Syntax Tree and symbol table, which are used to check that the given program is semantically consistent with the language definition. E.G., Let’s take a look at the following code snippet:

int x = "hello";         // 1
y = 5;                   // 2
int add(int a, int b) {
    return a + b;
}
int result = add(10);    // 3
Enter fullscreen mode Exit fullscreen mode

Even though syntactically this is a valid snippet of C code that would be marked valid by a parser, it contains multiple Semantic Errors. E.G.,

  • Assigning a string to an int variable → Type Mismatch
  • Using y before it is declared → Undeclared Identifier
  • Calling the add function with one argument instead of 2 → Arity Mismatch

NOTE: Errors like calling undeclared variables and wrong argument types are caught at this stage.

Next up, we move on to the Intermediate Code Generation Step.

Intermediate Code Generation

This is the fourth step in the Compilation Phase and is responsible for translating the source code and related Abstract Syntax Tree into a platform-independent, intermediate representation. This translation is done because, if the source language were translated directly to the target machine’s language, then a full native compiler would be needed for each new machine. This point within the Compilation Process can be thought of as a halfway point between source code and assembly.

The next step in the Compilation Process is the Optimization Phase.

Optimization

This step in the Compilation Process handles the improvement of the intermediate representation, generated in the previous step, to make the code run faster or use fewer resources. There are two kinds of optimization that can be performed at this stage. These are:

  • Local Optimization - Happens within a single basic block (straight line piece of code with no jumps or branches). This is done with the goal of improving performance in small, tightly-scoped chunks of code. E.G.,

    int x = 2 * 3;  // becomes int x = 6;
    
    x = x * 2;      // becomes x = x + x;
    
    int y = a + b;
    int z = a + b;  // z = y;  // after optimization
    
  • Global Optimization - These optimizations span multiple basic blocks or even entire functions. This is done with the goal of optimizing code with the broader context in consideration. This often leads to more significant performance gains.

Next up, we have Code Generation, the final step in the Compilation Phase.

Code Generation

At this stage, finally, the optimized intermediate representation is converted into assembly code for the target platform. This assembly code is what is then passed to the Assembler, bringing us to the end of the Compilation Phase in our journey from Source to Success.

Conclusion

The Compilation Phase is where the true transformation of high-level code begins, turning human-readable C into something much closer to machine language. By breaking down this phase into its individual components, we gain a clearer understanding of how compilers bridge the gap between our logic and the machine’s instructions.

Each step plays a critical role: Lexical and Syntax Analysis ensure structure, Semantic Analysis ensures meaning, Intermediate Code provides flexibility, Optimization boosts performance, and Code Generation seals the deal. Whether you're debugging a strange compiler error or building your own programming language, understanding this process demystifies what goes on behind the scenes and reveals just how much intelligence compilers bring to the table.

References

Top comments (0)