DEV Community

Cover image for Behind the Scenes of C++ | Compiling and Linking
Shreyos Ghosh
Shreyos Ghosh

Posted on • Updated on

Behind the Scenes of C++ | Compiling and Linking

C++ is a compiled language, which means that in order to run a program, one must convert it to machine-level language. While most of us use an IDE(Integrated Development Environment) which makes it for all intents and purposes, as easy as clicking a button. Which means most of the process is just abstracted away from us.

Text to Binary Convertion

In this blog post, we'll deeply analyze the inner steps and processes which need to be done to convert our text-based code to machine-understandable form.

Grab your coffee...and write your helloworld.cpp program we're going to convert it to helloworld.exe soon...

Hello World of C++

Let's write our first C++ program. Open a text file in notepad or any text editor of your choice and simply paste or type the below C++ code. And save it as helloworld.cpp.

// This is our simple hello world program
#include <iostream>

int main(int argc, char* argv[]) {
    // this line below will print "Hello World!" to the screen
    std::cout << "Hello World!" << std::endl;
    std::cin.get();
    // if all goes well the main will return 0
    return 0;
}
Enter fullscreen mode Exit fullscreen mode

Here, we will use a compiler provided by GNU called g++ to convert our C++ code to executable binary.

Note: C++ compilers doesn't have any way to tell whether you've written a C++ code or any other code in the file, other than to look on the file extension in this case .cpp. So, if you're writing a C++ source code you must save it using the extension .cpp or .c++, .cxx, .cp, .C. Now, for header files there is no such restrictions from the compiler itself. But, it is a best practice to name the header file with .h or .hpp extensions to maintain the code clarity and consistency.

Don't worry if you find the above code ambiguous, We'll talk more about the code in detail and also about headers(why we use headers and how actually it works) in my upcoming blogs.

Once you have your helloworld.cpp file, open the command prompt and navigate to the folder where you've saved your helloworld.cpp program. And run the command below to get the output file helloworld.exe.

g++ helloworld.cpp -o helloworld.exe
Enter fullscreen mode Exit fullscreen mode

Note: If you're on Windows, to use the g++ compiler you must install the GNU toolchain port on Windows also known as MinGW-w64(Minimalist GNU for Windows) in your system.

By, now you should have the helloworld.exe file, upon clicking on it you'll have the "Hello World!" printed on your terminal window.

So, let's have a look on how the compiler converts our C++ code into machine executable form.

The C++ Compilation Model

To convert our C++ code into executable binary, the C++ compilers use the compilation model. It consists of a few essential stages that lay the foundation for successful code execution.

These stages consists of,

  1. Preprocessing of the source code
  2. Compiling the processed source code
  3. Assembling the compiled file
  4. Linking the object code binary

The C++ Compilation Model

Let's talk about these stages into a bit more detailed way.

Preprocessing:

This is the first stage in the build process where the source codes undergoes a series of text formatting before the actual compilation begins. Which are,

  • Handling Preprocessor Directives: Preprocessor Directives or Preprocessor Statements are the commands that starts with the '#' symbol, and it instruct the preprocessor to perform certain textual manipulation on the source code. Which can be further categorized by,

    • File Inclusion Directive - Embeds the contents of library headers and source headers to the source code. (Basically a simple copy and paste) Example: #include
    • Macro Definition Directive - Replaces identifiers with replacement tokens or token-strings in the current file. Example: #define
    • Conditional Directives - Conditionally compile sections of the current file. Example: #if, #endif, #ifdef, #ifndef, #else, #elif.
    • Pragma Directives - Applies compiler-specific rules to specified sections of code. Example: #pragma once.
  • Removal of the Comments: Having the ability to write comments is a remarkable feature in C++. Making it more versatile and readable for us human readers.

    Note: In, C++ we can write comments using two forward-slash "//Text-Comment"(Single-Line) or like this "/*Text-Comment*/"(Multi-Line).

    However, the comments have no utility for the machine since the English language grammar is "far too loosely structured" to be used in the later stages. This is why during the preprocessing stage all the comments mentioned in the source file are replaced with one space character each and the newline characters are retained.

To, generate a preprocessed file helloworld.i from the helloworld.cpp source code, navigate to the source code directory again and run the below command,

g++ -E helloworld.cpp -o helloworld.i
Enter fullscreen mode Exit fullscreen mode

Now, If you take a look at the generated preprocessed file, you'll see that, it is no longer our small helloworld program, rather it contains thousands of lines of code.

This is an indication that the preprocessor has done its job. It copied and pasted all the contents from the library header file iostream into the top of our helloworld.cpp file. And at the very end of the file content, you should see that all the comments are removed from our helloworld program.

# 3 "helloworld.cpp" 2


# 4 "helloworld.cpp"
int main(int argc, char* argv[]) {

    std::cout << "Hello World!" << std::endl;
    std::cin.get();

    return 0;
}
Enter fullscreen mode Exit fullscreen mode

Compiling:

After the preprocessing of the source code is done, the individual intermediate source files are fed into the compiler as a separate translation unit where these files confront a series of phases to be transformed into assembler files.

Translation Units

These phases includes,

Phases of a Compiler

  • Lexical Analysis or Tokenization: In this phase, the source code lexemes(sequence of characters) are categorized into tokens. For example let's see how the compiler will tokenize the following line of preprocessed source code.
int main() {
    return 0;
}
Enter fullscreen mode Exit fullscreen mode
Lexemes Tokens
int Keyword
main Identifier
( Left parenthesis(Operator)
) Right parenthesis(Operator)
{ Left curly brace(Punctuator)
return Keyword
0 Integer literal
; Semicolon(Punctuator)
} Right curly brace(Punctuator)

Note: Tokens are the smallest individual units of a program. C++ has different types of tokens, some of which are keyword(if, return, int), identifier(main, foo, var), literal(3.1415f, 0, "Hello World!"), Operator(=, +, -) and Punctuator((, ), {, }, ;).

  • Syntax Analysis or Parsing: During the syntax analysis the tokens are organized into a hierarchical structure known as a parse tree to check if the written code conforms to the language's syntax rules or not.
int main() {
    return 0
}

// The above code will generate error in the syntax analysis.

helloworld.cpp:5:13: error: expected ';' before '}' token
    5 |     return 0
      |             ^
Enter fullscreen mode Exit fullscreen mode
  • Semantical Analysis: Semantical analysis focuses on checking the code's computational meaning and context. It converts the parse tree generated in the parsing phase, into a simpler abstract form without the unnecessary syntactical details such as braces, and commas, which is also known as Abstract Syntax Tree.

    With the help of the AST, the parser verifies whether the variables are initialized before being used, whether types are compatible, and whether the code's behaviour aligns with the intended semantics of the language.

int var = "Hello";  // incompatible types 'var'(int) and "Hello"(const char*)
pi = 3.1415f;       // variable 'pi' not declared
int x;
int result = x + 3; // variable 'x' not initialized
Enter fullscreen mode Exit fullscreen mode
  • Intermediate Code Generation: During this phase, the front end tree(AST), which we've obtained from the semantical analysis, is converted into an intermediate representation(IR), before it's sent to the optimizer for further analysis.

Note: One of the main reasons for this conversion is, essentially due to the fact that ASTs differ across different languages(C, C++). This means that performing the same type of optimization logic, on each language would require a different implementation. Which is not very ideal to maintain and upgrade.

g++ uses a three-address-based IR called GIMPLE(GNU SIMPLE representation) for high-level optimizations which is fully independent of the processor being targeted.

You can generate the GIMPLE file from the source file helloworld.cpp directly by using the following command.

g++ -fdump-tree-gimple=helloworld.gimple helloworld.cpp -o helloworld.exe
Enter fullscreen mode Exit fullscreen mode
  • Code Optimization: This is the phase, also known as the middle end where most of the optimizations occur. These optimizations are mainly focused on the higher level, meaning that it won't be targeting the final hardware just yet.

    Instead the optimization strategy is focused on various operations such as dead code elimination, removing redundant calculations, loop optimizations, constant folding, among others.

int foo(int x, int y) {
    int val = 10 + (5 * 3); //<= The value of this expression
                            //   will be replaced with one integer 
                            //   literal "25" during compile time 
                            //   vs. calculating it in runtime.
    int sum = x + y;
    return x + y; //<= Redundant code; This can be removed with
}                 //   the value of "sum" or the previous line can
                  //   be totally removed during the optimization 
                  //   phase.
Enter fullscreen mode Exit fullscreen mode
  • Target Code Generation: Within this final phase the compiler converts the higher-level IR into a relatively lower-level format called RTL(Register Transfer Language). Which is very similar to assembly instructions but only contains pseudo-registers so that low-level optimizations can be easily performed on it.

    The RTL is generated to adhere to the target architecture so that the real registers can be replaced with the pseudo registers later on in the register allocation phase. After the register allocation phase, the assembly instruction code is produced as an output string.

To generate the assembler source file use the following command,

g++ -S helloworld.cpp
Enter fullscreen mode Exit fullscreen mode

Assembling:

In, the assembling stage the source assembler file that we have got from the last stage is sent to the assembler, where each of the source program statements are converted into a bit stream to construct the object code binary.

You can generate the object file helloworld.o from the helloworld.cpp file directly by running the command below,

g++ -c helloworld.cpp
Enter fullscreen mode Exit fullscreen mode

The object code that we get out of this is not usable yet or executable to be exact. This is because we need to link the library code which holds the actual implementation for all the features that we have used in our helloworld program. This is where we need the linker.

Linking:

This is the final stage of the whole C++ compilation model, where the different object files generated from individual source files as a different translation unit get linked with the library files.

There are two ways to combine or link these two parts,

  • Static Linking: In, static linking, our separate .obj files and the static library .lib file are linked by the linker in the compile time(before the code gets executed in the memory) and produce a single portable executable binary.

Static Linking

Note: You might wonder why we need to link the .lib file when we've already included the iostream file. It is because the header files such as iostream only hold the features (i.e. declarations of variables, function signatures) that are available. However, the actual implementation of those functions can be found in a library file. And by combining those two parts we get a final executable file.

  • Dynamic Linking: Dynamic linking happens when our already compiled executable is linked with the runtime implementation of the library also called dynamically linked library file or .dll file.

Dynamic Linking

Congrats! I hope you now have a good understanding of how the C++ compilation model and the linker function together to transform our C++ code into machine language.

Here, are some of the resource links that I read/watched while making this post. I highly encourage reading them if you're willing to learn more,

C++ Books That I have followed:

  • Programming: Principles and Practice Using C++ by Bjarne Stroustrup
  • C++ Fundamentals by Antonio Mallia and Francesco Zoffoli

About-

Compiling and Linking:

Preprocessor Directives:

Compiler:

GCC Internals:

Linker:

Top comments (0)