This blog is part of a series
How does a C program get compiled? For C-like languages, compilation involves four steps:
- Preprocessing, compile-time metaprogramming
- Compilation itself, translation of the source code to assembly
- Assembling, turning assembly into machine code in an object file
- Linking, turning the object file into an executable or a library
Of course, all these categories, except for linking, are to some degree arbitrary. Preprocessing is an anomaly, a language within a language, a crutch - it exposes the limited expressive power of the base language. Compilation is to a degree arbitrary, because you can embed assembly code into C code, which doesn't require compilation. Assembly is not actually assembly - it's Gnu Assembly, the universal assembly. Originally, the assembly language was described in the ISA Manual, and the manufacturer provided with it the assembler itself, which read and compiled the assembly - GNU Assembly is not that. It's a higher-level, universal assembler. Still, the mental framework of these four steps is a net positive, but past a point of experience, you can see gaps in the structure.
We already discussed the preprocessor in the previous blog, let's now turn our attention to compilation. Compile our test program like this:
[lostghost1@archlinux c]$ gcc -S main.c
Or rather, for more clean, unoptimized assembly:
[lostghost1@archlinux c]$ gcc -S -O0 -fno-asynchronous-unwind-tables -fno-unwind-tables -fno-ident -fno-stack-protector main.c
Resulting assembly with explanatory comments:
.file "main.c"
.text
.globl main
.type main, @function
main:
pushq %rbp # Prologue: save old base pointer
movq %rsp, %rbp # Set new base pointer
subq $16, %rsp # Allocate 16 bytes for local variables
movl %edi, -4(%rbp) # Save argc (1st argument, int) at -4(%rbp)
movq %rsi, -16(%rbp) # Save argv (2nd argument, char **) at -16(%rbp)
cmpl $1, -4(%rbp) # Compare argc to 1
jg .L2 # If argc > 1, jump to .L2 (print argument)
movl $1, %eax # argc <= 1: set return value to 1
jmp .L3 # Return
.L2:
movq -16(%rbp), %rax # Load argv into %rax
addq $8, %rax # Advance to argv[1] (first argument, skipping program name)
movq (%rax), %rax # Dereference: load pointer to argument string
movq %rax, %rdi # Move that pointer to %rdi (argument for puts)
call puts@PLT # Print argv[1] with puts()
movl $0, %eax # Set return value to 0
.L3:
leave # Epilogue: restore frame pointer and stack
ret # Return to caller
.size main, .-main
.section .note.GNU-stack,"",@progbits
As you can see, many C constructs translate into assembly directly. For example:
int a = 10, b = 20, c;
c = a + b;
Translates to:
mov eax, 10
mov ebx, 20
add eax, ebx
mov c, eax
Another example:
int arr[4] = {1, 2, 3, 4};
int *p = &arr[2];
*p = 99;
Translates to:
mov eax, [arr + 8] ; access arr[2] (int, 4 bytes each)
mov dword ptr [arr + 8], 99
So in a way, C is just higher-level assembly. But in other ways, it isn't - some constructs don't have a translation, producing undefined behavior. Structs, enums and unions are higher-level datatypes, which don't have a direct assembly counterpart. Calling conventions vary between CPUs and OS'es. In fact, if you want to explore, how exactly does code translate into assembly - there is a really useful website for that, GodBolt.
After compilation comes assembly, which translates assembly code into machine code, for a given ISA. But it doesn't output just text - it outputs a binary image. Specifically, one in an ELF format.
But the resulting artifact is an object file, which isn't the final process image. It contains information about sections (.text, .data, .bss) and their contents (machine code, using section-relative addresses), as well as references to symbols imported from external libraries. However, machine code uses section-relative addresses - addresses based on offsets from start of sections. But because we don't yet know at which address these sections are loaded - so we can't run the program yet. What lays out the sections in memory, thus turning them into segments, is a linker - and it does so with a linker script. On Arch Linux, these are at /lib/ldscripts/
.
Let's examine one. Take elf_x86_64.x
.
OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64") // self-explanatory
OUTPUT_ARCH(i386:x86-64)
ENTRY(_start) // which symbol is the entry point to the executable
SEARCH_DIR("/usr/x86_64-pc-linux-gnu/lib64"); SEARCH_DIR("/usr/lib"); SEARCH_DIR("/usr/local/lib"); SEARCH_DIR("/usr/x86_64-pc-linux-gnu/lib"); // which directories to look for for libraries, while linking
SECTIONS
{
/* Read-only sections, merged into text segment: */
PROVIDE (__executable_start = SEGMENT_START("text-segment", 0x400000));
. = SEGMENT_START("text-segment", 0x400000) + SIZEOF_HEADERS;
/* Place the build-id as close to the ELF headers as possible. This
maximises the chance the build-id will be present in core files,
which GDB can then use to locate the associated debuginfo file. */
.note.gnu.build-id : { *(.note.gnu.build-id) }
.interp : { *(.interp) }
.hash : { *(.hash) }
This shows the mapping of sections into segments, starting at address 0x400000
.
Let's now link the program manually
[lostghost1@archlinux c]$ gcc -c main.c
[lostghost1@archlinux c]$ ld main.o --dynamic-linker /lib64/ld-linux-x86-64.so.2 /usr/lib/crt1.o -lc -o main
[lostghost1@archlinux c]$ ./main hello
hello
When invoking ld
, our linker, we needed to specify the path to the dynamic loader (which is specified as --dynamic-linker
- quite confusing), because we are compiling a dynamic and not a static executable - more on the distinction later. crt1.o is a special object file, part of the standard C library, which contains the entry point (the _start
) symbol. -lc
is libc, glibc in our case - alternatives such as musl libc exist.
Now let's inspect the binary:
[lostghost1@archlinux c]$ readelf -a main
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x401060
Start of program headers: 64 (bytes into file)
Start of section headers: 13088 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 12
Size of section headers: 64 (bytes)
Number of section headers: 24
Section header string table index: 23
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .interp PROGBITS 00000000004002e0 000002e0
000000000000001c 0000000000000000 A 0 0 1
[ 2] .hash HASH 0000000000400300 00000300
0000000000000018 0000000000000004 A 4 0 8
[ 3] .gnu.hash GNU_HASH 0000000000400318 00000318
000000000000001c 0000000000000000 A 4 0 8
[ 4] .dynsym DYNSYM 0000000000400338 00000338
0000000000000048 0000000000000018 A 5 1 8
[ 5] .dynstr STRTAB 0000000000400380 00000380
0000000000000039 0000000000000000 A 0 0 1
[ 6] .gnu.version VERSYM 00000000004003ba 000003ba
0000000000000006 0000000000000002 A 4 0 2
[ 7] .gnu.version_r VERNEED 00000000004003c0 000003c0
0000000000000030 0000000000000000 A 5 1 8
[ 8] .rela.dyn RELA 00000000004003f0 000003f0
0000000000000018 0000000000000018 A 4 0 8
[ 9] .rela.plt RELA 0000000000400408 00000408
0000000000000018 0000000000000018 AI 4 18 8
[10] .plt PROGBITS 0000000000401000 00001000
0000000000000020 0000000000000010 AX 0 0 16
[11] .text PROGBITS 0000000000401020 00001020
0000000000000075 0000000000000000 AX 0 0 16
[12] .rodata PROGBITS 0000000000402000 00002000
0000000000000004 0000000000000004 AM 0 0 4
[13] .eh_frame PROGBITS 0000000000402008 00002008
0000000000000088 0000000000000000 A 0 0 8
[14] .note.gnu.pr[...] NOTE 0000000000402090 00002090
0000000000000040 0000000000000000 A 0 0 8
[15] .note.ABI-tag NOTE 00000000004020d0 000020d0
0000000000000020 0000000000000000 A 0 0 4
[16] .dynamic DYNAMIC 0000000000403e60 00002e60
0000000000000180 0000000000000010 WA 5 0 8
[17] .got PROGBITS 0000000000403fe0 00002fe0
0000000000000008 0000000000000008 WA 0 0 8
[18] .got.plt PROGBITS 0000000000403fe8 00002fe8
0000000000000020 0000000000000008 WA 0 0 8
[19] .data PROGBITS 0000000000404008 00003008
0000000000000004 0000000000000000 WA 0 0 1
[20] .comment PROGBITS 0000000000000000 0000300c
000000000000001b 0000000000000001 MS 0 0 1
[21] .symtab SYMTAB 0000000000000000 00003028
0000000000000180 0000000000000018 22 5 8
[22] .strtab STRTAB 0000000000000000 000031a8
00000000000000a6 0000000000000000 0 0 1
[23] .shstrtab STRTAB 0000000000000000 0000324e
00000000000000cc 0000000000000000 0 0 1
We see that we still have the section headers - along with the program headers! Let's remove all of it, since we won't be debugging this executable:
[lostghost1@archlinux c]$ strip --strip-section-headers main
[lostghost1@archlinux c]$ readelf -a main
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x401060
Start of program headers: 64 (bytes into file)
Start of section headers: 0 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 12
Size of section headers: 0 (bytes)
Number of section headers: 0
Section header string table index: 0
There are no sections in this file.
There are no section groups in this file.
Much better!
Now on the difference between static and dynamic executables. Object files that call out to external functions, produce unresolved symbols. They are resolved during linking - when the executable is laid out in program segments, the points where functions are called get replaced with jumps to the actual function addresses. This makes for a static executable. However, we can choose to postpone resolving the symbols - and resolve them at program start. Then, we will declare which libraries we need, and which symbols from them are needed - and at program start, the linker will run first, find those libraries, load them, and resolve the symbols. This makes for a dynamic executable.
Let's see which one our program is:
[lostghost1@archlinux c]$ ldd main
linux-vdso.so.1 (0x00007ffedcd23000)
libc.so.6 => /usr/lib/libc.so.6 (0x0000756dbada8000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x0000756dbafc0000)
Both libc and the loader are needed at runtime (linux-vdso
is a special pseudo-library). That makes the executable dynamic.
Glibc shouldn't produce static executables. To compile one, install musl-libc:
[lostghost1@archlinux c]$ yay -S musl clang
[lostghost1@archlinux c]$ musl-clang --static main.c -o main
[lostghost1@archlinux c]$ ldd main
not a dynamic executable
[lostghost1@archlinux c]$ ./main hello
hello
This executable has all its symbols resolved - no dynamic loader needed!
Lastly, let's touch upon compiling dynamic and static libraries themselves. A static library is just an archived object file:
[lostghost1@archlinux c]$ cat main.c
#include <stdio.h>
#include "sayhello.h"
int main(int argc, char** argv){
sayhello();
return 0;
}
[lostghost1@archlinux c]$ cat sayhello.h
#ifndef _SAYHELLO_H
#define _SAYHELLO_H
void sayhello();
#endif
[lostghost1@archlinux c]$ cat sayhello.c
#include <stdio.h>
void sayhello(){
printf("Hello!\n");
}
[lostghost1@archlinux c]$ musl-clang -c sayhello.c
[lostghost1@archlinux c]$ musl-clang -c main.c
[lostghost1@archlinux c]$ ar q libsayhello.a sayhello.o
ar: creating libsayhello.a
[lostghost1@archlinux c]$ musl-clang --static main.o -L. -lsayhello -o main
[lostghost1@archlinux c]$ ldd main
not a dynamic executable
[lostghost1@archlinux c]$ ./main
Hello!
Here, -L.
means "look in this directory", -lsayhello
means "look for a file libsayhello.a" (.a
because we specified --static
, otherwise it would be .so
).
As for a dynamic library:
[lostghost1@archlinux c]$ rm main
[lostghost1@archlinux c]$ gcc -shared sayhello.o -o libsayhello.so
[lostghost1@archlinux c]$ gcc main.o -L. -lsayhello -o main
[lostghost1@archlinux c]$ ldd main
linux-vdso.so.1 (0x00007ffc384aa000)
libsayhello.so => not found
libc.so.6 => /usr/lib/libc.so.6 (0x000074e040a48000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x000074e040c65000)
[lostghost1@archlinux c]$ ./main
./main: error while loading shared libraries: libsayhello.so: cannot open shared object file: No such file or directory
[lostghost1@archlinux c]$ LD_LIBRARY_PATH=. ./main
Hello!
Typically we don't look in the current directory - neither for executables (which is why we have to specify ./
when running ./main
), nor for libraries - this is for security reasons, so that we don't accidentally run what we didn't intend to. Which is why we have to resort to specifying the environment variable.
Of course, the shared library advertises it's exported symbol:
[lostghost1@archlinux c]$ readelf -a libsayhello.so
...
Symbol table '.dynsym' contains 7 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_deregisterT[...]
2: 0000000000000000 0 FUNC GLOBAL DEFAULT UND [...]@GLIBC_2.2.5 (2)
3: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __gmon_start__
4: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_registerTMC[...]
5: 0000000000000000 0 FUNC WEAK DEFAULT UND [...]@GLIBC_2.2.5 (2)
6: 0000000000001110 20 FUNC GLOBAL DEFAULT 11 sayhello
And that's all I have to share, when it comes to compiling and linking a C program. In the next blog we will examine loading and running an ELF executable file. See ya then!
Top comments (1)
Fantastic job, very in-depth.
I think that some people who program in C for the first time may not necessarily come from a low-level background of knowing Assembly, but from a higher-level program of programming in Python, so they may not necessarily be aware of what a CPU architecture is.
It could be helpful to give a general overview of a CPU architecture defining the instructions supported by a CPU "type" and the encoding of the instructions, and also explaining the prominent architectures: