Writing Your First Compiler - Part 2: What Is a Compiler?

#computerscience #programming #tutorial #opensource

Before we dive into writing code, let's demystify what a compiler actually does. At its core, a compiler is just a program that translates source code from one language to another. Most commonly, it translates from a high-level programming language (like our YFC language) to something your computer can execute.

Think of compilation as a series of translations, each step getting closer to what your CPU understands:

Let's start with this simple recursive implementation of the Fibonacci sequence in C.

int fib(int n) {
    if (n <= 1) {
        return n;
    }
    return fib(n - 1) + fib(n - 2);
}

int main() {
    return fib(10);
}

This code should be relatively easy to understand for someone who's familiar with programming or at least has been introduced to the concept of recursion. But if you write this code, you still have to compile it before you can use it, why?

The answer is quite simple, your CPU doesn't actually understand C.

Your processor has no idea what int fib(int n) means or what if (n<=1) is supposed to do. At the hardware level, your CPU only understands the very basic instructions like 'move this number from memory into a register' or 'add these two numbers together.'

So when you write return fib(n - 1) + fib(n - 2); That single line needs to be translated into dozens of simple CPU instructions that handle things like:

Subtracting 1 from n
Calling the function recursively
Storing the result somewhere
Subtracting 2 from the original n
Calling the function again
Adding those two results together
Returning the final value

Phew, that's quite a lot. Which is why we rely on compilers to undertake this job of bridging the gab between the code you want to write, and the instructions your CPU can actually execute.

Let's see what this looks like in practice. If you compile our Fibonacci function with GCC (targeting x86–64 architecture) using -O0 flag (which disables optimizations for clarity), here's what the compiler actually produces:

fib:
        push    rbp
        mov     rbp, rsp
        push    rbx
        sub     rsp, 24
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 1
        jg      .L2
        mov     eax, DWORD PTR [rbp-20]
        jmp     .L3
.L2:
        mov     eax, DWORD PTR [rbp-20]
        sub     eax, 1
        mov     edi, eax
        call    fib
        mov     ebx, eax
        mov     eax, DWORD PTR [rbp-20]
        sub     eax, 2
        mov     edi, eax
        call    fib
        add     eax, ebx
.L3:
        mov     rbx, QWORD PTR [rbp-8]
        leave
        ret
main:
        push    rbp
        mov     rbp, rsp
        mov     edi, 10
        call    fib
        pop     rbp
        ret

Don't worry if this looks intimidating, the key point is that our simple 4-line C function became over a dozen assembly instructions, each doing one very specific thing that your CPU can understand.

But it doesn't quite stop here. Assembly is still human-readable text. What your CPU actually understands is machine code - raw binary instructions, just 1s and 0s.

The program that converts assembly into machine code is called an assembler, and despite the cool name, it's essentially just another compiler. Instead of translating C to assembly, it translates assembly to machine code.

Compilation vs. Translation
You might've heard terms like compiler, transpiler, interpreter, and assembler thrown around. they all involve converting code from one form to another, but there are some useful distinctions. This section can be skimmed over or very briefly read, it is not going to be important to developing our compiler, but it can serve as a good reference. It is important to note that these definitions are very loose, and in many ways just don't matter too much:

Compiler: Typically refers to a program that translates code from a high-level language to a lower-level language. This usually means going from something human-friendly (like C, Rust, or our YFC language) to something closer to what the machine understands (like assembly, machine code, or some intermediate representation). But it's also very common to see compilers translate source code into C - the key characteristic is that you're generally moving down the abstraction ladder, with the target language being at a lower level than the source language.

Transpiler: A program that translates code from one high-level language to another high-level language, typically at roughly the same level of abstraction. TypeScript to JavaScript is a classic example - both are high-level languages, neither is really "closer to the metal" than the other. The same goes for tools like Babel that convert modern JavaScript to older JavaScript syntax, or compilers that convert SASS to CSS. Unlike compilers that move down the abstraction ladder, transpilers generally move horizontally - the source and target languages are at similar levels of abstraction, they're just different languages or different versions of the same language.

Assembler: A program that converts assembly language into machine code. As we saw earlier, assembly language uses human-readable mnemonics like movq and addl, but your CPU needs these translated into raw binary instructions. The assembler handles this final step of translation. While we often give it a special name, an assembler is really just a very simple compiler - it's doing the same translation job, just with an almost 1:1 mapping between assembly instructions and machine code bytes. Each assembly instruction typically corresponds directly to a single machine instruction, which makes the assembler's job much more straightforward than a full compiler like GCC.

Interpreter: A program that executes source code directly without producing a standalone executable. Python is a classic example - when you run a Python script, the interpreter reads and executes your code. While Python does compile to an intermediate bytecode representation (those .pyc files you might have seen), it still interprets that bytecode rather than converting it all the way to machine code. Languages like Java work similarly with the JVM. We'll talk more about intermediate representations and bytecode in a later section.

JIT (Just-In-Time) Compiler: A hybrid approach that combines interpretation and compilation. Modern JavaScript engines (like V8 in Chrome) and the Java Virtual Machine use JIT compilation. The idea is that code starts being interpreted for fast startup, but as the program runs, the JIT compiler identifies "hot" code paths - sections that run frequently - and compiles those to optimized machine code on the fly. This gives you the flexibility of interpretation with the performance of compilation where it matters most. It's called "just-in-time" because the compilation happens right when (or just before) the code is needed, rather than ahead of time

What We're Building
In this tutorial, we're building a true compiler. YFC is a high-level language with functions, conditionals, and loops - abstractions that make it easy for humans to express logic.

x86–64 assembly, on the other hand, is very low-level - it's essentially a thin layer of readable text over raw machine instructions. By translating from YFC down to assembly, we're moving down that abstraction ladder we talked about, doing the same fundamental job that compilers like GCC and Clang do.

Our pipeline is going to look like this:
YFC source code → (our compiler) → x86–64 assembly → (assembler) → machine code

We're responsible for the first and hardest step: taking high-level abstractions and breaking them down into the simple instructions a CPU can execute. The assembler handles the final mechanical translation to binary.

Next: Part 3: Lexical Analysis
Previous: Part 1: Introduction

Code Repository
All accompanying code for this tutorial series is available on GitHub. The repository includes complete source code for each step, example programs, and additional resources to help you follow along.

DEV Community

Writing Your First Compiler - Part 2: What Is a Compiler?

Top comments (0)