Airat Abdrakov

Posted on Apr 14

Creating a Programming Language as an Amateur. What Is It Like?

#compiling #csharp #hapet #diy

Before we begin, I want to say that I am not a compiler engineer and I have very little connection to this field in general. I am not great at math, and even less so at the mathematical terms that are often used in “compiler construction.” I also do not know much about optimization and other smart things.

This article will not explain “how to make a compiler?” and similar topics. It will simply describe my small experience of creating a compiler for a language.

The idea of creating my own compiler came before the idea of creating my own programming language. I remember digging through the source code of the EV3 controller firmware to make an emulator for it, and then a thought came to me: why not make something like that, but my own? The emulator never really turned out well, but the idea of writing a compiler or a virtual machine never left my head.

After a long search for compiler projects that I could understand at least a little, I came across CheezLang. According to the repository description, the language is similar to Rust. I do not know how to program in Rust at all, but the compiler itself is written in C# (I am a C# guy btw 🥴), so my starting point ended up being this project.

After a long breakdown of the project, after several unsuccessful attempts to modify it, after a lot of other things, I realized that simply taking someone else’s project and slightly changing the semantics was not the best approach. I had to read a ton of articles, make a “simple compiler” and during all that studying I gradually began to understand how it roughly works.

This is roughly how the compiler architecture turned out to be structured (the arrows show the flow of data between compiler components):

It is worth noting that the project was not written top-down or bottom-up. It was written across all stages at once. In fact, I think that is a bad way to write compilers, but since I did not understand what might be needed at the next compiler stage, I had to keep rewriting things all the time.

Lexer and parser

The lexer and parser were simply taken from the CheezLang project and later heavily rewritten. If I understand correctly, that project used an LR(1) parser. I thought that would be enough for my project (at that time I did not even understand what language I wanted). In the end, that type of parser did not suit me because of the need to parse generics, so I had to rewrite it to do lookahead over several tokens.

A bit more detail on why I had to do this: if we have an expression

var a = Test < b > c;

with an LR(1) parser I find it hard (or impossible) to understand whether this is a comparison or something related to generics. To determine that, you need to look ahead several tokens, and then maybe you can figure out what it is.

In reality, even that is not enough to determine whether something is a generic or a tricky expression with comparisons, because cases like this can also appear:

var a = Test < b > (c);

Such an expression needs to be handled during type inference, which we will talk about later. Unfortunately, I did not have enough skill and knowledge to implement such precise handling of these tricky expressions in my compiler, so I just left that for future me :)

Otherwise, developing the lexer and parser was not that painful: you just get tokens from the lexer, “look ahead” a little, create AST nodes in the parser, and that is it. If something goes wrong, you throw errors.

Type inference

The next compilation stage is type inference. Put simply, you get an AST tree, walk through it, try to understand what the type is everywhere, try to evaluate some expressions at compile time, and that is all.

The first problem I ran into when I started writing this stage was that you cannot just walk through the trees sequentially. Since classes, structs, and so on can be in different files, can be declared below the place where they are used, and there are other nuances, I had to process these trees in a clever way. Here are the stages I divided the processing into:

Processing declarations of top-level types without their internals and other details. Just “put” the declarations into the namespace scope;
Processing generic parameters of these types and processing constraints of those generic parameters;
Processing inherited types;
Processing delegates, nested types, and function declarations;
Processing field and property declarations;
Processing field and property initializers;
Processing attributes;
And only at this stage function bodies are processed.

In general, this is not that difficult at first glance. Of course, there were annoying moments with handling operations for different types and headaches with virtual and abstract stuff, but that was manageable. The real nightmare happened where I least expected it — generics again.

What are generic types? Just something like List<T>: you write a specific type instead of T, and a copy of the generic class is created, but now with your concrete type. On paper everything is easy, but in the end I spent more than 4 months on this and still do not like the current result. I had to rewrite the type inference pipeline several times simply because the processing order was not suitable for generics. You might ask, “What exactly is so hard about that?”. Please, an example:

public class Test<T> 
{ 
    public static Test<E> GetTest<E>(); 
}

In the first versions of the generic handler, my compiler would simply loop forever on things like this. What happened was: the compiler tries to process the class Test<T>, processed it; then tries to process the function GetTest, sees that it needs to process the Test<E> class, processed it; then tries to process GetTest inside Test<E> class, and so on. It creates a bunch of Test<E> classes and the compiler crashes with a stack overflow.

In fact, this is only one of the problems with generics; there were many, honestly. Another one that comes to mind is that the compiler has a special limitation: you cannot create static fields and properties using a generic parameter of the class. In short, this is not allowed:

public class Test<T> 
{ 
    public static T CoolField; 
}

Why not? We will discuss that in the code generation chapter.

Type preparation

Types are inferred, some expressions are evaluated, everything seems fine. Can we move on to code generation? In general, yes, but it would be a little inconvenient to use the AST tree in its current form, so I decided to add one more stage to prepare the tree for convenient code generation.

I will just list what happens in this stage:

Handling lambda expressions and nested non-static functions. I cannot point to an article or another document about how smart people do this, because regarding context capture and how to do closures, I talked to a artifical brain. But I will give a small example of how it works in my compiler:

public class Test
{
    public Action ReturnFunction(int a, int b)
    {
        var action = () => 
        { 
            Console.WriteLine($"Result: {a + b}"); 
        };
        return action;
    }

    public void CallFunction()
    {
        var a = ReturnFunction(1, 2);
        a();
    }
}

This code turns into something like this:

public class __SyntheticClass0
{
    public int a;
    public int b;
    public void Lambda0()
    {
        Console.WriteLine($"Result: {a + b}"); 
    }
}

public class Test
{
    public Action ReturnFunction(int a, int b)
    {
        __SyntheticClass0 tmpVar = new __SyntheticClass0();
        tmpVar.a = a;
        tmpVar.b = b;
        return tmpVar.Lambda0;
    }

    public void CallFunction()
    {
        var a = ReturnFunction(1, 2);
        a();
    }
}

You have to do such tricky things to preserve context for lambdas and nested functions.

Creating get/set functions and fields for properties;
Replacing property access with get/set function calls;
Scanning used code so that unused code is not generated (for libraries everything is generated);
Some tricks with calling static constructors;
The most important stage — everything related to classes is transformed to work with pointers. I did not want to do this at code generation time, because handling it there would be a headache. And I also did not want to deal with pointers during type inference. This exact point is what pushed me to create an additional stage before code generation;
And finally, adding virtual functions (also functions inherited from a base class) into the class itself.

After all these manipulations, we can peacefully move on to code generation.

Code generation

Before I even started working on the compiler, I had heard about LLVM somewhere, and it seemed very convenient to me — just generate one intermediate code and it will work “everywhere”. Obviously, there are nuances in that “everywhere”: things like setjmp/longjmp differ across platforms, and the same goes for valist. In short, generating LLVM IR is cool, but not without problems related to platform-specific code.

Once we have the prepared AST tree, we can start generating IR. In fact, this stage was one of the easiest, because all the unpleasant stuff had mostly already been solved in the previous stages.

Ah, no, I remembered something. Let us talk a little about the generic restriction that says you cannot use a generic type parameter in static fields and properties. Why is that? Let us imagine that we have three libraries and one application, with a dependency graph like this:

Library A defines a type SomeType<T>, and then libraries B and C use that type with the parameter int. Then both of these libraries are imported into our application. During code generation, both in library B and in library C, their own SomeType<int> types are created, and each library has its own version. And obviously, if the static field of SomeType<int> changes in library B, that change will not be reflected in the same type in library C. And in general, in the final application project, how do you understand which type to use — the one from B or the one from C?

You might say: “Well, the problem is just in your code generation; surely this can be solved somehow.” I agree, it probably can be solved somehow. I even found something about it (somewhere in my saved links there was an article about COMDAT and how to play with the linker and other smart stuff to solve my problem, but I unfortunately lost it somewhere), but in the end I got lazy and stopped working on it. Studying how linkers work is not something I want to dive into right now, although I should.

There was another problem with the linker, I remember. It had something to do with underscores ('_'). Maybe they had to be added to the names of global variables or something else when generating for x86 Windows. But that problem was solved by replacing the MSVC linker with LLVM’s linker.

In short, there were various difficult moments, like vtables, but overall code generation was still one of the simplest parts of the compiler; you just needed to know the LLVM IR syntax.

Other things

There are a couple of things I would like to mention as well, but they are not related to a specific compilation stage.

Garbage collector

I always knew this thing would not be simple and that I would spend a lot of time on it. That would have been the case if I had not just said “screw it” and made the collector run in the same thread. Threads were not implemented in my compiler (and still are not implemented), so I had to think about how to collect garbage without such tricks. In the end, I ended up with a funny garbage collector that must be called manually.

The collector is conservative: it simply walks up (or down, I do not remember which way the stack goes) the stack to the main root and looks at everything; it also checks global variables (static fields of classes and structs). And yes, the problem with such collectors is that they may leave objects untouched even though they should be cleaned up, because of false positives (for example, an object address may be the same as just a number lying somewhere on the stack). Other than that, I am satisfied with this collector.

Standard library details

I made the array type generic. I honestly thought it was generic in C# too, but it turns out it is not. Well, I hope that was not a reckless decision; so far I have not noticed any major problems with it. After that, I also redesigned the generic interfaces ICollection, IEnumerable, and others so that they are identical to the regular ones.

Compiler update

I added a command to update the compiler, because I want users to always be on the latest version of the compiler. That probably will not really be the case, but I just wanted it. I do not plan any major breaking changes. The only goal is to resemble C#.

Reflection

Reflection is still very weakly implemented. You can get the type of an object, and from there a couple of simple properties. I think this is not as important as other things like multithreading, so I left it for later.

Language Server Protocol

To make working with the language a little nicer for users (and for me), I had to add LSP support to the compiler. That was a terrible experience. In principle, as smart people do, you should make an incremental compiler so that it only looks at small changes in the code and recompiles only those. That also seemed difficult to me (I tried reading about tree-sitters and did not understand anything), so I just recompile the entire project :) And now there is a VS Code extension, so working with the language became a little more pleasant.

I highlighted the main changes in the documentation, but in general the language turned out to be very similar to C#.

Conclusion

At first, I did not even know what kind of language I wanted. I ended up with a C#-like language that compiles natively :) I consider this a good experience, although obviously many things I did were “wrong”. I will continue developing the compiler; I do not plan to abandon it for now. There is still a lot to add: multithreading, support for fancy pattern matching, and the standard library could be much broader too.

Thank you for reading to the end. I would be happy to hear your opinion about this project and take your feedback into account.

Project link: Compiler for hapet programming language
Project website: hapet