Before you read, if you want to go through the compiler yourself, you can find it here.
Introduction
Compilers have always felt something like magic to me. They seem so complex yet so simple, they’re just programs that convert code from one form to another. I've spent countless hours prodding the assembly generated by clang, just looking at how C gets converted to assembly. Naturally, I got curious. I wanted to know how to write one.
So how did I do it?
Obviously the first thing I did was to google "how to make my own programming language", which led me to this website. Now this gave me a rough idea on how to write a compiler but it was kind of disappointing. The author wrote a transpiler that converted his code to C++. I did not like that.
So I dug around more. Combed through websites and read articles, looked at GitHub repos (mind you ChatGPT just came out a few months ago at that time). Eventually, this is what I had gathered:
-
Every compiler needs to at least have the following:
- A Tokenizer, read the code file and generate "Tokens". These are the smallest units of a code, i.e., lexemes that have some meaning.
- An Abstract Syntax Tree, a tree like visual representation.
- A parser that would convert the tokens into the AST.
The best way to get your final executable is to transpile (sigh, this again) to LLVM's IR, and then call LLVM to take over and generate the binary.
I did not like this even one bit. Why can't I generate assembly on my own? Of course this question was answered too.
"Assembly is platform dependent. You would limit your language to one specific platform".
And I get it. But I just couldn't shake the feeling that I was cheating. Not because "LLVM bad" or something, but because it abstracted away what felt like the most fun part of the challenge. Not to be too dramatic or something but I'm a no external dependencies kinda guy. Because then what's the fun? What did I even make?
So anyways, I just got started. I was already familiar with assembly, so I figured I'd just write the last part on my own.
The part where it got messy
Pretty quickly, I learned that everything from the tokenizer to the AST needs to have some kind of structure, or things fall apart fast.
And fall apart they did.
I kept ending up with these awkward ASTs, loosely borrowed from this article. They worked fine for tiny examples, but the moment I tried generating assembly from them, everything became painful. As the language grew, it was obvious this approach just wasn’t going to cut it.
Part of the problem was my language choice. I was using C. Now, C is great. I love C. I've written stuff in C. But in a world of dynamic arrays and garbage collection, C is a choice. And I stupidly chose it. After rewriting the AST more times than I want to admit, and making a fucking struct for every goddamn ASTNode array, I finally gave up and moved to C++.
And honestly? It was a breath of fresh air.
With that out of the way, I actually researched parsers and how to write a real one, not for some calculator. I ended up using a recursive descent parser.
Actually Generating the Assembly
Actually generating assembly is pretty easy, at least conceptually. I used the stack machine method; pop your operands from the stack and push your result to the stack. Worked really well.
But boy, oh, boy is it easy to introduce the minutest bugs in generating assembly. You’ll go through your entire development cycle with everything working perfectly, and then your friend doesn’t order variables the same way you did, everything breaks and your computer catches on fire.
Now, full disclosure, I had sort of given up on it. After my implementation in C kept failing I just stopped working on it. Another dead project.
But every time I would open my GitHub it would sit there, collecting dust and I felt kind of bad... Then I picked it up recently, switched to C++ and actually finished everything. Catching these bugs with Gemini was so much easier. Now I get it AI code is slop or whatever but it can be a really useful tool if you treat it as that. A tool.
And this time, I actually finished it. I added a small standard library and finally gave the project a name, one I had decided on a long time ago. Cappuccino.
What did I learn?
I don't know anything about anything and I should shut up and google how to do something before breaking my skull over a bad design decision for 2 years.
Now why did I name it Cappuccino? I don't know actually, I guess I just really liked the idea behind how Java was named and I like coffee.
The project is available here.
Thank you so much for reading my first blog post ever. I hope you liked reading my journey on writing a compiler.
P.S. If you found this interesting please star this repo.
Top comments (0)