Discussion on: What is self-hosting, and is there value in it?

View post

Primarily, self hosting was a way to gain platform independence. If your language was self hosted, then if you needed to target a new platform, you added support to the compiler for that platform, then cross compiled it for the platform with the support you just added. And voila, you have ported it. If you have a language for systems programming, this is essential. Otherwise how do you get the language onto a platform that just came into being?

Plus, for much of the history of computing, the compiler was likely to be the most sophisticated program written in a language, so it gave a language an immediate, sophisticated testbed. It also acted as a brake on language complexity. If your language is self hosted, any feature you add you have to balance against the complexity it adds to the compiler. This biases the language towards being good for writing compilers, rather than some other class of program.

If you are working on languages that aren't for systems programming, it matters less. MATLAB, for example, started as an interpreter to access BLAS libraries, so it assumed the existence of a FORTRAN compiler and toolchain on the system (or you wouldn't have BLAS in the first place). If you have a FORTRAN compiler and you build on that language and toolchain, then cross compiling to a bare platform is irrelevant.

edA‑qa mort‑ora‑y • Feb 15 '19

I wonder if it's different now that we have platforms like LLVM. By using the LLVM backend, it's easy to target a wide range of platforms.

I wonder how much of a language has to be written in itself to be considered self-hosting. I don't think one would expect that LLVM is replaced, nor the regex library, nor any complex libraries that might be used (I used libgmp in Leaf).

I never thought about how it biases a language towards compilers. Maybe that's why it feels so natural to write compilers in C++. :)

Fred Ross • Feb 15 '19

I wonder if it's different now that we have platforms like LLVM. By using the LLVM backend, it's easy to target a wide range of platforms.

Imagine that aliens land and start selling their bare metal microcontrollers for a price we can't refuse. If my compiler actually emits machine instructions directly, then I can add the new microcontroller to it and produce a compiler for the microcontroller. That's where that benefit of self hosting comes in.

That kind of scenario is simply rare today. Our processor families are kind of entrenched. And LLVM, like FORTRAN for MATLAB, is an assumed environment. If you start assuming LLVM, there's no reason to be self hosting. That being said, self hosting languages can quickly develop LLVM backends, and many have, by treating it as a new machine to port to.

I never thought about how it biases a language towards compilers. Maybe that's why it feels so natural to write compilers in C++. :)

If you think it feels natural in C++, you should try the ML family (Standard ML, Haskell, OCaml). Those languages are deeply optimized for that kind of data manipulation.

edA‑qa mort‑ora‑y • Feb 15 '19

Why would emitting machine instructions be better than emitting LLVM IR instructions? Is there some reason to believe that a shared IR would be harder to migrate to a new platform than an exclusive one?

Note, on Leaf I had a Leaf IR, which was already quite low-level. Unlike LLVM IR, Leaf IR still have a tree scope structure.

Fred Ross • Feb 16 '19

If you are targeting a new architecture, you probably don't have a way to translate LLVM IR instructions into machine instructions, so you'll have to do it yourself. Again, if the language assumes that you always have a mature environment where LLVM has been ported, it's irrelevant.

edA‑qa mort‑ora‑y • Feb 16 '19

I think this is a good point you make. LLVM is good for targetting a family of related systems -- basically Linux, Windows, MacOS. A truly new architecture will either be folded into that family, and thus LLVM will apply, or LLVM won't really help that much.

Though it does target some unusual architectures. I think mainly you'd want to keep it for the shared manpower of optimization. But in Leaf, my IR was low enough that it wouldn't take too much effort to lower it to a target machine code (albeit, it'd be an inefficient machine code compared to LLVM).