Desi

Posted on May 8, 2019

ELI5: How does someone write a new computer language?

#explainlikeimfive #programming

Really showing my non-technical side here, but I don't understand how someone could just come up with a new language and make computers understand it. ELI5!

Top comments (13)

edA‑qa mort‑ora‑y • May 8 '19

There's no way to truly ELI5 this without obscuring details, so I'll opt for simplifying, yet trying to stay correct.

Vocabulary

A language starts with a human level description of what the language can do. You decide on a vocabulary, and how that'll look inside a text file. That is, without considering the hardware/software for a moment, you create an agreement on what the words in your language will accomplish.

Once you've done that you'll write a compiler for the language. This is something that is going to let the computer understand your vocabulary.

The compiler consists of several parts.

Parsing

That text file may be readable by you, but the computer won't be able to make any sense of. Parsing translates that text into a form that the computer understands. This is the abstract syntax tree (AST).

At this point the compiler has read your code, but does not yet understand it.

Semantics / Typing

Once the compiler has an AST, it needs to make sense of this. In this stage it goes through and determines what the code does. It figures out what the function names are, and how to call them. It determines what are variables, and how to store data in them. It figures out the conditions for if statements and loops.

At this point the compiler understands your code.

Translation / Lowering

The compiler will now translate the code into a new form, a form that the computer understands. This is often called "lowering" as it takes your high-level language and brings it closer to the machine language the computer understands.

Compilers do this in a variety of stages, often lowering to one or more intermediate languages (IRs), and then finally to the machine code of the machine.

A simple compiler may choose to "lower" to another high-level language, such as C++. This provides a quick way to implement some new languages.

Linking

Your code doesn't live in isolation, it needs to interact with the target machine. In addition to translating the code, the compiler will also setup tables of names and resources. These are how your code will attach to other components on the system.

Host Language

But you may be wondering in what language you write the compiler? This can be any language. The job of the compiler is to translate from your high-level code, to a known low-level language. It doesn't matter what language the compiler itself is written in.

It's relatively easy to write a compiler for a simple language. Complex languages take more effort.

I encourage everybody to write a language once. I've written several of them, from basic domain specific languages, to full modern languages like C++.

This description focuses on a compiled language, and is best understood with imperative or functional languages. The details, and terminology, tends to change as we get into other paradigms, like declarative languages, or interpreted languages. Most of the stages still exist though.

Eugene Karataev • May 12 '19

Do all compilers transform source code to AST? Are there any languages which omit this step?
Let's take python and javascript for example. Their syntax is different, but abstractions (variables, functions, loops, conditions, e.t.c) are almost the same. Do they have similar ASTs? Can we take AST generated by python compiler and convert it to the source code in javascript?

edA‑qa mort‑ora‑y • May 12 '19

Yes. However, here is where we'd need to go into technical details to define what an AST truly is. There are some languages, potentially, where the parse directly results in a usable tree. There are also stream languages where no complete tree is ever created. But there's no avoiding the concept of the AST ever -- the compiler/VM/interpreter needs to understand the source.
The ASTs have similar features, but are different depending on the compiler and language. When I talked about "lowering", it is possible, and quite common, to lower from the AST to another language. Not all languages support enough of the source language to be viable. Others require essentially writing a complete VM in them to do the compilation.

Eugene Karataev • May 13 '19

Thanks!

Ben Lovy • May 8 '19 • Edited

A programming language interpreter is just a program itself that can understand strings of text and semantically tag them, and then evaluate that result.

I'll include a snippet of my latest post, which talks about a problem I had making my first programming language interpreter:

For a small concrete example, let's look at the input string + 2 (* 3 4) 5. To work with this input, we need to build a Abstract Syntax Tree structure like the following:

S-Expression(
    Symbol("+"),
    Number(2),
    S-Expression(
        Symbol("*"),
        Number(3),
        Number(4),
    ),
    Number(5),
)

The whole program is represented as an S-Expression. When our program sees one of these with multiple elements, it's going to try to execute it as a function call, looking up the function from the symbol in the first position. First, though, it's going to recursively evaluate all of its children - so if any of them are themselves s-expressions, we'll get them into values we can work with first. In this example, the inner S-Expression (* 3 4) will be evaluated first:

S-Expression(
    Symbol("*"),
    Number(3),
    Number(4),
)

This will be interpreted as a function call, and evaluates to:

S-Expression(
    Symbol("+"),
    Number(2),
    Number(12),
    Number(5),
)

Now we have an operation as the first element of the S-expression and some numbers with which to apply it. When this evaluates, we're left with just Number(19), which can be displayed to the user as the result of the computation. To get the Symbol("+") to mean "addition", the program will keep track of an environment which associates names like "+" with functions.

Not all interpreters use S-Expressions like this, but its one of the simplest ways to get started. Writing an interpreter means writing functions that can turn text into a tree like the above, and then evaluate that tree.

stereobooster • May 8 '19 • Edited

Here is Peter Norvig write minimalistic Lisp interpreter from scratch norvig.com/lispy.html. Basically what you need to do is to take program(string), tokenize it (turn text into words), convert tokens to abstract syntax tree (tree data structure, looks like a tree if you draw it on paper), walk over it starting from the root (visit each branch of the tree one by one) and execute each command in the branch.

Which of this needs more explanation? I'm curious because I want to write article on the subject.

Casey Brooks • May 10 '19 • Edited

In my (very limited) experience writing parsers/compilers, one of the more understandable forms of writing a compiler is the Parser Combinator. Reading most literature on these parsers is typically pretty heavy on the mathy- or haskelly-syntax, but its core concept is really simple.

Essentially, you can understand a parser simply as a function. Specifically one that takes an input String and returns a node, which is the part of the program that function could understand, and the remaining String to be parsed by a different function. I find Kotlin to be an exceptionally readable language for this so here's a way to represent that function as an interface:

interface Parser {
    fun parse(input: String): Pair<Node?, String> // if parsing succeeds, the Node will be non-null
}

// Node classes are simple, and either contain text or other nodes (but not both!)
class Node(val kind: String)
class LeafNode(kind: String, val content: String) : Node(kind)
class InnerNode(kind: String, val children: List<Node>) : Node(kind)

From this interface, we can write the first actual parsers, which just recognize a single character of a specific type:

class AnyCharParser : Parser {
    override fun parse(input: String): Pair<Node?, String> {
        val nextChar = input.first() // get the first character
        val remainingString = input.substring(1) // get the rest of the String
        return LeafNode("anyChar", nextChar) to remainingString // return a node with the data it found, and the remaining String to be parsed
    }
}

class DigitParser : Parser {
    override fun parse(input: String): Pair<Node?, String> {
        val nextChar = input.first() // get the first character
        if(nextChar.isDigit()) { // parsing only succeeds if the next character is a digit
            val remainingString = input.substring(1)
            return LeafNode("digit", nextChar) to remainingString
        }
        else { // otherwise we return a null node to signify parsing failure
            return null to input
        }
    }
}

At this point, we can successfully parse a language with only 1 character. Success is determined by some Parser returning a single root Node with an empty String as the remaining String.

val parser = DigitParser()
val input = "1"
val (rootNode, remainingString) = parser.parse(input)
// rootNode should be non-null, and remainingString is empty. Parsing was successful!

You can imagine more kinds of "Leaf Parsers" that recognize different types of characters (whitespace, letters) or even recognize tokens rather than individual characters (if, for, class, etc). But we usually parse code which contains more than 1 token[citation needed]. So let's add a higher-order parser to allow parsing longer String inputs:

class ManyParser(val childParser: Parser) : Parser {
    override fun parse(input: String): Pair<Node?, String> {
        val children = mutableListOf<Node>()
        var remainingString: String = input
        while(true) {
            val (currentNode, currentRemainingInput) = childParser.parse(remainingString)
            if(currentNode == null) break

            children.add(currentNode)
            remainingString = currentRemainingInput
        }

        return InnerNode("many", children)
    }
}

Notice how this ManyParser delegates parsing to another child parser. We can then combine it with one of our Leaf parsers to recognize longer inputs:

val parser = ManyParser(
    DigitParser()
)
parser.parse("1") // parsing success
parser.parse("123456789") // parsing success
parser.parse("12345a") // parsing failure!

val parser2 = ManyParser(
    LetterParser()
)
parser2.parse("1") // parsing failure!
parser2.parse("123456789") // parsing failure!
parser2.parse("a") // parsing success
parser2.parse("asdf") // parsing success
parser2.parse("qwerty") // parsing success
parser2.parse("qwerty5") // parsing failure!

Notice how we can now recognize different types of String simply by passing a different Parser to the ManyParser parser. You could then extend this idea with other types of parsers, such as sequences of specific parser rules (if token, followed by an boolean expression, followed by block of code), a choice of many different possible parsers (an assignment could parse a literal String OR a literal number), etc. The result of all of these operations is a tree of nodes nested inside each other, which are annotated with information describing the context in which it was parsed and the actual content that was found. We now have something we can traverse in code and actually do something with.

For further study, you might check out my library Kudzu, which is a more complete implementation of this exact idea and will help reinforce this idea and show how it can be used to slowly build up complex grammars from simple building blocks. For example, this test case shows how a JSON parser might be built and parsed data transformed to actual usable data.

bakenator • May 8 '19 • Edited

ELI5 version: you write a program in an existing language to read and compile the new language. Eventually using this method you can write a compiler program in the new language for the new language. Then you don't need the old lang anymore

To go further Google "tokenization" and "abstract syntax trees"

Desi • May 9 '19

This is super simple and exactly what I was hoping for! Haha thank you!

Ganesh Prasad • May 9 '19

A real programming language takes a lot of time to develop. First you start with a minimal version of the language and then you gradually grow it, write libraries around it and so on.

Check out the source of Ironscript, ironscript.github.io . It's a small programming language that I developed a couple of years ago and it's interpreter is written in ES6.

Franco Traversaro • May 8 '19

I'm following this really interesting book while is written and published for free: craftinginterpreters.com

Vincent Milum Jr • May 9 '19

I honestly think it is beneficial to actually learn how a CPU works to some degree. The ARM instruction set is fairly simple and straightforward, especially compared to x86. 16 registers, all numbered, and simple instructions to modify what are in those registers.

Programming languages we are used to build on top of this. They're almost like libraries in a sense, where things are abstracted away from the core. But learning the core makes the abstraction so much clearer.