Discussion on: ELI5: How does someone write a new computer language?

View post

In my (very limited) experience writing parsers/compilers, one of the more understandable forms of writing a compiler is the Parser Combinator. Reading most literature on these parsers is typically pretty heavy on the mathy- or haskelly-syntax, but its core concept is really simple.

Essentially, you can understand a parser simply as a function. Specifically one that takes an input String and returns a node, which is the part of the program that function could understand, and the remaining String to be parsed by a different function. I find Kotlin to be an exceptionally readable language for this so here's a way to represent that function as an interface:

interface Parser {
    fun parse(input: String): Pair<Node?, String> // if parsing succeeds, the Node will be non-null
}

// Node classes are simple, and either contain text or other nodes (but not both!)
class Node(val kind: String)
class LeafNode(kind: String, val content: String) : Node(kind)
class InnerNode(kind: String, val children: List<Node>) : Node(kind)

From this interface, we can write the first actual parsers, which just recognize a single character of a specific type:

class AnyCharParser : Parser {
    override fun parse(input: String): Pair<Node?, String> {
        val nextChar = input.first() // get the first character
        val remainingString = input.substring(1) // get the rest of the String
        return LeafNode("anyChar", nextChar) to remainingString // return a node with the data it found, and the remaining String to be parsed
    }
}

class DigitParser : Parser {
    override fun parse(input: String): Pair<Node?, String> {
        val nextChar = input.first() // get the first character
        if(nextChar.isDigit()) { // parsing only succeeds if the next character is a digit
            val remainingString = input.substring(1)
            return LeafNode("digit", nextChar) to remainingString
        }
        else { // otherwise we return a null node to signify parsing failure
            return null to input
        }
    }
}

At this point, we can successfully parse a language with only 1 character. Success is determined by some Parser returning a single root Node with an empty String as the remaining String.

val parser = DigitParser()
val input = "1"
val (rootNode, remainingString) = parser.parse(input)
// rootNode should be non-null, and remainingString is empty. Parsing was successful!

You can imagine more kinds of "Leaf Parsers" that recognize different types of characters (whitespace, letters) or even recognize tokens rather than individual characters (if, for, class, etc). But we usually parse code which contains more than 1 token[citation needed]. So let's add a higher-order parser to allow parsing longer String inputs:

class ManyParser(val childParser: Parser) : Parser {
    override fun parse(input: String): Pair<Node?, String> {
        val children = mutableListOf<Node>()
        var remainingString: String = input
        while(true) {
            val (currentNode, currentRemainingInput) = childParser.parse(remainingString)
            if(currentNode == null) break

            children.add(currentNode)
            remainingString = currentRemainingInput
        }

        return InnerNode("many", children)
    }
}

Notice how this ManyParser delegates parsing to another child parser. We can then combine it with one of our Leaf parsers to recognize longer inputs:

val parser = ManyParser(
    DigitParser()
)
parser.parse("1") // parsing success
parser.parse("123456789") // parsing success
parser.parse("12345a") // parsing failure!

val parser2 = ManyParser(
    LetterParser()
)
parser2.parse("1") // parsing failure!
parser2.parse("123456789") // parsing failure!
parser2.parse("a") // parsing success
parser2.parse("asdf") // parsing success
parser2.parse("qwerty") // parsing success
parser2.parse("qwerty5") // parsing failure!

Notice how we can now recognize different types of String simply by passing a different Parser to the ManyParser parser. You could then extend this idea with other types of parsers, such as sequences of specific parser rules (if token, followed by an boolean expression, followed by block of code), a choice of many different possible parsers (an assignment could parse a literal String OR a literal number), etc. The result of all of these operations is a tree of nodes nested inside each other, which are annotated with information describing the context in which it was parsed and the actual content that was found. We now have something we can traverse in code and actually do something with.

For further study, you might check out my library Kudzu, which is a more complete implementation of this exact idea and will help reinforce this idea and show how it can be used to slowly build up complex grammars from simple building blocks. For example, this test case shows how a JSON parser might be built and parsed data transformed to actual usable data.