DEV Community

Marianne
Marianne

Posted on • Updated on

From Parse Tree to Evaluator (featuring Sarah Withee)

Summary

A parser generator like ANTLR won't handle evaluating programs written in your new language. You need to write the code for that yourself. Marianne explains how to build an AST from a parse tree, create an evaluation loop, and how esolangs (esoteric programming languages) make for great learning tools.

Links

Bonus Content

Become a patron to support this project.

Book Club: Thinking in Systems

If you're interested in learning more about stock/flow system models or have just had Thinking in Systems by Donella H. Meadows on your reading list, I'm giving people a free one month pass to the MWaPL private community for patrons in order to read it with us. Request a free pass here

Transcript

MB: When I first started releasing episodes of this series a few people asked me to put them up on Youtube as well as traditional podcasting channels. This is going to be the episode where that comes in handy. Today I’m going to talk you through how you take the output of a parser generator and build an evaluator loop on top of it that can run programs. If you find yourself thinking “I really can’t visualize what she’s talking about”, I encourage you to look up this episode on Youtube because the Youtube version includes screen caps of code that might make it easier to understand.

MB: Funny enough in my first draft of this series this episode did not exist. That’s the challenge of learning things— once you understand them, they begin to feel obvious and trivial and you forget that they have to be explained to other people.

MB: Right. So...

MB: If you decided to write your own parser then the question of how to connect the parser output to an evaluator loop is simple, but if you’re like me and you wanted to leverage the mature tools that exist to do some of this stuff for you it is not immediately obvious what you should do after you’ve fine tuned your grammar and generated your parser. A parser is NOT a programming language.

MB: I’ve been using ANTLR to generate my parser and lexer so some comments will be specific to that, but the general concepts will be applicable to other tools so you should be able to follow along if you’ve chosen something else.

MB: In the language I’m building I don’t want global functions. Therefore when we think of a grammar as breaking a file into smaller and smaller chunks of valid code there are only three types of code that can be in a program: declarations, assertions and for statements. The for statements I’m referring to in this programming language are a bit different than you’re expecting. Because stock-flow models are themselves designed to run in a bounded loop. The for statements in my language are not for loops, the are configuration settings for the model’s internal loop.

MB: But I digress. Assertions aren’t really important at this stage of the language development so that leaves just declarations. The first level of our grammar is the spec file, the second level is a set of declarations, the third level are all the possible declarations: constants, stocks, and flows.

MB: Stocks and flows have proprieties and methods, some of which are required default. By this point you should begin to see how this forms a tree: at the root we have a spec node, under that a declaration mode, under that the specific type of variable we’re declaring … maybe a stock node, with that node we have a series of property nodes. Under a property node maybe a function definition… under that……

MB: This is what the parser will create. It will use the grammar to define how to break source code up into a set of nodes arranged into a tree, then it will walk that tree for you.

MB: But walking the tree isn’t executing the program, that’s the gap we need to write code in order to fill in.

MB: In ANTLR part of our generated code will include something to walk the parse tree (either a listener or a visitor depending on how you configure ANTLR). A listener will walk the entire tree from top to bottom and then back up, executing your custom instructions as it goes. A visitor will only move forward if you tell it to advance to the next node, which can be useful if you don’t need to consider the full program. For example, part of the REPL I built to do user testing outputs a visualization of the model, but that visualization only cares about which flows are connected to which stocks and how, the rest of the code isn’t relevant. So that’s a situation where I used a visitor instead of a listener.

MB: Both visitors and listeners will generate a set of hooks for each node. So what you do is extend the default class that ANTLR has generated (be it visitor class or listener class) and overwrite those hooks with custom code that extracts the data from the node and assembles something executable out of it.

MB: In a prototype language what this means is that I have a hash map where I store defined variables. When we walk the parse tree and encounter a declaration node, we pull the identifier out of that node (often a child node of the declaration) and turn it into a key in our hash map, then we pull the assigned value in the declaration node (again often child node of the declaration node) and store it as the corresponding value.

MB: This is super easy when the values are static— strings, numbers, etc. But what do we store when we declare a function?

MB: Well… We store a tree.

MB: I suppose you could store a copy of all the relevant child nodes from the parse tree, but I actually found it easier to create a simple AST. That’s Abstract Syntax Tree if you’re unfamiliar. ANTLR’s parse tree has a lot of information in it, after all. And parse trees tend to contain nodes that accurately represent the syntax of the grammar, but are not important to the execution of the code.

MB: Building your own AST is not super difficult. Essentially every node needs to store some information about its type and then a value. The node value will either be a child node, a set of child nodes, or a static value like a string or a number.

MB: Since I wrote the prototype in Javascript I could get by with having one node class the type being stored as a string in a property. When I start building the real language in Go, each type of node will need to be defined as its own type in Go, because Go will enforce strict assignment of the node value property which means it can’t be a single child node or a slice of nodes or an int or a string…. we have to define exactly what it going to be up front, so we have to define a different node type for each node we want to use.

MB: I mentioned before that when we extend ANTLR’s generated visitor/listener class we get hooks for each node of the parse tree. We will have code that triggers when we enter a node (its name is enter followed by the node name as it is defined in the grammar, depending on what language ANTLR is compiling to it might be enter plus node name plus the word context, the best way to know for sure is to open up the auto generated visitor or listener file and check) and then there’s code that triggers when we’re exiting the node as the walk is moving back up the tree (that has the same naming convention as enter hooks just with the word exit instead). Most of the time you want to use the exit hooks. We’ll talk about why in a second.

MB: We build the AST by wrapping data from the parse tree node at the very end of a branch in the right AST node type and pushing it onto a stack (which in most languages will either be a list or an array). That’s why the exit hooks or so important, because we want to start from the bottom (where the data is) and pass AST nodes up. That way when we get to the part of the parse tree where we define our function and assign it a name, the AST is sitting on the top of the stack, fully formed, waiting for us.

MB: As we move up from the bottom node, whenever a higher node has valuable information, we need to sort of add to our tree, we pop off the top of the stack, wrap that value in another node capturing the new information and push the larger tree back onto the stack.

MB: So for example: let’s say I have the expression x + y. We have a node representing x, a node representing y, and we want to wrap them in a parent node called addition. Using this structure allows us to create complex algorithms with the same code we do for simple ones. A+b+c is just an addition node where one child node is c and one child node is another addition node with child nodes a and b.

MB: Building ASTs is useful because most functions will make use of values passed to them at runtime, so you can’t execute them while walking the parse tree because they may not have been defined yet.

MB: When we finish walking the parse tree what we have is a hash map populated by all the named variables and the values assigned to them and all the named functions and the trees that represent the steps they execute.

MB: Since stock flow models operate in loops, walking the tree gets everything defined and assembled then the process of executing the model loop pushes those new trees into the eval loop.

MB: The eval loop is a function wrapping a giant switch statement. The value being tested in the switch statement is the type of node being passed to the eval loop and each case typically calls the eval function recursively passing in the node’s child. It the node has multiple children the case block will call eval on them all one by one. Sometimes in a for loop or sometimes in a different construct depending on what the node actually does.

MB: So our x+y example from before would operate like this. The switch statement would evaluate to node type addition add return eval(leftNode) + eval(rightNode). When the eval loop executes on the leftNode, the leftNode type is a variable, the case block probably says something like look for a variable by that name in the hash map and return that value. It this way we work down the tree until the eval loop encounters a value it can actually return and then the recursion completes by passing those values by up.

MB: The best way to understand how this works is just by doing it a bunch of times. Every time I work on an implementation I get a little bit better at it.

MB: But if you’re thinking to yourself that building an interpreter or a compiler over and over again sounds like a lot of work— you’re right and I have a suggestion!

MB: To help with my suggestion this week I asked my friend Sarah to come on the show. Sarah is another Strange Looper who I also know through the Pittsburgh based hackerspace and community group Code & Supply. Hey Sarah what are you working on now?

SW: Currently working on a side project called Code Thesaurus, which is a polyglot developer reference tool, so the idea is if you know something in one language and you're wanting to know how to do the same thing in a different language, you can look it up in the tool and kind of compare them side by side without having to, like, read entire sets of documentation. You can just kind of see a quick glance at a chart and kind of know the difference between the two.

MB: So if you hear a little background noise that’s because Sarah’s completely adorable cat Theodosius keeps swatting at her microphone. Anyway, back to my suggestion: Esolangs. If you want practice tying everything together, work on a few esolangs first.

MB: Sarah, do you know what esolangs are?

SW: I believe they're esoteric languages

MB: Yes, they are. They're this fabulous world of, like, really weird, strange, completely compilable valid programming languages. Right?

SW: They are basically like, hey, I don't think we should ever do this, but let's do it anyway.

MB: Yeah.

MB: Like, wouldn't it be great if we lived in a world where we programmed only in white space.

SW: Something you would never really want to do, or only in symbols or only in colors?

MB: Yes, the actually the only in colors ones are the ones that I like the best. So is it P*iet or Pie*t? I never quite clear how it's suppose to be…

SW: I call it Piet, but yeah, I don't know…

MB: Okay. We’ll go with Piet. It’s sad because it's someone's actual name and so… I should have done the research.

SW: I think it’s an artist name.

MB: Yeah it is. But like I was looking at some of the programs that people write with the this programming language, which is essentially just about interpreting colors and like pixels. They're gorgeous! Like people have done like almost like meme style portraits in this language and they've done beautiful rainbow mosaics in this programming language. And like I wouldn't say that any of those programs are useful. Right? Like mostly I think what gets written in esoteric languages are like hello worlds of various types that either print hello world or like print the name of language is super popular or they do calculators like this is how you program a calculator in this particular language, what's not clear to me is like where the line between an esoteric language and like a domain specific language is like how useful do you have to be before you’re a domain specific language?

SW: I don't know if I know where that solid boundary is either, but I would say, you know, typically you want a programming language to be readable and readable and like easy to convey your thoughts in the form of code and esoteric languages are almost never any of that.

MB: That's true.

MB: Well, I think like we do like as programmers care about the look of our code, right? But esoteric languages are almost inevitably like this bizarro world where the the actual look of the source code is more important than what the source code does or how efficient it is it doing it. Right?

SW: Yeah. You know, they're usually proof of concepts like can we actually use LOLCat as a way to write a programming language?

MB: Oh….. I love LOLCat so much. Can you talk a little bit about like, what is LOLCat?

SW: So LOLCat is like the memes of the cats, of, like, I can haz cheeseburger nonsense and like the embodiment of what if your cat were to actually talk to you and has bad spelling and grammar. But it's like what if you implemented that as a programming language like O HAI. O Space H-A-I would be like the introduction to your program and you know, like I can haz variable name would be like to set up a new variable and I don't remember too much past that, but you know, it's like you can have these things in their typical statements and they're not super easy to parse. Kind of like if you took Shakespeare and decided to write it in LOLCat like it's doable. It's not readable. And you may have to sit there scratching your head for a little bit, but, you know, you can convey the same thoughts. And so, like, LOLCat is one of those. It's just silly, you know, like out of like completely out of the box, like Piet or Whitespace or some of the other ones.

MB: I really love how LOLcode ends blocks in KTHXBYE. I want to do that in normal programming languages. It's just way more satisfying to write KTHXBYE instead of like END, right?

SW: Yeah, sure. Well python has some of that really readable text in some ways and Cobol even kind of have some of that.

MB: Yeah. But it's interesting that you bring up Shakespeare because did you know that there's also an esoteric language called Shakespeare?

SW: I believe I've heard of it. I don't know that I've really looked into it that much.

MB: So this is what I find really interesting about esoteric languages is kind of weird, like bizarro world form of like prioritization in code.

Shakespeare, the programming language is a programming language where your commands are meant to sound like lines from a Shakespeare play and your entire source code it has to be architected in a way that mirrors like stage directions. But in order to accomplish this, they do a couple of weird things. So they essentially sort— it's like a very keyword heavy language. And they sort nouns and adjectives into categories of like positive numbers and negative numbers. And so every noun is like a positive number of one or a negative number of one. And then you use the adjectives to sort of build upon them. So like say, look, I like I will add an adjective to one value and that gives me a two value. And if I add three adjectives to one value, I now have four and so on and so forth.

MB: And so it's like ridiculously complicated. But one of the critiques I read about Shakespeare, which I thought was interesting, is that it's actually… that kind of control flow is actually very similar to Assembly, but— Shakespeare itself, like all the implementations of it, are essentially kind of more transpiler than they are interpreter and compiler, so the main implementation of Shakespeare will parse your Shakespeare and then write a C program based on like what it's found in your program. And there's one I think that doesn't it in Python. And then what you're running is really a C, C or Python program. And since Python ultimately compiles to C, I guess ultimately just ends up being a C program. But what's interesting to me about that is that if your actual structure of your esoteric language is kind of mirroring the pattern of Assembly, like why are you translating it to a high level language that gets—that ultimately gets compiled down to assembly? Right? Like we're kind of going around in a circle in order to, like, get this joke of like, hey, we've got a programming language that looks like Shakespeare.

SW: But I think you also underestimate, not only the love of nerds to do weird things, but the love of nerds to go. I bet we can't do this. Watch me.

MB: I do historically underestimate the love of nerds to do weird things. It's true.

SW: Because I believe also you're writing a programming language, which is also kind of this like, hey, nerd, can I actually do this?

MB: Yeah, I know that's totally fair. It is totally 80 percent. “Hey, can I do this thing?” nerd-ness and like 20 percent this thing might actually have a useful function in problems that I face and other problems that other people face. It's true.

SW: And you've heard of like Ook? The language Ook, right?

MB: No. I haven’t.

SW: It’s supposed to be based on the sounds of monkeys. Oh, like Ook would be O-O-K. So it's either ook period period would be like maybe the start of a program like ook period. Question mark would be how you know like create a variable and question mark period would be something like, you know, add together and there's a whole list of the different ooks and the combinations of them that perform different actions.

MB: Oh my God. How do you…

SW: It's supposedly Turning complete if I remember correctly.

MB: I've pulled this up on Esolangs.org, which is like literally is the ultimate guide for every bizarre programming language that exists. And it's just like looking at the Hello World Program. It's just like it's ook? Ook, ook, ook. OoK, ook. OoK, ook. OoK, ook.

MB” This just goes something like that forever and ever and ever and ever. Ook is a David Morgan-Mar—so danger mouse— creation and like this seems to be like, it seems that the David Morgan-Mar really does very little except like draw cartoons and then write esoteric programming languages. He, he is the creator of Piet that we talked about before. He's the creator Chef, which is similar to Shakespeare in the sense that you're trying to get it to look like something else, but you're trying to get it to look like recipes instead of Shakespeare. And like there are at least eight different esoteric languages attributed to him. One of them I’m pulling up now is called Zombie. It's a zombie oriented machine-being interface engine designed for necromancers and pure evil ones. Programs run in a multithreaded environment where they're every kind of entity zombie, vampire, ghost, demon, dijinn, and they all act in a slightly different way. A program is a list of entity declarations stating the action that each entity must perform.

MB: Oh, it's a declarative language! It's sort of like the trolly version of Prolog.

SW: (laughs) I like that description.

MB: I now kind of want to write like a model validator in zombie and see how that would work.

SW: And you’ve heard of like the emoji language, right?

MB: Yeah.

SW: And every commands are just emoji symbols.

MB: Yeah, that's true. Emoji language like one and some of the ones I was looking at today is there’s a language called Don't, which is described as it only has three commands. Command H prints Hello World. It's the hard part is done for you, basically. Uhh the question mark teleports the user's device to a landfill in Dubai … which seems useful depending on what computer use.

SW: I can dig that.

MB: And then the last is G[] which deletes don’t from existence. And then the explanation of don't is unfortunately someone ran the G command and one of their programs while testing an interpreter, causing the language to be removed from everyone except the creators of memory. It was rendered impossible to implement….. So there you go.

SW: This language doesn't sound Turning complete.

MB: No, I don't think so. I don't I don't think so. There's also Retina, which is based entirely on regex. So like no syntactic sugar, just pure regex in case you wanted to put all the bad parts of programming into its own language. There you go.

SW: I’m constantly amazed at I feel like every time I look at the site, they've added like a whole ton more languages to it. Like I'm always just baffled at the absurd lengths people go to. I mean, like some of them are kind of fun and interesting, you know, like some of them are made as jokes. Some of them are made to be funny, but some are made to like absolutely to screw with people's heads.

MB: But I think in general, esolangs end up being really useful learning tools, because their very design usually relies on an extremely limited grammar and extremely limited vocabulary set. So you don’t have to think about, like, how do I implement like all sorts of complex data structures in order to like design a parser, a lexer and ultimately an evaluation loop, you can kind of just start off with a really simple scope and kind of like work those muscles over and over and over and over again as you as you need.

Top comments (0)