DEV Community

Cover image for Abstract Syntax Trees: They're Actually Used Everywhere -- But What Are They?
aruna-x
aruna-x

Posted on • Updated on

Abstract Syntax Trees: They're Actually Used Everywhere -- But What Are They?

Isn't it wonderful how VS Code grays out obsolete lines of code? Oops, my return statement is on line 3. Line 4 won't run... But I haven't called the function yet. So how in the world does VS Code know which lines of code won't be used in the future, when the code finally does run?

Code snippet of a function in VS Code with line 4 grayed out

If we have a conditional statement, VS Code accurately evaluates the potential for us to hit the code outside of it:

Code snippet showing that VS Code knows that an if statement with an unknown conditional might not run

bool could turn out to be false after all. But if we change the condition to true VS Code knows we will always run that block and (if there is an inevitable return inside) never reach the final line:

Code snippet showing that VS Code knows that an if statement with a true conditional will always run

It's almost as if VS Code has the ability to understand the semantics of code. But under the hood VS Code uses code to do this! How?

Enter: Abstract Syntax Trees (ASTs)

An AST is a data structure that encodes abstract information about a piece of code.

an AST object of the above code

This one is specifically for the above sample code declaring function foo(bool).

An AST is a "tree", which is a kind of graph. And a graph is a very useful type of data structure, ubiquitous in software engineering. In order to understand ASTs we have to understand graphs. (You can also skip ahead to learn more about ASTs or look at these tools to make and use an AST yourself.)

How Do Graphs Work?

Graphs consist of "nodes" and "edges", and can be represented by (often nested) objects or arrays. A graph can mix objects and arrays as well, nesting one kind within the other to whatever degree of complexity.

Each node and edge can contain information. You can travel from one node to another via the edge between them. Edges have direction as well. Here's a simple graph connecting node A to node B:

Simple graph showing A to B

At a very basic level, if you were to write this in Javascript, it could look like this:

[ ["A", ["B"] ], [ "B", [] ] ]
Enter fullscreen mode Exit fullscreen mode

or

{ 
   A: { value: data_set1, children: ["B"] }, 
   B: { value: data_set2, children: [] }
}
Enter fullscreen mode Exit fullscreen mode

You can flip the direction

Simple graph showing B to A

Resulting in code like this:

[ ["A", [] ], [ "B", ["A"] ] ]
Enter fullscreen mode Exit fullscreen mode

or this

{ 
   A: { value: data_set1, children: [] }, 
   B: { value: data_set2, children: ["A"] }
}
Enter fullscreen mode Exit fullscreen mode

And you can make the edge bidirectional, usually represented with a plain line with no arrows.

Simple graph showing a bidirectional relationship between A and B

With code that does something like this

[ ["A", ["B"] ], [ "B", ["A"] ] ]
Enter fullscreen mode Exit fullscreen mode

or this

{ 
   A: { value: data_set1, children: ["B"] }, 
   B: { value: data_set2, children: ["A"] }
}
Enter fullscreen mode Exit fullscreen mode

These are simple examples, and in practice graphs can encode large amounts of data. Google displays search results with the help of a page rank graph, for example. This is a simplified representation of one:

Simplified rank graph example

Graphs can also have certain constraints. We can say: "The graph will start with exactly one node and every node except the first will have exactly one parent. Nodes can have multiple children though."

image of a tree data structure

This is an example of one kind of tree. In general, a tree branches out. Every node after the first (root node) has exactly one parent. Trees are hierarchical and do not contain loops. (Graphs can have loops, and do not necessarily have a root node.)

But for now we'll focus on trees. Because when we build an AST, we take abstract syntactical data from code and encode it into a tree.

AST Design Standards & Traversal Functions

Because ASTs are often used in the process of compiling code (which happens all the time - every time you try to run any code), AST design standards are fairly robust. Compilers (and interpreters) essentially take the code we write (in Javascript, Python, Ruby, or C++) and turn it into machine-language instructions that a computer's CPU can run.

AST design standards include:

  • variables (and their declaration locations in the source code) must be preserved
  • the order in which statements get executed is well defined and preserved
  • in the case of binary operations, left and right positioning is preserved
  • identifiers and their values are stored

Ultimately, broken code cannot be turned into an AST. In the process of building the AST, we might come across errors like missing brackets, untyped variables (as in Typescript), or other syntactic mistakes. Instead of proceeding, we can flag these errors and display them to users for correction.

But once we successfully build an AST, it should be possible to unparse an it into something very similar to the original code, using a code generator. And the resulting code should definitely function exactly the same as the original code.

For example, using an AST like this ...

representation of generalized AST as graph showing the kinds of information each node and edge contain

We could rebuild code that would look something like this:

function euclid(a,b) {
   while (b !== 0) {
      if (a > b) { a = a - b; } 
      else { b = b - a; }
   } 
   return a;
}
Enter fullscreen mode Exit fullscreen mode

So we can take a piece of code, turn it into an AST, and eventually turn that back into code. But wait ... there's more: The function we use to step through the AST (called an AST traversal function) is intelligent enough to make sense of the semantic encodings and help us do useful things with that information.

We can use an AST traversal function to walk along the structure to discover "dead branches" (pieces of code that will never run).

Tree Shaking & More

Tree shaking refers to dead-code elimination in Javascript. In order to tree shake, we would combine the use of an AST and an AST traversal function to find which "branches" of code are "dead". This is how VS Code grays out unused lines of code. Tree shaking then eliminates those unused lines of code, for a cleaner, leaner code base.

When a code base is sufficiently large, dead-code elimination is necessary. Dead ends become dead weight, potentially causing worse performance if the product is shipped and bloated code in much need of pruning. (Amusingly, that's not a pun. That's what they call it! I came across many articles on tree pruning in writing this post though.)

There's incentive on both ends, as wet code is more confusing for developers as well.

The same traversal function can, interestingly, help us inject our own code into a given chunk of code according to preset rules if we wanted. (More about this in the follow up below.)

Tools To Make And Use An AST

Create an AST: Esprima

Traverse that AST and replace or inject code: Extraverse

Unparse the modified AST back into Javascript: Escodegen

ASTs vs CPTs

I mentioned earlier that ASTs are used in the process of compiling or interpreting. There is an alternative: Concrete Parse Tree. Unlike ASTs, CPTs include much more granular (potentially unnecessary) information. ASTs can omit some syntactic information like grouping parentheses, because of the way in which the structure of an AST already encodes that information.

CSTs are much bigger than ASTs. But the tradeoff is that they can aid in more efficient compiling. In practice, both are used.

Follow Up

My fascination with ASTs was inspired by an app I'm working on: a Big O (time complexity) calculator.

In my research on Big O approximation, I found that most tools calculate the amount of time a machine takes to run a function on different sized data sets. They use the resulting amounts of time to determine whether the rate of growth of time is sublinear, linear, exponential, etc.

I hope to create a tool that will count the number of actions taken (rather than the amount of time for a specific machine), so that for any snippet of code I can point to the most costly lines and indicate how many times they ran. This can help students learn Big O with a more concrete understanding of what's happening with their code.

The Halting Problem

Slightly outside the scope of this article, but cool enough to include: In 1936, Alan Turing (pictured at age 16, below) proved that it is impossible to write code that can examine another piece of code and its input, and tell whether or not it will ever terminate. This is called the halting problem.

Alan Turing, age 16

For this reason, code entered into the Big O calculator can run too long in an infinite loop, and lock up a user's computer. I plan to bake in a fail-safe for that.

We'll See What's Possible

I'd eventually like to expand the project into a more comprehensive teaching tool. For now, I've scoped the project to the calculator to see if it's viable.

Latest comments (17)

Collapse
 
dominicbraam profile image
Dominic Braam

Very interesting. I remember having a project in my first year at University which was to create a Big O Calculator. I didn’t get very far and I submitted with it only catering for one particular scenario. After reading this post, I’m ready to tackle it again. 😬

Collapse
 
aruna profile image
aruna-x

That’s wonderful! I’d love to take a look once it’s done 😄

Collapse
 
pulljosh profile image
Josh Pullen

Thank you for this article! It is very relevant to a project I’m working on right now. In particular, I’d love to hear more about injecting code into user code (for your big O project). I’m trying to do essentially the exact same thing, and this is all new territory for me.

Collapse
 
aruna profile image
aruna-x

Hey Josh, that’s great to hear! I’m happy to write a follow up on injecting code. Coming soon 😄 I’m curious what project you’re working on.

Collapse
 
pulljosh profile image
Josh Pullen • Edited

I am working on a JavaScript editor that enables you to debug your code by time traveling:

I hope it will help new developers understand the language better. (I've been heavily inspired by the work of Bret Victor.)

The main trick here is to inject function calls into the user's code which report back information about the state of the program. That's how the timeline of values is built.

Thread Thread
 
aruna profile image
aruna-x

Josh! Apologies for the delay. I actually had a family emergency. This project sounds amazing! I won’t have a chance to write that blog post quickly enough, but I’d be happy to hop on a quick call to chat about what I learned and how you can apply it to your project. Let me know :)

Thread Thread
 
pulljosh profile image
Josh Pullen

Sorry to hear about your family. :( Hope things are getting better for you.

I would love to chat if you have the time! You are very kind to offer; I'm sure there's a lot I could learn from you.

I've been making progress on my babel plugin to inject the diagnostic function calls, but I am definitely just hacking around as I build it. There are probably lots of things I could be doing in a better way.

If you're still interested in talking, you can shoot me an email or a DM on twitter. :)

Collapse
 
edgarbonet profile image
Edgar Bonet

If there is a syntax error (missing brace, or bracket...) wouldn't you be unable to build the AST in the first place?

Collapse
 
aruna profile image
aruna-x • Edited

Hi Edgar, I was curious about that too when I first started playing with esprima. Wonderfully, esprima (which comes pre-baked with the rules of writing code we expect - it's fun to crawl through the esprima code and see how they've done this) throws helpful errors indicating the line and column numbers of syntactic issues. As long as you catch and handle them well (for example, by indicating the issue to the user so they can fix it, as with VS Code), you should be good :)

Collapse
 
edgarbonet profile image
Edgar Bonet

Oh, I see. So it's the process of attempting to build the AST that uncovers syntax errors. It's not the AST itself, which cannot be built if there is such an error. Makes sense.

Thread Thread
 
aruna profile image
aruna-x

Ah, I see what you meant to say! It seems I played fast and loose with my phrasing in this post. Thanks for highlighting this! I’ll edit the post so it’s more accurate :)

Collapse
 
squidbe profile image
squidbe

Excellent write-up! Thanks for your time/effort. 👍

Collapse
 
pierresassoulas profile image
Pierre Sassoulas • Edited

Cool article aruna-x ! For those who want to do "applied ast", come contribute to pylint, it's a static analyser for python included in vs-code. We have a lot of issues to choose from and we'll help you with our custom ast, astroid :) We're using the visitor pattern so you can do something on all the node of a certain type easily. For example in our checker's 'visit_while' function you'll have access to all the while nodes. It's less graph/algorithmically intensive than you would think. (github.com/PyCQA/pylint/issues)

Collapse
 
aruna profile image
aruna-x

Thank you Pierre! I look forward to looking at pylint's open issues and contributing 😄

Collapse
 
danielhansson profile image
Daniel Hansson

Interesting to read, learned something new today!

Collapse
 
nickytonline profile image
Nick Taylor

Gym teacher from Glee saying amazing!

Great stuff! There's another cool tool for ASTs called AST Explorer if you wanna check it out. It supports a bunch of languages.

Looking forward to your next post! 😎

Collapse
 
aruna profile image
aruna-x

Ah man. Thank you Nick! Much appreciated 😄