Throughout all my journey as a software engineer, I've always hoped in and out of different topics and questions, OOP or Functional?, Static or Dynamic Types? But one thing always seemed sure. I, and every other programmer that I knew liked a good-looking code, and I don't mean in a functional way, I'm talking about the visual part.
Identations, blank lines here and there, double and single quotes, and a bunch of other rules and styles make our code more readable and concise. And thanks to tools like auto-formatters and Linters, we can save ourselves the hustle of pressing tab once (or twice if you are a psycho) at the start of each line or things like that.
We all know what those linters and formatters are, but something that I got interested in the last weeks was HOW they worked, how exactly they read our code and sometimes can even fix some of our errors by themselves.
So in this little article, I'll guide you through a simple and general explanation of how a code formatter work, these rules can change but in most cases apply to the majority of linters and formatters out there.
1 - Abstract Syntax Tree
Abstract Syntax Tree, or AST for short, is known as a visual representation of our code structure, it doesn't know anything about how our code works or what it does, it consists in organizing each part of our code into "branches" of a tree-like structure, or even json which I will be using for further examples for better understanding.
For example lets take this simple Javascript function.
function sum(a, b) {
return a + b;
}
Lets use AST explorer to generate the AST for this function block, it will look something like this.
{
"type": "Program",
"start": 0,
"end": 34,
"body": [
{
"type": "FunctionDeclaration",
"start": 0,
"end": 34,
"id": {
"type": "Identifier",
"start": 9,
"end": 12,
"name": "sum"
},
"expression": false,
"generator": false,
"async": false,
"params": [
{
"type": "Identifier",
"start": 13,
"end": 14,
"name": "a"
},
{
"type": "Identifier",
"start": 16,
"end": 17,
"name": "b"
}
],
"body": {
"type": "BlockStatement",
"start": 18,
"end": 34,
"body": [
{
"type": "ReturnStatement",
"start": 21,
"end": 32,
"argument": {
"type": "BinaryExpression",
"start": 28,
"end": 31,
"left": {
"type": "Identifier",
"start": 28,
"end": 29,
"name": "a"
},
"operator": "+",
"right": {
"type": "Identifier",
"start": 30,
"end": 31,
"name": "b"
}
}
}
]
}
}
],
"sourceType": "module"
}
Yes, those 3 lines of code generated this monster of a JSON, but dont panic, lets go through slowly and you'll see that its actually very simple.
2 - Traversing the tree
First, we have our type which is "program", followed by an indication of where is the start and end of our script which can be like a "coordinate" of where that character is, and blank spaces count and tabs count differently here, but the most important part of all of this is the "body", is basically where all our code is.
Inside the body, we will see an array containing again the same pattern, a type, a start and end, another body, and so on.
Each "block" of our code resides inside the "body" of another block, and each type of code block may have some specific properties like for example let's pick only the function declaration.
"type": "FunctionDeclaration",
"start": 0,
"end": 34,
"id": {
"type": "Identifier",
"start": 9,
"end": 12,
"name": "sum"
},
"expression": false,
"generator": false,
"async": false,
"params": [
{
"type": "Identifier",
"start": 13,
"end": 14,
"name": "a"
},
{
"type": "Identifier",
"start": 16,
"end": 17,
"name": "b"
}
],
We see a FunctionDeclaration type, followed by a property of the type Identifier wish contains the function name, then we have expression, generator, and async wish are boolean values, followed by an array of params, that have each function parameter contained in its object, again with type Identifier, its name, and its "coordinates".
We can use this system to read through every block of our code, and we can already start to have an Idea of how the formatters and linters can apply rules to it.
3 - Applying changes through AST
Now let us take a look at the fun part of all of this. Knowing exactly where every statement starts and ends, where every variable is, and every function name and parameter, linters will read the AST and can generate a warning based on the reading. Let's see some examples:
Unused variables / functions / imports
The linter goes through the AST storing the name for every VariableDeclaration Identifier in a temporary Array, now every time it finds this same identifier's name on the rest of the code, it will pop that entry out of the array, when it gets to the end of it, the linter loops through all the remaining items and generates the "Unused variable" warning in your console since it never got out of the array, it means that it was never called anywhere in the code. The same process happens with function declarations and function calls, as well as imports in javascript.
Formatting
When it comes to formatting its also very straightforward, the formatting tool can read through our AST and find the exact position of every character, it can increment or decrement the "start" and "end" values to manipulate the number of blank spaces or tabs that your code will have.
Conclusion
I hope that with this article I could deliver some cool information about this topic that isn't discussed as much in our community I find it very interesting, and I also hope that you could learn something new here.
Feel free to follow me here and on my socials for more tech talk and articles like this one!
Top comments (2)
Nice explanation. Any idea how this processes such large information for huge LOC efficiently (space & time)
Linters at what I've seen, dont have knowledge of line size by itself, so, it treats the entire file continuously like an one liner. The performance itself isnt bad at all no matter the size of the given file, if you only run linters / formatters on the current buffer or only in warning mode (not applying corrections).