loading...

Arrow functions break JavaScript parsers

samthor profile image Sam Thorogood Updated on ・4 min read

This is an incredibly esoteric post! Enjoy. 🔨🤓

In writing a JavaScript parser in C—which is a post for another day, but you can try it out via Web Assembly online here—I've discovered JavaScript's only real ambiguity.

Surprise! It's the arrow function, your favourite shorthand for writing methods and callbacks. A quick reminder of what it looks like:

const method = (arg1, arg2) => {
  console.info('do something', arg1, arg2);
};
const asyncMethodAddNumber = async foo => (await foo) + 123;

Why ➡️ At All?

Arrow functions take the this variable of the scope where they were declared. Here's a classic example:

class Foo {
  constructor(things) {
    this.total = 0;
    things.forEach((thing) => this.total += thing);
  }
}

If you were to change the above to use function (thing) { ... }, accessing this.total would fail: this wouldn't be set properly. And in general, my advice is to write () => ... by default. I believe it has the least surprise.

(Here's another post on sensible defaults in JS, around let, const and var!)

The Ambiguity

So: in writing a parser, your goal is to work out what each token is, and what sort of statement or expression it makes up. JavaScript's syntax makes this fairly easy, with most cases needing you to look at most one token "forward".

Easy: Let It Go

Here's an example with let. Did you know—let is only sometimes a keyword (used to define new variables), and sometimes a valid variable name itself?^

let += 123;  // let is a symbol which I'm adding 123 to
let abc;     // I'm declaring a variable "abc"

(note that dev.to's syntax highlighter is wrong here! 🤣)

So let is a keyword if:

  • you're at the top-level of execution (not inside brackets etc)
    • ... unless you're inside a "for" declaration, e.g.: for (let ...
  • the next token is a literal (or [ and {, for let {x,y} = ...)
  • the next token is NOT in or instanceof
    • ... as let in foo asks, is the variable contained in "let" a key of the object "foo"

Hard: Arrow Functions

But this post is about the humble arrow function! Now, the beginning of an arrow function can take two few different forms. The first is simpler, and trivially determinable as an arrow function:

foo => bar;
async foo => something + await blah;

When a parser encounters foo (or any named variable), we can look at the next token and ask if it's an arrow =>. We can similarly look ahead from async, because the only valid interpretation of async variableName is the start of an async arrow function. Hooray! 🎊

But in the case of parenthesis, like this (foo, bar), our parser can't know what to do. This could just be a list of expressions: think putting some math into brackets to ensure correct order of evaluation.

Arrow functions are even more ambiguous with a prefix of async: because async can technically be the name of method call. Yes, that's right, the following JavaScript is valid: 🤮

var y = 123;
var async = (x) => x * 2;  // assign 'async' to a function
console.info(async(y));    // call 'async' as a function!

I'll wait for you to copy and paste it into a console. 📥

(again, the syntax highlighter is wrong and says async is a keyword! 😂)

The Solution

There's a couple of solutions. No matter what, we must look forward, over the ambiguous bit. And it's important to remember that this might not be "fast".

Here's a contrived example:

(arg=function() {
  // whole other program could exist here
}) => ...

If we want to work out whether the first ( opens an arrow function, we could parse forward to find the following =>. Naïvely, we would then discard all that work and start parsing from the ( again.

But if we're aiming for speed, we've just thrown away all that "work".

Instead, a better solution is to leave it intentionally ambiguous and come back to it later. The way we parse what's inside the parenthesis–luckily!–doesn't change based on whether it's an arrow function or not. The same tokens, equals signs etc, are all allowed there.

So we could end up with a stream of tokens like this:

AMBIG_PAREN
PAREN       (
SYMBOL      que
OP          =
FUNC        function
...
CLOSE       )
ARROW       =>

We can now clarify our 'AMBIG_PAREN'—it started an arrow function declaration. This also only happens at most once per "depth" of your program: the same ambiguity could happen inside the whole other program, but it'll be at an increased depth.

Some Context

^
To be fair, some of JavaScript's ambiguities are solved when running in strict mode.

For instance, we can't use let as a variable name in this mode. But not all code is written or served this way—and strict mode doesn't change the behavior of async or arrow function ambiguity.

Regular Slashes

There's another fun challenge in JavaScript parsers: whether the humble slash is division, or the start of a regular expression. For example:

function foo() {} / 123 /g

Q: While the above code is nonsensical, we have to ask: what does the "divide by 123, divide by g" get parsed as?

A: Turns out—it's a regular expression. This is because a top-level function is a declaration, not an expression. If we surrounded the entire line with (), it would be division.

However, unlike arrow functions, this isn't really a problem for a JavaScript parser. When walking left-to-right through code, we can just keep track of what we expect any upcoming slash to be. So it's not ambiguous. 🤷

Fin

I told you this was esoteric! Thanks for reading this far. Personally, I would like to see JavaScript shed its ambiguities as it evolves, but I think its wide adoption is going to stymie fixing up what is arguably just mildly annoying idiosyncrasies in its syntax. 😄

3 👋

Posted on by:

samthor profile

Sam Thorogood

@samthor

Developer Relations for Web at Google.

Discussion

markdown guide
 
 

Agreed, super informative stuff Sam. I’m reading all of these.

 

Thanks!
Also you don't need to update dev.to's JS syntax highlighter: these are pretty obscure edge-cases. 🤣

 

JS is parsed with a TDOP (Top Down Order Precedence) parser.

This may explain why it's hard to cover all edge cases using an alternative (ex EBNF).

TDOP is very flexible and very fast. But, to handle an existing grammar you have to know the order.

 

I took a look at Crockford's post on this. It seems like he thinks TDOP is a good fit for "Simplified JavaScript", which wouldn't have these ambiguities in the language.

My post is more about simple L-to-R parsers, and maybe a better title would be on "breaking tokenizers". Unfortunately tokenizers technically can't be as simple as we expect because of the ambiguities I outline. This is why breaking most syntax highlighters is trivial ;)