Robin Heggelund Hansen

Posted on May 25, 2019

Improving Elm's compiler output

#elm #compilers #optimization #performance

Elm is fast.

This is not because of any innovation in the compiler. In fact, the Elm compiler hardly does any optimization at all.

Elm is fast because it compiles to Javascript, which dedicated and talented engineers from all over the globe have been optimizing for more than a decade.

But here's a question: does the Elm compiler output Javascript which makes it easy for browsers to optimize it and, if not, is there any performance to gain by changing how the compiler outputs Javascript?

Let's have a look.

Hidden classes

Javascript is a dynamic language. Javascript allows objects to change in shape at any time. Variables and data structures can contain different values of different types at any time. In practice, however, most programs are fairly static and browsers try to take advantage of this.

In Chrome, every object literal and class is seen as a shape. If the shape of an object or class changes, like a property being dynamicly added or removed, Chrome will see the resultant new shape and try to convert between them. In some cases, Chrome will just treat objects as hashmaps (kinda like Elm's Dict) if it has trouble figuring out the precise shape of something. We call these shapes for hidden classes.

We get the best performance when our Javascript code doesn't seem to deal with many different shapes at a time. Fortunately, Elm is a static language so this should be pretty common. There is, however, a case where Elm does produce different shapes that can pass as the same type. Let's look at an example:

type Maybe a
  = Just a
  | Nothing

This is how Elm's Maybe is defined. It is compiled into the following Javascript (using --optimized):

var elm$core$Maybe$Just = function (a) {
    return {$: 0, a: a};
};

var elm$core$Maybe$Nothing = {$: 1};

As you can see, the Javascript object literal for Just and Nothing has different shapes. As a result, every Javascript code which deals with Maybe has to be able to deal with two different shapes. But is that costly?

To measure the effect I've done two things:

1) Make a benchmark which can hopefully pick up the performance difference, then make two versions where one is handcoded to avoid the overhead.
2) Look at the assembly code which Node outputs (Node and Chrome uses the same JS engine, V8).

You can find the code for this experiment at github.

I've focused on Elm' built-in List type, as it's used in pretty much every program. A performance improvement here would be a big benefit to everyone who uses Elm.

We'll be focusing on the following benchmark:

benchmark "int +" <|
  \_ -> foldl (+) 0 intList

Simply a left fold over all the elements, adding them together. A change in performance here will tell us how quickly we can iterate through a List, the theory being that removing any overhead dealing with multiple hidden classes should increase performance.

We compile the benchmark. Looking through the compiled JS output, we can find this piece of code:

var _List_Nil = { $: 0 };
function _List_Cons(hd, tl) { return { $: 1, a: hd, b: tl }; }

The empty List looks different from a List element (if you're wondering how Lists work, I explained it decently well at Elm Europe 2017).

We copy the JS file, and make the following modificaiton:

var _List_Nil = { $: 0, a: null, b: null };
function _List_Cons(hd, tl) { return { $: 1, a: hd, b: tl }; }

A List should no longer be polymorphic (of many shapes).

The result:

Firefox, before modification: 75,843 ops/sec
Firefox, after modification: 84,549 ops/sec
Safari, before modification: 248,531 ops/sec
Safari, after modification: 248,530 ops/sec
Chrome, before modification: 294,434 ops/sec
Chrome, after modification: 302,569 ops/sec

So, no difference in Safari, however both Chrome and Firefox see a pretty decent improvement: ~11% in Firefox, ~4% in Chrome. Keep in mind that this is something that can be implemented in the compiler, no Elm code would have to change, it's a performance improvement for no effort on the part of application developers.

We can also look at the code that V8 generates by running the following script:

node --print-opt-code --code-comments index.js > jit_log

By reading the jit_log for the benchmark run without modifications, we can see:

-------- Optimized code ---
optimization_id = 1
source_position = 48049
kind = OPTIMIZED_FUNCTION
name = $List$foldl
stack_slots = 10
compiler = turbofan
address = 0x28bd2bd6e9a1
Body (size = 2292)
Instructions (size = 2012)
<Assembly code>

While for the modified code we see

-------- Optimized code ---
optimization_id = 0
source_position = 48067
kind = OPTIMIZED_FUNCTION
name = $List$foldl
stack_slots = 10
compiler = turbofan
address = 0x2081135eec01
Body (size = 1848)
Instructions (size = 1600)
<Assembly code>

As expected, it generated less code when the code doesn't have to deal with polymorphism.

There is, however, something in both of this logs which give me pause:

Inlined functions (count = 1)
 0x3f2705632551 <SharedFunctionInfo A2>

This section of the log lists how many functions have been inlined. In this case, only a function which evaluates the passed in function has been inlined (I'll explain what this mean in a second). This isn't actually that surprising. foldl only contains a single function call, which is done once per loop. This function is usually never the same either, so it makes sense that it wasn't inlined. However, if you look at all the other functions that have been optimized, the only functions which are inlined are either functions which take a single argument (also called arity-1 functions) or functions called A2, A3, A4 etc.

What gives?

Inlining

Inlining function calls (replacing the function call with the implementation of the function) is one of the most important optimizations a compiler can make. This is not necessarily because function calls are all that expensive, but because it allows the compiler to better understand what the code does and perform optimizations based on that.

Let's look at our benchmark again:

benchmark "int +" <|
  \_ -> foldl (+) 0 intList

Without inlining, this would call the foldl function, which is a loop calling a function for every element in the list. Since foldl can accept any function, it would store the intermediary number value as a reference (even though it could be stored as a number on the stack), performing a lookup in memory each time the function is called. If we weren't storing ints as intermediary values, as it would be the case if we were folding over other things, the Javscript optimizer would likely treat all values as a generic hashmap inside of foldl.

With inlining, however, this function is likely to be compiled down to a single javascript loop, without any function calls at all, and with specialized code for the types actually used in the loop. This is like getting the benefits of a monomorphising compiler, without actually having a monomorphishing compiler.

But why aren't many functions being inlined, and what are all these A2 things?

Currying

Elm has this concept of currying. Given the following function:

add : Int -> Int -> Int
add a b =
  a + b

You can create a new function like this:

add2 : Int -> Int
add2 =
  add 2

So if you call a function with enough arguments, it will execute. If you call a function without all the arguments it requires, it returns a new function which accepts the missing arguments.

This is how the above add function is compiled to JS:

function F(arity, fun, wrapper) {
  wrapper.a = arity;
  wrapper.f = fun;
  return wrapper;
}

function F2(fun) {
  return F(2, fun, function(a) { return function(b) { return fun(a,b); }; })

var author$project$Main$add = F2(function(a, b) {
  return a + b;
});

Our add function, as well as every other Elm function, is wrapped in an object which contains the original function, the arity that function expects, and a curried function. Calling the function has to be done with A2, which is implemented like this:

function A2(fun, a, b) {
  return fun.a === 2 ? fun.f(a, b) : fun(a)(b);
}

So A2 takes a F2 object, and calls the function directly if the provided function actually takes two arguments, or does a curried call if not.

From the perspective of a javascript engine, this has a big problem: unless we do whole program analysis (which is too expensive) there's no way to know if the function itself should be called, or if it is to be called using currying. The A2 call itself can be inlined, but nothing more.

But what if we made the Elm compiler smarter? If the Elm compiler knew how many arguments a function required, it could re-write this:

A2(author$project$Main$add, 1, 2)

into

author$project$Main$add.f(1, 2)

We make a copy of our benchmark, and make these changes by hand for every function call in our benchmark, and in the functions called by the benchmark.

This time we're going to focus on the results for the following function:

benchmark "* 2" <|
  \_ -> map (\a -> a * 2) intList

The result:

Firefox, before modification: 24,291 ops/sec
Firefox, after modification: 50,927 ops/sec
Safari, before modification: 35,723 ops/sec
Safari, after modification: 49,029 ops/sec
Chrome, before modification: 39,253 ops/sec
Chrome, after modification: 58,491 ops/sec

Pretty good. The performance in Firefox is doubled, while we're seeing ~30% improvements in Chrome and Safari.

When looking at the inlining results of the unmodified code:

Inlined functions (count = 1)
 0x13f84c332341 <SharedFunctionInfo A2>

We can see there's been some changes after the modifications:

Inlined functions (count = 5)
 0x1f31bec396e1 <SharedFunctionInfo $map>
 0x1f31bec395a9 <SharedFunctionInfo $foldr>
 0x1f31bec39541 <SharedFunctionInfo $foldrHelper>
 0x1f31bec32049 <SharedFunctionInfo F2>
 0x1f31bec31fe1 <SharedFunctionInfo F>

However, looking over the generated assembly code I'm seeing a lot of lines containing the following:

call 0x1e5fad48abe0  (Call_ReceiverIsNotNullOrUndefined)

When we're calling functions using someObject.f(args), Chrome has to make sure that someObject isn't null or undefined.

I've run one more benchmark. This time I've placed functions outside of F wrappers, and call them directly.

The result:

Firefox, before modification: 50,927 ops/sec
Firefox, after modification: 59,632 ops/sec
Safari, before modification: 49,029 ops/sec
Safari, after modification: 43,695 ops/sec
Chrome, before modification: 58,491 ops/sec
Chrome, after modification: 63,619 ops/sec

Chrome and Firefox see some nice speedups, ~16% for Firefox and ~8% for Chrome. Safari actually sees a slowdown but I don't know why. Re-running the benchmarks several times gives wildly different results, and I don't know what to make of that.

Conclusion

Elm is fast, but there's still room to become faster. By making changes to how the Elm compiler outputs Javascript, we can increase the performance of Elm programs by a significant margin.

Top comments (16)

Gage • May 26 '19

Awesome changes! Are these changes going to be implemented soon?

Robin Heggelund Hansen • May 26 '19

The honest answer to that is: don’t know 😅

Gage • May 28 '19

Corollary, are you helping Evan directly now?

Robin Heggelund Hansen • May 29 '19

I'm a member of the core team, so I do talk with Evan from time to time. But I still write PRs which he may, or may not, merge.

Gage • May 29 '19

I just thought it was a core team of one, so that's cool 😁

Robin Heggelund Hansen • May 29 '19

There's actually a whole bunch: elm-lang.org/community

James Carlson • Dec 27 '19

Wonderfully informative post.

m-shaka • Dec 23 '20

Interesting article! Can I translate it into Japanese?

Robin Heggelund Hansen • Dec 27 '20

Sure :)

m-shaka • Jan 3 '21

I'm sorry there is something that I can't understand in Inlining section.

Is the intermediary number value is accumlator?
What does we were folding over other things mean actually? And why aren't ints stored? In my knowledge, all local variables are stored on the stack or the heap.

I'm looking forward to your answer

Robin Heggelund Hansen • Jan 3 '21

yes
other things like Booleans, Maps, Sets, Floats or other Objects. «If we weren’t storing ints» refers to the option of storing other intermediate values like objects.

m-shaka • Jan 3 '21

Thanks for your quick response! I understood.

I'm so sorry I have one more question, but this is last one.

It's about this sentence.

This is like getting the benefits of a monomorphising compiler, without actually having a monomorphishing compiler.

What I understand is below:
This talks about Inline Caching. If foldl is inlined, the function as argument of foldl is also inlined, so without Inline Caching code of inlined foldl's loop is specilized for argument functions type.
Is this right?

Robin Heggelund Hansen • Jan 4 '21

This talks about Inline Caching => inline caches is something else, inlining is just replacing a function call with the implementation of the function.

without Inline Caching code of inlined foldl's loop is specilized for argument functions type => without inlining, the foldl's loop cannot be specialized because it has to handle all possible argument types.

m-shaka • Jan 4 '21

Thank you! You mean monomophishing is to specialize code by inlining function call, right? If so, I can't understand what without actually having a monomorphishing compiler means...

Robin Heggelund Hansen • Jan 5 '21

monomorphising doesn't do inling. When a monomorphising compiler sees the following code:

listmap : (a -> b) -> List a -> List b
listmap fn list =
    List.foldl (\a acc -> fn a :: acc) list
    |> List.reverse

strs = listmap String.fromInt [ 1, 2, 3 ]

strs = listmap isEven [ 1, 2, 3 ]

it will generate two versions of listmap, one which has the type (Int -> String) -> List Int -> List String and one which has the type (Int -> Bool) -> List Int -> List Bool. It will then use the specialized listmap implementations where it can.

This allows the compiler to specialize code better, because it has precise type information.