I’ve been pretty fascinated for the past few months with trying to understand how the Elm compiler might be able to target WebAssembly in the future. What are the major differences from generating JavaScript? What are the hard parts, what approaches would make sense?
I think one of the most interesting questions is: how do you implement first-class functions in WebAssembly? JavaScript has them built in, but WebAssembly doesn’t. Treating functions as values is a pretty high level of abstraction, and WebAssembly is a very low-level language.
Contents
Elm and WebAssembly
Elm’s first-class functions
Key WebAssembly concepts
Representing closures as bytes
Function application
Lexical closure
Code generation
Summary
What’s next?
References
Elm and WebAssembly
Before we get started, I just want to mention that from what I’ve heard from the core team, there is a general expectation that Elm will compile to WebAssembly some day, but currently no concrete plan. WebAssembly is still an MVP and won’t really be ready for Elm until it has garbage collection, and probably also access to the DOM and other Web APIs. The GC extension is still in "feature proposal" stage so it'll be quite a while before it's available.
But... it will get released at some point, and WebAssembly is one of the suggested research projects for community members, and well, it's just really interesting! So let’s have a think about what Elm in WebAssembly could look like!
Now... how do you go about implementing first-class functions in a low-level language like WebAssembly? WebAssembly is all just low-level machine instructions, and machine instructions aren’t something you can "pass around"! And what about partial function application? And isn’t there something about "closing over" values from outside the function scope?
Elm’s first-class functions
Let's start by looking at some example Elm code, then list all the features of Elm functions that we’ll need to implement.
module ElmFunctionsDemo exposing (..)
outerFunc : Int -> (Int -> Int -> Int)
outerFunc closedOver =
let
innerFunc arg1 arg2 =
closedOver + arg1 + arg2
in
innerFunc
myClosure : Int -> Int -> Int
myClosure =
outerFunc 1
curried : Int -> Int
curried =
myClosure 2
higherOrder : (Int -> Int) -> Int -> Int
higherOrder function value =
function value
answer : Int
answer =
higherOrder curried 3
In case you're wondering, the answer
is 1+2+3=6. This is definitely not the simplest way to write this calculation, but it does illustrate all the most important features of Elm functions!
Three key features
Firstly, Elm functions are first-class, meaning they are values that can be returned from other functions (like outerFunc
) and passed into other functions (like higherOrder
).
Secondly, they support lexical closure. innerFunc
"captures" a value from it's parent's scope, called closedOver
. This means that myClosure
"remembers" the value of closedOver
that it was created with, which in this case is 1
.
Finally, Elm functions support partial application. myClosure
is a function that takes two arguments, but the body of curried
, we only apply one argument to it. As a result, we get a new function that is waiting for one more argument before it can actually run. This new function "remembers" the value that was partially applied, as well as the closed-over value.
Clues in the code
Note that we now have several Elm functions that will all will end up executing the same line of code when they actually get executed! That's this expression:
closedOver + arg1 + arg2
If somebody calls curried
with one more argument, this is the expression that will calculate the return value. Same thing if somebody calls myClosure
with two arguments.
This gives us a clue how to start implementing this. All of the function values we’re passing around will need to have a reference to the same WebAssembly function, which evaluates the body expression.
In WebAssembly, we can’t pass functions around, only data. But maybe we can create a data structure that represents an Elm function value, keeping track of the curried arguments and closed-over values. When we finally have all the arguments and we’re ready to evaluate the body expression, we can execute a WebAssembly function to produce a return value.
There are still lots of details missing at this stage. In order to fill in the gaps, we’re going to need a bit of background knowledge on some of WebAssembly’s language features.
Key WebAssembly concepts
Linear memory
WebAssembly modules have access to a block of "linear memory" that they can use to store and load data. It’s a linear array of bytes, indexed by a 32-bit integer. WebAssembly has built-in instructions to store and load integers and floats, but anything more complex has to be built up from raw bytes.
The fact that everything is built up from raw bytes means that WebAssembly can be a compile target for lots of different languages. Different data structures will make sense for different languages, but they’re all just bytes in the end. It’s up to each compiler and runtime to define how those bytes are manipulated.
Tables
WebAssembly has a feature called "tables" which it uses to implement "indirect calls". Indirect calls are a feature of almost every high-level language, but what are they?
When a machine executes a function call, it obviously needs some reference to know which function to invoke. In a direct call, that function reference is simply hardcoded, so it invokes the same function every time. In an indirect call, however, the function reference is provided by a runtime value instead. This is a very handy thing to be able to do, because it means the caller doesn’t need to know in advance the full list of functions it might have to call. Because of this, most languages have some version of this. C and C++ have function pointers, Java has class-based polymorphism, and Elm has first-class functions.
A WebAssembly table is an array of functions, each indexed by a 32-bit integer. There’s a special call_indirect
instruction that takes the index of the function to be called, with a list of arguments, and executes it. The program statically declares which functions are elements of the table, and call_indirect
only works on those functions. (Incidentally, there’s also a call
instruction for direct calls, but we won’t be focusing on that too much for now.)
By the way, WebAssembly has this design for safety reasons. If functions were stored in linear memory, it would be possible for code to inspect or corrupt other code, which is not good for web security. But with an indexed function table, that’s impossible. The only instruction that can even access the table is call_indirect
, which is safe.
If you’re interested in some further reading, I recommend Mozilla’s article on Understanding the Text Format, and the design document on WebAssembly Semantics.
But for now, we already have enough knowledge to discuss how to implement first-class functions.
Representing closures as bytes
As mentioned earlier, to represent an Elm function in WebAssembly we’ll need a function and a data structure. We’ll use the term "closure" to refer to the data structure, and "evaluator function" to refer to the WebAssembly function that will evaluate the body expression and produce a return value.
One way of representing a closure in binary is the following, where each box represents an integer (4 bytes).
fn_index |
arity |
mem_ptr0 |
mem_ptr1 |
mem_ptr2 |
... |
---|
fn_index
is an integer index into the function table where the evaluator function for this closure can be found. At runtime, once all of the arguments have been applied to the closure, we can invoke the call_indirect
instruction to look up the table, call the evaluator function, and return a result.
arity
is the remaining number of parameters to be applied to the closure. Every time we apply another argument, we insert a pointer to that argument, and decrement the arity. When it reaches zero, we’re ready to call the evaluator function.
mem_ptr*
are pointers representing the addresses in linear memory of the arguments and closed-over values. They all start off "empty" (zero), and are filled in reverse order as arguments are applied. So if the closure has an arity of 2, then mem_ptr0
and mem_ptr1
will be "empty". When we apply the next argument, the mem_ptr1
will be filled with the address of the argument value, and arity
will be decremented from 2 to 1, with mem_ptr0
still being empty.
Function application
We’ve already mentioned some of the things that need to happen when a closure is applied to some arguments, but here's the algorithm in full:
- Make a new copy of the closure
- For each applied argument
- Let
a
be the remaining arity of the closure - Write the address of the argument into the
mem_ptr
at positiona-1
- Decrement the arity
a
- Let
- If remaining arity is greater than 0
- return the new closure
- else
- Use
call_indirect
to execute the function referenced byfunc_index
, passing the closure as its argument
- Use
Let's work through an example, applying two arguments to a closure of arity 2.
Here's what the data structure looks like before we apply any arguments. All of the pointers are set to zero (the null
pointer).
fn_index |
arity |
mem_ptr0 |
mem_ptr1 |
---|---|---|---|
123 |
2 |
null |
null |
Before applying the closure, we need to create a new copy of it, so that the old closure is still available for other code to use. All Elm values are immutable, and the closure is no exception.
Now let's apply an argument, arg0
. Our algorithm says that for arity 2
, we should put the argument address into the mem_ptr
at position 2-1=1
. In other words, mem_ptr1
. Let's see what that looks like.
fn_index |
arity |
mem_ptr0 |
mem_ptr1 |
---|---|---|---|
123 |
1 |
null |
arg0 |
Notice that we're filling the argument pointers in reverse. This is just an efficiency trick. If we filled them in ascending order, we'd need to know how many had already been applied so that we could skip over them and put the next argument in the next free position. That information would have to be stored as an extra field in the closure, taking up extra space.
But if we fill the arguments in reverse, we only need to know the current arity. If the current arity is 2 then the first two positions are free, regardless of whether this is a simple two-parameter function, or a five-parameter function that has already had 3 other arguments applied.
Let's apply one more argument, arg1
. As before, we'll put the address of the argument into the highest available mem_ptr
, which is mem_ptr0
, and decrement the arity.
fn_index |
arity |
mem_ptr0 |
mem_ptr1 |
---|---|---|---|
123 |
0 |
arg1 |
arg0 |
Having applied all of the arguments we've got, we check the remaining arity. If it's non-zero, this must be a partial application, and we can just return the closure. But if it’s zero, that means all arguments have been applied. In that case, it's time to call the evaluator function, and return the value it gives us.
Note that the evaluator function takes the closure structure as its only argument. It contains all of the necessary data, because that’s exactly what it was designed for!
Lexical closure
Let’s look again at our example of closing over values from an outer scope.
outerFunc : Int -> (Int -> Int -> Int)
outerFunc closedOver =
let
innerFunc arg1 arg2 =
closedOver + arg1 + arg2
in
innerFunc
To help us think about how to generate WebAssembly for innerFunc
, let’s first refactor the source code to the equivalent version below.
outerFunc : Int -> (Int -> Int -> Int)
outerFunc closedOver =
let
-- Replace inner function definition with partial application
innerFunc =
transformedInnerFunc closedOver
in
innerFunc
-- Move definition to top level, inserting a new first argument
transformedInnerFunc closedOver arg1 arg2 =
closedOver + arg1 + arg2
Here we’ve moved the definition of the inner function to the top level, and inserted closedOver
as a new first argument, instead of actually closing over it. This doesn’t make any difference to anyone who calls outerFunc
- it still creates an innerFunc
that remembers the value of closedOver
it was created with.
The big win here is that we no longer have nested function definitions. Instead, they’re all defined at top level. This is useful because we need to put all of our evaluator functions into one global WebAssembly function table. Remember, the table is WebAssembly’s way of supporting indirect function calls. So we’ll need the compiler to do this transformation on all nested function definitions.
Code generation
We’re now ready to look at the steps the compiler needs to take to generate code for an Elm function.
- Generate the body expression, keeping track of all of the local names referenced in the body (we can ignore top-level names).
- From the set of local names, remove the argument names and any names defined
let
subexpressions. Only the closed-over names will remain. - Prepend the list of the closed-over names to the list of function arguments, to get the argument list for the evaluator function.
- Generate the evaluator function
- Declare the evaluator function as an element of the function table
- Insert code into the parent scope that does the following
- Create a new closure structure in memory
- Partially apply the closed-over values from the parent scope
Summary
At the top of the post, we started by noting that one of the interesting challenges in compiling Elm to WebAssembly is how to implement first-class functions.
Elm functions have a lot of advanced features that are not directly available in WebAssembly. They behave like values, they can be partially applied, and they can capture values from outer scopes.
Although WebAssembly doesn’t have these features natively, it does provide the foundations to build them. WebAssembly supports indirect function calls using a function table, allowing us to pass around references to WebAssembly functions in the form of a table index.
We can represent an Elm function using a WebAssembly function and a data structure. We saw what the byte level representation of the data structure could look like. The data structure is what gets passed around the program, keeping track of partially-applied arguments and closed-over values. It also contains the table index of the evaluator function, which is what will eventually produce a return value.
We discussed a way to implement lexical closure. It involves automatically transforming Elm code, flattening nested function definitions so that they can be inserted into the WebAssembly function table. This transformation turns lexical closure into partial function application.
Finally we outlined some of the steps the compiler’s code generator needs to take, and looked at the runtime algorithm for function application.
What’s next?
I’m working on a prototype code generator to prove out these ideas. I’m making reasonable progress, and there don’t appear to be any major blockers, but it needs some more work to get it working. I’ll probably share something more if/when I get that far!
I’ve also got some ideas for more blog posts around the topic of Elm in WebAssembly:
- Byte-level representations of the other Elm data structures (Extensible records, union types, numbers, comparables, appendables...)
- Code generation architecture (WebAssembly AST, Is it reasonable to generate Wasm from Haskell? What about Rust?)
- The Elm runtime in WebAssembly (Platform, Scheduler, Task, Process, Effect Managers...)
- DOM, HTTP, and ports. Differences between Wasm MVP and post-MVP.
- Strings and Unicode
- Tail-Call Elimination with trampolines
Let me know in the comments if you’d like to see any of these!
References
Haskell's implementation of closures
Wikipedia Closure article
Wikipedia Lambda Lifting article
Thanks for reading!
Top comments (6)
Definitely interested in the Elm runtime and use of trampolines for tail-call elimination. Good article. I've not looked into the wasm spec yet but when thinking about how to implement a language like Elm that presumes the presence of a (albeit small) vm, my old FORTH experiences quickly comes to mind. Your closure implementation fits into the concept of FORTH's word dictionary quite nicely. Combine this with an Actor-style concurrency model and it seems like something that could target wasm (and even other platforms) quite nicely - details & devils notwithstanding.
Thanks Ben!
This is very interesting indeed. Looking at your github, it would be interesting to know why you decided to go elm -> C -> WebAssembly?
At first I'd think that the extra layer must give some performance drawbacks?
Hi, glad you liked the post!
I actually started off this project using the direct-to-Wasm approach. I even got a toy example Elm program to compile.
But I quickly decided it wasn't worth it.
C is really not that much of a "layer" on top of assembly performance-wise, but it's massively easier to work with.
All the low level runtime stuff would have to be written in WebAssembly. The first-class function stuff mentioned in this article, plus all the "kernel code" in the core libraries, plus perhaps a Garbage Collector. It's a huge project, and not feasible for one guy's hobby project. It would need a proper team in a big company.
Plus, C compilers produce extremely good assembly code. A lot of smart people have spent 50 years making sure of that!
As far as I know, lots of languages use C as an intermediate language, including Haskell. I think OCaml might do it this way too but I'm not 100% sure.
Having said all that, if you're curious, here's the fork of the Elm compiler where I tried it:
github.com/brian-carroll/elm-compi...
Isn't one point of lambda functions that they can be curried functions with one argument each? So can you simplify the underlying function data structure to that?
Hi Richard thanks for reading. Well I suppose there are different dimensions to "simplicity". The idea about functions having one argument is certainly simple in a mathematical sense. But focusing on implementation, I'm not sure it simplifies any of the important things.
I think you're suggesting a structure more like a linked list, with one argument per cell?
Linked lists are great in some situations but the downside is that they end up spreading their data all over the heap. You have to follow the pointers, loading data from a different address each time. On modern CPUs, this access pattern is extremely slow. One memory access takes about 300 times longer than a typical instruction. I'd much rather use a nice tightly packed structure that can be loaded into cache all in one go.
I suppose the simplicity this idea buys you is getting rid of the integer field for the arity, and making the structure fixed size instead of variable size. But I don't think those things would really get rid of much code. It might actually add code.
Also note that there is only one piece of code that looks at the internals of this structure - the "apply" operator. That handles every function call in every Elm program, so it really needs to be fast. You can see it here if you like.