Partial Evaluation, Whole-Program Compilation

#computerscience #performance #programming

This was originally posted on Dangling Pointers. My goal is to help busy people stay current with recent academic developments. Head there to subscribe for regular summaries of computer science research.

Partial Evaluation, Whole-Program Compilation Chris Fallin and Maxwell Bernstein PLDI'25

The First Futamura Projection

Language implementations shamelessly violate the DRY (i.e., don’t repeat yourself) principle. Take for instance, a language which supports 32-bit integer expressions. An interpreter (written in C) for this language might contain the semantics of such an expression via a corresponding C expression in the interpreter source code. A compiler for the language might implicitly contain the semantics of such an expression via a compilation step which emits an X86 ADD instruction. Or a compiler (written in Rust) may contain an optimization (constant propagation) which encodes the semantics of an add expression by detecting add expressions which contain constant operands and replacing those expressions with references to constant literals (thus encoding the add semantics with a Rust addition expression).

The semantics of each operation are spread throughout the language implementation codebase like cafeteria walls after a food fight.

The first Futamura Projection is a solution to this problem: express the language semantics once in an interpreter and then use a tool to partially evaluate the interpreter given a fixed input program.

In case that didn’t make sense, let me say that another way. Say the interpreter is written in C, the language it interprets is Python. Say a user wants to only execute the interpreter for one program (PyHearts++), but wants to run it fast. The user runs a tool which specializes the interpreter for PyHearts++. Both the interpreter source code (written in C), and the application {source,byte}code (PyHearts++, written in Python) are given as input to the specializer, and out pops C code for an interpreter which can only execute PyHearts++ (the specialized interpreter doesn’t need the PyHearts++ {source,byte}code as input). The specialized interpreter can be compiled with a standard C compiler to produce a binary image which is equivalent to compiling PyHearts++ to machine code.

One last run at it: the act of

partially evaluating (i.e., specializing) an interpreter such that the interpreted {source,byte} code is treated as a compile-time constant
and then compiling the specialized interpreter to machine code

is equivalent to compiling the interpreted {source,byte}code to machine code.

If only robust partial evaluation utilities were commodities, then languages could be implemented only as interpreters, and partial evaluation could be used to automatically produce compilers.

GraalVM and Truffle enable partial evaluation of Java bytecode, but they require the interpreter to be written in a very specific way and require the use of a special framework.

The mechanisms presented in this paper allow interpreters to be implemented much as they always have been. Partial evaluation support can be added with just a couple of strategically placed calls to intrinsic functions.

Constant Propagation

Constant propagation is typically implemented with dataflow analysis to compute a meet-over-all-paths solution. This allows constant propagation to propagate information across basic blocks. For example, in the following code:

x = 3;

do {
   print(x);

   x = 3;
} while(...);

a compiler can replace print(x) with print(3) because both basic blocks which are predecessors of the start of the loop have x=3 when they jump to the start of the loop.

If you have not seen dataflow analysis before, the way it is typically implemented is with a work list and a mapping of (basic block, variable) pairs to the constant value of the particular variable at the end of the particular basic block. The work list contains a list of basic blocks. Constant propagation starts by inserting the entry point of the function (i.e., the first basic block of the function) onto the work list. The algorithm then loops until the work list is empty. Each loop iteration pops a basic block from the list, determines the values of variables at the start of the block, and then iterates through all operations in the block to determine the values of variables at the end of the block. The values of variables at the start of a block are determined by looking at the values of variables at the end of all predecessor blocks. If x=3 at the end of each predecessor block, then x=3 at the start of the block in question.

Constant Propagation of an Interpreter Loop

This approach does not work well when applied a prototypical interpreter outer loop:

void interpret(bytecode[] bc) {
    // the contents of `bc` are known at compile time
    int pc = 0;

    while (true){
        switch(bc[pc]) {
        case ADD: ...; pc++; break;
        }
    }
}

The trouble is that pesky pc++ statement. When constant propagation runs on interpret, all of bc is known to the compiler, however the value of pc is not. That is because the while loop has multiple predecessors, and those predecessors do not agree on the value of pc.

You might think this could be solved with a #pragma unroll on the while loop. This almost works, except for bytecode operations which represent dynamic branches (e.g., pc = x ? pc + 1 : pc).

Intrinsics

This paper solves the aforementioned problems with the addition of two intrinsics: update_context and specialized_value, along with changes to constant propagation in the compiler/specializer.

All references to basic blocks in constant propagation are updated to reference (basic block, context) pairs. A context is simply an integer which is known at compile time. Fig. 4 shows how the interpreted program counter can be used as the context in the interpreter outer loop:

Source: https://dl.acm.org/doi/10.1145/3729259

In this usage, a (basic block, context) pair is really a (basic block, operation in the interpreted program) pair. This means the body of the interpreter’s while loop will be replica once for each operation in the interpreted program. At each replica, the value of pc will be known by the compiler/specializer.

specialized_value is a hint that indicates that a particular variable in the interpreter can only have a few specific values. For example, say a dynamic branch in the interpreted bytecode can either jump to pc=0 or pc=1. The interpreter can call specialized_value to indicate this information to the compiler/specializer. When the compiler/specializer encounters such a call, it:

generates control flow to 𝑁 blocks, branching at runtime on value, then constant-propagating at compile time in each specialized path

I think of the implementation of specialized_value like this:

// Original Code
x = specialized_value(x, 0, 3);
print(x);

// Transformed
switch (x)
{
case 0:
    print(0);
    break;

case 1:
    print(1);
    break;

case 2:
    print(2);
    break;
}

Fig. 5 shows examples of how an interpreter can call specialized_value to implement dynamic branching:

Source: https://dl.acm.org/doi/10.1145/3729259

The IR generated by a call like specialized_value(pc, 0, 2) contains conditional branches to one of two basic blocks (in one block pc=0, and in the other block pc=1). Because pc is a compile time constant in both blocks, constant propagation of pc continues to work, and each expression in the interpreter that looks like bytecode[pc] can be simplified down to a literal bytecode operation.

Results

wevaled and wevaled + state represent the techniques described in this paper:

Source: https://dl.acm.org/doi/10.1145/3729259

The optimizations in state involve modifying the interpreter to call intrinsic functions to read and write bytecode operation operands. This causes the operands which (bytecode operations use) to be converted directed to SSA operands when specializing/compiling the interpreter.

Dangling Pointers

This solution is great but feels like one point on a spectrum. It would be interesting to see other language extensions which achieve similar results. For example, I could see this implemented with something like macros/metaprogramming. A compile time loop could iterate over all operations in the interpreted bytecode and emit interpreter source code for each operation (maybe with something like goto statements linking them together). It would be more code change (isolated in a few places) in the interpreter, but still achieve the same goals, without relying on a fast path in constant propagation.