DEV Community

Sharp Cells
Sharp Cells

Posted on • Originally published at sharpcells.com

Experimentation with Optimized Closures

Experimentation with Optimized Closures

At Sharp Cells, we love functional programming but we also love writing high-performance code! One feature we love - partial application, represents a potential source of performance frustration.

Partial Application?

For those unaware, partial application is a technique where you can pass some of the arguments to a function and receive a new function that contains the provided arguments and expects only the remaining arguments.

let add a b = a + b

// Call normally
add 1 2 = 3

// Partial application creates a new function that adds 1
let addOne = add 1

// Use the partially applied function
addOne 2 = 3
Enter fullscreen mode Exit fullscreen mode

This technique is made easy from F# using a language feature known as currying and is common in many functional programming languages. But, partial application is not unique to functional programming languages and most programming languages can represent the same thing with some additional ceremony:

  • Using JavaScript:
function add(x, y) {
  return x + y;
}

// Partially apply the add function to create a new function that adds 1 to any number
const addOne = add.bind(null, 1);

// Call the new function with one argument
console.log(addOne(2)); // Output: 3
Enter fullscreen mode Exit fullscreen mode

In JavaScript, we can use the bind method to create a new function with one of the arguments applied.

  • Using C#:
Func<int, int, int> add = (x, y) => x + y;

// Partially apply the add function to create a new function that adds 1 to any number
Func<int, int> addOne = x => add(1, y);

// Call the new function with one argument
Console.WriteLine(addOne(2)); // Output: 3
Enter fullscreen mode Exit fullscreen mode

In C#, we can use lambda expressions to create a new Func delegate that captures the first argument.

So What's the Problem?

The source of the potential performance issues is better highlighted when looking at the different languages' implementations. To partially apply a function we must create a new function that encapsulates the first argument. Normal functions can't hold any state so we must create a closure - a special type of object that represents the state and the function to be called.

If, each time we use partial application we create a closure - a new object, when we no longer need that closure it must be removed from the heap, creating more work for the garbage collector and slowing down our program.

The F# Compiler Authors are Smart

Really smart. We made several attempts to demonstrate the effect of creating closures but each time the performance was equivalent to the fully applied version of the function. That's great but what's going on?

A Brief Detour into Inlining

When your F# gets compiled, it gets transformed in many ways to try to run your code in the fastest way possible. One of the ways code can be optimized is inlining - the compiler takes the definition of the function and replaces it with the body of the function. We can see the effect of inlining and other optimizations on F# by using SharpLab to compile F# and then decompile it to a low-level form of C#:

let add a b = a + b

let accumulate (sum: byref<_>) f = 
    sum <- f sum

type StackClosures() =

    member this.Inlined() = 
        let mutable sum = 0
        for i in 0 .. 10000 do
            sum <- add sum i
        sum

    member this.AlsoInlined() = 
        let mutable sum = 0
        for i in 0 .. 10000 do
            accumulate &sum (add i)
        sum
Enter fullscreen mode Exit fullscreen mode

Optimized then decompiled to C#:

public class StackClosures
{
    public int Inlined()
    {
        int num = 0;
        int num2 = 0;
        while (num2 < 10001)
        {
            num += num2;
            num2++;
        }
        return num;
    }

    public int AlsoInlined()
    {
        int num = 0;
        int num2 = 0;
        while (num2 < 10001)
        {
            num = num2 + num;
            num2++;
        }
        return num;
    }
}
Enter fullscreen mode Exit fullscreen mode

As you can see, the add and accumulate functions get completely eliminated by the compiler. Even with the tricky byref parameter. Similarly, the for loop is replaced with an equivalent while loop. Part of the reason F# can make these optimizations is immutability. If a value is immutable then it is always safe to replace it with its implementation. This is great for avoiding unnecessary closure allocation.

Defeating the Optimizer

Generally, this should be avoided, but for the sake of demonstration, we need to find a way to defeat the optimizer.

type StackClosures() =

    member this.NormalClosure() = 
        let mutable sum = 0
        let mutable x = 0
        for i in 0 .. 10000 do
            x <- i
            accumulate &sum (add x)
        sum
Enter fullscreen mode Exit fullscreen mode

Adding an unnecessary variable mutable x is sufficient to prevent the optimizer from doing its thing. Now when we look at the decompilation we can see how the closure is made:

public class StackClosures
{
    public int NormalClosure()
    {
        int num = 0;
        int num2 = 0;
        int num3 = 0;
        while (num3 < 10001)
        {
            num2 = num3;
            int a = num2;
            FSharpFunc<int, int> fSharpFunc = new NormalClosure@13(a);
            num = fSharpFunc.Invoke(num);
            num3++;
        }
        return num;
    }
}

internal sealed class NormalClosure@13 : FSharpFunc<int, int>
{
    public int a;

    internal NormalClosure@13(int a)
    {
        this.a = a;
    }

    public override int Invoke(int b)
    {
        return a + b;
    }
}
Enter fullscreen mode Exit fullscreen mode

There's the class we were expecting. Now each time we go through the loop a new instance of NormalClosure@13 is created, invoked, and discarded. Yuck!

A Better Way?

Objects aren't the only way to represent state in the world of .NET. We can also use structs. These are value types that contain state like an object but can be allocated on the stack instead of the heap allowing them to be created and destroyed without involving the garbage collector.

We can take inspiration from the implementation of NormalClosure@13 to create a struct version of a closure:

[<Struct>]
type StackFunc<'A, 'B, 'C> =
    {
        mutable A: 'A
        F: 'A -> 'B -> 'C
    }
    member inline this.Invoke(b) = this.F this.A b
Enter fullscreen mode Exit fullscreen mode

Neat! But how do we know if it's faster?

BenchmarkDotNet

BenchmarkDotNet is the gold standard for benchmarking all flavors of .NET code. With this tool, we test our functions and compare their performance in a rigorous and reproducible way.

Turning our code into benchmark code is quite simple, we just some attributes to our class and away we go:

module TestFuncs = 

    let add a b = a + b

    let accumulate (sum: byref<_>) f = 
        sum <- f sum

    let accumulateFunc (sum: byref<_>) (f: StackFunc<_,_,_>) = 
        sum <- f.Invoke(sum)

    let expected =
        function
        | 1 -> 1
        | 10 -> 55
        | 100 -> 5050
        | 10000 -> 50005000
        | x -> failwith $"Undefined for {x}"

open TestFuncs

[<MemoryDiagnoser>]
type StackClosures() =

    [<Params(1, 10, 100, 10000)>]
    member val N = 1 with get, set

    [<Benchmark>]
    member this.Inlined() = 
        let mutable sum = 0
        for i in 0 .. this.N do
            sum <- add sum i
        if (sum <> expected this.N) then failwith "Inlined summed incorrectly"

    [<Benchmark>]
    member this.AlsoInlined() = 
        let mutable sum = 0
        for i in 0 .. this.N do
            accumulate &sum (add i)
        if (sum <> expected this.N) then failwith "AlsoInlined summed incorrectly"

    [<Benchmark>]
    member this.NormalClosure() = 
        let mutable sum = 0
        let mutable x = 0
        for i in 0 .. this.N do
            x <- i
            accumulate &sum (add x)
        if (sum <> expected this.N) then failwith "NormalClosure summed incorrectly"

    [<Benchmark>]
    member this.StackFunc() = 
        let mutable sum = 0
        let mutable addOnStack = 
            {
                StackFunc.A = 0
                F = add
            }
        for i in 0 .. this.N do
            addOnStack.A <- i
            accumulateFunc &sum addOnStack
        if (sum <> expected this.N) then failwith "StackFunc summed incorrectly"

BenchmarkRunner.Run<StackClosures>()
|> ignore
Enter fullscreen mode Exit fullscreen mode

Results

Method N Mean Error StdDev Gen0 Allocated
Inlined 1 1.568 ns 0.0536 ns 0.0550 ns - -
AlsoInlined 1 1.935 ns 0.0611 ns 0.0541 ns - -
NormalClosure 1 5.956 ns 0.1424 ns 0.2417 ns 0.0029 48 B
StackFunc 1 9.369 ns 0.1177 ns 0.1101 ns - -
Inlined 10 4.108 ns 0.0995 ns 0.0930 ns - -
AlsoInlined 10 11.470 ns 0.2381 ns 0.2228 ns - -
NormalClosure 10 26.572 ns 0.5540 ns 1.2730 ns 0.0158 264 B
StackFunc 10 35.152 ns 0.7291 ns 0.7161 ns - -
Inlined 100 28.490 ns 0.3842 ns 0.3594 ns - -
AlsoInlined 100 29.988 ns 0.4265 ns 0.3989 ns - -
NormalClosure 100 233.629 ns 3.7461 ns 3.5041 ns 0.1447 2424 B
StackFunc 100 315.870 ns 4.5391 ns 4.2459 ns - -
Inlined 10000 2,224.525 ns 30.0005 ns 28.0625 ns - -
AlsoInlined 10000 2,216.351 ns 26.0386 ns 24.3565 ns - -
NormalClosure 10000 21,989.143 ns 273.0534 ns 242.0547 ns 14.3433 240024 B
StackFunc 10000 30,848.998 ns 589.8022 ns 551.7014 ns - -

Looking at the results we can see that we have successfully avoided allocation of the closure but the code is... slower. This is unexpected and extremely unfortunate but what's going on?

Let's take another look at the decompilation. Specifically the Invoke method.

public C Invoke(B b)
{
    return FSharpFunc<A, B>.InvokeFast(F@, A@, b);
}
Enter fullscreen mode Exit fullscreen mode

That's not a normal function invocation and it's bound to be causing some additional overhead. But all is not lost! There's more than one way to represent a function in our struct.

Alternative Struct Representations

The two other possible ways we can represent a function are the Func delegate and the OptimizedClosures.FSharpFunc. Let's create two more structs and rerun the benchmarks.

[<Struct>]
type StackOptFunc<'A, 'B, 'C> =
    {
        mutable A: 'A
        F: OptimizedClosures.FSharpFunc<'A, 'B, 'C>
    }
    member inline this.Invoke(b) = this.F.Invoke(this.A, b)

[<Struct>]
type StackDelegate<'A, 'B, 'C> =
    {
        mutable A: 'A
        F: Func<'A, 'B, 'C>
    }
    member inline this.Invoke(b) = this.F.Invoke(this.A, b)

module TestFuncs = 
    let accumulateOptFunc (sum: byref<_>) (f: StackOptFunc<_,_,_>) = 
        sum <- f.Invoke(sum)

    let accumulateDelegate (sum: byref<_>) (f: StackDelegate<_,_,_>) = 
        sum <- f.Invoke(sum)

[<MemoryDiagnoser>]
type StackClosures() =

    [<Benchmark>]
    member this.StackOptFunc() = 
        let mutable sum = 0
        let mutable addOnStack = 
            {
                StackOptFunc.A = 0
                F = OptimizedClosures.FSharpFunc<_,_,_>.Adapt(add)
            }
        for i in 0 .. this.N do
            addOnStack.A <- i
            accumulateOptFunc &sum addOnStack
        if (sum <> expected this.N) then failwith "StackOptFunc summed incorrectly"

    [<Benchmark>]
    member this.StackDelegate() = 
        let mutable sum = 0
        let mutable addOnStack = 
            {
                StackDelegate.A = 0
                F = add
            }
        for i in 0 .. this.N do
            addOnStack.A <- i
            accumulateDelegate &sum addOnStack
        if (sum <> expected this.N) then failwith "StackDelegate summed incorrectly"
Enter fullscreen mode Exit fullscreen mode

And the results look great! The OptimizedClosures.FSharpFunc version gets ahead of the Func version and both are substantially faster (almost 50%) than the NormalClosure case for large N. The Func delegate does allocate some memory but as it is wrapping the original function it only needs to be allocated once. Both are, however, around, 5x slower than the fully inlined version. NormalClosure and StackOptFunc are roughly the same speed for N = 1.

Method N Mean Error StdDev Gen0 Allocated
Inlined 1 1.568 ns 0.0536 ns 0.0550 ns - -
NormalClosure 1 5.956 ns 0.1424 ns 0.2417 ns 0.0029 48 B
StackOptFunc 1 5.771 ns 0.1231 ns 0.1152 ns - -
StackDelegate 1 11.009 ns 0.2426 ns 0.4120 ns 0.0038 64 B
Inlined 10 4.108 ns 0.0995 ns 0.0930 ns - -
NormalClosure 10 26.572 ns 0.5540 ns 1.2730 ns 0.0158 264 B
StackOptFunc 10 16.039 ns 0.3389 ns 0.2830 ns - -
StackDelegate 10 25.177 ns 0.5126 ns 0.5484 ns 0.0038 64 B
Inlined 100 28.490 ns 0.3842 ns 0.3594 ns - -
NormalClosure 100 233.629 ns 3.7461 ns 3.5041 ns 0.1447 2424 B
StackOptFunc 100 115.131 ns 1.3663 ns 1.2781 ns - -
StackDelegate 100 184.305 ns 3.0298 ns 2.8340 ns 0.0038 64 B
Inlined 10000 2,224.525 ns 30.0005 ns 28.0625 ns - -
NormalClosure 10000 21,989.143 ns 273.0534 ns 242.0547 ns 14.3433 240024 B
StackOptFunc 10000 13,130.112 ns 206.8013 ns 193.4421 ns - -
StackDelegate 10000 15,543.613 ns 169.1367 ns 158.2106 ns - 64 B

Conclusions

We have successfully demonstrated that it is possible to significantly improve the performance of closures for partial application by replacing the standard representation with a struct version. So are we going to replace all closures in our Sharp Cells add-in with StackOptFunc? Not at this stage.

Without a detailed analysis of the compilation output, it's not clear whether the existing closures are being inlined (and therefore faster) or allocated (and slower) than the struct version. We hope that an optimization like this will find its way into the compiler so that everyone can enjoy faster closures when they cannot be fully inlined.

Top comments (0)