Sharp Cells

Posted on Mar 31, 2023 • Originally published at sharpcells.com

Experimentation with Optimized Closures

#dotnet #fsharp #performance

Experimentation with Optimized Closures

At Sharp Cells, we love functional programming but we also love writing high-performance code! One feature we love - partial application, represents a potential source of performance frustration.

Partial Application?

For those unaware, partial application is a technique where you can pass some of the arguments to a function and receive a new function that contains the provided arguments and expects only the remaining arguments.

let add a b = a + b

// Call normally
add 1 2 = 3

// Partial application creates a new function that adds 1
let addOne = add 1

// Use the partially applied function
addOne 2 = 3

This technique is made easy from F# using a language feature known as currying and is common in many functional programming languages. But, partial application is not unique to functional programming languages and most programming languages can represent the same thing with some additional ceremony:

Using JavaScript:

function add(x, y) {
  return x + y;
}

// Partially apply the add function to create a new function that adds 1 to any number
const addOne = add.bind(null, 1);

// Call the new function with one argument
console.log(addOne(2)); // Output: 3

In JavaScript, we can use the bind method to create a new function with one of the arguments applied.

Using C#:

Func<int, int, int> add = (x, y) => x + y;

// Partially apply the add function to create a new function that adds 1 to any number
Func<int, int> addOne = x => add(1, y);

// Call the new function with one argument
Console.WriteLine(addOne(2)); // Output: 3

In C#, we can use lambda expressions to create a new Func delegate that captures the first argument.

So What's the Problem?

The source of the potential performance issues is better highlighted when looking at the different languages' implementations. To partially apply a function we must create a new function that encapsulates the first argument. Normal functions can't hold any state so we must create a closure - a special type of object that represents the state and the function to be called.

If, each time we use partial application we create a closure - a new object, when we no longer need that closure it must be removed from the heap, creating more work for the garbage collector and slowing down our program.

The F# Compiler Authors are Smart

Really smart. We made several attempts to demonstrate the effect of creating closures but each time the performance was equivalent to the fully applied version of the function. That's great but what's going on?

A Brief Detour into Inlining

When your F# gets compiled, it gets transformed in many ways to try to run your code in the fastest way possible. One of the ways code can be optimized is inlining - the compiler takes the definition of the function and replaces it with the body of the function. We can see the effect of inlining and other optimizations on F# by using SharpLab to compile F# and then decompile it to a low-level form of C#:

let add a b = a + b

let accumulate (sum: byref<_>) f = 
    sum <- f sum

type StackClosures() =

    member this.Inlined() = 
        let mutable sum = 0
        for i in 0 .. 10000 do
            sum <- add sum i
        sum

    member this.AlsoInlined() = 
        let mutable sum = 0
        for i in 0 .. 10000 do
            accumulate &sum (add i)
        sum

Optimized then decompiled to C#:

public class StackClosures
{
    public int Inlined()
    {
        int num = 0;
        int num2 = 0;
        while (num2 < 10001)
        {
            num += num2;
            num2++;
        }
        return num;
    }

    public int AlsoInlined()
    {
        int num = 0;
        int num2 = 0;
        while (num2 < 10001)
        {
            num = num2 + num;
            num2++;
        }
        return num;
    }
}

As you can see, the add and accumulate functions get completely eliminated by the compiler. Even with the tricky byref parameter. Similarly, the for loop is replaced with an equivalent while loop. Part of the reason F# can make these optimizations is immutability. If a value is immutable then it is always safe to replace it with its implementation. This is great for avoiding unnecessary closure allocation.

Defeating the Optimizer

Generally, this should be avoided, but for the sake of demonstration, we need to find a way to defeat the optimizer.

type StackClosures() =

    member this.NormalClosure() = 
        let mutable sum = 0
        let mutable x = 0
        for i in 0 .. 10000 do
            x <- i
            accumulate &sum (add x)
        sum

Adding an unnecessary variable mutable x is sufficient to prevent the optimizer from doing its thing. Now when we look at the decompilation we can see how the closure is made:

public class StackClosures
{
    public int NormalClosure()
    {
        int num = 0;
        int num2 = 0;
        int num3 = 0;
        while (num3 < 10001)
        {
            num2 = num3;
            int a = num2;
            FSharpFunc<int, int> fSharpFunc = new NormalClosure@13(a);
            num = fSharpFunc.Invoke(num);
            num3++;
        }
        return num;
    }
}

internal sealed class NormalClosure@13 : FSharpFunc<int, int>
{
    public int a;

    internal NormalClosure@13(int a)
    {
        this.a = a;
    }

    public override int Invoke(int b)
    {
        return a + b;
    }
}

There's the class we were expecting. Now each time we go through the loop a new instance of NormalClosure@13 is created, invoked, and discarded. Yuck!

A Better Way?

Objects aren't the only way to represent state in the world of .NET. We can also use structs. These are value types that contain state like an object but can be allocated on the stack instead of the heap allowing them to be created and destroyed without involving the garbage collector.

We can take inspiration from the implementation of NormalClosure@13 to create a struct version of a closure:

[<Struct>]
type StackFunc<'A, 'B, 'C> =
    {
        mutable A: 'A
        F: 'A -> 'B -> 'C
    }
    member inline this.Invoke(b) = this.F this.A b

Neat! But how do we know if it's faster?

BenchmarkDotNet

BenchmarkDotNet is the gold standard for benchmarking all flavors of .NET code. With this tool, we test our functions and compare their performance in a rigorous and reproducible way.

Turning our code into benchmark code is quite simple, we just some attributes to our class and away we go:

module TestFuncs = 

    let add a b = a + b

    let accumulate (sum: byref<_>) f = 
        sum <- f sum

    let accumulateFunc (sum: byref<_>) (f: StackFunc<_,_,_>) = 
        sum <- f.Invoke(sum)

    let expected =
        function
        | 1 -> 1
        | 10 -> 55
        | 100 -> 5050
        | 10000 -> 50005000
        | x -> failwith $"Undefined for {x}"

open TestFuncs

[<MemoryDiagnoser>]
type StackClosures() =

    [<Params(1, 10, 100, 10000)>]
    member val N = 1 with get, set

    [<Benchmark>]
    member this.Inlined() = 
        let mutable sum = 0
        for i in 0 .. this.N do
            sum <- add sum i
        if (sum <> expected this.N) then failwith "Inlined summed incorrectly"

    [<Benchmark>]
    member this.AlsoInlined() = 
        let mutable sum = 0
        for i in 0 .. this.N do
            accumulate &sum (add i)
        if (sum <> expected this.N) then failwith "AlsoInlined summed incorrectly"

    [<Benchmark>]
    member this.NormalClosure() = 
        let mutable sum = 0
        let mutable x = 0
        for i in 0 .. this.N do
            x <- i
            accumulate &sum (add x)
        if (sum <> expected this.N) then failwith "NormalClosure summed incorrectly"

    [<Benchmark>]
    member this.StackFunc() = 
        let mutable sum = 0
        let mutable addOnStack = 
            {
                StackFunc.A = 0
                F = add
            }
        for i in 0 .. this.N do
            addOnStack.A <- i
            accumulateFunc &sum addOnStack
        if (sum <> expected this.N) then failwith "StackFunc summed incorrectly"

BenchmarkRunner.Run<StackClosures>()
|> ignore

Results

Method	N	Mean	Error	StdDev	Gen0	Allocated
Inlined	1	1.568 ns	0.0536 ns	0.0550 ns	-	-
AlsoInlined	1	1.935 ns	0.0611 ns	0.0541 ns	-	-
NormalClosure	1	5.956 ns	0.1424 ns	0.2417 ns	0.0029	48 B
StackFunc	1	9.369 ns	0.1177 ns	0.1101 ns	-	-
Inlined	10	4.108 ns	0.0995 ns	0.0930 ns	-	-
AlsoInlined	10	11.470 ns	0.2381 ns	0.2228 ns	-	-
NormalClosure	10	26.572 ns	0.5540 ns	1.2730 ns	0.0158	264 B
StackFunc	10	35.152 ns	0.7291 ns	0.7161 ns	-	-
Inlined	100	28.490 ns	0.3842 ns	0.3594 ns	-	-
AlsoInlined	100	29.988 ns	0.4265 ns	0.3989 ns	-	-
NormalClosure	100	233.629 ns	3.7461 ns	3.5041 ns	0.1447	2424 B
StackFunc	100	315.870 ns	4.5391 ns	4.2459 ns	-	-
Inlined	10000	2,224.525 ns	30.0005 ns	28.0625 ns	-	-
AlsoInlined	10000	2,216.351 ns	26.0386 ns	24.3565 ns	-	-
NormalClosure	10000	21,989.143 ns	273.0534 ns	242.0547 ns	14.3433	240024 B
StackFunc	10000	30,848.998 ns	589.8022 ns	551.7014 ns	-	-

Looking at the results we can see that we have successfully avoided allocation of the closure but the code is... slower. This is unexpected and extremely unfortunate but what's going on?

Let's take another look at the decompilation. Specifically the Invoke method.

public C Invoke(B b)
{
    return FSharpFunc<A, B>.InvokeFast(F@, A@, b);
}

That's not a normal function invocation and it's bound to be causing some additional overhead. But all is not lost! There's more than one way to represent a function in our struct.

Alternative Struct Representations

The two other possible ways we can represent a function are the Func delegate and the OptimizedClosures.FSharpFunc. Let's create two more structs and rerun the benchmarks.

[<Struct>]
type StackOptFunc<'A, 'B, 'C> =
    {
        mutable A: 'A
        F: OptimizedClosures.FSharpFunc<'A, 'B, 'C>
    }
    member inline this.Invoke(b) = this.F.Invoke(this.A, b)

[<Struct>]
type StackDelegate<'A, 'B, 'C> =
    {
        mutable A: 'A
        F: Func<'A, 'B, 'C>
    }
    member inline this.Invoke(b) = this.F.Invoke(this.A, b)

module TestFuncs = 
    let accumulateOptFunc (sum: byref<_>) (f: StackOptFunc<_,_,_>) = 
        sum <- f.Invoke(sum)

    let accumulateDelegate (sum: byref<_>) (f: StackDelegate<_,_,_>) = 
        sum <- f.Invoke(sum)

[<MemoryDiagnoser>]
type StackClosures() =

    [<Benchmark>]
    member this.StackOptFunc() = 
        let mutable sum = 0
        let mutable addOnStack = 
            {
                StackOptFunc.A = 0
                F = OptimizedClosures.FSharpFunc<_,_,_>.Adapt(add)
            }
        for i in 0 .. this.N do
            addOnStack.A <- i
            accumulateOptFunc &sum addOnStack
        if (sum <> expected this.N) then failwith "StackOptFunc summed incorrectly"

    [<Benchmark>]
    member this.StackDelegate() = 
        let mutable sum = 0
        let mutable addOnStack = 
            {
                StackDelegate.A = 0
                F = add
            }
        for i in 0 .. this.N do
            addOnStack.A <- i
            accumulateDelegate &sum addOnStack
        if (sum <> expected this.N) then failwith "StackDelegate summed incorrectly"

And the results look great! The OptimizedClosures.FSharpFunc version gets ahead of the Func version and both are substantially faster (almost 50%) than the NormalClosure case for large N. The Func delegate does allocate some memory but as it is wrapping the original function it only needs to be allocated once. Both are, however, around, 5x slower than the fully inlined version. NormalClosure and StackOptFunc are roughly the same speed for N = 1.

Method	N	Mean	Error	StdDev	Gen0	Allocated
Inlined	1	1.568 ns	0.0536 ns	0.0550 ns	-	-
NormalClosure	1	5.956 ns	0.1424 ns	0.2417 ns	0.0029	48 B
StackOptFunc	1	5.771 ns	0.1231 ns	0.1152 ns	-	-
StackDelegate	1	11.009 ns	0.2426 ns	0.4120 ns	0.0038	64 B
Inlined	10	4.108 ns	0.0995 ns	0.0930 ns	-	-
NormalClosure	10	26.572 ns	0.5540 ns	1.2730 ns	0.0158	264 B
StackOptFunc	10	16.039 ns	0.3389 ns	0.2830 ns	-	-
StackDelegate	10	25.177 ns	0.5126 ns	0.5484 ns	0.0038	64 B
Inlined	100	28.490 ns	0.3842 ns	0.3594 ns	-	-
NormalClosure	100	233.629 ns	3.7461 ns	3.5041 ns	0.1447	2424 B
StackOptFunc	100	115.131 ns	1.3663 ns	1.2781 ns	-	-
StackDelegate	100	184.305 ns	3.0298 ns	2.8340 ns	0.0038	64 B
Inlined	10000	2,224.525 ns	30.0005 ns	28.0625 ns	-	-
NormalClosure	10000	21,989.143 ns	273.0534 ns	242.0547 ns	14.3433	240024 B
StackOptFunc	10000	13,130.112 ns	206.8013 ns	193.4421 ns	-	-
StackDelegate	10000	15,543.613 ns	169.1367 ns	158.2106 ns	-	64 B

Conclusions

We have successfully demonstrated that it is possible to significantly improve the performance of closures for partial application by replacing the standard representation with a struct version. So are we going to replace all closures in our Sharp Cells add-in with StackOptFunc? Not at this stage.

Without a detailed analysis of the compilation output, it's not clear whether the existing closures are being inlined (and therefore faster) or allocated (and slower) than the struct version. We hope that an optimization like this will find its way into the compiler so that everyone can enjoy faster closures when they cannot be fully inlined.

DEV Community

Experimentation with Optimized Closures

Experimentation with Optimized Closures

Partial Application?

So What's the Problem?

The F# Compiler Authors are Smart

A Brief Detour into Inlining

Defeating the Optimizer

A Better Way?

BenchmarkDotNet

Results

Alternative Struct Representations

Conclusions

Top comments (0)

Read next

Parallel and asynchronous programming are not the same in .NET. Do you think so?

LLM Parallel Processing in Practice: Key Techniques for Performance Enhancement

.NET MAUI vs Avalonia: Which Cross-Platform Framework Should You Choose?

OptimizeIt Is Unleashed!