Danial Jumagaliyev

Posted on Dec 4

C# Performance Optimization: Using Span<T> and stackalloc to Eliminate Allocations

#csharp #performance #dotnet #unity3d

Runtime memory handling is a crucial part of writing software, which heavily impacts the performance of your program. I think for every language there comes this layer of responsibility, but how it is delegated between the developer and the language is different.

In the case of using C/C++ - memory allocation is totally under the control of the developer. Whereas in languages like JavaScript, Python, Go, Java, where memory is managed - developers rarely need to worry about where an array is stored in memory, it's the duty of the language's runtime system to decide when to deallocate data that is no longer used. And this is where garbage collection comes into play.

C# (.NET platform) is known for its heavy GC (garbage collection) of dynamic (heap-allocated) resources. The process is straightforward - garbage collection automatically identifies unused memory (garbage) and frees it to be reused, efficiently recycling resources.

This article explores how stackalloc and Span<T> eliminate garbage collection pressure in performance-critical C# code, using real optimization examples from Dan.Net, my Unity networking library. By the end, you'll understand when and how to apply these techniques to reduce allocations by orders of magnitude in hot code paths.

Memory Allocation in C`#`

By default, the data of all reference types (classes, arrays, strings, delegates and interfaces) are allocated on the heap. The references to heap objects, however, live on stack or heap, but value types stay on stack unless boxed (converted into object type):

Location	Storage	GC Impact
Local reference variable	Stack	None (reference only)
Class field reference	Heap	Collected with object
Local value type	Stack	None unless boxed

On one hand, this is amazing. As the developer, you focus more on writing core logic and you get to easily structure your code in meaningful ways; you don't have to care about ensuring runtime safety and clean up dynamic memory resources after use - this job is handled automatically by the garbage collector, which is part of the .NET CLR (Common Language Runtime).

However, this could easily become a recipe for disaster in terms of performance. A developer who isn't familiar with the .NET ecosystem is likely to write code that may look good on the surface - but could be causing unnecessary memory allocations underneath the hood, reducing execution speed as a result.

I, personally, used to make this mistake often when working with .NET in my first few years of building Unity projects. I would receive performance feedback that I was using too many LINQ (Language Integrated Query) statements, which can cause excessive garbage collection, especially in the older .NET versions. In those days, I barely thought about the importance of memory management.

For instance, I would write code like this in a hot path that executed every frame:

// ❌ Allocates multiple enumerators and intermediate collections
var activeEnemies = enemies
    .Where(e => e.health > 0)
    .OrderBy(e => e.distanceToPlayer)
    .Take(5)
    .ToList();

Each LINQ operation allocates an enumerator, and ToList() allocates a new list. At 60 frames per second, this becomes 60 allocations per second per operation, triggering frequent garbage collections that cause frame drops.

Motivation

I was recently enjoying my weekend by reviewing my personal projects, until I decided to look into Dan.Net again. It's my own networking library for Unity, operating over HTTP/WebSockets and prioritizing ease-of-development and cross-platform-ability. I have been experimenting and toying around with it for quite a while now, and made a proper release of it earlier this year on GitHub.

Shortly after its release, I had a friend review it for me and they gave me some honest feedback. There's more to the story that deserves its own article. But in short, I chose to optimize - not out of necessity, but passion. It was working, but not fast enough. The drive for optimizations truly emerged within me, echoing Linus Torvalds' ethos.

With AI producing more code every day, I realized that mastering the fundamentals is key for programmers to truly understand and oversee the systems we build. And this is exactly why I am writing this blog!

After implementing the optimizations described in this article, Dan.Net's per-frame allocation during active network synchronization dropped significantly, from ~2.5KB to under 200 bytes - a 92% reduction. This eliminated GC stalls during 20Hz network updates, maintaining stable frame times even with 50+ networked objects.

About Memory

Stack Basics

The stack holds short-lived data like local variables and method parameters in a LIFO (Last-In, First-Out) structure, with automatic allocation/deallocation on method entry/exit. Access is fast due to its contiguous layout and CPU cache affinity, not because it's "in the cache" - both stack and heap reside in RAM. Value types (structs, primitives) live here unless boxed or embedded in heap objects.

The stack has a limited size - typically 1MB per thread on Windows and 8MB on Linux/macOS by default (these are operating system defaults, not Unity-specific). Unity's stack behavior inherits from the underlying .NET runtime and platform. Exceeding this capacity results in a StackOverflowException.

Heap Basics

The heap manages dynamic, long-lived objects via garbage collection, allowing flexible sizing but with slower access from indirection and GC pauses (periods when unreferenced memory gets cleaned up by the garbage collector). It grows as needed (system-limited), unlike the stack's fixed per-thread quota. As mentioned, this is where referenced types (classes, arrays, etc) reside.

Stack Limitations

To make it more clear, let's represent stack memory's LIFO behavior visually:

Stack (capacity: 6)
(empty)
(empty, next insertion goes here)
`4` (first to be out, last in)
`3`
`2`
`1` (last to be out, first in)

After two more insertions, this stack would become full. Assuming no deallocation occurs - the next allocation attempt would lead to a stack overflow. Real stack memory in C# manages method call frames, local variables, and return addresses in this LIFO manner, with each method invocation "pushing" a new frame and "popping" it off when returning.

C# History Lesson

C# provides stackalloc for explicit stack allocation of temporary buffers. Over time, its usability and safety have improved significantly with each major C# release, especially since C# 7.2.

The Danger Zone

Initially, stackalloc could only be used in conjunction with the unsafe modifier. Methods can be defined under an unsafe context, which signals the C# compiler "hey, we're about to do something illegal here":

using System;

class Program
{
    static unsafe void Main(string[] args)
    {
        int length = 5;
        // Allocate a block of memory for 5 integers on the stack
        int* numbers = stackalloc int[length];

        // Use pointer arithmetic to access and populate the memory block
        for (int i = 0; i < length; i++)
        {
            *(numbers + i) = i * 10; // = numbers[i]
        }

        // Read the values back using array-like pointer syntax
        for (int i = 0; i < length; i++)
        {
            // Pointer element access operator p[n] is evaluated as *(p + n)
            Console.WriteLine($"numbers[{i}]: {numbers[i]}");
        }

        // Memory is automatically discarded when the method returns
    }
}

To run this code for yourself, you must modify the .csproj file in your C# project to include <AllowUnsafeBlocks>true</AllowUnsafeBlocks> within a <PropertyGroup>. Here's the expected output:

numbers[0]: 0
numbers[1]: 10
numbers[2]: 20
numbers[3]: 30
numbers[4]: 40

As you can see, stackalloc used to be limited to allocating memory for pointers, which made it primarily a low-level, performance-focused feature suitable for advanced developers requiring fine control over memory, similar to variable-length arrays or the alloca() function in C/C++.

If the above-written code seems confusing to you, I highly recommend touching up on pointers and arrays in C, as C# borrows the same logic from there.

When the stack was truly stacking

It's only when C# 7.2 and 7.3 released that stackalloc became usable outside of unsafe contexts, only if the result was immediately converted to a Span<T> or ReadOnlySpan<T>.

// C# 7.2
int[] dataArray = new int[5] {1, 2, 3, 4, 5};
Span<int> dataSpan = dataArray;
dataSpan[0] = 100; // dataArray[0] is modified

// dataArray is now {100, 2, 3, 4, 5}

You can think of spans as "views" that allow access to memory regions like arrays, strings, or stack buffers, making them very lightweight, drastically reducing garbage collection pressure and improving speed in data-intensive code.

Benchmarking Pitfalls: The JIT Optimizer

Before diving into span performance comparisons, it's critical to understand a common benchmarking problem. Modern JIT (Just-in-time) compilers are extremely aggressive about eliminating "dead code" - code whose results are never used. When benchmarking span operations, if you don't actually consume the results, the compiler may eliminate the entire operation, giving you misleading 0.0000 ns execution times.

I encountered this exact issue when benchmarking the span optimizations. The solution is using BenchmarkDotNet's Consumer class, which signals to the compiler that values are being observed, preventing optimization away. Always verify your benchmarks produce realistic non-zero timings before trusting the results.

I wrote these two methods which serve the same purpose - to retrieve top 3 scores from a scores array of integers. I benchmarked these methods using BenchmarkDotNet:

...
// class SpanBenchmark

private const int length = 1000;
private int[] scores = new int[length];

// Constructor to initialize scores array with values
public SpanBenchmark()
{
    for (int i = 0; i < length; i++)
    {
        scores[i] = i;
    }
}

[Benchmark]
public int[] GetTop3Scores_Copy()
{
    // ❌ Traditional: Copy top 3 scores = heap allocation every time!
    int[] top3Copy = new int[3];
    Array.Copy(scores, length - 3, top3Copy, 0, 3);
    return top3Copy;
}

[Benchmark]
public Span<int> GetTop3Scores_Span()
{
    // ✅ C# 7.2: Slice original array = ZERO heap allocations
    Span<int> allScores = scores.AsSpan();
    Span<int> top3 = allScores.Slice(length - 3, 3);
    return top3;
}
...

Results:

BenchmarkDotNet v0.15.8, macOS Tahoe 26.1 (25B78) [Darwin 25.1.0]
Apple M5, 1 CPU, 10 logical and 10 physical cores
.NET SDK 9.0.306
  [Host]     : .NET 9.0.10 (9.0.10, 9.0.1025.47515), Arm64 RyuJIT armv8.0-a
...
| Method  | Mean      | Error     | StdDev    | Gen0   | Allocated |
|---------|-----------|-----------|-----------|--------|-----------|
| ...Copy | 4.0631 ns | 0.0274 ns | 0.0229 ns | 0.0048 |      40 B |
| ...Span | 0.2082 ns | 0.0202 ns | 0.0189 ns |      - |         - |

We can clearly see that the Span<T> approach is allocating no dynamic memory at all, meaning it's completely utilizing the stack within the method. The Gen0 column shows Generation 0 garbage collections per 1000 operations - the copy method triggers GC pressure while the span approach has zero impact.

The Span<T> class comes with a powerful method - Slice, thanks to which the span method is running faster in the first place. See the function specification here.

We can be safe... and go faster?

We are making use of Span<T>, where the spans can be of type ReadOnlySpan<T> since they aren't modified at all. The main usage of ReadOnlySpan<T> is not mainly dedicated to increasing performance, but rather to communicate the fact that the span is fully immutable.

However, after duplicating and modifying the span method (this time using ReadOnlySpan<T> instead of Span<T> as the type) - I noticed there is about a ~20% reduction in execution time, as evidenced by the benchmark results:

| Method          | Mean      | Error     | StdDev    | Allocated |
|-----------------|-----------|-----------|-----------|-----------|
| ...Span         | 0.1880 ns | 0.0266 ns | 0.0273 ns |         - |
| ...ReadOnlySpan | 0.1460 ns | 0.0142 ns | 0.0126 ns |         - |

The performance improvement from ReadOnlySpan<T> comes from compiler optimizations around immutability guarantees. The JIT compiler can eliminate bounds checks and enable more aggressive inlining when it knows the data won't be modified, reducing instruction count and improving CPU pipeline efficiency.

Execution Time Comparison (lower is better)

Array.Copy      █████████████████████████ 4.06 ns
Span<T>         ██ 0.19 ns
ReadOnlySpan<T> █ 0.15 ns
                |
                0     1     2     3     4 (nanoseconds)

Now, this might not seem like a practical example, but imagine a similar case with significantly more function calls and larger array sizes. Would it make sense to allocate extra dynamic resources where avoidable? We'll later dive into more use cases, coming from my own experience.

Syntax Improvements

Moving on to C# 7.3, it came with support for array initializer syntax:

// C# 7.2 and before:
Span<byte> arr = stackalloc byte[3];
arr[0] = 67;
arr[1] = 68;
arr[2] = 69;

// C# 7.3 and after:
Span<byte> arr = stackalloc byte[] {67, 68, 69};

This significantly reduced the verbosity and error-proneness of low-level buffer initialization.

It only got better from here though, as C# 8 introduced stackalloc support in more nested expression contexts, such as inside conditionals or as part of larger expressions:

int length = 1000;
Span<byte> buffer = length <= 1024
    ? stackalloc byte[length]
    : new byte[length];

C# 12 introduced collection expressions, a concise syntax that often compiles to stackalloc for small Span eliminating explicit keywords entirely:

// C# 12: Compiles to stackalloc for small spans (zero heap!)
Span<int> arr = [1, 2, 3]; // Automatic stack allocation

// Same as:
Span<int> arr2 = stackalloc int[3] {1, 2, 3};  // Verbose equivalent

Applying the Knowledge

Now back onto Dan.Net and where the memory optimizations came in. After a first run-through of enhancing the library, I made it fully operate with binary data. Events and streaming data were now being compressed and serialized into a special binary format before sending to the server, and, subsequently, incoming data was being deserialized and decompressed into data to be used on the client.

These processes required dealing with lots of allocation of bytes and bit manipulation for ensuring fast performance, but I was avoiding the memory overhead and focused on implementing the new binary protocol initially. Only at a second glance I took the decision to begin approaching memory optimizations, which involved heap allocation reduction.

Let's take a look at some of those examples:

1. Range Calculations

Below is a code snippet from my Quaternion compression logic, executed every time before sending data of a networked object's rotation

float[] components = new float[4] { q.x, q.y, q.z, q.w };
int largestIndex = 0;
float largestValue = Math.Abs(components[0]);

for (int i = 1; i < 4; i++)
{
    float abs = Math.Abs(components[i]);
    if (abs > largestValue)
    {
        largestValue = abs;
        largestIndex = i;
    }
}

q here is a Quaternion struct parameter, which contains the data of the current rotation of a networked game object. Quaternions represent 3D rotations as four floats, but can be transmitted in just 7 bytes by using the "smallest-three" compression algorithm - since quaternions are unit-length (x² + y² + z² + w² = 1), we can omit the largest component and reconstruct it from the other three, saving 9 bytes per rotation.

The components array is being allocated on the heap. It's only used locally for performing a simple calculation as shown above. So it's safe to convert the array into a stack-allocated span:

Span<float> components = stackalloc float[4] { q.x, q.y, q.z, q.w };

Might seem insignificant, but this function is to be executed at a frequency of 20 Hz. That's 20 times a second, which is 20 memory allocations of 4 floating point numbers -> 320 byte allocation/sec.

And that's just for one network-synchronized game object. What if we had more? In real-time networking scenarios like my library's streaming system, these optimizations matter.

In a similar fashion, I dealt with Quaternion decompression using spans.

2. Performant Chat Command Parsing

I believe this is where Spans truly shine. Consider a multiplayer game built with Dan.Net where players send chat commands at high frequency. Commands like "/say Hello everyone!", "/join lobby-01", or "/trade player123 50" need to be parsed into command names and arguments - and this happens 100+ times per second in active game lobbies.

The traditional string-based approach allocates heavily:

// ❌ Traditional: Multiple string allocations per command
[DanNetEvent]
public void OnSendChatCommand(string message)
{
    // Example: "/say Hello everyone!"

    if (!message.StartsWith("/"))
        return; // Not a command

    // Remove the leading "/" 
    string withoutSlash = message.Substring(1);

    // Find first space to separate command from args
    int spaceIndex = withoutSlash.IndexOf(' ');

    if (spaceIndex == -1)
    {
        // Execute command with no arguments (e.g., "/help")
        return;
    }

    // Split command and arguments
    string commandName = withoutSlash.Substring(0, spaceIndex);
    string arguments = withoutSlash.Substring(spaceIndex + 1);

    // Execute command with arguments
}

Every Substring() call allocates a new string by copying characters. For the example "/say Hello everyone!", that's 4 heap allocations (withoutSlash, commandName, arguments, plus potential string.Empty) per message. At 100 messages/sec, that's 400 allocations/sec just for command parsing.

Here's the span-based rewrite that eliminates all intermediate allocations:

// ✅ Span-based: ZERO intermediate allocations
[DanNetEvent]
public void OnSendChatCommand(ReadOnlySpan<char> message)
{
    // Example: "/say Hello everyone!"

    if (message.Length == 0 || message[0] != '/')
        return null; // Not a command

    // Slice away the leading "/" - no allocation, just pointer arithmetic
    ReadOnlySpan<char> withoutSlash = message.Slice(1);

    // Find first space to separate command from args
    int spaceIndex = withoutSlash.IndexOf(' ');

    if (spaceIndex == -1)
    {
        // Execute command with no arguments (e.g., "/help")
        return;
    }

    // Slice command and arguments - still no allocation!
    ReadOnlySpan<char> commandName = withoutSlash.Slice(0, spaceIndex);
    ReadOnlySpan<char> arguments = withoutSlash.Slice(spaceIndex + 1);

    // Only allocate when storing final results
    var (finalCommand, finalArguments) = (commandName.ToString(), arguments.ToString());

    // Execute command with arguments
}

The span version performs all slicing operations as lightweight "views" into the original string - zero copies, zero heap allocations until the final ToString() calls. The IndexOf() and Slice() operations are pointer arithmetic, making them extremely fast.

Using ReadOnlySpan<char> as a parameter (instead of string) allows callers to pass string slices, stack-allocated buffers, or even substrings without allocation - the method works with any contiguous character memory.

Bonus Optimization

For known command prefixes, you can avoid string allocation entirely by using SequenceEqual for comparisons:

// Check if command is "/help" without allocating strings
if (commandName.SequenceEqual("help".AsSpan()))
{
    ShowHelpMenu();
    return;
}

// Check if command is "/say" 
if (commandName.SequenceEqual("say".AsSpan()))
{
    SayMessage(arguments);
    return;
}

This pattern allows you to handle common commands completely allocation-free by processing spans directly instead of converting to strings.

3. Vector3 Compression Pattern

private static void WriteVector3(BinaryWriter writer, Vector3 v)
{
    writer.Write(BitConverter.GetBytes(v.x));
    writer.Write(BitConverter.GetBytes(v.y));
    writer.Write(BitConverter.GetBytes(v.z));
}

This method converts a 3D vector (a struct containing three floating point values x, y and z) into binary form.

Looks good on the surface, right? Not until you realize BitConverter.GetBytes() returns a heap allocated byte array.

The solution? Stackalloc a buffer of 12 bytes, then explicitly write the Vector3 values into the three 4 byte segments of the buffer, since the values are all 32-bit (equal to 4 bytes) floating point numbers. And we accomplish this with the help of slicing and the BitConverter.TryWriteBytes() method:

private static void WriteVector3(BinaryWriter writer, Vector3 v)
{
    Span<byte> buffer = stackalloc byte[12];
    BitConverter.TryWriteBytes(buffer.Slice(0, 4), v.x);
    BitConverter.TryWriteBytes(buffer.Slice(4, 4), v.y);
    BitConverter.TryWriteBytes(buffer.Slice(8, 4), v.z);
    writer.Write(buffer);
}

There is a Problem

This approach only works in .NET 5.0 and greater, that's when spans began to be adopted in the .NET ecosystem. My library is used on top of Unity.

Unity still mostly targets .NET Standard 2.1, only unlocking the ability to use newer versions of .NET starting from Unity 6. Luckily there is a sub-optimal solution - using conditional compilation, which allows libraries to target multiple platforms while using the most efficient APIs available on each.

#if NET5_0_OR_GREATER
    Span<byte> buffer = stackalloc byte[12];
    BitConverter.TryWriteBytes(buffer.Slice(0, 4), v.x);
    BitConverter.TryWriteBytes(buffer.Slice(4, 4), v.y);
    BitConverter.TryWriteBytes(buffer.Slice(8, 4), v.z);
    writer.Write(buffer);  // Has Span overload in .NET 5+
#else
    writer.Write(v.x);
    writer.Write(v.y);
    writer.Write(v.z);
#endif

Now, when compiled for .NET 5+, this gets zero-allocation stackalloc; any version lower, and it uses the traditional (but still efficient) approach.

Safety Considerations

While stackalloc is powerful, it requires careful usage to avoid crashes and undefined behavior:

Stack Overflow Risks

Never use stackalloc with user-controlled sizes or variable-length inputs. The stack is limited (typically 1MB per thread), and exceeding it causes unrecoverable StackOverflowException:

// ❌ DANGEROUS: User could cause stack overflow
int userSize = GetUserInput();
Span<byte> buffer = stackalloc byte[userSize];

// ✅ SAFE: Guard with size check and fallback to heap
int size = GetUserInput();
Span<byte> buffer = size <= 1024 
    ? stackalloc byte[size] 
    : new byte[size];

Keep Allocations Under 8KB

For cross-platform safety, keep stackalloc sizes under 8KB (8192 bytes). Larger allocations risk stack overflow on platforms with smaller stack sizes:

// ✅ Safe for all platforms
Span<byte> smallBuffer = stackalloc byte[4096];

// ⚠️ Risky on some platforms
Span<byte> largeBuffer = stackalloc byte[65536]; // 64KB

Never Use stackalloc in Loops

Repeated stack allocations in loops don't get "cleaned up" until the method returns, effectively leaking stack space:

// ❌ DANGEROUS: Stack grows with each iteration
for (int i = 0; i < 1000; i++)
{
    Span<byte> buffer = stackalloc byte[1024]; // 1MB total!
    ProcessData(buffer);
}

// ✅ SAFE: Allocate once outside loop
Span<byte> buffer = stackalloc byte[1024];
for (int i = 0; i < 1000; i++)
{
    ProcessData(buffer);
}

Debugging Challenges

Stack-allocated memory doesn't appear in memory profilers or GC diagnostics, making debugging harder. Always profile both allocation counts (heap) and execution time (stack) to verify optimizations work as expected.

Conclusion

Today, stackalloc is a core feature for scenarios where maximum speed and minimum allocation overhead matter, like parsing or serialization. Developers are encouraged to use it with Span<T> to maintain safety, only falling back to unsafe pointer usage for the rarest high-performance cases. These optimizations reduced Dan.Net's per-frame allocation by over ~80% on average, eliminating GC stalls during 20Hz network updates.

When NOT to optimize with stackalloc: Don't prematurely optimize. Use stackalloc only after profiling identifies allocation hotspots. For one-time setup code, IO-bound operations, or methods called <1000 times/sec, the complexity cost outweighs the performance benefit. Always measure with a profiler before and after.

This evolution has made stack-based, high-performance memory scenarios much more approachable, while maintaining C#'s general focus on safety and productivity.

If you're curious about my other works, check out my portfolio website. Dan.Net is open-source on GitHub - contributions and feedback are welcome!