Starving the Garbage Collector: A Pragmatic Guide to Zero-Allocation C#

#csharp #dotnet #opensource #performance

Over the last few weeks, I’ve open-sourced a suite of high-performance, zero-dependency C# engines. This includes a native DataFrame library (Glacier.Polaris), a blistering fast text searcher (Glacier.Grep), and a semantic Markdown parser for RAG contexts (Glacier.DocTree). You can find the source code for all of these on my GitHub.

A recurring question I’m getting from other devs looking at these repositories is simple: How exactly are you bypassing the Garbage Collector to get these speeds?

I’ve never hidden my distaste for heavy, magic-filled frameworks. Whether it's an unwieldy data access library or a bloated client-side framework, they all share a common flaw: they wrap your code in layers of hidden allocations that murder your CPU caches and force the .NET Garbage Collector (GC) into overdrive.

When you want to build systems that process millions of rows a second or rival native C/C++ in raw compute speed, you have to take control of your memory. To give you a fighting chance at writing your own high-performance engines, let's break down how memory allocation actually works in C#, using the architecture of the Glacier repositories as our guide.

Level 1: The Heap, the Stack, and Cache Locality

In C#, every time you use the new keyword on a class (a reference type), you are asking the runtime to find a contiguous block of free memory on the Managed Heap.

The heap is a messy place. When it gets full, the GC steps in. It pauses your application threads, traverses the object graph to see what you are still using, compacts the memory, and cleans up the garbage. For standard CRUD apps, this pause is negligible. For a DataFrame engine like Glacier.Polaris processing millions of rows, a GC pause is a catastrophic event. It's a heavy tax on your CPU cycles.

The alternative is the Stack. The stack is a tightly managed, incredibly fast area of memory assigned exclusively to the current thread. When you create a struct (a value type), it goes on the stack. When the method finishes, the stack unwinds, and the memory is instantly freed. No GC involved. Zero tax.

But dropping classes for structs isn't just about dodging the GC; it's about mechanical sympathy. Modern CPUs don't read bytes from RAM one at a time; they pull 64-byte "cache lines." By using struct and explicitly packing your data via [StructLayout(LayoutKind.Sequential)], you ensure that when the CPU grabs a cache line, it receives highly relevant, tightly packed data, drastically reducing cache misses.

The Golden Rule: If you want to go fast in a tight loop, favor struct over class.

Level 2: Slicing Without Allocating & The Async Trap

Value types are great, but what about arrays and strings? Historically, if you wanted a subset of an array or a string, you called .Substring() or .Skip().Take(). These operations allocate new objects on the heap, copying the data over.

If you look at the source for Glacier.DocTree or Glacier.Grep, you'll notice we rarely allocate new strings when reading text. Instead, we use Span<T> and ReadOnlySpan<T>.

A Span<T> is a ref struct. It is essentially a pointer to a block of memory and a length, meaning it must live on the stack. You can slice a massive buffer into smaller chunks, and it costs absolutely nothing. Zero allocations. Zero copying.

// The old, bloated way that triggers the GC
string line = "Error: Connection Timeout";
string message = line.Substring(7); // Allocates a new string on the heap

// The Glacier way (Zero-Allocation)
ReadOnlySpan<char> lineSpan = "Error: Connection Timeout".AsSpan();
ReadOnlySpan<char> messageSpan = lineSpan.Slice(7); // Just a view into memory!

The Async Trap

Because Span<T> is tied to the stack, the compiler will stop you if you try to use it across an await boundary (like asynchronously reading a file stream). State machines generated by async/await cannot preserve stack-only references.

To bridge this gap, we use Memory<T>. Memory<T> can safely live on the heap and travel through async pipelines. Once the I/O operation yields and you are ready to do synchronous, CPU-bound processing, you simply call .Span on your Memory<T> and begin slicing at zero cost.

Level 3: Custom Allocators and the Rental Market

The stack is fast, but it's small (typically around 1MB). If you try to put a massive DataFrame column there, you will crash your app with a StackOverflowException.

In Glacier.Polaris, we are mapping primitive types directly to dense arrays to avoid the overhead of boxing. But allocating massive arrays in a tight loop with new int[100000] will trigger a Gen 0 GC collection almost instantly.

Instead of relying on standard arrays, Polaris uses custom allocators and structures like MemoryOwnerColumn and ValidityMask. This allows us to maintain C-like memory control while remaining safe within the .NET ecosystem. When we need temporary buffers, we rent them from System.Buffers.ArrayPool<T>.Shared:

// Rent an array of AT LEAST the requested size
int[] buffer = ArrayPool<int>.Shared.Rent(100000);
try 
{
    // Wrap it in a span for safe, fast access
    Span<int> workSpan = buffer.AsSpan(0, 100000);
    // Do heavy data processing...
}
finally 
{
    // Always return it! The GC never sees a new allocation.
    ArrayPool<int>.Shared.Return(buffer);
}

Level 4: Unleashing Compute (SIMD & MemoryMarshal)

Once your memory is flat, contiguous, and not bothering the Garbage Collector, you can unleash the CPU.

In Glacier.Polaris, the math isn't done row-by-row in a simple for loop. We process data in chunks using SIMD (Single Instruction, Multiple Data) CPU vector instructions.

In older .NET versions, this meant hardcoding explicit intrinsics and pinning arrays with fixed, which added slight overhead. Modern .NET abstracts this beautifully. We use MemoryMarshal.GetReference to grab a lightweight ref to our data without pinning it, and feed it into cross-platform Vector256 logic that works efficiently on both x64 and ARM64 processors.

using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;

public static int SimdSum(ReadOnlySpan<int> data)
{
    int sum = 0;
    int i = 0;

    // Grab a fast, unpinned reference to the underlying data
    ref int current = ref MemoryMarshal.GetReference(data);

    // Process 8 integers at a time (if hardware supports 256-bit vectors)
    if (Vector256.IsHardwareAccelerated && data.Length >= Vector256<int>.Count)
    {
        Vector256<int> vSum = Vector256<int>.Zero;

        for (; i <= data.Length - Vector256<int>.Count; i += Vector256<int>.Count)
        {
            // Load 8 contiguous integers directly into the CPU register
            Vector256<int> vData = Vector256.LoadUnsafe(ref current, (nuint)i);

            // Add them in parallel
            vSum += vData; 
        }

        // Horizontal add to collapse the vector lanes into a final scalar sum
        sum += Vector256.Sum(vSum); 
    }

    // Process any remaining elements normally
    for (; i < data.Length; i++) 
    {
        sum += Unsafe.Add(ref current, i);
    }

    return sum;
}

This is where the magic happens. We've bypassed the GC to keep our memory clean, extracted an unpinned reference, and fed it directly into the CPU's vector lanes.

Level 5: Show the Receipts

It’s easy to talk about zero-allocation theory, but advanced developers deal in metrics. When you strip away the frameworks and embrace the mechanics outlined above, the results in BenchmarkDotNet look like this:

Method	Mean	Allocated
Standard Substring	18.45 ns	32 B
Glacier Span Slice	0.02 ns	0 B

Seeing that 0 B under the allocated column is the entire point.

The Bottom Line

Getting away from the Garbage Collector isn't about rewriting every line of business logic you have. It's about surgical precision. Identify the hot paths—the tight loops where data flows by the gigabyte—and strip away the abstractions.

Drop the heavy frameworks. Stop calling new in a loop. Embrace struct, slice with Span<T>, use the ArrayPool, and build custom column allocators when the scale demands it. Take a look through the Glacier repositories to see these patterns in action. This is how you build engines that don't just participate in the .NET ecosystem, but actually push it to its absolute limits.