Steve

Posted on Jun 30

Build a very large managed array in .NET

#csharp #dotnet

The story of arrays in .NET

I have been seeing complaints about the maximum size of arrays in .NET for years.

In .NET, arrays, collections, spans, and related APIs are built around 32-bit lengths and indexes. There was a very long issue on GitHub requesting support for 64-bit arrays, but it was closed as "won't fix" because changing this touches too much of the runtime and too much existing code.

Most solutions I have seen for this problem fall into one of two buckets: allocate unmanaged memory and wrap it in a class, or use a jagged array to simulate one larger array. Both approaches work for some scenarios, but both come with compromises.

Manual memory management is hard to get right, and it only works naturally with unmanaged data. That rules out reference types like string and object. Jagged arrays avoid unmanaged memory, but they are still a collection of separate arrays. You have to manage the inner array sizes, crossing segment boundaries becomes part of your indexing logic, and the result is awkward to combine with APIs that expect a contiguous region of memory.

Eventually I decided to roll my own solution. These were the goals:

It should be able to hold more than 2 billion elements and access them through a single index.
The index should be architecture-dependent: on 32-bit systems it should stay within the normal array limit, while on 64-bit systems it should be able to go much larger.
It should support reference types like string and object.
The storage should be one contiguous managed allocation, not a jagged array.
It should be garbage collected, so users do not need manual memory management.
It should work under NativeAOT, so it cannot rely on runtime code generation or reflection.

Those goals sound uncomfortably close to "change how arrays work", but modern .NET gives us features that open a different path.

The basic idea

In .NET, an SZArray is the normal single-dimensional, zero-based array shape: T[]. The length of this array is limited to an int-sized value. Loosening that limit would require runtime changes across the garbage collector, the JIT compiler, the type system, reflection, and the base class libraries.

But the limit is on the number of array elements, not directly on the number of bytes behind those elements. A byte[1024] holds 1024 bytes, while an int[1024] holds 4096 bytes. The array has the same element count in both cases.

So the first trick is simple: make each array element represent more than one logical value.

struct TwoBytes
{
    public byte A;
    public byte B;
}

An array of TwoBytes with 2 billion elements can store 4 billion bytes. It is still one managed array object, but each element has become a small chunk.

Since .NET 8, we have InlineArrayAttribute. It lets a struct represent a fixed number of repeated fields without manually declaring every field.

using System.Runtime.CompilerServices;

[InlineArray(4)]
struct FourBytes
{
    private byte _first;
}

The nice part is that InlineArray works with reference types too:

[InlineArray(4)]
struct FourStrings
{
    private string _first;
}

It also works with generics:

[InlineArray(4)]
struct FourElements<T>
{
    private T _first;
}

Now an array of FourElements<T> can hold four logical T values per physical array element. If the physical array can contain around 2 billion chunks, then a four-wide chunk can represent around 8 billion logical values.

That is the core of BigArray<T>. The storage is still one managed array, but the array element type may be a chunk type instead of T itself. BigArray<T> records the real logical length separately and treats the data inside those chunks as one long sequence of T values.

Building the chunk types

The obvious implementation would be to define a chunk type for every possible chunk length:

[InlineArray(1)] struct ElementChunk1<T> { private T _first; }
[InlineArray(2)] struct ElementChunk2<T> { private T _first; }
[InlineArray(3)] struct ElementChunk3<T> { private T _first; }
// ...
[InlineArray(65535)] struct ElementChunk65535<T> { private T _first; }

That is not practical, and it is not even valid for every T.

There is an important runtime type-loading limit here: the base value type of an array cannot grow above 65,535 bytes. For a given T, the chunk length is:

65535 / Unsafe.SizeOf<T>()

So byte can use a chunk length of 65,535. A type with a size of 8 bytes can use a chunk length of 8,191. A type with a size of 32 bytes can use a chunk length of 2,047. Once that value is known, allocation becomes a matter of calculating how many physical chunks are needed for the requested logical length.

This means the implementation does not need a chunk type for every integer. It only needs the values produced by 65535 / size for supported element sizes. This reduced the number of chunk types to 510, much better than 65,535 but still a lot.

Even better, the chunk structs themselves can be factored.

[InlineArray(2)]
struct ElementChunk2<T>
{
    private T _first;
}

[InlineArray(3)]
struct ElementChunk3<T>
{
    private T _first;
}

ElementChunk2<ElementChunk3<T>> is a chunk of 2 chunks of 3 values, which is 6 logical T values. ElementChunk23<ElementChunk89<T>> is 2047 logical values. The largest byte chunk, 65,535, can be expressed as:

ElementChunk3<ElementChunk5<ElementChunk17<ElementChunk257<byte>>>>

because:

3 * 5 * 17 * 257 = 65535

We can then factor the chunk types into a small number of base chunk types whose lengths are prime numbers. To produce chunk lengths from 1 to 65,535, it turns out that we only need 85 base chunk types: from ElementChunk2<T> to ElementChunk8191<T>. The rest can be expressed as a product of those base chunk lengths.

That is much easier to maintain than thousands of hand-written fields or one struct per possible length.

With the chunk types in place, we can compose them to produce a chunk type of any length from 1 to 65,535:

var chunkSize = 65535 / Unsafe.SizeOf<T>();
var chunks = length / chunkSize + (length % chunkSize == 0 ? 0 : 1);

Array array = chunkSize switch
{
    1 => new ElementChunk1<T>[chunks],
    2 => new ElementChunk2<T>[chunks],
    3 => new ElementChunk3<T>[chunks],
    4 => new ElementChunk2<ElementChunk2<T>>[chunks],
    5 => new ElementChunk5<T>[chunks],
    6 => new ElementChunk2<ElementChunk3<T>>[chunks],
    7 => new ElementChunk7<T>[chunks],
    8 => new ElementChunk2<ElementChunk2<ElementChunk2<T>>>[chunks],
    9 => new ElementChunk3<ElementChunk3<T>>[chunks],
    10 => new ElementChunk2<ElementChunk5<T>>[chunks],
    // ...
    21845 => new ElementChunk5<ElementChunk17<ElementChunk257<T>>>[chunks],
    32767 => new ElementChunk7<ElementChunk31<ElementChunk151<T>>>[chunks],
    65535 => new ElementChunk3<ElementChunk5<ElementChunk17<ElementChunk257<T>>>>[chunks],
};

Here, chunks is the length of the real managed array, not the logical length exposed by BigArray<T>. If the logical length is 10,000 and the chunk size is 4,095, the implementation allocates 3 physical chunks. The last chunk is only partly used, and _length keeps track of the real logical end.

With this approach, you might wonder if we are over-allocating memory and wasting space. The answer is yes, but only a little, and it has an upper bound:

`sizeof(T)`	`chunkSize`	worst extra elements	worst extra bytes
1	65535	65534	65534 B
2	32767	32766	65532 B
3	21845	21844	65532 B
4	16383	16382	65528 B
8	8191	8190	65520 B
16	4095	4094	65504 B
257	255	254	65278 B
32768+	1	0	0 B

The worst case is when the logical length is one more than a multiple of the chunk size, and the chunk size is 65,535. In that case, we allocate 65,535 bytes for the first chunk and then allocate another 65,535 bytes for the second chunk... but we only use one byte of the last chunk.

Type loading

Now imagine T is object on a 64-bit runtime. A reference is 8 bytes, so the valid chunk length is 8,191:

65535 / 8 = 8191

That means ElementChunk8191<object> is valid. But ElementChunk3<ElementChunk5<ElementChunk17<ElementChunk257<object>>>> would be enormous, because it would contain 65,535 object references, which will take 8 * 65535 = 524,280 bytes for the base type of the array. We must not load such a type; otherwise, the runtime will throw a TypeLoadException when we try to create an array of it.

The problem is that "never executed" is not always the same as "never loaded". If a method contains references to many closed generic array types, the JIT and type loader can still run into invalid combinations while importing or compiling the method. This will show up as a TypeLoadException even though the offending chunk shape was not the one the code intended to allocate:

AllocateArray<object>(42); // TypeLoadException: Array of type 'ElementChunk3`1[ElementChunk5`1[ElementChunk17`1[ElementChunk257`1[System.__Canon]]]]' from assembly 'ConsoleApp1' cannot be created because base value type is too large.

Array AllocateArray<T>(int length)
{
    if (length <= 8191)
        return new ElementChunk8191<T>[length];
    else
        return new ElementChunk3<ElementChunk5<ElementChunk17<ElementChunk257<T>>>>[length];
}

The fix is to keep the selected allocation shape lazy. The allocation path first computes the valid chunk length for T, then asks the switch for an allocator for that chunk length, and then invokes only that allocator.

The allocator is a switch over the computed chunk length. Each arm returns a static lambda that performs the allocation for exactly one chunk type:

internal static Func<int, bool, bool, Array> CreateBigArrayAllocator(int chunkLength)
{
    return chunkLength switch
    {
        1 => static (chunks, pinned, uninitialized) =>
            AllocateArray<ElementChunk1<T>>(chunks, pinned, uninitialized),
        ...,
        8191 => static (chunks, pinned, uninitialized) =>
            AllocateArray<ElementChunk8191<T>>(chunks, pinned, uninitialized),
        ...,
        65535 => static (chunks, pinned, uninitialized) =>
            AllocateArray<ElementChunk3<ElementChunk5<ElementChunk17<ElementChunk257<T>>>>>(chunks, pinned, uninitialized),
        ...,
        _ => throw new UnreachableException(),
    };
}

The real switch has all 510 cases, but the important detail is the shape: the allocation is behind a lambda, and the allocation helper is marked NoInlining.

[MethodImpl(MethodImplOptions.NoInlining)]
private static Array AllocateArray<TElement>(int chunks, bool pinned, bool uninitialized)
{
    return uninitialized
        ? GC.AllocateUninitializedArray<TElement>(chunks, pinned)
        : GC.AllocateArray<TElement>(chunks, pinned);
}

This is not just cosmetic indirection. It prevents unselected chunk-array types from being eagerly loaded. Only the chunk shape that matches the actual Unsafe.SizeOf<T>() is forced, because now the JIT only compiles the method behind the lambda that is actually created. The other arms of the switch are never executed, so no lambda is created for them, and the JIT will not compile them. For object, the code can pick the 8191 arm and create an ElementChunk8191<object>[]; the 65535 arm still exists for types such as byte, but it is not forced for object.

BigArray

With the chunk machinery in place, BigArray<T> itself can stay surprisingly small.

The owner stores two things:

internal readonly Array _storage;
internal readonly nint _length;

For normal lengths, it allocates an ElementChunk1<T>[], which is effectively a normal T[] layout with one layer of wrapping. For larger lengths, it computes the chunk length, allocates an array of the selected chunk type, and records the logical length as an nint.

That is why _storage is typed as Array: the actual runtime type depends on T. It might be ElementChunk1<T>[], ElementChunk8191<T>[], or a composed chunk type such as ElementChunk3<ElementChunk5<ElementChunk17<ElementChunk257<T>>>>[].

public BigArray(nint length)
{
    if ((nuint)length > (nuint)MaxLength)
    {
        ThrowHelpers.ThrowOutOfRange(nameof(length));
    }

    if (length <= Array.MaxLength)
    {
        _storage = new ElementChunk1<T>[length];
    }
    else
    {
        _storage = CreateBigArraySlow(length);
    }

    _length = length;
}

The indexer is the part I like most. It does not need to divide by the chunk size on every access. The storage is one managed array whose data region is a contiguous sequence of chunk structs, and the chunk structs contain their elements inline. So the code gets a reference to the first logical T and uses normal by-ref arithmetic from there:

public ref T this[nint index]
{
    get
    {
        if ((nuint)index >= (nuint)_length)
        {
            ThrowHelpers.ThrowOutOfRange(nameof(index));
        }

        return ref Unsafe.Add(ref GetDataReference(), index);
    }
}

The Unsafe usage here is an implementation detail. The public inputs are still validated first, and then the implementation uses by-ref arithmetic so it does not need another normal array bounds check for every logical access. If an index, length, or slice is outside the valid range, it fails before reaching this path. This keeps the safety guarantees for the public API, while still allowing the implementation to be as fast as possible.

The data reference is obtained by reinterpreting the beginning of the array data as T:

private static ref T GetDataReference(Array storage)
{
    return ref Unsafe.As<byte, T>(ref MemoryMarshal.GetArrayDataReference(storage));
}

This is why the contiguous-storage goal matters. After the first data reference is obtained, Unsafe.Add(ref first, index) moves by index logical T elements. Crossing from one chunk into the next is just moving forward in the same array data region. BigArray<T> does not become a jagged-array wrapper with an extra division and remainder on every index. It is one managed array object viewed as a larger logical sequence.

The maximum length is architecture-dependent:

public static nint MaxLength =>
    nint.Size == 4 ? Array.MaxLength : GetChunkLength() * (nint)Array.MaxLength;

On 32-bit runtimes, nint cannot represent a larger index space, so BigArray<T> stays at the normal array limit.

On 64-bit runtimes, the maximum length scales with the chunk size. For byte, that is roughly Array.MaxLength * 65535; for long or an object reference on a 64-bit runtime, it is roughly Array.MaxLength * 8191. It can represent extremely large arrays up to 128 TiB! (precisely, 127.998 TiB in binary units) This is a theoretical maximum, of course. The machine still has to have enough memory for the allocation.

BigSpan and BigMemory

An owner type is useful, but it is not enough. In normal .NET code, arrays are only one part of the programming model. We also use Span<T>, ReadOnlySpan<T>, Memory<T>, and ReadOnlyMemory<T> to pass views around.

BigSpan<T> is the stack-only view over a large contiguous region:

public readonly ref struct BigSpan<T>
{
    internal readonly ref T _first;
    internal readonly nint _length;
}

It has the same basic shape as Span<T>: a by-ref start and a length. The difference is that the length is nint, and indexing uses nint. Like Span<T>, it does not own the allocation; it is only a view over memory that is kept alive by something else, usually a BigArray<T> or a BigMemory<T>.

BigArray<byte> buffer = new((nint)Array.MaxLength + 1024);
BigSpan<byte> span = buffer.AsBigSpan();

span[Array.MaxLength] = 42;

BigMemory<T> and BigReadOnlyMemory<T> are the storable counterparts. They keep the underlying managed array, a start offset, and a length:

internal readonly Array? _storage;
internal readonly nint _start;
internal readonly nint _length;

Their Span properties materialize BigSpan<T> or BigReadOnlySpan<T> when you need the hot by-ref path. Since BigMemory<T> keeps the underlying managed array in _storage, it can be stored in fields or returned from methods while still keeping that storage visible to the GC.

BigMemory<byte> page = buffer.AsBigMemory(1024, 4096);
page.Span.Fill(0);

The API follows the normal span and memory vocabulary as much as possible: slicing, copying, searching, sorting, trimming, splitting, ToArray, ToBigArray, and read-only conversions. When the implementation needs to call a BCL API that only accepts Span<T> or ReadOnlySpan<T>, it processes the data in int-sized windows.

That is an important boundary. BigSpan<T> is not a replacement that magically makes every existing API accept more than int.MaxValue elements. It gives you a familiar large-indexed view, and it lets you take normal Span<T> windows when you need to interoperate with existing APIs.

nint offset = (nint)5_000_000_000L;
Span<byte> window = buffer.AsSpan(offset, length: 4096);

Allocation APIs

The simplest way to allocate is still the constructor:

nint length = (nint)10_000_000_000L;
BigArray<byte> buffer = new(length);

But arrays in .NET also have explicit GC allocation helpers, so I exposed matching APIs as well:

nint length = (nint)10_000_000_000L;

BigArray<byte> zeroed = GC.AllocateBigArray<byte>(length);
BigArray<byte> scratch = GC.AllocateUninitializedBigArray<byte>(length);
BigArray<byte> pinned = GC.AllocateBigArray<byte>(length, pinned: true);

This allows you to control whether the allocation is zeroed or uninitialized, and whether it is pinned. The pinned option is useful for interop scenarios where you need to pass a pointer to unmanaged code; the uninitialized option is useful for performance-sensitive scenarios where you will immediately overwrite the memory and do not need it zeroed, especially for large allocations. Internally, these APIs go through the same storage selection as the constructor: normal lengths use an ElementChunk1<T>[], and larger lengths use the chunked allocator described above.

The owner remains managed. If T contains references, the garbage collector still understands the references because the underlying storage is an array of value types containing those references. If T is unmanaged, you can also use the pinning-oriented APIs where appropriate.

Ending thoughts

With BigArray<T>, BigSpan<T>, and BigMemory<T>, we can now work with large contiguous managed memory in a way that is similar to working with regular arrays.

Huge arrays are generally not a good idea because they can make GC work harder, and random access patterns can be slower than with smaller arrays. But for scenarios where you must work with large contiguous data, these types provide a managed, best-effort way to do so. Performance is important, but availability matters even more: if you cannot allocate the memory you need, everything else is a show stopper.

I would appreciate it if you could try the library and provide feedback. You can find the source code and documentation on GitHub, and download the package from NuGet.