Arrays and the Common Language Runtime - a Very Special Relationship

#clr #net #arrays

A while ago I wrote about the 'special relationship' that exists between Strings and the CLR, well it turns out that Arrays and the CLR have an even deeper one, the type of closeness where you hold hands on your first meeting

As an aside, if you like reading about CLR internals you may find these other posts interesting:

Fundamental to the Common Language Runtime (CLR)

Arrays are such a fundamental part of the CLR that they are included in the ECMA specification, to make it clear that the runtime has to implement them:

In addition, there are several IL (Intermediate Language) instructions that specifically deal with arrays:

newarr <etype>
- Create a new array with elements of type etype.
ldelem.ref
- Load the element at index onto the top of the stack as an O. The type of the O is the same as the element type of the array pushed on the CIL stack.
stelem <typeTok>
- Replace array element at index with the value on the stack (also stelem.i, stelem.i1, stelem.i2, stelem.r4 etc)
ldlen
- Push the length (of type native unsigned int) of array on the stack.

This makes sense because arrays are the building blocks of so many other data types, you want them to be available, well defined and efficient in a modern high-level language like C#. Without arrays you can't have lists, dictionaries, queues, stacks, trees, etc, they're all built on-top of arrays which provided low-level access to contiguous pieces of memory in a type-safe way.

Memory and Type Safety

This memory and type-safety is important because without it .NET couldn't be described as a 'managed runtime' and you'd be left having to deal with the types of issues you get when your are writing code in a more low-level language.

More specifically, the CLR provide the following protections when you are using arrays (from the section on Memory and Type Safety in the BOTR 'Intro to the CLR' page):

While a GC is necessary to ensure memory safety, it is not sufficient. The GC will not prevent the program from indexing off the end of an array or accessing a field off the end of an object (possible if you compute the field's address using a base and offset computation). However, if we do prevent these cases, then we can indeed make it impossible for a programmer to create memory-unsafe programs.

While the common intermediate language (CIL) does have operators that can fetch and set arbitrary memory (and thus violate memory safety), it also has the following memory-safe operators and the CLR strongly encourages their use in most programming:

Field-fetch operators (LDFLD, STFLD, LDFLDA) that fetch (read), set and take the address of a field by name.

Array-fetch operators (LDELEM, STELEM, LDELEMA) that fetch, set and take the address of an array element by index. All arrays include a tag specifying their length. This facilitates an automatic bounds check before each access.

Also, from the section on Verifiable Code - Enforcing Memory and Type Safety in the same BOTR page

In practice, the number of run-time checks needed is actually very small. They include the following operations:

Casting a pointer to a base type to be a pointer to a derived type (the opposite direction can be checked statically)

Array bounds checks (just as we saw for memory safety)

Assigning an element in an array of pointers to a new (pointer) value. This particular check is only required because CLR arrays have liberal casting rules (more on that later...)

However you don't get this protection for free, there's a cost to pay:

Note that the need to do these checks places requirements on the runtime. In particular:

All memory in the GC heap must be tagged with its type (so the casting operator can be implemented). This type information must be available at runtime, and it must be rich enough to determine if casts are valid (e.g., the runtime needs to know the inheritance hierarchy). In fact, the first field in every object on the GC heap points to a runtime data structure that represents its type.

All arrays must also have their size (for bounds checking).

Arrays must have complete type information about their element type.

Implementation Details

It turns out that large parts of the internal implementation of arrays is best described as magic, this Stack Overflow comment from Marc Gravell sums it up nicely

Arrays are basically voodoo. Because they pre-date generics, yet must allow on-the-fly type-creation (even in .NET 1.0), they are implemented using tricks, hacks, and sleight of hand.

Yep that's right, arrays were parametrised (i.e. generic) before generics even existed. That means you could create arrays such as int[] and string[], long before you were able to write List<int> or List<string>, which only became possible in .NET 2.0.

Special helper classes

All this magic or sleight of hand is made possible by 2 things:

The CLR breaking all the usual type-safety rules
A special array helper class called SZArrayHelper

But first the why, why were all these tricks needed? From .NET Arrays, IList<T>, Generic Algorithms, and what about STL?:

When we were designing our generic collections classes, one of the things that bothered me was how to write a generic algorithm that would work on both arrays and collections. To drive generic programming, of course we must make arrays and generic collections as seamless as possible. It felt that there should be a simple solution to this problem that meant you shouldn’t have to write the same code twice, once taking an IList<T> and again taking a T[]. The solution that dawned on me was that arrays needed to implement our generic IList. We made arrays in V1 implement the non-generic IList, which was rather simple due to the lack of strong typing with IList and our base class for all arrays (System.Array). What we needed was to do the same thing in a strongly typed way for IList<T>.

But it was only done for the common case, i.e. 'single dimensional' arrays:

There were some restrictions here though – we didn’t want to support multidimensional arrays since IList<T> only provides single dimensional accesses. Also, arrays with non-zero lower bounds are rather strange, and probably wouldn’t mesh well with IList<T>, where most people may iterate from 0 to the return from the Count property on that IList. So, instead of making System.Array implement IList<T>, we made T[] implement IList<T>. Here, T[] means a single dimensional array with 0 as its lower bound (often called an SZArray internally, but I think Brad wanted to promote the term 'vector' publically at one point in time), and the element type is T. So Int32[] implements IList<Int32>, and String[] implements IList<String>.

Also, this comment from the array source code sheds some further light on the reasons:

//----------------------------------------------------------------------------------
// Calls to (IList<T>)(array).Meth are actually implemented by SZArrayHelper.Meth<T>
// This workaround exists for two reasons:
//
//    - For working set reasons, we don't want insert these methods in the array 
//      hierachy in the normal way.
//    - For platform and devtime reasons, we still want to use the C# compiler to 
//      generate the method bodies.
//
// (Though it's questionable whether any devtime was saved.)
//
// ....
//----------------------------------------------------------------------------------

So it was done for convenience and efficiently, as they didn't want every instance of System.Array to carry around all the code for the IEnumerable<T> and IList<T> implementations.

This mapping takes places via a call to GetActualImplementationForArrayGenericIListOrIReadOnlyListMethod(..), which wins the prize for the best method name in the CoreCLR source!! It's responsible for wiring up the corresponding method from the SZArrayHelper class, i.e. IList<T>.Count -> SZArrayHelper.Count<T> or if the method is part of the IEnumerator<T> interface, the SZGenericArrayEnumerator<T> is used.

But this has the potential to cause security holes, as it breaks the normal C# type system guarantees, specifically regarding the this pointer. To illustrate the problem, here's the source code of the Count property, note the call to JitHelpers.UnsafeCast<T[]>:

internal int get_Count<T>()
{
    //! Warning: "this" is an array, not an SZArrayHelper. See comments above
    //! or you may introduce a security hole!
    T[] _this = JitHelpers.UnsafeCast<T[]>(this);
    return _this.Length;
}

Yikes, it has to remap this to be able to call Length on the correct object!!

And just in case those comments aren't enough, there is a very strongly worded comment at the top of the class that further spells out the risks!!

Generally all this magic is hidden from you, but occasionally it leaks out. For instance if you run the code below, SZArrayHelper will show up in the StackTrace and TargetSite of properties of the NotSupportedException:

try {
    int[] someInts = { 1, 2, 3, 4 };
    IList<int> collection = someInts;
    // Throws NotSupportedException 'Collection is read-only'
    collection.Clear();         
} catch (NotSupportedException nsEx) {              
    Console.WriteLine("{0} - {1}", nsEx.TargetSite.DeclaringType, nsEx.TargetSite);
    Console.WriteLine(nsEx.StackTrace);
}

Removing Bounds Checks

The runtime also provides support for arrays in more conventional ways, the first of which is related to performance. Array bounds checks are all well and good when providing memory-safety, but they have a cost, so where possible the JIT removes any checks that it knows are redundant.

It does this by calculating the range of values that a for loop access and compares those to the actual length of the array. If it determines that there is never an attempt to access an item outside the permissible bounds of the array, the run-time checks are then removed.

For more information, the links below take you to the areas of the JIT source code that deal with this:

JIT trying to remove range checks
RangeCheck::OptimizeRangeCheck(..)
- In turn calls RangeCheck::GetRange(..)
- Also call Compiler::optRemoveRangeCheck(..) to actually remove the range-check
Really informative source code comment explaining the range check removal logic

And if you are really keen, take a look at this gist that I put together to explore the scenarios where bounds checks are 'removed' and 'not removed'.

Allocating an array

Another task that the runtime helps with is allocating arrays, using hand-written assembly code so the methods are as optimised as possible, see:

Run-time treats arrays differently

Finally, because arrays are so intertwined with the CLR, there are lots of places in which they are dealt with as a special-case. For instance a search for 'IsArray()' in the CoreCLR source returns over 60 hits, including:

The method table for an array is built differently
- MethodTableBuilder::BuildInteropVTableForArray(..)
When you call ToString() on an array, you get special formatting, i.e. 'System.Int32[]' or 'MyClass[,]'
- TypeString::AppendType(..)

So yes, it's fair to say that arrays and the CLR have a Very Special Relationship