Deep dive Tainted Grail [4] - MipMaps streaming

#unity3d #gamedev #optimization #deepdive

Introduction

What it is?

A texture can have mipmaps. Mipmap level 0 is the original texture, level 1 is downscaled by a factor of 2, level 2 by a factor of 4, and so on.

Mipmaps are generated so that far textures can sample higher mip levels, which means a more pleasant visual output and cheaper sampling. Wikipedia has a very nice example of that phenomenon, as well as the OpenGL GitBook.
The downside is that we need to use more memory:

Why use it?

Most of the time we don't render textures with mip 0, which means we don't need it in memory. That means almost 50% memory reduction per texture. But further mips like mip 1, mip 2, and so on aren't always needed either. That means even more frugal textures.
Screenshots are from the editor. Total texture memory in the editor is wrong because in the editor Unity keeps track of all ever-loaded textures and takes them into account for that stat.

Here are stats of a working system. You can see how many textures are at which mip level. There is 292.4MB of non-streaming textures and 1.2GB overall memory usage, therefore 1.0GB is from streamed textures.

If I force textures to be at mip 0, then memory usage goes up to 3.4GB.

Forced mip 1 results in memory pressure at the 1.1GB level, but makes textures blurred. It may not be very visible in this screenshot, but it's very visible for others scenarios.

Forced mip 5 reduces textures to occupy 299.8MB. If you subtract 292.4MB of non-streamed textures, then you see that 1667 textures take just 7.4MB.

Here is a small table with a summary (by accident I moved a bit, so measurements are a bit different):

More examples can be found at the end of this article.

So as you can see, there are massive savings to achieve here.

Other considerations

The mip level to sample is calculated by the GPU when sampling is performed. Therefore the CPU doesn't know which mip level is needed. To obtain that knowledge we need to calculate it ourselves. For every texture we need to calculate the minimum mip level (mip 0 is original resolution) and either load it if not present (along with all higher levels) or unload unused levels (that step could be skipped if we are under the budget).

That means CPU overhead. Mip level approximation is calculated from UV distribution, distance to camera, and camera setup. That indicates potentially different mip levels for every renderer.

Unity's out of box

Unity inside its engine code has support for CPU mip level approximation for MeshRenderers and SkinnedMeshRenderers. For unknown reasons it's very slow, so Unity implemented sparse updates, so only X renderers are processed per frame.
It's hidden from C# view. That means you can just override an unknown internal value. That means you can either use the texture as is and stick to vanilla renderers, or fully reimplement it yourself.

As you know, we have custom rendering solutions (also the ECS way where there is no mipmaps streaming implemented by Unity – really good job), which indicates we must implement mipmaps streaming (the CPU level calculations and requests part of it; the GPU resources handling is still on Unity's side and cannot be reimplemented) on our own.

Requirements

The first requirements were as follows:

Must be fast
Avoid runtime allocations
Debug capabilities:
- Check when a texture got blurred
- Detect textures which should be streamed but are misconfigured

After the first version more requirements appeared:

The system must have universal inputs, because Drake, Leshy, Medusa, and HLODs must have support for mipmaps streaming.
Easily expandable as more systems may be introduced (and Kandra was added after that point, and the connection between both systems was done in a few lines of code)

Implementations

Version 1

At that time most of the data was held by ECS, so it felt natural to implement it as a System, but at that point we were still using Entities and Entities Graphics as external packages. That meant we needed to duplicate some code related to material registration. Extracting textures from a material is a very costly operation (Unity's shaders and materials pipeline is very sluggish), but slicing in an ECS system is very hard and odd. We needed to maintain multiple caches for materials, textures, refcounts, and mappings.

Duplicated code means unnecessary slowdowns.
A complex and performant caching strategy means giga-complex code, a debugging nightmare, and extending headaches.
Systems are more or less singletons, but still, accessing them and storing data feels awkward.

It wasn't impossible to maintain and plug new rendering systems into this version, but it was a long process.

Debugging capabilities were also nearly non-existent.

Version 2

As I described, there were problems with the first version as it was very rigid and error-prone. Therefore we added requirements that it must be easily expandable to plug in other systems or make changes or fixes.
To simplify the whole code, a small layer of indirection was introduced.
System can register a Material and Material_MipMaps will take care of tracking textures of that Material. Then it's expected that in a background thread that System will provide a Mip Factor associated with MaterialHandle.
For any Material, factors are gathered and Interlocked.CompareExchange is used in a way that we store the min factor.
Then the Textures part of the system takes it and applies similar logic to textures extracted from registered Materials.

You can see that it is very easy to add a new system – just look at the Kandra example:

[BurstCompile]
struct KandraMipmapsFactorJob : IJobFor {
    public CameraData cameraData;
    [ReadOnly] public UnsafeBitmask takenSlots;
    [ReadOnly] public UnsafeBitmask toUnregister;
    [ReadOnly] public UnsafeArray<float> xs;
    [ReadOnly] public UnsafeArray<float> ys;
    [ReadOnly] public UnsafeArray<float> zs;
    [ReadOnly] public UnsafeArray<float> radii;
    [ReadOnly] public UnsafeArray<float4x4> rootBoneMatrices;
    [ReadOnly] public UnsafeArray<float> reciprocalUvDistributions;
    [ReadOnly] public UnsafeArray<UnsafeArray<MipmapsStreamingMasterMaterials.MaterialId>> materialIndices;

    public MipmapsStreamingMasterMaterials.ParallelWriter outMipmapsStreamingWriter;

    public void Execute(int index) {
        var uIndex = (uint)index;
        if (!takenSlots[uIndex] || toUnregister[uIndex]) {
            return;
        }

        var position = new float3(xs[uIndex], ys[uIndex], zs[uIndex]);
        var radius = radii[uIndex];
        var scale = math.square(math.cmax(rootBoneMatrices[uIndex].Scale()));

        var factorFactor = MipmapsStreamingUtils.CalculateMipmapFactorFactor(cameraData, position, radius, scale);
        var fullFactor = reciprocalUvDistributions[uIndex] * factorFactor;
        var subMaterialIndices = materialIndices[uIndex];

        for (uint j = 0; j < subMaterialIndices.Length; j++) {
            outMipmapsStreamingWriter.UpdateMipFactor(subMaterialIndices[j], fullFactor);
        }
    }
}

There are still caveats. The biggest is that you can (Un-)Register a Material mostly during the Update phase. During EarlyUpdate we extract textures if there are newly added Materials. PreLateUpdate: jobs, like the one above, are started. Finally, PostLateUpdate: jobs are completed and per-texture requested levels are updated.

Closing

Mipmaps streaming is a very important technique to lower memory pressure, which is even more important in Unity as it has big memory pressure to start with. Virtual Textures would be better and more impactful, but there is no real solution from the Unity side, and we are not brave enough to do it ourselves (or adopt any solution from GitHub).

Overall, by making the code simpler I was able to make it faster too. A bit of indirection (as opposed to abstraction) makes it easier to think and reason about the topic, and from thinking and reasoning the best optimizations are born.
When designing architecture, be aware of the mental map/image of the tools you are about to use. An ECS System may not be far from a Singleton-like System, nevertheless it could lead to very different architectures – one feeling like a nightmare while the other more like a dream.