Michael Sacco

Posted on May 13

How I Render 100,000 Unity Objects With One Draw Call (38ms to 0.4ms)

#unity3d #gamedev #performance #csharp

If you've ever tried to render tens of thousands of objects in Unity — trees, rocks, enemies, particles — you know the pain. My scene had 100,000 simple mesh instances and was running at 38ms per frame (26 FPS). After switching to GPU indirect rendering, I got it down to 0.4ms. That's a 95x speedup.

Here's exactly how it works.

The Problem: CPU Bottleneck

Unity's default rendering pipeline sends one draw call per object (or per batch). Even with GPU instancing enabled, the CPU still has to prepare transform data and issue commands for each object. At 100k objects, that's brutal.

The frame breakdown looked like this:

CPU: 35ms (preparing transforms, issuing draw calls)
GPU: 3ms (actually drawing)

The GPU was sitting idle most of the time. All the cost was on the CPU.

The Solution: DrawMeshInstancedIndirect

Graphics.DrawMeshInstancedIndirect lets you tell the GPU "draw N instances of this mesh" — and the GPU figures out the rest from a buffer you've already uploaded. The CPU just issues a single command.

Here's the core setup:

public class IndirectRenderer : MonoBehaviour
{
    public Mesh instanceMesh;
    public Material instanceMaterial;
    public int instanceCount = 100000;

    private ComputeBuffer positionBuffer;
    private ComputeBuffer argsBuffer;
    private uint[] args = new uint[5] { 0, 0, 0, 0, 0 };

    void Start()
    {
        // Upload all positions to GPU once
        positionBuffer = new ComputeBuffer(instanceCount, 16); // float4
        Vector4[] positions = new Vector4[instanceCount];
        for (int i = 0; i < instanceCount; i++)
        {
            positions[i] = new Vector4(
                Random.Range(-500f, 500f), 0,
                Random.Range(-500f, 500f), 1
            );
        }
        positionBuffer.SetData(positions);
        instanceMaterial.SetBuffer("_Positions", positionBuffer);

        // Set up indirect args buffer
        argsBuffer = new ComputeBuffer(1, args.Length * sizeof(uint),
            ComputeBufferType.IndirectArguments);
        args[0] = instanceMesh.GetIndexCount(0);
        args[1] = (uint)instanceCount;
        args[2] = instanceMesh.GetIndexStart(0);
        args[3] = instanceMesh.GetBaseVertex(0);
        argsBuffer.SetData(args);
    }

    void Update()
    {
        // ONE draw call for 100k objects
        Graphics.DrawMeshInstancedIndirect(
            instanceMesh,
            0,
            instanceMaterial,
            new Bounds(Vector3.zero, new Vector3(1000, 100, 1000)),
            argsBuffer
        );
    }

    void OnDestroy()
    {
        positionBuffer?.Release();
        argsBuffer?.Release();
    }
}

The Shader Side

Your material needs to read the position buffer. In HLSL:

StructuredBuffer<float4> _Positions;

void vert(appdata v, uint instanceID : SV_InstanceID,
          out v2f o)
{
    float4 worldPos = _Positions[instanceID];
    float4 localPos = v.vertex;
    o.pos = UnityObjectToClipPos(localPos + worldPos);
}

The key is SV_InstanceID — this is the GPU-side index that maps each instance to its entry in the position buffer.

The Results

Approach	CPU	GPU	Total
Default (no batching)	35ms	3ms	38ms
Static batching	28ms	3ms	31ms
GPU instancing	12ms	3ms	15ms
Indirect rendering	0.1ms	0.3ms	0.4ms

The CPU time essentially disappears. The GPU is doing all the work, which is exactly what it's built for.

When to Use This

GPU indirect rendering is ideal when:

You have thousands of identical or similar meshes (foliage, particles, crowds)
Objects don't need per-frame CPU logic (or you can move that logic to a compute shader)
You want LOD handled on the GPU (you can encode LOD level in the args buffer)

It's not the right fit for objects that need individual per-frame C# callbacks, or highly unique meshes that can't share a draw call.

Going Further: Compute Shaders for Culling

The real power comes when you combine this with a compute shader that does frustum culling on the GPU. Instead of rendering all 100k instances, you dispatch a compute shader that writes only visible instances into the args buffer — entirely on the GPU, zero CPU involvement.

[numthreads(64, 1, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{
    if (id.x >= _InstanceCount) return;

    float4 pos = _AllPositions[id.x];
    if (IsInFrustum(pos))
    {
        uint idx;
        InterlockedAdd(_ArgsBuffer[1], 1, idx);
        _VisiblePositions[idx] = pos;
    }
}

This can drop your rendered instance count from 100k to 8k depending on camera angle, with the cull happening in microseconds.

Ready-to-Use Starter Kit

If you want to skip the boilerplate and start with working examples, I've put together a GPU Indirect Rendering Starter Pack for Unity that includes:

Complete indirect renderer with frustum culling compute shader
LOD support baked in
Example scene with 100k instances running in real-time
Documented, production-ready C# + HLSL code

It's the foundation I wish I'd had when I started with this technique.

Have questions about GPU indirect rendering or hit a snag? Drop a comment below — happy to help.

DEV Community