If you've ever tried to render tens of thousands of objects in Unity — trees, rocks, enemies, particles — you know the pain. My scene had 100,000 simple mesh instances and was running at 38ms per frame (26 FPS). After switching to GPU indirect rendering, I got it down to 0.4ms. That's a 95x speedup.
Here's exactly how it works.
The Problem: CPU Bottleneck
Unity's default rendering pipeline sends one draw call per object (or per batch). Even with GPU instancing enabled, the CPU still has to prepare transform data and issue commands for each object. At 100k objects, that's brutal.
The frame breakdown looked like this:
- CPU: 35ms (preparing transforms, issuing draw calls)
- GPU: 3ms (actually drawing)
The GPU was sitting idle most of the time. All the cost was on the CPU.
The Solution: DrawMeshInstancedIndirect
Graphics.DrawMeshInstancedIndirect lets you tell the GPU "draw N instances of this mesh" — and the GPU figures out the rest from a buffer you've already uploaded. The CPU just issues a single command.
Here's the core setup:
public class IndirectRenderer : MonoBehaviour
{
public Mesh instanceMesh;
public Material instanceMaterial;
public int instanceCount = 100000;
private ComputeBuffer positionBuffer;
private ComputeBuffer argsBuffer;
private uint[] args = new uint[5] { 0, 0, 0, 0, 0 };
void Start()
{
// Upload all positions to GPU once
positionBuffer = new ComputeBuffer(instanceCount, 16); // float4
Vector4[] positions = new Vector4[instanceCount];
for (int i = 0; i < instanceCount; i++)
{
positions[i] = new Vector4(
Random.Range(-500f, 500f), 0,
Random.Range(-500f, 500f), 1
);
}
positionBuffer.SetData(positions);
instanceMaterial.SetBuffer("_Positions", positionBuffer);
// Set up indirect args buffer
argsBuffer = new ComputeBuffer(1, args.Length * sizeof(uint),
ComputeBufferType.IndirectArguments);
args[0] = instanceMesh.GetIndexCount(0);
args[1] = (uint)instanceCount;
args[2] = instanceMesh.GetIndexStart(0);
args[3] = instanceMesh.GetBaseVertex(0);
argsBuffer.SetData(args);
}
void Update()
{
// ONE draw call for 100k objects
Graphics.DrawMeshInstancedIndirect(
instanceMesh,
0,
instanceMaterial,
new Bounds(Vector3.zero, new Vector3(1000, 100, 1000)),
argsBuffer
);
}
void OnDestroy()
{
positionBuffer?.Release();
argsBuffer?.Release();
}
}
The Shader Side
Your material needs to read the position buffer. In HLSL:
StructuredBuffer<float4> _Positions;
void vert(appdata v, uint instanceID : SV_InstanceID,
out v2f o)
{
float4 worldPos = _Positions[instanceID];
float4 localPos = v.vertex;
o.pos = UnityObjectToClipPos(localPos + worldPos);
}
The key is SV_InstanceID — this is the GPU-side index that maps each instance to its entry in the position buffer.
The Results
| Approach | CPU | GPU | Total |
|---|---|---|---|
| Default (no batching) | 35ms | 3ms | 38ms |
| Static batching | 28ms | 3ms | 31ms |
| GPU instancing | 12ms | 3ms | 15ms |
| Indirect rendering | 0.1ms | 0.3ms | 0.4ms |
The CPU time essentially disappears. The GPU is doing all the work, which is exactly what it's built for.
When to Use This
GPU indirect rendering is ideal when:
- You have thousands of identical or similar meshes (foliage, particles, crowds)
- Objects don't need per-frame CPU logic (or you can move that logic to a compute shader)
- You want LOD handled on the GPU (you can encode LOD level in the args buffer)
It's not the right fit for objects that need individual per-frame C# callbacks, or highly unique meshes that can't share a draw call.
Going Further: Compute Shaders for Culling
The real power comes when you combine this with a compute shader that does frustum culling on the GPU. Instead of rendering all 100k instances, you dispatch a compute shader that writes only visible instances into the args buffer — entirely on the GPU, zero CPU involvement.
[numthreads(64, 1, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{
if (id.x >= _InstanceCount) return;
float4 pos = _AllPositions[id.x];
if (IsInFrustum(pos))
{
uint idx;
InterlockedAdd(_ArgsBuffer[1], 1, idx);
_VisiblePositions[idx] = pos;
}
}
This can drop your rendered instance count from 100k to 8k depending on camera angle, with the cull happening in microseconds.
Ready-to-Use Starter Kit
If you want to skip the boilerplate and start with working examples, I've put together a GPU Indirect Rendering Starter Pack for Unity that includes:
- Complete indirect renderer with frustum culling compute shader
- LOD support baked in
- Example scene with 100k instances running in real-time
- Documented, production-ready C# + HLSL code
It's the foundation I wish I'd had when I started with this technique.
Have questions about GPU indirect rendering or hit a snag? Drop a comment below — happy to help.
Top comments (0)