Our LiDAR detector spent 40% of its time in voxelization, not convs

#machinelearning #mlops #computervision #llm

TL;DR: We profiled a LiDAR object detector expecting the 3D backbone to dominate. It didn't. Voxelization plus the scatter-to-pillars step ate roughly 40% of per-frame latency on an A100, and pulling them out of the Python hot path took our p50 from 31ms down to 19ms.

The assumption that cost us two weeks

When I was at Valeo.ai we ran a PointPillars-style detector on nuScenes-scale point clouds, around 250k points per sweep. The mental model everyone carried was simple. The sparse conv backbone is heavy, so the backbone is where the milliseconds go. We spent a sprint trying to prune channels and fuse BatchNorm into the conv weights before anyone actually looked at a trace.

When we finally ran torch.profiler with CUDA activities enabled, the picture was not what we expected. The 2D CNN head and the pillar feature net were fast. The expensive part lived upstream, in the code nobody thought of as "the model."## What the trace actually showed

To be precise, two things dominated. First, the voxelization that buckets raw points into a fixed grid of pillars. Second, the scatter operation that writes encoded pillar features back into a dense BEV canvas before the 2D backbone runs.

Here is the kind of profiling output that made us stop:

import torch
from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
) as prof:
    for _ in range(50):
        detections = model(point_cloud)
    torch.cuda.synchronize()

print(prof.key_averages().table(
    sort_by="cuda_time_total", row_limit=12))
# voxelize_cpu          12.4 ms  (self CPU)
# scatter_nd            3.1 ms   (CUDA)
# sparse_conv_backbone  9.8 ms   (CUDA)
# rpn_head              4.2 ms

The voxelization ran on CPU in a Python loop. Every frame paid a host-to-device copy after the points were bucketed, and the CPU work serialized against the GPU instead of overlapping with it. The backbone was not the problem. The plumbing around it was.## Moving the work to where the data already lives

The fix was not exotic. We replaced the CPU voxelizer with a GPU implementation from spconv 2.x, kept the point cloud resident on the device from the sensor decode onward, and let the scatter run as a single fused kernel instead of an indexed assignment in Python.

The nuance here is that the win came from removing a synchronization point, not from making any single kernel faster. Once voxelization happened on-device, the CPU could prepare frame N+1 while the GPU finished frame N. That overlap is invisible in a microbenchmark of one op and very visible in end-to-end p50.

Stage	Before (CPU voxelize)	After (GPU voxelize)
Voxelization	12.4 ms (CPU)	1.9 ms (CUDA)
Scatter to BEV	3.1 ms	1.4 ms
Sparse conv backbone	9.8 ms	9.6 ms
RPN head	4.2 ms	4.1 ms
p50 end-to-end	31 ms	19 ms

The backbone barely moved, which is the whole point. We had been optimizing the one part of the pipeline that was already efficient.

The labeling side had the same shape of bug

While we were in there, we hit a related issue in our auto-labeling loop. We used a VLM to spot-check a sample of frames where the detector confidence was low, around 3% of 1.2M frames. The captioning calls were a different kind of bottleneck, one that came from rate limits and the occasional provider timeout rather than a kernel.

We put those calls behind a gateway so a failed request to one provider would fail over to another without us babysitting it. Bifrost (https://github.com/maximhq/bifrost) was the one we landed on, mostly because it spoke an OpenAI-compatible API and we didn't want to rewrite the client. There are other options in that space. The lesson was the same as the LiDAR one. The slow part was rarely the model itself.

Trade-offs and limitations

Moving voxelization to the GPU is not free. The spconv GPU path holds more device memory for the hash tables that map points to voxels, so on a memory-constrained Jetson Orin we had to drop the max points-per-pillar from 32 to 20 to fit, which cost us about 0.4 mAP on the moderate split. On an A100 that tradeoff never came up.

There is also a portability cost. A CPU voxelizer runs anywhere. The GPU version pins you to a CUDA toolkit version that matches your spconv build, and we burned an afternoon on a mismatch between CUDA 11.8 and a wheel built for 12.1.

And profiling itself can mislead. CUDA kernels launch asynchronously, so without an explicit torch.cuda.synchronize() before reading timings, you measure launch overhead instead of real work. Half our early numbers were wrong for exactly this reason.

What I'd tell my past self

Profile before you optimize, and profile the whole pipeline, not the layer you find intellectually interesting. The detector network is the part you publish papers about. The preprocessing is the part that ships in production and quietly dominates latency.

For LiDAR specifically, watch the boundary between CPU and GPU. Every host-to-device copy per frame is a stall, and stalls hide from per-op benchmarks. The model is usually the easy part.