TL;DR: We spent three weeks chasing a 6 mAP regression in an event-camera object detector. The model was fine. The bug was the accumulation window we used to turn raw events into tensors, and we had picked it once, eighteen months earlier, on a different dataset. Here is how we tune it now.
So, the thing is, with event cameras you do not get frames. You get a stream of events, each one a tuple of (x, y, t, polarity), fired asynchronously whenever a pixel sees a brightness change. Microsecond timestamps. No global shutter, no exposure. Beautiful for high-speed motion. Annoying when you want to feed a convolutional detector that expects a dense tensor.
So everyone accumulates. You take all the events inside a time window, say 10 ms, and you build a representation out of them. A 2D histogram, a voxel grid, a time surface. That window length is a hyperparameter. And in my experience at Prophesee, it is the one people set once and never look at again.
The regression that was not a model regression
Last spring we retrained a small detector for a logistics conveyor setup. Boxes moving at roughly 1.8 m/s past a Gen4 sensor. New training run, new augmentations, and the val mAP came back at 41.2 against a previous baseline of 47.5.
Six points. Gone. We blamed the LoRA-style fine-tune first, then the augmentation pipeline, then a teammate's data split. Two of us, the better part of three weeks.
The actual cause: the old baseline accumulated events over 33 ms, the new pipeline defaulted to 10 ms. At 10 ms the boxes barely produced enough events to fill the histogram. The detector was looking at near-empty tensors. Sparse input, low recall, lost mAP. Nothing wrong with the weights at all.
What the window actually trades
A short window gives you crisp spatial structure but few events, so thin or slow-moving objects vanish. A long window collects plenty of events but smears fast motion across pixels, and the network sees a blurred ghost. The right value depends on object speed and event rate, which means it depends on your scene.
Here is the core of how we build the representation now, with the window made explicit instead of buried in a default:
import torch
def events_to_voxel(events, window_us, num_bins, height, width):
# events: (N, 4) tensor of [x, y, t_us, polarity]
t0 = events[:, 2].min()
rel_t = events[:, 2] - t0
keep = rel_t < window_us
ev = events[keep]
bin_idx = (ev[:, 2] - t0) / window_us * num_bins
bin_idx = bin_idx.clamp(0, num_bins - 1).long()
voxel = torch.zeros(num_bins, height, width)
pol = ev[:, 3] * 2 - 1 # {0,1} -> {-1, +1}
voxel.index_put_(
(bin_idx, ev[:, 1].long(), ev[:, 0].long()),
pol, accumulate=True,
)
return voxel
We now sweep window_us as a first-class part of validation, the same way we sweep learning rate. Cheap to run, since it is a preprocessing change and the weights stay fixed for the inference-time sweep.
The numbers from our conveyor set
Same model, same checkpoint, same 4,100-frame validation set. Only the accumulation window changes. Latency measured on a Jetson Orin NX at INT8.
| Window | Events/frame (median) | mAP@0.5 | Preproc + inference |
|---|---|---|---|
| 5 ms | 1,900 | 38.0 | 7.4 ms |
| 10 ms | 4,300 | 41.2 | 8.1 ms |
| 20 ms | 9,800 | 46.9 | 9.3 ms |
| 33 ms | 17,400 | 47.6 | 11.0 ms |
| 50 ms | 28,500 | 45.1 | 13.8 ms |
The curve is not monotonic. It climbs, plateaus around 20 to 33 ms, then falls as motion blur sets in. For this scene the sweet spot was 20 ms, which gave us almost all the accuracy of 33 ms with 1.7 ms less latency per frame. We had been leaving both accuracy and speed on the table.
How we audit windows now
We added a small step to dataset curation. For a random 300-frame subset we render the accumulated voxel back to a grayscale-ish preview and run it past a vision-language model to flag frames where the target is unreadable, blurred, or empty. It catches degenerate windows faster than a human scrubbing through previews. We route that call through Bifrost so the same code can hit one provider in CI and a cheaper one for bulk runs without rewriting anything, and that is the whole extent of the LLM involvement here. The detector itself never touches a model bigger than 6 MB.
It is not a substitute for the mAP sweep. It is a sanity filter before we trust the sweep.
Trade-offs and Limitations
The window that wins on a conveyor at 1.8 m/s is wrong for drones or automotive. Scene speed changes everything, so these exact numbers do not transfer. Treat the method, not the 20 ms.
Sweeping the window inflates validation time. Five windows means five full preprocessing passes over the val set. For us that is a few minutes; for a million-frame set it is real compute you have to budget.
A fixed window also assumes roughly constant scene dynamics. The honest answer for variable-speed scenes is an adaptive or event-count-based window, which we are testing but do not yet trust in production. And the VLM audit costs money per frame, so we cap it to a subset rather than the full set.
Top comments (0)