Quantising event-camera networks to run under 1MB on a Cortex-M7

#machinelearning #pytorch #mlops #computervision

TL;DR: I shrunk a gesture-recognition model for a Prophesee EVK4 event camera from 4.2MB down to 780KB so it could run on an STM32H7 at 15ms per inference. The trick was not the quantisation itself, it was rethinking what an "image" even means when your sensor produces events instead of frames.

So, the thing is, most computer vision tutorials assume you start with a tensor of shape [B, 3, H, W] and end with a classification head. Event cameras break that assumption on day one. A Prophesee sensor doesn't give you frames at 30fps. It gives you a sparse stream of events, each a tuple of (x, y, t, polarity), fired only when a pixel changes brightness. You can get millions of these per second during motion and almost nothing when the scene is static.

That changes the entire optimisation game. Let me give you the full picture here.

The starting point

We had a gesture model trained on the DVS128 Gesture dataset (11 classes, hand movements recorded with a DAVIS sensor). The baseline was a small ResNet-ish backbone running on event frames, accumulated over 50ms windows. It hit 94.1% test accuracy at 4.2MB in fp32. Inference on a Cortex-M7 at 480MHz took 68ms per window, which is too slow when your events are arriving in real time.

Target: sub-1MB, sub-20ms, accuracy drop under 2 percentage points.

Step one: stop pretending events are frames

The first 30% of the size came from not training on frames at all. Event-frame accumulation throws away the temporal resolution you paid for when you bought a €3,000 sensor. We switched the input representation to a voxel grid of shape [2, 5, 128, 128] (2 polarities, 5 temporal bins per window). That alone let us drop the first conv block from 64 channels to 24, because the input was already temporally structured.

def events_to_voxel(events, num_bins=5, height=128, width=128):
    voxel = torch.zeros(2, num_bins, height, width)
    t_min, t_max = events[:, 2].min(), events[:, 2].max()
    t_norm = (events[:, 2] - t_min) / (t_max - t_min + 1e-9)
    bin_idx = (t_norm * (num_bins - 1)).long()
    for i, (x, y, _, p) in enumerate(events):
        voxel[int(p), bin_idx[i], int(y), int(x)] += 1.0
    return voxel

Accuracy actually went up to 94.6%. Smaller and better. This happens more often than the literature admits.

Step two: QAT, not PTQ

Post-training quantisation is fast but it lies to you on event data. The activation distributions are wildly bimodal because most pixels are zero. Standard min/max calibration collapses the useful range.

We did quantisation-aware training with PyTorch's torch.ao.quantization pipeline, qint8 weights and activations, per-channel for convs, per-tensor for the linear head. 15 epochs of QAT on top of the fp32 checkpoint. The observer matters: MovingAverageMinMaxObserver with averaging_constant=0.01 worked, the default 0.1 did not.

Stage	Size	Accuracy	M7 latency
fp32 baseline (frames)	4.2 MB	94.1%	68 ms
fp32 + voxel input	3.1 MB	94.6%	51 ms
PTQ int8	820 KB	89.2%	19 ms
QAT int8	780 KB	93.4%	15 ms

The 4-point gap between PTQ and QAT is the whole story. If anyone tells you PTQ "just works" on sparse inputs, ask them to show you the per-class confusion matrix.

Step three: deployment plumbing

Getting a quantised PyTorch model onto a microcontroller is the part nobody writes about. We export to ONNX, then run it through X-CUBE-AI from ST to get C code we can flash to the H7. The flow is finicky around quantisation parameters, so we wrote a small validator that runs the same 200 inputs through PyTorch, ONNX Runtime, and the on-device binary and compares logits. Three places to get a mismatch.

For the cloud side, when we're benchmarking different model variants we need an LLM to summarise long evaluation logs and propose hyperparameter tweaks. Our team of four uses Bifrost in front of OpenAI and a local Ollama instance so we can switch between them when the OpenAI rate limit bites during a sweep. It's one less thing to babysit.

Trade-offs and limitations

This pipeline only really works because gesture recognition is forgiving. A 1% accuracy drop on hand waves doesn't kill anyone. For automotive event-camera workloads (pedestrian detection at 30m), the same int8 quantisation pushed our internal benchmark from 87.3% AP to 81.0% AP, which is unacceptable. You'd need mixed-precision (int8 backbone, fp16 detection head) or a larger budget.

Also: the voxel-grid representation assumes you can buffer 50ms of events before inferring. If you need sub-10ms reaction time on a 1kHz event stream, you're back to recurrent spiking architectures or sparse convs, and that's a different blog post.

The X-CUBE-AI toolchain is closed-source and the error messages will make you reconsider your career on a Tuesday afternoon. TFLite Micro is more portable but has worse op coverage for the things event-vision pipelines actually need (per-channel quantised group convs especially).

What I'd do differently

Train with QAT from epoch zero, not as a fine-tune. The fp32-then-quantise habit comes from natural-image work where it's basically free. On event data, the activation statistics shift enough during QAT that you might as well let the model converge to them properly. We're trying this on the next project.

And benchmark on the actual hardware early. The first three weeks I tuned everything against PyTorch latency on a workstation. Useless. The M7 has a different cache hierarchy and SIMD profile, and the model that "should" be fastest on paper was 2x slower in silicon.