DEV Community: Marco Rinaldi

Routing Event-Camera Pipelines Through an LLM Gateway: A Field Report

Marco Rinaldi — Thu, 21 May 2026 16:52:22 +0000

TL;DR: We added a vision-language stage to an event-camera pipeline at Prophesee and the LLM provider routing became the messiest part. Bifrost handled the failover and the OpenAI-compatible surface without forcing us to rewrite the C++ side. Honest comparison vs LiteLLM and Portkey below.

So, the thing is, when you spend your day writing CUDA kernels for event cameras, you do not expect to also become an expert in LLM provider quotas. But here we are. A few weeks back our team at Prophesee built a small captioning service on top of our event-based object detector. The detector runs on the sensor itself, sub-1MB, quantised to int8. The captioning stage, obviously, does not. That part calls out to a vision-language model, and that is where things got annoying.

Let me give you the full picture here.

The setup

Our pipeline is the usual neuromorphic story. A Prophesee Gen4 sensor produces events, we accumulate them into time surfaces every 10ms, run a tiny YOLO-ish detector on a Jetson Orin Nano, and then for a subset of detections we want a natural-language description of what is happening. Think security analytics, where you want "person carrying a long object near the loading bay" instead of "bbox 0.87".

The captioning runs at maybe 2 Hz, not 200 Hz, so we can afford a cloud call. We started with Anthropic's Claude for the vision-language part because it handled our weird grayscale-ish event reconstructions better than the alternatives in our internal eval (37 test scenes, blind A-B with two annotators from the Milan office, Claude won 24 of them).

Then Anthropic had a regional outage in February, our pipeline went dark for 90 minutes, and the customer was not happy. That is when we started looking at gateways.

What we tried

Three candidates. I will be fair to all of them.

Tool	Language	Deploy	Failover	Built-in cache
LiteLLM	Python	pip / Docker	Yes	Redis-backed
Portkey	Hosted + OSS	SaaS-first	Yes	Yes
Bifrost	Go	npx / Docker	Yes	Semantic

LiteLLM is the one most of you know. It is fine. We ran it for two weeks. The Python process held up but our captioning service is Go (we share a binary with the on-device telemetry agent), and adding a Python sidecar just to talk to LiteLLM felt wrong. Memory footprint on our edge box mattered too.

Portkey is genuinely good if you want a hosted control plane. The dashboard is nice. But we have a contractual requirement to keep all inference routing inside our VPC. The self-hosted Portkey works, however the SaaS-first orientation showed up in small ways, and the docs assume the cloud path more often than not.

Bifrost won mostly because it is a single Go binary, the API is OpenAI-compatible end to end (so our existing OpenAI SDK code did not change), and npx -y @maximhq/bifrost got us a working gateway in about 40 seconds on the Orin. Not a marketing claim, I timed it while making espresso.

The actual config

Here is roughly what we shipped. Anthropic is primary, OpenAI is fallback, with two API keys per provider for load balancing.

providers:
  anthropic:
    keys:
      - value: env.ANTHROPIC_KEY_1
        weight: 0.5
      - value: env.ANTHROPIC_KEY_2
        weight: 0.5
    models:
      - claude-sonnet-4-6
  openai:
    keys:
      - value: env.OPENAI_KEY
    models:
      - gpt-4o
fallbacks:
  - from: anthropic/claude-sonnet-4-6
    to: openai/gpt-4o

Then on the client side, nothing changed. Same OpenAI SDK call, just pointed at http://gateway:8080/v1. The drop-in replacement claim in the README is accurate, we did not rewrite a single client.

The semantic caching was an unexpected win. Our captions repeat a lot ("person walks across frame" happens 200 times an hour in some deployments). Turning on semantic cache cut our Anthropic bill by roughly 31% over a two-week window. The cache lives in Redis, we already had one.

What broke

To be fair, not everything was smooth.

The MCP integration looked interesting on paper but we did not need it for this use case. Our tools are CUDA kernels, not filesystem helpers. We turned it off.

The Prometheus metrics endpoint worked but the default scrape interval in our Grafana setup was too aggressive and we briefly thought we had a memory leak. Operator error, not Bifrost's fault.

The web UI is genuinely useful for non-engineers on the team (our PM kept asking for cost breakdowns) but I personally edited the config file directly. Different strokes.

Trade-offs and Limitations

Bifrost is younger than LiteLLM. The community is smaller and Stack Overflow answers are sparse. If you get stuck at 2am, you are reading source code, which for me is fine, but might not be for everyone.

If your stack is already pure Python, LiteLLM has tighter integration with Python-native tools like LangChain. Bifrost speaks the OpenAI API, so it works, but it does not pretend to be Pythonic.

Portkey's analytics UI is more polished. If you care about that more than deployment shape, look there first.

And honestly, if you only call one provider and never hit quotas, you do not need any gateway. We did not, until we did.

The point

Event cameras are about doing more with less data. Adding a gateway between us and three different LLM vendors is the opposite philosophy, more layers, more moving parts. I resisted it for a while. But once the captioning service started going down for reasons completely unrelated to our computer vision work, the cost of not having a gateway became obvious. The cheapest model is the one you never had to call twice.

Event cameras and edge inference: why the frame-based mindset is still holding us back

Marco Rinaldi — Tue, 07 Apr 2026 12:54:03 +0000

So, the thing is, most edge inference pipelines for computer vision are built around a mental model that goes: capture frame → preprocess → run model → get result → repeat. Everything is designed around this loop. The latency budget, the model architecture, the preprocessing pipeline, the hardware selection — all of it assumes that "input" means "a dense grid of pixel values captured at a regular interval."

This works. For many applications it works well. But there's a class of problems where this mental model is the actual constraint — not the model, not the hardware, not the optimization. The sensing paradigm.

I've been working with event cameras at Prophesee for a while now, and I want to give you an honest, detailed picture of where they change the game and where they don't. Not hype. The technology is genuinely interesting and the engineering challenges are real.

What an event camera outputs

A conventional camera takes a photo. All pixels fire at the same time, you get a matrix of intensity values, you process it.

An event camera works completely differently. Each pixel operates independently and fires an event when the change in log luminance at that pixel crosses a threshold. The output is not a frame — it's a continuous, asynchronous stream of events, each containing:

event = {
    x: int,          # pixel column
    y: int,          # pixel row
    t: int,          # timestamp in microseconds
    p: bool,         # polarity: True = brightness increase, False = decrease
}

A static scene generates almost no events. A fast-moving object generates a dense burst. The data rate is determined by scene activity, not a fixed clock.

This has concrete engineering consequences:

Property	Frame-based camera	Event camera
Temporal resolution	1/fps (e.g. 33ms at 30fps)	~1 microsecond
Latency floor	Frame period	Sub-millisecond
Motion blur	Present for fast objects	Eliminated
Dynamic range	~60 dB	120+ dB
Data at rest	Constant (full frame every period)	Near-zero
Data during motion	Same	Proportional to activity

Processing event data — the engineering reality

The mental model shift required here is significant. You don't have frames. You have a sparse, asynchronous, continuous stream. Standard CNNs that expect (batch, channels, height, width) tensors don't directly apply.

There are four main approaches in practice:

1. Event accumulation into pseudo-frames

The pragmatic approach. Accumulate events over a fixed time window (or fixed event count), render them into an image-like representation, then run a standard CNN.

import numpy as np
import torch

def events_to_voxel_grid(
    events: np.ndarray,  # shape (N, 4): x, y, t, p
    num_bins: int = 5,
    height: int = 480,
    width: int = 640,
) -> torch.Tensor:
    """
    Convert events to a voxel grid representation.
    Each bin accumulates events in a time slice.
    Output shape: (num_bins, height, width)
    """
    t_start = events[:, 2].min()
    t_end = events[:, 2].max()

    if t_end == t_start:
        return torch.zeros(num_bins, height, width)

    # Normalize timestamps to [0, num_bins)
    t_normalized = (events[:, 2] - t_start) / (t_end - t_start) * (num_bins - 1)

    voxel = torch.zeros(num_bins, height, width)

    xs = events[:, 0].astype(int)
    ys = events[:, 1].astype(int)
    ts = t_normalized
    ps = events[:, 3] * 2 - 1  # convert 0/1 polarity to -1/+1

    # Bilinear interpolation across time bins
    t_floor = np.floor(ts).astype(int)
    t_ceil = np.ceil(ts).astype(int)

    # Lower bin contribution
    weight_floor = torch.tensor(1 - (ts - t_floor), dtype=torch.float32)
    weight_ceil = torch.tensor(ts - t_floor, dtype=torch.float32)

    for i in range(len(events)):
        if 0 <= xs[i] < width and 0 <= ys[i] < height:
            if t_floor[i] < num_bins:
                voxel[t_floor[i], ys[i], xs[i]] += float(ps[i]) * float(weight_floor[i])
            if t_ceil[i] < num_bins and t_ceil[i] != t_floor[i]:
                voxel[t_ceil[i], ys[i], xs[i]] += float(ps[i]) * float(weight_ceil[i])

    return voxel

Then feed this voxel grid to any standard backbone. ResNet, EfficientNet, whatever your latency/accuracy tradeoff requires. You can export to ONNX and run TensorRT on it exactly like any other vision model.

The tradeoff: you've reintroduced a temporal discretization. If your window is 10ms, you've effectively given yourself 100fps temporal resolution. Better than 30fps, worse than raw event resolution.

2. Graph neural networks over spatiotemporal point clouds

Treat events as a 3D point cloud in (x, y, t) space. Use GNNs to process them natively.

import torch
from torch_geometric.data import Data
from torch_geometric.nn import knn_graph, GCNConv

def events_to_graph(
    events: np.ndarray,  # (N, 4): x, y, t, p
    k: int = 16,          # k-nearest neighbors
    time_weight: float = 0.1,  # weight of time dimension relative to spatial
) -> Data:
    """
    Convert events to a k-NN graph for GNN processing.
    Nodes are events, edges connect spatiotemporally close events.
    """
    # Normalize coordinates
    pos = torch.tensor(events[:, :3], dtype=torch.float32)
    pos[:, 0] /= events[:, 0].max()  # normalize x to [0,1]
    pos[:, 1] /= events[:, 1].max()  # normalize y to [0,1]
    pos[:, 2] /= events[:, 2].max()  # normalize t to [0,1]
    pos[:, 2] *= time_weight           # down-weight time dimension

    # Node features: polarity and position
    x = torch.zeros(len(events), 4, dtype=torch.float32)
    x[:, 0] = torch.tensor(events[:, 3] * 2 - 1)  # polarity: -1 or +1
    x[:, 1:] = pos  # position as additional features

    # Build k-NN graph
    edge_index = knn_graph(pos, k=k)

    return Data(x=x, edge_index=edge_index, pos=pos)

GNN approaches preserve temporal information better and handle the asynchronous nature more naturally. The cost: significantly higher computational overhead and harder to deploy on constrained edge hardware.

3. Spiking neural networks

The philosophically correct approach. Spiking neural networks (SNNs) are event-driven by nature — neurons fire when their membrane potential crosses a threshold, which maps naturally onto event data.

# Using SpikingJelly (spikingjelly.pytorch)
import torch
import torch.nn as nn
from spikingjelly.activation_based import neuron, functional, layer

class SpikingEventEncoder(nn.Module):
    def __init__(self, in_channels: int = 2, feature_dim: int = 128):
        super().__init__()
        self.encoder = nn.Sequential(
            layer.Conv2d(in_channels, 32, 3, padding=1, bias=False),
            layer.BatchNorm2d(32),
            neuron.LIFNode(tau=2.0),  # Leaky integrate-and-fire
            layer.Conv2d(32, 64, 3, stride=2, padding=1, bias=False),
            layer.BatchNorm2d(64),
            neuron.LIFNode(tau=2.0),
            layer.Conv2d(64, feature_dim, 3, stride=2, padding=1, bias=False),
            layer.BatchNorm2d(feature_dim),
            neuron.LIFNode(tau=2.0),
        )
        functional.set_step_mode(self, step_mode='m')  # multi-step mode

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x shape: (T, B, C, H, W) — T timesteps of binary spike frames
        return self.encoder(x)

SNNs are energy-efficient on neuromorphic hardware (Intel Loihi, BrainScaleS) — we're talking orders of magnitude lower power than GPU inference. Harder to train (backpropagation through spikes requires surrogate gradients), harder to deploy on standard hardware, and the training frameworks are still maturing.

Where the architecture choice is decisive

I'll be direct about this. Event cameras are not better than frame cameras in general. They're decisively better for specific problems.

High-speed tracking: A ball at 200km/h. A manufacturing defect on a conveyor at 10m/s. At 240fps, you have 4ms between frames and ~22cm of position uncertainty for that ball. Event cameras track position continuously with microsecond resolution. No model architecture on a frame-based camera solves this — it's a sensing problem.

Low-latency reactive control: If you need your control loop to react in under 2ms, the 33ms floor of 30fps sensing is disqualifying. Even 240fps (4ms) is marginal for some robotics tasks. Event cameras give you sub-millisecond sensing latency.

High dynamic range scenes: Outdoor robotics, automotive, industrial inspection in variable lighting. Frame cameras require HDR fusion tricks (multi-exposure, tone mapping) that add latency and complexity. Event cameras have 120+ dB dynamic range natively.

Power-constrained edge deployment: A static scene generates near-zero events. Data-proportional computation means dramatic power savings at rest. Pair this with an SNN and neuromorphic hardware and you have inference systems that run on milliwatts.

Where frame-based is still correct

Static scene understanding: events encode change, not state. If you need to understand a scene that isn't moving, you have no data.

Texture and color: events detect luminance changes, not absolute color or fine texture. For tasks where texture is discriminative, frame cameras are better.

Existing ecosystem: PyTorch, TensorRT, ONNX, OpenCV, every pre-trained backbone — all of this is built for frame-based input. The event camera tooling ecosystem is much smaller and less mature.

The practical advice

If you're running real-time inference at standard framerates (<30ms latency budget) on scenes with standard dynamic range: stick with frame-based. The maturity gap in tooling, pre-trained models, and deployment infrastructure is not worth it.

If your application involves any of these — fast motion, low latency requirements, HDR environments, power-constrained edge hardware — start evaluating event cameras seriously. The Prophesee Metavision SDK is the most mature industrial option. IniVation makes the DAVIS cameras that are popular in research.

The frame-based mindset isn't wrong. It's just not universal. The sensing architecture is a design choice like any other, and for some problems, it's the most important design choice you'll make.

I'm working with event cameras daily and happy to answer specific questions about hardware, processing pipelines, or deployment. Drop them in the comments.