<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Marco Rinaldi</title>
    <description>The latest articles on DEV Community by Marco Rinaldi (@marcorinaldi_ai).</description>
    <link>https://dev.to/marcorinaldi_ai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3851217%2Fce3e7721-f256-4c07-bec2-4bdbc0121b10.jpg</url>
      <title>DEV Community: Marco Rinaldi</title>
      <link>https://dev.to/marcorinaldi_ai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/marcorinaldi_ai"/>
    <language>en</language>
    <item>
      <title>One image schema for four VLM providers: we stopped reformatting payloads</title>
      <dc:creator>Marco Rinaldi</dc:creator>
      <pubDate>Mon, 01 Jun 2026 16:51:55 +0000</pubDate>
      <link>https://dev.to/marcorinaldi_ai/one-image-schema-for-four-vlm-providers-we-stopped-reformatting-payloads-j70</link>
      <guid>https://dev.to/marcorinaldi_ai/one-image-schema-for-four-vlm-providers-we-stopped-reformatting-payloads-j70</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We reconstruct grayscale frames from event-camera data and send them to vision-language models for weak scene labels. Four providers, four slightly different ways to attach an image, and our payload-building code had grown three branches. Putting Bifrost in front of the VLMs gave us one OpenAI-compatible image schema. Here's the honest version, including where LiteLLM and Portkey do it better.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So, the thing is, on my team at Prophesee we don't have classic RGB frames. We have event streams, and for some of our offline tooling we reconstruct short grayscale frames from those events so a vision-language model can give us a rough scene description. "Person crossing, low light, motion blur on the left." Weak labels. Useful for triaging which clips a human should look at first.&lt;/p&gt;

&lt;p&gt;We send those reconstructed frames to four VLMs depending on the job: OpenAI's gpt-4o, Anthropic's Claude, Gemini through Google Vertex, and occasionally Mistral's vision model. Same picture, four annoyingly different request bodies.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual pain
&lt;/h2&gt;

&lt;p&gt;Let me give you the full picture here. The OpenAI shape wants an &lt;code&gt;image_url&lt;/code&gt; content block, and the URL can be a base64 data URI. Fine. Then you go to Vertex and the structure shifts, the field names shift, and the &lt;code&gt;detail&lt;/code&gt; hint you were passing to control token cost stops meaning anything. Anthropic wants its own &lt;code&gt;source&lt;/code&gt; object with a media type. None of this is hard. It's just three &lt;code&gt;if provider ==&lt;/code&gt; branches in a function that should do one thing.&lt;/p&gt;

&lt;p&gt;We had maybe 80 lines of payload-shaping code in our Python annotation service. Every time a provider tweaked an API version, something silently broke and a batch of frames came back unlabelled at 2am during an overnight run. Not dramatic. Just paper cuts that add up over a 6-person team.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed
&lt;/h2&gt;

&lt;p&gt;We put Bifrost (an open-source AI gateway written in Go) in front of all four providers. It exposes a single OpenAI-compatible API, including multimodal, so our service now builds exactly one image message and never branches on provider again.&lt;/p&gt;

&lt;p&gt;Running it locally was a one-liner:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;span class="c"&gt;# or&lt;/span&gt;
docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then our call looks identical no matter who serves it. We just change the &lt;code&gt;model&lt;/code&gt; string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8080/v1/chat/completions &lt;span class="se"&gt;\\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "vertex/gemini-1.5-pro",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe the scene. One line."},
        {"type": "image_url",
         "image_url": {"url": "data:image/png;base64,iVBORw0K... "}}
      ]
    }]
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Swap &lt;code&gt;vertex/gemini-1.5-pro&lt;/code&gt; for &lt;code&gt;openai/gpt-4o&lt;/code&gt; or &lt;code&gt;anthropic/claude-...&lt;/code&gt; and the body doesn't move. That's the whole point. The gateway translates to each provider's native multimodal format on the way out. (&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/streaming" rel="noopener noreferrer"&gt;Streaming and multimodal docs&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;Two things came along for free that I didn't plan for. First, automatic fallback. If Vertex throws a 503 mid-batch, the request retries on a configured backup model instead of dropping the frame. (&lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;Fallbacks docs&lt;/a&gt;.) Second, semantic caching. Our reconstructed frames are repetitive, lots of near-identical low-motion clips, so caching on semantic similarity cut a chunk of redundant VLM calls. (&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Semantic caching docs&lt;/a&gt;.)&lt;/p&gt;

&lt;h2&gt;
  
  
  How it compares
&lt;/h2&gt;

&lt;p&gt;I won't pretend Bifrost is the only thing that does this. It isn't.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Unified multimodal API&lt;/td&gt;
&lt;td&gt;Yes, OpenAI-compatible&lt;/td&gt;
&lt;td&gt;Yes, very mature&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Implementation&lt;/td&gt;
&lt;td&gt;Go, self-hosted&lt;/td&gt;
&lt;td&gt;Python, self-host or lib&lt;/td&gt;
&lt;td&gt;Hosted-first, self-host option&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider breadth&lt;/td&gt;
&lt;td&gt;23+&lt;/td&gt;
&lt;td&gt;Largest I've seen&lt;/td&gt;
&lt;td&gt;Broad&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability UI&lt;/td&gt;
&lt;td&gt;Prometheus metrics&lt;/td&gt;
&lt;td&gt;Functional&lt;/td&gt;
&lt;td&gt;Strongest dashboards&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM has been doing image normalisation longer and its provider list is wider than anyone's. If you're already deep in a Python stack and want a library call rather than a separate service, LiteLLM is genuinely the pragmatic pick. Portkey's hosted dashboards and guardrail tooling are more polished than what we run; for a team that wants observability out of the box without wiring Prometheus, that matters.&lt;/p&gt;

&lt;p&gt;We picked Bifrost mostly because it's a single Go binary we self-host next to our annotation service, and the OpenAI-compatible surface meant near-zero changes to client code. Different teams will weigh that differently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;It's a network hop. You're adding a gateway between your service and the provider, so there's a small latency cost and one more thing that can fall over. For an offline labelling pipeline I don't care. For a tight real-time loop you'd measure it first.&lt;/p&gt;

&lt;p&gt;You inherit the gateway's abstraction. If a provider ships a brand-new multimodal parameter, you wait for the gateway to expose it rather than calling the raw API. So far that's been fine for our image-description use, but it's a real constraint if you live on the bleeding edge of one provider's features.&lt;/p&gt;

&lt;p&gt;And to be precise: this solved a payload-normalisation and reliability problem. It did nothing for our actual model quality. The VLM still mislabels heavy motion blur, and reconstructing good frames from sparse events is still the hard part of my week. The gateway just stopped me rewriting JSON shapes. If you can't make the upstream model smaller or the labels cleaner, no amount of plumbing fixes that.&lt;/p&gt;

&lt;p&gt;One espresso's worth of setup, a quieter on-call. That trade I'll take.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/streaming" rel="noopener noreferrer"&gt;Multimodal and streaming quickstart&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;Retries and fallbacks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Semantic caching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.litellm.ai/docs/completion/vision" rel="noopener noreferrer"&gt;LiteLLM multimodal docs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>llm</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Our event-camera detector lost 6 mAP to a badly chosen accumulation window</title>
      <dc:creator>Marco Rinaldi</dc:creator>
      <pubDate>Mon, 01 Jun 2026 07:21:48 +0000</pubDate>
      <link>https://dev.to/marcorinaldi_ai/our-event-camera-detector-lost-6-map-to-a-badly-chosen-accumulation-window-k68</link>
      <guid>https://dev.to/marcorinaldi_ai/our-event-camera-detector-lost-6-map-to-a-badly-chosen-accumulation-window-k68</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We spent three weeks chasing a 6 mAP regression in an event-camera object detector. The model was fine. The bug was the accumulation window we used to turn raw events into tensors, and we had picked it once, eighteen months earlier, on a different dataset. Here is how we tune it now.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So, the thing is, with event cameras you do not get frames. You get a stream of events, each one a tuple of &lt;code&gt;(x, y, t, polarity)&lt;/code&gt;, fired asynchronously whenever a pixel sees a brightness change. Microsecond timestamps. No global shutter, no exposure. Beautiful for high-speed motion. Annoying when you want to feed a convolutional detector that expects a dense tensor.&lt;/p&gt;

&lt;p&gt;So everyone accumulates. You take all the events inside a time window, say 10 ms, and you build a representation out of them. A 2D histogram, a voxel grid, a time surface. That window length is a hyperparameter. And in my experience at Prophesee, it is the one people set once and never look at again.&lt;/p&gt;

&lt;h2&gt;
  
  
  The regression that was not a model regression
&lt;/h2&gt;

&lt;p&gt;Last spring we retrained a small detector for a logistics conveyor setup. Boxes moving at roughly 1.8 m/s past a Gen4 sensor. New training run, new augmentations, and the val mAP came back at 41.2 against a previous baseline of 47.5.&lt;/p&gt;

&lt;p&gt;Six points. Gone. We blamed the LoRA-style fine-tune first, then the augmentation pipeline, then a teammate's data split. Two of us, the better part of three weeks.&lt;/p&gt;

&lt;p&gt;The actual cause: the old baseline accumulated events over 33 ms, the new pipeline defaulted to 10 ms. At 10 ms the boxes barely produced enough events to fill the histogram. The detector was looking at near-empty tensors. Sparse input, low recall, lost mAP. Nothing wrong with the weights at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the window actually trades
&lt;/h2&gt;

&lt;p&gt;A short window gives you crisp spatial structure but few events, so thin or slow-moving objects vanish. A long window collects plenty of events but smears fast motion across pixels, and the network sees a blurred ghost. The right value depends on object speed and event rate, which means it depends on your scene.&lt;/p&gt;

&lt;p&gt;Here is the core of how we build the representation now, with the window made explicit instead of buried in a default:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;events_to_voxel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_us&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_bins&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# events: (N, 4) tensor of [x, y, t_us, polarity]
&lt;/span&gt;    &lt;span class="n"&gt;t0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;rel_t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t0&lt;/span&gt;
    &lt;span class="n"&gt;keep&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rel_t&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;window_us&lt;/span&gt;
    &lt;span class="n"&gt;ev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;keep&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;bin_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;window_us&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;num_bins&lt;/span&gt;
    &lt;span class="n"&gt;bin_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bin_idx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_bins&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;long&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;voxel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_bins&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pol&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="c1"&gt;# {0,1} -&amp;gt; {-1, +1}
&lt;/span&gt;    &lt;span class="n"&gt;voxel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;index_put_&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bin_idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;long&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;long&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
        &lt;span class="n"&gt;pol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;accumulate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;voxel&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We now sweep &lt;code&gt;window_us&lt;/code&gt; as a first-class part of validation, the same way we sweep learning rate. Cheap to run, since it is a preprocessing change and the weights stay fixed for the inference-time sweep.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers from our conveyor set
&lt;/h2&gt;

&lt;p&gt;Same model, same checkpoint, same 4,100-frame validation set. Only the accumulation window changes. Latency measured on a Jetson Orin NX at INT8.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Window&lt;/th&gt;
&lt;th&gt;Events/frame (median)&lt;/th&gt;
&lt;th&gt;&lt;a href="mailto:mAP@0.5"&gt;mAP@0.5&lt;/a&gt;&lt;/th&gt;
&lt;th&gt;Preproc + inference&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;5 ms&lt;/td&gt;
&lt;td&gt;1,900&lt;/td&gt;
&lt;td&gt;38.0&lt;/td&gt;
&lt;td&gt;7.4 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10 ms&lt;/td&gt;
&lt;td&gt;4,300&lt;/td&gt;
&lt;td&gt;41.2&lt;/td&gt;
&lt;td&gt;8.1 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20 ms&lt;/td&gt;
&lt;td&gt;9,800&lt;/td&gt;
&lt;td&gt;46.9&lt;/td&gt;
&lt;td&gt;9.3 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;33 ms&lt;/td&gt;
&lt;td&gt;17,400&lt;/td&gt;
&lt;td&gt;47.6&lt;/td&gt;
&lt;td&gt;11.0 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50 ms&lt;/td&gt;
&lt;td&gt;28,500&lt;/td&gt;
&lt;td&gt;45.1&lt;/td&gt;
&lt;td&gt;13.8 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The curve is not monotonic. It climbs, plateaus around 20 to 33 ms, then falls as motion blur sets in. For this scene the sweet spot was 20 ms, which gave us almost all the accuracy of 33 ms with 1.7 ms less latency per frame. We had been leaving both accuracy and speed on the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we audit windows now
&lt;/h2&gt;

&lt;p&gt;We added a small step to dataset curation. For a random 300-frame subset we render the accumulated voxel back to a grayscale-ish preview and run it past a vision-language model to flag frames where the target is unreadable, blurred, or empty. It catches degenerate windows faster than a human scrubbing through previews. We route that call through Bifrost so the same code can hit one provider in CI and a cheaper one for bulk runs without rewriting anything, and that is the whole extent of the LLM involvement here. The detector itself never touches a model bigger than 6 MB.&lt;/p&gt;

&lt;p&gt;It is not a substitute for the mAP sweep. It is a sanity filter before we trust the sweep.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;The window that wins on a conveyor at 1.8 m/s is wrong for drones or automotive. Scene speed changes everything, so these exact numbers do not transfer. Treat the method, not the 20 ms.&lt;/p&gt;

&lt;p&gt;Sweeping the window inflates validation time. Five windows means five full preprocessing passes over the val set. For us that is a few minutes; for a million-frame set it is real compute you have to budget.&lt;/p&gt;

&lt;p&gt;A fixed window also assumes roughly constant scene dynamics. The honest answer for variable-speed scenes is an adaptive or event-count-based window, which we are testing but do not yet trust in production. And the VLM audit costs money per frame, so we cap it to a subset rather than the full set.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1904.08245" rel="noopener noreferrer"&gt;Gehrig et al., "End-to-End Learned Event Representations" (EST)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.prophesee.ai/" rel="noopener noreferrer"&gt;Prophesee Metavision SDK docs on event accumulation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2304.13455" rel="noopener noreferrer"&gt;Zubić et al., survey on event-camera representations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost AI gateway&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/docs/stable/generated/torch.Tensor.index_put_.html" rel="noopener noreferrer"&gt;PyTorch &lt;code&gt;index_put_&lt;/code&gt; reference&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>computervision</category>
      <category>machinelearning</category>
      <category>pytorch</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Structured channel pruning got our detector under 12ms on a Jetson</title>
      <dc:creator>Marco Rinaldi</dc:creator>
      <pubDate>Fri, 29 May 2026 07:22:38 +0000</pubDate>
      <link>https://dev.to/marcorinaldi_ai/structured-channel-pruning-got-our-detector-under-12ms-on-a-jetson-3m3j</link>
      <guid>https://dev.to/marcorinaldi_ai/structured-channel-pruning-got-our-detector-under-12ms-on-a-jetson-3m3j</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Our frame-based defect detector ran at 31ms on a Jetson Orin Nano, and the production line needed 12. Structured channel pruning at 45% plus a three-epoch fine-tune got us to 11.4ms for a 0.6 mAP drop. Unstructured pruning looked beautiful on paper and gave exactly zero real-world speedup, so we deleted it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So, the thing is, the model was never the problem. The detector hit 0.91 mAP in the lab and everyone was happy. Then we put it on the actual Orin Nano sitting next to the conveyor, and it ran at 31ms per frame. The line moves a stamped part every 12ms. You can imagine how that meeting went.&lt;/p&gt;

&lt;p&gt;Let me give you the full picture here. This was a side project away from my usual event-camera work at Prophesee, a plain RGB detector for surface defects on stamped metal panels. Small team, three of us, a hard latency budget and no room for a bigger GPU on the line. Cutting model size was the only lever we had.&lt;/p&gt;

&lt;h2&gt;
  
  
  The trap I fell into first
&lt;/h2&gt;

&lt;p&gt;I reached for unstructured pruning because the literature loves it. &lt;code&gt;torch.nn.utils.prune&lt;/code&gt;, magnitude based, zero out the smallest 60% of weights, retrain. The sparsity numbers were gorgeous. Sixty percent of the parameters gone, mAP barely moved.&lt;/p&gt;

&lt;p&gt;Latency on the Jetson: 30.8ms. Same as before, down to the decimal.&lt;/p&gt;

&lt;p&gt;Here is why it does nothing. Unstructured pruning keeps every tensor the exact shape it was and writes zeros into it. cuDNN still runs a dense convolution over those zeros. Unless you have a sparse kernel plus hardware that exploits 2:4 structured sparsity (Ampere does, partially, and the Orin Nano's GPU is not the right SKU for it), you bought a smaller checkpoint file and nothing else. The thing was multiplying by zero at full price.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the latency actually lives
&lt;/h2&gt;

&lt;p&gt;Structured pruning removes whole filters and channels, so the tensors get physically smaller. A conv layer with 256 output channels becomes 140. FLOPs drop, memory traffic drops, the kernel finishes sooner. Real speedup.&lt;/p&gt;

&lt;p&gt;The painful part is dependency tracking. Remove an output channel in one layer and you must remove the matching input channel everywhere it feeds, across residual adds, concats, the lot. Do it by hand and you will spend a Sunday debugging shape mismatches instead of eating lunch in Bologna. &lt;code&gt;torch-pruning&lt;/code&gt; traces the whole graph with its DepGraph and prunes coupled layers together.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch_pruning&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;tp&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_detector&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;eval&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;example&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;640&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;640&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# group importance by L2 norm of each filter group
&lt;/span&gt;&lt;span class="n"&gt;imp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;importance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GroupNormImportance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pruner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pruner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MetaPruner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;importance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;imp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pruning_ratio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ignored_layers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# leave the classification head alone
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pruner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;step&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;macs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count_ops_and_params&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;macs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mf"&gt;1e9&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GFLOPs, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mf"&gt;1e6&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;M params&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After pruning, the model is broken in the accuracy sense. You fine-tune to recover. Three epochs on the same training set pulled mAP back from 0.86 to 0.904. Not free, but cheap compared to retraining from scratch.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Params removed&lt;/th&gt;
&lt;th&gt;FLOPs&lt;/th&gt;
&lt;th&gt;Jetson latency&lt;/th&gt;
&lt;th&gt;mAP&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Baseline (FP16)&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;17.2 GFLOPs&lt;/td&gt;
&lt;td&gt;31.0 ms&lt;/td&gt;
&lt;td&gt;0.910&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unstructured 60%&lt;/td&gt;
&lt;td&gt;60%&lt;/td&gt;
&lt;td&gt;17.2 GFLOPs (dense)&lt;/td&gt;
&lt;td&gt;30.8 ms&lt;/td&gt;
&lt;td&gt;0.905&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structured 45%&lt;/td&gt;
&lt;td&gt;41%&lt;/td&gt;
&lt;td&gt;9.4 GFLOPs&lt;/td&gt;
&lt;td&gt;11.4 ms&lt;/td&gt;
&lt;td&gt;0.904&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Structured pruning removed fewer parameters on paper and delivered all of the speed. That gap between "params removed" and "FLOPs removed" is the whole lesson. FLOPs only fall when the shapes shrink.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;Pruning ratio has a cliff. At 0.45 we lost 0.006 mAP. At 0.60 the small defect class collapsed and no amount of fine-tuning brought it back, because those filters were genuinely doing work. You have to sweep the ratio per model, there is no universal number.&lt;/p&gt;

&lt;p&gt;The head is sensitive. We pinned the classification head with &lt;code&gt;ignored_layers&lt;/code&gt; after an early run where pruning it tanked recall on the rare defect type. Pruning is not uniform across a network and treating it as such will cost you.&lt;/p&gt;

&lt;p&gt;Speedup is hardware-shaped. Our 9.4 GFLOPs model runs at 11.4ms on the Orin Nano and at 4ms on a desktop 4070. Pruning helps compute-bound layers; if your bottleneck is memory bandwidth or a fat softmax, channel pruning moves the needle less than you hope. Profile first with &lt;code&gt;trtexec&lt;/code&gt; or Nsight before you assume FLOPs equal time.&lt;/p&gt;

&lt;p&gt;And retraining needs data. We had labels, so recovery was painless. For the handful of ambiguous border crops where annotators disagreed, we sent them to a larger vision model for a second opinion and routed those calls through an AI gateway (we run Bifrost, some teams use LiteLLM) so nobody had to wire up provider SDKs by hand. That kept the labelling loop moving without a separate integration project.&lt;/p&gt;

&lt;p&gt;One more honest note. Structured pruning fights quantisation a little. Our INT8 TensorRT export of the pruned model needed re-calibration, and the calibration cache from the dense model was useless. Budget time for that step.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would tell past me
&lt;/h2&gt;

&lt;p&gt;Stop optimising the metric that does not move the clock. Sparsity is a vanity number unless your runtime can spend it. Measure latency on the real device, with the real batch size, before and after every change. The espresso machine in our office has more consistent timing than my early benchmark scripts did, and I trusted it more.&lt;/p&gt;

&lt;p&gt;If you cannot make the model smaller and faster on the actual target, you do not yet understand where its time goes. Pruning forced me to learn that, layer by layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;torch-pruning (DepGraph) - structured pruning with dependency tracking&lt;/li&gt;
&lt;li&gt;Pruning Filters for Efficient ConvNets, Li et al. 2017 - the filter-pruning baseline&lt;/li&gt;
&lt;li&gt;What is the State of Neural Network Pruning?, Blalock et al. 2020 - a sober look at pruning claims&lt;/li&gt;
&lt;li&gt;NVIDIA TensorRT best practices - profiling and INT8 calibration&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>computervision</category>
      <category>pytorch</category>
      <category>machinelearning</category>
      <category>mlops</category>
    </item>
    <item>
      <title>VLM-scored calibration sets for INT8 quantisation, routed through Bifrost</title>
      <dc:creator>Marco Rinaldi</dc:creator>
      <pubDate>Thu, 28 May 2026 16:58:23 +0000</pubDate>
      <link>https://dev.to/marcorinaldi_ai/vlm-scored-calibration-sets-for-int8-quantisation-routed-through-bifrost-37ek</link>
      <guid>https://dev.to/marcorinaldi_ai/vlm-scored-calibration-sets-for-int8-quantisation-routed-through-bifrost-37ek</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We pick the 512 hardest images for INT8 PTQ calibration by scoring a candidate pool with a small VLM. Bifrost sits between our calibration pipeline and four providers, gives us semantic caching, per-engineer virtual keys, and hard budget caps so a runaway loop can't burn EUR800 of API spend over the weekend.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So, the thing is, calibration set selection is one of those topics nobody writes about until the day your INT8 model is 4.2 points of mAP worse than the fp16 reference and you have a release branch already cut. Pick the wrong 512 images and your activation histograms get biased toward easy frames. Pick the right 512 and the gap closes to 0.6 points without any QAT work at all.&lt;/p&gt;

&lt;p&gt;For about a year we picked them by hand. Then we tried random sampling, then stratified by class. The thing that finally worked, on an industrial defect detector we shipped to a customer last quarter, was scoring our candidate pool with a vision-language model and biasing toward the high-difficulty tail. That sounds neat in a paper. In production it meant roughly 80k API calls per release cycle, four providers in rotation, three engineers all running their own ablations, and one very pointed email from finance about the November bill.&lt;/p&gt;

&lt;p&gt;That's where Bifrost came in.&lt;/p&gt;

&lt;h3&gt;
  
  
  What we actually wanted from a gateway
&lt;/h3&gt;

&lt;p&gt;A short list, in order of how much pain each item was causing us:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;One API surface so the calibration script doesn't care if we're hitting GPT-4o, Claude Sonnet, Gemini, or our self-hosted Qwen2-VL.&lt;/li&gt;
&lt;li&gt;Per-experiment budget caps. If someone's loop goes infinite over a Friday night, I want it to die at EUR50, not EUR5,000.&lt;/li&gt;
&lt;li&gt;Semantic caching, because we score the same images across many ablations and paying twice for the same prompt+image hash is wasteful.&lt;/li&gt;
&lt;li&gt;Per-engineer virtual keys so I can see who spent what without parsing four billing consoles.&lt;/li&gt;
&lt;li&gt;Real observability. Prometheus metrics over a custom dashboard built by the intern who's now in Berlin.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Bifrost gave us all five. Setup took less than an afternoon, mostly spent arguing about key naming conventions.&lt;/p&gt;

&lt;h3&gt;
  
  
  The scoring loop
&lt;/h3&gt;

&lt;p&gt;The actual code is boring, which is good. Our calibration scorer is a single async function that streams candidates from S3, calls Bifrost, parses a difficulty score from the JSON response, and writes it back to a Parquet shard.&lt;/p&gt;

&lt;p&gt;providers:&lt;br&gt;
  openai:&lt;br&gt;
    keys:&lt;br&gt;
      - value: env:OPENAI_KEY_A&lt;br&gt;
        weight: 0.5&lt;br&gt;
      - value: env:OPENAI_KEY_B&lt;br&gt;
        weight: 0.5&lt;br&gt;
  anthropic:&lt;br&gt;
    keys:&lt;br&gt;
      - value: env:ANTHROPIC_KEY&lt;br&gt;
governance:&lt;br&gt;
  virtual_keys:&lt;br&gt;
    - id: vk_calib_marco&lt;br&gt;
      budget_eur: 200&lt;br&gt;
    - id: vk_calib_giulia&lt;br&gt;
      budget_eur: 200&lt;br&gt;
semantic_cache:&lt;br&gt;
  enabled: true&lt;br&gt;
  similarity_threshold: 0.92&lt;/p&gt;

&lt;p&gt;Each engineer gets their own virtual key. The budget is hard. Hit EUR200 and Bifrost returns 429 until the cap is lifted. This sounds harsh until you remember the November bill.&lt;/p&gt;

&lt;p&gt;Semantic caching pays for itself faster than I expected. About 38% of our candidate frames recur across ablations because we keep iterating on the difficulty prompt rather than the image set itself. Cache hits cost us nothing and return in under 40ms.&lt;/p&gt;

&lt;h3&gt;
  
  
  How it compares to LiteLLM and Portkey
&lt;/h3&gt;

&lt;p&gt;We did genuinely evaluate both before committing.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI-compatible API&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multimodal request routing&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Plugin&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hierarchical virtual keys / budgets&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted single Go binary&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;Hosted-first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Native Prometheus metrics&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Portkey's hosted dashboards are more polished. If you don't want to run anything yourself, that's the right call. LiteLLM's Python ecosystem is broader and their community is bigger. We picked Bifrost because we wanted a self-hosted Go binary that fit alongside our existing inference services, with a governance model that matched how the team already splits experiments by engineer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trade-offs and Limitations
&lt;/h3&gt;

&lt;p&gt;A few things to be honest about.&lt;/p&gt;

&lt;p&gt;The semantic cache is great for repeated calibration runs and useless if your prompt template changes per experiment. We pin the prompt template and version it like model weights. Without that discipline the cache hit rate dropped to about 6%.&lt;/p&gt;

&lt;p&gt;VLM scoring is not deterministic. Two scoring passes on the same 80k images gave us a Spearman correlation of 0.94, which is fine for picking calibration sets but would not be fine for, say, regulatory model documentation.&lt;/p&gt;

&lt;p&gt;Bifrost's MCP support is good but we don't use it for this workflow. The scoring loop is a plain HTTP client, no tool use. I mention it because the README puts MCP front and centre and you might assume you need it. You don't.&lt;/p&gt;

&lt;p&gt;The built-in UI is React-based and feels a generation behind Grafana for ad-hoc exploration. We export Prometheus metrics to our existing Grafana stack and ignore the bundled dashboard. Personal preference, not a bug.&lt;/p&gt;

&lt;h3&gt;
  
  
  Further Reading
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost on GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Bifrost semantic caching docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Bifrost virtual keys and budgets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2106.08295" rel="noopener noreferrer"&gt;Nagel et al., "A White Paper on Neural Network Quantization" (2021)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html" rel="noopener noreferrer"&gt;PyTorch FX-graph PTQ tutorial&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>computervision</category>
      <category>pytorch</category>
    </item>
    <item>
      <title>QAT vs PTQ on our edge vision model: 6 months of A/B data</title>
      <dc:creator>Marco Rinaldi</dc:creator>
      <pubDate>Thu, 28 May 2026 07:21:54 +0000</pubDate>
      <link>https://dev.to/marcorinaldi_ai/qat-vs-ptq-on-our-edge-vision-model-6-months-of-ab-data-147f</link>
      <guid>https://dev.to/marcorinaldi_ai/qat-vs-ptq-on-our-edge-vision-model-6-months-of-ab-data-147f</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We ran post-training quantisation (PTQ) and quantisation-aware training (QAT) side by side on the same defect-classification model deployed on a Jetson Orin Nano. After six months in production, QAT recovered 3.1 mAP points over PTQ on rare defect classes, but cost us roughly two engineer-weeks of pipeline work and a 4x slower training cycle.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So, the thing is, every time someone shows me a quantisation benchmark on ImageNet, I want to ask them what their actual deployment looks like. Because ImageNet validation accuracy at INT8 tells you almost nothing about whether your model will still detect the 0.4% of defect samples that pay for the whole project. We learned this the hard way at the end of last year, when the first quarter of production data came back from one of our partner sites and our PTQ model was missing scratches that the FP16 baseline caught fine.&lt;/p&gt;

&lt;p&gt;This post is the writeup. Six months, one model architecture (ResNet-18 trunk with a custom anchor-free head), two quantisation paths, two hardware targets. No synthetic benchmarks, no toy datasets.&lt;/p&gt;

&lt;h2&gt;
  
  
  The model and the constraint
&lt;/h2&gt;

&lt;p&gt;The model is a defect classifier on a steel rolling line. Inference runs on a Jetson Orin Nano, 8GB version, sharing the SoC with a stereo depth pipeline. Latency budget for the classification path is 14ms. Memory budget after the depth pipeline takes its share is around 180MB. Five classes including background, with class imbalance roughly 92/3/2/2/1 percent.&lt;/p&gt;

&lt;p&gt;FP16 baseline: 17.3ms, 92.4 mAP, 71MB of activations.&lt;br&gt;
INT8 PTQ: 8.9ms, 88.1 mAP, 38MB.&lt;br&gt;
INT8 QAT: 9.2ms, 91.2 mAP, 38MB.&lt;/p&gt;

&lt;p&gt;Numbers look fine in aggregate. But aggregate hides the problem.&lt;/p&gt;
&lt;h2&gt;
  
  
  Where PTQ falls apart
&lt;/h2&gt;

&lt;p&gt;Post-training quantisation with TensorRT's entropy calibrator does a reasonable job when your dataset is balanced. Ours is not. The calibration set we initially used was drawn proportionally from the training distribution, which meant 92% of the calibration data was clean background. The activation histograms ended up dominated by the background distribution, and the quantisation scales were tuned for that.&lt;/p&gt;

&lt;p&gt;The result was that the rare defect classes (cracks at 1% prevalence, embedded particles at 2%) lost between 5 and 9 mAP points each. Hot pixels in the defect feature maps got clipped into the background range. We caught this in a customer review meeting in early March. Not my favourite Tuesday.&lt;/p&gt;
&lt;h2&gt;
  
  
  What we tried
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Calibration / training&lt;/th&gt;
&lt;th&gt;mAP (rare classes)&lt;/th&gt;
&lt;th&gt;Engineer-weeks&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PTQ default&lt;/td&gt;
&lt;td&gt;5000 random samples&lt;/td&gt;
&lt;td&gt;79.2&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;td&gt;Baseline pain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PTQ rebalanced&lt;/td&gt;
&lt;td&gt;2000 samples, defects oversampled 10x&lt;/td&gt;
&lt;td&gt;84.6&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;Better, still gaps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PTQ percentile&lt;/td&gt;
&lt;td&gt;99.99 percentile + rebalanced&lt;/td&gt;
&lt;td&gt;86.1&lt;/td&gt;
&lt;td&gt;1.5&lt;/td&gt;
&lt;td&gt;Marginal gain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QAT (10 epochs)&lt;/td&gt;
&lt;td&gt;Real training loop, fake-quant ops&lt;/td&gt;
&lt;td&gt;89.4&lt;/td&gt;
&lt;td&gt;2.5&lt;/td&gt;
&lt;td&gt;The keeper&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The QAT path used &lt;code&gt;torch.ao.quantization&lt;/code&gt; with custom fake-quant observers per layer, exported through ONNX to TensorRT. We had to write a small shim because the default ONNX → TRT path stripped some of our QDQ nodes silently in TensorRT 10.4. The fix was forcing explicit precision on the affected layers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Snippet from our QAT training step
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torch.ao.quantization&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FakeQuantize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MovingAverageMinMaxObserver&lt;/span&gt;

&lt;span class="n"&gt;per_channel_qconfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;FakeQuantize&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_args&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;observer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MovingAverageMinMaxObserver&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;quant_min&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quant_max&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;127&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;qint8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;qscheme&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;per_tensor_symmetric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;FakeQuantize&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_args&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;observer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MovingAverageMinMaxObserver&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;quant_min&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quant_max&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;127&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;qint8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;qscheme&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;per_channel_symmetric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;qconfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;per_channel_qconfig&lt;/span&gt;
&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ao&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantization&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;prepare_qat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interesting bit was per-channel weight quantisation. Per-tensor lost us another 1.5 mAP on rare classes. Per-channel costs roughly nothing at inference on the Orin's NVDLA. You almost always want per-channel for vision models with significant filter diversity.&lt;/p&gt;

&lt;h2&gt;
  
  
  What QAT does not fix
&lt;/h2&gt;

&lt;p&gt;QAT is not free and it's not magic. A few things bit us.&lt;/p&gt;

&lt;p&gt;The training cycle goes from 2 hours to 8 hours when you turn on fake-quant. We optimised some of this by only quantising from epoch 4 onwards (training in FP16 first), which the literature calls warm-start QAT. Recovered most of the wall-clock cost.&lt;/p&gt;

&lt;p&gt;Operator coverage in TensorRT for QDQ nodes is decent but not complete. Our custom group-norm replacement broke quantisation entirely and we had to fall back to batch-norm for the deployment branch. Annoying. Worth checking before you commit to an architecture.&lt;/p&gt;

&lt;p&gt;The eval pipeline itself needed work too. We use an LLM-driven workflow to triage failure modes from our weekly inspection batches (sorting false positives by visual similarity, basically), routed through Bifrost as our internal gateway so the rest of the org can share quota. Found one whole class of failures (specular highlights confused as cracks) we were not tracking before.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PTQ is still the right call for prototyping.&lt;/strong&gt; If you're not sure your architecture is final, do not pay the QAT tax. Iterate with PTQ until the design freezes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;QAT amplifies dataset noise.&lt;/strong&gt; Mislabeled samples hurt more under QAT than under FP training, because the fake-quant adds noise on top. We re-curated about 800 ambiguous labels before our final run.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The mAP gap shrinks with mixed precision.&lt;/strong&gt; We tried mixed INT8/FP16 per layer and recovered 1 mAP for a 1.4ms latency hit. For our 14ms budget, full INT8 won. Yours may differ.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Different hardware, different story.&lt;/strong&gt; The Orin's INT8 throughput is excellent. On a Coral TPU or a Cortex-M7, the analysis changes completely.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pytorch.org/docs/stable/quantization.html" rel="noopener noreferrer"&gt;PyTorch quantisation docs (ao.quantization)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#optimizing-int8" rel="noopener noreferrer"&gt;TensorRT 10 INT8 best practices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost AI gateway&lt;/a&gt; (what we use for the LLM tooling on the eval side)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/1902.08153" rel="noopener noreferrer"&gt;LSQ paper (Esser et al., 2019)&lt;/a&gt; for learned step sizes&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/NVIDIA/TensorRT-Model-Optimizer" rel="noopener noreferrer"&gt;NVIDIA Model Optimizer (PTQ → QAT workflow)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>mlops</category>
      <category>pytorch</category>
    </item>
    <item>
      <title>Six weeks of Bifrost in a factory QA pilot: real cost numbers</title>
      <dc:creator>Marco Rinaldi</dc:creator>
      <pubDate>Wed, 27 May 2026 16:52:34 +0000</pubDate>
      <link>https://dev.to/marcorinaldi_ai/six-weeks-of-bifrost-in-a-factory-qa-pilot-real-cost-numbers-175a</link>
      <guid>https://dev.to/marcorinaldi_ai/six-weeks-of-bifrost-in-a-factory-qa-pilot-real-cost-numbers-175a</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Six weeks running an AI gateway between our edge cameras and three cloud VLM providers cut our pilot VLM spend by 58% and gave us actual failover during a 90-minute Anthropic blip last month. Bifrost handled it. Here's what worked, what didn't, and how it compared to LiteLLM and Portkey on the same workload.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So, the thing is, when our team at a partner factory near Bologna wired up a defect inspection pilot with cloud VLMs in the loop, the cost story turned ugly within the first ten days. We had 28 stations, each catching anomalies from local event-camera and frame fusion, then escalating ambiguous frames to GPT-4o-mini or Claude Sonnet for a second opinion. The VLM bill landed at €4,800 in week one. Production was running 11 hours a day. Nobody had budgeted for that.&lt;/p&gt;

&lt;p&gt;The pilot also stalled twice. Once because OpenAI returned 429s for 22 minutes during what I assume was a regional capacity issue, and once because a key rotated wrong and half the fleet froze. Neither outage was the model's fault. Both were avoidable.&lt;/p&gt;

&lt;p&gt;We picked Bifrost as a gateway and ran it for six weeks. This is a writeup of what we measured. Independent perspective. I have no commercial relationship with them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;Our stack: Jetson Orin Nano per station, edge model (a distilled student of CLIP for class triage), and an escalation rule. If the edge confidence falls below a threshold, the frame plus context gets sent to a cloud VLM through Bifrost. Bifrost runs on a small VM in our partner's DMZ, two replicas behind a TCP load balancer. We use the OpenAI-compatible endpoint so our existing inference client didn't change.&lt;/p&gt;

&lt;p&gt;Three things mattered for us, in this order. First, fallback semantics. Second, per-station budgeting. Third, observability metrics that hit Prometheus without extra scaffolding. We tried LiteLLM and Portkey before settling on Bifrost. More on that below.&lt;/p&gt;

&lt;p&gt;Configuration was a single YAML and two environment variables. This is roughly the relevant part:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;openai&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OPENAI_KEY_PRIMARY&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.7&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OPENAI_KEY_FAILOVER&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;
  &lt;span class="na"&gt;anthropic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ANTHROPIC_KEY&lt;/span&gt;
  &lt;span class="na"&gt;bedrock&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;BEDROCK_KEY&lt;/span&gt;
        &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;eu-central-1&lt;/span&gt;

&lt;span class="na"&gt;routes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-sonnet-4-6"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock/anthropic.claude-3-5-haiku"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole fallback chain. When OpenAI started rate limiting on April 18, traffic shifted to Anthropic within the configured timeout and we lost zero frames. Our oncall got a Prometheus alert on the fallback rate metric, which is exposed natively, and we caught the issue 90 seconds before any operator noticed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we measured
&lt;/h2&gt;

&lt;p&gt;Six weeks of data from our pilot, three weeks before Bifrost and three weeks after. Same shift schedule, same product mix. Numbers below are real, scrubbed of station-specific identifiers.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before gateway&lt;/th&gt;
&lt;th&gt;After Bifrost&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Weekly VLM spend&lt;/td&gt;
&lt;td&gt;€4,640 avg&lt;/td&gt;
&lt;td&gt;€1,920 avg&lt;/td&gt;
&lt;td&gt;Semantic cache hit at 41%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P95 escalation latency&lt;/td&gt;
&lt;td&gt;1.9s&lt;/td&gt;
&lt;td&gt;1.4s&lt;/td&gt;
&lt;td&gt;Some hits served from cache&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Outage minutes&lt;/td&gt;
&lt;td&gt;112&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;One Anthropic blip auto-mitigated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operator interventions&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Most were cost-driven&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-station cost visibility&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;per virtual key&lt;/td&gt;
&lt;td&gt;New capability&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 41% semantic cache hit rate surprised me. I expected maybe 15%. Factory floor frames have a lot of repeated context (same product variant, same lighting, similar defect prompts), and the cache exploits that pattern. Documented behaviour, see the semantic caching docs linked at the bottom.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bifrost vs LiteLLM vs Portkey
&lt;/h2&gt;

&lt;p&gt;Honest comparison, because we tested all three on the same workload.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI-compatible API&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-host as binary&lt;/td&gt;
&lt;td&gt;yes (Go binary or Docker)&lt;/td&gt;
&lt;td&gt;yes (Python proxy)&lt;/td&gt;
&lt;td&gt;hosted primary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;via Redis plugin&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Virtual keys with budgets&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prometheus metrics&lt;/td&gt;
&lt;td&gt;native&lt;/td&gt;
&lt;td&gt;requires setup&lt;/td&gt;
&lt;td&gt;hosted dashboard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Throughput in our test&lt;/td&gt;
&lt;td&gt;~9k RPS sustained&lt;/td&gt;
&lt;td&gt;~2k RPS before tuning&lt;/td&gt;
&lt;td&gt;hosted (couldn't test fairly)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM is more mature in some places. Its Python plugin ecosystem is bigger, and if your team already lives in Python middleware, that's a real advantage. Portkey has the slickest hosted dashboard out of the box, no question. We couldn't put a hosted service in the path for this pilot because the partner factory had restrictions on outbound traffic from the OT network.&lt;/p&gt;

&lt;p&gt;Bifrost won for us on two specific points. The single Go binary deployed without a fight on the small VM we had. And the per-virtual-key budgeting let us bill each station's VLM cost back to the production line owner, which mattered for getting the pilot extended past phase one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;The semantic cache occasionally returns a cached response when product variants change mid-shift. We had two false-negative defect reports in week three traced to a cache hit on a near-identical SKU. We tuned the similarity threshold and added a variant ID to the cache key. Lesson learned, but the takeaway is that semantic caching needs context-aware keys for industrial workloads. Not a Bifrost bug, but a real failure mode you have to design around.&lt;/p&gt;

&lt;p&gt;Bifrost's MCP integration is interesting but we didn't use it for this pilot. Our cameras don't need tool use. If you are building agentic flows on top, that calculus changes.&lt;/p&gt;

&lt;p&gt;Documentation gaps exist. The clustering setup for HA was less detailed than I would have liked when we first read it. Their team answered on Discord within an hour, which helped a lot.&lt;/p&gt;

&lt;p&gt;You don't need a gateway if you are single-provider and single-region. The added complexity is worth it once you have real failover requirements or cost attribution problems. We had both. Your factory may not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bifrost retries and fallbacks: &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/retries-and-fallbacks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost semantic caching: &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/semantic-caching&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost governance and virtual keys: &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/governance/virtual-keys&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;LiteLLM proxy: &lt;a href="https://github.com/BerriAI/litellm" rel="noopener noreferrer"&gt;https://github.com/BerriAI/litellm&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Portkey gateway: &lt;a href="https://github.com/Portkey-AI/gateway" rel="noopener noreferrer"&gt;https://github.com/Portkey-AI/gateway&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Next pilot starts in two weeks. Same gateway, different factory, event-camera-only feeds this time. I'll write that one up once the numbers are in.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>mlops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Distilling SAM 2 into a 6MB student for industrial inspection</title>
      <dc:creator>Marco Rinaldi</dc:creator>
      <pubDate>Wed, 27 May 2026 07:22:52 +0000</pubDate>
      <link>https://dev.to/marcorinaldi_ai/distilling-sam-2-into-a-6mb-student-for-industrial-inspection-2dia</link>
      <guid>https://dev.to/marcorinaldi_ai/distilling-sam-2-into-a-6mb-student-for-industrial-inspection-2dia</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We took Meta's SAM 2 small (around 224M params) and distilled it into a 6.3MB student that runs at 31 FPS on a Jetson Orin Nano for an automotive surface-defect pipeline. Mask IoU drops from 0.91 to 0.84, which is acceptable for the defect shapes we care about. The single biggest lever was a feature-alignment loss on the image embedding, not the mask logits.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So, the thing is, most of my year goes into event-camera work at Prophesee, but a side contract this spring with an automotive supplier outside Brescia ate two months of my evenings. They make aluminium body panels and they wanted real-time masks for surface defects: scratches, dents, paint pinholes. Cameras are boring CMOS at 25 FPS and 4MP. Target hardware is a Jetson Orin Nano because the PLCs on the line already talk to one over Ethernet.&lt;/p&gt;

&lt;p&gt;First thing we tried was to fine-tune SAM 2 small directly and ship it with TensorRT FP16. About 1.2 seconds per image on the Orin. That's roughly 30x too slow for a moving line. We needed something a lot smaller.&lt;/p&gt;

&lt;h2&gt;
  
  
  The student architecture
&lt;/h2&gt;

&lt;p&gt;We did a MobileSAM-style backbone transplant but kept going further. TinyViT-5M as the image encoder, prompt encoder stripped down to dense point prompts only (we feed it candidate locations from a cheap saliency head, no box prompts), and we cut the mask decoder's upsample to 1/2 instead of 1/4. That last bit lost us a tiny amount of edge precision but it was the difference between 19 FPS and 31 FPS on the Orin.&lt;/p&gt;

&lt;p&gt;The student is 1.6M params. Storage is 6.3MB after INT8 weight-only quantisation. The teacher was held in FP16 on a separate A6000 during training.&lt;/p&gt;

&lt;h2&gt;
  
  
  The loss that actually worked
&lt;/h2&gt;

&lt;p&gt;Here is the part where I want to be very honest. We spent two weeks on cleverness: multi-scale logit matching, attention transfer, contrastive losses on the prompt embedding. None of it beat the obvious thing once we put it in.&lt;/p&gt;

&lt;p&gt;The obvious thing: align the student's image embedding to the teacher's in cosine space, alongside the usual soft mask BCE and a supervised dice term on the actual ground truth.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;distill_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s_emb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t_emb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s_logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t_logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gt_mask&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# feature alignment in cosine space
&lt;/span&gt;    &lt;span class="n"&gt;s_n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s_emb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flatten&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;t_n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t_emb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flatten&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;feat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s_n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;t_n&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# soft mask distillation with temperature
&lt;/span&gt;    &lt;span class="n"&gt;soft&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;binary_cross_entropy_with_logits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;s_logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sigmoid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t_logits&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# supervised dice on hand-labelled defects
&lt;/span&gt;    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sigmoid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s_logits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;inter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;gt_mask&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;union&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;gt_mask&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;dice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;inter&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;union&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;1e-6&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;feat&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;soft&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;dice&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without the &lt;code&gt;feat&lt;/code&gt; term, we plateaued at IoU 0.71 on the held-out set of 4,200 defect crops. With it, 0.84. Same student, same data, same schedule. That is the kind of result that makes you doubt your prior assumptions about what knowledge transfer actually means.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the numbers look like
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;IoU&lt;/th&gt;
&lt;th&gt;FPS (Orin)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SAM 2 small (teacher)&lt;/td&gt;
&lt;td&gt;224M&lt;/td&gt;
&lt;td&gt;884 MB&lt;/td&gt;
&lt;td&gt;0.91&lt;/td&gt;
&lt;td&gt;0.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MobileSAM-style transplant&lt;/td&gt;
&lt;td&gt;9.8M&lt;/td&gt;
&lt;td&gt;39 MB&lt;/td&gt;
&lt;td&gt;0.78&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ours, no feature loss&lt;/td&gt;
&lt;td&gt;1.6M&lt;/td&gt;
&lt;td&gt;6.3 MB&lt;/td&gt;
&lt;td&gt;0.71&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ours, with feature loss&lt;/td&gt;
&lt;td&gt;1.6M&lt;/td&gt;
&lt;td&gt;6.3 MB&lt;/td&gt;
&lt;td&gt;0.84&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Where a VLM judge crept in
&lt;/h2&gt;

&lt;p&gt;One thing that surprised me during training: about 6% of held-out crops had teacher masks that were visibly wrong. Pinholes the teacher missed entirely, or scratches it bled into adjacent reflections. If you blindly distill on those, the student learns the teacher's bad habits.&lt;/p&gt;

&lt;p&gt;For the crops where student and teacher disagreed by more than 0.15 IoU, we ran a VLM-as-judge step. Claude 4.5 Sonnet on the primary path, Gemini 2.5 Pro as a backup, asked to pick which mask better followed the actual defect contour given the original crop and both overlays. We routed those calls through Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) because we needed automatic failover between the two providers during peak hours when one would rate-limit us, and writing that retry logic ourselves felt like wasted time. About 38% of disagreements were judged in the student's favour, so we re-weighted those samples in the next epoch.&lt;/p&gt;

&lt;p&gt;I will say: the VLM-as-judge is not magic and we hand-checked a stratified sample of its calls. The agreement with our QA lead was around 84%, which was good enough to drive sample re-weighting but not good enough to use as ground truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Hairline scratches below about 80 microns at our working distance are still missed. The teacher catches some of those. We are debating whether to keep a slow second pass for ambiguous frames.&lt;/li&gt;
&lt;li&gt;The feature-alignment loss requires teacher and student image embeddings at the same spatial resolution. That ruled out a couple of backbone choices we wanted to try.&lt;/li&gt;
&lt;li&gt;Training was nine days on 4x A6000s. Distillation is not cheap, even if the result is.&lt;/li&gt;
&lt;li&gt;We have not validated on glass or carbon-fibre panels. Aluminium only.&lt;/li&gt;
&lt;li&gt;INT8 quantisation cost us 0.6 IoU points. Worth it for the size but not free.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;SAM 2 paper: &lt;a href="https://arxiv.org/abs/2408.00714" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2408.00714&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;MobileSAM: &lt;a href="https://arxiv.org/abs/2306.14289" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2306.14289&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;TinyViT: &lt;a href="https://arxiv.org/abs/2207.10666" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2207.10666&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost gateway docs: &lt;a href="https://docs.getbifrost.ai" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;NVIDIA TensorRT INT8 calibration notes: &lt;a href="https://docs.nvidia.com/deeplearning/tensorrt/" rel="noopener noreferrer"&gt;https://docs.nvidia.com/deeplearning/tensorrt/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>computervision</category>
      <category>machinelearning</category>
      <category>pytorch</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Capping VLM spend per CV researcher: hierarchical budgets in practice</title>
      <dc:creator>Marco Rinaldi</dc:creator>
      <pubDate>Tue, 26 May 2026 16:52:17 +0000</pubDate>
      <link>https://dev.to/marcorinaldi_ai/capping-vlm-spend-per-cv-researcher-hierarchical-budgets-in-practice-4a2p</link>
      <guid>https://dev.to/marcorinaldi_ai/capping-vlm-spend-per-cv-researcher-hierarchical-budgets-in-practice-4a2p</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Our 11-person CV team at Prophesee was burning through €3-4k weeks of VLM spend on dataset annotation with no idea which researcher caused which spike. We put Bifrost between the labelling scripts and the providers, mapped one virtual key per person with monthly caps, and the receipt-chasing stopped. Took an afternoon to wire up. Real savings came from enforcement, not from clever routing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So, the thing is, when you let eleven people each script their own VLM annotation passes against three different providers, you get one giant invoice at the end of the month and no idea who is responsible for the €700 Tuesday spike. We hit that exact wall in late February. Two consecutive weeks at around €4k each, and our team lead asked, very politely, over an espresso, whether maybe we should think about it.&lt;/p&gt;

&lt;p&gt;The honest answer was: yes, six months ago.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we were actually doing
&lt;/h2&gt;

&lt;p&gt;Our annotation pipeline labels event-camera frames reconstructed from recordings (indoor robotics, low-light driving, drone footage). For each scene we run a VLM pass to produce captions, bounding box suggestions, and a sanity-check description. The VLM is the teacher; a smaller distilled model is what we deploy on the edge.&lt;/p&gt;

&lt;p&gt;The traffic profile looked like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Calls/week&lt;/th&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bulk caption jobs&lt;/td&gt;
&lt;td&gt;~80k&lt;/td&gt;
&lt;td&gt;gpt-4o-mini&lt;/td&gt;
&lt;td&gt;overnight cron, predictable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bounding box suggestions&lt;/td&gt;
&lt;td&gt;~12k&lt;/td&gt;
&lt;td&gt;claude-3.5-sonnet&lt;/td&gt;
&lt;td&gt;interactive, daytime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sanity-check descriptions&lt;/td&gt;
&lt;td&gt;~6k&lt;/td&gt;
&lt;td&gt;gemini-2.0-flash&lt;/td&gt;
&lt;td&gt;per-researcher scripts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ad-hoc exploration&lt;/td&gt;
&lt;td&gt;who knows&lt;/td&gt;
&lt;td&gt;mixed&lt;/td&gt;
&lt;td&gt;the actual problem&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That last row was biting us. Researchers were testing prompt variations, sweeping over thousands of frames out of curiosity, forgetting they kicked off the script before going home for the weekend.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we wired up
&lt;/h2&gt;

&lt;p&gt;We deployed Bifrost as a Docker container on the same internal box that already hosts our annotation queue. Ten minutes. Then we mapped one virtual key per researcher (&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;docs&lt;/a&gt;) with a monthly cap, and two teams (vision research, robotics demos), each with their own pooled budget on top.&lt;br&gt;
The config looked something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;teams&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vision-research&lt;/span&gt;
    &lt;span class="na"&gt;monthly_budget_eur&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1800&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;robotics-demos&lt;/span&gt;
    &lt;span class="na"&gt;monthly_budget_eur&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1200&lt;/span&gt;

&lt;span class="na"&gt;virtual_keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vk_marco&lt;/span&gt;
    &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vision-research&lt;/span&gt;
    &lt;span class="na"&gt;monthly_budget_eur&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;220&lt;/span&gt;
    &lt;span class="na"&gt;allowed_models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-3.5-sonnet"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vk_sofia&lt;/span&gt;
    &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vision-research&lt;/span&gt;
    &lt;span class="na"&gt;monthly_budget_eur&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;220&lt;/span&gt;
    &lt;span class="na"&gt;allowed_models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a researcher's key hits 90% of their monthly cap, the gateway rejects further requests until the next cycle. No silent overspend. The team-level budget acts as a second ceiling: even if every individual cap stays under, the team budget kicks in.&lt;/p&gt;

&lt;p&gt;We pointed every script at &lt;code&gt;http://bifrost.internal:8080/v1/chat/completions&lt;/code&gt; and replaced the OpenAI/Anthropic SDK calls with the unified endpoint. The &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; meant we didn't touch the actual annotation code beyond the base URL.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we got back
&lt;/h2&gt;

&lt;p&gt;Three things, in honest order of value:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Per-key spend visibility.&lt;/strong&gt; Prometheus metrics out of the box. We pipe them into our existing Grafana, and now I can see at a glance that Sofia spent €43 on Monday running a prompt sweep. Not to shame anyone. To understand the pattern.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic failover.&lt;/strong&gt; When OpenAI had its 47-minute hiccup on March 14, our overnight caption job didn't die. Bifrost rerouted to Anthropic. We found out from the metrics, not from a panicked Slack message at 7am.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic caching on the bulk caption jobs.&lt;/strong&gt; About 18% cache hit rate on repeated frames from similar scenes. Modest. Not the savings argument I would lead with, but real.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;I should be honest about where this is and isn't worth it.&lt;/p&gt;

&lt;p&gt;If you're a solo engineer with one API key and one provider, none of this applies. Put your &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; in .env and move on. The hierarchical budget story matters when you have multiple humans with their own scripts.&lt;/p&gt;

&lt;p&gt;LiteLLM is the obvious comparison. We tried it first. It works, the routing logic is solid, and the Python-native feel suits a research team. What pushed us to Bifrost was the governance side: virtual keys with per-team and per-user budgets as a first-class concept, not bolted on. Portkey has similar governance features but felt heavier than we needed for an 11-person internal deployment.&lt;/p&gt;

&lt;p&gt;The Bifrost web UI is functional but not pretty. The semantic cache config has some sharp edges (the embedding model choice matters more than the docs imply). And if your provider mix is exotic (we briefly wanted Mistral's hosted vision endpoint, which wasn't supported at the time we tested), you'll need to wait or contribute.&lt;/p&gt;

&lt;p&gt;The real limitation though: a gateway cannot save you from a researcher who genuinely needs to spend €500 on a legitimate experiment. It can only make that spend visible and intentional. Which, honestly, is what we wanted.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Bifrost virtual keys and governance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;Budget and limits documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;Drop-in replacement guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;Prometheus observability defaults&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;[LiteLLM router docs (for comparison)]](&lt;a href="https://docs.litellm.ai/docs/routing" rel="noopener noreferrer"&gt;https://docs.litellm.ai/docs/routing&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>llm</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Falling back from edge detection to a cloud VLM when confidence drops</title>
      <dc:creator>Marco Rinaldi</dc:creator>
      <pubDate>Tue, 26 May 2026 07:22:10 +0000</pubDate>
      <link>https://dev.to/marcorinaldi_ai/falling-back-from-edge-detection-to-a-cloud-vlm-when-confidence-drops-5bgh</link>
      <guid>https://dev.to/marcorinaldi_ai/falling-back-from-edge-detection-to-a-cloud-vlm-when-confidence-drops-5bgh</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We deploy a 4MB SSD detector on an ARM edge box and cascade low-confidence frames to a cloud VLM. About 3% of frames make the trip. The interesting part is not the model but the routing layer that decides when to ask for help and how to fail gracefully when the network is hostile.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Last winter we shipped a vision system for an industrial inspection client. The constraint was familiar: a Cortex-A72 box bolted to a conveyor, no GPU, intermittent 4G uplink. The detector had to run at 25 FPS and flag defects. Easy enough. The hard part came later.&lt;/p&gt;

&lt;p&gt;So, the thing is, on a clean validation set the small detector hit 91% mAP at IoU 0.5. In production it hovered around 78%. The drop came from frames the model had never imagined: unusual lighting on Mondays after the floor was washed, parts placed at angles outside our training distribution, partial occlusions from operator gloves. The honest answer was that we didn't have a large enough labeled set to cover the tail.&lt;/p&gt;

&lt;p&gt;Rather than ship a bigger model the box couldn't run, we built a cascade. The edge model handles every frame. When its top-1 detection confidence drops below 0.55 or its second-best is within 0.15 of the first, we send the cropped patch to a cloud VLM for a second opinion. About 3% of frames trigger this path.&lt;/p&gt;

&lt;h3&gt;
  
  
  The cascade in practice
&lt;/h3&gt;

&lt;p&gt;Here is the gist of the gating logic, simplified:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;should_escalate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;detections&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;detections&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="n"&gt;top&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;detections&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;top&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.55&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;detections&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;runner_up&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;detections&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;
        &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;top&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;runner_up&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.15&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A patch is 224x224 around the predicted bounding box, encoded as JPEG quality 80. Average payload is 11 KB. Round-trip latency to the cloud VLM is 800-1400ms over 4G, which is acceptable because the conveyor has a 2-second buffer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why a routing layer mattered
&lt;/h3&gt;

&lt;p&gt;The first version called one provider directly. Then we had a 17-minute outage and the line operator had to override frames manually. Not great. The second version routes through a gateway that fans out across two VLM providers with automatic failover. We use Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) on a small VPS for this, though LiteLLM or a custom proxy would also work. With one provider as primary and another as secondary, when the primary returns 5xx or times out past 2 seconds, the next one takes over. Semantic caching helps for near-identical frames during steady-state operation.&lt;/p&gt;

&lt;p&gt;The choice of gateway is not the headline. What matters is having one. Calling provider SDKs directly from edge devices is a debugging nightmare once you have failover logic and retries layered on top.&lt;/p&gt;

&lt;h3&gt;
  
  
  What we measured
&lt;/h3&gt;

&lt;p&gt;After 6 weeks of production data on three client lines:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Edge only&lt;/th&gt;
&lt;th&gt;Edge + VLM cascade&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;mAP (production)&lt;/td&gt;
&lt;td&gt;78.2%&lt;/td&gt;
&lt;td&gt;86.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg latency / frame&lt;/td&gt;
&lt;td&gt;38 ms&lt;/td&gt;
&lt;td&gt;41 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;p99 latency / frame&lt;/td&gt;
&lt;td&gt;52 ms&lt;/td&gt;
&lt;td&gt;1290 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud cost / day&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$4.20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frames escalated&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;3.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The p99 jump is real and unavoidable. We absorb it in the conveyor buffer. Without that slack the cascade would not be an option.&lt;/p&gt;

&lt;h3&gt;
  
  
  A subtle bug we hit
&lt;/h3&gt;

&lt;p&gt;Early on, the cascade rate drifted upward over a week, climbing from 3% to 7%. Cost was rising too. After some digging we found the detector was responding to slow camera lens contamination. Dust accumulating on the optics was lowering confidence everywhere. The cascade masked the real problem because the cloud VLM was strong enough to handle the dirty frames.&lt;/p&gt;

&lt;p&gt;Now we alert when the cascade rate moves beyond a 5% rolling band. The cloud should be a backup, not a crutch.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trade-offs and limitations
&lt;/h3&gt;

&lt;p&gt;Latency variance. The p99 number above is honest. For applications without buffer slack (closed-loop robotics, automotive vision), this approach falls apart. It works for our throughput-tolerant inspection setup. It would not work for collision avoidance.&lt;/p&gt;

&lt;p&gt;Network dependence. When the 4G link drops the system degrades to edge-only. We log every frame that would have escalated and run a batch job once connectivity returns. About 8% of escalations end up batched on a bad day. The client accepted this. Another client might not.&lt;/p&gt;

&lt;p&gt;Cost predictability. Per-frame cost is small but it is real. A noisier production environment can double the cascade rate overnight. We have a hard daily ceiling that disables escalation if exceeded, and the line falls back to edge-only with a flag for operator review.&lt;/p&gt;

&lt;p&gt;Privacy. The cropped patch is sent to a third-party provider. For industrial inspection of inert parts this was fine. For anything involving people we would need on-prem VLMs, which changes the economics considerably.&lt;/p&gt;

&lt;p&gt;Calibration drift. The confidence threshold of 0.55 was tuned on validation data from one factory. The other two factories needed different thresholds (0.48 and 0.61). We now run a small calibration script after first deployment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Closing thought
&lt;/h3&gt;

&lt;p&gt;Cascade architectures are old. Hierarchical classifiers and coarse-to-fine pipelines have been around since before I started this job. What changed is that the "ask for help" tier is now a generally capable model rather than a slightly larger specialist. That shifts the economics for the long tail of unusual inputs that no training set will ever cover.&lt;/p&gt;

&lt;h3&gt;
  
  
  Further reading
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;NVIDIA TensorRT INT8 calibration documentation: &lt;a href="https://docs.nvidia.com/deeplearning/tensorrt/" rel="noopener noreferrer"&gt;https://docs.nvidia.com/deeplearning/tensorrt/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Silla &amp;amp; Freitas 2011, "A survey of hierarchical classification across different application domains"&lt;/li&gt;
&lt;li&gt;Guo et al. 2017, "On Calibration of Modern Neural Networks"&lt;/li&gt;
&lt;li&gt;Bifrost gateway documentation: &lt;a href="https://docs.getbifrost.ai/" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Cai &amp;amp; Vasconcelos 2018, "Cascade R-CNN" for a different but related cascade idea&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>computervision</category>
      <category>mlops</category>
      <category>llm</category>
    </item>
    <item>
      <title>Auto-labelling 1.2M robotics frames with VLMs: a failover story</title>
      <dc:creator>Marco Rinaldi</dc:creator>
      <pubDate>Mon, 25 May 2026 16:53:43 +0000</pubDate>
      <link>https://dev.to/marcorinaldi_ai/auto-labelling-12m-robotics-frames-with-vlms-a-failover-story-1dje</link>
      <guid>https://dev.to/marcorinaldi_ai/auto-labelling-12m-robotics-frames-with-vlms-a-failover-story-1dje</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We needed to caption 1.2M reconstructed event-camera frames using vision-language models for auxiliary supervision. The first run died at 340K from Anthropic rate limits. Putting Bifrost in front of three VLM providers cut the rerun cost by 22% and finished in 9 hours.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So, the thing is, when you work at a neuromorphic vision startup, your training data looks strange. At Prophesee we accumulate event streams into time-binned windows that we render into pseudo-frames. For a self-supervised pretraining run on a new asynchronous backbone, we wanted natural-language captions on every window. Not because we're going language-first. The captions act as auxiliary targets for a contrastive head that sits alongside the actual event tensor.&lt;/p&gt;

&lt;p&gt;1.2M frames. Three candidate VLMs: GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Pro. All three caption our weird greyscale reconstructions differently enough that we wanted a mix per frame.&lt;/p&gt;

&lt;p&gt;I tried Anthropic first because the captions were qualitatively the best on our pilot set. Job died at 340,317 captions on a sustained TPM cap. That was a Friday evening before a long weekend in Bologna. I lost the weekend.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choosing a gateway over more retry code
&lt;/h3&gt;

&lt;p&gt;My first instinct was to write a smarter retry loop. Every CV engineer has this instinct when they discover REST APIs aren't deterministic. After about three hours of writing what was clearly going to become a half-baked rate-limit handler with provider-specific quirks, I stopped.&lt;/p&gt;

&lt;p&gt;The actual problem was that I had multiple providers, all with their own SDKs and their own error formats. I needed something in the middle that knew about quotas, retries, and fallback chains, and that wasn't going to require me to learn yet another vendor lock-in.&lt;/p&gt;

&lt;p&gt;I looked at LiteLLM, Portkey, and &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;. Ended up running Bifrost in Docker on the same node as the batch dispatcher.&lt;/p&gt;

&lt;h3&gt;
  
  
  The setup
&lt;/h3&gt;

&lt;p&gt;Bifrost runs as a single Go binary or container. The config that mattered for us was the fallback chain. Here's the trimmed version we shipped:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;openai&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;OPENAI_KEY_1&lt;/span&gt;&lt;span class="pi"&gt;},&lt;/span&gt; &lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;OPENAI_KEY_2&lt;/span&gt;&lt;span class="pi"&gt;}]&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.5&lt;/span&gt;
  &lt;span class="na"&gt;anthropic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_KEY_1&lt;/span&gt;&lt;span class="pi"&gt;}]&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;
  &lt;span class="na"&gt;vertex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;VERTEX_KEY_1&lt;/span&gt;&lt;span class="pi"&gt;}]&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.2&lt;/span&gt;

&lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o&lt;/span&gt;
    &lt;span class="na"&gt;next&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;anthropic/claude-3-7-sonnet&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;vertex/gemini-2.5-pro&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic/claude-3-7-sonnet&lt;/span&gt;
    &lt;span class="na"&gt;next&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;openai/gpt-4o&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;vertex/gemini-2.5-pro&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our batch dispatcher called &lt;code&gt;http://bifrost:8080/v1/chat/completions&lt;/code&gt; with whatever model we picked for that frame. If a provider was over quota, Bifrost handled the failover and the dispatcher never saw the error. That part is documented under &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;retries and fallbacks&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We also turned on semantic caching for the prompt template because we caption a lot of near-identical static scenes. Robotics demos have long boring stretches. Cache hit rate landed around 14% on the full run, which isn't huge but covered the cost of running the gateway itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  How it compared
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concern&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Portkey&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Multi-provider failover&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted in our VPC&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Paid tier&lt;/td&gt;
&lt;td&gt;Yes (Docker)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching built-in&lt;/td&gt;
&lt;td&gt;Plugin&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prometheus metrics native&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single binary deploy&lt;/td&gt;
&lt;td&gt;No (Python)&lt;/td&gt;
&lt;td&gt;N/A (SaaS)&lt;/td&gt;
&lt;td&gt;Yes (Go)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;800 req/s sustained&lt;/td&gt;
&lt;td&gt;GIL issues&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Held&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM was the most familiar option for our team because we already use it for eval scripts. Honestly for offline single-process work it's fine. The problem hit us when we tried to push sustained throughput through one Python process. Bifrost being Go meant we didn't fight the GIL. Portkey's hosted product is genuinely nice and the analytics UI is better than what Bifrost shipped, but we needed everything inside our VPC for frames covered by client confidentiality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;p&gt;The full 1.2M caption run finished in 9 hours and 14 minutes. Total cost was $4,180, down from a projected $5,360 if we'd run everything on GPT-4o. The 22% saving came from routing roughly a third of traffic to Gemini, which is cheaper per token for our prompt length.&lt;/p&gt;

&lt;p&gt;Two providers had transient 429 spikes during the run. I didn't have to do anything about either. The gateway absorbed them. I noticed only because the per-provider request graph in the Bifrost dashboard had a visible dip on Anthropic around hour four.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trade-offs and limitations
&lt;/h3&gt;

&lt;p&gt;Not everything was clean.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency overhead.&lt;/strong&gt; Bifrost adds a hop. For batch labelling it didn't matter. For an interactive vision app streaming a webcam, I'd benchmark carefully before putting any gateway in the path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caption drift across providers.&lt;/strong&gt; Captions from Gemini and Claude are stylistically different even with the same prompt. We had to normalise downstream with a small T5 rewriter. The gateway doesn't solve this for you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Config sprawl.&lt;/strong&gt; Once you have weights, fallbacks, virtual keys, and cache rules in one YAML, it gets hard to reason about which path a given request actually took. Bifrost's logging helped but I had to dig.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP and tool use.&lt;/strong&gt; We didn't need them. If you're building an agent product instead of a labelling pipeline, the &lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP support&lt;/a&gt; might matter more than failover.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I'd do differently
&lt;/h3&gt;

&lt;p&gt;Run a 5K-frame pilot before launching the full job. We did 50K, which was enough to catch the rate-limit issue conceptually but not enough to see what 800 req/s sustained does to a Python process. Also: drink the espresso before debugging gateway configs at 1am, not after.&lt;/p&gt;

&lt;h3&gt;
  
  
  Further reading
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;Bifrost retries and fallbacks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Bifrost semantic caching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/BerriAI/litellm" rel="noopener noreferrer"&gt;LiteLLM GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://portkey.ai/docs" rel="noopener noreferrer"&gt;Portkey docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.prophesee.ai" rel="noopener noreferrer"&gt;Prophesee event camera primer&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>computervision</category>
      <category>mlops</category>
      <category>llm</category>
    </item>
    <item>
      <title>Quantising event-camera networks to run under 1MB on a Cortex-M7</title>
      <dc:creator>Marco Rinaldi</dc:creator>
      <pubDate>Mon, 25 May 2026 07:22:24 +0000</pubDate>
      <link>https://dev.to/marcorinaldi_ai/quantising-event-camera-networks-to-run-under-1mb-on-a-cortex-m7-382c</link>
      <guid>https://dev.to/marcorinaldi_ai/quantising-event-camera-networks-to-run-under-1mb-on-a-cortex-m7-382c</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: I shrunk a gesture-recognition model for a Prophesee EVK4 event camera from 4.2MB down to 780KB so it could run on an STM32H7 at 15ms per inference. The trick was not the quantisation itself, it was rethinking what an "image" even means when your sensor produces events instead of frames.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So, the thing is, most computer vision tutorials assume you start with a tensor of shape '[B, 3, H, W]' and end with a classification head. Event cameras break that assumption on day one. A Prophesee sensor doesn't give you frames at 30fps. It gives you a sparse stream of events, each a tuple of '(x, y, t, polarity)', fired only when a pixel changes brightness. You can get millions of these per second during motion and almost nothing when the scene is static.&lt;/p&gt;

&lt;p&gt;That changes the entire optimisation game. Let me give you the full picture here.&lt;/p&gt;

&lt;h2&gt;
  
  
  The starting point
&lt;/h2&gt;

&lt;p&gt;We had a gesture model trained on the DVS128 Gesture dataset (11 classes, hand movements recorded with a DAVIS sensor). The baseline was a small ResNet-ish backbone running on event frames, accumulated over 50ms windows. It hit 94.1% test accuracy at 4.2MB in fp32. Inference on a Cortex-M7 at 480MHz took 68ms per window, which is too slow when your events are arriving in real time.&lt;/p&gt;

&lt;p&gt;Target: sub-1MB, sub-20ms, accuracy drop under 2 percentage points.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step one: stop pretending events are frames
&lt;/h2&gt;

&lt;p&gt;The first 30% of the size came from not training on frames at all. Event-frame accumulation throws away the temporal resolution you paid for when you bought a €3,000 sensor. We switched the input representation to a voxel grid of shape '[2, 5, 128, 128]' (2 polarities, 5 temporal bins per window). That alone let us drop the first conv block from 64 channels to 24, because the input was already temporally structured.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;events_to_voxel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;voxel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_bins&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;t_min&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t_max&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;t_norm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t_min&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t_max&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t_min&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;1e-9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;bin_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t_norm&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_bins&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;long&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;voxel&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;bin_idx&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;voxel&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Accuracy actually went up to 94.6%. Smaller and better. This happens more often than the literature admits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step two: QAT, not PTQ
&lt;/h2&gt;

&lt;p&gt;Post-training quantisation is fast but it lies to you on event data. The activation distributions are wildly bimodal because most pixels are zero. Standard min/max calibration collapses the useful range.&lt;/p&gt;

&lt;p&gt;We did quantisation-aware training with PyTorch's &lt;code&gt;torch.ao.quantization&lt;/code&gt; pipeline, qint8 weights and activations, per-channel for convs, per-tensor for the linear head. 15 epochs of QAT on top of the fp32 checkpoint. The observer matters: &lt;code&gt;MovingAverageMinMaxObserver&lt;/code&gt; with &lt;code&gt;averaging_constant=0.01&lt;/code&gt; worked, the default 0.1 did not.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;M7 latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;fp32 baseline (frames)&lt;/td&gt;
&lt;td&gt;4.2 MB&lt;/td&gt;
&lt;td&gt;94.1%&lt;/td&gt;
&lt;td&gt;68 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fp32 + voxel input&lt;/td&gt;
&lt;td&gt;3.1 MB&lt;/td&gt;
&lt;td&gt;94.6%&lt;/td&gt;
&lt;td&gt;51 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PTQ int8&lt;/td&gt;
&lt;td&gt;820 KB&lt;/td&gt;
&lt;td&gt;89.2%&lt;/td&gt;
&lt;td&gt;19 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QAT int8&lt;/td&gt;
&lt;td&gt;780 KB&lt;/td&gt;
&lt;td&gt;93.4%&lt;/td&gt;
&lt;td&gt;15 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 4-point gap between PTQ and QAT is the whole story. If anyone tells you PTQ "just works" on sparse inputs, ask them to show you the per-class confusion matrix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step three: deployment plumbing
&lt;/h2&gt;

&lt;p&gt;Getting a quantised PyTorch model onto a microcontroller is the part nobody writes about. We export to ONNX, then run it through X-CUBE-AI from ST to get C code we can flash to the H7. The flow is finicky around quantisation parameters, so we wrote a small validator that runs the same 200 inputs through PyTorch, ONNX Runtime, and the on-device binary and compares logits. Three places to get a mismatch.&lt;/p&gt;

&lt;p&gt;For the cloud side, when we're benchmarking different model variants we need an LLM to summarise long evaluation logs and propose hyperparameter tweaks. Our team of four uses &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; in front of OpenAI and a local Ollama instance so we can switch between them when the OpenAI rate limit bites during a sweep. It's one less thing to babysit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;This pipeline only really works because gesture recognition is forgiving. A 1% accuracy drop on hand waves doesn't kill anyone. For automotive event-camera workloads (pedestrian detection at 30m), the same int8 quantisation pushed our internal benchmark from 87.3% AP to 81.0% AP, which is unacceptable. You'd need mixed-precision (int8 backbone, fp16 detection head) or a larger budget.&lt;/p&gt;

&lt;p&gt;Also: the voxel-grid representation assumes you can buffer 50ms of events before inferring. If you need sub-10ms reaction time on a 1kHz event stream, you're back to recurrent spiking architectures or sparse convs, and that's a different blog post.&lt;/p&gt;

&lt;p&gt;The X-CUBE-AI toolchain is closed-source and the error messages will make you reconsider your career on a Tuesday afternoon. TFLite Micro is more portable but has worse op coverage for the things event-vision pipelines actually need (per-channel quantised group convs especially).&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;Train with QAT from epoch zero, not as a fine-tune. The fp32-then-quantise habit comes from natural-image work where it's basically free. On event data, the activation statistics shift enough during QAT that you might as well let the model converge to them properly. We're trying this on the next project.&lt;/p&gt;

&lt;p&gt;And benchmark on the actual hardware early. The first three weeks I tuned everything against PyTorch latency on a workstation. Useless. The M7 has a different cache hierarchy and SIMD profile, and the model that "should" be fastest on paper was 2x slower in silicon.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.prophesee.ai/" rel="noopener noreferrer"&gt;Prophesee Metavision SDK docs&lt;/a&gt; — event representations, EVK4 specs&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pytorch.org/docs/stable/quantization.html" rel="noopener noreferrer"&gt;PyTorch quantisation tutorial&lt;/a&gt; — the QAT section, specifically&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://tonic.readthedocs.io/" rel="noopener noreferrer"&gt;Tonic library&lt;/a&gt; — event-camera datasets and transforms in PyTorch&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/1904.08405" rel="noopener noreferrer"&gt;Event-based Vision: A Survey, Gallego et al.&lt;/a&gt; — still the best one-stop reference&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.st.com/en/embedded-software/x-cube-ai.html" rel="noopener noreferrer"&gt;ST X-CUBE-AI&lt;/a&gt; — if you're deploying to STM32&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Quantising event-camera networks to run under 1MB on a Cortex-M7</title>
      <dc:creator>Marco Rinaldi</dc:creator>
      <pubDate>Fri, 22 May 2026 07:22:38 +0000</pubDate>
      <link>https://dev.to/marcorinaldi_ai/quantising-event-camera-networks-to-run-under-1mb-on-a-cortex-m7-1n9c</link>
      <guid>https://dev.to/marcorinaldi_ai/quantising-event-camera-networks-to-run-under-1mb-on-a-cortex-m7-1n9c</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: I shrunk a gesture-recognition model for a Prophesee EVK4 event camera from 4.2MB down to 780KB so it could run on an STM32H7 at 15ms per inference. The trick was not the quantisation itself, it was rethinking what an "image" even means when your sensor produces events instead of frames.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So, the thing is, most computer vision tutorials assume you start with a tensor of shape &lt;code&gt;[B, 3, H, W]&lt;/code&gt; and end with a classification head. Event cameras break that assumption on day one. A Prophesee sensor doesn't give you frames at 30fps. It gives you a sparse stream of events, each a tuple of &lt;code&gt;(x, y, t, polarity)&lt;/code&gt;, fired only when a pixel changes brightness. You can get millions of these per second during motion and almost nothing when the scene is static.&lt;/p&gt;

&lt;p&gt;That changes the entire optimisation game. Let me give you the full picture here.&lt;/p&gt;

&lt;h2&gt;
  
  
  The starting point
&lt;/h2&gt;

&lt;p&gt;We had a gesture model trained on the DVS128 Gesture dataset (11 classes, hand movements recorded with a DAVIS sensor). The baseline was a small ResNet-ish backbone running on event frames, accumulated over 50ms windows. It hit 94.1% test accuracy at 4.2MB in fp32. Inference on a Cortex-M7 at 480MHz took 68ms per window, which is too slow when your events are arriving in real time.&lt;/p&gt;

&lt;p&gt;Target: sub-1MB, sub-20ms, accuracy drop under 2 percentage points.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step one: stop pretending events are frames
&lt;/h2&gt;

&lt;p&gt;The first 30% of the size came from not training on frames at all. Event-frame accumulation throws away the temporal resolution you paid for when you bought a €3,000 sensor. We switched the input representation to a voxel grid of shape &lt;code&gt;[2, 5, 128, 128]&lt;/code&gt; (2 polarities, 5 temporal bins per window). That alone let us drop the first conv block from 64 channels to 24, because the input was already temporally structured.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;events_to_voxel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;voxel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_bins&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;t_min&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t_max&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;t_norm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t_min&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t_max&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t_min&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;1e-9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;bin_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t_norm&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_bins&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;long&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;voxel&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;bin_idx&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;voxel&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Accuracy actually went up to 94.6%. Smaller and better. This happens more often than the literature admits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step two: QAT, not PTQ
&lt;/h2&gt;

&lt;p&gt;Post-training quantisation is fast but it lies to you on event data. The activation distributions are wildly bimodal because most pixels are zero. Standard min/max calibration collapses the useful range.&lt;/p&gt;

&lt;p&gt;We did quantisation-aware training with PyTorch's &lt;code&gt;torch.ao.quantization&lt;/code&gt; pipeline, qint8 weights and activations, per-channel for convs, per-tensor for the linear head. 15 epochs of QAT on top of the fp32 checkpoint. The observer matters: &lt;code&gt;MovingAverageMinMaxObserver&lt;/code&gt; with &lt;code&gt;averaging_constant=0.01&lt;/code&gt; worked, the default 0.1 did not.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;M7 latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;fp32 baseline (frames)&lt;/td&gt;
&lt;td&gt;4.2 MB&lt;/td&gt;
&lt;td&gt;94.1%&lt;/td&gt;
&lt;td&gt;68 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fp32 + voxel input&lt;/td&gt;
&lt;td&gt;3.1 MB&lt;/td&gt;
&lt;td&gt;94.6%&lt;/td&gt;
&lt;td&gt;51 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PTQ int8&lt;/td&gt;
&lt;td&gt;820 KB&lt;/td&gt;
&lt;td&gt;89.2%&lt;/td&gt;
&lt;td&gt;19 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QAT int8&lt;/td&gt;
&lt;td&gt;780 KB&lt;/td&gt;
&lt;td&gt;93.4%&lt;/td&gt;
&lt;td&gt;15 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 4-point gap between PTQ and QAT is the whole story. If anyone tells you PTQ "just works" on sparse inputs, ask them to show you the per-class confusion matrix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step three: deployment plumbing
&lt;/h2&gt;

&lt;p&gt;Getting a quantised PyTorch model onto a microcontroller is the part nobody writes about. We export to ONNX, then run it through X-CUBE-AI from ST to get C code we can flash to the H7. The flow is finicky around quantisation parameters, so we wrote a small validator that runs the same 200 inputs through PyTorch, ONNX Runtime, and the on-device binary and compares logits. Three places to get a mismatch.&lt;/p&gt;

&lt;p&gt;For the cloud side, when we're benchmarking different model variants we need an LLM to summarise long evaluation logs and propose hyperparameter tweaks. Our team of four uses &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; in front of OpenAI and a local Ollama instance so we can switch between them when the OpenAI rate limit bites during a sweep. It's one less thing to babysit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;This pipeline only really works because gesture recognition is forgiving. A 1% accuracy drop on hand waves doesn't kill anyone. For automotive event-camera workloads (pedestrian detection at 30m), the same int8 quantisation pushed our internal benchmark from 87.3% AP to 81.0% AP, which is unacceptable. You'd need mixed-precision (int8 backbone, fp16 detection head) or a larger budget.&lt;/p&gt;

&lt;p&gt;Also: the voxel-grid representation assumes you can buffer 50ms of events before inferring. If you need sub-10ms reaction time on a 1kHz event stream, you're back to recurrent spiking architectures or sparse convs, and that's a different blog post.&lt;/p&gt;

&lt;p&gt;The X-CUBE-AI toolchain is closed-source and the error messages will make you reconsider your career on a Tuesday afternoon. TFLite Micro is more portable but has worse op coverage for the things event-vision pipelines actually need (per-channel quantised group convs especially).&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;Train with QAT from epoch zero, not as a fine-tune. The fp32-then-quantise habit comes from natural-image work where it's basically free. On event data, the activation statistics shift enough during QAT that you might as well let the model converge to them properly. We're trying this on the next project.&lt;/p&gt;

&lt;p&gt;And benchmark on the actual hardware early. The first three weeks I tuned everything against PyTorch latency on a workstation. Useless. The M7 has a different cache hierarchy and SIMD profile, and the model that "should" be fastest on paper was 2x slower in silicon.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.prophesee.ai/" rel="noopener noreferrer"&gt;Prophesee Metavision SDK docs&lt;/a&gt; — event representations, EVK4 specs&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pytorch.org/docs/stable/quantization.html" rel="noopener noreferrer"&gt;PyTorch quantisation tutorial&lt;/a&gt; — the QAT section, specifically&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://tonic.readthedocs.io/" rel="noopener noreferrer"&gt;Tonic library&lt;/a&gt; — event-camera datasets and transforms in PyTorch&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/1904.08405" rel="noopener noreferrer"&gt;Event-based Vision: A Survey, Gallego et al.&lt;/a&gt; — still the best one-stop reference&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.st.com/en/embedded-software/x-cube-ai.html" rel="noopener noreferrer"&gt;ST X-CUBE-AI&lt;/a&gt; — if you're deploying to STM32&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>computervision</category>
      <category>mlops</category>
      <category>pytorch</category>
    </item>
  </channel>
</rss>
