DEV Community

Marco Rinaldi
Marco Rinaldi

Posted on

Falling back from edge detection to a cloud VLM when confidence drops

TL;DR: We deploy a 4MB SSD detector on an ARM edge box and cascade low-confidence frames to a cloud VLM. About 3% of frames make the trip. The interesting part is not the model but the routing layer that decides when to ask for help and how to fail gracefully when the network is hostile.

Last winter we shipped a vision system for an industrial inspection client. The constraint was familiar: a Cortex-A72 box bolted to a conveyor, no GPU, intermittent 4G uplink. The detector had to run at 25 FPS and flag defects. Easy enough. The hard part came later.

So, the thing is, on a clean validation set the small detector hit 91% mAP at IoU 0.5. In production it hovered around 78%. The drop came from frames the model had never imagined: unusual lighting on Mondays after the floor was washed, parts placed at angles outside our training distribution, partial occlusions from operator gloves. The honest answer was that we didn't have a large enough labeled set to cover the tail.

Rather than ship a bigger model the box couldn't run, we built a cascade. The edge model handles every frame. When its top-1 detection confidence drops below 0.55 or its second-best is within 0.15 of the first, we send the cropped patch to a cloud VLM for a second opinion. About 3% of frames trigger this path.

The cascade in practice

Here is the gist of the gating logic, simplified:

def should_escalate(detections):
    if not detections:
        return False
    top = detections[0].score
    if top < 0.55:
        return True
    if len(detections) > 1:
        runner_up = detections[1].score
        if (top - runner_up) < 0.15:
            return True
    return False
Enter fullscreen mode Exit fullscreen mode

A patch is 224x224 around the predicted bounding box, encoded as JPEG quality 80. Average payload is 11 KB. Round-trip latency to the cloud VLM is 800-1400ms over 4G, which is acceptable because the conveyor has a 2-second buffer.

Why a routing layer mattered

The first version called one provider directly. Then we had a 17-minute outage and the line operator had to override frames manually. Not great. The second version routes through a gateway that fans out across two VLM providers with automatic failover. We use Bifrost (https://github.com/maximhq/bifrost) on a small VPS for this, though LiteLLM or a custom proxy would also work. With one provider as primary and another as secondary, when the primary returns 5xx or times out past 2 seconds, the next one takes over. Semantic caching helps for near-identical frames during steady-state operation.

The choice of gateway is not the headline. What matters is having one. Calling provider SDKs directly from edge devices is a debugging nightmare once you have failover logic and retries layered on top.

What we measured

After 6 weeks of production data on three client lines:

Metric Edge only Edge + VLM cascade
mAP (production) 78.2% 86.9%
Avg latency / frame 38 ms 41 ms
p99 latency / frame 52 ms 1290 ms
Cloud cost / day $0 $4.20
Frames escalated 0% 3.1%

The p99 jump is real and unavoidable. We absorb it in the conveyor buffer. Without that slack the cascade would not be an option.

A subtle bug we hit

Early on, the cascade rate drifted upward over a week, climbing from 3% to 7%. Cost was rising too. After some digging we found the detector was responding to slow camera lens contamination. Dust accumulating on the optics was lowering confidence everywhere. The cascade masked the real problem because the cloud VLM was strong enough to handle the dirty frames.

Now we alert when the cascade rate moves beyond a 5% rolling band. The cloud should be a backup, not a crutch.

Trade-offs and limitations

Latency variance. The p99 number above is honest. For applications without buffer slack (closed-loop robotics, automotive vision), this approach falls apart. It works for our throughput-tolerant inspection setup. It would not work for collision avoidance.

Network dependence. When the 4G link drops the system degrades to edge-only. We log every frame that would have escalated and run a batch job once connectivity returns. About 8% of escalations end up batched on a bad day. The client accepted this. Another client might not.

Cost predictability. Per-frame cost is small but it is real. A noisier production environment can double the cascade rate overnight. We have a hard daily ceiling that disables escalation if exceeded, and the line falls back to edge-only with a flag for operator review.

Privacy. The cropped patch is sent to a third-party provider. For industrial inspection of inert parts this was fine. For anything involving people we would need on-prem VLMs, which changes the economics considerably.

Calibration drift. The confidence threshold of 0.55 was tuned on validation data from one factory. The other two factories needed different thresholds (0.48 and 0.61). We now run a small calibration script after first deployment.

Closing thought

Cascade architectures are old. Hierarchical classifiers and coarse-to-fine pipelines have been around since before I started this job. What changed is that the "ask for help" tier is now a generally capable model rather than a slightly larger specialist. That shifts the economics for the long tail of unusual inputs that no training set will ever cover.

Further reading

  • NVIDIA TensorRT INT8 calibration documentation: https://docs.nvidia.com/deeplearning/tensorrt/
  • Silla & Freitas 2011, "A survey of hierarchical classification across different application domains"
  • Guo et al. 2017, "On Calibration of Modern Neural Networks"
  • Bifrost gateway documentation: https://docs.getbifrost.ai/
  • Cai & Vasconcelos 2018, "Cascade R-CNN" for a different but related cascade idea

Top comments (0)