Falling back from edge detection to a cloud VLM when confidence drops

#computervision #mlops #llm

TL;DR: We deploy a 4MB SSD detector on an ARM edge box and cascade low-confidence frames to a cloud VLM. About 3% of frames make the trip. The interesting part is not the model but the routing layer that decides when to ask for help and how to fail gracefully when the network is hostile.

Last winter we shipped a vision system for an industrial inspection client. The constraint was familiar: a Cortex-A72 box bolted to a conveyor, no GPU, intermittent 4G uplink. The detector had to run at 25 FPS and flag defects. Easy enough. The hard part came later.

So, the thing is, on a clean validation set the small detector hit 91% mAP at IoU 0.5. In production it hovered around 78%. The drop came from frames the model had never imagined: unusual lighting on Mondays after the floor was washed, parts placed at angles outside our training distribution, partial occlusions from operator gloves. The honest answer was that we didn't have a large enough labeled set to cover the tail.

Rather than ship a bigger model the box couldn't run, we built a cascade. The edge model handles every frame. When its top-1 detection confidence drops below 0.55 or its second-best is within 0.15 of the first, we send the cropped patch to a cloud VLM for a second opinion. About 3% of frames trigger this path.

The cascade in practice

Here is the gist of the gating logic, simplified:

def should_escalate(detections):
    if not detections:
        return False
    top = detections[0].score
    if top < 0.55:
        return True
    if len(detections) > 1:
        runner_up = detections[1].score
        if (top - runner_up) < 0.15:
            return True
    return False

A patch is 224x224 around the predicted bounding box, encoded as JPEG quality 80. Average payload is 11 KB. Round-trip latency to the cloud VLM is 800-1400ms over 4G, which is acceptable because the conveyor has a 2-second buffer.

Why a routing layer mattered

The first version called one provider directly. Then we had a 17-minute outage and the line operator had to override frames manually. Not great. The second version routes through a gateway that fans out across two VLM providers with automatic failover. We use Bifrost (https://github.com/maximhq/bifrost) on a small VPS for this, though LiteLLM or a custom proxy would also work. With one provider as primary and another as secondary, when the primary returns 5xx or times out past 2 seconds, the next one takes over. Semantic caching helps for near-identical frames during steady-state operation.

The choice of gateway is not the headline. What matters is having one. Calling provider SDKs directly from edge devices is a debugging nightmare once you have failover logic and retries layered on top.

What we measured

After 6 weeks of production data on three client lines:

Metric	Edge only	Edge + VLM cascade
mAP (production)	78.2%	86.9%
Avg latency / frame	38 ms	41 ms
p99 latency / frame	52 ms	1290 ms
Cloud cost / day	$0	$4.20
Frames escalated	0%	3.1%

The p99 jump is real and unavoidable. We absorb it in the conveyor buffer. Without that slack the cascade would not be an option.

A subtle bug we hit

Early on, the cascade rate drifted upward over a week, climbing from 3% to 7%. Cost was rising too. After some digging we found the detector was responding to slow camera lens contamination. Dust accumulating on the optics was lowering confidence everywhere. The cascade masked the real problem because the cloud VLM was strong enough to handle the dirty frames.

Now we alert when the cascade rate moves beyond a 5% rolling band. The cloud should be a backup, not a crutch.

Trade-offs and limitations

Latency variance. The p99 number above is honest. For applications without buffer slack (closed-loop robotics, automotive vision), this approach falls apart. It works for our throughput-tolerant inspection setup. It would not work for collision avoidance.

Network dependence. When the 4G link drops the system degrades to edge-only. We log every frame that would have escalated and run a batch job once connectivity returns. About 8% of escalations end up batched on a bad day. The client accepted this. Another client might not.

Cost predictability. Per-frame cost is small but it is real. A noisier production environment can double the cascade rate overnight. We have a hard daily ceiling that disables escalation if exceeded, and the line falls back to edge-only with a flag for operator review.

Privacy. The cropped patch is sent to a third-party provider. For industrial inspection of inert parts this was fine. For anything involving people we would need on-prem VLMs, which changes the economics considerably.

Calibration drift. The confidence threshold of 0.55 was tuned on validation data from one factory. The other two factories needed different thresholds (0.48 and 0.61). We now run a small calibration script after first deployment.

Closing thought

Cascade architectures are old. Hierarchical classifiers and coarse-to-fine pipelines have been around since before I started this job. What changed is that the "ask for help" tier is now a generally capable model rather than a slightly larger specialist. That shifts the economics for the long tail of unusual inputs that no training set will ever cover.

Top comments (1)

Harjot Singh • May 31

This is a textbook cascade pattern and it's the right architecture - run the cheap/fast edge detector on everything, and only escalate to the expensive cloud VLM on the cases where the edge model is unsure. You get the cost and latency profile of the small model on the easy majority, and the accuracy of the big model exactly where it's needed. The whole design hinges on one thing being honest: the confidence signal. If the edge model is confidently wrong, the fallback never fires and you ship the error; if it's underconfident, you escalate everything and lose the savings. So calibration of that threshold is the real engineering, not the models themselves.

This is the same routing instinct I built Moonshift on - the thing I work on, a multi-agent pipeline that takes a prompt to a deployed SaaS, where each job goes to the cheapest model that can actually handle it and only escalates when the cheap one isn't confident/correct (a full build lands ~$3 flat, first run free no card). Cheap-by-default, escalate-on-uncertainty is the same pattern at a different layer. Genuinely nice writeup. How are you setting the confidence threshold - calibrated on a labeled set, or hand-tuned? And do you catch the confidently-wrong case where edge is sure but actually off, or does that slip through? That false-confidence gap is the one that quietly hurts cascades.