ashish-frozo

Posted on May 15 • Originally published at edgegate.frozo.ai

How we catch silent NPU fallback on Snapdragon in CI (and why your eval set won't)

#edgeai #mlops #onnxruntime #cicd

TL;DR — ONNX Runtime's QNN execution provider will quietly route unsupported ops to the CPU instead of the Hexagon NPU. Your accuracy is fine. Your eval set is fine. Median latency on a clean device looks fine. Then production traffic hits a different input distribution, more ops fall back, and p95 latency triples. The fix isn't more eval data — it's three CI assertions: run on real hardware, gate on median and coefficient of variation, and parse the ORT profiling output to assert what fraction of FLOPs actually ran on the NPU.

The pain

A team we work with ships a person-detection model on a robotics platform with a Snapdragon 8 Gen 3 SoC. The model is YOLOv8n, quantized to INT8 with AIMET, compiled through Qualcomm AI Hub, exported as an ONNX graph that ORT loads with the QNN execution provider targeting the Hexagon NPU in HTP performance mode.

Pre-merge they run a 5,000-image eval set on a development board. Latency: median 8.2 ms, p95 9.1 ms. Accuracy regresses by 0.3% from the FP32 baseline — well within tolerance. The PR merges. The model ships to ~400 devices in the field.

Three days later the support queue lights up. Field telemetry shows median inference at 24 ms, p95 at 41 ms, occasional spikes to 62 ms. The model isn't crashing. Accuracy is fine. It's just slow, slow enough to cause navigation hesitation, slow enough to miss frames at 30 fps. The team spends 9 engineering days reproducing the issue on a single board they can't make misbehave, before someone finally captures an ORT profiling trace and discovers that ~18% of the graph's FLOPs are running on the Kryo CPU, not the Hexagon NPU. A new LayerNormalization introduced in the latest YOLOv8 release isn't supported by the QNN backend on their firmware version. ORT logs a single line about it at startup and silently routes those nodes to CPU.

The eval set never caught it because the QNN backend cache was warm on the dev board, the input distribution didn't stress the unsupported path the same way, and nobody was watching the per-op routing manifest. The model "worked." It just worked on the wrong silicon.

Why standard solutions don't work

The instinct, after this kind of incident, is to widen the eval set. More images. More edge cases. More devices. That helps with accuracy regressions. It does almost nothing for NPU fallback regressions, because the failure mode isn't about what the model predicts, it's about where the model executes. Three specific things break down:

1. Cloud eval can't observe execution provider routing

If you eval in the cloud — on a workstation, on an x86 CI runner, in a Docker container with ORT compiled for CUDA or CPU — the QNN execution provider doesn't load. ORT silently runs the entire graph on whatever backend is available, and reports a clean 100% utilization on it. You learn nothing about Hexagon coverage. Even running on a Snapdragon emulator misses this: emulators implement the NPU instruction set in software, so every op is "supported," but for the wrong reason.

2. Median-of-N hides bimodal latency distributions

Most CI gating frameworks run inference N times (commonly 5 or 10), take the median, and compare against a threshold. The intuition is right — single runs are noisy, median smooths it out. But fallback creates a bimodal distribution: when the QNN scheduler hits a warm context binary cache, latency is fast; when it doesn't, latency is 3–8× slower because the fallback path involves a CPU↔NPU memory round-trip. With 10 samples, the median often lands on the fast cluster, the p95 lands on the slow cluster, and the gate (which only checks median) passes. You ship the regression.

The fix isn't to switch from median to mean — that just shifts which cluster wins. The fix is to gate on the coefficient of variation (CV) alongside the central tendency. A healthy on-NPU inference has CV around 2–5%. A graph with intermittent fallback has CV >15%. Set the gate to "median ≤ X and CV ≤ Y" and bimodal distributions fail.

3. ORT QNN EP doesn't surface per-op routing by default

This is the silent part. With default ORT logging (severity = WARNING), the only thing you see at session creation is a one-liner like:

[W:onnxruntime:Default, qnn_execution_provider.cc] 23 nodes assigned to QNNExecutionProvider, 7 nodes assigned to CPUExecutionProvider

Nobody reads ORT startup logs in CI. There's no exception, no non-zero exit code, no metric exposed. The model loads, runs, and returns correct outputs. The 7 CPU nodes are buried in the verbose log stream alongside thread-pool warnings and tensor-shape inferences. You have to opt in to per-node routing diagnostics — set session_options.log_severity_level = 0 (VERBOSE) and provider_options["profiling_level"] = "detailed", then post-process the resulting profiling JSON to count what landed where.

Most teams never do this, because nothing in the standard ORT or QAI Hub developer flow tells them they should.

The three-part fix

Here's what actually catches NPU fallback regressions before they merge. None of these requires custom infrastructure beyond what ORT and Qualcomm AI Hub already give you.

Part 1: Run inference on the real device, not an emulator or a cloud proxy

This sounds tautological, but it's not common practice. Most teams use one of three substitutes for real hardware in CI:

Emulator / simulator (e.g., Hexagon SDK simulator) — implements the ISA but not the runtime scheduler, cache hierarchy, or thermal envelope. Useful for kernel correctness; useless for latency.
Cloud x86 with ORT-CPU — proves the graph runs, says nothing about NPU coverage.
Single dev board in someone's drawer — works until the dev board's firmware version diverges from the production fleet.

Qualcomm AI Hub gives you a hosted device farm: real chips, multiple firmware versions, schedulable from a script. Use it. Your CI should compile through AI Hub, submit a profiling job against the target device class, and wait for the metrics. The latency budget for this is 90–180 seconds per run, which is well within the budget of a normal PR check.

Part 2: Gate on median and coefficient of variation

Pseudocode that fits in a CI step:

def gate(latencies_ms, median_threshold, cv_threshold):
    median = statistics.median(latencies_ms)
    mean = statistics.mean(latencies_ms)
    stdev = statistics.stdev(latencies_ms)
    cv = (stdev / mean) * 100  # percent

    fails = []
    if median > median_threshold:
        fails.append(f"median {median:.2f}ms > {median_threshold}ms")
    if cv > cv_threshold:
        fails.append(
            f"CV {cv:.1f}% > {cv_threshold}% "
            f"(bimodal distribution — likely intermittent fallback)"
        )
    if fails:
        raise SystemExit("Latency gate failed: " + "; ".join(fails))

We use median ≤ 10 ms, CV ≤ 8% as defaults for vision models on 8 Gen 3 HTP. Yours will differ — calibrate against a known-good baseline. The key is that both thresholds must hold. A model that satisfies the median but blows past the CV is the exact signature of partial fallback.

Sample size matters: you need ≥ 20 inference runs to compute a stable CV. Don't try this with N=5. We use N=30 with a 3-iteration warmup discarded to get past first-load cache misses.

Part 3: Assert per-op routing from the ORT profile

This is the diagnostic that would have caught the YOLOv8n incident in CI. Enable ORT profiling:

sess_options = onnxruntime.SessionOptions()
sess_options.enable_profiling = True
sess_options.log_severity_level = 0

provider_options = {
    "backend_path": "libQnnHtp.so",
    "profiling_level": "detailed",
    "htp_performance_mode": "burst",
}

session = onnxruntime.InferenceSession(
    model_path,
    sess_options,
    providers=[("QNNExecutionProvider", provider_options), "CPUExecutionProvider"],
)

After the run, ORT drops a JSON profile next to your script. Each node entry tags which execution provider handled it. A post-run assertion:

import json
from collections import Counter

profile = json.load(open(profile_path))
op_routing = Counter()
flops_routing = Counter()

for ev in profile:
    if ev.get("cat") == "Node" and ev.get("args", {}).get("op_name"):
        ep = ev["args"].get("provider", "Unknown")
        op_routing[ep] += 1
        flops_routing[ep] += ev["args"].get("flops", 0)

total_flops = sum(flops_routing.values())
npu_pct = 100 * flops_routing["QNNExecutionProvider"] / total_flops
assert npu_pct >= 95, (
    f"Only {npu_pct:.1f}% of FLOPs on NPU; "
    f"CPU fell back on: {[op for op, ep in op_routing.items() if ep != 'QNNExecutionProvider']}"
)

Two numbers matter: percentage of nodes on the NPU, and percentage of FLOPs on the NPU. A model can have 95% of nodes on the NPU but 60% of FLOPs on the CPU if the fallback ops are the heaviest ones. Always gate on FLOPs, not node count.

For us, the threshold is ≥ 98% of FLOPs on the QNN EP. Drop below that, the PR fails. The CI message tells the engineer exactly which op fell back, so they can either swap it for a supported equivalent (e.g., LayerNormalization → fused MatMul+Add patterns the QNN backend recognizes), pin the QNN SDK version, or escalate to firmware support.

Putting it together

The three checks compose. A PR run looks like:

Compile the new model through Qualcomm AI Hub for the target chipset (~90s)
Submit profiling job to the AI Hub device farm against, say, Snapdragon 8 Gen 3 reference + 7s Gen 2 fallback (~60s)
Pull the ORT profile + latency series
Assert median ≤ 10 ms, CV ≤ 8%, NPU FLOPs ≥ 98%
Emit a signed evidence bundle so the release engineer can verify what was tested

Steps 1–4 take ~4 minutes per device. Step 5 is the audit trail. If any assertion fails, the PR check goes red and the engineer sees, in the GitHub UI, the exact op that fell back — not "model is slow" but LayerNormalization node /model.10/m.0/m.0.1/norm/LayerNormalization is on CPUExecutionProvider.

That single PR check has saved one of our customers 11 engineering days per quarter on average. Not because the bugs are rare — because catching them in the PR is a 20-minute fix and catching them in the field is a multi-week fire.

Where EdgeGate fits

We built EdgeGate because rolling all of this yourself — Qualcomm AI Hub orchestration, ORT profile parsing, CV-aware gating, signed evidence bundles, GitHub Action integration — is two or three engineer-weeks. It's the kind of infrastructure every team that ships edge AI eventually builds, and most build twice because the first version doesn't handle the firmware-version axis. If you'd rather not, we run the whole pipeline as a CI gate with a single YAML file. Free tier includes 10 runs/month on real Snapdragon hardware.

But honestly, if you take nothing else from this post: turn on profiling_level=detailed, parse the JSON, and assert NPU FLOP percentage. Three lines of Python, and you'll never ship a silent fallback again.

Originally published on edgegate.frozo.ai. Discuss in the comments — corner cases and code questions welcome.

DEV Community