By Jack Kao — author of mk-qa-master, an MCP-native QA toolkit.
Most "AI testing" stops at calling an API and asserting the response isn't empty. Edge AI — a model running on a live camera feed — doesn't fit that mold. You can't assert exact bounding-box coordinates (the output is fuzzy by design), and "correct but 200ms too late" is a production failure, not a pass.
When I added the edge runner to mk-qa-master in v1.1, I wanted it to make that hard problem feel like writing any other test. So this post is me dogfooding it: my laptop webcam → an RTSP stream → a YOLOv8 model → assertions on IoU, throughput, and p95 latency — the whole thing orchestrated through MCP tool calls, no Jetson required.
What the edge runner is
mk-qa-master is an MCP server: your AI client (Claude Code, Cursor, etc.) calls its tools to analyze, generate, and run tests. It already ships runners for pytest-playwright, jest, cypress, go, maestro, schemathesis… and as of v1.1, edge.
Flip one env var:
// the mk-qa-master MCP server config
"env": { "QA_RUNNER": "edge" }
…and the same generate_test / run_tests tools now speak RTSP + YOLO instead of browsers. Same muscle memory, completely different domain.
The four questions an edge test must answer
A useful Edge AI test answers all of these at once:
- Correctness — does the model detect the thing it should?
- Throughput — does it keep up with the frame rate?
- Latency — is each inference fast enough? (p95, not mean.)
- Robustness — do blank/empty frames produce false detections?
Correctness is the subtle one. Two runs of the same model on the same frame can disagree by a few pixels, so detection correctness uses IoU (Intersection over Union) against an expected box with a threshold (default 0.5). You assert "the box overlaps enough," never "the box is identical." mk-qa-master ships this as edge.metrics.match_detection.
The pipeline
The testing layer wants a stream, not a device handle — so you bridge the camera:
MacBook camera
│ ffmpeg (avfoundation capture)
▼
local clip (5s — deterministic, replayable, no privacy surprises in CI)
│ ffmpeg -re -stream_loop -1 (loop forever)
▼
mediamtx → rtsp://127.0.0.1:8600/cam
│ OpenCV VideoCapture
▼
YOLOv8n (LocalYolo) → IoU / fps / latency assertions → pytest report
Recording a clip once and looping it over RTSP is what makes this deterministic — the same frames every run, no "why did it fail at 6pm" lighting drama. The edge runner actually does this loop for you when QA_RTSP_SOURCE points at a file; I'll show the manual version because it's clearer:
# capture once
ffmpeg -f avfoundation -framerate 30 -video_size 1280x720 \
-i "0" -t 5 -pix_fmt yuv420p edge_demo.mp4
# serve on loop via mediamtx (mediamtx.yml: paths: { all_others: })
ffmpeg -re -stream_loop -1 -i edge_demo.mp4 \
-c:v libx264 -preset ultrafast -tune zerolatency \
-f rtsp rtsp://127.0.0.1:8600/cam
The workflow, as MCP tool calls
This is the part I'm proud of — the whole session is just tools.
1. Confirm the runner. get_runner_info →
{ "current": "edge", "available": ["cypress","edge","go","jest","maestro","newman","pytest-playwright","schemathesis"] }
2. Probe the stream. analyze_stream(rtsp_url="rtsp://127.0.0.1:8600/cam") →
{
"url": "rtsp://127.0.0.1:8600/cam",
"width": 1280, "height": 720, "fps": 30.0,
"labels": [],
"candidate_tcs": [
"overall throughput should be >= the configured min_fps",
"single-frame p95 latency should be <= the latency SLA",
"stream reconnects after mid-test interruption without crashing",
"empty / no-target frames do not generate false-positive detections"
]
}
Note it hands back candidate test cases specific to the edge domain — the same way analyze_url hands back form/CTA test cases for web.
3. Generate the test. generate_test(...) writes a pytest skeleton wired to the edge fixtures (backend, stream, latency). The body stays boring on purpose:
def test_throughput(stream, backend, latency):
"""Sustained-rate check: at least MIN_FPS over a 150-frame window."""
n, t0 = 0, time.time()
while n < 150:
ok, frame = stream.read()
if not ok:
break
latency.add(backend.infer(frame).latency_ms)
n += 1
fps = n / max(time.time() - t0, 1e-6)
assert fps >= MIN_FPS, f"throughput below target: {fps:.1f} < {MIN_FPS} fps"
def test_detect_target(stream, backend, latency):
"""Target label within IoU threshold; p95 latency holds."""
hit = False
for idx in range(MAX_FRAME):
ok, frame = stream.read()
if not ok:
break
res = backend.infer(frame)
latency.add(res.latency_ms)
for exp in annotations.get(str(idx), []):
if exp["label"] == LABEL and match_detection(res.detections, exp, IOU):
hit = True
assert hit, f"{LABEL!r} not detected within IoU={IOU}"
assert latency.p95() <= SLA, f"p95 {latency.p95():.1f}ms > {SLA}ms"
4. Run it. run_tests → the runner starts the RTSP source, loads YOLOv8n once (session-scoped — reloading per test would wreck wall-clock), runs pytest, and snapshots the report.
What the numbers actually looked like
YOLOv8n on CPU (no GPU, just the MacBook), 150 frames of the looped 720p stream:
| Metric | Result | Target (desktop yolov8n) |
|---|---|---|
| Throughput | 23.0 fps | 25 — close, CPU-bound |
| p95 latency | 27.9 ms | ≤ 40 ms ✅ |
| mean / p50 latency | 26.0 / 25.6 ms | — |
person detections |
150 / 150 frames | — |
| Empty-frame false positives | 0 | 0 ✅ |
The model nailed person on every frame (it was pointed at me). It also confidently reported a suitcase in all 150 frames — there wasn't one. That's the reality of a tiny model: great recall, noisy precision. Exactly the thing a real test surfaces and a "looks like it works" demo hides. p95 of 27.9ms beat the 40ms SLA comfortably; 23fps fell just short of 25 purely for lack of a GPU.
mk-qa-master ships SLA starting points so you don't guess:
| Scenario | min_fps | latency SLA | IoU |
|---|---|---|---|
| Desktop yolov8n dev | 25 | 40 ms | 0.5 |
| Jetson Nano | 15 | 70 ms | 0.5 |
| Jetson Orin Nano | 30 | 25 ms | 0.6 |
| Cloud GPU service | 60 | 16 ms | 0.6 |
Design principles I baked in (and why)
- IoU, not coordinates. Pixel-exact assertions test randomness, not correctness.
- Latency and correctness in the same suite, on p95. A run can average 26ms and still drop a frame on a 90ms straggler.
-
An empty-frame false-positive test is first-class. A model that hallucinates objects on noise is worse than one that misses — it poisons everything downstream. Every edge suite should have one
test_empty_frame_no_false_positives. - Every frame is traceable to an index, so a failure report says which frame, not "missed somewhere."
- Session-scoped model load. The single biggest wall-clock footgun is reloading YOLO per test.
Field notes from dogfooding (v1.2 roadmap)
Eating my own dog food turned up sharp edges I'm filing down:
-
Stream readiness is harder than a TCP probe. The setup currently polls "can I
connect()to the RTSP port?" — but I had an Android emulator (qemu) squatting on the default port 8554, so the probe passed instantly against the wrong listener and the test raced ahead of ffmpeg's first publish (DESCRIBE 404). v1.2 moves readiness to an actual RTSPDESCRIBEsuccess, and the docs now recommend a dedicated port. -
localhostis two addresses. When the server binds::and the consumer resolves127.0.0.1, they miss each other. Pinning both sides to127.0.0.1removes the ambiguity — going into the defaults. -
The HTML report doesn't render edge cards yet.
generate_html_reportbuilds per-test cards from the runner'sget_all_test_details(); the edge runner doesn't override it yet, so an edge run shows summary tiles but no case list. Quick fix, already queued. -
Remote inference is v1.2. v1.1 is LocalYolo only;
QA_JETSON_HOST/QA_INFERENCE_ENDPOINTwill let the same test point at a real device or a GPU service and just re-tune the thresholds.
Try it
pip install "mk-qa-master[edge]" # opencv + ultralytics + torch
# add the MCP server to your client, set QA_RUNNER=edge,
# point QA_RTSP_SOURCE at a clip or an rtsp:// url, and call run_tests
You don't need edge hardware to start testing edge models — a webcam, ffmpeg, mediamtx, and a 6MB YOLOv8n checkpoint get you a real, asserting, repeatable test in an afternoon. When the Jetson arrives, you point the same test at a different endpoint.
If you build CV/edge pipelines, I'd genuinely like to hear how you handle the latency-vs-correctness trade-off in CI — that tension is the whole game.
— Jack Kao
Top comments (0)