I have been building a lot of server-side vision systems — cloud inference, GPU clusters, the whole stack. But a recent side project reminded me how compelling on-device AI still is, especially when you strip away the assumption of reliable connectivity.
The project: a livestock counting app for smallholders. Take a photo of your flock, tap one chicken, get a count back. No account, no subscription, no signal required. Just a model on the device doing its job.
Here is what I learned porting YOLOv8 into an iOS app via CoreML.
Why On-Device at All?
The obvious answer: barns and fields do not have 5G. But the less-obvious answer is more interesting — no server means no ongoing cost, no latency, and no privacy concern. The photo never leaves the phone. That is increasingly a selling point, not a footnote.
For small utility apps, cloud inference is overkill. You are paying per-inference and maintaining infrastructure to serve a model that could run on a £400 phone in under 200ms.
The Stack: YOLOv8n → CoreML → Apple Vision
The model is YOLOv8 nano (yolov8n), trained on COCO. Nano is the key decision — it is ~6MB, runs on the Neural Engine, and for categories like bird, sheep, cow the accuracy is genuinely good enough for a counting use case.
The conversion path:
pip install ultralytics coremltools
yolo export model=yolov8n.pt format=coreml nms=True
That gives you a .mlpackage. Xcode compiles it to .mlmodelc at build time and generates a Swift wrapper class automatically. The inference code is clean:
let config = MLModelConfiguration()
config.computeUnits = .all // prefer Neural Engine
let mlModel = try MLModel(contentsOf: modelURL, configuration: config)
let vnModel = try VNCoreMLModel(for: mlModel)
let request = VNCoreMLRequest(model: vnModel)
request.imageCropAndScaleOption = .scaleFit
let handler = VNImageRequestHandler(cgImage: cgImage, orientation: orientation)
try handler.perform([request])
let results = request.results as? [VNRecognizedObjectObservation]
On a modern iPhone, yolov8n inference on a 640px image runs in roughly 50–80ms. Fast enough that it feels instant.
The Hard Part: Confidence Thresholds
COCO-trained YOLOv8 with default confidence thresholds performs well on textbook images. Real livestock photos are not textbook images. Partially-occluded animals behind fence posts, sheep that are mostly mud, chickens half-in-frame — these score lower confidence but are still valid detections you want to count.
I ended up with a final threshold of 0.25, vs the default 0.35–0.45 most tutorials recommend. The model exports with NMS baked in (conf=0.15, iou=0.65), and I apply a second filter in Swift at 0.25. This catches most real-world partial occlusions without drowning in false positives.
The other trick: let users tap to remove false positives rather than trying to tune away every edge case. Editable results beat perfect results. People accept "mostly right, I will tap off the fence post shadow" much better than "sometimes wrong with no recourse."
Tap-to-Identify Flow
Instead of forcing a category selection, users can just tap on one example object in the photo. The app finds the highest-confidence detection at that point, identifies its COCO class, and returns all detections of the same class.
// Vision uses bottom-left origin; UIKit uses top-left
let visionPoint = CGPoint(x: normalisedPoint.x, y: 1.0 - normalisedPoint.y)
let tapped = observations
.filter { $0.boundingBox.contains(visionPoint) }
.max(by: { $0.confidence < $1.confidence })
That coordinate flip (1.0 - normalisedPoint.y) is the kind of thing that wastes 45 minutes if you do not know to expect it.
What On-Device Vision Is Actually Good For
After building this, my take: on-device inference with a small COCO-trained model is a good fit for:
- Counting / detection of common real-world objects (people, animals, vehicles, plants)
- Apps that work in low-connectivity environments — field tools, outdoor apps, anything rural
- Privacy-sensitive use cases — medical, personal, anything users would not want hitting a cloud API
- One-off utility apps where server infrastructure is not justified
It is not a good fit for fine-grained classification (you need a domain-specific model), real-time video at scale, or anything needing more than ~80 COCO classes.
The stack — YOLOv8 + CoreML + Apple Vision framework — is mature, well-documented, and genuinely pleasant to work with. If you are building something where offline matters, it is worth the afternoon it takes to get running.
Top comments (0)