DEV Community

Todd Sullivan
Todd Sullivan

Posted on

YOLOv8 + CoreML on iOS: Shipping Offline Computer Vision That Actually Works in the Field

I have been building a lot of server-side vision systems — cloud inference, GPU clusters, the whole stack. But a recent side project reminded me how compelling on-device AI still is, especially when you strip away the assumption of reliable connectivity.

The project: a livestock counting app for smallholders. Take a photo of your flock, tap one chicken, get a count back. No account, no subscription, no signal required. Just a model on the device doing its job.

Here is what I learned porting YOLOv8 into an iOS app via CoreML.


Why On-Device at All?

The obvious answer: barns and fields do not have 5G. But the less-obvious answer is more interesting — no server means no ongoing cost, no latency, and no privacy concern. The photo never leaves the phone. That is increasingly a selling point, not a footnote.

For small utility apps, cloud inference is overkill. You are paying per-inference and maintaining infrastructure to serve a model that could run on a £400 phone in under 200ms.


The Stack: YOLOv8n → CoreML → Apple Vision

The model is YOLOv8 nano (yolov8n), trained on COCO. Nano is the key decision — it is ~6MB, runs on the Neural Engine, and for categories like bird, sheep, cow the accuracy is genuinely good enough for a counting use case.

The conversion path:

pip install ultralytics coremltools
yolo export model=yolov8n.pt format=coreml nms=True
Enter fullscreen mode Exit fullscreen mode

That gives you a .mlpackage. Xcode compiles it to .mlmodelc at build time and generates a Swift wrapper class automatically. The inference code is clean:

let config = MLModelConfiguration()
config.computeUnits = .all  // prefer Neural Engine

let mlModel = try MLModel(contentsOf: modelURL, configuration: config)
let vnModel = try VNCoreMLModel(for: mlModel)

let request = VNCoreMLRequest(model: vnModel)
request.imageCropAndScaleOption = .scaleFit

let handler = VNImageRequestHandler(cgImage: cgImage, orientation: orientation)
try handler.perform([request])

let results = request.results as? [VNRecognizedObjectObservation]
Enter fullscreen mode Exit fullscreen mode

On a modern iPhone, yolov8n inference on a 640px image runs in roughly 50–80ms. Fast enough that it feels instant.


The Hard Part: Confidence Thresholds

COCO-trained YOLOv8 with default confidence thresholds performs well on textbook images. Real livestock photos are not textbook images. Partially-occluded animals behind fence posts, sheep that are mostly mud, chickens half-in-frame — these score lower confidence but are still valid detections you want to count.

I ended up with a final threshold of 0.25, vs the default 0.35–0.45 most tutorials recommend. The model exports with NMS baked in (conf=0.15, iou=0.65), and I apply a second filter in Swift at 0.25. This catches most real-world partial occlusions without drowning in false positives.

The other trick: let users tap to remove false positives rather than trying to tune away every edge case. Editable results beat perfect results. People accept "mostly right, I will tap off the fence post shadow" much better than "sometimes wrong with no recourse."


Tap-to-Identify Flow

Instead of forcing a category selection, users can just tap on one example object in the photo. The app finds the highest-confidence detection at that point, identifies its COCO class, and returns all detections of the same class.

// Vision uses bottom-left origin; UIKit uses top-left
let visionPoint = CGPoint(x: normalisedPoint.x, y: 1.0 - normalisedPoint.y)

let tapped = observations
    .filter { $0.boundingBox.contains(visionPoint) }
    .max(by: { $0.confidence < $1.confidence })
Enter fullscreen mode Exit fullscreen mode

That coordinate flip (1.0 - normalisedPoint.y) is the kind of thing that wastes 45 minutes if you do not know to expect it.


What On-Device Vision Is Actually Good For

After building this, my take: on-device inference with a small COCO-trained model is a good fit for:

  • Counting / detection of common real-world objects (people, animals, vehicles, plants)
  • Apps that work in low-connectivity environments — field tools, outdoor apps, anything rural
  • Privacy-sensitive use cases — medical, personal, anything users would not want hitting a cloud API
  • One-off utility apps where server infrastructure is not justified

It is not a good fit for fine-grained classification (you need a domain-specific model), real-time video at scale, or anything needing more than ~80 COCO classes.

The stack — YOLOv8 + CoreML + Apple Vision framework — is mature, well-documented, and genuinely pleasant to work with. If you are building something where offline matters, it is worth the afternoon it takes to get running.

Top comments (0)