Daisuke Majima

Posted on Jun 2 • Originally published at qiita.com

Real-time object detection on iPhone with YOLO26

#ios #swift #machinelearning #computervision

What is YOLO26

The latest object detection model, released by Ultralytics in January 2026. Compared to the previous-generation YOLO11, CPU inference is up to 43% faster, greatly improving its practicality on edge devices.

Its biggest feature is end-to-end inference with no NMS. The Non-Maximum Suppression (NMS) post-processing step that used to be mandatory in YOLO is gone — the model outputs the final detections directly.

Model	mAP	CPU inference	Params
YOLO26n	40.9	38.9ms	2.5M
YOLO26s	48.6	63.3ms	9.2M
YOLO26m	53.1	155ms	18.7M

Why run it on iPhone

Real-time inference: 30+ FPS using the Neural Engine
Privacy: data never leaves the device
Offline: works without a network
Low latency: no server round-trip, so results come back instantly

Preparing the CoreML model

Option 1: Download a converted model

You can grab a converted model from the CoreML-Models repository.

YOLO26s (18MB)

Option 2: Convert it yourself

pip install ultralytics coremltools==8.1

python -c "
from ultralytics import YOLO
model = YOLO('yolo26s.pt')
model.export(format='coreml', nms=False)
"

Note: coremltools 9.0 has a _cast bug, so 8.1 is recommended.

Implementing the iOS app

Loading the model

import CoreML
import Vision

let config = MLModelConfiguration()
config.computeUnits = .all  // Neural Engine + GPU + CPU
let mlModel = try MLModel(contentsOf: modelURL, configuration: config)
let vnModel = try VNCoreMLModel(for: mlModel)

Running inference on camera frames

func captureOutput(_ output: AVCaptureOutput,
                   didOutput sampleBuffer: CMSampleBuffer,
                   from connection: AVCaptureConnection) {
    guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }

    let request = VNCoreMLRequest(model: vnModel) { request, _ in
        self.handleDetections(request)
    }
    request.imageCropAndScaleOption = .scaleFill

    try? VNImageRequestHandler(cvPixelBuffer: pixelBuffer, orientation: .up)
        .perform([request])
}

Decoding the NMS-free output

YOLO26's output is a [1, 300, 6] tensor. Each row is [x1, y1, x2, y2, confidence, class_id] — already-filtered final results.

func handleDetections(_ request: VNRequest) {
    guard let results = request.results as? [VNCoreMLFeatureValueObservation],
          let array = results.first?.featureValue.multiArrayValue else { return }

    let shape = array.shape.map { $0.intValue }  // [1, 300, 6]

    for i in 0..<shape[1] {
        let confidence = array[[0, i, 4] as [NSNumber]].floatValue
        guard confidence >= 0.25 else { continue }

        let x1 = CGFloat(array[[0, i, 0] as [NSNumber]].floatValue) / 640
        let y1 = CGFloat(array[[0, i, 1] as [NSNumber]].floatValue) / 640
        let x2 = CGFloat(array[[0, i, 2] as [NSNumber]].floatValue) / 640
        let y2 = CGFloat(array[[0, i, 3] as [NSNumber]].floatValue) / 640
        let classId = Int(array[[0, i, 5] as [NSNumber]].floatValue)

        // x1,y1,x2,y2 are normalized coordinates [0,1]
        // convert them directly to screen coordinates and draw
    }
}

With conventional YOLO (v5, v8, v9, v11), an NMS step was required here. In YOLO26, duplicate removal via Dual Assignment is already done inside the model, so you get the final results by just filtering on a threshold.

Comparison with NMS-based YOLO

	YOLO26 (no NMS)	YOLO11 (NMS required)
Output	`[1, 300, 6]` — direct results	`[1, 84, 8400]` — needs decode + NMS
Post-processing	threshold filter only	box decode → NMS → filter
CoreML conversion	simple, `nms=False`	needs a pipeline, `nms=True`
Inference speed	43% faster (CPU)	baseline

Drawing bounding boxes

Use CAShapeLayer for fast drawing. Drawing with SwiftUI's ForEach regenerates the views every frame and gets slow.

class BoundingBoxView {
    let shapeLayer = CAShapeLayer()
    let textLayer = CATextLayer()

    func show(frame: CGRect, label: String, color: UIColor) {
        CATransaction.begin()
        CATransaction.setDisableActions(true)  // disable implicit animation

        shapeLayer.path = UIBezierPath(roundedRect: frame, cornerRadius: 10).cgPath
        shapeLayer.strokeColor = color.cgColor
        textLayer.string = label

        CATransaction.commit()
    }
}

Key points:

Disable implicit animation with CATransaction.setDisableActions(true). Without it, labels lag one frame behind the box.
Pool and reuse ~100 layers to avoid per-frame alloc/dealloc.

Aligning coordinates with the camera preview

This is where you get stuck the most.

// Set the camera output's videoOrientation to .portrait
// so the pixelBuffer arrives already rotated to portrait
let connection = videoOutput.connection(with: .video)
connection?.videoOrientation = .portrait

// Pass .up to VNImageRequestHandler (it's already rotated)
VNImageRequestHandler(cvPixelBuffer: pb, orientation: .up)

Because the preview is cropped with resizeAspectFill, you have to correct for the difference between the camera's aspect ratio and the screen's aspect ratio.

let cameraRatio = shortSide / longSide  // e.g., 1080/1920
let displayRatio = screenWidth / screenHeight
let ratio = (screenHeight / screenWidth) / (longSide / shortSide)

if ratio >= 1 {
    // screen is taller than the camera → scale-correct horizontally
    let offset = (1 - ratio) * (0.5 - rect.minX)
    // ... correct with an affine transform
}

Sample app

There's a complete sample app in the CoreML-Models repository.

YOLO26Demo (sample_apps/YOLO26Demo/) — for NMS-free models
- Real-time camera inference + FPS/latency display
- Inference on images from the photo library
- Per-frame inference on video

Setup:

Download and unzip the model
Drag the .mlpackage into your Xcode project
Build & run on a real device

Any model with output shape [1, N, 6] is loaded automatically, regardless of file name.

Conversion tips

What I learned doing this conversion:

coremltools 9.0 + numpy 2.x crashes on _cast → use coremltools 8.1 + numpy<2
ultralytics 8.4.31's nms=True CoreML export fails because of a pipeline_coreml bug → with NMS-free YOLO26 you just use nms=False, so it's a non-issue
Python 3.14 isn't supported by coremltools → use Python 3.12

Summary

Thanks to its NMS-free design, YOLO26 makes both CoreML conversion and app implementation simpler. Conventional YOLO needed an NMS pipeline and decoding logic; YOLO26 needs only threshold filtering.

With the iPhone's Neural Engine you can hit real-time detection at 30+ FPS. Edge AI feels one step closer to being practical.

References

Originally published in Japanese on Qiita. Want to prototype an app or service with the latest AI, fast? Reach out: rockyshikoku@gmail.com — GitHub / X

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.