DEV Community

Daisuke Majima
Daisuke Majima

Posted on • Originally published at qiita.com

Real-time object detection on iPhone with YOLO26

What is YOLO26

The latest object detection model, released by Ultralytics in January 2026. Compared to the previous-generation YOLO11, CPU inference is up to 43% faster, greatly improving its practicality on edge devices.

Its biggest feature is end-to-end inference with no NMS. The Non-Maximum Suppression (NMS) post-processing step that used to be mandatory in YOLO is gone — the model outputs the final detections directly.

Model mAP CPU inference Params
YOLO26n 40.9 38.9ms 2.5M
YOLO26s 48.6 63.3ms 9.2M
YOLO26m 53.1 155ms 18.7M

Why run it on iPhone

  • Real-time inference: 30+ FPS using the Neural Engine
  • Privacy: data never leaves the device
  • Offline: works without a network
  • Low latency: no server round-trip, so results come back instantly

Preparing the CoreML model

Option 1: Download a converted model

You can grab a converted model from the CoreML-Models repository.

YOLO26s (18MB)

Option 2: Convert it yourself

pip install ultralytics coremltools==8.1

python -c "
from ultralytics import YOLO
model = YOLO('yolo26s.pt')
model.export(format='coreml', nms=False)
"
Enter fullscreen mode Exit fullscreen mode

Note: coremltools 9.0 has a _cast bug, so 8.1 is recommended.

Implementing the iOS app

Loading the model

import CoreML
import Vision

let config = MLModelConfiguration()
config.computeUnits = .all  // Neural Engine + GPU + CPU
let mlModel = try MLModel(contentsOf: modelURL, configuration: config)
let vnModel = try VNCoreMLModel(for: mlModel)
Enter fullscreen mode Exit fullscreen mode

Running inference on camera frames

func captureOutput(_ output: AVCaptureOutput,
                   didOutput sampleBuffer: CMSampleBuffer,
                   from connection: AVCaptureConnection) {
    guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }

    let request = VNCoreMLRequest(model: vnModel) { request, _ in
        self.handleDetections(request)
    }
    request.imageCropAndScaleOption = .scaleFill

    try? VNImageRequestHandler(cvPixelBuffer: pixelBuffer, orientation: .up)
        .perform([request])
}
Enter fullscreen mode Exit fullscreen mode

Decoding the NMS-free output

YOLO26's output is a [1, 300, 6] tensor. Each row is [x1, y1, x2, y2, confidence, class_id] — already-filtered final results.

func handleDetections(_ request: VNRequest) {
    guard let results = request.results as? [VNCoreMLFeatureValueObservation],
          let array = results.first?.featureValue.multiArrayValue else { return }

    let shape = array.shape.map { $0.intValue }  // [1, 300, 6]

    for i in 0..<shape[1] {
        let confidence = array[[0, i, 4] as [NSNumber]].floatValue
        guard confidence >= 0.25 else { continue }

        let x1 = CGFloat(array[[0, i, 0] as [NSNumber]].floatValue) / 640
        let y1 = CGFloat(array[[0, i, 1] as [NSNumber]].floatValue) / 640
        let x2 = CGFloat(array[[0, i, 2] as [NSNumber]].floatValue) / 640
        let y2 = CGFloat(array[[0, i, 3] as [NSNumber]].floatValue) / 640
        let classId = Int(array[[0, i, 5] as [NSNumber]].floatValue)

        // x1,y1,x2,y2 are normalized coordinates [0,1]
        // convert them directly to screen coordinates and draw
    }
}
Enter fullscreen mode Exit fullscreen mode

With conventional YOLO (v5, v8, v9, v11), an NMS step was required here. In YOLO26, duplicate removal via Dual Assignment is already done inside the model, so you get the final results by just filtering on a threshold.

Comparison with NMS-based YOLO

YOLO26 (no NMS) YOLO11 (NMS required)
Output [1, 300, 6] — direct results [1, 84, 8400] — needs decode + NMS
Post-processing threshold filter only box decode → NMS → filter
CoreML conversion simple, nms=False needs a pipeline, nms=True
Inference speed 43% faster (CPU) baseline

Drawing bounding boxes

Use CAShapeLayer for fast drawing. Drawing with SwiftUI's ForEach regenerates the views every frame and gets slow.

class BoundingBoxView {
    let shapeLayer = CAShapeLayer()
    let textLayer = CATextLayer()

    func show(frame: CGRect, label: String, color: UIColor) {
        CATransaction.begin()
        CATransaction.setDisableActions(true)  // disable implicit animation

        shapeLayer.path = UIBezierPath(roundedRect: frame, cornerRadius: 10).cgPath
        shapeLayer.strokeColor = color.cgColor
        textLayer.string = label

        CATransaction.commit()
    }
}
Enter fullscreen mode Exit fullscreen mode

Key points:

  • Disable implicit animation with CATransaction.setDisableActions(true). Without it, labels lag one frame behind the box.
  • Pool and reuse ~100 layers to avoid per-frame alloc/dealloc.

Aligning coordinates with the camera preview

This is where you get stuck the most.

// Set the camera output's videoOrientation to .portrait
// so the pixelBuffer arrives already rotated to portrait
let connection = videoOutput.connection(with: .video)
connection?.videoOrientation = .portrait

// Pass .up to VNImageRequestHandler (it's already rotated)
VNImageRequestHandler(cvPixelBuffer: pb, orientation: .up)
Enter fullscreen mode Exit fullscreen mode

Because the preview is cropped with resizeAspectFill, you have to correct for the difference between the camera's aspect ratio and the screen's aspect ratio.

let cameraRatio = shortSide / longSide  // e.g., 1080/1920
let displayRatio = screenWidth / screenHeight
let ratio = (screenHeight / screenWidth) / (longSide / shortSide)

if ratio >= 1 {
    // screen is taller than the camera → scale-correct horizontally
    let offset = (1 - ratio) * (0.5 - rect.minX)
    // ... correct with an affine transform
}
Enter fullscreen mode Exit fullscreen mode

Sample app

There's a complete sample app in the CoreML-Models repository.

  • YOLO26Demo (sample_apps/YOLO26Demo/) — for NMS-free models
    • Real-time camera inference + FPS/latency display
    • Inference on images from the photo library
    • Per-frame inference on video

Setup:

  1. Download and unzip the model
  2. Drag the .mlpackage into your Xcode project
  3. Build & run on a real device

Any model with output shape [1, N, 6] is loaded automatically, regardless of file name.

Conversion tips

What I learned doing this conversion:

  • coremltools 9.0 + numpy 2.x crashes on _castuse coremltools 8.1 + numpy<2
  • ultralytics 8.4.31's nms=True CoreML export fails because of a pipeline_coreml bug → with NMS-free YOLO26 you just use nms=False, so it's a non-issue
  • Python 3.14 isn't supported by coremltools → use Python 3.12

Summary

Thanks to its NMS-free design, YOLO26 makes both CoreML conversion and app implementation simpler. Conventional YOLO needed an NMS pipeline and decoding logic; YOLO26 needs only threshold filtering.

With the iPhone's Neural Engine you can hit real-time detection at 30+ FPS. Edge AI feels one step closer to being practical.

References


Originally published in Japanese on Qiita. Want to prototype an app or service with the latest AI, fast? Reach out: rockyshikoku@gmail.comGitHub / X

Top comments (0)