DEV Community

Daisuke Majima
Daisuke Majima

Posted on • Originally published at qiita.com

Type 'dog' to detect a dog: running YOLO-World on iPhone

What it does

Type text like "person, red car, coffee cup" and it detects those objects in the camera view in real time. No class list needed. You can specify any words you like, as many as you like.

This is YOLO-World's "Open-Vocabulary Detection." Presented at CVPR 2024, it's a fundamentally different approach from the conventional "fixed 80 classes" YOLO.

How it works

text input ──→ CLIP Text Encoder ──→ text features [1,80,512]
                                            │
camera feed ──→ YOLO-World Detector ────────┤──→ boxes [1,4,8400]
                                            └──→ scores [1,80,8400]
                                                     │
                                                 NMS + Filter ──→ bounding boxes
Enter fullscreen mode Exit fullscreen mode

A dual-wield of CLIP's language understanding and YOLO's detection speed. It converts text into vectors and detects via the matching score against features extracted from the image.

Changing the query text only re-runs the CLIP encoder; camera-frame inference uses only the visual detector. No heavy recompute runs every time the text changes.

Preparing the CoreML models

Download (ready to use)

Download 3 files from the release assets of the CoreML-Models repository:

File Size Role
yoloworld_detector.mlpackage 25 MB YOLO-World V2-S (image → boxes+scores)
clip_text_encoder.mlpackage 121 MB CLIP ViT-B/32 (text → embedding)
clip_vocab.json 1.6 MB BPE tokenizer vocabulary

Convert it yourself

pip install ultralytics open_clip_torch coremltools==8.1
python convert_models.py --size s  # s/m/l/x
Enter fullscreen mode Exit fullscreen mode

The conversion script does:

  1. Unwrap YOLO-World V2's Detect head — output boxes [1,4,8400] and scores [1,NC,8400] directly
  2. Convert CLIP's text encoder standalone — patch MultiheadAttention to be CoreML-compatible
  3. Export the BPE vocab as JSON — for the Swift-side tokenizer

iOS implementation

Architecture overview

TextGroundingDetector (ObservableObject)
├── visualModel: MLModel    — YOLO-World detector
├── textEncoder: MLModel    — CLIP text encoder
├── tokenizer: CLIPTokenizer — BPE tokenizer
└── cachedTxtFeats: MLMultiArray — text-feature cache
Enter fullscreen mode Exit fullscreen mode

Encoding text

Run only when the user changes the query; the result is cached.

func updateQueries(_ queryString: String) {
    let queries = queryString.split(separator: ",")
        .map { $0.trimmingCharacters(in: .whitespaces) }

    // tokenize each query → CLIP encoder → 512-dim vector
    let txtFeats = try MLMultiArray(shape: [1, 80, 512], dataType: .float32)

    for (i, query) in queries.prefix(80).enumerated() {
        let tokens = tokenizer.tokenize(query)
        // ... textEncoder.prediction() via MLDictionaryFeatureProvider ...
        // L2-normalize the result and store into txtFeats[i]
    }
    cachedTxtFeats = txtFeats
}
Enter fullscreen mode Exit fullscreen mode

Key points:

  • Up to 80 queries can be detected at once
  • L2 normalization is important — CLIP outputs live in a normalized cosine-similarity space
  • Fast normalization with Accelerate via vDSP_svesq + vDSP_vsmul

Image preprocessing

YOLO-World requires letterbox preprocessing (keep aspect ratio + padding):

func preprocessImage(_ cgImage: CGImage) throws -> MLMultiArray {
    let scale = Float(640) / Float(max(imgW, imgH))
    let scaledW = Int(Float(imgW) * scale)
    let scaledH = Int(Float(imgH) * scale)
    let padX = (640 - scaledW) / 2
    let padY = (640 - scaledH) / 2

    // draw onto a 640x640 canvas padded with gray (0.5)
    ctx.setFillColor(gray: 0.5, alpha: 1.0)
    ctx.fill(CGRect(x: 0, y: 0, width: 640, height: 640))
    ctx.draw(cgImage, in: CGRect(x: padX, y: padY, width: scaledW, height: scaledH))

    // RGBA → CHW Float32 [0,1]
    for i in 0..<(640*640) {
        dst[0 * hw + i] = Float(src[i * 4 + 0]) / 255  // R
        dst[1 * hw + i] = Float(src[i * 4 + 1]) / 255  // G
        dst[2 * hw + i] = Float(src[i * 4 + 2]) / 255  // B
    }
}
Enter fullscreen mode Exit fullscreen mode

You can't use .scaleFill — the coordinates shift by the letterbox padding, so you have to subtract the padding back out of the output coordinates.

Inference and post-processing

let input = try MLDictionaryFeatureProvider(dictionary: [
    "image": tensor,
    "txt_feats": cachedTxtFeats,  // cached text features
])
let output = try visualModel.prediction(from: input)

let boxes = output.featureValue(for: "boxes")!.multiArrayValue!   // [1,4,8400]
let scores = output.featureValue(for: "scores")!.multiArrayValue! // [1,NC,8400]

for qi in 0..<queryCount {
    for anchor in 0..<8400 {
        let score = scores[qi * 8400 + anchor]
        guard score >= threshold else { continue }

        let cx = boxes[0 * 8400 + anchor]
        let cy = boxes[1 * 8400 + anchor]
        let bw = boxes[2 * 8400 + anchor]
        let bh = boxes[3 * 8400 + anchor]

        // remove padding and convert to normalized coordinates
        let nx = (cx - bw/2 - padX) / (imgW * scale)
        let ny = (cy - bh/2 - padY) / (imgH * scale)
    }
}
Enter fullscreen mode Exit fullscreen mode

The output scores are sigmoid values already computed by the BNContrastiveHead, so you can use them directly as confidence.

NMS

Apply NMS per query (per-class):

allDets.sort { $0.confidence > $1.confidence }
var kept: [Int] = []
for i in allDets.indices {
    var suppress = false
    for ki in kept {
        if allDets[i].classIndex == allDets[ki].classIndex
            && iou(allDets[i].rect, allDets[ki].rect) > 0.5 {
            suppress = true; break
        }
    }
    if !suppress { kept.append(i) }
}
Enter fullscreen mode Exit fullscreen mode

BPE tokenizer (Swift)

You need to implement CLIP's tokenizer in Swift. Load the BPE merge rules and vocabulary from clip_vocab.json:

class CLIPTokenizer {
    let contextLength: Int  // 77
    private let encoder: [String: Int]
    private let bpeRanks: [(String, String): Int]

    func tokenize(_ text: String) -> [Int] {
        var tokens = [encoder["<|startoftext|>"]!]
        // lowercase text → split into characters → BPE merge → token IDs
        // ...
        tokens.append(encoder["<|endoftext|>"]!)
        // pad to contextLength (77)
        return tokens + Array(repeating: 0, count: contextLength - tokens.count)
    }
}
Enter fullscreen mode Exit fullscreen mode

Compared with ordinary YOLO

YOLO-World (Open-Vocabulary) YOLO26 (fixed classes)
Detection target any text fixed COCO 80 classes
Model setup Detector + CLIP Encoder + Vocab one model only
Total size ~148 MB ~18 MB
NMS implemented app-side none (End-to-End)
Use for flexible detection / search / grounding general object detection
Speed a bit slower (CLIP overhead) fastest

Practical scenarios

  • Search by "red sneakers" — visual search in an e-commerce app
  • Detect "cracks" — infrastructure inspection
  • Detect "dog, cat, hamster" simultaneously — pet tracking
  • Let users freely specify what to detect — deploy without customization

With fixed-class YOLO you had to collect a dataset and retrain to detect "cracks." With YOLO-World you just change the text.

Sample app

A complete sample app is in sample_apps/YOLOWorldDemo/ of the CoreML-Models repository.

  • 3 modes: camera / photo / video
  • freely change the query in a text field
  • real-time filtering with a confidence slider
  • download the models from release assets and drag into Xcode

Conversion tips

  • Use coremltools 8.1 (9.0 has a bug)
  • You need to patch torch.nn.MultiheadAttention.forward — CoreML can't convert the default PyTorch MHA well; monkey-patch it to call F.multi_head_attention_forward directly
  • Use YOLO-World V2 (faster and more accurate than V1)
  • compute_precision=ct.precision.FLOAT16 halves the model size

Summary

YOLO-World delivers intuitive, powerful object detection where you "specify what you want to detect by text." Run it on the iPhone's Neural Engine and it works server-free, offline, with low latency.

When to use which:

  • Speed-first, COCO 80 classes is enough → YOLO26
  • Want to flexibly change targets → YOLO-World

References


Originally published in Japanese on Qiita. GitHub / X

Top comments (0)