Daisuke Majima

Posted on Jun 2 • Originally published at qiita.com

Type 'dog' to detect a dog: running YOLO-World on iPhone

#ios #machinelearning #computervision #coreml

What it does

Type text like "person, red car, coffee cup" and it detects those objects in the camera view in real time. No class list needed. You can specify any words you like, as many as you like.

This is YOLO-World's "Open-Vocabulary Detection." Presented at CVPR 2024, it's a fundamentally different approach from the conventional "fixed 80 classes" YOLO.

How it works

text input ──→ CLIP Text Encoder ──→ text features [1,80,512]
                                            │
camera feed ──→ YOLO-World Detector ────────┤──→ boxes [1,4,8400]
                                            └──→ scores [1,80,8400]
                                                     │
                                                 NMS + Filter ──→ bounding boxes

A dual-wield of CLIP's language understanding and YOLO's detection speed. It converts text into vectors and detects via the matching score against features extracted from the image.

Changing the query text only re-runs the CLIP encoder; camera-frame inference uses only the visual detector. No heavy recompute runs every time the text changes.

Preparing the CoreML models

Download (ready to use)

Download 3 files from the release assets of the CoreML-Models repository:

File	Size	Role
yoloworld_detector.mlpackage	25 MB	YOLO-World V2-S (image → boxes+scores)
clip_text_encoder.mlpackage	121 MB	CLIP ViT-B/32 (text → embedding)
clip_vocab.json	1.6 MB	BPE tokenizer vocabulary

Convert it yourself

pip install ultralytics open_clip_torch coremltools==8.1
python convert_models.py --size s  # s/m/l/x

The conversion script does:

Unwrap YOLO-World V2's Detect head — output boxes [1,4,8400] and scores [1,NC,8400] directly
Convert CLIP's text encoder standalone — patch MultiheadAttention to be CoreML-compatible
Export the BPE vocab as JSON — for the Swift-side tokenizer

iOS implementation

Architecture overview

TextGroundingDetector (ObservableObject)
├── visualModel: MLModel    — YOLO-World detector
├── textEncoder: MLModel    — CLIP text encoder
├── tokenizer: CLIPTokenizer — BPE tokenizer
└── cachedTxtFeats: MLMultiArray — text-feature cache

Encoding text

Run only when the user changes the query; the result is cached.

func updateQueries(_ queryString: String) {
    let queries = queryString.split(separator: ",")
        .map { $0.trimmingCharacters(in: .whitespaces) }

    // tokenize each query → CLIP encoder → 512-dim vector
    let txtFeats = try MLMultiArray(shape: [1, 80, 512], dataType: .float32)

    for (i, query) in queries.prefix(80).enumerated() {
        let tokens = tokenizer.tokenize(query)
        // ... textEncoder.prediction() via MLDictionaryFeatureProvider ...
        // L2-normalize the result and store into txtFeats[i]
    }
    cachedTxtFeats = txtFeats
}

Key points:

Up to 80 queries can be detected at once
L2 normalization is important — CLIP outputs live in a normalized cosine-similarity space
Fast normalization with Accelerate via vDSP_svesq + vDSP_vsmul

Image preprocessing

YOLO-World requires letterbox preprocessing (keep aspect ratio + padding):

func preprocessImage(_ cgImage: CGImage) throws -> MLMultiArray {
    let scale = Float(640) / Float(max(imgW, imgH))
    let scaledW = Int(Float(imgW) * scale)
    let scaledH = Int(Float(imgH) * scale)
    let padX = (640 - scaledW) / 2
    let padY = (640 - scaledH) / 2

    // draw onto a 640x640 canvas padded with gray (0.5)
    ctx.setFillColor(gray: 0.5, alpha: 1.0)
    ctx.fill(CGRect(x: 0, y: 0, width: 640, height: 640))
    ctx.draw(cgImage, in: CGRect(x: padX, y: padY, width: scaledW, height: scaledH))

    // RGBA → CHW Float32 [0,1]
    for i in 0..<(640*640) {
        dst[0 * hw + i] = Float(src[i * 4 + 0]) / 255  // R
        dst[1 * hw + i] = Float(src[i * 4 + 1]) / 255  // G
        dst[2 * hw + i] = Float(src[i * 4 + 2]) / 255  // B
    }
}

You can't use .scaleFill — the coordinates shift by the letterbox padding, so you have to subtract the padding back out of the output coordinates.

Inference and post-processing

let input = try MLDictionaryFeatureProvider(dictionary: [
    "image": tensor,
    "txt_feats": cachedTxtFeats,  // cached text features
])
let output = try visualModel.prediction(from: input)

let boxes = output.featureValue(for: "boxes")!.multiArrayValue!   // [1,4,8400]
let scores = output.featureValue(for: "scores")!.multiArrayValue! // [1,NC,8400]

for qi in 0..<queryCount {
    for anchor in 0..<8400 {
        let score = scores[qi * 8400 + anchor]
        guard score >= threshold else { continue }

        let cx = boxes[0 * 8400 + anchor]
        let cy = boxes[1 * 8400 + anchor]
        let bw = boxes[2 * 8400 + anchor]
        let bh = boxes[3 * 8400 + anchor]

        // remove padding and convert to normalized coordinates
        let nx = (cx - bw/2 - padX) / (imgW * scale)
        let ny = (cy - bh/2 - padY) / (imgH * scale)
    }
}

The output scores are sigmoid values already computed by the BNContrastiveHead, so you can use them directly as confidence.

NMS

Apply NMS per query (per-class):

allDets.sort { $0.confidence > $1.confidence }
var kept: [Int] = []
for i in allDets.indices {
    var suppress = false
    for ki in kept {
        if allDets[i].classIndex == allDets[ki].classIndex
            && iou(allDets[i].rect, allDets[ki].rect) > 0.5 {
            suppress = true; break
        }
    }
    if !suppress { kept.append(i) }
}

BPE tokenizer (Swift)

You need to implement CLIP's tokenizer in Swift. Load the BPE merge rules and vocabulary from clip_vocab.json:

class CLIPTokenizer {
    let contextLength: Int  // 77
    private let encoder: [String: Int]
    private let bpeRanks: [(String, String): Int]

    func tokenize(_ text: String) -> [Int] {
        var tokens = [encoder["<|startoftext|>"]!]
        // lowercase text → split into characters → BPE merge → token IDs
        // ...
        tokens.append(encoder["<|endoftext|>"]!)
        // pad to contextLength (77)
        return tokens + Array(repeating: 0, count: contextLength - tokens.count)
    }
}

Compared with ordinary YOLO

	YOLO-World (Open-Vocabulary)	YOLO26 (fixed classes)
Detection target	any text	fixed COCO 80 classes
Model setup	Detector + CLIP Encoder + Vocab	one model only
Total size	~148 MB	~18 MB
NMS	implemented app-side	none (End-to-End)
Use for	flexible detection / search / grounding	general object detection
Speed	a bit slower (CLIP overhead)	fastest

Practical scenarios

Search by "red sneakers" — visual search in an e-commerce app
Detect "cracks" — infrastructure inspection
Detect "dog, cat, hamster" simultaneously — pet tracking
Let users freely specify what to detect — deploy without customization

With fixed-class YOLO you had to collect a dataset and retrain to detect "cracks." With YOLO-World you just change the text.

Sample app

A complete sample app is in sample_apps/YOLOWorldDemo/ of the CoreML-Models repository.

3 modes: camera / photo / video
freely change the query in a text field
real-time filtering with a confidence slider
download the models from release assets and drag into Xcode

Conversion tips

Use coremltools 8.1 (9.0 has a bug)
You need to patch torch.nn.MultiheadAttention.forward — CoreML can't convert the default PyTorch MHA well; monkey-patch it to call F.multi_head_attention_forward directly
Use YOLO-World V2 (faster and more accurate than V1)
compute_precision=ct.precision.FLOAT16 halves the model size

Summary

YOLO-World delivers intuitive, powerful object detection where you "specify what you want to detect by text." Run it on the iPhone's Neural Engine and it works server-free, offline, with low latency.

When to use which:

Speed-first, COCO 80 classes is enough → YOLO26
Want to flexibly change targets → YOLO-World

References

Originally published in Japanese on Qiita. GitHub / X

DEV Community