6 months later: Apple Finally Shipped Local Multimodal in Xcode 27 Beta

#programming #swift #ios #ai

A while ago I wrote a full llama.cpp iOS implementation using Obj-c bridge because I wanted one thing:

image in -> structured JSON out -> no cloud required.

It worked. It was fast enough. It was also a lot of plumbing:

XCFramework builds
ObjC++ bridge
tokenizer/eval/sampling internals
model + projector file choreography
JSON guardrails everywhere

Now, about 6 months later, Apple dropped Foundation Models image analysis in Xcode 27.0 beta, and i can finally call a serious on-device model without keeping that whole engine room by myself.

Analyzing images with multimodal prompting | Apple Developer Documentation

Analyze and extract information from images by combining them with descriptive text prompts.

developer.apple.com

With Foundation Models, the core API is basically:

import FoundationModels

@Generable
struct ReceiptExtraction: Codable {
    var vendor_name: String
    var transaction_date: String
    var total_amount: Double
    var currency: String
    var category: String
    var line_items: [String]
}

let session = LanguageModelSession(model: .default)

let response = try await session.respond(
    generating: ReceiptExtraction.self,
    options: GenerationOptions(
        sampling: .random(top: 20, seed: 1111),
        temperature: 0.1,
        maximumResponseTokens: 384
    )
) {
    """
    Extract receipt information for bookkeeping.
    Return schema-compliant structured output only.
    Format fields for QuickBooks ingestion.
    """
    Attachment(cgImage, orientation: .right)
}

let result = response.content