DEV Community

Timothy Fosteman
Timothy Fosteman

Posted on

6 months later: Apple Finally Shipped Local Multimodal in Xcode 27 Beta

A while ago I wrote a full llama.cpp iOS implementation using Obj-c bridge because I wanted one thing:

image in -> structured JSON out -> no cloud required.

It worked. It was fast enough. It was also a lot of plumbing:

  • XCFramework builds
  • ObjC++ bridge
  • tokenizer/eval/sampling internals
  • model + projector file choreography
  • JSON guardrails everywhere

Now, about 6 months later, Apple dropped Foundation Models image analysis in Xcode 27.0 beta, and i can finally call a serious on-device model without keeping that whole engine room by myself.

Analyzing images with multimodal prompting | Apple Developer Documentation

Analyze and extract information from images by combining them with descriptive text prompts.

favicon developer.apple.com

With Foundation Models, the core API is basically:

import FoundationModels

@Generable
struct ReceiptExtraction: Codable {
    var vendor_name: String
    var transaction_date: String
    var total_amount: Double
    var currency: String
    var category: String
    var line_items: [String]
}

let session = LanguageModelSession(model: .default)

let response = try await session.respond(
    generating: ReceiptExtraction.self,
    options: GenerationOptions(
        sampling: .random(top: 20, seed: 1111),
        temperature: 0.1,
        maximumResponseTokens: 384
    )
) {
    """
    Extract receipt information for bookkeeping.
    Return schema-compliant structured output only.
    Format fields for QuickBooks ingestion.
    """
    Attachment(cgImage, orientation: .right)
}

let result = response.content
Enter fullscreen mode Exit fullscreen mode

Receipt image in → QuickBooks-ready JSON out.

No bridge.
No gguf.
No mmproj.
No custom decode loop.

Before

  • llama.cpp vendor management
  • ObjC++ wrappers and thread safety
  • bespoke schema/prompt failover handling
  • app startup warmups with model files in bundle

Now

  • native LanguageModelSession
  • native Attachment(...) for images
  • native structured generation with @Generable
  • native prewarm and model availability checks
  • native Instruments.app profiling available

And that is exactly where it should have been from day one fiddling with multi-modal inference.

Top comments (0)