ArshTechPro

Posted on Jun 11

WWDC 2026 - Apple Just Opened the Foundation Models Framework to Any LLM Provider

#ai #ios #llm #mobile

Until WWDC 2026, the Foundation Models framework had one rule: Apple's on-device model or nothing. That rule is gone.

Session 339 introduces a public protocol layer that any LLM — a cloud API, an open-source local model, a fine-tune you host yourself — can implement. Once a provider ships a conforming Swift package, your existing LanguageModelSession code works with it unchanged. No rewrites. No new session abstractions. One line swap.

This article is not a transcript summary. It explains the decisions you now have to make as a developer building on Foundation Models in iOS 27 / macOS 27.

The Situation Before This

The Foundation Models framework launched in iOS 26 and was genuinely useful: free, private, offline, no API key, structured output via @Generable. But it had a hard ceiling. The on-device model is about 3B parameters, not designed for general world knowledge, and has no long-context mode worth speaking of. If your feature needed anything beyond what that model handled, you were stitching together a completely separate SDK — URLSession calls, your own message history management, your own error handling, duplicated everything.

The new protocol layer closes that gap. You write your session logic once against the Foundation Models API, and the model behind it becomes a deployment decision rather than an architectural one.

What Is Actually Available Now

Four model sources, one session API:

// The existing on-device model — free, private, offline
let model = SystemLanguageModel()

// Apple's cloud model — 32K context, reasoning, Private Cloud Compute privacy guarantees
// let model = PrivateCloudComputeLanguageModel()

// A model you package and distribute via Swift Package Manager
// let model = try await CoreAILanguageModel(resourcesAt: modelURL)

// Any MLX-format model from HuggingFace
// let model = MLXLanguageModel(modelID: "mlx-community/my-model")

let session = LanguageModelSession(model: model)
let response = try await session.respond(to: "Summarize this contract.")

And first-party partners are already here. Gemini models are available through the Firebase Apple SDK, plugging directly into the Foundation Models framework using the same API — so the on-device Apple model and cloud-hosted Gemini sit behind the same interface. Anthropic is also listed as a launch partner.

The practical upshot: you can build your feature against SystemLanguageModel for dev and testing, then swap to Claude or Gemini for production without touching your session code.

The Decision You Now Have to Make

Before writing any protocol conformance, figure out which tier your use case actually belongs to.

The on-device model is good at focused tasks: summarization, extraction, classification, short-form generation, anything where you control the prompt tightly and the output is bounded. A 20B sparse model on a phone is enough to handle a meaningful slice of in-app AI tasks — structured extraction, classification, embedded summarization, tool routing. Apps that previously paid for cloud calls to do this can stop.

The cloud tier is for everything else: long context, open-ended reasoning, conversations that run for many turns, tasks that need current world knowledge. Frontier reasoning, agentic loops, vision-language across many images — all still cheaper, more capable, or both via a cloud LLM. The build decision becomes "what can't run on-device" rather than "how big a model do I need."

That framing matters because implementing a custom LanguageModelExecutor is real work. Do not do it if the existing models cover your use case.

If You Are Building a Provider Integration

This is the engineering meat of session 339. Two protocols to implement, one Swift package to ship.

The Two Protocols

LanguageModel describes what your model can do. LanguageModelExecutor is where generation actually happens.

// LanguageModel: declare capabilities and hand the session a configuration
public struct MyLanguageModel: LanguageModel {
    typealias Executor = MyLanguageModelExecutor

    public var capabilities: LanguageModelCapabilities {
        LanguageModelCapabilities(capabilities: [.toolCalling, .guidedGeneration, .reasoning])
    }

    public var executorConfiguration: Executor.Configuration {
        Executor.Configuration(/* API endpoint, auth, model variant, etc. */)
    }
}

// LanguageModelExecutor: translate the framework's request into your model's wire format
public struct MyLanguageModelExecutor: LanguageModelExecutor {
    public typealias Model = MyLanguageModel

    public init(configuration: Configuration) throws { }

    public func respond(
        to request: LanguageModelExecutorGenerationRequest,
        model: MyLanguageModel,
        streamingInto channel: LanguageModelExecutorGenerationChannel
    ) async throws { }
}

The LanguageModel / LanguageModelExecutor split is deliberate. A single model type can have multiple executor configurations — for example, a fast tier and a quality tier backed by the same model family. The session caches executor instances per configuration, so if a developer creates two sessions with identical configurations, they share an executor.

Prewarm

prewarm is called before the first request arrives. For a local model that loads weights from disk, use it to get weights into memory so the user does not wait on the first generation. For a remote API, you might warm a connection pool here or do nothing at all.

func prewarm(transcript: Transcript) {
    loadedModel = try? loadWeights()
}

func respond(to request: ..., streamingInto channel: ...) async throws {
    let weights = try loadedModel ?? loadWeights()
    // generate
}

Reading the Transcript

The framework passes conversation history as a typed Transcript, not a raw array of role/content strings. Your executor maps these to whatever your model expects.

// Transcript.Entry cases you will encounter:
// .instructions  → system prompt
// .prompt        → user message
// .toolCalls     → model-initiated tool invocations
// .toolOutput    → results from those tool calls
// .response      → assistant turn

This typed representation is cleaner than raw JSON message arrays and makes multi-turn context, tool call history, and system instructions all unambiguous. If your model's API uses OpenAI-compatible message formats, the mapping is mechanical.

Streaming Output Back

The channel is how your executor sends generation back to the session. Three phases: metadata, token usage, then text deltas.

func respond(to request: ..., streamingInto channel: ...) async throws {
    // Tell the framework what model/request handled this
    await channel.send(.response(action: .updateMetadata([
        "modelID": "my-model-2026",
        "requestID": request.id.uuidString
    ])))

    // Report input token count before generating
    await channel.send(.response(action: .updateUsage(
        input: .init(totalTokenCount: promptTokens, cachedTokenCount: cachedTokens),
        output: .init(totalTokenCount: 0, reasoningTokenCount: 0)
    )))

    // Stream tokens as they arrive
    for try await token in modelStream {
        await channel.send(.response(action: .appendText(token)))
    }
}

Handling Options You Do Not Support

The session caller can request things your model might not support — greedy sampling, guided generation with a JSON schema, a specific reasoning level. The rule is: approximate where you can, throw where you cannot.

// Caller requested greedy sampling, your API only takes temperature — close enough
if request.generationOptions.sampling?.kind == .greedy {
    apiRequest.temperature = 0
}

// Schema + tiny token budget is genuinely unsatisfiable — throw
if let schema = request.schema,
   let budget = request.generationOptions.maximumResponseTokens,
   budget < minimumTokensNeeded(for: schema) {
    throw LanguageModelError.unsupportedCapability(
        .init(capability: .guidedGeneration,
              debugDescription: "Token budget too small for this schema.")
    )
}

The framework ships typed errors for every common failure: contextSizeExceeded, rateLimited, refusal, guardrailViolation, timeout, and more. Use them. Callers know how to handle them, and apps already built on Foundation Models will degrade gracefully when they see these errors from your model.

Your own provider-specific errors — subscription limits, suspended accounts, model variants not yet provisioned — can be thrown as custom Error types. Give them a proper errorDescription; that string surfaces in developer-facing tooling.

Custom Segments: When Text Is Not Enough

Some models produce or consume things that are not text. The framework has a Transcript.CustomSegment protocol for this.

Define a type, use it in prompts, emit it from your executor:

public struct AudioSegment: Transcript.CustomSegment {
    public var id: String
    public var content: URL
}

// In a session
let recording = AudioSegment(id: UUID().uuidString, content: audioFileURL)
let response = try await session.respond {
    "Transcribe the key decisions from this meeting."
    recording
}

// In the executor, emit back
await channel.send(.response(action: .updateCustomSegment(
    AudioSegment(id: outputFile.id, content: outputFile.url)
)))

Same mechanism works for web search citations, image outputs, retrieval chunks — anything that should live in the transcript with type information preserved.

Server-Side Tools

If your model runs tools on the server side (web search is the common case), you do not expose them as client-side Tool conformances. Instead, declare them on your model type and surface their output through the channel as custom segments.

public struct MyLanguageModel: LanguageModel {
    public struct ServerTool: Sendable {
        public static let webSearch: ServerTool = ...
    }
    public init(serverTools: [ServerTool] = []) { }
}

// Executor routes server events to the channel
for try await chunk in apiResponse {
    switch chunk {
    case .webSearch(let result):
        await channel.send(.response(action: .updateCustomSegment(
            WebSearchSegment(url: result.url, content: result.html)
        )))
    case .textDelta(let delta):
        await channel.send(.response(action: .appendText(delta.text, tokenCount: delta.tokenCount)))
    }
}

The developer using your model sees a coherent transcript that includes both the response text and the search results, typed and inspectable, without the server-side mechanics leaking through.

Packaging Your Provider

The session is specific about how to ship this:

// Package.swift structure Apple recommends
targets: [
    .target(name: "MyModelRuntime"),          // inference engine, weights loader
    .target(name: "MyModel", dependencies: ["MyModelRuntime"]),  // public LanguageModel conformance
    .testTarget(name: "MyModelTests", dependencies: ["MyModel"])
]

A few things worth noting from the session's packaging guidance:

Platform targets. Foundation Models supports iOS, macOS, visionOS, and watchOS. The Foundation Models framework is being released as open source, so your package could also be useful to developers who deploy Swift on Linux servers — consider supporting Linux too.

Dependencies. Every dependency translates to bytes that a developer ships to their users. Be deliberate about what is linked by your package. A bloated transitive dependency graph is a fast way to get your package rejected from app codebases.

Credentials. Design initializers that guide developers toward secure usage rather than plain API key strings — persist tokens via Keychain rather than accepting them as plain strings.

Summary

Foundation Models has always been Apple's model or nothing. That changes in iOS 27. What Apple has built here is less a feature and more a distribution channel. Any LLM provider that ships a conforming Swift package can be dropped into apps that already use Foundation Models — apps that collectively handle session management, tool call loops, and error handling in a consistent way.

DEV Community