Last year Apple gave us an on-device LLM through the Foundation Models framework. This year that on-device model gets better, and Apple adds something many of us asked for: a larger server model you can call directly from your app, running on Private Cloud Compute (PCC).
TL;DR
- A new server-class model is now reachable from the same Foundation Models API you already use.
- Switching from on-device to server is a one-line change.
- You get a 32K context window (vs 4K on-device), reasoning support, and image input.
- No API keys, no auth, no token costs to you. Requests are metered against the user's iCloud account, with a daily per-user limit.
- Eligible for apps with fewer than 2M downloads. You apply on the developer site.
- It works across platforms, including watchOS.
Why a server model when we already have on-device
The on-device model is great for fast, private, offline tasks, and this year it improved: it now supports image input, follows instructions more reliably, and is better at calling your custom tools.
But some features just need more headroom. Think:
- Assistants that reason over a large chunk of user input.
- Workflows that make many tool calls and produce large outputs.
- Tasks where a bigger context window and deeper reasoning materially change quality.
That's where PCC comes in. You get a frontier-class model while keeping Apple's privacy posture intact.
The privacy and pricing story (the part that's genuinely different)
Most server LLMs mean: provision an account, manage API keys, eat token costs, and ship a privacy policy that accounts for it. PCC removes most of that:
- Privacy by design. Data is used only for the request and is never stored, and Apple has had this independently verified by researchers.
- No keys or auth. PCC is integrated into the OS and iCloud. Your user just needs a device that supports Apple Intelligence.
- No token bill for you. Each user gets a daily limit tied to their iCloud account. Users on iCloud+ get higher limits.
The trade-off you're accepting: a network connection is required, and there's a per-user daily cap you need to design around (more on that below).
Integrating it: one line to switch models
If you've used Foundation Models before, prompting the on-device model is three lines:
import FoundationModels
let session = LanguageModelSession()
let response = try await session.respond(to: "Summarize this article: \(article)")
Switching to the PCC server model is a single line, you just hand the session a different model:
import FoundationModels
let session = LanguageModelSession(
model: PrivateCloudComputeLanguageModel()
)
let response = try await session.respond(to: "Summarize this article: \(article)")
That's the headline ergonomic win. Same unified Swift API, larger model behind it.
Structured output and tools work identically
@Generable structured output and Tool calling behave the same whether you're on-device or on PCC. You don't rewrite anything to move between them:
import FoundationModels
@Generable
struct ArticleSummary {
let oneLineSummary: String
let keyPoints: [String]
}
struct FindRelatedArticlesTool: Tool {
// ...
}
let session = LanguageModelSession(
model: PrivateCloudComputeLanguageModel(),
tools: [FindRelatedArticlesTool.self]
)
let response = try await session.respond(
to: "Summarize this article: \(article)",
generating: ArticleSummary.self
)
Always check availability
PCC, like the on-device model, only runs on Apple Intelligence devices. Check availability and provide a graceful fallback:
import FoundationModels
struct ArticleSummarizationView: View {
private var model = PrivateCloudComputeLanguageModel()
var body: some View {
if model.isAvailable {
// Show UI for making request
} else {
// Fall back
}
}
}
On-device vs PCC: how to choose
Both are private. The rest is a set of trade-offs:
| Factor | On-device | PCC server |
|---|---|---|
| Privacy | Yes | Yes |
| Works offline | Yes | No (needs connection) |
| Request limits | None | Daily per-user limit |
| Context size | 4K | 32K |
| Reasoning | No | Yes |
The session's advice is worth repeating: pick the model based on data, not vibes. The updated on-device model may handle more than you'd expect, and it has no request limits. The only way to know is to evaluate your specific feature (Apple's new Evaluations framework, covered in "Meet the Evaluations framework," is built for exactly this).
Reasoning levels
PCC supports reasoning, where the model generates extra "thinking" text in a separate transcript segment before producing the final answer. There are three levels:
-
.lightgathers a bit of extra context. -
.moderatereasons a little deeper. -
.deepcan produce a reasoning segment longer than the answer itself.
You set it per request:
let response = try await session.respond(
to: prompt,
contextOptions: ContextOptions(reasoningLevel: .light)
)
// Reasoning levels: .light, .moderate, .deep
Two things to keep in mind:
- Reasoning is generated text, so it consumes tokens and counts against your context limit.
- The reasoning lives in the session transcript, so you can observe the transcript to show progress. This matters most with
.deep, which can take a while.
Reading context size programmatically
You can now query context size directly instead of hardcoding it:
SystemLanguageModel().contextSize
// 4096 on 26.0
// 8192 on 27.0 (newer devices)
PrivateCloudComputeLanguageModel().contextSize
// 32768
Handling usage limits (don't skip this)
Because requests are metered against the user's iCloud account, your app will eventually hit a user who's at their daily cap. If the only thing that happens is a thrown error surfaced in the UI, that's a poor, non-actionable experience.
Instead, inspect quotaUsage and render persistent, actionable UI:
struct ArticleSummarizationView: View {
private var model = PrivateCloudComputeLanguageModel()
var body: some View {
if case .belowLimit(let info) = model.quotaUsage.status {
if info.isApproachingLimit {
Text("Nearing usage limit.")
.foregroundStyle(Color.orange)
}
}
if model.quotaUsage.isLimitReached {
Text("Usage limit exceeded.")
.foregroundStyle(Color.red)
}
if let suggestion = model.quotaUsage.limitIncreaseSuggestion {
Button("Show options") {
suggestion.show()
}
}
}
}
Design guidance from the session:
- Avoid alerts. Use UI that persists and isn't dismissed, for example a disabled request button with a subtle label beneath it.
-
Offer the upgrade path.
limitIncreaseSuggestionlets the user manage or raise their limit (such as upgrading their iCloud account). - Handle the "approaching limit" case too, so users can make informed decisions about which requests are worth spending on.
Testing limit states in Xcode
You don't need to burn real quota to test this. In your scheme, go to Debug > Options and use Simulate Apple Foundation Models Availability. You can select Quota Usage Limit Reached and Nearing Usage Limit to exercise both code paths.
Combining on-device and server models
You're not forced to pick one. A common pattern is to route simple work to the on-device model and escalate harder tasks to PCC. The session points to "Build agentic app experiences with Foundation Models" for that workflow.
Getting access
The server model is available for apps with fewer than 2M downloads, and you apply on the Apple Developer website. If your feature genuinely needs the larger context or reasoning, it's worth applying early.
--
Summary
If you already use Foundation Models, reaching for a bigger model is now a one-line decision, with privacy handled and no token bill to manage. Evaluate, choose the right tier for each task, and design for the daily limit up front.
Top comments (0)