DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at runaihome.com

WWDC 2026 Preview: Apple Foundation Models and Core AI — What On-Device AI Actually Means for Home Lab Builders

This article was originally published on runaihome.com

TL;DR: Apple's WWDC 2026 (June 8–12) is expected to replace Core ML with a new Core AI framework, ship a Gemini-trained Foundation Model to power a chatbot-capable Siri, and expand the on-device Foundation Models developer API. The existing 3B on-device model already runs at ~30 tokens/second on iPhone 15 Pro with zero API cost. For home lab builders this matters in a specific, narrow way: if you write iOS/macOS apps, the free inference is real and the privacy story is solid. If you run open-source LLMs, Foundation Models is a separate ecosystem that doesn't replace Ollama or llama.cpp.

Apple Foundation Models API Open-source LLMs on Apple Silicon NVIDIA GPU + Ollama
Best for iOS/macOS app developers Running 7B–70B open models locally Maximum tok/s, widest model choice
Cost Free (on-device inference, no API key) Device cost only GPU cost + ~$420/year electricity
The catch Apple's model only, no fine-tuning, Apple devices required Needs 48GB+ for 70B models 24GB VRAM ceiling, 350–450W draw

Honest take: If you write Swift apps and want on-device AI with no API bill, enable the Foundation Models framework today — it's already shipping. If you run Llama, Qwen, or Mistral models in Ollama, Core AI doesn't change your setup at all.


What WWDC 2026 Is Actually Announcing

The keynote opens June 8 at 10 AM PT. Based on reporting from Bloomberg's Mark Gurman, AppleInsider, 9to5Mac, and TechCrunch, three AI-specific things are coming.

Core AI replaces Core ML. Apple's Core ML framework dates to 2017, when "machine learning" was the industry term and "AI" still felt like science fiction. Core AI is its modernized replacement: same underlying function (local inference on the Neural Engine, GPU, and CPU), but with a broader mandate. Core AI introduces a standardized API for developers to plug in third-party model weights alongside Apple's own models — a direct response to the fact that developers increasingly want to ship custom weights, not just Apple's. Core ML will continue running the existing model zoo in compatibility mode; Core AI takes the forward path.

Updated Foundation Models with Gemini-trained weights. Apple and Google announced a multi-year collaboration under which the next generation of Apple Foundation Models will be based on Google's Gemini architecture and training infrastructure. The current on-device model is a 3B parameter Apple-trained model. The WWDC 2026 version is expected to be larger, more capable, and significantly better at multi-turn conversation. The expanded context window is one of the explicit improvements Apple has signaled.

Siri becomes a chatbot. The rebuilt Siri arriving with iOS 27/macOS 27 gets a dedicated app, full conversation history, and text-plus-voice input. The underlying model is reportedly a 1.2 trillion parameter system developed in collaboration with Google. Unlike the current Foundation Models 3B model that runs fully on-device, the full Siri chatbot routes through Apple's Private Cloud Compute infrastructure — not on your local hardware. The developer framework to build Siri-like experiences in your own apps, however, remains on-device.


The Foundation Models Framework Today: What Already Ships

Before getting to the WWDC 2026 announcements, it's worth being clear about what exists right now, because the framework has been available since iOS 26 shipped and is already useful.

The Foundation Models framework gives Swift developers direct API access to the 3B parameter on-device model that powers writing tools, summaries, and Smart Replies in Apple Intelligence. Performance from Apple's own technical documentation: ~30 tokens/second on iPhone 15 Pro and iPhone 17 Pro, with time-to-first-token latency under 1 millisecond per prompt token. For context, that's slower than running Llama 3 8B on an RTX 5060 Ti (55–60 tok/s), but the 3B model runs on a phone with no power plug, no API call, and no data leaving the device.

The Swift API to use it is deliberately minimal:

import FoundationModels

let session = LanguageModelSession()
let response = try await session.respond(to: "Summarize this support ticket in one sentence.")
print(response.content)
Enter fullscreen mode Exit fullscreen mode

Three lines. Apple handles memory management, quantization, and Neural Engine scheduling. The more interesting part is the @Generable macro for structured output:

@Generable struct TicketClassification {
    let summary: String
    @Guide(description: "Urgency level based on customer tone")
    @Guide(.anyOf(["low", "medium", "high", "critical"]))
    let priority: String
}
Enter fullscreen mode Exit fullscreen mode

This constrained decoding approach doesn't just limit output to the four priority values — Apple's documentation reports that guided generation improves accuracy compared to free-form output, because constraining the generation space reduces hallucination probability. That's a real technical advantage for extraction and classification tasks, regardless of model size.

Hardware requirements: Apple Intelligence must be enabled, which requires iPhone 15 Pro/15 Pro Max or any iPhone 16+, iPad with M1 or A17 Pro, or any Apple Silicon Mac (M1 or later). Intel Macs and older iPhones are excluded.


Two Different Things Home Lab Builders Need to Keep Separate

There is a conflation in most Apple AI coverage that creates real confusion for home lab builders: the Foundation Models developer API and Apple Silicon as a platform for open-source LLMs are separate stories with separate hardware considerations.

Foundation Models: the developer-facing story

If you write iOS or macOS apps, the WWDC 2026 Core AI framework announcement is relevant. You get:

  • Inference at zero API cost (no key, no billing, no rate limits)
  • Privacy guarantees: data stays on device by default, no telemetry
  • Swift-native type safety via guided generation
  • Apple handles all hardware-specific optimization per chip generation

The hard constraint is that you use Apple's model. You can't swap in your own weights, you can't fine-tune on private data, and deployment is limited to Apple platforms. If your app needs a specific domain or language not well-represented in the Foundation Model's training data, you're engineering around the model through prompting, not through retraining.

For AI coding tools built around Xcode and Apple's platform ecosystem, the Core AI developer story has direct implications. Aicoderscope.com covers that angle in depth.

Apple Silicon for open-source LLMs: an independent story

This is completely independent of Foundation Models. Ollama, llama.cpp, LM Studio, and every other open inference tool runs on Apple Silicon through the Metal and (as of Ollama 0.19 in March 2026) MLX backends. The Foundation Models 3B model and Llama 3.3 70B running in Ollama do not share inference infrastructure, don't compete for the same memory pool, and aren't connected in any way.

The performance picture for open-source inference on Apple hardware in 2026, verified across multiple benchmark sources:

Hardware Unified Memory Memory BW Llama 3.3 70B Q4_K_M Annual power cost
Mac Mini M4 16GB 16GB 120 GB/s Won't fit ~$13/yr
Mac Mini M4 32GB 32GB 120 GB/s Won't fit (needs ~43GB) ~$17/yr
Mac Mini M4 Pro 48GB 48GB 273 GB/s ~18 tok/s ~$37/yr
Mac Studio M4 Max 64GB 64GB 546 GB/s ~24 tok/s ~$68/yr
Mac Studio M4 Max 128GB 128GB 546 GB/s 28 tok/s ~$82/yr
Mac Studio M3 Ultra 192GB 192GB 800 GB/s ~40 tok/s ~$121/yr

The M4 Max 128GB at 28 tok/s on Llama 3.3 70B Q4_K_M is the Apple Silicon sweet spot for home lab work in 2026. The Q4_K_M quantization uses ~43GB of the

Top comments (0)