I've spent weeks stress-testing Apple's on-device model — the ~3B parameter one that runs on the Neural Engine of any Apple Silicon Mac. To test it thoroughly, I built Think Local, a macOS app that exercises every capability of the model: chat, image generation, structured output, tool calling, and parameter comparison.
My conclusion: As a chatbot, the model is terrible. As a structured output and tool calling engine, it's surprisingly good.
This distinction matters because it completely changes what you should use this model for.
Chat is disappointing — and that's fine
Apple's model has a 4,096-token context window. To put this in perspective: Claude has 1M tokens and GPT-4o has 128K. With Apple, add a 200-token system prompt, a 150-token schema, and three conversation turns, and you're already at 70% capacity.
Free-form text quality isn't impressive either. Responses are correct but generic, repetitive in long sessions, and the guardrails trigger false positives — ask about cybersecurity and sometimes it refuses because it interprets "attack" literally.
If you evaluate it as a chatbot, it's a mediocre model. But evaluating it that way is like criticizing a screwdriver for being a bad hammer.
Where it shines: constrained decoding
This is where everything changes. Foundation Models supports @Generable — a macro that converts a Swift struct into a schema the model is forced to respect. This isn't "asking for JSON and hoping for the best." It's constrained decoding: during generation, the model literally cannot emit tokens that violate the schema.
@Generable
struct BugReport {
@Guide(.anyOf(["bug", "feature", "task"]))
var type: String
@Guide(.anyOf(["low", "medium", "high", "critical"]))
var priority: String
var title: String
var description: String
}
With this schema, you say "Classify: the app crashes when I rotate the device" and it returns:
{
"type": "bug",
"priority": "high",
"title": "Crash on device rotation",
"description": "The application crashes when the user rotates..."
}
The constrained fields (type, priority) are always valid — it can't make up values outside the list. The free fields (title, description) are coherent and concise. I tested this with dozens of different inputs: zero schema errors in over 200 runs.
The difference between free generation and constrained decoding in this model is dramatic. It's like the difference between asking a junior developer to write whatever they want versus giving them a form with defined fields. The form always wins.
Tool calling: a 3B model that knows when it doesn't know
This was what surprised me most. Define a get_weather(city: String) tool, ask "What's the weather in Madrid?" and it generates a structured invocation with correct parameters. Ask "What's the capital of France?" and it answers directly — it knows it doesn't need the tool.
A 3B parameter model, running on your laptop without network, distinguishes when to use a tool and when not to. It's not perfect — with ambiguous prompts it sometimes gets confused — but the accuracy rate on clear inputs is remarkable.
Apple's tool calling uses the same constrained decoding mechanism: the invocation is a @Generable struct, so parameters are always valid. No broken JSON parsing or invented parameters.
Image Playground: the other model nobody mentions
Apple Intelligence isn't just text. Image Playground is a separate framework that generates images on-device in three styles: animation, illustration, and sketch. It also runs entirely on the Neural Engine, without network.
The pattern repeats: for what it's designed for, it works well. Icons, stickers, simple compositions — surprisingly good results. Text within images, detailed human anatomy, complex compositions — disastrous.
Think Local includes an Image Studio where you can test prompts and compare styles side by side. The intuition you get from ten minutes of testing prompts beats any benchmark.
The numbers
Measured on an Apple Silicon Mac, using Think Local's resource monitor:
| Metric | Value |
|---|---|
| Generation speed | ~40 tok/s |
| Cold start (no prewarm) | ~800ms |
| Cold start (with prewarm) | ~200ms |
| Model RAM | ~1.2 GB |
| Context window | 4,096 tokens |
| Parameters | ~3B |
| Battery impact | Minimal (Neural Engine) |
The battery data is relevant: the Neural Engine consumes noticeably less than CPU for inference. During generation, CPU usage increases slightly for marshalling and UI, but the heavy work goes to the Neural Engine. This makes it viable to use the model in background for tooling — git hooks, linters, classifiers — without draining your laptop.
What to use it for and what not to
Use it for:
- Classification and triage (bugs, emails, tickets) → constrained decoding with
@Generable - Structured data extraction from free text → schemas with
@Guide - Lightweight tool calling (search, calculate, query) → type-safe invocations
- Drafting and summarizing short texts → within the 4K token limit
- Simple image generation (icons, stickers, illustrations) → Image Playground
- Any task where you can define the output format
Don't use it for:
- Long conversations → the 4K window depletes in 3-4 turns
- Complex or multi-step reasoning → the model is too small
- Long creative generation → responses are generic and repetitive
- Anything requiring knowledge post-October 2023
Try it
Think Local is open source (MIT). The app exercises all these capabilities with visual UI: a token visualizer that shows consumed context in real-time, a schema editor, a tool calling lab, and a compare mode to see how responses change with different parameters.
git clone https://github.com/frr149/think-local.git
cd think-local
open Package.swift
Requirements: macOS 26, Apple Silicon, Apple Intelligence enabled. Zero dependencies, zero API keys.
The conclusion
Apple hasn't built a chatbot. They've built a local inference engine with constrained decoding and tool calling that happens to also chat. If you evaluate it as a chatbot, it loses against everything. If you evaluate it as a structured output engine that runs free on your hardware, without network and without API keys — it has no competition. Literally nobody else offers that.
The right question isn't "Can Apple compete with GPT-4?" — it can't. The question is "What can I build with a 3B model that runs free, locally, with constrained decoding?" And the answer is: quite a bit more than you think.
Top comments (0)