Apple's On-Device Model is Terrible for Chat But Surprisingly Good at Structured Output and Tool Calling

#swift #swiftui #appleintelligence #macos

I've spent weeks stress-testing Apple's on-device model — the ~3B parameter one that runs on the Neural Engine of any Apple Silicon Mac. To test it thoroughly, I built Think Local, a macOS app that exercises every capability of the model: chat, image generation, structured output, tool calling, and parameter comparison.

My conclusion: As a chatbot, the model is terrible. As a structured output and tool calling engine, it's surprisingly good.

This distinction matters because it completely changes what you should use this model for.

Chat is disappointing — and that's fine

Apple's model has a 4,096-token context window. To put this in perspective: Claude has 1M tokens and GPT-4o has 128K. With Apple, add a 200-token system prompt, a 150-token schema, and three conversation turns, and you're already at 70% capacity.

Free-form text quality isn't impressive either. Responses are correct but generic, repetitive in long sessions, and the guardrails trigger false positives — ask about cybersecurity and sometimes it refuses because it interprets "attack" literally.

If you evaluate it as a chatbot, it's a mediocre model. But evaluating it that way is like criticizing a screwdriver for being a bad hammer.

Where it shines: constrained decoding

This is where everything changes. Foundation Models supports @Generable — a macro that converts a Swift struct into a schema the model is forced to respect. This isn't "asking for JSON and hoping for the best." It's constrained decoding: during generation, the model literally cannot emit tokens that violate the schema.

@Generable
struct BugReport {
    @Guide(.anyOf(["bug", "feature", "task"]))
    var type: String
    @Guide(.anyOf(["low", "medium", "high", "critical"]))
    var priority: String
    var title: String
    var description: String
}

With this schema, you say "Classify: the app crashes when I rotate the device" and it returns:

{
  "type": "bug",
  "priority": "high",
  "title": "Crash on device rotation",
  "description": "The application crashes when the user rotates..."
}

The constrained fields (type, priority) are always valid — it can't make up values outside the list. The free fields (title, description) are coherent and concise. I tested this with dozens of different inputs: zero schema errors in over 200 runs.

The difference between free generation and constrained decoding in this model is dramatic. It's like the difference between asking a junior developer to write whatever they want versus giving them a form with defined fields. The form always wins.

Tool calling: a 3B model that knows when it doesn't know

This was what surprised me most. Define a get_weather(city: String) tool, ask "What's the weather in Madrid?" and it generates a structured invocation with correct parameters. Ask "What's the capital of France?" and it answers directly — it knows it doesn't need the tool.

A 3B parameter model, running on your laptop without network, distinguishes when to use a tool and when not to. It's not perfect — with ambiguous prompts it sometimes gets confused — but the accuracy rate on clear inputs is remarkable.

Apple's tool calling uses the same constrained decoding mechanism: the invocation is a @Generable struct, so parameters are always valid. No broken JSON parsing or invented parameters.

Image Playground: the other model nobody mentions

Apple Intelligence isn't just text. Image Playground is a separate framework that generates images on-device in three styles: animation, illustration, and sketch. It also runs entirely on the Neural Engine, without network.

The pattern repeats: for what it's designed for, it works well. Icons, stickers, simple compositions — surprisingly good results. Text within images, detailed human anatomy, complex compositions — disastrous.

Think Local includes an Image Studio where you can test prompts and compare styles side by side. The intuition you get from ten minutes of testing prompts beats any benchmark.

The numbers

Measured on an Apple Silicon Mac, using Think Local's resource monitor:

Metric	Value
Generation speed	~40 tok/s
Cold start (no prewarm)	~800ms
Cold start (with prewarm)	~200ms
Model RAM	~1.2 GB
Context window	4,096 tokens
Parameters	~3B
Battery impact	Minimal (Neural Engine)

The battery data is relevant: the Neural Engine consumes noticeably less than CPU for inference. During generation, CPU usage increases slightly for marshalling and UI, but the heavy work goes to the Neural Engine. This makes it viable to use the model in background for tooling — git hooks, linters, classifiers — without draining your laptop.

What to use it for and what not to

Use it for:

Classification and triage (bugs, emails, tickets) → constrained decoding with @Generable
Structured data extraction from free text → schemas with @Guide
Lightweight tool calling (search, calculate, query) → type-safe invocations
Drafting and summarizing short texts → within the 4K token limit
Simple image generation (icons, stickers, illustrations) → Image Playground
Any task where you can define the output format

Don't use it for:

Long conversations → the 4K window depletes in 3-4 turns
Complex or multi-step reasoning → the model is too small
Long creative generation → responses are generic and repetitive
Anything requiring knowledge post-October 2023

Try it

Think Local is open source (MIT). The app exercises all these capabilities with visual UI: a token visualizer that shows consumed context in real-time, a schema editor, a tool calling lab, and a compare mode to see how responses change with different parameters.

git clone https://github.com/frr149/think-local.git
cd think-local
open Package.swift

Requirements: macOS 26, Apple Silicon, Apple Intelligence enabled. Zero dependencies, zero API keys.

The conclusion

Apple hasn't built a chatbot. They've built a local inference engine with constrained decoding and tool calling that happens to also chat. If you evaluate it as a chatbot, it loses against everything. If you evaluate it as a structured output engine that runs free on your hardware, without network and without API keys — it has no competition. Literally nobody else offers that.

The right question isn't "Can Apple compete with GPT-4?" — it can't. The question is "What can I build with a 3B model that runs free, locally, with constrained decoding?" And the answer is: quite a bit more than you think.