Juan Torchia

Posted on Apr 17 • Edited on Apr 20 • Originally published at juanchi.dev

Apple as the AI 'Loser' That Ends Up Winning: I Lived It When Anthropic Ghosted Me for a Month

#english #reflections #ia #modeloslocales

87 milliseconds per token. That's what I measured running Llama 3.2 on my M3 Pro the first time I tried it seriously. I ran the benchmark twice because I was convinced something was wrong with the measurement.

It's not server speed. It's not what you'll see on an H100 benchmark. But it's enough for autocomplete, for code analysis, for 80% of the things I was asking Claude Code to do every single day. And it runs on my machine. With my data. Nothing leaves the network.

That shifted a lot of things in how I think about the AI ecosystem.

Apple, On-Device AI, and the Moat Nobody Saw Coming

There was a post that circulated on HN a few weeks back about Apple's "accidental moat." The argument is simple but powerful: Apple spent the last decade building specialized inference hardware (Neural Engine since the A11 in 2017), building privacy APIs that developers hate because they complicate tracking, and building the reputation that "your data stays on device."

Nobody was taking it seriously as an AI strategy. It was always "sure, but Apple's models are garbage compared to GPT-4." And on capability benchmarks, that's true. Apple Intelligence doesn't beat Claude 3.5 Sonnet on complex reasoning.

But the problem is being framed wrong. The question isn't "which model is smarter?" The question that more and more companies, regulators, and users are asking is "where does my data actually run?"

And there Apple has an answer no hyperscaler can genuinely give: on your hardware, full stop.

The Month Anthropic Ghosted Me and What I Learned From It

In March I had a concrete problem with my Claude Code setup. I won't get into every technical detail, but basically an integration I was using to automate part of my code review workflow broke after an API change. I filed the ticket, opened the issue, waited.

A month. Nothing.

It's not that Anthropic is a terrible company. They're a startup scaling at warp speed and technical support doesn't scale the same way models do. I get it. But that month forced me into something I wouldn't have done voluntarily: actually looking for alternatives.

First I moved everything to Zed with OpenRouter. That solved the vendor dependency problem. But while I was digging around, I ran into the agent benchmarks that were being questioned and started asking myself something deeper: how much of what I use actually needs the biggest, most expensive model on the market?

The honest answer: less than I thought.

What Actually Runs Well Locally Today (Real Numbers, M3 Pro, 18GB RAM)

Here's what I measured on my machine. This isn't marketing — these are numbers from llama.cpp and ollama:

# My current setup
# ollama running in background, models in ~/models

# Basic benchmark: tokens per second generation
ollama run llama3.2:3b "Explain the Repository pattern in TypeScript" --verbose
# Result: ~95 tok/s — useful for autocompletion

ollama run llama3.2:11b "Review this code for security vulnerabilities" --verbose  
# Result: ~42 tok/s — usable for analysis

ollama run codellama:13b "Refactor this function" --verbose
# Result: ~38 tok/s — solid enough for refactor sessions

# For long contexts (where local really hurts)
ollama run llama3.1:8b --num-ctx 32768
# Result: ~29 tok/s with long context — this is where you feel the gap

// Simple Ollama integration in my Next.js setup
// for in-editor code analysis

const analyzeCode = async (code: string): Promise<string> => {
  // Everything runs local, nothing leaves the machine
  const response = await fetch('http://localhost:11434/api/generate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'codellama:13b',
      prompt: `Review this TypeScript code and find issues:\n\n${code}`,
      // No streaming for batch analysis
      stream: false,
    }),
  });

  const data = await response.json();
  return data.response;
};

// Used in automated PR reviews
// Exactly the kind of thing I used to send to Claude Code
// that I now process locally — client code never leaves my machine
const reviewPR = async (diff: string) => {
  // For sensitive stuff: 100% local
  const localAnalysis = await analyzeCode(diff);

  // Only escalate to external model for complex architectural analysis
  // where the big model's reasoning actually matters
  if (requiresComplexReasoning(localAnalysis)) {
    return await analyzeWithExternalModel(diff);
  }

  return localAnalysis;
};

This pattern — local first, external only when it's worth it — is what changed my workflow. And it's not just about privacy. For vibe coding PR code reviews, the local model is enough for 90% of the cases I need to cover.

The Privacy Angle Developers Are Underestimating

There's something I think is important that I don't hear talked about enough.

When I work with clients, the code I write has business context baked into it. Table names, business logic, sometimes query fragments with sensitive data structure. Every time I sent that to Claude Code, that code was transiting through Anthropic's servers in the US.

Anthropic has a privacy policy. I read it. It's reasonable. But "reasonable" is not the same as "my client's code never leaves their infrastructure."

While I was digging into all this, I thought about something I wrote about the brain-computer interface for controlling performances with ALS — in that context I talked about how the most powerful interfaces are the most transparent ones for the user. The model that runs on your hardware is the most transparent interface possible: you know exactly where your data is.

Apple Private Cloud Compute (PCC) — the system they introduced with Apple Intelligence — takes this even further. Models that need more capacity than what runs on-device get processed on Apple servers with cryptographic guarantees that not even Apple can see the content of your requests. They published the client source code so anyone can audit it.

No other AI provider has this. Not Google, not Microsoft, not Anthropic. It's a real competitive advantage that took years to build and can't be copied in six months.

The Real Gotchas of Going On-Device (I'm Not Selling You on This)

I'd be lying if I didn't cover this part.

Long context is the Achilles heel. Locally, with 18GB of unified memory, I can run a 13B model with 32k token context. That sounds fine until you have a large codebase and need 100k tokens of context. There the gap with Claude 3.5 Sonnet is enormous and there's no local solution for it today.

Complex reasoning doesn't scale the same way. For debugging concurrency issues like deadlocks where you need deep multi-step reasoning, Llama 3.2 11B doesn't come close to Claude 3.5 Sonnet. It's not even in the same ballpark. The model matters for hard stuff.

The setup has friction. Ollama makes the process much simpler than it was two years ago, but it's still more friction than pip install anthropic. Models weigh anywhere from 4GB to 30GB. Your first attempts at parameter tuning will disappoint you if you don't know what you're doing.

Apple Intelligence has real limitations. Apple's on-device capabilities are solid for text, summarization, rewriting. For code they're basic. They don't replace a specialized coding model yet.

What changed isn't that local models are better. It's that the equation now has more variables than "which one responds best?"

The Strategy Apple Played That Nobody Gave Them Credit For

There's something elegant in what Apple did that only becomes visible in retrospect.

While OpenAI, Google, and Anthropic were racing on capability benchmarks — parameters, MMLU scores, reasoning benchmarks — Apple kept building the Neural Engine, generation after generation. Not to win AI benchmarks. To make Face ID, Siri (yeah, everyone laughed at Siri), and photo processing faster.

The accidental result: they have the most efficient silicon for inference in the consumer market. The M3 and M4 have an 18 TOPS Neural Engine. The M4 Ultra has 38 TOPS. For models up to ~30B quantized parameters, that's competitive with dedicated hardware from two years ago.

And they built a user base of 1.5 billion devices that already trust Apple not to sell their data. Not because Apple is morally superior, but because their business model doesn't depend on advertising.

That's the moat. It wasn't built in a year. It was built across a decade of decisions that looked irrational from the outside.

When I think about this I remember something I read about contributing to the Linux kernel: the infrastructure that looks boring and that nobody wants to maintain is exactly what eventually becomes strategic. Apple built privacy infrastructure when it wasn't cool to do so.

FAQ: Apple AI, Privacy, and Local Models

Is Apple Intelligence enough to replace Claude or GPT-4 for development?
Not today. Apple Intelligence shines at text tasks, summarization, and rewriting. For complex coding, architectural analysis, or multi-step reasoning, Anthropic and OpenAI models are still superior. The more honest question is: how many of your daily tasks actually need the most capable model available?

What hardware do I need to run locally useful models for development?
With 16GB of RAM (unified on Apple Silicon) you can run 8-13B models that handle 70-80% of everyday coding tasks well. With 32GB or more, 30B models that compete with GPT-3.5 on many tasks. The M3 Pro with 18GB is my setup and it works well for daily workflow.

What is Apple Private Cloud Compute and why does it matter?
It's Apple's system for processing AI requests that can't be resolved on-device. Unlike other providers, Apple uses secure enclaves with auditable cryptographic guarantees — not even Apple can see the content of your requests. They published the client code on GitHub for independent audit. No other mainstream AI provider has anything equivalent today.

Is Ollama the best way to run local models on Mac?
Ollama is the lowest-friction option to get started. Alternatives like LM Studio have a better UI. Raw llama.cpp gives you more control over parameters. For a developer who wants to get started without too much setup, Ollama plus a client like Open WebUI or the Zed/Continue integration is the fastest path.

Does on-device privacy actually matter if the provider "promises" not to train on my data?
Depends on your threat model. If you work with client code, sensitive data, or in regulated industries (fintech, healthcare, legal), contractual promises aren't the same as the technical impossibility of accessing your data. On-device or Apple's PCC gives you technical guarantees, not just contractual ones. For personal side project code, it probably doesn't matter.

Does this mean Apple is going to win the AI race?
Not in the "most capable model" sense. Probably never. But there's room for multiple winners depending on what problem you're solving. Apple can win in the segment of users who prioritize privacy, regulatory compliance, and integrated experience over raw model capability. In Europe with GDPR, in regulated sectors, with users who have sensitive data — that segment isn't small.

The Mess Got Fixed, But the Strategy Changed

Back to where I started: the month Anthropic didn't respond was uncomfortable. But it forced me to rethink what I run where and why.

The conclusion I reached isn't "local models are better" or "Apple won AI." It's more nuanced than that:

The right model depends on the problem. Privacy is just another variable in that equation. For 70% of my daily coding tasks, a local 13B model on my M3 Pro is sufficient and more private. For complex reasoning, systems architecture, analysis of non-sensitive code — I still escalate to Claude.

What changed is that I no longer have a single provider with a single model for everything. I have a strategy where cost, privacy, and capability get balanced against each task.

Apple spent ten years building something that now has real strategic value. Not because they were brilliant at AI. Because they were consistent on privacy when it wasn't earning them any headlines.

Sometimes the winning strategy is the one with the least glamour while you're executing it.

Are you already running local models in your setup? Or does everything still go to external APIs? I'm genuinely curious what balance other developers have found.