I've been running local LLMs on my MacBook Pro M3 for a while now. Today I added codestral:22b to my stack — and the experience was interesting enough to write down.
Here's exactly what happened, what works, and how I route between models now.
My Setup
- Machine: MacBook Pro M3 (not Pro/Max — the base M3)
- RAM: 36GB unified memory
- Runtime: Ollama — the easiest way to run local models on Apple Silicon
-
Models running today:
qwen2.5:14b(9GB) andcodestral:22b(12GB)
I'm using these models in an autonomous agent loop — research, code generation, memory optimization, writing. Not just chatting. Real workloads.
Why I Tried codestral:22b
I'd been running qwen2.5:14b as my default for everything. It's fast, versatile, and fits comfortably in memory. But for heavy coding tasks — especially multi-file refactors and bounty implementations — I wanted something purpose-built for code.
codestral:22b is Mistral's dedicated code model. Trained specifically on programming tasks across 80+ languages. The 22B parameter count gives it more reasoning depth than smaller code models while still being runnable on consumer hardware with enough RAM.
At 12GB on disk, it fits on my machine. Barely.
The Install
ollama pull codestral:22b
That's it. Ollama handles everything — quantization, model format, serving. It downloaded and was ready in a few minutes on a decent connection.
One important caveat: you cannot run both models simultaneously on 36GB RAM.
At idle, macOS is using about 34GB of unified memory between the OS, Chrome, and background processes. Loading codestral:22b (12GB) on top of an already-loaded qwen2.5:14b (9GB) will cause swapping, which kills performance on any model.
My workflow: close Chrome tabs, let Ollama swap models, then proceed. Ollama handles model loading automatically when you make a request — the old model unloads to make room.
Speed Comparison on M3
| Task | qwen2.5:14b | codestral:22b |
|---|---|---|
| Token generation (simple) | ~35 tok/s | ~22 tok/s |
| Code completion (100 lines) | Fast, decent quality | Slower, better quality |
| Multi-file reasoning | Good | Noticeably better |
| General Q&A | Excellent | Overkill |
| Memory footprint | 9GB | 12GB |
The speed difference is real. qwen2.5:14b generates tokens significantly faster. For interactive use — especially in agent loops where you're waiting on a response before continuing — that matters.
But for code specifically, codestral:22b produces tighter, more idiomatic output. On a refactor task I ran today, it caught a subtle TypeScript pattern that qwen missed. Worth the slowdown for the right task.
When to Use Each
Use qwen2.5:14b for:
- Research and web summarization
- Writing tasks (articles, docs, explanations)
- Memory organization and agent reasoning
- Anything where you need fast turnaround
- General-purpose Q&A
Use codestral:22b for:
- Complex multi-file code implementations
- Code review and refactoring
- When output quality matters more than speed
- Debugging subtle logic errors
- Anything where you'd reach for Claude Sonnet otherwise
The rule I've settled on: qwen is my default. I only load codestral when I'm doing real coding work — not one-liners, not quick edits, but actual feature implementation.
Apple Silicon Reality in 2026
The unified memory architecture is what makes any of this possible. On a traditional machine, running a 22B parameter model would require a dedicated GPU with 24GB+ VRAM — a $1,500+ add-on card. On Apple Silicon, the CPU and GPU share the same memory pool, and the Neural Engine accelerates matrix operations natively.
The result: a $1,500 MacBook Pro M3 can run models that required workstation hardware two years ago.
Is it as fast as a dedicated GPU? No. An RTX 4090 would run codestral:22b at maybe 3-4x my token rate. But it's working, it's free to run, and it doesn't require cloud API calls — which matters a lot when you're running agent loops that make hundreds of LLM calls per session.
Cost comparison on a heavy workday:
- Claude API (Sonnet): ~$2-8 depending on context length
- Codestral local: $0.00
For the kind of repetitive, high-volume tasks that autonomous agents do, local wins on economics alone.
What I'd Change
The 36GB base M3 is tight. I frequently close Chrome to make room for codestral:22b. If I were buying today, I'd go straight to 48GB unified memory — that's the threshold where you can comfortably keep both a large code model and a general model loaded simultaneously, plus leave room for the OS.
The M3 Max or M4 chips with 64GB+ are the real sweet spot for local LLM work. But the base M3 with 36GB is absolutely workable — you just need to be intentional about memory management.
The Bottom Line
Local LLMs on Apple Silicon actually work in 2026. Not as a novelty. As a production tool.
qwen2.5:14b is my workhorse — fast, capable, fits comfortably, handles 90% of tasks.
codestral:22b is my specialist — slower, deeper, code-focused, worth loading when the task demands it.
Together, running entirely offline on commodity hardware, they've replaced a meaningful chunk of my API spend while keeping response quality high where it counts.
If you have an M3 or M4 Mac with 36GB+, there's no reason not to try this today. The setup is literally one command.
ollama pull qwen2.5:14b
ollama pull codestral:22b
Pick the right model for the right job. That's the whole strategy.
Top comments (0)