Normal hardware x really smart local LLMs = future
This week I ran a fully local AI model on my MacBook. Not as a curiosity, but genuinely trying to use it in my actual workflow with my agent system on real tasks.
The model was Qwen 3.5 at 9 billion parameters. My machine is an M1 Pro with 16 GB of RAM—a regular laptop, not a workstation. Qwen 3.5 is recent, and the smaller variants make this experiment worth trying now.
It worked. Not just launching without error, but in a way where I sat doing actual things without feeling punished by slowness. It was slower than Claude, obviously, but acceptably so.
Two Different "Local AI" Stories
There's an important distinction often collapsed into one concept:
Version One: Local agent with cloud model. Your code, memory systems, automation scripts, and tool integrations live on your device. The actual model is remote—calling Claude or OpenAI from your laptop. This is what I do with Wiz.
Version Two: Fully local LLM. The model itself lives on your device. No API calls, no cloud dependency, no data leaving your machine. For years this required serious hardware. That's changing now.
The MacBook Experiment
Qwen 3.5 at 9 billion parameters runs acceptably on 16 GB of RAM. This is significant.
I used Ollama, a one-command install handling model management and providing a local OpenAI-compatible API at localhost:11434. Any tool supporting OpenAI format works with it.
Setup requires three commands:
brew install ollama
ollama pull qwen3.5:9b
ollama run qwen3.5:9b
Ollama starts a local server with an OpenAI-compatible API. Since Wiz is model-agnostic, switching to Ollama was a one-line configuration change.
What actually happened:
Memory recall worked surprisingly well. The model read context files and surfaced relevant information with reasonable accuracy. For tasks fundamentally about reading files and reporting findings, a 9B model handles this fine.
Tool calling was interesting. Qwen invoked my agent system's tools with reasonable accuracy on straightforward requests. For agentic work, the model that calls the right tool matters more than beautiful prose.
Creative tasks and complex reasoning? The quality gap was noticeable. This isn't criticism—it's honest observation. The 4B variant showed significant capability drops. The 9B is my usability line for my work type.
Critical framing: this isn't comparing Qwen to Claude Opus. They're not the same category. The question is whether local models handle a real subset of actual work. The answer is yes.
The iPhone Experiment
I used PocketPal AI (free on App Store), open-source software letting you download and run language models directly on your iPhone, completely locally. Download models from Hugging Face over Wi-Fi once, then run with no internet required.
I ran Qwen at 0.8 billion and 2 billion parameters on my iPhone 17 Pro.
The obvious question wasn't "is this as good as Claude" but "can something genuinely useful fit locally on a phone at all?" Yes, with clear limits. These are tiny models handling basic text and short question-answering reasonably well. They won't build apps overnight. But they run fully on the device.
The most interesting implication isn't model capability—it's the hardware signal. An iPhone running local LLMs in 2026 means smartphones crossed a threshold. Not because the 0.8B model is impressive, but because hardware already in your pocket can do this.
The privacy angle is real too. When nothing leaves your device, you don't consider what you're sending where. No API logs, no terms of service governing queries. Just you and the weights running on your silicon.
The Cost Angle
Not every task requires Opus. AI subscriptions accumulate fast when running agent tasks constantly. This isn't hypothetical—I track usage closely.
Plenty of agent work is genuinely simple: read a file, format something, summarize a note, answer factual questions from context. Routing those to local models instead of frontier models changes the math considerably.
Where This Goes
I think the future involves far more local compute than current conversation suggests.
The shape I see: cloud models for hard work. Complex reasoning, creative work, architectural decisions. But for the hundreds of small cognitive tasks happening in an agent system daily, local models will get good enough that routing makes sense.
Hardware trajectory matters. M1 through M5 show each generation meaningfully faster and more memory-efficient. Both trajectories—better models and better hardware—point toward the same place. In a few years, laptops people already own will run noticeably more capable models than what I ran this week.
My rough prediction: in three years, local models fine-tuned to specific use cases will genuinely compete with today's frontier models on those specific tasks. Not general reasoning or creative synthesis. But "do this specific thing I care about, quickly, privately, without internet." That's a very real category.
There's an underexplored environmental angle too. Data center query costs dwarf same inference running on local silicon. If routine AI tasks shift local, resource equations change meaningfully.
The tradeoffs are currently clear: local models are limited, fine-tuning requires effort, and capability gaps exist. But direction of travel isn't ambiguous. The gap is closing. I tested it this week on hardware I've owned for years, and it worked well enough to make me reconsider task routing.
If you're curious: install Ollama, pull Qwen 3.5 at 9B, try something simple in your workflow. The experience differs from benchmark-running—it's surprisingly real.
Originally published on Digital Thoughts
Top comments (0)