Quentin Merle

Posted on Mar 23 • Edited on Jun 9

Running LLMs Locally: A Developer's Guide from CLI to GPU Acceleration

#ai #ollama #productivity #webdev

🌐 Version française ici : L'IA locale en 2026 : Ma traversée du désert

Disclaimer & Context: This article is based on my personal experience using a MacBook Pro M1 Pro with 32GB of RAM and VS Code. While I use Claude as the primary reference for Cloud AI (given its current leadership in coding tasks), the same logic applies to other giants like Gemini or ChatGPT when comparing Cloud performance vs. Local efficiency.

The Starting Point: "Is Local AI actually good? And is it a pain to set up?"
A few weeks ago, I knew nothing about Ollama. Like many devs, I was just juggling free quotas from the cloud giants in my IDE. Then, curiosity hit me before I reached for my credit card: can you actually run a world-class "brain" on a base MacBook Pro M1 Pro (32GB) in 2026?

1. The Installation Shock (Pure Euphoria)

Installing Ollama is almost too easy. One command, and boom: you have an AI in your terminal. No account, no API key, no credit card.

2. DeepSeek, Qwen, Mistral... Which "Brain" Should You Pick?

Before hitting my first prompt, I had to dig through the library. In 2026, three families dominate the game:

Qwen (Alibaba): The "Clean Code" architect. Brilliant with React and Tailwind, it produces elegant code and follows best practices.
DeepSeek: The logic "Sniper." Formidable for complex algorithms and pure backend tasks.
Mistral (France) & Llama (Meta): The pillars. Mistral is a superb, versatile European alternative, while Llama remains the universal Swiss Army knife of Open Source.

2 bis. What’s a "B"? (Understanding Brain Size)
You see labels everywhere like 4B, 7B, 32B. The "B" stands for Billion.

The Number: It’s the number of parameters (neural connections) in the AI. The higher the number, the more "educated" the AI is.

The RAM Footprint: In 2026, thanks to "quantization", a 1B model consumes about 0.8GB of RAM.

A 4B model takes up ~3.5GB.

A 32B model eats ~20GB... just to exist in your memory!

💡 Wait, how does a 9B model fit into 7.80GB? It’s all about Quantization (specifically 4-bit or Q4_K_M). It’s like turning a heavy RAW image into a high-quality JPEG: you lose a tiny bit of precision, but you gain massive speed and a much smaller memory footprint.

3. ⚠️ The "Claude Code" Disclaimer (Don’t Get Fooled)

You see it everywhere right now: "Use Claude for free via Ollama!". That's only half true. Claude Code is a great tool (an agentic CLI), but it's just an interface.

By default, it connects to Anthropic's paid models (Sonnet, Opus, Haiku).
You can "plug" it into Ollama (e.g., claude --model qwen3-coder). It’s free and private, but you get the Claude UX with your local model's brain.

4. The Reality Wall: "Matrix" Latency 🐌

Thinking I was doing the right thing, I loaded a Qwen 3 32B.

The Crash: My Mac froze. The AI took minutes to output a single word.
The Culprit: My system (Chrome, VS Code, Teams) was already hogging 20GB.
The Fatal Math: 20GB (System) + 20GB (AI) = 40GB. On my 32GB RAM machine, the Mac had to use the SSD (Swap). Result: unbearable slowness.

I tried pairing this with Roo Code (an open-source, AI-powered coding assistant) on VS Code, but every instruction sent too many context tokens. The RAM saturated instantly. It’s frustrating when you're used to the instant reactivity of the Cloud.

5. The Art of Compromise: "Slicing" Your Setup

After nearly losing my mind, I pivoted to a hybrid approach:

Qwen 2.5-coder 1.5B: For autocomplete (instant).
Qwen 3.5 4B: My "daily driver." This is the Sweet Spot for 32GB: it leaves enough room for macOS to breathe while remaining highly relevant.

💡 Pro Tip: Using a smaller model requires re-learning how to prompt. Cloud AIs "read between the lines" and guess your vague intentions. In local with a 4B, that magic doesn't exist. You have to become a prompt craftsman again: be precise, concise, and structured.

📥 UPDATE: The "Morning Surprise" (Testing the 9B Model)

Just when I thought I was settled on the 4B model, I tried a fresh boot this morning with Qwen 3.5 9B. With "clean" RAM (no Docker, no 50 Chrome tabs), the difference was night and day: Responses in under 10 seconds.

The 9B feels like the true "Pro" sweet spot for a 32GB machine:

The RAM Math: In my test, the 9B model takes up exactly 7.80GB of RAM. On a 32GB Mac, this is perfectly manageable if your system isn't already saturated.
The Experience: It feels like the high-end Copilot we had a few years ago. It won’t automatically refactor your entire file structure yet, but the logic is sharp, and the code blocks are actually production-ready.
The Catch: It requires a disciplined environment. You can't run a heavy dev stack and a 9B model simultaneously on 32GB without feeling the heat.

Final takeaway? The 4B is your "safety net" for heavy multitasking, but the 9B is your "deep work" companion when you can afford to give it the room it needs to breathe.

6. The Essential Tool: Can I Run AI

A life-saving discovery: canirun.ai. This site simulates the RAM consumption of a model based on your hardware before you download it. It’s a mandatory stop before every ollama pull.

🦀 The Next Frontier: "Agentic" AI (OpenClaw)

While I was writing this review, I pushed my research into autonomous agents like OpenClaw, which promise to automate your tasks (emails, calendar, scripts) directly from your terminal. But beware: here, the "shell" is empty, and the RAM dilemma gets even tougher.

The Privacy Paradox: Until now, I was okay with using the Cloud for isolated queries. But giving full system access to a remote agent? At a time when GitHub Copilot has just announced that, starting April 24, your prompts and contexts will be used by default to train their models, the irony is peak. Handing over your entire local context to a third party just to save ten minutes a day is... a bold bet.
The Price of Freedom: The alternative is to inject a Local AI into the agent. This is total sovereignty: what happens on the Mac stays on the Mac.
The Balancing Act: But freedom comes at a hardware cost. Running the agent infrastructure (Node.js/Docker) + the 9B model + your IDE on 32GB of RAM is a high-wire act. That's the literal price of owning your code.

🏁 Verdict: Is the Future Hybrid?

I managed to have my little 9B model code a complex React component. It was smooth, clean, and 100% private. But let’s be honest for a second:

If you’ve been spoiled by the speed and "mind-reading" capabilities of Claude Sonnet or Gemini Pro, running local AI on a 32GB machine still feels a bit... outdated. It’s like switching back to a manual car after years of driving an automatic.

Intelligence: A local 9B is a great intern. Claude remains the Senior Architect.
Speed & Comfort: The sheer friction of managing your RAM and dealing with slightly "dumber" prompts makes the Cloud experience unbeatable for pure productivity.

To put it bluntly: Sometimes, I even find myself doubting the local AI's output. To stretch the point, I almost feel the urge to ask Claude to double-check Qwen's answer just to be sure 🙃.

Will I keep using my local Qwen 3.5? Yes, but mostly out of curiosity—to push its limits and see what it has in its gut. But for my heavy-duty daily dev work? The comfort, speed, and sheer brilliance of a Cloud AI aren't going anywhere.

📥 Update since 9b run well
Will I keep using my local Qwen 3.5? Definitely. Since discovering how well the 9B model runs, I’m much more tempted to use it for everyday, routine tasks. It’s perfect for quick logic checks or boilerplate code. However, for "Heavy Dev" sessions that require deep reasoning and a massive architectural vision, I’ll still switch back to Cloud AI.

In 2026, RAM is the new CPU power. Until I have 128GB of Unified Memory on my desk, the giants still own the crown.

What about you? What’s your "Sweet Spot"? Are you playing the local card for privacy, or is the Cloud still your only co-pilot?

Proudly developed in Beauce, Québec 🇨🇦. Interested in local AI sovereignty? Let's connect via Vibrisse Studio!

DEV Community