cucoleadan

Posted on May 21 • Originally published at vibestacklab.substack.com on Mar 24

Ditch Your Subscriptions and Run Open Source AI on Your Device

#open #source #local #models

This post was originally published on my Substack publication as Ditch Your Subscriptions and Run Open Source AI on Your Device.

Open-source AI models are beating the paid ones. A year ago that sentence would have been ridiculous. Not anymore.

Qwen 3.5 is outscoring GPT-5.2 on key benchmarks. MiniMax M2.5 is running on people's Mac Studios at 20 words per second, trading blows with frontier models like Opus 4.5 and Gemini 3 Pro. The gap between a $20/month cloud subscription and running that same intelligence on your own hardware has never been thinner.

The models are free. The tools are ready. The part most people get stuck on is figuring out which model their specific hardware can actually handle without crawling.

I spent weeks digging through benchmarks, community reports, and real-world results across every hardware tier for two of the most relevant open-source model families. What follows is exactly what runs where, how fast, and which model deserves a spot on your machine.

In this article:

Two Families
Efficiency Tier
GPU and Mac Mini Tier
The Final Boss
The Cheat Sheet
Your Machine, Your Model
What Comes Next

Two model families cover the entire spectrum from "runs on a phone" to "runs on a workstation" better than anything else available today.

Qwen 3.5 (by Alibaba) is the Swiss Army knife of open-source AI. Eight sizes from 0.8B to 397B parameters. Specialized variants for coding, vision, and reasoning. All Apache 2.0 licensed. Every local AI tool worth mentioning, Ollama, LM Studio, llama.cpp, Jan.ai, supports it out of the box. The latest generation dropped between February and March 2026 with a new Gated DeltaNet architecture, 262K-token context windows, and 201 languages. This is where most people should start.

MiniMax M2.5 is the ambitious one. 230 billion total parameters, but it only activates 10 billion on every response thanks to an extreme Mixture-of-Experts architecture (more on this soon). 200K native context window. The community at Unsloth compressed it from 457GB down to a 101GB file, making home deployment possible. For those with the hardware, it's frontier-class intelligence on your own desk.

You do not need an expensive GPU to run a language model locally.

You can use your travel laptop with integrated graphics, a base Mac Mini or your aging desktop to run a Qwen model at home.

Qwen3.5-4B (~2.5 GB at Q4 quantization) is the best quality at this size. Drafting emails, summarizing documents, light coding help, translation, private conversations that never leave your machine.

Based on community reviews, it's coherent and helpful in ways you wouldn't expect from a model this small. Qwen3.5-2B (~1.3 GB) is the sweet spot for CPU-only machines. Qwen3.5-0.8B (~0.5 GB) runs on anything with a CPU (like your phone).

These won't write your PhD thesis, but they're fast (40+ tokens per second on CPU), completely private, and the 4B punches well above its weight. Getting started takes one command:

ollama run qwen3.5:4b

Ollama downloads the model and then you're ready to chat. Use LM Studio if you prefer a GUI or, my favorite, Jan.ai if you want something prettier.

No API latency. No rate limits. You hit Enter and the answer starts flowing instantly. This tier is where people have the "wait, this is running on MY computer?" moment.

The hardware range here is wide. On the lower end: an RTX 3060 12GB, an RTX 4060 Ti 16GB, an RX 7800 XT, or a Mac Mini M4 Pro with 24GB. On the upper end: an RTX 3090, an RTX 4090, or a Mac Mini M4 Pro with 48-64GB.

Apple's unified memory works like VRAM for AI inference, so a 24GB Mac Mini sits in this tier right alongside a 24GB GPU. One rule applies across the board: the bigger the GPU and the more memory you have, the faster your tokens generate and the larger the model you fit.

Qwen3-8B (~5 GB) is a solid all-rounder that leaves tons of headroom on a 12GB card. Good for quick tasks and light conversations.

Qwen3-14B (~9 GB) is the Goldilocks model. Fits comfortably on 12-16 GB, and delivers top notch quality when you take its size into account. It's does a great job at coding, reasoning and creative writing. If you have the memory for it, this is where I'd recommend most people start.

Qwen3.5-35B-A3B (~18.6 GB) is the model that inspired me to write this article. It has 35 billion total parameters, but only 3 billion activate on every response. This is a Mixture-of-Experts model.

MoE models are built differently. Instead of one massive brain firing every neuron, think of it as a team of specialists. Ask a coding question and the coding experts light up. Switch to creative writing and a different set activates. The result: you get 35B-level intelligence at 3B speed and memory cost. Fits on 16GB with CPU offloading.

This MoE architecture is the same idea behind MiniMax M2.5, so keep that concept in mind as we move along.

Qwen3.5-27B (~17 GB) is the dense powerhouse at the top of this tier. Built for 24GB cards and 48-64GB Macs. All 27 billion parameters fire on every response, it supports 262K context across 201 languages, and it wins on reasoning and coding benchmarks against every model at this size. With 24GB of VRAM you still have plenty of headroom left for long conversations.

Qwen3-Coder-30B-A3B also deserves a mention here as its a dedicated coding model (also MoE, 3B active), rivaling Claude Sonnet 4 on SWE-Bench.

Speed across this tier ranges from 15 to 40+ tokens per second, depending on model size and your hardware. A 64GB Mac Mini M4 Pro runs the 27B at 15-25 tok/s and the 35B-A3B even faster thanks to MoE efficiency. A 24GB GPU pushes the smaller models past 40 tok/s. For reference, average human reading speed is roughly 250 words per minute, or about 5-6 tokens per second.

Worth noting for anyone planning to run models around the clock: the Mac draws about 30W under load compared to 300W+ for a GPU rig. Over months of use, the electricity savings add up.

Everything above was the warm-up. This is the final boss.

You need a Mac Studio with 128GB unified memory or a multi-GPU PC with 96GB+ RAM. The Mac Mini caps at 64GB, so it tops out at the GPU tier above.

MiniMax M2.5 takes the MoE concept to the extreme: 230 billion total parameters, 10 billion active per response. A 200K native context window that handles entire codebases, full novels, or months of transcripts in one conversation.

Mac Studio 128GB is the ideal setup. No bottleneck between GPU and CPU since it's all one memory pool. Community benchmarks: 20-25 tok/s. PC with dual GPUs + 96GB RAM works through CPU offloading. Slower (12-25 tok/s) but functional.

The key number: 101GB. Unsloth's 3-bit GGUF (UD-Q3_K_XL) compresses the model from 457GB to 101GB with minimal quality loss.

Start with 16K-32K context and scale up. Enable flash attention and CPU-MOE offloading.

Do all that and you'll get frontier-class intelligence and massive context entirely on your hardware. No API costs, no data leaving your machine, no rate limits. For lawyers, researchers, or developers handling sensitive work, this is the endgame of private AI.

For developers at this tier: Qwen3-Coder-480B-A35B is the most capable open-source coding model available (that you can run at home). 480B total parameters, 35B active, 69.6% on SWE-bench Verified, comparable to Claude Sonnet. It needs 240GB+ at Q4, so a Mac Studio with 192GB or a multi-GPU server setup is the minimum. If you write code for a living and have the hardware, this is the local Copilot replacement to end all replacements.

Looking ahead: MiniMax M2.7 launched March 18 with strong coding benchmarks (56.2% SWE-Pro, 97% skill adherence across 40+ tasks), but weights are proprietary. You can't run it locally yet. MiniMax M3 is expected to add multimodal capabilities (text, images, video). M2.5 is text-only, which is its biggest gap compared to Gemini Flash or GPT-5.4 Mini. If M3 ships open-weight, it becomes a direct competitor to those cloud-only models on home hardware.

Find your hardware, grab your model, go.

The cheat sheet above gets you close, but VRAM estimates are estimates. Your exact hardware, OS, and background apps all matter. Reddit gives conflicting advice. YouTube benchmarks were run on different machines.

This is why I'm building the AI Hardware Checker. It's a website where you plug in your hardware details, your GPU, your RAM, and it tells you exactly which AI model fits your setup, what settings to use, and what speed to expect.

It's not live yet. I'm actively building it right now. And I want to build it around real hardware owned by real people.

What needed a datacenter two years ago runs on a gaming PC today. What runs on a gaming PC will run on a phone tomorrow. MoE architectures, Gated DeltaNet, aggressive quantization. The field is sprinting toward "run anywhere."

Qwen and MiniMax are the beginning. MiniMax M2.7 is already here (API-only), M3 with multimodal is on the horizon. The walls between cloud AI and local AI are dissolving.

The best time to start was last year. The second best time is right now.

DEV Community

Ditch Your Subscriptions and Run Open Source AI on Your Device

Top comments (0)