DEV Community: Jovan Chan

WWDC 2026 Home Lab Verdict: What Apple's Foundation Models, Core AI, and Siri Actually Deliver for Local AI

Jovan Chan — Tue, 07 Jul 2026 07:03:54 +0000

This article was originally published on runaihome.com

TL;DR: Apple's WWDC 2026 delivered the AI do-over it promised in 2024 — a 20B sparse on-device model (AFM 3 Core Advanced), a Gemini-powered Siri routed through Private Cloud Compute, and Xcode 27 agents. For home lab builders the practical change is narrow: if you write Swift apps, free on-device inference and provider-agnostic agents are real. If you run Ollama or llama.cpp, none of it touches your stack — a used RTX 3090 still wins on tokens/sec and model choice.

	Apple Foundation Models (AFM 3)	Open models on Apple Silicon	Used RTX 3090 + Ollama
Best for	iOS/macOS app developers	7B–70B open models on a Mac	Max tok/s, widest model choice
Cost	Free on-device inference	Device cost only	~$1,070 used + ~$0.034/hr power
Speed	~30 tok/s (20B sparse)	28.4 tok/s on 70B Q4 (M4 Max)	~95 tok/s on a 7B model
The catch	Apple's model only, 12GB+ RAM, Apple devices	546 GB/s bandwidth ceiling	24GB VRAM ceiling, 350W draw

Honest take: WWDC 2026 makes Apple Silicon a better app development platform, not a better open-model platform. If you run Llama, Qwen, or Gemma locally, your RTX tower or Mac Studio works exactly as it did on June 7.

In our June 2 preview we laid out what the rumor mill expected from Apple's keynote. The keynote happened June 8, and most of it landed. This is the follow-up: what Apple actually shipped, with the real numbers, and a verdict for anyone who builds or runs AI at home rather than just shipping iOS apps.

What Apple actually announced

The headline is AFM 3 — the third generation of Apple Foundation Models. The model that matters for hardware discussion is AFM 3 Core Advanced: a 20-billion-parameter model that runs entirely on-device.

The interesting part is how it fits. AFM 3 Core Advanced uses a sparse architecture that activates only 1 to 4 billion parameters per request. Apple stores the full 20B weights in flash storage and loads just the relevant expert set into RAM once per prompt, through a lightweight dense routing block. That is a deliberate workaround to the RAM wall that has bottlenecked on-device models for years — you get the quality headroom of a 20B model without needing 20B parameters resident in DRAM.

Performance is roughly 30 tokens per second on iPhone 15 Pro and iPhone 17 Pro class hardware. That's the same ballpark as the previous 3B on-device model, which tells you the sparse routing is doing its job: a much larger model, similar latency.

The catch is the device floor. AFM 3 Core Advanced runs only on hardware with at least 12GB of RAM — iPhone Air, iPhone 17 Pro and Pro Max, iPads with M4 or later, Vision Pro (M5), and Macs with M3 or later. Older 8GB devices fall back to the smaller AFM 3 Core model. If your home server is an M1 Mac mini with 8GB, the flagship on-device model isn't for you.

The three-tier stack

Apple settled on a clear three-layer architecture, and it's worth understanding because it determines what touches your hardware and what doesn't:

On-device (AFM 3 Core / Core Advanced) — expressive voices, dictation, on-screen awareness, structured extraction, and quick personal-context lookups. Runs on the Neural Engine and GPU of your Apple Silicon device. No network, no API key.
Private Cloud Compute — heavier requests that still need Apple's privacy guarantees, run on Apple Silicon servers where Apple says data isn't stored or made readable.
AFM Cloud Pro — the top tier for world-knowledge and complex reasoning. Apple says it matches Gemini Frontier quality and runs on NVIDIA GPUs in Google's cloud, custom-built in collaboration with Google's Gemini program.

So the new Siri is a hybrid: simple, personal, on-device work stays local; the chatbot-grade reasoning routes out to Gemini-class infrastructure. AppleInsider's reporting is worth noting here — Apple was explicit that the on-device models contain no Gemini weights. The Google collaboration lives in the cloud tier, not on your phone.

Foundation Models framework: the developer angle

For anyone writing apps, the framework got the upgrades that matter:

Multimodal image input — you can now send images alongside text. The on-device model identifies objects, extracts text, and reads screenshots.
A single API surface that unifies on-device, server-side, and third-party provider access. You can swap the underlying provider without rewriting your code.
Open source this summer, with Linux server support — which is the genuinely surprising one, and the only WWDC item that reaches beyond Apple's own walled garden.

Xcode 27 agents

Xcode 27 ships a dual-engine agentic coding system: a local Neural Engine model for real-time Swift completion that never sends your source off-device, plus a cloud routing layer to Anthropic Claude, Google Gemini, or OpenAI GPT for heavier analysis. Xcode is now an MCP host (via a mcpbridge binary), so any agent that speaks the Model Context Protocol can read diagnostics, symbol info, SwiftUI previews, and the Swift REPL live. The agent can run test suites, drive the iOS Simulator through a new Device Hub, and pull crash reports from Organizer to fix the underlying code.

If your interest is AI coding rather than AI hardware, our sister site covers the agentic-IDE landscape in depth at aicoderscope.com — the Xcode 27 model is conceptually close to what Cursor and Cline already do, now first-party on the Mac.

What this changes for home lab builders (and what it doesn't)

Here's the part the keynote glosses over. There are two completely separate things people mean by "local AI on a Mac," and WWDC 2026 only moves one of them.

Track 1: Apple's own AI stack. Foundation Models, Core AI (Apple's modernized successor to Core ML), AFM 3, Siri. This is for shipping features inside iOS/macOS apps. It got materially better. The free on-device inference is real, the privacy story is solid, and the framework going open source could matter for cross-platform developers.

Track 2: running open-weight models yourself. Ollama, llama.cpp, LM Studio, vLLM, ComfyUI — Llama 4, Qwen3.6, Gemma 4, DeepSeek, Mistral. WWDC 2026 changed nothing here. Apple did not open the Neural Engine to third-party LLM runtimes, did not ship a faster Metal inference path as a headline feature, and AFM 3 is not a model you can pull into Ollama. Your ollama run workflow on June 17 is identical to June 7.

This distinction is the whole verdict. If you bought a Mac Studio to run Qwen3.6 and Llama 3.3 70B, the WWDC announcements are interesting news but not an upgrade to your rig.

The numbers that actually decide your hardware

For Track 2 — the thing this site is about — the bandwidth and tok/s reality is unchanged:

Hardware	Memory bandwidth	Llama 3.3 70B Q4_K_M	7B model	Notes
Mac Studio M4 Max	546 GB/s	~28.4 tok/s	~87 tok/s	Unified memory, quiet, low power
Used RTX 3090	936 GB/s	offload needed (24GB)	~95 tok/s	CUDA ecosystem, ~350W
AFM 3 Core Advanced	(flash-routed)	n/a — Apple model only	~30 tok/s	20B sparse, 12GB+ RAM

A used RTX 3090 still has the highest memory bandwidth in this comparison at 936 GB/s, and bandwidth is what governs decode speed for local LLMs. In June 2026 it averages around $1,070 used (range $966–$1,189), which is remarkable staying power for a card this old — and a direct consequence of the GDDR7 shortage squeezing new GPU supply. The 24GB ceiling and ~350W draw (about $0.034/hour at $0.12/kWh) are the trade-offs.

The Mac Studio M4 Max at 546 GB/s wins on capacity and noise: its unified memory le

WSL 3 GPU Passthrough for Local AI on Windows in 2026: Near-Native Ollama, llama.cpp, and PyTorch

Jovan Chan — Mon, 06 Jul 2026 07:05:36 +0000

This article was originally published on runaihome.com

TL;DR: WSL 3, previewed at Microsoft Build 2026, swaps the heavy Hyper-V backend for a paravirtualized machine that gives Linux apps GPU and NPU access at within 3-5% of bare-metal Linux speed. If you already run Ollama in WSL 2 on an NVIDIA card, the practical gain is small — WSL 2 was already within ~5%. The real story is NPU passthrough, and that ships Intel/Qualcomm-only at launch.

What you'll be able to do after this guide:

Run Ollama, llama.cpp, and PyTorch inside Linux on Windows with full GPU acceleration and no separate Linux driver install.
Understand whether WSL 3 is worth chasing on the Insider channel, or whether your current WSL 2 setup is already fast enough.
Avoid the single most common mistake that drops you to CPU-only inference (and costs you ~10× the tokens/sec).

Honest take: For an NVIDIA GPU owner, WSL 2 today already runs Ollama within 5% of native — WSL 3 is a nice-to-have, not a reason to flash an Insider build. The people who should actually care are Copilot+ laptop owners who finally get NPU passthrough.

What Microsoft actually announced

At Build 2026 (June 2, 2026), Microsoft previewed WSL 3. The headline change is architectural: it replaces the Hyper-V VM backend that WSL 2 has used since 2020 with a lighter paravirtualized machine, and routes GPU and NPU access through DirectML 2.0. Microsoft's claim is that PyTorch, CUDA, and JAX workloads run inside WSL 3 at within 3-5% of bare-metal Linux speed.

That 3-5% number is the one to anchor on, because it reframes the whole pitch. WSL 2's GPU passthrough was never the bottleneck people assumed it was — for GPU-accelerated inference, WSL 2 already lands within roughly 5% of native Windows Ollama. So for a discrete NVIDIA GPU, WSL 3 is closing a gap that was already small. The genuinely new capability is NPU passthrough, which WSL 2 never had at all.

WSL 3 is available now through the Windows Insiders program and will roll out via Windows Update later, the same way WSL 2 updates have always shipped.

NPU passthrough is the real change — and it's not for everyone yet

NPU passthrough at launch is limited to Copilot+ class silicon:

Qualcomm Snapdragon X Elite / X Elite 2 — Hexagon NPU
Intel Meteor Lake / Lunar Lake — Core Ultra NPU

AMD Ryzen AI support is deferred to a later date. The minimum bar for NPU passthrough is a 40 TOPS NPU, which matches the Copilot+ hardware floor. Machines below that, or with no qualifying NPU, still get the GPU improvements — they just don't get NPU access.

Alongside WSL 3, Microsoft shipped DirectML 2.0, which adds better use of AMD's XDNA 2 architecture, brings Intel Core Ultra Series 3 (50 TOPS) support, and tunes the Phi Silica model across AMD, Intel, and Qualcomm NPUs. The XDNA 2 work in DirectML 2.0 is the hint that AMD NPU passthrough in WSL is a "when," not an "if."

One reality check before you get excited about NPU inference: an NPU is not a shortcut to GPU-class tokens/sec. Decode throughput on local LLMs is bound by memory bandwidth, not raw TOPS, which is why Copilot+ laptops post single-digit-to-low-double-digit tokens/sec on 8B models while a discrete card clears 30+. We covered exactly why in NPU vs Discrete GPU for Local LLMs — read it before assuming the NPU in your new laptop replaces a GPU.

Does the 3-5% number matter for your hardware?

Here's the comparison that actually decides whether WSL 3 is worth chasing:

	WSL 2 today	WSL 3 (Insider)	Bare-metal Linux
NVIDIA GPU (CUDA)	~5% slower than native	within 3-5%	baseline
NPU access	none	Intel/Qualcomm only	vendor-dependent
Setup effort	mature, well-documented	preview, expect rough edges	dual-boot or separate machine
Best for	anyone with an NVIDIA card now	Copilot+ laptop NPU users	absolute max throughput

If you own a discrete NVIDIA GPU, you are in the top row, and the difference between "~5% slower" and "3-5% slower" is inside the noise of run-to-run variance. There is no compelling reason to move to an Insider build for inference speed alone. Keep your stable WSL 2 setup.

If you own a Copilot+ laptop with a qualifying Intel or Qualcomm NPU, WSL 3 is the first time you can drive that NPU from Linux tooling. That's the upgrade worth the Insider risk — with the caveat above about what NPU throughput actually looks like.

For context on what GPU throughput you're protecting with that "within 5%" figure: a correctly-loaded 8B model clears about 95 tok/s on a used RTX 3090 and roughly 104 tok/s on an RTX 4090 under llama.cpp. The 3090 lands about 16.6% behind the 4090 on these workloads. A 5% WSL tax on top of either is a few tokens/sec — real, but not workflow-changing. (llama.cpp itself runs about 3-10% faster than Ollama on NVIDIA GPUs, so your engine choice matters more than your virtualization layer.)

Setting it up: WSL 2 today, WSL 3 on Insider

The setup flow is nearly identical between WSL 2 and WSL 3 — the passthrough plumbing changed underneath, but the user-facing commands didn't. This is the path that works on a stable Windows 11 machine right now with WSL 2, and the same steps apply once WSL 3 reaches your channel.

1. Install WSL and a distro

From an elevated PowerShell:

wsl --install -d Ubuntu-24.04
wsl --update
wsl --status

wsl --update pulls the latest kernel. On the Insider channel with WSL 3 available, the same --update is what flips you onto the new backend; check wsl --version afterward to confirm.

2. Install the Windows GPU driver — and ONLY the Windows driver

This is the step that trips up almost everyone. The CUDA libraries are exposed inside Linux automatically through /usr/lib/wsl/lib/. You do not install a Linux NVIDIA driver inside the distro. Installing a separate Linux driver inside WSL is the most common way people break passthrough and silently fall back to CPU.

Install the normal NVIDIA Windows driver (a recent Game Ready or Studio driver is fine — WSL-specific CUDA drivers are no longer required), then verify from inside WSL:

nvidia-smi

If you see your GPU and its VRAM, passthrough is live. If nvidia-smi is missing or errors, you either skipped the Windows driver or installed a Linux driver on top of it.

3. Install the CUDA toolkit (only if you compile)

For Ollama you don't need the full toolkit — its bundled runtime is enough. If you compile llama.cpp or build PyTorch extensions, install the WSL-Ubuntu CUDA toolkit, which is the keyring package that does not ship a display driver:

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-6

4. Run Ollama and confirm it's on the GPU

curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.1:8b

In a second terminal, while a prompt is generating:

ollama ps

Expected output on a working GPU setup:

NAME           ID            SIZE     PROCESSOR    UNTIL
llama3.1:8b    365c0bd3c000  6.2 GB   100% GPU     4 minutes from now

100% GPU is the goal. If you see 100% CPU or a CPU/GPU split, the model didn't fully offload. A correctly offloaded 8B model should clear 30 tok/s on any modern card and much more on a 3090/4090.

The error you'll actually hit, and the fix

The single most common failure isn't exotic — it's the GPU disappearing from WSL after a Windows update or a sleep cycle, leaving Ollama on CPU at a tenth of the speed. The symptom in ollama ps:

NAME           ID            SIZE     PROCESSOR    UNTIL
llama3.1:8b    365c0bd3c000  6.2 GB   100% CPU     4 minutes from now

Walk it in this order:

**nvidia-smi inside WSL returns noth

Why Local LLMs Got Good in 2026: Multi-Token Prediction, Speculative Decoding, and the MoE Efficiency Leap

Jovan Chan — Mon, 06 Jul 2026 07:04:50 +0000

This article was originally published on runaihome.com

TL;DR: Local models didn't just get bigger in 2026 — they got faster at the same quality. Three techniques did the heavy lifting: multi-token prediction (~1.8× throughput, lossless), speculative decoding (1.5–3× on consumer GPUs), and sparse MoE routing (35B of weights, only 3B active per token). Together they put GPT-4-class output on a single 24GB GPU at usable speeds.

	Multi-token prediction	Speculative decoding	Sparse MoE
What it speeds up	Decode (built into the model)	Decode (runtime trick, any model)	Decode + memory pressure
Typical gain	~1.8× (85–90% accept rate)	1.5–3× depending on accept rate	3B active vs 35B total → ~30B-dense feel at 3B-dense speed
The catch	Model must be trained with it	Needs a good small draft model + extra VRAM	Still needs VRAM for all weights

Honest take: None of these are magic — they all trade extra compute or memory for fewer sequential steps. But stacked together on a model like Qwen3.6 35B-A3B, they're the reason a used RTX 3090 now does everyday coding and writing that needed a cloud API call 18 months ago.

The thing that actually changed

Ask anyone who ran local models in 2024 and they'll tell you the same story: a 7B model was fast but dumb, and a 70B model was smart but unusably slow on consumer hardware. You picked your poison. The honest answer to "should I run this locally or just call an API?" was usually "call the API."

That tradeoff broke in 2026, and it broke for a specific, technical reason. The models people run at home — Qwen3.6, Gemma 4, GPT-OSS, Nemotron-Cascade — aren't just bigger or better-trained than their predecessors. They're architected so the expensive part of generation (decoding one token at a time) costs far less per token than the raw parameter count suggests.

There are three distinct techniques doing this, and they're often confused because they all attack the same bottleneck. This article separates them, shows what each one actually buys you, and explains why the combination — not any single one — is what closed the gap to cloud.

If you just want to know which model to pull for your GPU, the open-source LLM shootout has the picks by VRAM tier. This article is the why behind those picks.

The bottleneck: why decoding is slow in the first place

Every autoregressive LLM generates text one token at a time. To produce token N, the model runs a full forward pass using tokens 1 through N−1 as context. Then it appends token N and runs another full forward pass for token N+1. The passes are strictly sequential — you can't start computing token N+1 until you know token N.

Here's the part that surprises people: that forward pass is memory-bandwidth-bound, not compute-bound. For a single user generating one token, the GPU spends most of its time reading the model's weights out of VRAM, not doing math. A dense 32B model at Q4 has roughly 18GB of weights, and the GPU has to stream a large fraction of those through its memory bus for every single token.

That's why a used RTX 3090, with 936 GB/s of memory bandwidth, does roughly 95 tok/s on a 7B model but only a fraction of that on a dense 32B — the model is the same architecture, there are just more weights to read each step. The short version, which the VRAM-tier guide leans on throughout: decode speed tracks bandwidth ÷ active-weight-size far more than it tracks raw TFLOPS.

All three 2026 techniques are different answers to one question: how do we produce more tokens without reading all the weights that many times in sequence?

Technique 1: Sparse MoE — read fewer weights per token

Mixture-of-Experts is the most consequential of the three, and it's the easiest to understand once you frame it around bandwidth.

A dense 32B model reads ~32B parameters' worth of weights for every token. A Mixture-of-Experts model splits most of its parameters into "expert" sub-networks and adds a small router that picks which experts to use for each token. The model still stores all the weights, but it only reads the active ones per token.

Two of the most-run local models in 2026 are built this way:

Qwen3.6 35B-A3B: 35 billion total parameters, but only ~3 billion active per token.
Gemma 4 26B-A4B: 26 billion total, ~4 billion active per token.

The "A3B" / "A4B" suffix literally means "active 3 billion / active 4 billion." That naming convention is itself a sign of how central this idea became.

The payoff shows up directly in benchmarks. On an RTX 4090, a dense 32B model at Q4 lands near 60 tok/s, while a ~30B MoE model with 3B active runs around 110 tok/s at 32K context — nearly double, from a model with more total parameters. Reported numbers vary by runtime and context length (one careful Q4_K_M measurement put Qwen3.5 35B-A3B at ~78 tok/s decode on a 4090), but the direction is consistent: you get the throughput of a ~3B model with the knowledge capacity of a much larger one.

The catch — and it's a real one — is VRAM. MoE saves bandwidth, not capacity. You still have to hold all 35B parameters in memory, so Qwen3.6 35B-A3B needs a 24GB card just like a dense model of that size would. MoE makes the smart model fast; it doesn't make it fit on less. That distinction trips up a lot of buyers, which is why the VRAM-tier guide leads with total size, not active size.

Technique 2: Speculative decoding — guess ahead, verify in parallel

Speculative decoding is a pure runtime trick. It doesn't change the model's weights or quality at all — it changes how you run inference.

The idea: pair your big "target" model with a small, fast "draft" model. The draft model cheaply generates a short run of candidate tokens — say the next 4. Then the target model does one forward pass that verifies all 4 candidates at once (verification is parallel; generation is not). Every candidate the target agrees with is accepted for free; the first disagreement is corrected, and you start the next round from there.

The crucial property is that the output is bit-for-bit identical to what the target model would have produced alone. Speculative decoding is lossless — tuning how many tokens the draft proposes changes speed only, never the text. That's what separates it from quantization or distillation, which trade quality for speed.

Real-world gains:

General reports put speculative decoding at 2–3× speedup with no quality change.
In llama.cpp specifically, users see 1.5×–3× tokens/sec depending on how often the draft model's guesses are accepted.
NVIDIA has demonstrated up to 3.6× throughput on H200-class hardware with tuned draft models.

The acceptance rate is everything. If your draft model agrees with the target 80% of the time, most of your speculative tokens stick and you get a big speedup. If the draft is poorly matched and only agrees 30% of the time, you've paid for draft passes that get thrown away and you might even go slower. This is why picking a draft model from the same family (e.g. a 0.5B Qwen drafting for a 32B Qwen) matters — they make similar predictions.

By late 2025 this moved from research curiosity to production default: vLLM and TensorRT-LLM ship native support, and llama-server (the backbone of many local setups) supports several implementations. The cost is extra VRAM for the draft model and some tuning of the speculative length — a small price for a free 1.5–3×.

Technique 3: Multi-token prediction — bake the drafting into the model

Multi-token prediction (MTP) is the technique people most often confuse with speculative decoding, because the mechanism at inference time looks similar. The difference is where

RTX PRO 6000 Blackwell for Local AI in 2026: 96GB GDDR7, the 120B+ MoE Threshold, and Whether a Workstation Card Makes Sense for Home Labs

Jovan Chan — Mon, 06 Jul 2026 07:04:07 +0000

This article was originally published on runaihome.com

TL;DR: The RTX PRO 6000 Blackwell gives you 96GB of GDDR7 in a single PCIe slot, enough to run gpt-oss 120B or Llama 3.3 70B at FP8 with KV-cache headroom to spare — at ~193 tok/s on the 120B. But at roughly $8,500, it costs more than three used RTX 3090s and shares the exact same 1.79 TB/s bandwidth as a $3,000 RTX 5090. You pay for capacity and a single-slot footprint, not raw speed.

	RTX PRO 6000 Blackwell	RTX 5090	3× Used RTX 3090
Best for	70B–120B models on one card	Single-GPU 32GB workloads	Max VRAM-per-dollar
Price (Jun 2026)	~$8,000–$9,400	~$2,900–$4,300	~$2,100–$2,400
VRAM	96GB GDDR7 ECC	32GB GDDR7	72GB pooled GDDR6X
Bandwidth	1.79 TB/s	1.79 TB/s	~930 GB/s each
Power	600W (300W Max-Q)	575W	~1,050W combined
The catch	3× the price of a 5090	32GB caps you at ~32B	PCIe overhead, 3 slots, heat

Honest take: If you genuinely need a single 70B+ model resident 24/7 in one quiet slot — for an agentic coding rig, a shared family server, or fine-tuning — the PRO 6000 is the cleanest answer that exists short of an H100. For everything else, the same money buys more usable VRAM as multiple consumer cards.

What you're actually buying with 96GB

The RTX PRO 6000 Blackwell Workstation Edition is built on the same GB202 Blackwell die as the RTX 5090, but NVIDIA enables a fuller configuration: 24,064 CUDA cores versus the 5090's 21,760, paired with 96GB of GDDR7 ECC memory on a 512-bit bus. Memory bandwidth lands at 1.79 TB/s — identical to the RTX 5090. It uses 5th-generation Tensor Cores with native FP4 support and runs on PCIe 5.0 x16.

That bandwidth parity is the single most important fact in this entire article, and most buyers miss it. Token generation in LLM inference is a memory-bandwidth problem: the GPU streams every weight out of VRAM to produce each token. Two cards with the same bandwidth produce roughly the same tokens-per-second on a model that fits in both. The PRO 6000 does not make a 14B model faster than a 5090. What it does is let you load models the 5090 physically cannot hold.

The standard Workstation Edition draws up to 600W. There's also a Max-Q variant with identical 96GB / 1.79 TB/s specs capped at 300W with a blower-style cooler — meaningfully relevant for home labs where two-slot blower cards and a 300W ceiling make a multi-card or rack build far more thermally sane. You give up some peak throughput for half the power envelope.

The benchmarks that justify it (and the ones that don't)

Here's where the 96GB earns its keep. On gpt-oss 120B, the PRO 6000 hits 193.30 tok/s on token generation (tg128) at Q8_0 in llama.cpp, peaking above 200 tok/s with full GPU offload and GQA optimization. The Q4_K_M weights for that 120B model occupy roughly 59.4 GB of VRAM — leaving over 30GB free for a long context window. At 12k context, generation runs around 134 tok/s, tapering to about 48 tok/s near the model's maximum context. That entire workload is impossible on a 32GB card without offloading to system RAM, which would crater throughput.

For batched serving — the real workstation use case — the gap widens. On Llama 3.3 70B (AWQ INT4), a single PRO 6000 delivered 8,425 tok/s aggregate throughput versus 4,570 tok/s on a single RTX 5090, a 1.8× lead, because the extra capacity lets it run far larger batches. On a 30B AWQ model, a single PRO 6000 pushed roughly 8,400 tok/s — nearly matching a 4× RTX 4090 rig at 8,900 tok/s, in one slot.

Now the unflattering number. For single-stream inference of a model that fits on both cards, the PRO 6000's advantage largely evaporates. On Llama 3.3 70B Q4_K_M under vLLM, a single PRO 6000 streams roughly 30–45 tok/s for one request — fine, but not a multiple of what a 5090 manages on models it can hold. If your workload is one user, one prompt at a time, on models ≤32GB, you are paying $5,500 extra over a 5090 for VRAM you won't touch.

Workload	RTX PRO 6000	What it means
gpt-oss 120B Q8_0, tg128	193 tok/s	Flagship MoE runs on one card
gpt-oss 120B @ 12k ctx	~134 tok/s	Long context stays fast
Llama 3.3 70B AWQ, batched	8,425 tok/s	1.8× a single 5090
Llama 3.3 70B Q4, single stream	30–45 tok/s	5090-class for one user
30B AWQ, batched	~8,400 tok/s	Matches 4× RTX 4090

Price reality in June 2026

NVIDIA launched the PRO 6000 Blackwell with an ~$8,565 MSRP in early 2025. As of June 2026, street pricing has stabilized into the $8,000–$9,400 band, with wide retailer spread: Newegg around $9,349, Amazon around $9,449, and B&H as high as $11,500, while Micro Center has listed it near $10,000 with a $1,000 instant discount. VideoCardz reported the desktop card dipping to $7,999 at one point — still the floor, not the norm. If renting beats buying for your duty cycle, the card is available on cloud providers like Spheron from around $0.90/hr; do the math on a rent-vs-buy basis before committing $8,500 of capital, and remember you can spin up a comparable card on RunPod for short fine-tuning bursts instead of owning one.

For context, the RTX 5090 sits at roughly $2,900 (ASUS TUF) to $4,329 (Amazon) depending on model and the ongoing GDDR7-driven price pressure. A used RTX 3090 runs $600–$800 on eBay. That spread frames the entire decision.

When the PRO 6000 actually wins over multi-GPU

The honest competitor isn't the H100 — it's three used 3090s. Three RTX 3090s give you 72GB of pooled VRAM for about $2,100–$2,400, roughly a quarter of the PRO 6000's price. With a framework that shards models across cards, that rig runs the same 70B-class models. So why pay 3.5× more?

Four reasons, and you need at least one to be real for you:

Single-slot capacity. 96GB contiguous in one card means no tensor-parallel PCIe overhead, no NUMA tuning, no per-layer split. A 120B MoE loads as one device. Multi-GPU always pays a coordination tax that grows with model size and context.
Power and noise. The Max-Q variant pulls 300W. Three 3090s pull north of 1,000W under load, dump that heat into your office, and need a 1500W PSU plus serious airflow. For a card that runs 24/7 in a home, this is not a footnote.
ECC memory. GDDR7 with ECC matters for long fine-tuning runs where a single bit-flip silently corrupts a checkpoint. Consumer cards have no ECC.
One slot, one warranty, one driver. For a shared family or team server you want to forget about, three used cards with no warranty is a different reliability story than one new pro card.

If none of those four matter — you have the PCIe lanes, the PSU, the cooling, and the patience — multi-GPU wins on pure dollars-per-usable-GB. That's the whole trade.

Where it sits against the 5090 and the H100

Against the RTX 5090, the PRO 6000 is the same architecture with 3× the VRAM and 10% more cores at 3× the price. The decision is binary: do your target models exceed 32GB? If yes, the 5090 can't do the job at full speed and the PRO 6000 is the consumer-adjacent answer. If no, buy the 5090 and pocket $5,500.

Against a datacenter H100 (80GB HBM3, ~3.35 TB/s), the PRO 6000 has more VRAM (96GB vs 80GB) but roughly half the bandwidth and no NVLink. For single-card inference of large MoE models, the extra 16GB and the far lower price make the PRO 6000 the smarter home-lab pick. The H100 pulls ahead on raw bandwidth-bound throughput and multi-GPU scaling — but you're not putting an SXM H100 in a desktop, and the PCIe H100 still costs roughly 2–3× more.

The honest

Open WebUI Can't Connect to Ollama? Every Fix for the Server Connection Error (2026)

Jovan Chan — Sun, 05 Jul 2026 07:05:24 +0000

This article was originally published on runaihome.com

TL;DR: 90% of "Open WebUI can't reach Ollama" failures are one of two things: Open WebUI runs in a Docker container where localhost means the container, not your machine — or Ollama is bound to 127.0.0.1 and refuses outside connections. Fix the URL with host.docker.internal, bind Ollama with OLLAMA_HOST=0.0.0.0, and check the saved URL in Settings hasn't overridden your env var.

What you'll be able to do after this:

Diagnose whether the break is on the Open WebUI side (Docker networking) or the Ollama side (bind address) in under two minutes with one curl and one ss command.
Apply the exact fix for your setup — Docker on Linux, Docker Desktop on Mac/Windows, or --network=host.
Stop the error from coming back after a reboot or a settings change.

Honest take: Don't start editing config files. Run curl http://localhost:11434 first. If it answers, the problem is how the container is addressing Ollama; if it doesn't, Ollama itself isn't reachable. That one check tells you which half of this guide to read.

The two error messages, and what each one means

Open WebUI surfaces this failure in a couple of ways. In the chat box you'll see a red "Open WebUI: Server Connection Error" toast, or the model dropdown is simply empty with no models to pick. In the admin panel under Settings → Connections, a "Verify Connection" click returns "WebUI could not connect to Ollama."

Both mean the same thing: the Open WebUI backend tried to reach the Ollama API at the URL it has configured (default http://localhost:11434) and got nothing back. The question is why, and there are exactly three common causes. Work them in order — they're sorted from most to least likely.

This guide assumes Open WebUI v0.9.6 (released June 2, 2026) and Ollama v0.30.x (current as of June 2026). The mechanics below have been stable across the 0.x line of both projects, but the version tags matter when you're reading older forum threads — pre-0.2 Open WebUI used OLLAMA_API_BASE_URL, which was renamed.

Step 0: Is Ollama even running?

Before touching Open WebUI, confirm Ollama answers on the host where it's installed:

$ curl http://localhost:11434
Ollama is running

If you get Ollama is running, the server is up and serving on the local interface. Good — skip to Cause 1, because your problem is almost certainly Docker networking.

If you get curl: (7) Failed to connect to localhost port 11434: Connection refused, Ollama isn't running at all (or crashed on startup). Start it (ollama serve, or start the desktop app), then re-run the curl. If it still refuses after the app says it's running, that's a different class of failure — see our guide on Ollama not using the GPU and falling back to CPU and the llama runner process terminated walkthrough for crash-on-load cases.

Cause 1: Docker's `localhost` is not your `localhost` (the #1 cause)

This is the single most common reason, and it trips up almost everyone running the recommended Docker install.

Here's the trap. You install Ollama natively on your machine, it listens on localhost:11434, and curl http://localhost:11434 works perfectly from your terminal. Then you run Open WebUI in Docker and point it at http://localhost:11434 — and it fails.

The reason: inside a Docker container, localhost (and 127.0.0.1) refers to the container itself, not the host machine. Ollama is running on the host. From the container's point of view, nothing is listening on its own localhost:11434. The official troubleshooting docs state it plainly — the failure is "the WebUI docker container not being able to reach the Ollama server at 127.0.0.1:11434 (host.docker.internal:11434) inside the container."

There are two clean fixes.

Fix 1a: Use `host.docker.internal` (recommended)

Docker provides a special hostname, host.docker.internal, that resolves to the host machine from inside a container. Point Open WebUI at that instead of localhost.

On Docker Desktop (Mac and Windows), host.docker.internal works automatically. On Linux, it does not exist by default — you have to add it with --add-host. This is the line most Linux tutorials get wrong.

The correct Linux command:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Then set the Ollama URL inside Open WebUI to http://host.docker.internal:11434 (either via the env var below or in the admin panel — see Cause 3 for why the panel can win).

You can also bake the URL into the container at launch:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  -v open-webui:/app/backend/data \
  --name open-webui --restart always \
  ghcr.io/open-webui/open-webui:main

After this, Open WebUI is at http://localhost:3000 (the 3000:8080 mapping sends host port 3000 to the container's internal port 8080).

OLLAMA_BASE_URL is the environment variable Open WebUI reads to decide where to forward Ollama API calls. Note it's OLLAMA_BASE_URL in current versions — older guides referencing OLLAMA_API_BASE_URL are pre-0.2 and will silently do nothing.

Fix 1b: Use `--network=host`

The blunter option is to share the host's network stack directly, so localhost inside the container points where you'd expect:

docker run -d --network=host \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
  --name open-webui --restart always \
  ghcr.io/open-webui/open-webui:main

Watch the port change. With --network=host, the -p 3000:8080 mapping is ignored — the container binds the host's port 8080 directly. So Open WebUI is now at http://localhost:8080, not 3000. People apply this fix, see the old URL stop working, and assume it failed. It didn't; the address moved.

--network=host is the less-preferred fix because it throws away Docker's network isolation. Use host.docker.internal unless you have a specific reason not to.

Cause 2: Ollama is bound to 127.0.0.1 and rejecting the container

If curl http://localhost:11434 works on the host but Open WebUI still can't connect even after fixing the address, the problem flips to the Ollama side. By default Ollama binds only to the loopback interface 127.0.0.1, which accepts connections from the same machine but not from the Docker bridge network the container lives on.

Confirm what it's actually bound to:

$ ss -tlnp | grep 11434
LISTEN 0  4096  127.0.0.1:11434  0.0.0.0:*  users:(("ollama",pid=1234,fd=3))

127.0.0.1:11434 = local only. You want to see 0.0.0.0:11434 (all IPv4 interfaces) or *:11434. The fix is to set OLLAMA_HOST=0.0.0.0, and how you set it depends on the platform — this is where most people go wrong, because Ollama reads its environment once at startup and a variable exported in your shell often isn't the environment the service actually runs under.

Linux (systemd)

If Ollama was installed via the official script, it runs as a systemd service — and your shell's export OLLAMA_HOST=0.0.0.0 does nothing, because systemd uses its own environment, not your terminal's. You have to edit the unit:

sudo systemctl edit ollama.service

Add, under the [Service] section:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_ORIGINS=*"

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

OLLAMA_ORIGINS=* relaxes the CORS origin check, which matters when a browser-based front end on a different host/port talks to the API. Re-run the ss check; you should now see 0.0.0.0:11434.

macOS (desktop app)

The Ollama menu-bar ap

Open-Source LLM Shootout 2026: Qwen3.6 vs Gemma 4 vs Llama 4 vs GLM-5.1 vs DeepSeek V4 — Which Fits Your GPU?

Jovan Chan — Sun, 05 Jul 2026 07:04:43 +0000

This article was originally published on runaihome.com

TL;DR: Of the five open-weight families everyone is benchmarking in 2026, only two — Qwen3.6 and Gemma 4 — actually run well on a single consumer GPU. Llama 4 Maverick, GLM-5.1, and DeepSeek V4 Pro are server-class MoEs that need 200 GB+ of VRAM at usable quantization. The headline "five-way shootout" collapses to a two-horse race on the hardware most home labs own, with DeepSeek V4 Flash and Llama 4 Scout sitting in an awkward middle.

	Qwen3.6 35B-A3B	Gemma 4 (12B–31B)	Llama 4 Scout 109B	DeepSeek V4 Flash 284B	GLM-5.1 754B
Best for	Single RTX 3090, speed	Quality per GB, 8–24 GB	Gray-zone, 24 GB at 1.78-bit	2× RTX 4090 minimum	Server / API only
Min usable VRAM	~23 GB (Q3)	6.6 GB (12B Q4) → 24 GB (31B Q4)	~24 GB (1.78-bit, degraded)	~33 GB (heavy quant)	~860 GB (FP8)
RTX 3090 speed	~120 tok/s	30–119 tok/s	~20 tok/s	Not viable	Not viable
License	Apache 2.0	Gemma custom	Llama 4 Community	MIT	MIT

Honest take: If you have one 24 GB GPU, run Qwen3.6 35B-A3B for speed or Gemma 4 31B for reasoning depth — pick by license and task, not by leaderboard rank. The 1T-class models (GLM-5.1, DeepSeek V4 Pro, Llama 4 Maverick) are API models you should use as APIs, not local builds.

The five families, and what they actually are

Every "open-source LLM comparison" article in mid-2026 lists the same five names. The problem is that they treat a 12B dense model and a 1.6T MoE as if they belong in the same buying decision. They don't. Here is what each family ships, and the size of the gap.

Qwen3.6 (Alibaba). The Qwen3.6-35B-A3B dropped April 16, 2026 — a 35B-total MoE that activates just 3B parameters per token. There are also dense 8B and 14B members carried over from the Qwen3 line. Everything is Apache 2.0, which matters more than people admit. Qwen optimized this family explicitly for consumer hardware.

Gemma 4 (Google). A dense-and-MoE family spanning E2B up to a 31B dense flagship, plus a 26B-A4B MoE. Google's June 5 QAT (quantization-aware training) checkpoints cut VRAM roughly 72% versus BF16 with near-original quality, which is why a 26B now fits 16 GB. License is Google's custom Gemma terms — permissive for most use, but not OSI-approved.

Llama 4 (Meta). Scout is 109B total / 17B active; Maverick is 402B total / 17B active. Both are MoE, and both load all parameters into memory regardless of how few activate per token — which is the trap. The Llama 4 Community License carries a 700M-MAU clause, a "Built with Llama" attribution requirement, and a ban on multimodal use by entities domiciled in the EU.

DeepSeek V4. Two checkpoints: V4 Flash (284B total / 13B active) and V4 Pro (1.6T total / ~49B active), released April 2026 under MIT. Flash is the only one with any home-lab pathway.

GLM-5.1 (Z.ai). A 754B open-weight MoE under MIT, released April 7, 2026. It matches frontier closed models on coding benchmarks — and needs roughly 860 GB of VRAM at FP8, i.e. 8× H200. It is on this list because it is excellent, not because you can run it.

Which family fits which GPU

This is the only question that matters for a home lab. Here is the decision by the VRAM you actually have.

8 GB (RTX 3060, RTX 4060, RTX 5060)

Your only real options are small dense models. Gemma 4 12B at Q4 needs about 6.6 GB of weights — it fits, but with little headroom for context, so keep num_ctx modest. A Qwen3 8B at Q4 sits around 5 GB and leaves more room. Llama 4, GLM-5.1, and DeepSeek V4 are all out at this tier — not "slow," but flatly impossible without spilling to system RAM and crawling.

16 GB (RTX 4060 Ti 16GB, RTX 5060 Ti 16GB, RTX 5070 Ti)

This is where Gemma 4's MoE shines. The 26B-A4B at Q4_K_M lands right at 16 GB and, because only ~4B parameters activate per token, it generates fast — 64–119 tok/s on a 24 GB RTX 3090, and comfortably usable on a 16 GB card if you trim context. Qwen3 14B dense also fits here. Qwen3.6 30B/35B-A3B at Q4 wants ~17 GB, so it's right on the edge — doable at low context, more comfortable at 24 GB.

24 GB (RTX 3090, RTX 4090)

The sweet spot, and where the shootout gets interesting:

Qwen3.6 35B-A3B: with Unsloth's Q3 quant it takes ~23 GB and runs at ~120 tok/s on a RTX 3090. This is the fastest capable model you can run on one consumer card.
Gemma 4 31B dense at Q4_K_M fills 24 GB and runs ~30–34 tok/s on the 3090 — slower, but a dense 31B scoring 87.1% MMLU is a different quality tier than a 3B-active MoE.
Gemma 4 26B-A4B at Q8 also fits 24 GB if you want quality with speed.
Llama 4 Scout technically fits via Unsloth's 1.78-bit dynamic quant at ~20 tok/s — but a 1.78-bit 109B model is a curiosity, not a daily driver. The quality you lose at that bit depth erases the point of running Scout over Qwen3.6.

On a RTX 4090, the same models run faster on the small end — Qwen3 8B Q4 hits ~104 tok/s, 14B ~69 tok/s — but the 24 GB ceiling is identical to the 3090, so it changes speed, not which models fit.

Beyond 24 GB (multi-GPU / server)

DeepSeek V4 Flash needs ~33 GB heavily quantized — realistically 2× RTX 4090 or a single RTX 6000-class card. Llama 4 Maverick at INT4 wants ~200 GB (4× H100). GLM-5.1 and DeepSeek V4 Pro are 8-GPU server territory. If you're renting rather than buying for these, a cloud GPU like RunPod is far cheaper than assembling an 8× H200 box you'll use occasionally.

Speed vs. quality: the real trade-off at 24 GB

The MoE-vs-dense split is the whole story on a single 24 GB card. Qwen3.6 35B-A3B activates 3B parameters per token, so it feels like running a 3B model — ~120 tok/s — while having the knowledge of a 35B. Gemma 4 31B is dense: every one of its 31B parameters fires on every token, so you get ~30 tok/s but denser reasoning per token. Neither is "better"; they're different shapes.

For interactive chat and agentic loops where latency compounds, Qwen3.6's 120 tok/s wins. For one-shot reasoning, summarization, or code review where you read the output once, Gemma 4 31B's depth is worth the slower stream. We go deeper on the two-model decision in our Qwen3.6 35B-A3B guide and the Gemma 4 QAT update.

Licenses: the column most comparisons skip

For a home lab tinkering on a side project, license rarely bites. For anyone building a product, it's decisive.

Family	License	Commercial use	Catch
Qwen3.6	Apache 2.0	Yes, unrestricted	None
DeepSeek V4	MIT	Yes, unrestricted	None
GLM-5.1	MIT	Yes, unrestricted	None
Gemma 4	Gemma custom	Yes	Use restrictions, not OSI-approved
Llama 4	Community	Yes, with limits	700M-MAU clause, "Built with Llama" badge, EU multimodal ban

The irony of the 2026 landscape: the two model families you can actually run locally (Qwen3.6 and Gemma 4) sit at opposite license ends — Qwen3.6 is the cleanest (Apache 2.0), Gemma 4 the most encumbered of the two. If license cleanliness matters to you and you want speed, Qwen3.6 is the unambiguous pick.

The gotcha that wastes an afternoon

The most common failure we see: someone reads "Llama 4 Scout is only 17B active" and runs ollama pull expecting a 17B-class memory footprint. Then they hit CUDA error: out of memory on a 24 GB card. MoE models must hold all 109B parameters in VRAM even though 17B activate per token — the "active" number describes compute, not memory. The fix isn't a smaller context; it's accepting that Scout needs either a sub-2-bit dynamic quant (degraded) or more than one GPU. If you hit OOM mid-load on any of these MoEs, check total parameter count, not active count, against your VRAM. (Our full [CUDA OOM fixes](/blog/cuda-out-of-memory

Ollama v0.30 on Apple Silicon: What the Stable MLX Release Actually Changed From the Preview

Jovan Chan — Sun, 05 Jul 2026 07:04:01 +0000

This article was originally published on runaihome.com

TL;DR: Ollama v0.30 (May 13, 2026) promoted the MLX engine from a spring preview to the default Apple Silicon inference path, and the point releases through June added the parts that actually matter day-to-day: Gemma 4 QAT weights, Gemma 4 MTP speculative decoding (>2× on Macs), and better KV-cache reuse so repeated prompts skip re-prefill. If you're on the May preview build, upgrading to v0.30.10 is a free speed bump on an idle afternoon.

What you'll be able to do after this guide:

Upgrade to Ollama v0.30.10 and confirm the MLX engine is actually running, not the old llama.cpp Metal fallback
Pull the Gemma 4 QAT tags that fit your Mac's memory (1GB to 18GB) and run them at near-original quality
Verify KV-cache reuse and MTP speculative decoding are active using ollama ps and real timing

Honest take: This isn't a new engine — it's the MLX preview we covered on June 2 finally stabilized and fed. The headline tok/s numbers haven't moved much since spring, but Gemma 4 QAT plus speculative decoding plus cache reuse together make a 32GB Mac feel meaningfully snappier on real multi-turn work. Upgrade, pull a -it-qat tag, and move on.

What the v0.30 line actually shipped

The MLX engine arrived in preview this spring as a swap of Ollama's Mac backend from llama.cpp's Metal path to Apple's MLX framework, which treats unified memory as the architectural primitive instead of an edge case. That preview nearly doubled decode speed — from ~58 to ~112 tok/s on an M4 Max running Qwen3.5-35B-A3B at int4 — but it was narrow: a handful of models, a hard 32GB-memory floor, and a "preview" label.

Ollama v0.30.0, released May 13, 2026, changed the framing. The release notes describe it as "improved compatibility and performance using llama.cpp" that augments the MLX engine on Apple Silicon, bringing support to a wider range of hardware. In plain terms: MLX is now the default fast path on capable Macs, and the llama.cpp side got broader GGUF support (Hugging Face models and your own fine-tunes) plus faster NVIDIA performance for everyone else.

The interesting work happened in the point releases:

Version	Date	What it added
v0.30.0	May 13, 2026	MLX default on Apple Silicon; broader GGUF + Hugging Face model support; faster NVIDIA
v0.30.5	early June 2026	Fixed `gemma4:12b` floating-point exception crash; Gemma 4 MTP speculative decoding on Macs (>2× speedup)
v0.30.8	June 12, 2026	Improved prompt caching for better KV-cache reuse
v0.30.9	mid June 2026	Cohere2Moe architecture support
v0.30.10	June 17, 2026	Command A and North family models on Apple Silicon MLX; llama.cpp updated to build 9672

If you installed the spring preview and never touched it, you're missing all four of those. None is a marketing bullet — they're the difference between "MLX is fast in a benchmark" and "MLX is fast on the thing I actually do."

Upgrade and verify it's really MLX

Upgrading is the easy part. On macOS, re-run the installer or use Homebrew:

$ brew upgrade ollama
$ ollama --version
ollama version is 0.30.10

The part people skip — and then wonder why nothing got faster — is confirming the MLX engine is the one doing the work. The MLX path activates on Macs with 32GB or more of unified memory. Below that, Ollama silently falls back to llama.cpp Metal with no error and no speed change. That silent fallback is the single most common "I upgraded and saw nothing" complaint, and it's not a bug — it's the documented memory floor.

To check which engine is live, load a model and read ollama ps:

$ ollama run gemma4:26b-it-qat ""
$ ollama ps
NAME                 ID              SIZE     PROCESSOR    UNTIL
gemma4:26b-it-qat    a1b2c3d4e5f6    16 GB    100% GPU     4 minutes from now

100% GPU means the model is fully on the GPU via the unified-memory path. If you see any CPU percentage on a model that should fit, you're either below the memory floor or the model spilled — close other apps and reload. The SIZE column also sanity-checks your quant: a 26B QAT model should report ~16GB, not ~30GB.

Gemma 4 QAT: the upgrade that changes which Mac is enough

The most useful thing v0.30 unlocked isn't raw speed — it's Google's Gemma 4 quantization-aware training (QAT) checkpoints, released June 5, 2026, now available as first-party Ollama tags. QAT simulates quantization during training instead of bolting it on afterward, which cuts memory roughly 72% versus BF16 while keeping near-original quality. We covered the full QAT memory map in the Gemma 4 QAT hardware update; here's the short version of what to pull:

$ ollama pull gemma4:e4b-it-qat    # ~5 GB  — fits a 16GB MacBook Air
$ ollama pull gemma4:12b-it-qat    # ~7 GB  — fits 16GB comfortably
$ ollama pull gemma4:26b-it-qat    # ~15 GB — fits a 16GB Mac/GPU, barely
$ ollama pull gemma4:31b-it-qat    # ~18 GB — needs 24GB+

Gemma 4 QAT tag	Memory	What it fits
`gemma4:e2b-it-qat`	~1 GB	A phone, or any Mac
`gemma4:e4b-it-qat`	~5 GB	8–16GB MacBook Air
`gemma4:12b-it-qat`	~7 GB	16GB Mac / 8GB+ GPU
`gemma4:26b-it-qat`	~15 GB	16GB Mac/GPU (tight)
`gemma4:31b-it-qat`	~18 GB	24GB Mac/GPU

The reason this matters: the 26B-A4B model now fits in ~15GB, which means a 16GB Mac that previously couldn't touch a 26B-class model runs one at near-full quality. Critical caveat carried over from the QAT release: don't hand-convert the Hugging Face QAT BF16 weights to Q4_0 yourself — the F16-vs-BF16 scale mismatch reintroduces the exact accuracy loss QAT was meant to avoid. Use the official Ollama -it-qat tags above, which are already converted correctly.

Speculative decoding and cache reuse: where v0.30 feels faster

Two changes in the point releases don't show up as a bigger headline tok/s number but change the lived experience.

Gemma 4 MTP speculative decoding (v0.30.5) uses multi-token-prediction draft heads to propose several tokens at once and verify them in a single pass — lossless output, but Ollama reports over a 2× speedup on Macs for Gemma 4. This is the same family of technique we broke down in why local LLMs got good in 2026: it doesn't raise the memory-bandwidth ceiling, it just wastes fewer trips to it.

KV-cache reuse (v0.30.8) is the quieter win. Before, sending a follow-up message in a long chat re-processed the entire prompt history (the prefill step) every turn. With improved prompt caching, an unchanged prefix is reused, so on a multi-turn conversation the second and later turns skip straight to generation. The bigger your system prompt and the longer your chat, the more time-to-first-token you save — on a long coding session with a 4K-token system prompt, that's the difference between a visible pause and an instant reply on every turn.

You won't see a flag for this. The way to confirm it's helping is crude but honest: time two identical follow-up prompts in the same session. The second should start streaming noticeably sooner because the shared prefix is already cached.

Real numbers, and the ceiling that didn't move

Here's what to actually expect, because "2× faster" is only true in specific places:

Mac / model	Backend	Decode	Notes
M4 Max, Qwen3.5-35B-A3B int4	MLX	~112 tok/s	vs ~58 tok/s on the old Metal path (~93% gain)
M4 Max, optimized 7B	MLX	~230 tok/s	small models show MLX's biggest lead
M3 Ultra, Gemma 4 27B Q4_K_M	MLX	~30–42 tok/s	prefill ~700–900 tok/s
M3 Ultra, Qwen3.6 30B-A3B	MLX	>80 tok/s	MoE sparsity (3B active) is why it's 2× the dense 27B

The pattern worth internalizing: MLX leads llama.cpp by **roughly 10–25% on most models, and up to 21–87% o

Ollama Not Using GPU? Fix CPU-Only Inference on Windows, WSL2, and Linux (2026)

Jovan Chan — Sat, 04 Jul 2026 07:05:50 +0000

This article was originally published on runaihome.com

TL;DR: If Ollama feels slow, run ollama ps — a "100% CPU" line means your GPU isn't being used at all, and a CPU/GPU split means the model is too big for your VRAM. Most cases come down to drivers, a WSL2/Docker passthrough gap, or VRAM overflow. The speed gap is real: ~42 tok/s on an RTX 3060 versus 8–14 tok/s CPU-only for Llama 3.1 8B.

What you'll be able to do after this guide:

Confirm in 30 seconds whether Ollama is on your GPU, your CPU, or split between both
Fix the six causes that account for nearly every "Ollama won't use my GPU" report in 2026
Read the Ollama server log to find the one line that tells you what actually happened

Honest take: 80% of these reports are one of two things — you installed Ollama before the NVIDIA driver was working, or your model simply doesn't fit in VRAM and is spilling to system RAM. Check ollama ps first; it tells you which camp you're in before you change a single setting.

Step 1: Confirm the problem (don't guess)

Before touching drivers or reinstalling anything, find out what Ollama is actually doing. Load a model and run ollama ps:

$ ollama run llama3.1:8b "hi" 
$ ollama ps
NAME           ID              SIZE      PROCESSOR    UNTIL
llama3.1:8b    365c0bd3c000    6.7 GB    100% GPU     4 minutes from now

That PROCESSOR column is the whole diagnosis:

100% GPU — working as intended. If it's still slow, your model/quant or context is the bottleneck, not GPU detection.
100% CPU — Ollama isn't seeing your GPU at all. This is a driver, passthrough, or unsupported-card problem.
58% / 42% CPU/GPU (a split) — Ollama found the GPU but the model doesn't fully fit in VRAM, so layers spilled to system RAM. The GPU is fine; you're out of VRAM.

Cross-check with the GPU itself while a prompt is generating:

$ nvidia-smi

If nvidia-smi prints a table and you see a python/ollama process using VRAM during generation, the GPU is being used. If nvidia-smi returns command not found or NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver, your driver is the problem — jump to Cause 1.

Here's the diagnostic flow in one table:

`ollama ps` shows	`nvidia-smi` shows	What's wrong	Go to
100% CPU	driver error / not found	Driver missing or broken	Cause 1
100% CPU	works fine on host, fails in WSL/Docker	Passthrough not configured	Cause 2 / 4
100% CPU	works, GPU is old	Compute capability too low	Cause 5
CPU/GPU split	GPU present, VRAM full	Model bigger than VRAM	Cause 3
100% GPU on wrong card	both GPUs listed	Ollama picked the wrong GPU	Cause 6

Cause 1: Drivers missing, outdated, or installed after Ollama

This is the single most common cause. Ollama detects GPU libraries at install time and at server start, so the order of operations matters.

The version floor in 2026: Ollama supports NVIDIA GPUs with compute capability 5.0+ and driver 531 or newer. Older Maxwell/Pascal cards (compute capability 5.0–6.2, e.g. a GTX 1060) need driver 570 or newer. If your driver is below that, Ollama silently falls back to CPU.

Check your driver:

$ nvidia-smi --query-gpu=driver_version --format=csv,noheader
576.52

Then fix in this order:

Install/update the NVIDIA driver first. On Windows, grab the latest Game Ready or Studio driver. On Linux, install the proprietary driver (e.g. sudo ubuntu-drivers install on Ubuntu) and reboot.
Verify nvidia-smi works before going further.
Reinstall Ollama after the driver is healthy. If you installed Ollama before the driver worked, its server never registered CUDA support. On Linux: curl -fsSL https://ollama.com/install.sh | sh. On Windows, reinstall the app. Then systemctl restart ollama (Linux) or restart the app.

The classic trap: people install Ollama on a fresh machine, then install GPU drivers, then wonder why it's on CPU. Reinstall Ollama last.

Cause 2: WSL2 passthrough (the Windows + Linux gotcha)

Running Ollama inside WSL2 on Windows is its own special case, and the fix is counterintuitive.

Do not install a Linux NVIDIA driver inside WSL2. The Windows host driver is automatically projected into WSL2 as libcuda.so. Installing a Linux driver on top of that breaks the stub and sends you straight to CPU. This is the mistake that generates the most WSL2 bug reports.

The working setup:

Update the Windows NVIDIA driver (must be 470.76 or later for CUDA-in-WSL2; in practice use a current driver). Windows 11, or Windows 10 21H2+, is required.
Confirm you're on WSL2, not WSL1:

   # in PowerShell
   wsl -l -v

The VERSION column must say 2. WSL1 has no GPU passthrough at all.

Inside WSL2, verify the stub is visible:

   $ nvidia-smi

If that works inside WSL but Ollama still shows CPU, reinstall Ollama inside WSL after confirming nvidia-smi works.

Cause 3: The model is bigger than your VRAM

If ollama ps shows a CPU/GPU split, nothing is broken — you're out of VRAM, and Ollama is doing exactly what it's designed to do: offload the layers that fit to the GPU and run the rest on CPU. That CPU portion is what tanks your tokens/sec.

A rough VRAM budget: a Q4_K_M quant needs about 0.6 GB per billion parameters, plus 1–2 GB for the KV cache at modest context. So Llama 3.1 8B Q4_K_M wants ~6–7 GB, which is why it fits cleanly on an 8GB card; a 14B Q4 wants ~10 GB; a 32B Q4 wants ~20 GB and will split on anything under a 24GB card.

Fixes, in order of preference:

Use a smaller quant. Drop from Q6_K to Q4_K_M, or pull the :8b instead of :14b tag. See our quantization explainer for what you actually lose (less than people think — Q4_K_M is the sweet spot).
Shrink the context window. A huge num_ctx eats VRAM through the KV cache. If you set OLLAMA_CONTEXT_LENGTH or num_ctx to 32768 "just in case," that alone can force a split. Drop to 4096 or 8192.
Unload other models. Ollama keeps recently-used models resident. Run ollama stop <model> to free VRAM, or set OLLAMA_MAX_LOADED_MODELS=1.
Get more VRAM. If a 32B model is your daily driver, a 24GB card is the real answer — see our VRAM-by-model guide and check whether you actually have enough system RAM for the spillover, too.

The classic error here is the hard out-of-memory case:

Error: CUDA error: out of memory

Fix: lower the context (num_ctx 2048), use a smaller quant, or stop other loaded models — then retry.

Cause 4: Docker without GPU access

A container does not get GPU access by default. If you run Ollama in Docker and it's on CPU, the host setup is incomplete.

On the host (or inside your WSL2 distro if that's where Docker lives):

# install the toolkit, then register the runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Then launch the container with the GPU flag:

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

The two things people forget: installing nvidia-container-toolkit on the host, and the --gpus=all flag itself. Miss either and the container quietly runs on CPU.

Cause 5: Your GPU is too old (compute capability)

Ollama requires compute capability 5.0 or higher — Maxwell (GTX 900 series) and newer. A Kepler-era card (GTX 700 series, compute capability 3.5) will never get GPU acceleration in current Ollama, no matter what driver you install. Datacenter Tesla K80s fall in the same bucket.

Check your card's compute capability against NVIDIA's CUDA GPUs list. If it's below 5.0, your only paths are a newer GPU or cloud rental (see Cause 7's note). This is rare on home builds but common when someone tries to repurpose an ancient mining or serv

Ollama Keeps Reloading the Model? Fix VRAM Unloading, Cold Starts, and Model Swapping (2026)

Jovan Chan — Sat, 04 Jul 2026 07:05:06 +0000

This article was originally published on runaihome.com

TL;DR: Ollama unloads a model from VRAM 5 minutes after the last request by default, so the next prompt pays a cold-start penalty while weights reload from disk. The fix is one environment variable — OLLAMA_KEEP_ALIVE — set on the service, not your shell. If you're juggling several models, OLLAMA_MAX_LOADED_MODELS decides how many stay resident at once. You almost never need more VRAM; you need the right keep-alive policy.

What you'll be able to do after this guide:

Keep a model pinned in VRAM indefinitely (or for a set window) so the first token is instant
Read ollama ps to confirm what's actually loaded and when it'll unload
Stop two models from fighting over VRAM and thrashing each other out of memory

Honest take: 90% of "Ollama is slow on the first message" complaints are the 5-minute keep-alive timeout doing exactly what it was designed to do. Set OLLAMA_KEEP_ALIVE correctly on the service and the problem disappears — no hardware change required.

What's actually happening

By default, Ollama keeps a model loaded in memory for 5 minutes after its last request, then unloads it and frees the VRAM (Ollama FAQ). That idle timeout is deliberate: it returns GPU memory to the system so other workloads (or other models) can use it.

The downside shows up the moment you step away. Come back after lunch, send a prompt, and Ollama has to reload the entire model from disk before it can answer. On a 7B model that cold start is roughly 3–10 seconds; on a 70B model loading from a SATA SSD it can be ~74 seconds, versus about 18 seconds from an NVMe drive (Markaicode NVMe load-time benchmarks). To the user it feels like Ollama "froze." It didn't — it's doing a disk-to-VRAM reload because the model went cold.

So before you blame your GPU, confirm the symptom. Open two terminals. In one, run your model. In the other, watch what's resident:

$ ollama ps
NAME               ID              SIZE      PROCESSOR    UNTIL
llama3.1:8b        42182419e950    6.7 GB    100% GPU     4 minutes from now

Three columns matter here:

PROCESSOR — 100% GPU means the whole model is in VRAM (fast). Anything less means part of it spilled to CPU/system RAM, which tanks tokens/sec. If you're seeing CPU offload, that's a different problem — see our Ollama not using GPU fix.
SIZE — how much memory this model is holding.
UNTIL — the countdown to unload. 4 minutes from now is the default 5-minute timer ticking down. This is the column that explains your cold starts.

If ollama ps shows nothing, the model is already unloaded and your next request will be a cold start. That's the whole bug.

Fix 1: Set OLLAMA_KEEP_ALIVE on the service (the real fix)

OLLAMA_KEEP_ALIVE controls how long a model stays resident after its last request. It accepts (Ollama FAQ):

A duration string: "10m", "24h"
A number in seconds: 3600
Any negative value to keep it loaded forever: -1
0 to unload immediately after each response

The trap that wastes the most time: setting it in your shell does nothing. Ollama usually runs as a background service (systemd on Linux, a launch agent on macOS, a tray app on Windows) with its own environment. Exporting OLLAMA_KEEP_ALIVE=-1 in .bashrc is invisible to that service (SumGuy's Ramblings). You have to set it where the service can see it.

Linux (systemd):

sudo systemctl edit ollama.service

Add, under the [Service] section:

[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

-1 pins the model in VRAM permanently — ideal for a dedicated home AI box that only ever runs one model. If you'd rather it free memory overnight, use "8h" instead.

macOS: set it for the launch environment, then restart the Ollama app:

launchctl setenv OLLAMA_KEEP_ALIVE "-1"

Windows: quit Ollama from the system tray, open Settings → System → About → Advanced system settings → Environment Variables, add a user variable OLLAMA_KEEP_ALIVE with value -1, then relaunch Ollama.

Verify it took effect — UNTIL should now read Forever:

$ ollama ps
NAME               ID              SIZE      PROCESSOR    UNTIL
llama3.1:8b        42182419e950    6.7 GB    100% GPU     Forever

Fix 2: Override per request with keep_alive

If you don't want a global policy — say a script that should load a big model, do one batch job, and release the VRAM — pass keep_alive directly in the API call. The request-level parameter overrides the OLLAMA_KEEP_ALIVE environment variable for that call (Ollama FAQ).

Keep a model loaded for this session:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Summarize this changelog...",
  "keep_alive": -1
}'

Unload immediately when the job is done (frees VRAM the instant the response finishes):

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "keep_alive": 0
}'

Sending a request with an empty prompt just loads (or unloads) the model without generating — handy for preloading.

Fix 3: Preload before the user shows up

If your real complaint is "the first request of the day is slow," preload the model at boot so the cold start happens before anyone is waiting. The cleanest way is an empty generate request that pins the model:

ollama run llama3.1:8b ""

Or via the API in a startup script / systemd unit:

curl http://localhost:11434/api/generate -d '{"model": "llama3.1:8b", "keep_alive": -1}'

Run this from a cron @reboot job or a small systemd service after ollama.service, and your model is warm in VRAM before the first real prompt arrives. Combined with OLLAMA_KEEP_ALIVE=-1, you get instant responses around the clock — at the cost of holding that VRAM permanently.

Fix 4: Stop two models from thrashing

A subtler version of the same problem: you switch between, say, a coding model and a general chat model, and every switch is slow. That's not the keep-alive timer — it's model swapping.

Two environment variables govern this (Ollama FAQ, envconfig source):

Variable	Default	What it does
`OLLAMA_MAX_LOADED_MODELS`	3 × GPU count (3 for CPU)	How many distinct models can be resident at once
`OLLAMA_NUM_PARALLEL`	auto (1 or 4)	Concurrent requests per model; each parallel slot needs its own KV-cache VRAM

The catch is physical: on GPU inference, a new model must fit entirely in VRAM alongside whatever's already loaded, or Ollama unloads something to make room (Ollama FAQ). On a 24GB card, two 8B models (≈7 GB each) coexist comfortably and switching is instant. An 8B plus a 32B model won't both fit, so Ollama evicts one on every switch — and you're back to cold starts.

Practical rules:

Plenty of VRAM, a few small models you alternate between? Raise OLLAMA_MAX_LOADED_MODELS and let them all stay resident.
Tight on VRAM? Leave it at default and standardize on one model. Forcing two big models into a card that can't hold both just guarantees thrashing.
Watch the KV cache. Bumping OLLAMA_NUM_PARALLEL multiplies KV-cache VRAM by the context length, which can quietly push you into a CUDA out-of-memory error. If you raised parallelism and things got less stable, lower it back.

Check your VRAM headroom before raising any of these. A used RTX 3090's 24GB holds one 8B model with enormous room to spare, but t

vLLM Won't Start? Every Fix for the Engine Init, CUDA, and OOM Errors (2026)

Jovan Chan — Sat, 04 Jul 2026 07:02:44 +0000

This article was originally published on runaihome.com

TL;DR: Most vLLM startup failures are one of three things: the engine reserves more KV-cache memory than your card has (No available memory for the cache blocks), the CUDA driver is older than the wheel was built for (The NVIDIA driver on your system is too old), or a multi-GPU run hangs at NCCL init. The fixes are nearly always flags, not code: pin --max-model-len, tune --gpu-memory-utilization, add --enforce-eager, or set a couple of NCCL env vars. Read the last line of the traceback first — it tells you which of the three you have.

What you'll be able to do after this:

Read a vLLM startup traceback and know in one glance whether it's a KV-cache/OOM problem, a driver/CUDA problem, or a multi-GPU networking hang.
Apply the exact flag or environment variable that fixes each class, with the values that actually work on 12–24 GB consumer cards.
Stop guessing from nvidia-smi — which lies about how much memory vLLM can actually use — and trust the startup log instead.

Honest take: vLLM is a server engine, not a desktop app. If you just want a model running on one consumer GPU with the least friction, Ollama or LM Studio will get you there faster. Reach for vLLM when you need throughput under concurrency — many requests at once — and you're willing to learn three flags. Once you know those flags, 90% of the "it won't start" pain disappears.

This guide assumes vLLM v0.23.0 (released June 13, 2026), which ships on PyTorch 2.11 with the default PyPI wheel now built for CUDA 13.0 and Python 3.14 added to the supported list. Older forum threads reference very different defaults, so the version tag matters when you're copying fixes from 2024–2025 posts.

Step 0: Read the actual error, not the wall of logs

vLLM prints a lot of output on startup — model download progress, worker spawn messages, CUDA graph capture. None of that is the error. The error is the last Python traceback, and specifically its final line. Three lines account for the overwhelming majority of "vLLM won't start" reports:

The line you see	What it actually means	Jump to
`ValueError: No available memory for the cache blocks`	KV cache doesn't fit after weights load	OOM section
`RuntimeError: The NVIDIA driver on your system is too old`	Wheel built for newer CUDA than your driver	Driver section
Hangs forever after `Started a worker` / NCCL lines	Multi-GPU collective setup stuck	NCCL section

If you can't tell which bucket you're in, restart with debug logging on and capture the tail:

VLLM_LOGGING_LEVEL=DEBUG vllm serve Qwen/Qwen2.5-7B-Instruct 2>&1 | tee vllm.log

The DEBUG level is documented in vLLM's own troubleshooting guide and is the single most useful thing you can do before asking anyone for help.

The #1 startup error: "No available memory for the cache blocks"

This is the error people hit first, and it's been the top vLLM startup complaint since at least issue #2248. The full message reads:

ValueError: No available memory for the cache blocks. Try increasing
`gpu_memory_utilization` when initializing the engine.

Why it happens

vLLM loads the model weights first, then tries to carve the remaining VRAM into KV-cache blocks. The KV cache is sized from --max-model-len (the maximum sequence length) and the number of concurrent sequences. If the weights plus the requested context budget exceed your card, there's nothing left for blocks, and the engine refuses to start rather than crash mid-request.

The trap: vLLM's default --max-model-len is the model's full trained context — often 32K or higher. A 7B model at Q4 might be ~4.5 GB of weights, but a 32K context KV cache for several parallel sequences can dwarf that. On a 12 GB card the math simply doesn't close. This is exactly the failure mode reported for 7B–13B models on the RTX 3060 12GB in issue #27934 — and it affects every Ampere 12 GB card, not just the 3060.

The fix, in the order to try it

1. Pin --max-model-len to what you actually need. This is the highest-leverage fix and most people skip it. If your prompts are 4K, don't pay for 32K of KV cache:

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85

2. Raise --gpu-memory-utilization. Default is 0.9. The error message tells you to increase it, and that's legitimate — it lets vLLM claim a larger slice of the card for blocks. On a dedicated inference box, 0.92–0.95 is reasonable. But on a card that's also driving a display, going too high starves the desktop and can crash X. On 12 GB cards, counterintuitively, lowering it to 0.75–0.80 sometimes fixes init OOMs because it leaves more headroom for the CUDA context and fragmentation overhead that the allocator needs up front.

3. Cap concurrency with --max-num-seqs. Fewer simultaneous sequences means a smaller KV-cache budget. Dropping from the default to --max-num-seqs 16 (or 8) frees real memory on tight cards.

4. Quantize the KV cache. --kv-cache-dtype fp8 roughly halves KV-cache memory at a small quality cost — often the difference between fitting and not on 16 GB.

5. Add --enforce-eager. CUDA graph capture pre-allocates extra memory. Disabling it with --enforce-eager reclaims a few hundred MiB — useful as a last 300–500 MiB when you need context length more than peak throughput.

A combined low-VRAM launch that works on most 12 GB cards:

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.80 \
  --max-num-seqs 8 \
  --kv-cache-dtype fp8 \
  --enforce-eager

Don't trust nvidia-smi here

A subtle point that wastes hours: nvidia-smi reports driver-level reserved memory, not the segments the CUDA allocator can actually hand to vLLM. vLLM's block allocator queries CUDA directly and can OOM even when nvidia-smi shows a couple of GB "free." When the two disagree, trust vLLM's startup log, not the system monitor.

If — and only if — the log explicitly mentions fragmentation, add:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Don't set this reflexively. It's a fix for fragmentation specifically, not a magic OOM cure, and the same PYTORCH_CUDA_ALLOC_CONF knob shows up across the broader CUDA out-of-memory fix guide for Ollama, llama.cpp, and ComfyUI too.

"The NVIDIA driver on your system is too old"

The second-most-common wall, especially right after a fresh pip install vllm:

RuntimeError: The NVIDIA driver on your system is too old (found version XXXX).

Why it happens

vLLM wheels are compiled against a specific CUDA toolkit. As of v0.23.0 the default PyPI wheel targets CUDA 13.0. If your installed driver predates that toolkit, the compiled kernels can't run. This bites people on stable LTS distros and on cloud images that pin older drivers.

The fixes

Option A — enable CUDA forward compatibility (no driver upgrade). If you're on the official vLLM Docker image, add:

docker run --gpus all -e VLLM_ENABLE_CUDA_COMPATIBILITY=1 ... vllm/vllm-openai:latest

Outside Docker, install the matching cuda-compat package and point vLLM at it:

sudo apt-get install cuda-compat-13-0
export VLLM_ENABLE_CUDA_COMPATIBILITY=1
export VLLM_CUDA_COMPATIBILITY_PATH=/usr/local/cuda/compat

Option B — upgrade the driver. The cleaner long-term fix. Match your driver to the CUDA 13.0 toolkit minimum. On WSL2, install the latest Game Ready or Studio driver on the Windows host — never install a Linux GPU driver inside the WSL distro, which is the same trap that breaks Ollama GPU detection.

**Option C — install a wheel

How to Run a 70B Model on a Single 24GB GPU in 2026 (and When You Shouldn't)

Jovan Chan — Fri, 03 Jul 2026 07:04:07 +0000

This article was originally published on runaihome.com

TL;DR: A 70B model at Q4_K_M needs ~43–45 GB of VRAM, and a single 24 GB card holds barely half of it. You can run one with partial offload, but you'll get 6–12 tok/s instead of 30+, and the honest move for most people is a 32B-class model that fits fully. This guide shows the exact -ngl math, the KV-cache trick that buys you 4–6 extra layers, and the speed you'll actually see.

What you'll be able to do after this:

Calculate the correct offload layer count for your card and context length, instead of guessing
Cut KV-cache VRAM roughly in half with one flag, freeing room for more GPU layers
Decide — with numbers, not vibes — whether a 70B is worth the speed hit on your hardware

Honest take: On one 24 GB GPU, a 70B at Q4 runs but crawls (single-digit to low-teens tok/s). If you need 70B-class quality, rent a 48 GB card by the hour; if you need speed on hardware you own, run a 32B-class model that fits in 24 GB.

The core problem: the model is twice the size of your card

Here's the wall everyone hits. A modern dense 70B like Llama 3.3 70B at Q4_K_M is about 42.5 GB of weights (bartowski's widely-used GGUF), and once you add the KV cache, compute buffers, and CUDA overhead, you need ~43–45 GB of VRAM to run it entirely on the GPU. A single RTX 3090 or RTX 4090 gives you 24 GB. That's roughly half of what a fully-resident 70B Q4 wants.

The full quant ladder for a 70B, so you can see where the cliffs are:

Quant	Approx. file size	Fits fully in 24 GB?	Notes
F16 (unquantized)	~141 GB	No	Datacenter only
Q8_0	~75 GB	No	Needs 2× 48 GB
Q5_K_M	~50 GB	No	Single 48–64 GB card
Q4_K_M	~42.5 GB	No	The "good quality" baseline; needs ~48 GB to be comfortable
Q3_K_M	~34 GB	No	Noticeable quality drop
Q2_K	~26 GB	Almost — spills ~2 GB	Quality degrades visibly
IQ1_M	~16.75 GB	Yes	Barely coherent; not worth it

File sizes from bartowski's Llama-3.3-70B-Instruct-GGUF repo; VRAM-in-use runs a few GB above file size because of context and buffers.

So the only quant that fits fully on a 24 GB card is something below Q2_K — and at that point you've quantized away most of the reason you wanted a 70B. The realistic play is partial offload: put as many layers as fit on the GPU, run the rest on the CPU.

How partial offload actually works

A 70B GGUF has 80 transformer layers plus an output layer. With llama.cpp (and anything built on it — Ollama, LM Studio, koboldcpp), the -ngl N flag (--n-gpu-layers) decides how many of those layers live in VRAM. The rest stay in system RAM and run on the CPU.

At Q4_K_M, each of those 80 layers weighs roughly 475 MB. During every forward pass, execution bounces between GPU and CPU: the GPU-resident layers run at full speed, and the CPU layers run at whatever your system RAM bandwidth allows — which is an order of magnitude slower. That handoff is why partial offload is functional but never fast.

The math for a 24 GB card:

24 GB total minus ~2 GB for CUDA context, the OS, and compute buffers = ~22 GB usable
Reserve 2–4 GB for the KV cache (more if you want long context)
That leaves ~18 GB for weights → ~18,000 MB ÷ 475 MB ≈ 38 layers

In practice people land at 40–45 layers on a 24 GB card by trimming context length and quantizing the KV cache. That puts roughly half the model on the GPU. Community reports put a single RTX 3090 running Llama 70B Q4_K_M at about 8 tok/s with basic settings — slow, but usable for non-interactive work.

The relationship is roughly linear: with no GPU at all you're at 2–4 tok/s (pure CPU), partial offload gets you into the 8–15 tok/s range, and a card that holds the whole thing does 30+ tok/s. If you offload too few layers you barely beat CPU; the win comes from getting as many layers onto the GPU as your VRAM physically allows.

Step 1: Find your real layer ceiling

Don't guess -ngl. Start high and let the loader tell you the truth.

With llama.cpp directly:

./llama-cli -m Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  -ngl 99 -c 4096 -fa \
  -p "Explain partial GPU offload in one paragraph."

-ngl 99 means "offload everything you can." On a 24 GB card it won't fit, and you'll get a CUDA out-of-memory error — that's expected. Watch the log line that reports offloaded XX/81 layers to GPU before it dies, then back the number down. If you OOM at 81, try 45, then nudge up until it loads with a few hundred MB to spare. (If you keep hitting OOM and don't understand why, our CUDA out-of-memory fix guide walks every cause.)

With Ollama, the equivalent is a Modelfile parameter or an environment-driven setting:

# In a Modelfile:
FROM llama3.3:70b
PARAMETER num_gpu 45
PARAMETER num_ctx 4096

num_gpu is Ollama's name for -ngl. Set it too high and Ollama will silently spill into shared memory on Windows (which tanks speed) or OOM on Linux. Set it just below your ceiling.

One gotcha worth knowing: if your model keeps unloading and reloading between prompts, that's a separate keep-alive issue, not an offload problem — see why Ollama keeps reloading the model.

Step 2: Quantize the KV cache to buy back layers

This is the single highest-leverage trick on a memory-starved card. The KV cache stores attention keys and values for every token in context, and at the default F16 precision it eats VRAM fast — several GB at long context. Running it at q8_0 cuts that roughly in half with minimal quality impact, which frees room for 4–6 more transformer layers on the GPU.

In Ollama, KV-cache quantization requires Flash Attention, so you set both:

# Set these where the Ollama *service* reads them (systemd, launchctl, or
# the Windows user env), not just your shell:
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0

In raw llama.cpp:

./llama-cli -m model.gguf -ngl 45 -fa \
  -ctk q8_0 -ctv q8_0 -c 8192

-fa enables Flash Attention; -ctk/-ctv set the K and V cache types. Two warnings the docs bury:

Not every architecture supports Flash Attention. If you force q8_0 on an unsupported model, Ollama silently falls back to F16 — so you budget for half the cache and then OOM unexpectedly. Verify with ollama ps that the model loaded at the size you expected.
q4_0 cache cuts VRAM further but the quality loss becomes noticeable, especially on long-context reasoning. Stick with q8_0 unless you're desperate for those last few hundred MB.

With q8_0 cache and Flash Attention on, a 24 GB card that topped out at 40 layers can often reach 44–46, and every extra layer on the GPU is a measurable speed gain.

Step 3: Keep context realistic

Context length is the other VRAM lever, and it's the one people forget. The KV cache scales linearly with context: doubling -c from 4096 to 8192 roughly doubles cache VRAM, which costs you GPU layers. On a 24 GB card running a 70B, 4096–8192 tokens is the sweet spot. If you genuinely need 32K+ context, you're back to needing more VRAM — there's no free lunch.

Also make sure your system RAM can hold the half of the model that lives on the CPU. A Q4 70B needs ~42 GB total, so with ~22 GB on the GPU you need at least ~24 GB of free system RAM for the rest plus headroom — meaning 48 GB of RAM minimum, 64 GB comfortable. If you're short, that's a system RAM sizing problem, and the symptom is the Linux OOM killer terminating your runner mid-load.

When you shouldn't do this at all

Here's the part most "how to run 70B" posts skip. Partial offload works, but at 8–12 tok/s a 70B is painful for anything interactive

Ornith-1.0 for Local AI in 2026: Which GPU Runs DeepReinforce's MIT-Licensed Coding Model?

Jovan Chan — Fri, 03 Jul 2026 07:03:21 +0000

This article was originally published on runaihome.com

TL;DR: Ornith-1.0 is DeepReinforce's new MIT-licensed coding family — 9B Dense, 31B Dense, 35B MoE, and 397B MoE, post-trained on Gemma 4 and Qwen 3.5. The home-lab pick is the 35B MoE: ~3B active parameters per token make it fast, and the Q4_K_M GGUF is 21.2 GB, so it just fits a single 24 GB card. The catch: 21.2 GB on a 24 GB GPU leaves almost no room for long context.

	9B Dense	35B MoE	397B MoE
Best for	8–12 GB cards	The 24 GB sweet spot	Cloud / API only
Q4 size	~6 GB	21.2 GB (Q4_K_M)	~225 GB+
Active params	9B (dense)	~3B per token	~? per token
Runs on a single consumer GPU?	Yes, easily	Yes, on 24 GB	No
The catch	Weakest of the family	No headroom for 256K context	No card holds it

Honest take: If you have a 24 GB card, grab the 35B MoE Q4_K_M — it's the rare model that gives you MoE speed and a license you can actually ship a product on. If you're on 8–16 GB, run the 9B and keep your expectations modest. The 397B is an API model; don't try to buy hardware for it.

The local-AI release calendar has been relentless this month, but Ornith-1.0 is worth stopping for — not because it's the biggest, but because it lands the two things home-labbers actually ask for: a permissive license and a variant that runs fast on a card you already own.

What Ornith-1.0 actually is

DeepReinforce released the Ornith-1.0 family on June 25, 2026, under the MIT license with no regional restrictions — every checkpoint, including the GGUF and FP8 builds, ships under that license on Hugging Face. That alone separates it from a lot of "open weight" releases that bolt on usage clauses or research-only terms.

The family spans four checkpoints, all post-trained on Gemma 4 and Qwen 3.5 bases:

Ornith-1.0-9B — dense, edge/resource-constrained target
Ornith-1.0-31B — dense
Ornith-1.0-35B — sparse Mixture-of-Experts
Ornith-1.0-397B — flagship MoE

The headline feature is the training method, not the size. Ornith is a self-scaffolding model: during reinforcement learning it learns to write its own harness — the tool-use loop, the test scaffold — and jointly optimizes that scaffold alongside the code it produces. It's also reasoning-first: each assistant turn opens with a chain-of-thought block, and the serving stack returns that reasoning in a separate field from the final answer. For a coding agent, that's the right shape: it plans, then acts.

If you've read our piece on why local LLMs got good in 2026, this is the same story playing out — sparse activation plus better post-training, not raw parameter count, is what's closing the gap.

The benchmarks (and where to be skeptical)

The vendor numbers are strong, and a few are striking enough to flag as vendor-reported until independent runs land:

Ornith-1.0-397B: 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1. DeepReinforce positions this above Claude Opus 4.7; for context, Claude Opus 4.8 sits at 87.6 on SWE-Bench Verified, so the 397B trails only the very top of the closed-model field on that test.
Ornith-1.0-35B MoE: 64.2 on Terminal-Bench 2.1 — above Qwen 3.5-397B's 53.5, a model with more than ten times the total parameter count. If that holds up under independent testing, it's the most interesting result in the release.
Ornith-1.0-9B Dense: 43.1 on Terminal-Bench 2.1, essentially matching Gemma 4-31B's 42.1.

A 35B MoE beating a 397B dense model on an agentic benchmark is exactly the kind of claim that needs third-party confirmation — vendor benchmark suites tend to flatter the home team. Treat these as a reason to try the model, not as settled fact. We've taken the same cautious line on every fresh-drop coding model, from Kimi K2.7 to Qwen3-Coder-Next.

Which GPU runs which variant

This is the part that matters for your wallet. Here's how each variant maps to real hardware.

Ornith-1.0-9B — for 8 GB to 16 GB cards

The 9B dense weights are about 6 GB at Q4 quantization and roughly 19 GB in BF16. At Q4_K_M it runs comfortably on 6–8 GB of VRAM, which means it's the variant for an RTX 3060 12GB, an RTX 4060 Ti 8GB or 16GB, or even an older 8 GB card. With a 16 GB card you can move up to Q6_K or Q8_0 and still leave plenty of room for context.

Being a dense 9B, it's bandwidth-bound — generation speed scales with your card's memory bandwidth, not its FLOPS. On a modern 16 GB card you'll get interactive speeds, but understand the trade-off: this is the weakest member of the family. It's a capable local autocomplete and small-task assistant, not a replacement for a frontier agent.

Ornith-1.0-35B MoE — the 24 GB sweet spot

This is the one to care about. The 35B MoE is a sparse model with 256 routed experts, 8 active per token plus a shared expert, across 40 layers, activating roughly 3B parameters per token. That architecture is the whole point: all 35B weights have to sit in VRAM, but only ~3B are read per token, so it generates far faster than a dense model of the same footprint.

The official GGUF sizes:

Quant	Size	Fits
Q4_K_M	21.2 GB	24 GB card (tight)
Q5_K_M	24.7 GB	32 GB card
Q6_K	28.5 GB	32 GB card
Q8_0	36.9 GB	48 GB+ / multi-GPU

The practical read: Q4_K_M at 21.2 GB fits a single 24 GB GPU, but barely. That leaves under 3 GB for the KV cache and runtime overhead. You'll run it fine at 8K–16K context; the model's full 256K context window is a cloud-serving figure, not something you'll reach on 24 GB. If you want real context headroom, you want a 32 GB RTX 5090 and the Q4_K_M, or you accept short context on 24 GB.

What about speed? We don't have independent tokens/sec measurements for Ornith yet, so we won't invent one. But we do have a measured reference point on this exact site: Nemotron-Cascade 2 is a ~3B-active MoE (30B-A3B) that hits 187 tok/s on a used RTX 3090 at a comparable quant. Ornith-1.0-35B has near-identical active-parameter math, so expect it to land in the same neighborhood on the same hardware — fast enough for genuinely interactive agentic coding. We'll update this with real numbers once community benchmarks are out. For more on how 3B-active MoE compares to dense models at the same VRAM, see our Qwen 3.6 35B-A3B guide.

The best-value card for this remains the used RTX 3090: 24 GB, 936 GB/s of bandwidth, and a used average around $1,070 as of June 2026. The RTX 4090 is faster but costs roughly twice as much used for the same 24 GB ceiling.

Ornith-1.0-31B Dense — capable, but awkward

The 31B dense variant sits in an odd spot. As a dense 31B it needs a similar VRAM footprint to the 35B MoE at the same quant (call it ~18–20 GB at Q4), but because it's dense it activates all 31B parameters per token — so it'll be meaningfully slower than the 35B MoE while taking up about the same space. Unless you have a specific reason to prefer a dense model's behavior, the 35B MoE is the better pick on identical hardware. This is the same dense-vs-MoE trade we walked through in the Codestral 2 guide.

Ornith-1.0-397B MoE — rent it, don't buy for it

The flagship is not a home-lab model. The FP8 checkpoint alone is on the order of 225 GB+, well beyond any single consumer GPU and beyond most multi-GPU home builds. If you want to use the 397B, the sane paths are the

DEV Community: Jovan Chan

WWDC 2026 Home Lab Verdict: What Apple's Foundation Models, Core AI, and Siri Actually Deliver for Local AI

What Apple actually announced

The three-tier stack

Foundation Models framework: the developer angle

Xcode 27 agents

What this changes for home lab builders (and what it doesn't)

The numbers that actually decide your hardware

WSL 3 GPU Passthrough for Local AI on Windows in 2026: Near-Native Ollama, llama.cpp, and PyTorch

What Microsoft actually announced

NPU passthrough is the real change — and it's not for everyone yet

Does the 3-5% number matter for your hardware?

Setting it up: WSL 2 today, WSL 3 on Insider

1. Install WSL and a distro

2. Install the Windows GPU driver — and ONLY the Windows driver

3. Install the CUDA toolkit (only if you compile)

4. Run Ollama and confirm it's on the GPU

The error you'll actually hit, and the fix

Why Local LLMs Got Good in 2026: Multi-Token Prediction, Speculative Decoding, and the MoE Efficiency Leap

The thing that actually changed

The bottleneck: why decoding is slow in the first place

Technique 1: Sparse MoE — read fewer weights per token

Technique 2: Speculative decoding — guess ahead, verify in parallel

Technique 3: Multi-token prediction — bake the drafting into the model

RTX PRO 6000 Blackwell for Local AI in 2026: 96GB GDDR7, the 120B+ MoE Threshold, and Whether a Workstation Card Makes Sense for Home Labs

What you're actually buying with 96GB

The benchmarks that justify it (and the ones that don't)

Price reality in June 2026

When the PRO 6000 actually wins over multi-GPU

Where it sits against the 5090 and the H100

The honest

Open WebUI Can't Connect to Ollama? Every Fix for the Server Connection Error (2026)

The two error messages, and what each one means

Step 0: Is Ollama even running?

Cause 1: Docker's localhost is not your localhost (the #1 cause)

Fix 1a: Use host.docker.internal (recommended)

Fix 1b: Use --network=host

Cause 2: Ollama is bound to 127.0.0.1 and rejecting the container

Linux (systemd)

macOS (desktop app)

Open-Source LLM Shootout 2026: Qwen3.6 vs Gemma 4 vs Llama 4 vs GLM-5.1 vs DeepSeek V4 — Which Fits Your GPU?

The five families, and what they actually are

Which family fits which GPU

8 GB (RTX 3060, RTX 4060, RTX 5060)

16 GB (RTX 4060 Ti 16GB, RTX 5060 Ti 16GB, RTX 5070 Ti)

24 GB (RTX 3090, RTX 4090)

Beyond 24 GB (multi-GPU / server)

Speed vs. quality: the real trade-off at 24 GB

Licenses: the column most comparisons skip

The gotcha that wastes an afternoon

Ollama v0.30 on Apple Silicon: What the Stable MLX Release Actually Changed From the Preview

What the v0.30 line actually shipped

Upgrade and verify it's really MLX

Gemma 4 QAT: the upgrade that changes which Mac is enough

Speculative decoding and cache reuse: where v0.30 feels faster

Real numbers, and the ceiling that didn't move

Ollama Not Using GPU? Fix CPU-Only Inference on Windows, WSL2, and Linux (2026)

Step 1: Confirm the problem (don't guess)

Cause 1: Drivers missing, outdated, or installed after Ollama

Cause 2: WSL2 passthrough (the Windows + Linux gotcha)

Cause 3: The model is bigger than your VRAM

Cause 4: Docker without GPU access

Cause 5: Your GPU is too old (compute capability)

Ollama Keeps Reloading the Model? Fix VRAM Unloading, Cold Starts, and Model Swapping (2026)

What's actually happening

Fix 1: Set OLLAMA_KEEP_ALIVE on the service (the real fix)

Fix 2: Override per request with keep_alive

Fix 3: Preload before the user shows up

Fix 4: Stop two models from thrashing

vLLM Won't Start? Every Fix for the Engine Init, CUDA, and OOM Errors (2026)

Step 0: Read the actual error, not the wall of logs

The #1 startup error: "No available memory for the cache blocks"

Why it happens

The fix, in the order to try it

Don't trust nvidia-smi here

"The NVIDIA driver on your system is too old"

Why it happens

The fixes

How to Run a 70B Model on a Single 24GB GPU in 2026 (and When You Shouldn't)

The core problem: the model is twice the size of your card

Cause 1: Docker's `localhost` is not your `localhost` (the #1 cause)

Fix 1a: Use `host.docker.internal` (recommended)

Fix 1b: Use `--network=host`