This article was originally published on runaihome.com
TL;DR: A 70B model at Q4_K_M needs ~43–45 GB of VRAM, and a single 24 GB card holds barely half of it. You can run one with partial offload, but you'll get 6–12 tok/s instead of 30+, and the honest move for most people is a 32B-class model that fits fully. This guide shows the exact -ngl math, the KV-cache trick that buys you 4–6 extra layers, and the speed you'll actually see.
What you'll be able to do after this:
- Calculate the correct offload layer count for your card and context length, instead of guessing
- Cut KV-cache VRAM roughly in half with one flag, freeing room for more GPU layers
- Decide — with numbers, not vibes — whether a 70B is worth the speed hit on your hardware
Honest take: On one 24 GB GPU, a 70B at Q4 runs but crawls (single-digit to low-teens tok/s). If you need 70B-class quality, rent a 48 GB card by the hour; if you need speed on hardware you own, run a 32B-class model that fits in 24 GB.
The core problem: the model is twice the size of your card
Here's the wall everyone hits. A modern dense 70B like Llama 3.3 70B at Q4_K_M is about 42.5 GB of weights (bartowski's widely-used GGUF), and once you add the KV cache, compute buffers, and CUDA overhead, you need ~43–45 GB of VRAM to run it entirely on the GPU. A single RTX 3090 or RTX 4090 gives you 24 GB. That's roughly half of what a fully-resident 70B Q4 wants.
The full quant ladder for a 70B, so you can see where the cliffs are:
| Quant | Approx. file size | Fits fully in 24 GB? | Notes |
|---|---|---|---|
| F16 (unquantized) | ~141 GB | No | Datacenter only |
| Q8_0 | ~75 GB | No | Needs 2× 48 GB |
| Q5_K_M | ~50 GB | No | Single 48–64 GB card |
| Q4_K_M | ~42.5 GB | No | The "good quality" baseline; needs ~48 GB to be comfortable |
| Q3_K_M | ~34 GB | No | Noticeable quality drop |
| Q2_K | ~26 GB | Almost — spills ~2 GB | Quality degrades visibly |
| IQ1_M | ~16.75 GB | Yes | Barely coherent; not worth it |
File sizes from bartowski's Llama-3.3-70B-Instruct-GGUF repo; VRAM-in-use runs a few GB above file size because of context and buffers.
So the only quant that fits fully on a 24 GB card is something below Q2_K — and at that point you've quantized away most of the reason you wanted a 70B. The realistic play is partial offload: put as many layers as fit on the GPU, run the rest on the CPU.
How partial offload actually works
A 70B GGUF has 80 transformer layers plus an output layer. With llama.cpp (and anything built on it — Ollama, LM Studio, koboldcpp), the -ngl N flag (--n-gpu-layers) decides how many of those layers live in VRAM. The rest stay in system RAM and run on the CPU.
At Q4_K_M, each of those 80 layers weighs roughly 475 MB. During every forward pass, execution bounces between GPU and CPU: the GPU-resident layers run at full speed, and the CPU layers run at whatever your system RAM bandwidth allows — which is an order of magnitude slower. That handoff is why partial offload is functional but never fast.
The math for a 24 GB card:
- 24 GB total minus ~2 GB for CUDA context, the OS, and compute buffers = ~22 GB usable
- Reserve 2–4 GB for the KV cache (more if you want long context)
- That leaves ~18 GB for weights → ~18,000 MB ÷ 475 MB ≈ 38 layers
In practice people land at 40–45 layers on a 24 GB card by trimming context length and quantizing the KV cache. That puts roughly half the model on the GPU. Community reports put a single RTX 3090 running Llama 70B Q4_K_M at about 8 tok/s with basic settings — slow, but usable for non-interactive work.
The relationship is roughly linear: with no GPU at all you're at 2–4 tok/s (pure CPU), partial offload gets you into the 8–15 tok/s range, and a card that holds the whole thing does 30+ tok/s. If you offload too few layers you barely beat CPU; the win comes from getting as many layers onto the GPU as your VRAM physically allows.
Step 1: Find your real layer ceiling
Don't guess -ngl. Start high and let the loader tell you the truth.
With llama.cpp directly:
./llama-cli -m Llama-3.3-70B-Instruct-Q4_K_M.gguf \
-ngl 99 -c 4096 -fa \
-p "Explain partial GPU offload in one paragraph."
-ngl 99 means "offload everything you can." On a 24 GB card it won't fit, and you'll get a CUDA out-of-memory error — that's expected. Watch the log line that reports offloaded XX/81 layers to GPU before it dies, then back the number down. If you OOM at 81, try 45, then nudge up until it loads with a few hundred MB to spare. (If you keep hitting OOM and don't understand why, our CUDA out-of-memory fix guide walks every cause.)
With Ollama, the equivalent is a Modelfile parameter or an environment-driven setting:
# In a Modelfile:
FROM llama3.3:70b
PARAMETER num_gpu 45
PARAMETER num_ctx 4096
num_gpu is Ollama's name for -ngl. Set it too high and Ollama will silently spill into shared memory on Windows (which tanks speed) or OOM on Linux. Set it just below your ceiling.
One gotcha worth knowing: if your model keeps unloading and reloading between prompts, that's a separate keep-alive issue, not an offload problem — see why Ollama keeps reloading the model.
Step 2: Quantize the KV cache to buy back layers
This is the single highest-leverage trick on a memory-starved card. The KV cache stores attention keys and values for every token in context, and at the default F16 precision it eats VRAM fast — several GB at long context. Running it at q8_0 cuts that roughly in half with minimal quality impact, which frees room for 4–6 more transformer layers on the GPU.
In Ollama, KV-cache quantization requires Flash Attention, so you set both:
# Set these where the Ollama *service* reads them (systemd, launchctl, or
# the Windows user env), not just your shell:
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0
In raw llama.cpp:
./llama-cli -m model.gguf -ngl 45 -fa \
-ctk q8_0 -ctv q8_0 -c 8192
-fa enables Flash Attention; -ctk/-ctv set the K and V cache types. Two warnings the docs bury:
-
Not every architecture supports Flash Attention. If you force
q8_0on an unsupported model, Ollama silently falls back to F16 — so you budget for half the cache and then OOM unexpectedly. Verify withollama psthat the model loaded at the size you expected. - q4_0 cache cuts VRAM further but the quality loss becomes noticeable, especially on long-context reasoning. Stick with q8_0 unless you're desperate for those last few hundred MB.
With q8_0 cache and Flash Attention on, a 24 GB card that topped out at 40 layers can often reach 44–46, and every extra layer on the GPU is a measurable speed gain.
Step 3: Keep context realistic
Context length is the other VRAM lever, and it's the one people forget. The KV cache scales linearly with context: doubling -c from 4096 to 8192 roughly doubles cache VRAM, which costs you GPU layers. On a 24 GB card running a 70B, 4096–8192 tokens is the sweet spot. If you genuinely need 32K+ context, you're back to needing more VRAM — there's no free lunch.
Also make sure your system RAM can hold the half of the model that lives on the CPU. A Q4 70B needs ~42 GB total, so with ~22 GB on the GPU you need at least ~24 GB of free system RAM for the rest plus headroom — meaning 48 GB of RAM minimum, 64 GB comfortable. If you're short, that's a system RAM sizing problem, and the symptom is the Linux OOM killer terminating your runner mid-load.
When you shouldn't do this at all
Here's the part most "how to run 70B" posts skip. Partial offload works, but at 8–12 tok/s a 70B is painful for anything interactive
Top comments (0)