DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at runaihome.com

GPT-OSS 20B for local AI in 2026: 225 tok/s on RTX 4090, the 128k context trap, and which GPU you actually need

This article was originally published on runaihome.com

TL;DR: gpt-oss-20b is OpenAI's first Apache 2.0 model and it fits on any 16 GB GPU — but only if you keep context under 8k. At 128k context, generation collapses to ~9 tok/s regardless of GPU. On an RTX 4090 with context capped at 8k, you get 225 tok/s. The 20B model is the one home-lab builders should pull; the 120B requires an H100.

gpt-oss-20b Gemma 4 12B Qwen3 30B-A3B
Best for Reasoning + tool use, OpenAI quality Fast chat on budget hardware Coding + multilingual on 24 GB
Min VRAM 16 GB (8k ctx) 8 GB 24 GB
RTX 4090 speed 225 tok/s ~400+ tok/s ~130 tok/s
The catch 128k context = 9 tok/s on consumer cards Less agentic than gpt-oss Needs 24 GB; bigger download

Honest take: If you have an RTX 3090 or better and want o3-mini-quality reasoning running locally for zero per-token cost, gpt-oss-20b is the easiest pull right now. Just set --ctx-size 8192 or you will wonder why your brand-new GPU is doing 9 tokens per second.


Why GPT-OSS is different from every model before it

Every major open-weight model family before August 2025 — Llama, Qwen, Mistral, Gemma — came from research labs that never charged for API access to their models. OpenAI did. When they released gpt-oss-120b and gpt-oss-20b on August 5, 2025 under the Apache 2.0 license, it was the first time you could pull an OpenAI-trained model, run it on your own hardware, and never send a request to their servers.

That matters for trust reasons (data stays local), cost reasons (no per-token bill), and latency reasons (no network hop). Whether the quality justifies the hardware cost depends on which GPU you own.


Architecture: 21 billion parameters, 3.6 billion at a time

Both gpt-oss models use a Mixture of Experts (MoE) Transformer. The 20B model has 21 billion total parameters organized into 128 expert sub-networks. For any given token, the router activates exactly 4 of those experts, touching only 3.6 billion parameters per token. The same approach appears in Qwen3-30B-A3B and Nemotron Cascade — but in gpt-oss, it's paired with the reasoning post-training OpenAI uses for its o-series models.

Other architectural details from the model card:

  • Context: 128k tokens native (o200k_harmony tokenizer, same as GPT-4o)
  • Attention: grouped multi-query attention with group size 8
  • Positional encoding: RoPE
  • Quantization at training: MXFP4 post-training on MoE weights, which is why the 20B can run in 16 GB
  • Built-in tools: function calling, web browsing, Python execution — the same tool suite used in OpenAI's API

The 3.6B active parameters explain the speed numbers: the router skips 94% of the weights per token, so memory bandwidth pressure stays low relative to a dense 20B model.


VRAM: what the model actually uses

The gpt-oss-20b model card reports 12.0 GB for model weights, 2.7 GB for compute buffers, and approximately 0.2 GB per 8,192 tokens of KV cache. That adds up to:

Context Total VRAM needed
2k tokens ~15.3 GB
8k tokens ~15.5 GB
32k tokens ~16.5 GB
128k tokens ~21.7 GB

A 16 GB card sits right at the edge for 8k context — workable, not comfortable. A 24 GB card handles up to ~65k context before spilling. The RTX 5090's 32 GB is the first consumer card that can run the full 128k context without offloading, though the speed penalty still exists (more on that below).

The Q4_K_M GGUF for local inference is 13.3 GB on disk and 12.91 GB downloaded. Pull it once with Ollama and you're done.


Benchmark table: 8 GPUs, real numbers

These numbers are llama.cpp token generation benchmarks (tg128, Q4 quantization) from community testing as of August–September 2025. They represent sustained generation speed after the prompt has been processed.

GPU VRAM tok/s (tg128 Q4) Can it run it?
RTX 5090 32 GB 282 Yes — full 128k headroom
RTX 4090 24 GB 225 Yes — comfortable to ~65k ctx
RTX 5070 Ti 16 GB 189 Yes — 8k context recommended
RTX 4080 SUPER 16 GB 186 Yes — 8k context recommended
RTX 3090 24 GB 161 Yes — comfortable to ~65k ctx
RTX 5060 Ti 16GB 16 GB 111 Yes — 8k context recommended
RX 7900 XT 20 GB 101 Yes — ROCm required
RTX 3060 12 GB 30–31 Partial (CPU offload required)

Source: llama.cpp community benchmark thread, Discussion #15396.

The RTX 3060 result comes with an asterisk: 12 GB is below the 15 GB practical minimum, so llama.cpp offloads the excess layers to system RAM over PCIe. The 30 tok/s you get is CPU-bound, not GPU-bound. If you have an RTX 3060, gpt-oss-20b will technically load and run, but you're better served by Gemma 4 12B or Qwen3-8B.


Setup with Ollama

Ollama has a first-party gpt-oss model on its library. Two commands:

ollama pull gpt-oss:20b
ollama run gpt-oss:20b
Enter fullscreen mode Exit fullscreen mode

That downloads the MXFP4-optimized GGUF (~12.9 GB) and starts a chat session. Ollama auto-detects your GPU and loads as many layers as fit in VRAM.

The critical flag: Ollama defaults to 2048 context unless you tell it otherwise. For most sessions that's fine. If you want to use the model's full 128k context window, set it explicitly — but read the next section first.

For a persistent context setting, create a Modelfile:

cat > Modelfile <<'EOF'
FROM gpt-oss:20b
PARAMETER num_ctx 8192
EOF
ollama create gpt-oss-8k -f Modelfile
ollama run gpt-oss-8k
Enter fullscreen mode Exit fullscreen mode

For 24 GB cards, 32768 context is reasonable without hitting the speed cliff:

PARAMETER num_ctx 32768
Enter fullscreen mode Exit fullscreen mode

The 128k context trap

This is the number one complaint from users who pulled gpt-oss-20b on a 16 GB card and got confused.

What happens: Set context to 128k on an RTX 5060 Ti or RTX 4080 SUPER. Start generating. Speed drops to around 9 tok/s. Task Manager shows VRAM nearly empty.

Why it happens: The KV cache for 128k context (~20+ GB) doesn't fit in 16 GB of VRAM. llama.cpp and Ollama fall back to system RAM for the KV cache, routing every attention lookup through PCIe instead of the GPU's memory bus. The GPU sits idle waiting for data.

Fix: Cap context at 8k on 16 GB cards.

# In Ollama Modelfile:
PARAMETER num_ctx 8192

# Or in llama.cpp directly:
./llama-cli -m gpt-oss-20b.Q4_K_M.gguf --ctx-size 8192 -n 512
Enter fullscreen mode Exit fullscreen mode

At 8k context on an RTX 3060 12 GB (with CPU offload), one community member reported going from 9 tok/s to 43 tok/s by setting this flag. The same fix applies proportionally on 16 GB cards that were seeing similar slowdowns at larger context values.

The table below gives practical context limits by card:

GPU VRAM Safe context limit
12 GB 4k (offload mode)
16 GB 8k
24 GB 32k
32 GB 128k (native)

Setup with llama.cpp

If you want direct control, llama.cpp gives you more flags:

# Download the GGUF from Hugging Face
# (search: openai/gpt-oss-20b GGUF on HuggingFace)

./llama-server \
  -m gpt-oss-20b.Q4_K_M.gguf \
  --ctx-size 8192 \
  --n-gpu-layers 999 \
  --port 8080
Enter fullscreen mode Exit fullscreen mode

--n-gpu-layers 999 tells llama.cpp to push as many layers as possible onto the GPU. On a 16 GB card this will load the full model at 8k context. Check the startup logs: if you see offloaded X/33 layers to GPU where X is less than 33, some layers are going to CPU.

For 24 GB cards, bump --ctx-size to 16384 or 32768 and you'll still get full GPU utilization.


gpt-oss-120b: not for home labs

The 120B model has 117 billion total parameters and activates 5.1 billion per token — the same MoE trick, larger pool. OpenAI says it fits on a single 80 GB GPU (H100 or MI300X) in MXFP4 form.

At Q4_K_M quantization, the weight file alone is 72.7 GB. W

Top comments (0)