I dug up some interesting performance tests of GPT-OSS 120b running on Ollama across three different platforms: NVIDIA DGX Spark, Mac Studio, and RTX 4080. The GPT-OSS 120b model from the Ollama library weighs in at 65GB, which means it doesn't fit into the 16GB VRAM of an RTX 4080 (or the newer RTX 5080).
Yes, the model can run with partial offloading to CPU, and if you have 64GB of system RAM (like I do), you can try it out. However, this setup would not be considered anywhere close to production-ready performance. For truly demanding workloads, you might need something like the NVIDIA DGX Spark, which is designed specifically for high-capacity AI workloads.
I expected this LLM would benefit significantly from running on a "high-RAM AI device" like the DGX Spark. While the results are good, they're not as dramatically better as you might expect given the price difference between DGX Spark and more affordable options.
TL;DR
Ollama running GPT-OSS 120b performance comparison across three platforms:
| Device | Prompt Eval Performance (tokens/sec) | Generation Performance (tokens/sec) | Notes |
|---|---|---|---|
| NVIDIA DGX Spark | 1159 | 41 | Best overall performance, fully GPU-accelerated |
| Mac Studio | Unknown | 34 → 6 | One Test showed degradation with context size increase |
| RTX 4080 | 969 | 12.45 | 78% CPU / 22% GPU split due to VRAM limits |
Model specifications:
- Model: GPT-OSS 120b
- Parameters: 117B (Mixture-of-Experts architecture)
- Active parameters per pass: 5.1B
- Quantization: MXFP4
- Model size: 65GB
This is similar in architecture to other MoE models like Qwen3:30b, but at a much larger scale.
GPT-OSS 120b on RTX 4080
I initially thought running Ollama with GPT-OSS 120b on my consumer PC wouldn't be particularly exciting, but the results pleasantly surprised me. Here's what happened when I tested it with this query:
$ ollama run gpt-oss:120b --verbose Compare the weather in state capitals of Australia
Thinking...
We need to compare weather in state capitals of Australia. Provide a comparison, perhaps include
...
*All data accessed September 2024; any updates from the BOM after that date may slightly adjust the
numbers, but the broad patterns remain unchanged.*
total duration: 4m39.942105769s
load duration: 75.843974ms
prompt eval count: 75 token(s)
prompt eval duration: 77.341981ms
prompt eval rate: 969.72 tokens/s
eval count: 3483 token(s)
eval duration: 4m39.788119563s
eval rate: 12.45 tokens/s
Now here's the interesting part—Ollama with this LLM was running mostly on CPU! The model simply doesn't fit in the 16GB VRAM, so Ollama intelligently offloaded most of it to system RAM. You can see this behavior using the ollama ps command:
$ ollama ps
NAME ID SIZE PROCESSOR CONTEXT
gpt-oss:120b a951a23b46a1 65 GB 78%/22% CPU/GPU 4096
Despite running with a 78% CPU / 22% GPU split, the RTX 4080 still delivers respectable performance for a model this size. The prompt evaluation is blazingly fast at 969 tokens/s, and even the generation speed of 12.45 tokens/s is usable for many applications.
This is particularly impressive when you consider that:
- The model is nearly 4x larger than the available VRAM
- Most of the computation happens on CPU (which benefits from my 64GB of system RAM)
- Understanding how Ollama uses CPU cores can help optimize this setup further
Who would have thought a consumer GPU could handle a 117B parameter model at all, let alone with usable performance? This demonstrates the power of Ollama's intelligent memory management and the importance of having sufficient system RAM. If you're interested in integrating Ollama into your applications, check out this guide on using Ollama with Python.
Note: While this works for experimentation and testing, you'll notice GPT-OSS can have some quirks, particularly with structured output formats.
References
That's where DGX Spark and MacStudio stats came from:
- GPT-OSS 120B on Mac Studio - Slinging Bits YouTube - Detailed video testing GPT-OSS 120b with varying context sizes
- Ollama on NVIDIA DGX Spark: Performance Benchmarks - Official Ollama blog post with comprehensive DGX Spark performance data
Top comments (0)