With the release of OpenAI’s GPT-OSS models under a permissive Apache 2.0 license, developers now have access to state-of-the-art open-weight LLMs completely free to use and run locally. Thanks to Ollama’s seamless integration with these models, you can now deploy and customize GPT-OSS on your own machine with minimal setup.
This guide covers everything you need to get started: installation, model setup, Modelfile customization, API usage, and local API testing.
What is GPT-OSS?
GPT-OSS refers to OpenAI’s open-weight models, now available in two sizes:
gpt-oss-20b
gpt-oss-120b
Both models are designed for high-performance tasks, including:
- Reasoning and step-by-step explanation
- Agentic workflows with tool use (e.g., function calling, Python execution, web browsing)
- Developer-specific tasks like code generation, documentation, and debugging
These models are accessible through Ollama, a local inference engine optimized for fast LLM deployment without needing additional cloud infrastructure or model conversion.
Key Features of GPT-OSS on Ollama
OpenAI and Ollama have partnered to bring GPT-OSS to local environments with advanced features and hardware-friendly optimizations:
Agentic Capabilities
Supports native function calling, optional web browsing, Python code execution, and structured output generation enabling complex multi-step workflows on-device.
Full Chain-of-Thought Access
Models expose their internal reasoning paths, allowing developers to debug and audit responses easily.
Adjustable Reasoning Effort
You can configure the "reasoning effort" (low, medium, high) depending on your latency requirements and task complexity.
Fine-Tuning Support
Both models are fine-tunable for specific domains or use cases via parameter tuning.
Apache 2.0 License
The models are released under a permissive license, allowing for both commercial and experimental use cases without copyleft constraints.
MXFP4 Quantization Format
The models use OpenAI’s MXFP4 quantization reducing the memory footprint by compressing mixture-of-expert (MoE) weights to 4.25 bits. This allows:
-
gpt-oss-20b
: To run on consumer systems with 16GB of memory -
gpt-oss-120b
: To run on high-end machines with a single 80GB GPU
Ollama natively supports this format using custom-developed kernels optimized for performance and accuracy.
System Requirements
To run GPT-OSS efficiently on your machine, your system should meet the following minimum specifications:
Model | RAM / GPU Requirement | Use Case |
---|---|---|
gpt-oss-20b | 16GB RAM (CPU or GPU-accelerated) | Consumer desktops and laptops |
gpt-oss-120b | 80GB GPU (data center hardware) | Research and high-performance tasks |
You will also need:
- 20–50GB free disk space
- macOS, Linux (recommended), or Windows (WSL2 compatible)
- Stable internet connection for model download
Step 1: Install Ollama
macOS / Linux
bash
CopyEdit
curl -fsSL https://ollama.com/install.sh | sh
Windows
Download the installer from ollama.com. Additional setup may be needed, such as enabling WSL2 or Docker Desktop.
Start the Ollama Server
bash
CopyEdit
ollama serve
This starts the backend server on http://localhost:11434
.
Step 2: Download GPT-OSS Models
Choose the model based on your hardware:
bash
CopyEdit
# For most users
ollama pull gpt-oss-20b
# For data center-grade GPUs
ollama pull gpt-oss-120b
Download progress depends on your internet connection. After downloading, confirm installation with:
bash
CopyEdit
ollama list
You should see: gpt-oss-20b
or gpt-oss-120b
.
20B parameter model
gpt-oss-20b
model is designed for lower latency, local, or specialized use-cases.
120B parameter model
gpt-oss-120b
model is designed for production, general purpose, high reasoning use-cases.
NVIDIA and Ollama collaborate to accelerate gpt-oss on GeForce RTX and RTX PRO GPUs
NVIDIA and Ollama are advancing their partnership to boost model performance on NVIDIA GeForce RTX and RTX PRO GPUs. This collaboration enables users on RTX-powered PCs to accurately leverage the capabilities of OpenAI’s gpt-oss model.
Step 3: Run GPT-OSS Locally
Interactive CLI
bash
CopyEdit
ollama run gpt-oss-20b
Type your prompts and get real-time responses.
Single Query Execution
bash
CopyEdit
ollama run gpt-oss-20b "Explain the principle of transformers in AI"
Control Response Style
You can adjust generation parameters for accuracy or creativity:
bash
CopyEdit
ollama run gpt-oss-20b --temperature 0.3 --top-p 1.0 "Summarize the history of cryptography"
Lower temperature values produce more deterministic outputs.
Step 4: Customize Behavior with Modelfile
Ollama allows configuration of model behavior via a Modelfile
, similar to Dockerfiles.
Sample Modelfile
text
CopyEdit
FROM gpt-oss-20b
SYSTEM "You are a helpful assistant specialized in TypeScript development. Provide well-documented code examples."
PARAMETER temperature 0.3
PARAMETER num_ctx 4096
Build and Run
bash
CopyEdit
ollama create ts-assistant -f Modelfile
ollama run ts-assistant
This is useful for focused workflows, like data analysis, technical writing, or code generation.
Step 5: API Integration
Ollama exposes an API compatible with OpenAI’s format, making it easy to plug into existing tools.
Generate Text (Single Prompt)
bash
CopyEdit
curl http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{"model": "gpt-oss-20b", "prompt": "What is reinforcement learning?"}'
Chat Completion (OpenAI Style)
bash
CopyEdit
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss-20b",
"messages": [{"role": "user", "content": "How do transformers work?"}]
}'
Python Example Using OpenAI SDK
python
CopyEdit
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="gpt-oss-20b",
messages=[{"role": "user", "content": "What is chain-of-thought reasoning?"}]
)
print(response.choices[0].message.content)
Step 6: Local API Testing
To debug and inspect model responses, especially for streaming outputs, you can use tools like Apidog.
Why Use Apidog?
- Easily test
localhost:11434
endpoints - Preview streamed JSON responses in real-time
- Experiment with different parameters
- Save test collections for repeatability
Apidog can simplify development workflows when integrating GPT-OSS with applications.
Step 7: Troubleshooting Common Issues
Issue | Solution |
---|---|
Model not loading | Ensure ollama serve is running and the model is listed |
Inference is slow | Consider using a GPU or reducing context length |
API not responding | Confirm port 11434 is active and not blocked |
Memory errors (120B model) | Switch to gpt-oss-20b or upgrade to 80GB GPU hardware |
Logs can be found under ~/.ollama/logs
if further debugging is needed.
Bonus: Optional GUI via Open WebUI
If you prefer a browser-based interface, try pairing Ollama with Open WebUI.
Launch with Docker
bash
CopyEdit
docker run -d -p 3000:8080 --name open-webui ghcr.io/open-webui/open-webui:main
Visit http://localhost:3000
in your browser to access a full chat UI. Features include:
- Prompt history
- File uploads for RAG workflows
- Custom system instructions
Conclusion
Running GPT-OSS models locally with Ollama provides a powerful and cost-effective alternative to cloud-based LLM solutions. With privacy, flexibility, and compatibility in mind, this setup enables developers to explore and build with open-weight language models on their own infrastructure.
From installation and customization to API integration and debugging, this guide equips you with all the tools needed to create a complete local LLM development workflow. Whether you're experimenting with agents, building AI tools, or enhancing automation, Ollama with GPT-OSS gives you full control.
For advanced local testing, consider adding tools like Apidog to your stack. This approach not only increases productivity but also aligns with best practices for responsible AI deployment.
Top comments (0)