Emmanuel Mumba

Posted on Aug 6

How to Run GPT-OSS for Free Locally with Ollama

#webdev #programming #ai

With the release of OpenAI’s GPT-OSS models under a permissive Apache 2.0 license, developers now have access to state-of-the-art open-weight LLMs completely free to use and run locally. Thanks to Ollama’s seamless integration with these models, you can now deploy and customize GPT-OSS on your own machine with minimal setup.

This guide covers everything you need to get started: installation, model setup, Modelfile customization, API usage, and local API testing.

What is GPT-OSS?

GPT-OSS refers to OpenAI’s open-weight models, now available in two sizes:

gpt-oss-20b
gpt-oss-120b

Both models are designed for high-performance tasks, including:

Reasoning and step-by-step explanation
Agentic workflows with tool use (e.g., function calling, Python execution, web browsing)
Developer-specific tasks like code generation, documentation, and debugging

These models are accessible through Ollama, a local inference engine optimized for fast LLM deployment without needing additional cloud infrastructure or model conversion.

Key Features of GPT-OSS on Ollama

OpenAI and Ollama have partnered to bring GPT-OSS to local environments with advanced features and hardware-friendly optimizations:

Agentic Capabilities

Supports native function calling, optional web browsing, Python code execution, and structured output generation enabling complex multi-step workflows on-device.

Full Chain-of-Thought Access

Models expose their internal reasoning paths, allowing developers to debug and audit responses easily.

Adjustable Reasoning Effort

You can configure the "reasoning effort" (low, medium, high) depending on your latency requirements and task complexity.

Fine-Tuning Support

Both models are fine-tunable for specific domains or use cases via parameter tuning.

Apache 2.0 License

The models are released under a permissive license, allowing for both commercial and experimental use cases without copyleft constraints.

MXFP4 Quantization Format

The models use OpenAI’s MXFP4 quantization reducing the memory footprint by compressing mixture-of-expert (MoE) weights to 4.25 bits. This allows:

gpt-oss-20b: To run on consumer systems with 16GB of memory
gpt-oss-120b: To run on high-end machines with a single 80GB GPU

Ollama natively supports this format using custom-developed kernels optimized for performance and accuracy.

System Requirements

To run GPT-OSS efficiently on your machine, your system should meet the following minimum specifications:

Model	RAM / GPU Requirement	Use Case
gpt-oss-20b	16GB RAM (CPU or GPU-accelerated)	Consumer desktops and laptops
gpt-oss-120b	80GB GPU (data center hardware)	Research and high-performance tasks

You will also need:

20–50GB free disk space
macOS, Linux (recommended), or Windows (WSL2 compatible)
Stable internet connection for model download

Step 1: Install Ollama

macOS / Linux

bash
CopyEdit
curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com. Additional setup may be needed, such as enabling WSL2 or Docker Desktop.

Start the Ollama Server

bash
CopyEdit
ollama serve

This starts the backend server on http://localhost:11434.

Step 2: Download GPT-OSS Models

Choose the model based on your hardware:

bash
CopyEdit
# For most users
ollama pull gpt-oss-20b

# For data center-grade GPUs
ollama pull gpt-oss-120b

Download progress depends on your internet connection. After downloading, confirm installation with:

bash
CopyEdit
ollama list

You should see: gpt-oss-20b or gpt-oss-120b.

20B parameter model

gpt-oss-20b model is designed for lower latency, local, or specialized use-cases.

120B parameter model

gpt-oss-120b model is designed for production, general purpose, high reasoning use-cases.

NVIDIA and Ollama collaborate to accelerate gpt-oss on GeForce RTX and RTX PRO GPUs

NVIDIA and Ollama are advancing their partnership to boost model performance on NVIDIA GeForce RTX and RTX PRO GPUs. This collaboration enables users on RTX-powered PCs to accurately leverage the capabilities of OpenAI’s gpt-oss model.

Step 3: Run GPT-OSS Locally

Interactive CLI

bash
CopyEdit
ollama run gpt-oss-20b

Type your prompts and get real-time responses.

Single Query Execution

bash
CopyEdit
ollama run gpt-oss-20b "Explain the principle of transformers in AI"

Control Response Style

You can adjust generation parameters for accuracy or creativity:

bash
CopyEdit
ollama run gpt-oss-20b --temperature 0.3 --top-p 1.0 "Summarize the history of cryptography"

Lower temperature values produce more deterministic outputs.

Step 4: Customize Behavior with Modelfile

Ollama allows configuration of model behavior via a Modelfile, similar to Dockerfiles.

Sample Modelfile

text
CopyEdit
FROM gpt-oss-20b
SYSTEM "You are a helpful assistant specialized in TypeScript development. Provide well-documented code examples."
PARAMETER temperature 0.3
PARAMETER num_ctx 4096

Build and Run

bash
CopyEdit
ollama create ts-assistant -f Modelfile
ollama run ts-assistant

This is useful for focused workflows, like data analysis, technical writing, or code generation.

Step 5: API Integration

Ollama exposes an API compatible with OpenAI’s format, making it easy to plug into existing tools.

Generate Text (Single Prompt)

bash
CopyEdit
curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-oss-20b", "prompt": "What is reinforcement learning?"}'

Chat Completion (OpenAI Style)

bash
CopyEdit
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [{"role": "user", "content": "How do transformers work?"}]
  }'

Python Example Using OpenAI SDK

python
CopyEdit
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="gpt-oss-20b",
    messages=[{"role": "user", "content": "What is chain-of-thought reasoning?"}]
)
print(response.choices[0].message.content)

Step 6: Local API Testing

To debug and inspect model responses, especially for streaming outputs, you can use tools like Apidog.

Why Use Apidog?

Easily test localhost:11434 endpoints
Preview streamed JSON responses in real-time
Experiment with different parameters
Save test collections for repeatability

Apidog can simplify development workflows when integrating GPT-OSS with applications.

Step 7: Troubleshooting Common Issues

Issue	Solution
Model not loading	Ensure `ollama serve` is running and the model is listed
Inference is slow	Consider using a GPU or reducing context length
API not responding	Confirm port `11434` is active and not blocked
Memory errors (120B model)	Switch to `gpt-oss-20b` or upgrade to 80GB GPU hardware

Logs can be found under ~/.ollama/logs if further debugging is needed.

Bonus: Optional GUI via Open WebUI

If you prefer a browser-based interface, try pairing Ollama with Open WebUI.

Launch with Docker

bash
CopyEdit
docker run -d -p 3000:8080 --name open-webui ghcr.io/open-webui/open-webui:main

Visit http://localhost:3000 in your browser to access a full chat UI. Features include:

Prompt history
File uploads for RAG workflows
Custom system instructions

Conclusion

Running GPT-OSS models locally with Ollama provides a powerful and cost-effective alternative to cloud-based LLM solutions. With privacy, flexibility, and compatibility in mind, this setup enables developers to explore and build with open-weight language models on their own infrastructure.

From installation and customization to API integration and debugging, this guide equips you with all the tools needed to create a complete local LLM development workflow. Whether you're experimenting with agents, building AI tools, or enhancing automation, Ollama with GPT-OSS gives you full control.

For advanced local testing, consider adding tools like Apidog to your stack. This approach not only increases productivity but also aligns with best practices for responsible AI deployment.

DEV Community