We’ve been buzzing ever since we integrated OpenAI’s GPT‑OSS‑20B and GPT‑OSS‑120B into Forgecode because why not!! These are OpenAI’s first open‑weight releases since GPT‑2. They’re a game‑changer: you can run them on your local hardware, benchmark them surface‑to‑surface with cloud models, and retain full code privacy. That alone is enough to pique anyone’s curiosity.
Finally Openai is giving justice to 'open' in their name!!!
Want to see what GPT‑OSS‑20B and 120B can really do?
Spin them up directly inside your terminal using ForgeCode.👉 ForgeCode — it’s fast, local, and awesome.
No cloud. No wait. Just pure AI horsepower at your fingertips._
1. Benchmarks That Speak for Themselves
Here’s how GPT‑OSS models stack up against OpenAI’s o3 and o4‑mini on key reasoning and competition math tests:
Task | GPT‑OSS‑120B | GPT‑OSS‑20B | OpenAI o3 | OpenAI o4‑mini |
---|---|---|---|---|
MMLU | 90.0 | 85.3 | 93.4 | 93.0 |
GPQA Diamond | 80.1 | 71.5 | 83.3 | 81.4 |
Humanity’s Last Exam | 19.0 | 17.3 | 24.9 | 17.7 |
AIME 2024 | 96.6 | 96.0 | 95.2 | 98.7 |
AIME 2025 | 97.9 | 98.7 | 98.4 | 99.5 |
We are genuinely impressed by how GPT‑OSS‑120B stacks up against OpenAI’s proprietary models; it nearly matches or even exceeds them in several key reasoning benchmarks, including o3 and o4‑mini. Even the smaller GPT‑OSS‑20B delivers surprisingly strong performance given its compact size.
- On MMLU, GPT‑OSS‑120B scores 90.0 versus o3’s 93.4; GPT‑OSS‑20B follows closely with 85.3.
- GPQA Diamond sees GPT‑OSS‑120B hitting an impressive 80.1, while o3 reaches 83.3.
- Even on the notoriously challenging Humanity’s Last Exam, GPT‑OSS‑120B scores 19.0 solid given o3’s 24.9 benchmark.
- And for competition math like AIME, both GPT‑OSS models deliver near-top-tier accuracy, outpacing or matching o3’s results on 2024 and 2025 problems.
These benchmarks reinforce that the new OpenAI GPT‑OSS models offer real, competitive power in reasoning tasks even while running locally under an open‑weight Apache 2.0 licence.
2. Sub‑Second Responses, Even with Complex Builds
We hit sub‑second response times, even when feeding multi‑file or multi‑phase prompts. Whether I'm asking to update configs across directories or run schema migrations, Forgecode backed by GPT‑OSS feels razor‑fast in live terminal sessions.
2. Stunning Accuracy with CLI Commands & Tools
We've noticed high accuracy when issuing CLI instructions or tool-enabled tasks. From generating git commit
messages to scaffolding TypeScript interfaces, the model nails it consistently even in more complex tooling flows.
Want to see what GPT‑OSS‑20B and 120B can really do?
Spin them up directly inside your terminal using ForgeCode.👉 ForgeCode — it’s fast, local, and awesome.
No cloud. No wait. Just pure AI horsepower at your fingertips._
3. Some Collaboration Quirks: But We're Tuning Them
A quirk: occasionally the interaction halts mid-output. For example, we’ve seen it stop at “Here’s Phase 1…” without completing the response. I’ve been refining prompts to improve its multi-step follow‑through, and the results are quickly improving.
4. The Power of Open‑Weight Transparency
Unlike closed models, GPT‑OSS, especially GPT‑OSS‑20B and 120B, runs with full transparency. We can benchmark them directly, optimise prompts, and share results openly. That transparency fosters ecosystem momentum, pushing other providers to release powerful open alternatives, which benefits everyone.
5. Choose the Right Model for Every Task
Forgecode gives me model flexibility. For a lightweight edit, I pick GPT‑OSS‑20B. For reasoning over massive codebases, I use 120B. Switching is seamless in the CLI interface; just /model
choose and continue.
🧠 Why This Matters
- Privacy & Control: No need to send code to the cloud.
- Performance & Speed: Real-time CLI assistance for developers.
- Transparency: Open weights give full insight into behaviour.
- Innovation Spark: Encourages broader open-source model development.
Ready to Try It?
You can already try both models right now in your terminal. Just head to Forgecode, install Forgecode, and start using GPT‑OSS‑20B or GPT-OSS-120B with your local setup. We’d love to hear what you think your feedback helps us refine prompts, collaboration flows, and future features.
✅ Bottom Line
- We’re integrated with OpenAI’s open-weight GPT‑OSS‑20B and 120B models.
- You’ll experience super-fast, accurate CLI-powered code assistance.
- We’re optimising multi-step workflows and embracing detailed transparency.
- This is a major stride toward secure, powerful, and community-driven AI engineering.
Want to try it yourself?
Want to see what GPT‑OSS‑20B and 120B can really do?
Spin them up directly inside your terminal using ForgeCode.👉 ForgeCode — it’s fast, local, and awesome.
No cloud. No wait. Just pure AI horsepower at your fingertips._
kick the wheels in your own terminal. Your feedback means everything—let us know how it performs!
Top comments (6)
I think you have a weird definition of "local" within this context ^^
You meant the app runs on your machine, but the llm runs in the cloud anyway and you still have the subscription limitations. Don't know what you mean with "No cloud".
If I am wrong, tell me how to configure forge code to run locally without any internet connection :D
Thanks for reading this, Elpien...OpenAI's open models are open weight models...you know, you can get all the training stuff for that. But with Forgecode you are using it in your terminal and ofcourse api is getting used for this but this is free!!!
Huge props to OpenAI for releasing GPT‑OSS‑20B and 120B under Apache 2.0. Running these locally with ForgeCode feels like unlocking a superpower. The benchmarks are impressive, but what really stands out is the speed and transparency. No cloud, no latency just raw AI performance.
On point, Anik... thanks for reading man!!! Try out forgecode.
This is what I was looking for!!! Would love to try this on Forgecode. I have big hopes from this model.
Thanks for the read Tom!!!!