DEV Community

Divyesh
Divyesh Subscriber

Posted on

I Tested Nex-N2-Pro — A Free Open-Source Model That's Matching GPT-5.5 on Coding Benchmarks

TL;DR: Nex-N2-Pro by Nex AGI is a free, open-source MoE model (397B params, 17B active) built on Qwen3.5. It scores 75.3 on Terminal-Bench 2.1 — top-3 globally, open or closed — and introduces "Adaptive Thinking" that dynamically scales its reasoning depth per task. You can run it right now on OpenRouter for free.


The Problem With Most Reasoning Models Today

Every modern reasoning model has the same flaw: they think at the same depth regardless of what you ask.

Ask one to add two numbers? Full chain-of-thought. 2,000 tokens of internal monologue. For 1 + 1.

Ask it to architect a distributed system? Same 2,000 tokens. Not enough.

It's like hiring someone who takes a three-hour meeting to answer "yes" and a three-hour meeting to redesign your database. The fixed-depth reasoning budget is wasteful on simple tasks and dangerously shallow on complex ones.

Nex-N2-Pro just shipped a real answer to this problem.


What Is Nex-N2-Pro?

Nex-N2-Pro is an open-source agentic reasoning model by Nex AGI, released June 2, 2026. It's post-trained on Qwen3.5-397B-A17B and built specifically for:

  • Agentic coding loops (write → run → debug → iterate)
  • Long-horizon software engineering tasks
  • Deep research across large document sets
  • Tool use and function calling in agent pipelines

It comes in two variants:

Variant Base Total Params Active Params Best for
Nex-N2-Pro Qwen3.5-397B 397B 17B Full agentic workloads
Nex-N2-mini Qwen3.5-35B 35B 3B Low-latency, edge deployment

The Architecture: Why 397B Parameters Costs Less Than You Think

This is a Mixture-of-Experts (MoE) model. That means despite 397B total parameters, only 17B are active per forward pass. In practice, the inference cost is closer to a 17B dense model — not a 397B one.

Other key specs:

  • Context window: 262,144 tokens (262K) — load an entire codebase, no chunking needed
  • Max output: 256K tokens
  • Native capabilities: Vision, Reasoning, Tool Calling, Structured Outputs, Function Calling
  • Recommended serving: Nex AGI's customized sglang fork (nexagi/sglang:v0.5.12)

The Core Innovation: Agentic Thinking = Adaptive + Coherent

Nex-N2-Pro's "Agentic Thinking" framework has two components that work together:

1. Adaptive Thinking — Efficiency by Design

The model dynamically decides whether to reason deeply and how much, based on task complexity. Simple prompts execute fast. Multi-step agent tasks trigger structured, deep planning.

This isn't just a nice property — it's architecturally enforced at training. The result:

  • Faster responses on straightforward tasks
  • More reliable outcomes on hard, multi-step problems
  • No wasted token budget on tasks that don't need it

2. Coherent Thinking — One Reasoning Paradigm, All Tasks

A single consistent reasoning framework is applied across coding, research, tool use, and multimodal tasks. No context-switching penalty when a task crosses domains (e.g., "research this API, then write code that calls it").

This is what makes it genuinely useful for agentic pipelines — the model doesn't lose context or change behavior when switching between subtasks.

The Full Agentic Loop

Requirement Understanding
       ↓
   Task Planning
       ↓
 Code Implementation
       ↓
Environment Feedback (actual execution output)
       ↓
Evaluation & Debugging
       ↓
Continuous Iteration → back to Task Planning if needed
Enter fullscreen mode Exit fullscreen mode

The key word is Environment Feedback — the model actually runs code, reads the output, and incorporates it. It's not simulating execution. It's doing it.


Benchmark Results

Benchmark What It Measures Nex-N2-Pro Score Context
Terminal-Bench 2.1 Real terminal/coding tasks 75.3 Top-3 globally (open + closed)
GDPval Long-horizon task planning 1585 Competes with GPT-5.5
SWE-Atlas Real software engineering Strong Verified generalization
DeepSWE Agentic SWE tasks Strong End-to-end dev loops

For context: these aren't "write a function" benchmarks. Terminal-Bench 2.1 and SWE-Atlas test actual software engineering workflows — the kind you'd run in a real development loop.


How to Try It Right Now (Free)

Via OpenRouter (Zero Setup)

curl https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nexagi/nex-n2-pro",
    "messages": [
      {"role": "user", "content": "Refactor this function to use async/await and add error handling: [your code here]"}
    ]
  }'
Enter fullscreen mode Exit fullscreen mode

Via SiliconFlow (Free Tier — $0/M Tokens)

Available at siliconflow.cn — model string: Nex-AGI/Nex-N2-Pro

Self-Hosted

docker pull nexagi/sglang:v0.5.12
# Then load the model weights from HuggingFace: NexAGI/Nex-N2-Pro
Enter fullscreen mode Exit fullscreen mode

Real-World Performance: What It Actually Feels Like

The benchmark numbers tell one story. Here's the practical one:

What it does well:

  • Multi-file refactoring tasks where it needs to hold context across the whole repo
  • Debugging loops — it reads the stack trace, forms a hypothesis, patches, re-runs
  • Tool-calling workflows where it needs to chain 5–10 API calls in sequence
  • Research-to-code tasks (read a paper, then implement the algorithm)

Where you'll hit limits:

  • Very low-latency use cases (it's still a large model; mini variant helps here)
  • Highly specialized domain knowledge outside its training distribution

Watch: Benchmark Review & Speed Test

A deep-dive benchmark review comparing Nex-N2-Pro against closed-source giants. Worth watching before you decide whether to integrate it into your stack.


The Bigger Picture

For the last two years, the open-source vs. closed-source AI gap looked like this: open models were 6–12 months behind frontier commercial models on coding benchmarks.

Nex-N2-Pro changes that calculus. A 75.3 Terminal-Bench 2.1 score puts it in the same conversation as GPT-5.5 and Claude Opus 4.7 — and it's completely free to use and open-weight.

The trend is clear: the next competitive battleground isn't just raw reasoning ability. It's execution reliability — can the model actually complete a real task end-to-end, in a real environment, without falling apart?

Nex-N2-Pro's Adaptive Thinking and closed agentic loop are a serious answer to that question.


Discussion

For those who've tried it: how does the agentic loop hold up on your actual codebase vs. your current daily driver?

Specifically curious whether the "Environment Feedback" loop meaningfully reduces back-and-forth compared to, say, Claude Code or Cursor. Drop your experience below — benchmark scores are one thing, but real-world dev workflow is another.

Top comments (0)