Divyesh

Posted on Jun 9

I Tested Nex-N2-Pro — A Free Open-Source Model That's Matching GPT-5.5 on Coding Benchmarks

#ai #llm #programming #opensource

TL;DR: Nex-N2-Pro by Nex AGI is a free, open-source MoE model (397B params, 17B active) built on Qwen3.5. It scores 75.3 on Terminal-Bench 2.1 — top-3 globally, open or closed — and introduces "Adaptive Thinking" that dynamically scales its reasoning depth per task. You can run it right now on OpenRouter for free.

The Problem With Most Reasoning Models Today

Every modern reasoning model has the same flaw: they think at the same depth regardless of what you ask.

Ask one to add two numbers? Full chain-of-thought. 2,000 tokens of internal monologue. For 1 + 1.

Ask it to architect a distributed system? Same 2,000 tokens. Not enough.

It's like hiring someone who takes a three-hour meeting to answer "yes" and a three-hour meeting to redesign your database. The fixed-depth reasoning budget is wasteful on simple tasks and dangerously shallow on complex ones.

Nex-N2-Pro just shipped a real answer to this problem.

What Is Nex-N2-Pro?

Nex-N2-Pro is an open-source agentic reasoning model by Nex AGI, released June 2, 2026. It's post-trained on Qwen3.5-397B-A17B and built specifically for:

Agentic coding loops (write → run → debug → iterate)
Long-horizon software engineering tasks
Deep research across large document sets
Tool use and function calling in agent pipelines

It comes in two variants:

Variant	Base	Total Params	Active Params	Best for
Nex-N2-Pro	Qwen3.5-397B	397B	17B	Full agentic workloads
Nex-N2-mini	Qwen3.5-35B	35B	3B	Low-latency, edge deployment

The Architecture: Why 397B Parameters Costs Less Than You Think

This is a Mixture-of-Experts (MoE) model. That means despite 397B total parameters, only 17B are active per forward pass. In practice, the inference cost is closer to a 17B dense model — not a 397B one.

Other key specs:

Context window: 262,144 tokens (262K) — load an entire codebase, no chunking needed
Max output: 256K tokens
Native capabilities: Vision, Reasoning, Tool Calling, Structured Outputs, Function Calling
Recommended serving: Nex AGI's customized sglang fork (nexagi/sglang:v0.5.12)

The Core Innovation: Agentic Thinking = Adaptive + Coherent

Nex-N2-Pro's "Agentic Thinking" framework has two components that work together:

1. Adaptive Thinking — Efficiency by Design

The model dynamically decides whether to reason deeply and how much, based on task complexity. Simple prompts execute fast. Multi-step agent tasks trigger structured, deep planning.

This isn't just a nice property — it's architecturally enforced at training. The result:

Faster responses on straightforward tasks
More reliable outcomes on hard, multi-step problems
No wasted token budget on tasks that don't need it

2. Coherent Thinking — One Reasoning Paradigm, All Tasks

A single consistent reasoning framework is applied across coding, research, tool use, and multimodal tasks. No context-switching penalty when a task crosses domains (e.g., "research this API, then write code that calls it").

This is what makes it genuinely useful for agentic pipelines — the model doesn't lose context or change behavior when switching between subtasks.

The Full Agentic Loop

Requirement Understanding
       ↓
   Task Planning
       ↓
 Code Implementation
       ↓
Environment Feedback (actual execution output)
       ↓
Evaluation & Debugging
       ↓
Continuous Iteration → back to Task Planning if needed

The key word is Environment Feedback — the model actually runs code, reads the output, and incorporates it. It's not simulating execution. It's doing it.

Benchmark Results

Benchmark	What It Measures	Nex-N2-Pro Score	Context
Terminal-Bench 2.1	Real terminal/coding tasks	75.3	Top-3 globally (open + closed)
GDPval	Long-horizon task planning	1585	Competes with GPT-5.5
SWE-Atlas	Real software engineering	Strong	Verified generalization
DeepSWE	Agentic SWE tasks	Strong	End-to-end dev loops

For context: these aren't "write a function" benchmarks. Terminal-Bench 2.1 and SWE-Atlas test actual software engineering workflows — the kind you'd run in a real development loop.

How to Try It Right Now (Free)

Via OpenRouter (Zero Setup)

curl https://openrouter.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nexagi/nex-n2-pro",
    "messages": [
      {"role": "user", "content": "Refactor this function to use async/await and add error handling: [your code here]"}
    ]
  }'

Via SiliconFlow (Free Tier — $0/M Tokens)

Available at siliconflow.cn — model string: Nex-AGI/Nex-N2-Pro

Self-Hosted

docker pull nexagi/sglang:v0.5.12
# Then load the model weights from HuggingFace: NexAGI/Nex-N2-Pro

Real-World Performance: What It Actually Feels Like

The benchmark numbers tell one story. Here's the practical one:

What it does well:

Multi-file refactoring tasks where it needs to hold context across the whole repo
Debugging loops — it reads the stack trace, forms a hypothesis, patches, re-runs
Tool-calling workflows where it needs to chain 5–10 API calls in sequence
Research-to-code tasks (read a paper, then implement the algorithm)

Where you'll hit limits:

Very low-latency use cases (it's still a large model; mini variant helps here)
Highly specialized domain knowledge outside its training distribution

Watch: Benchmark Review & Speed Test

A deep-dive benchmark review comparing Nex-N2-Pro against closed-source giants. Worth watching before you decide whether to integrate it into your stack.

The Bigger Picture

For the last two years, the open-source vs. closed-source AI gap looked like this: open models were 6–12 months behind frontier commercial models on coding benchmarks.

Nex-N2-Pro changes that calculus. A 75.3 Terminal-Bench 2.1 score puts it in the same conversation as GPT-5.5 and Claude Opus 4.7 — and it's completely free to use and open-weight.

The trend is clear: the next competitive battleground isn't just raw reasoning ability. It's execution reliability — can the model actually complete a real task end-to-end, in a real environment, without falling apart?

Nex-N2-Pro's Adaptive Thinking and closed agentic loop are a serious answer to that question.

Discussion

For those who've tried it: how does the agentic loop hold up on your actual codebase vs. your current daily driver?

Specifically curious whether the "Environment Feedback" loop meaningfully reduces back-and-forth compared to, say, Claude Code or Cursor. Drop your experience below — benchmark scores are one thing, but real-world dev workflow is another.

DEV Community