Owen

Posted on May 9 • Originally published at ofox.ai

Claude Opus 4.6 vs GPT-5.5 vs Gemini 3.1 Pro: Reasoning Benchmarks (3 Real Tasks Tested)

#ai #modelcomparison #reasoning #llm

Claude Opus 4.6 vs GPT-5.5 vs Gemini 3.1 Pro: Reasoning Benchmarks (3 Real Tasks Tested)

TL;DR — On three reasoning tasks (legal contradiction analysis, multi-step proof, nested-spec planning), Claude Opus 4.6 produced the most rigorous step-by-step output, GPT-5.5 reached correct answers fastest, and Gemini 3.1 Pro delivered roughly 70% of the depth at one-third the price. There is no overall winner — only sweet spots. We tested Opus 4.6 instead of 4.7 because Anthropic's own system card flags a long-context retrieval regression, and reasoning chains depend on long-context recall.

Why this comparison, and why now

Most flagship-model comparisons in 2026 collapse coding, math, multimodal, and agentic benchmarks into a single ranking that nobody actually uses for picking a model. When choosing for chained reasoning specifically, the leaderboard average tells you almost nothing about which model will think clearly through your problem.

This article focuses on reasoning through three real tasks where reasoning was the entire job: legal contradiction analysis, a chained proof, and nested-spec planning. Each model received identical inputs. Outputs were graded on correctness, depth of justification, and total cost.

For pricing context: all three models are available through ofox.ai's unified API gateway, with one OpenAI-compatible endpoint for switching between them.

Why Opus 4.6 (and the 4.7 disclaimer)

Opus 4.7 is Anthropic's newest flagship at the time of writing, and on most benchmarks it beats 4.6. So why test the older version?

Anthropic's published system card for Opus 4.7 reports 32.2% on MRCR v2 8-needle at 1M context, against Opus 4.6's 78.3%. That represents a real regression on long-context multi-needle retrieval — the exact failure mode that breaks chained reasoning. On Lech Mazur's Extended NYT Connections benchmark (a closed reasoning test Anthropic did not optimize against), Opus 4.6 scores 94.7% versus 41.0% for Opus 4.7 — a 54-point gap. The benchmark author notes Opus 4.7 also refuses over 50% of prompts, and even on the subset it does answer it scores below 4.6.

Reactions in the community split sharply. r/ClaudeAI threads praise agentic-coding performance for 4.7. r/LocalLLaMA threads highlight regression complaints for non-coding reasoning. Both perspectives hold merit, as they measure different capabilities.

For the reasoning tasks below, we selected the version still recommended by users running long, layered prompts. If your workload is agentic coding, consider 4.7 instead. If your workload matches the tasks covered here, the choice between 4.6 and 4.7 still matters and 4.6 remains the safer default.

Public reasoning benchmarks: the honest summary

Before our own runs, here is what the public benchmarks actually show — with all three models compared on a like-for-like basis where data is available.

Benchmark	Opus 4.6 (no tools)	GPT-5.5 (no tools)	Gemini 3.1 Pro (thinking)	Source
HLE (no tools)	40.0%	41.4%	44.4%	Anthropic, OpenAI, Google model cards
GPQA Diamond	91.3%	93.6%	94.3%	LM Council
FrontierMath Tier 4	not reported	35.4% (base) / 39.6% (Pro)	~19%	Epoch AI / OpenAI GPT-5.5 system card
MRCR v2 (1M ctx, 8-needle)	78.3%	74.0%	not reported	Anthropic / OpenAI system cards

Three quick observations:

No model dominates. GPT-5.5 wins math-heavy reasoning by a wide margin. Gemini 3.1 Pro edges out PhD-level science questions. Opus 4.6 wins long-context multi-needle retrieval (which underlies most real-world chained reasoning).
HLE-no-tools is close. Within four points across all three. Marketing departments will pick whichever benchmark flatters them; ignore the 0.4-point claims.
GPQA Diamond is saturating. When all three flagships score above 91%, GPQA stops discriminating. Treat it as a floor, not a ranking.

Three real reasoning tasks

Each task was run twice per model (to control for sampling variance) on the same date with the same prompt. Outputs were graded on a 5-point rubric: correctness, depth of justification, edge-case awareness, format clarity, and total tokens. All runs went through ofox.ai's unified endpoint to keep auth and routing identical.

Task 1: legal contradiction analysis

Prompt — given a 2,800-word excerpt of a fictional jurisdiction's contract law statute, identify three internal contradictions and explain the legal reasoning for each. The contradictions are not surface-level; they require chaining across multiple sections.

Opus 4.6 (extended thinking, high effort). Identified all three contradictions on the first attempt. The reasoning chain explicitly named each section being cited, walked through the implication, and flagged a fourth potential contradiction as "ambiguous, depends on definition of 'reasonable notice' in §11." That fourth note was correct — the human author had intentionally left that section ambiguous. Total: 11,400 output tokens, ~$0.29.

GPT-5.5 (default reasoning effort). Identified three contradictions on the first run. Output was 40% shorter than Opus, with cleaner structure (numbered headings, one paragraph per contradiction). It missed the ambiguous fourth case entirely. On the second run, sampling variance flipped one of the three identifications to a wrong section reference, though the contradiction itself was still real. Total: ~6,800 output tokens, ~$0.20.

Gemini 3.1 Pro (thinking high). Identified two of three contradictions. Missed the third because it failed to chain across §4 and §17 (separated by ~1,400 tokens). Justifications for the two it found were solid. Total: ~7,100 output tokens, ~$0.09.

Winner on this task: Opus 4.6. Got every contradiction, flagged the trick case, justified each step. The cost was 3x Gemini's, but for legal-style reasoning where misses are expensive, the depth matters.

Task 2: chained mathematical proof

Prompt — prove that for any positive integer n ≥ 2, there exist n consecutive composite numbers. (This is a classic but the prompt forbade citing factorial constructions and required a fully explicit proof.)

GPT-5.5. Produced a clean, complete proof in two paragraphs. Used the (n+1)! + k construction implicitly without naming it as factorial — exactly within constraints. Total: ~1,200 output tokens, ~$0.04.

Opus 4.6. Produced a more verbose three-paragraph proof, explicitly named the construction, noticed the constraint on the second pass and rewrote without naming, then offered an alternative proof using the Chinese Remainder Theorem. The CRT proof was correct and elegant but longer than required. Total: ~3,800 output tokens, ~$0.10.

Gemini 3.1 Pro. Produced a correct proof on the first run with the cleanest exposition of the three. On the second run, with no prompt change, it produced an essentially identical proof (low variance — a good sign for production use). Total: ~1,400 output tokens, ~$0.02.

Winner on this task: Gemini 3.1 Pro on cost-quality. GPT-5.5 on raw answer speed. Opus 4.6 produced the deepest output but over-engineered the answer. For closed-form math reasoning, more depth is not better.

Task 3: nested-spec planning

Prompt — given a 1,500-word product specification with five interdependent features (each with constraints that reference other features), produce an implementation plan that respects all constraints, identifies the optimal build order, and flags any contradictions in the spec itself.

Opus 4.6. Produced a build order that respected every constraint and identified two genuine contradictions in the spec (a circular dependency and a constraint that violated a stated requirement). It also flagged a third "potential contradiction" that turned out to be the intended behavior — a false positive, but the kind of false positive that a careful engineer would also raise. Total: ~6,200 output tokens, ~$0.16.

Gemini 3.1 Pro. Produced a build order that respected most constraints but quietly reordered one feature in a way that broke its stated dependency. When pressed in a follow-up turn, it self-corrected and identified the issue. Caught one of the two genuine contradictions in the spec. Total: ~4,800 output tokens, ~$0.06.

GPT-5.5. Produced the most readable build order — closest to something you would copy into a project planner. Caught both genuine contradictions and did not flag the false positive. Did not fully justify why one specific feature needed to be third in the order; the reasoning was implicit. Total: ~3,900 output tokens, ~$0.12.

Winner on this task: GPT-5.5, narrowly. Opus 4.6 was more thorough but raised one false alarm. Gemini broke a constraint silently — the riskiest failure mode for planning work.

Sweet spots, not winners

Across three tasks, the picture shows what frontier-model reviewers keep finding and refusing to admit out loud: there is no single best reasoning model in May 2026 — there are three good ones, each best at a specific shape of problem.

Task shape	Best fit	Why
Long, layered reasoning where misses are expensive (legal, compliance, multi-document analysis)	Claude Opus 4.6	Caught every contradiction, flagged trick cases, justified every step. The MRCR-v2 long-context strength shows up here.
Closed-form math, theorem proofs, anything where the right answer is short	Gemini 3.1 Pro	Cleanest output, lowest variance run-to-run, one-third the cost.
Implementation planning, structured output, anything that gets handed to humans or pipelines	GPT-5.5	Most readable structure, fewest false positives, fastest to a usable answer.

Gemini 3.1 Pro is cheap enough to change the workflow calculus. At $2/$12 per million tokens you can run Gemini as the default reasoner and only escalate the 5-10% of cases where it underperforms. That routing pattern is covered in our hybrid model routing guide.

Cost math, in concrete dollars

For a typical day of reasoning workloads — say 50 prompts averaging 8K input + 4K output:

Model	Input cost	Output cost	Daily total	Monthly (×30)
Claude Opus 4.6	$2.00	$5.00	$7.00	$210
GPT-5.5	$2.00	$6.00	$8.00	$240
Gemini 3.1 Pro	$0.80	$2.40	$3.20	$96

If your reasoning chains are output-heavy (which most are), the gap widens. A research workflow generating 20K output tokens per call hits $14/day on Opus, $18/day on GPT-5.5, $5.60/day on Gemini.

Cache pricing changes the picture for repeated context — see our breakdown of Claude API pricing and the Gemini 3.1 Pro guide for the per-provider details.

How to actually decide

If you have one reasoning workload, pick by task shape using the table above. If you have a mixed workload, the routing pattern is the better answer than picking one model. We covered the implementation in the hybrid routing guide, and the broader case for unifying behind one endpoint in the AI API aggregation guide.

A reasonable starting policy:

Default to Gemini 3.1 Pro for math, code review, structured planning under 5K output tokens.
Escalate to Opus 4.6 (extended thinking, high effort) for legal, compliance, multi-document analysis, anything where a missed edge case is expensive.
Escalate to GPT-5.5 for output that humans or downstream systems will read directly without much editing.

This is closer to a portfolio approach than a "best model" ranking — and it consistently outperforms any single-model default on cost-adjusted quality.

What to test on your own data

Public benchmarks saturate fast. The numbers above will shift as each provider releases new versions. What has held for the last year, and will likely keep holding, is that reasoning capability splits along three axes: depth of justification, format cleanliness, and unit cost. The model that wins for your team is the one that wins on the axis you actually care about, not the one with the highest leaderboard average.

Three concrete things worth running on your own data before you commit:

A 10-prompt micro-benchmark. Pick ten prompts that look like your real workload. Run each through all three models. Score on a 1-5 rubric. The result will be more useful than any public benchmark.
A cost-per-correct-answer calculation. Track wrong answers and the human time to fix them. Cheap-and-wrong is more expensive than expensive-and-right when the human edit cost dominates.
A long-context test if your prompts go past 100K tokens. This is where the MRCR-v2 regression on Opus 4.7 and the chunked-attention quirks of GPT-5.5 actually show up.

If you want to run these tests against all three models without juggling three SDKs and three billing dashboards, ofox.ai's OpenAI-compatible endpoint lets you swap model names in one config line. That is the setup we used to keep the runs above identical across providers.

What we did not test

This article is deliberately about reasoning, not coding, not multimodal, not agentic-tool-use. For coding-specific comparisons, see Best LLM for Coding in 2026. For the broader flagship benchmark roundup including math and long-context, see GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro flagship comparison. For Gemini-specific reasoning vs Opus, Gemini 3.1 Pro vs Claude Opus 4.6 goes deeper on the head-to-head.

The market's preferred model in twelve months will not be any of these three. The framework — pick by task shape, not by leaderboard average, and route across providers when the workload is mixed — is the part that survives the next round of releases.

Originally published on ofox.ai/blog.

DEV Community

Claude Opus 4.6 vs GPT-5.5 vs Gemini 3.1 Pro: Reasoning Benchmarks (3 Real Tasks Tested)

Claude Opus 4.6 vs GPT-5.5 vs Gemini 3.1 Pro: Reasoning Benchmarks (3 Real Tasks Tested)

Why this comparison, and why now

Why Opus 4.6 (and the 4.7 disclaimer)

Public reasoning benchmarks: the honest summary

Three real reasoning tasks

Task 1: legal contradiction analysis

Task 2: chained mathematical proof

Task 3: nested-spec planning

Sweet spots, not winners

Cost math, in concrete dollars

How to actually decide

What to test on your own data

What we did not test

Top comments (0)