Model-by-model takeaway
Claude Opus 4.7 had the strongest baseline reasoning, but it suffered the sharpest evidence-quality erosion under heavy context and got worse even when given more thinking budget.
Claude Sonnet 4.6 was the surprise winner on heavy-fill pairwise tests, but it paid for that with very high reasoning-token usage and long latency.
GPT-5.5 was the safest against hallucinations and cross-contamination, but it lost reasoning depth as context filled and showed a cliff drop at very high fill.
Gemini 3.1 Pro had the flattest drift and the fastest runtime, but its baseline reasoning ceiling was lower than the others.
DeepSeek V4 Pro looked stable on absolute scores, but pairwise testing revealed the steepest hidden losses and the highest variance.
Top comments (0)