MA-ProofBench, a new theorem-proving benchmark for mathematical analysis, shows GPT-5.5 achieving 16% on undergraduate problems and 5% on PhD-level, with most models near 0% on the harder set.
GPT-5.5 scored 16% Pass@8 on MA-ProofBench's undergraduate-level theorem-proving problems, and 5% on PhD-level. Most models tested barely registered above 0% on the harder set, per the June 2026 arXiv preprint.
Key facts
- GPT-5.5 achieved 16% Pass@8 on Level I, 5% on Level II.
- Most models scored near 0% on Level II PhD problems.
- Benchmark has 200 theorems across 6 core topics.
- Two dominant failure modes: Mathlib hallucinations and incomplete proofs.
- Natural-language version shows a clear informal-formal reasoning gap.
Researchers released MA-ProofBench according to the arXiv preprint, the first formal theorem-proving benchmark dedicated to mathematical analysis. The benchmark comprises 200 theorems across 6 core topics and 27 subcategories, including measure theory, complex analysis, and functional analysis. Problems are split into two difficulty tiers: Level I (undergraduate, 100 problems) and Level II (PhD qualifying, 100 problems).
Results: Near-zero on advanced reasoning
On Level I, GPT-5.5 achieved 16% Pass@8, while most other general-purpose reasoning models and formal theorem provers scored below 10%. On Level II, GPT-5.5 dropped to 5%, and the majority of models stayed close to 0%. The authors note that existing formal benchmarks concentrate on easier-to-formalize areas like algebra and elementary number theory, leaving a gap in advanced domains requiring deeper reasoning.
Failure modes: Hallucination and incompleteness
The paper identifies two dominant failure modes: Mathlib hallucinations (models generating plausible-looking but incorrect Lean code referencing non-existent library entities) and incomplete proofs (models starting correctly but failing to finish). An evaluation on natural-language versions of the same problems revealed a clear gap between informal and formal reasoning — models performed significantly better when not constrained by formal syntax.
Implications for AI reasoning
MA-ProofBench exposes a stark ceiling on current LLMs' ability to perform rigorous formal reasoning in advanced mathematics. The near-zero performance on Level II suggests that today's models, including frontier systems like GPT-5.5, lack the depth to handle PhD-level formal proofs. The benchmark is intended as a reference for tracking progress, but the current results indicate that formal theorem proving in analysis remains largely unsolved.
What to watch
Watch for future model releases on MA-ProofBench, especially from OpenAI and Anthropic. The benchmark's public leaderboard will reveal whether next-generation reasoning models can crack the 20% barrier on Level II, or if architectural changes are needed to handle formal analysis.
Source: arxiv.org
Originally published on gentic.news

Top comments (0)