MiniMax's M3 exceeded human gold-medal on math benchmarks via MaxProof, but no scores or details were disclosed.
MiniMax's M3 model exceeded the human gold-medal threshold on both math benchmark sets using the MaxProof framework. The claim was announced via a repost by @MiniMax_AI of a post by Ryan Lee, with a link to a paper.
Key facts
- M3 exceeded human gold-medal on both math sets.
- MaxProof framework is the claimed method.
- No benchmark names or scores disclosed.
- Announced via X post by Ryan Lee/MiniMax.
- Full paper not yet publicly available.
MiniMax's M3 model exceeded the human gold-medal threshold on both math benchmark sets using the MaxProof framework, according to a post on X by Ryan Lee (@MiniMax_AI). The post links to a paper titled 'MaxProof: ...' but the full text is not yet available. No benchmark names, numerical scores, or comparison baselines were disclosed in the announcement, making independent verification impossible at this stage.
The claim is notable because exceeding a human gold-medal threshold on math benchmarks—typically the AIME or AMC sets—requires strong reasoning and step-by-step verification. The MaxProof framework likely introduces a proof-based verification mechanism to reduce hallucination in mathematical reasoning, a known weakness in large language models. However, without published scores or ablations, it is unclear whether M3 achieves this via scaling, novel architecture, or a specialized inference-time procedure.
This announcement follows a pattern of Chinese AI labs—including DeepSeek, Alibaba's Qwen, and Baidu's ERNIE—publishing strong benchmark results on math reasoning tasks. MiniMax, known for its multimodal models and video generation (Hailuo AI), has not previously emphasized mathematical reasoning. The shift suggests a strategic pivot to compete in the reasoning-heavy segment dominated by OpenAI's o-series and Anthropic's Claude.
The company did not disclose the exact scores, dataset names, or training details. Until the paper is released and results are replicated, the claim should be treated as preliminary. The broader trend: math reasoning benchmarks are becoming a standard proxy for general reasoning capability, and any model that claims to exceed human expert performance invites scrutiny.
What the paper may reveal
The linked paper—assuming it follows the typical arXiv format—will likely describe: (1) the MaxProof framework's mechanism for generating and verifying proofs, (2) the training data and fine-tuning methodology for M3, (3) ablation studies comparing M3 with and without MaxProof, and (4) results on standard benchmarks such as AIME, AMC, or MATH. If MaxProof is a new verification layer, it could be applicable beyond math to code generation and formal verification.
Caveats and context
Human gold-medal thresholds on math competitions are not static. The AIME, for example, requires a score of 6–7 out of 15 for a gold medal (distinguished honor roll), and the AMC 12 requires approximately 100–120 out of 150. Exceeding these thresholds does not mean the model solves all problems—only that it achieves a score above the cutoff. Moreover, benchmark contamination (training on test data) remains a concern for all frontier models. Without a public evaluation set or a third-party audit, the result is unverified.
What to watch
Watch for the full MaxProof paper on arXiv and whether independent evaluators replicate the result on standard math benchmarks like AIME 2025. Also monitor MiniMax's next model release—if M3 is a reasoning-focused variant, expect comparisons to OpenAI's o3 and DeepSeek-R1.
Originally published on gentic.news

Top comments (0)