Your AI can stop hallucinating math: a real Lean kernel over MCP

#ai #mcp #math #opensource

I got tired of watching AI assistants confidently misquote theorems, so I built mathlas: an MCP server that gives any agent a real Lean kernel, PSLQ, OEIS matching, and a 3.68M-document theorem index. No LLM inside, no API key, Apache-2.0.

The premise is a strict division of labor: the AI is the brain, mathlas is the hands. Every tool returns data — candidates, verdicts, checklists — and the agent does the judging. No tool inside mathlas ever calls an LLM, which means no tool inside mathlas can hallucinate.

The discipline: airtight or nothing

Every verdict-producing tier follows one rule: return an independently-checkable fact, or an honest "nothing." Never a plausible guess.

Here is what that looks like in practice — real in-process tool outputs, captured from the live server:

verify_formal runs the actual Lean 4.31.0 kernel. Hand it a proposition and your Lean 4 proof and you get one of:

VERIFIED_PROOF — the kernel typechecked the full declaration;
REFUTED — with the kernel's error message verbatim, so the agent can repair and retry;
REJECTED — for sorry/admit holes (Lean itself exits 0 on a sorried proof; mathlas scans the source and the kernel's sorryAx diagnostics, so you can't sneak a hole past it);
UNDETERMINED — when the toolchain is missing, an import can't resolve, or the 60 s cap hits. An honest shrug, never a fake verdict.

On the numeric side, identify_constant and verify_numeric use PSLQ plus an independent high-precision re-evaluation (50–51 digits). Type in 1.6449340668482264... and it hands back pi**2/6, re-verified — or nothing at all. Measured false-positive rate across every tier (numeric, sequence, Ramanujan-style relations): zero. Structureless inputs produce zero false hits, 8 out of 8 times we tried to bait it. Full tables with reproduction commands live in RESULTS.md.

Twelve tools, one pipeline

The tools compose into a workflow the agent drives:

search_existing_math → applicability_checklist / mapping_scaffold → (AI judges) → verify_numeric / verify_formal

The retrieval side is search_existing_math, served from a 3,683,428-document dense + BM25 + RRF index (the text side is open on Hugging Face). applicability_checklist is the tool I'm proudest of and the one nobody else ships: it decomposes a theorem's hypotheses into atomic preconditions the AI verifies one by one — the guardrail against applying Banach's fixed-point theorem to an incomplete space. Then there's identify_sequence (exact OEIS term-match), search_formal_math (proxies Loogle + LeanSearch for mathlib declarations, provenance-labeled), conjecture_relation (Ramanujan Machine-style PSLQ over a rich basis), a sandboxed funsearch harness, and add_finding, which matters for the benchmark below.

The benchmark, with its caveat up front

On TheoremSearch's own 110 human-written queries, mathlas with its self-augmenting web loop scores 59.1% theorem-level Hit@20 (65/110) against TheoremSearch's 45.0%. Sounds great. Here is the part you should read before quoting it:

This is a loop-vs-static comparison, not a corpus-vs-corpus one. Corpus-only, mathlas's baseline on this benchmark is 10.0% — TheoremSearch withheld ~85% of their private 9.2M corpus (the non-redistributable arXiv-licensed papers), so 95 of the 110 target papers are unreachable for any open system. What the 59.1% measures is the add_finding loop: the agent web-finds each missing statement, embeds it with the same Qwen3-Embedding-8B, and fuses it into the live index at runtime. TheoremSearch's 45.0% is a static system answering from its full private corpus. The honest headline is "a validated writeback loop repairs an open system's coverage gap at runtime," not "my index beats theirs." The math domain is the right place for such a loop precisely because the write-back candidate can be deterministically checked (verify_numeric / verify_formal) before it's trusted.

Reproduce it with benchmarks/webaug_110_bench.py in the repo.

Try it in one line

With uv installed:

claude mcp add mathlas -- uvx mathlas-mcp

That's it — Claude Code now sees twelve tools. Plain pip works too (pip install mathlas-mcp), Cursor or any MCP client can point at the same stdio command, and if the official mcp SDK isn't installed the server falls back to a dependency-free stdio JSON-RPC implementation, so it always runs. It's also on the official MCP registry as io.github.Archerkattri/mathlas.

What it doesn't do

mathlas does not write proofs — the generator/verifier split is absolute, so the kernel checks your proof and reports exactly why it fails, but the repair is on you (or your agent). Corpus-only retrieval will not beat TheoremSearch on their benchmark; the 10.0% baseline is a licensing-bounded floor and I report it as such. Two tools degrade without optional local data: identify_sequence wants a local OEIS copy and verify_formal wants a Lean toolchain — without them you get a clear "not available," never a fake answer. The full-quality index needs the Qwen3-Embedding-8B encoder, which is not laptop hardware; there are measured quantized and 0.6B tiers that trade 7–9 points of recall for running on 4 CPU threads, documented with their exact costs. And it's not a CAS — if you want to symbolically massage an expression you gave it, sympy is the right tool; mathlas finds, scopes, and verifies existing math.

Code, benchmarks, and every number's reproduction command: github.com/Archerkattri/mathlas.