A new benchmarking study finds that AI models' strong Python performance is a poor predictor of their coding ability across other programming languages. The Multi-LCB project rebuilt a respected contamination-resistant coding benchmark in twelve languages and found that models which look excellent in Python perform markedly worse elsewhere — they have over-specialized in the language they saw most in training.
Key facts
- What: A widely-trusted coding benchmark was Python-only. Expanding it to a dozen languages revealed that models acing Python often stumble badly elsewhere — Python skill isn't general coding skill.
- When: 2026-06-20
- Primary source: read the source (arXiv 2606.20517)
Three findings stand out. Python overfitting: many models that look excellent in Python perform markedly worse in other languages — they've over-specialized in the language they saw most. Uneven contamination: the degree to which test problems appear to have leaked into a model's training varies by language, a fingerprint of how lopsided these models' training diets are toward popular languages. Large gaps across languages: models are especially weak in stricter, more structured languages and in less common ones that show up rarely in training data. The blunt conclusion: a model's Python performance is not a reliable stand-in for its coding ability in general.
Testing only in Python is like judging someone's overall musical talent solely by how well they play one song they've practiced a thousand times. They'll sound like a virtuoso — until you hand them a new piece, or a different instrument, and discover the talent was narrower than it looked. Multi-LCB hands the models a different instrument and listens to what actually comes out.
Benchmarks shape everything: which models look best, which research directions get funded, and which claims make headlines. If the headline coding test is single-language, the entire field is optimizing for a narrow slice of reality while telling itself the slice is the whole. Real software is written in a sprawling variety of languages, and a coding assistant that only truly shines in Python is far less useful than its leaderboard position suggests. Building tests that span many languages forces a more honest measure of general skill — and this is part of a broader reckoning this week about how AI gets evaluated, with several groups arguing that a single tidy score hides more than it reveals.
The weaker results in less common languages might not reflect a deep inability to generalize so much as a simple shortage of training material — these models have just seen far less code in those languages. With a more balanced training diet, some of the gap might close, which would mean the problem is partly about what we feed models rather than a fundamental limit of how they learn. "Can't generalize" and "wasn't taught enough" call for different fixes. Either way, the practical lesson is sturdy: the next time a model is crowned a coding champion on a Python-only test, treat the crown with suspicion. The same model handed a different language might tell a very different story.
Originally published on Ground Truth, where every claim is checked against the primary source.
Top comments (0)