AI coding skill in Python doesn't carry over to other languages

#benchmarks #evaluation #coding

A new benchmarking study finds that AI models' strong Python performance is a poor predictor of their coding ability across other programming languages. The Multi-LCB project rebuilt a respected contamination-resistant coding benchmark in twelve languages and found that models which look excellent in Python perform markedly worse elsewhere — they have over-specialized in the language they saw most in training.

Key facts

What: A widely-trusted coding benchmark was Python-only. Expanding it to a dozen languages revealed that models acing Python often stumble badly elsewhere — Python skill isn't general coding skill.
When: 2026-06-20
Primary source: read the source (arXiv 2606.20517)

Three findings stand out. Python overfitting: many models that look excellent in Python perform markedly worse in other languages — they've over-specialized in the language they saw most. Uneven contamination: the degree to which test problems appear to have leaked into a model's training varies by language, a fingerprint of how lopsided these models' training diets are toward popular languages. Large gaps across languages: models are especially weak in stricter, more structured languages and in less common ones that show up rarely in training data. The blunt conclusion: a model's Python performance is not a reliable stand-in for its coding ability in general.

Testing only in Python is like judging someone's overall musical talent solely by how well they play one song they've practiced a thousand times. They'll sound like a virtuoso — until you hand them a new piece, or a different instrument, and discover the talent was narrower than it looked. Multi-LCB hands the models a different instrument and listens to what actually comes out.

Benchmarks shape everything: which models look best, which research directions get funded, and which claims make headlines. If the headline coding test is single-language, the entire field is optimizing for a narrow slice of reality while telling itself the slice is the whole. Real software is written in a sprawling variety of languages, and a coding assistant that only truly shines in Python is far less useful than its leaderboard position suggests. Building tests that span many languages forces a more honest measure of general skill — and this is part of a broader reckoning this week about how AI gets evaluated, with several groups arguing that a single tidy score hides more than it reveals.

The weaker results in less common languages might not reflect a deep inability to generalize so much as a simple shortage of training material — these models have just seen far less code in those languages. With a more balanced training diet, some of the gap might close, which would mean the problem is partly about what we feed models rather than a fundamental limit of how they learn. "Can't generalize" and "wasn't taught enough" call for different fixes. Either way, the practical lesson is sturdy: the next time a model is crowned a coding champion on a Python-only test, treat the crown with suspicion. The same model handed a different language might tell a very different story.

Originally published on Ground Truth, where every claim is checked against the primary source.

DEV Community

AI coding skill in Python doesn't carry over to other languages

Key facts

Top comments (0)