TL;DR
I had Claude Code implement a very simplified version of Git in 13 languages. Ruby, Python, and JavaScript were the fastest, cheapest, and most stable. Statically typed languages were 1.4–2.6× slower and more expensive.
Introduction
Which programming language is best suited for AI coding agents?
"Static typing prevents AI hallucination bugs!"
"No, skipping type annotations saves tokens!"
There's plenty of qualitative debate, but quantitative data is scarce. So I ran an experiment.
Experiment
I asked Claude Code to implement a "mini-git" — a simplified version of Git — in various languages, and measured the time and cost for each. Git was famously built by Linus in two weeks, so it seemed like a good task.
The task was split into two phases:
-
v1 (New project): Implement
init,add,commit, andlog. -
v2 (Feature extension): Add
status,diff,checkout, andreset.
The prompt was simply: "Read SPEC-v1.txt, implement it, and make sure test-v1.sh passes." v2 was analogous.
The languages compared:
| Category | Languages |
|---|---|
| Dynamic | Python, Ruby, JavaScript, Perl, Lua |
| Dynamic + type checker | Python/mypy, Ruby/Steep |
| Static | TypeScript, Go, Rust, C, Java |
| Functional | Scheme (dynamic), OCaml (static), Haskell (static) |
Python/mypy writes fully type-annotated Python verified with mypy --strict. Ruby/Steep writes RBS type signatures verified with steep check. These allow direct comparison of type-checking overhead within the same language.
Each language was run 20 times. The model was Claude Opus 4.6 (high effort).
Results
Average results across 20 trials, sorted by average cost.
| Language | Tests passed (v1+v2) | Time (v1+v2) | Avg. cost | LOC (v2) |
|---|---|---|---|---|
| Ruby | 40/40 | 73.1s ± 4.2s | $0.36 | 219 |
| Python | 40/40 | 74.6s ± 4.5s | $0.38 | 235 |
| JavaScript | 40/40 | 81.1s ± 5.0s | $0.39 | 248 |
| Go | 40/40 | 101.6s ± 37.0s | $0.50 | 324 |
| Java | 40/40 | 115.4s ± 34.4s | $0.50 | 303 |
| Rust | 38/40 | 113.7s ± 54.8s | $0.54 | 303 |
| Perl | 40/40 | 130.2s ± 44.2s | $0.55 | 315 |
| Python/mypy | 40/40 | 125.3s ± 19.0s | $0.57 | 326 |
| OCaml | 40/40 | 128.1s ± 28.9s | $0.58 | 216 |
| Lua | 40/40 | 143.6s ± 43.0s | $0.58 | 398 |
| Scheme | 40/40 | 130.6s ± 39.9s | $0.60 | 310 |
| TypeScript | 40/40 | 133.0s ± 29.4s | $0.62 | 310 |
| C | 40/40 | 155.8s ± 40.9s | $0.74 | 517 |
| Haskell | 39/40 | 174.0s ± 44.2s | $0.74 | 224 |
| Ruby/Steep | 40/40 | 186.6s ± 69.7s | $0.84 | 304 |
Out of 600 runs (15 languages × 2 phases × 20 trials), only 3 failed (tests did not pass). The failures were Rust (2) and Haskell (1). In one of the Rust failure logs, the agent claimed "the tests are wrong." Since all other Rust trials succeeded, this appears to be a hallucination.
Total Time and Cost
Dot plots of total time (v1 + v2):
And cost:
Ruby, Python, and JavaScript are the top 3: 73–81 seconds, $0.36–0.39, with low standard deviations — fast and stable.
From 4th place onward (Go, Rust, Java), variance increases sharply. Go averages 102s but with ±37s of spread.
Time and cost are strongly correlated:
Lines of Code
LOC after v2 completion:
OCaml (216), Ruby (219), and Haskell (224) are the most compact. C stands out at 517 lines.
Interestingly, fewer LOC does not imply faster or cheaper generation. OCaml and Haskell are compact but mid-to-low in speed and cost efficiency.
v1/v2 Detailed Results
Breakdown by phase:
| Language | v1 Time | v1 Turns | v1 LOC | v1 Tests | v2 Time | v2 Turns | v2 LOC | v2 Tests | Total Time | Avg. Cost |
|---|---|---|---|---|---|---|---|---|---|---|
| Ruby | 33.2s± 2.5s | 6.0 | 107 | 20/20 | 40.0s± 3.0s | 7.0 | 219 | 20/20 | 73.1s± 4.2s | $0.36 |
| Python | 32.9s± 1.3s | 6.0 | 113 | 20/20 | 41.8s± 4.5s | 7.1 | 235 | 20/20 | 74.6s± 4.5s | $0.38 |
| JavaScript | 36.0s± 3.5s | 6.0 | 123 | 20/20 | 45.1s± 4.1s | 7.2 | 248 | 20/20 | 81.1s± 5.0s | $0.39 |
| Go | 47.5s±34.5s | 7.7 | 143 | 20/20 | 54.1s±12.3s | 9.7 | 324 | 20/20 | 101.6s±37.0s | $0.50 |
| Java | 64.3s±31.1s | 8.7 | 152 | 20/20 | 51.2s± 8.1s | 9.6 | 303 | 20/20 | 115.4s±34.4s | $0.50 |
| Rust | 53.6s±36.7s | 9.4 | 139 | 19/20 | 60.1s±19.2s | 10.1 | 303 | 19/20 | 113.7s±54.8s | $0.54 |
| Perl | 84.4s±43.1s | 9.2 | 173 | 20/20 | 45.7s± 6.8s | 7.5 | 315 | 20/20 | 130.2s±44.2s | $0.55 |
| Python/mypy | 52.7s± 8.3s | 9.2 | 171 | 20/20 | 72.6s±14.4s | 12.2 | 326 | 20/20 | 125.3s±19.0s | $0.57 |
| OCaml | 80.9s±28.8s | 11.2 | 111 | 20/20 | 47.1s± 6.0s | 9.2 | 216 | 20/20 | 128.1s±28.9s | $0.58 |
| Lua | 96.4s±42.8s | 10.1 | 226 | 20/20 | 47.2s± 5.2s | 8.1 | 398 | 20/20 | 143.6s±43.0s | $0.58 |
| Scheme | 66.7s±36.7s | 8.9 | 171 | 20/20 | 63.9s±10.0s | 10.6 | 310 | 20/20 | 130.6s±39.9s | $0.60 |
| TypeScript | 69.9s±18.8s | 12.2 | 149 | 20/20 | 63.1s±17.8s | 11.3 | 310 | 20/20 | 133.0s±29.4s | $0.62 |
| C | 65.0s±18.2s | 8.2 | 276 | 20/20 | 90.8s±39.1s | 13.7 | 517 | 20/20 | 155.8s±40.9s | $0.74 |
| Haskell | 74.3s±39.1s | 10.3 | 119 | 19/20 | 99.6s±32.4s | 16.4 | 224 | 20/20 | 174.0s±44.2s | $0.74 |
| Ruby/Steep | 105.0s±65.2s | 20.2 | 150 | 20/20 | 81.6s± 8.8s | 17.2 | 304 | 20/20 | 186.6s±69.7s | $0.84 |
"Turns" is the number of API round-trips (tool call → result → next response) within a single prompt execution.
v1 (New Project) Time
v1 shows the largest gap between languages. Python (32.9s) and Ruby (33.2s) lead, followed by JavaScript (36.0s). Ruby/Steep takes 105.0s — 3.2× slower than plain Ruby. Lua (96.4s) and OCaml (80.9s) are also slow.
v1 starts from an empty directory, so languages requiring project config files (Cargo.toml, package.json, etc.) incur additional overhead. Python, Ruby, and JavaScript only need to generate a single minigit file, which may partly explain the gap.
v2 (Feature Extension) Time
In v2, the gap narrows. The top 3 remain Ruby (40.0s), Python (41.8s), JavaScript (45.1s). Perl (45.7s), OCaml (47.1s), and Lua (47.2s) follow closely.
Haskell is the slowest in v2 as well, averaging 99.6s despite having the fewest LOC (224) — it appears to spend heavily on thinking tokens. C takes 90.8s, weighed down by its high LOC (517).
Type-checker overhead: Python/mypy is 1.6–1.7× slower than Python; Ruby/Steep is 2.0–3.2× slower than Ruby.
Data and Reproduction
All experiment code and results are available on GitHub:
mame
/
ai-coding-lang-bench
Which programming language is best for AI coding agents? Benchmarking 13 languages with Claude Code.
Which Programming Language Is Best for AI Coding Agents?
A quantitative benchmark comparing how efficiently Claude Code generates code across 13 programming languages.
For a detailed discussion, see the blog post: Which Programming Language Is Best for Claude Code? / 日本語版
TL;DR
At least for prototyping-scale tasks, Ruby, Python, and JavaScript (not TypeScript) appear to be the best fit for Claude Code — fastest, cheapest, and most stable.
Motivation
"Static typing prevents AI hallucination bugs!" vs. "Dynamic typing saves tokens!" — qualitative arguments abound, but quantitative data is scarce. This experiment aims to fill that gap.
Experiment
We asked Claude Code (Opus 4.6) to implement a mini-git — a simplified version of Git — in various programming languages, and measured the time, cost, and lines of code for each.
The task is split into two phases:
-
v1 (New project): Implement
init,add,commit, andlog. - …
Per-run results are in report.md. Execution logs and generated source code are on the data branch.
Note: benchmark.rb uses --dangerously-skip-permissions, so if you want to reproduce the experiment, please be careful (I ran it inside Docker).
Discussion
What follows is my interpretation. I'm a Ruby committer, so please keep that bias in mind. I also haven't analyzed all the generated code in detail.
What causes the speed/cost differences?
This experiment can't pinpoint a single cause, and I don't think there is one. But we can form hypotheses from the trends:
- Type system: Dynamic languages are consistently faster and more stable.
- Conciseness: Shorter code generally means faster generation — but OCaml and Haskell are compact yet expensive, apparently due to high thinking-token usage.
- Procedural vs. functional: Excluding the top 3, there isn't a large gap between procedural and functional languages. Notably, OCaml achieved 47.1s in v2, rivaling JavaScript (though OCaml can be written in a procedural style, so a pure functional comparison is difficult).
- Language-specific difficulty: C's memory management and Rust's ownership model may add overhead.
- AI familiarity: Python, Ruby, and JavaScript have vastly more training data. Scheme and Haskell likely have less, and the results reflect this. Ruby/Steep's larger overhead (2.0–3.2×) vs. Python/mypy (1.6–1.7×) may also reflect lower AI familiarity with Steep compared to mypy.
Most likely, these factors combine to produce the observed results.
Doesn't lack of types mean more bugs?
Possibly. The tests pass, but untested paths in dynamically typed languages may have type errors.
That said, type errors are among the easiest bugs to detect and fix. If an agent frequently introduced type errors without a type checker, it would likely introduce logic bugs at a similar rate — at which point the problem goes beyond type checking.
It's also worth noting that the only 3 failures out of 600 runs were in Rust (2) and Haskell (1) — both statically typed languages with unique concepts like ownership and monads. This may be coincidence, but types don't prevent all bugs.
A 2× difference isn't that big, is it?
I think it is.
In real-world development, you're constantly iterating: prompt → wait → think about the next task → prompt again. The difference between waiting 30 seconds and 60 seconds matters — not just in total time, but in focus and flow. Response time is critical in iterative development.
"Take longer to build something robust" is a reasonable argument, but when competitors are shipping at twice the speed, is waiting the right call? Development speed is itself a dimension of quality.
That said, if the difference shrinks to 1 second vs. 0.5 seconds in the future, then it truly won't matter.
The task is too small. Static typing should shine at larger scales.
I don't disagree. But designing a large-scale benchmark that's fair across 15 languages is quite challenging. Someone should try it.
Isn't ecosystem and runtime performance more important for language choice?
Absolutely. From a generation standpoint too — if you can leverage an ecosystem, there's less code to generate, which should be faster and cheaper. And if runtime speed is essential for your application, there's no reason to choose a slow dynamic language.
For this experiment, I intentionally chose a task with no external library dependencies to isolate language-level differences. The spec uses a custom hash instead of SHA-256 for this reason.
Conclusion
I quantitatively evaluated which programming languages are best suited for code generation with Claude Code. At least for prototyping-scale tasks, Ruby, Python, and JavaScript appear to be the best fit.
Static typing may become advantageous at larger scales — someone should test this.
The classic strategy — start with a dynamic language, then migrate to a static one as the project matures — may still be the right call. Coding agents seem very capable at cross-language migration (needs verification), making this an increasingly realistic option.
Notes
- Evaluated in March 2026. Given the pace of AI progress, results may look different in a few months.
- This experiment was supported by the Claude for Open Source Program. Thanks Anthropic for 6 months of free Claude Max 20x!







Top comments (0)