DEV Community

Yusuke Endoh
Yusuke Endoh

Posted on

Which Programming Language Is Best for Claude Code?

TL;DR

I had Claude Code implement a very simplified version of Git in 13 languages. Ruby, Python, and JavaScript were the fastest, cheapest, and most stable. Statically typed languages were 1.4–2.6× slower and more expensive.

Cost v1+v2

Introduction

Which programming language is best suited for AI coding agents?

"Static typing prevents AI hallucination bugs!"
"No, skipping type annotations saves tokens!"

There's plenty of qualitative debate, but quantitative data is scarce. So I ran an experiment.

Experiment

I asked Claude Code to implement a "mini-git" — a simplified version of Git — in various languages, and measured the time and cost for each. Git was famously built by Linus in two weeks, so it seemed like a good task.

The task was split into two phases:

  • v1 (New project): Implement init, add, commit, and log.
  • v2 (Feature extension): Add status, diff, checkout, and reset.

The prompt was simply: "Read SPEC-v1.txt, implement it, and make sure test-v1.sh passes." v2 was analogous.

The languages compared:

Category Languages
Dynamic Python, Ruby, JavaScript, Perl, Lua
Dynamic + type checker Python/mypy, Ruby/Steep
Static TypeScript, Go, Rust, C, Java
Functional Scheme (dynamic), OCaml (static), Haskell (static)

Python/mypy writes fully type-annotated Python verified with mypy --strict. Ruby/Steep writes RBS type signatures verified with steep check. These allow direct comparison of type-checking overhead within the same language.

Each language was run 20 times. The model was Claude Opus 4.6 (high effort).

Results

Average results across 20 trials, sorted by average cost.

Language Tests passed (v1+v2) Time (v1+v2) Avg. cost LOC (v2)
Ruby 40/40 73.1s ± 4.2s $0.36 219
Python 40/40 74.6s ± 4.5s $0.38 235
JavaScript 40/40 81.1s ± 5.0s $0.39 248
Go 40/40 101.6s ± 37.0s $0.50 324
Java 40/40 115.4s ± 34.4s $0.50 303
Rust 38/40 113.7s ± 54.8s $0.54 303
Perl 40/40 130.2s ± 44.2s $0.55 315
Python/mypy 40/40 125.3s ± 19.0s $0.57 326
OCaml 40/40 128.1s ± 28.9s $0.58 216
Lua 40/40 143.6s ± 43.0s $0.58 398
Scheme 40/40 130.6s ± 39.9s $0.60 310
TypeScript 40/40 133.0s ± 29.4s $0.62 310
C 40/40 155.8s ± 40.9s $0.74 517
Haskell 39/40 174.0s ± 44.2s $0.74 224
Ruby/Steep 40/40 186.6s ± 69.7s $0.84 304

Out of 600 runs (15 languages × 2 phases × 20 trials), only 3 failed (tests did not pass). The failures were Rust (2) and Haskell (1). In one of the Rust failure logs, the agent claimed "the tests are wrong." Since all other Rust trials succeeded, this appears to be a hallucination.

Total Time and Cost

Dot plots of total time (v1 + v2):

Total time

And cost:

Total cost

Ruby, Python, and JavaScript are the top 3: 73–81 seconds, $0.36–0.39, with low standard deviations — fast and stable.

From 4th place onward (Go, Rust, Java), variance increases sharply. Go averages 102s but with ±37s of spread.

Time and cost are strongly correlated:

Time vs Cost

Lines of Code

LOC after v2 completion:

Lines of code

OCaml (216), Ruby (219), and Haskell (224) are the most compact. C stands out at 517 lines.

Time vs LOC

Interestingly, fewer LOC does not imply faster or cheaper generation. OCaml and Haskell are compact but mid-to-low in speed and cost efficiency.

v1/v2 Detailed Results

Breakdown by phase:

Language v1 Time v1 Turns v1 LOC v1 Tests v2 Time v2 Turns v2 LOC v2 Tests Total Time Avg. Cost
Ruby 33.2s± 2.5s 6.0 107 20/20 40.0s± 3.0s 7.0 219 20/20 73.1s± 4.2s $0.36
Python 32.9s± 1.3s 6.0 113 20/20 41.8s± 4.5s 7.1 235 20/20 74.6s± 4.5s $0.38
JavaScript 36.0s± 3.5s 6.0 123 20/20 45.1s± 4.1s 7.2 248 20/20 81.1s± 5.0s $0.39
Go 47.5s±34.5s 7.7 143 20/20 54.1s±12.3s 9.7 324 20/20 101.6s±37.0s $0.50
Java 64.3s±31.1s 8.7 152 20/20 51.2s± 8.1s 9.6 303 20/20 115.4s±34.4s $0.50
Rust 53.6s±36.7s 9.4 139 19/20 60.1s±19.2s 10.1 303 19/20 113.7s±54.8s $0.54
Perl 84.4s±43.1s 9.2 173 20/20 45.7s± 6.8s 7.5 315 20/20 130.2s±44.2s $0.55
Python/mypy 52.7s± 8.3s 9.2 171 20/20 72.6s±14.4s 12.2 326 20/20 125.3s±19.0s $0.57
OCaml 80.9s±28.8s 11.2 111 20/20 47.1s± 6.0s 9.2 216 20/20 128.1s±28.9s $0.58
Lua 96.4s±42.8s 10.1 226 20/20 47.2s± 5.2s 8.1 398 20/20 143.6s±43.0s $0.58
Scheme 66.7s±36.7s 8.9 171 20/20 63.9s±10.0s 10.6 310 20/20 130.6s±39.9s $0.60
TypeScript 69.9s±18.8s 12.2 149 20/20 63.1s±17.8s 11.3 310 20/20 133.0s±29.4s $0.62
C 65.0s±18.2s 8.2 276 20/20 90.8s±39.1s 13.7 517 20/20 155.8s±40.9s $0.74
Haskell 74.3s±39.1s 10.3 119 19/20 99.6s±32.4s 16.4 224 20/20 174.0s±44.2s $0.74
Ruby/Steep 105.0s±65.2s 20.2 150 20/20 81.6s± 8.8s 17.2 304 20/20 186.6s±69.7s $0.84

"Turns" is the number of API round-trips (tool call → result → next response) within a single prompt execution.

v1 (New Project) Time

v1 time

v1 shows the largest gap between languages. Python (32.9s) and Ruby (33.2s) lead, followed by JavaScript (36.0s). Ruby/Steep takes 105.0s — 3.2× slower than plain Ruby. Lua (96.4s) and OCaml (80.9s) are also slow.

v1 starts from an empty directory, so languages requiring project config files (Cargo.toml, package.json, etc.) incur additional overhead. Python, Ruby, and JavaScript only need to generate a single minigit file, which may partly explain the gap.

v2 (Feature Extension) Time

v2 time

In v2, the gap narrows. The top 3 remain Ruby (40.0s), Python (41.8s), JavaScript (45.1s). Perl (45.7s), OCaml (47.1s), and Lua (47.2s) follow closely.

Haskell is the slowest in v2 as well, averaging 99.6s despite having the fewest LOC (224) — it appears to spend heavily on thinking tokens. C takes 90.8s, weighed down by its high LOC (517).

Type-checker overhead: Python/mypy is 1.6–1.7× slower than Python; Ruby/Steep is 2.0–3.2× slower than Ruby.

Data and Reproduction

All experiment code and results are available on GitHub:

GitHub logo mame / ai-coding-lang-bench

Which programming language is best for AI coding agents? Benchmarking 13 languages with Claude Code.

Which Programming Language Is Best for AI Coding Agents?

A quantitative benchmark comparing how efficiently Claude Code generates code across 13 programming languages.

For a detailed discussion, see the blog post: Which Programming Language Is Best for Claude Code? / 日本語版

TL;DR

At least for prototyping-scale tasks, Ruby, Python, and JavaScript (not TypeScript) appear to be the best fit for Claude Code — fastest, cheapest, and most stable.

Motivation

"Static typing prevents AI hallucination bugs!" vs. "Dynamic typing saves tokens!" — qualitative arguments abound, but quantitative data is scarce. This experiment aims to fill that gap.

Experiment

We asked Claude Code (Opus 4.6) to implement a mini-git — a simplified version of Git — in various programming languages, and measured the time, cost, and lines of code for each.

The task is split into two phases:

  • v1 (New project): Implement init, add, commit, and log.

Per-run results are in report.md. Execution logs and generated source code are on the data branch.

Note: benchmark.rb uses --dangerously-skip-permissions, so if you want to reproduce the experiment, please be careful (I ran it inside Docker).

Discussion

What follows is my interpretation. I'm a Ruby committer, so please keep that bias in mind. I also haven't analyzed all the generated code in detail.

What causes the speed/cost differences?

This experiment can't pinpoint a single cause, and I don't think there is one. But we can form hypotheses from the trends:

  • Type system: Dynamic languages are consistently faster and more stable.
  • Conciseness: Shorter code generally means faster generation — but OCaml and Haskell are compact yet expensive, apparently due to high thinking-token usage.
  • Procedural vs. functional: Excluding the top 3, there isn't a large gap between procedural and functional languages. Notably, OCaml achieved 47.1s in v2, rivaling JavaScript (though OCaml can be written in a procedural style, so a pure functional comparison is difficult).
  • Language-specific difficulty: C's memory management and Rust's ownership model may add overhead.
  • AI familiarity: Python, Ruby, and JavaScript have vastly more training data. Scheme and Haskell likely have less, and the results reflect this. Ruby/Steep's larger overhead (2.0–3.2×) vs. Python/mypy (1.6–1.7×) may also reflect lower AI familiarity with Steep compared to mypy.

Most likely, these factors combine to produce the observed results.

Doesn't lack of types mean more bugs?

Possibly. The tests pass, but untested paths in dynamically typed languages may have type errors.

That said, type errors are among the easiest bugs to detect and fix. If an agent frequently introduced type errors without a type checker, it would likely introduce logic bugs at a similar rate — at which point the problem goes beyond type checking.

It's also worth noting that the only 3 failures out of 600 runs were in Rust (2) and Haskell (1) — both statically typed languages with unique concepts like ownership and monads. This may be coincidence, but types don't prevent all bugs.

A 2× difference isn't that big, is it?

I think it is.

In real-world development, you're constantly iterating: prompt → wait → think about the next task → prompt again. The difference between waiting 30 seconds and 60 seconds matters — not just in total time, but in focus and flow. Response time is critical in iterative development.

"Take longer to build something robust" is a reasonable argument, but when competitors are shipping at twice the speed, is waiting the right call? Development speed is itself a dimension of quality.

That said, if the difference shrinks to 1 second vs. 0.5 seconds in the future, then it truly won't matter.

The task is too small. Static typing should shine at larger scales.

I don't disagree. But designing a large-scale benchmark that's fair across 15 languages is quite challenging. Someone should try it.

Isn't ecosystem and runtime performance more important for language choice?

Absolutely. From a generation standpoint too — if you can leverage an ecosystem, there's less code to generate, which should be faster and cheaper. And if runtime speed is essential for your application, there's no reason to choose a slow dynamic language.

For this experiment, I intentionally chose a task with no external library dependencies to isolate language-level differences. The spec uses a custom hash instead of SHA-256 for this reason.

Conclusion

I quantitatively evaluated which programming languages are best suited for code generation with Claude Code. At least for prototyping-scale tasks, Ruby, Python, and JavaScript appear to be the best fit.

Static typing may become advantageous at larger scales — someone should test this.

The classic strategy — start with a dynamic language, then migrate to a static one as the project matures — may still be the right call. Coding agents seem very capable at cross-language migration (needs verification), making this an increasingly realistic option.

Notes

  • Evaluated in March 2026. Given the pace of AI progress, results may look different in a few months.
  • This experiment was supported by the Claude for Open Source Program. Thanks Anthropic for 6 months of free Claude Max 20x!

Top comments (0)