Yusuke Endoh

Posted on Mar 5

Which Programming Language Is Best for Claude Code?

#ai #programming #llm

TL;DR

I had Claude Code implement a very simplified version of Git in 13 languages. Ruby, Python, and JavaScript were the fastest, cheapest, and most stable. Statically typed languages were 1.4–2.6× slower and more expensive.

Introduction

Which programming language is best suited for AI coding agents?

"Static typing prevents AI hallucination bugs!"
"No, skipping type annotations saves tokens!"

There's plenty of qualitative debate, but quantitative data is scarce. So I ran an experiment.

Experiment

I asked Claude Code to implement a "mini-git" — a simplified version of Git — in various languages, and measured the time and cost for each. Git was famously built by Linus in two weeks, so it seemed like a good task.

The task was split into two phases:

v1 (New project): Implement init, add, commit, and log.
v2 (Feature extension): Add status, diff, checkout, and reset.

The prompt was simply: "Read SPEC-v1.txt, implement it, and make sure test-v1.sh passes." v2 was analogous.

The languages compared:

Category	Languages
Dynamic	Python, Ruby, JavaScript, Perl, Lua
Dynamic + type checker	Python/mypy, Ruby/Steep
Static	TypeScript, Go, Rust, C, Java
Functional	Scheme (dynamic), OCaml (static), Haskell (static)

Python/mypy writes fully type-annotated Python verified with mypy --strict. Ruby/Steep writes RBS type signatures verified with steep check. These allow direct comparison of type-checking overhead within the same language.

Each language was run 20 times. The model was Claude Opus 4.6 (high effort).

Results

Average results across 20 trials, sorted by average cost.

Language	Tests passed (v1+v2)	Time (v1+v2)	Avg. cost	LOC (v2)
Ruby	40/40	73.1s ± 4.2s	$0.36	219
Python	40/40	74.6s ± 4.5s	$0.38	235
JavaScript	40/40	81.1s ± 5.0s	$0.39	248
Go	40/40	101.6s ± 37.0s	$0.50	324
Java	40/40	115.4s ± 34.4s	$0.50	303
Rust	38/40	113.7s ± 54.8s	$0.54	303
Perl	40/40	130.2s ± 44.2s	$0.55	315
Python/mypy	40/40	125.3s ± 19.0s	$0.57	326
OCaml	40/40	128.1s ± 28.9s	$0.58	216
Lua	40/40	143.6s ± 43.0s	$0.58	398
Scheme	40/40	130.6s ± 39.9s	$0.60	310
TypeScript	40/40	133.0s ± 29.4s	$0.62	310
C	40/40	155.8s ± 40.9s	$0.74	517
Haskell	39/40	174.0s ± 44.2s	$0.74	224
Ruby/Steep	40/40	186.6s ± 69.7s	$0.84	304

Out of 600 runs (15 languages × 2 phases × 20 trials), only 3 failed (tests did not pass). The failures were Rust (2) and Haskell (1). In one of the Rust failure logs, the agent claimed "the tests are wrong." Since all other Rust trials succeeded, this appears to be a hallucination.

Total Time and Cost

Dot plots of total time (v1 + v2):

And cost:

Ruby, Python, and JavaScript are the top 3: 73–81 seconds, $0.36–0.39, with low standard deviations — fast and stable.

From 4th place onward (Go, Rust, Java), variance increases sharply. Go averages 102s but with ±37s of spread.

Time and cost are strongly correlated:

Lines of Code

LOC after v2 completion:

OCaml (216), Ruby (219), and Haskell (224) are the most compact. C stands out at 517 lines.

Interestingly, fewer LOC does not imply faster or cheaper generation. OCaml and Haskell are compact but mid-to-low in speed and cost efficiency.

v1/v2 Detailed Results

Breakdown by phase:

Language	v1 Time	v1 Turns	v1 LOC	v1 Tests	v2 Time	v2 Turns	v2 LOC	v2 Tests	Total Time	Avg. Cost
Ruby	33.2s± 2.5s	6.0	107	20/20	40.0s± 3.0s	7.0	219	20/20	73.1s± 4.2s	$0.36
Python	32.9s± 1.3s	6.0	113	20/20	41.8s± 4.5s	7.1	235	20/20	74.6s± 4.5s	$0.38
JavaScript	36.0s± 3.5s	6.0	123	20/20	45.1s± 4.1s	7.2	248	20/20	81.1s± 5.0s	$0.39
Go	47.5s±34.5s	7.7	143	20/20	54.1s±12.3s	9.7	324	20/20	101.6s±37.0s	$0.50
Java	64.3s±31.1s	8.7	152	20/20	51.2s± 8.1s	9.6	303	20/20	115.4s±34.4s	$0.50
Rust	53.6s±36.7s	9.4	139	19/20	60.1s±19.2s	10.1	303	19/20	113.7s±54.8s	$0.54
Perl	84.4s±43.1s	9.2	173	20/20	45.7s± 6.8s	7.5	315	20/20	130.2s±44.2s	$0.55
Python/mypy	52.7s± 8.3s	9.2	171	20/20	72.6s±14.4s	12.2	326	20/20	125.3s±19.0s	$0.57
OCaml	80.9s±28.8s	11.2	111	20/20	47.1s± 6.0s	9.2	216	20/20	128.1s±28.9s	$0.58
Lua	96.4s±42.8s	10.1	226	20/20	47.2s± 5.2s	8.1	398	20/20	143.6s±43.0s	$0.58
Scheme	66.7s±36.7s	8.9	171	20/20	63.9s±10.0s	10.6	310	20/20	130.6s±39.9s	$0.60
TypeScript	69.9s±18.8s	12.2	149	20/20	63.1s±17.8s	11.3	310	20/20	133.0s±29.4s	$0.62
C	65.0s±18.2s	8.2	276	20/20	90.8s±39.1s	13.7	517	20/20	155.8s±40.9s	$0.74
Haskell	74.3s±39.1s	10.3	119	19/20	99.6s±32.4s	16.4	224	20/20	174.0s±44.2s	$0.74
Ruby/Steep	105.0s±65.2s	20.2	150	20/20	81.6s± 8.8s	17.2	304	20/20	186.6s±69.7s	$0.84

"Turns" is the number of API round-trips (tool call → result → next response) within a single prompt execution.

v1 (New Project) Time

v1 shows the largest gap between languages. Python (32.9s) and Ruby (33.2s) lead, followed by JavaScript (36.0s). Ruby/Steep takes 105.0s — 3.2× slower than plain Ruby. Lua (96.4s) and OCaml (80.9s) are also slow.

v1 starts from an empty directory, so languages requiring project config files (Cargo.toml, package.json, etc.) incur additional overhead. Python, Ruby, and JavaScript only need to generate a single minigit file, which may partly explain the gap.

v2 (Feature Extension) Time

In v2, the gap narrows. The top 3 remain Ruby (40.0s), Python (41.8s), JavaScript (45.1s). Perl (45.7s), OCaml (47.1s), and Lua (47.2s) follow closely.

Haskell is the slowest in v2 as well, averaging 99.6s despite having the fewest LOC (224) — it appears to spend heavily on thinking tokens. C takes 90.8s, weighed down by its high LOC (517).

Type-checker overhead: Python/mypy is 1.6–1.7× slower than Python; Ruby/Steep is 2.0–3.2× slower than Ruby.

Data and Reproduction

All experiment code and results are available on GitHub:

mame / ai-coding-lang-bench

Which programming language is best for AI coding agents? Benchmarking 13 languages with Claude Code.

Which Programming Language Is Best for AI Coding Agents?

A quantitative benchmark comparing how efficiently Claude Code generates code across 13 programming languages.

For a detailed discussion, see the blog post: Which Programming Language Is Best for Claude Code? / 日本語版

TL;DR

At least for prototyping-scale tasks, Ruby, Python, and JavaScript (not TypeScript) appear to be the best fit for Claude Code — fastest, cheapest, and most stable.

Motivation

"Static typing prevents AI hallucination bugs!" vs. "Dynamic typing saves tokens!" — qualitative arguments abound, but quantitative data is scarce. This experiment aims to fill that gap.

Experiment

We asked Claude Code (Opus 4.6) to implement a mini-git — a simplified version of Git — in various programming languages, and measured the time, cost, and lines of code for each.

The task is split into two phases:

v1 (New project): Implement init, add, commit, and log.
…

View on GitHub

Per-run results are in report.md. Execution logs and generated source code are on the data branch.

Note: benchmark.rb uses --dangerously-skip-permissions, so if you want to reproduce the experiment, please be careful (I ran it inside Docker).

Discussion

What follows is my interpretation. I'm a Ruby committer, so please keep that bias in mind. I also haven't analyzed all the generated code in detail.

What causes the speed/cost differences?

This experiment can't pinpoint a single cause, and I don't think there is one. But we can form hypotheses from the trends:

Type system: Dynamic languages are consistently faster and more stable.
Conciseness: Shorter code generally means faster generation — but OCaml and Haskell are compact yet expensive, apparently due to high thinking-token usage.
Procedural vs. functional: Excluding the top 3, there isn't a large gap between procedural and functional languages. Notably, OCaml achieved 47.1s in v2, rivaling JavaScript (though OCaml can be written in a procedural style, so a pure functional comparison is difficult).
Language-specific difficulty: C's memory management and Rust's ownership model may add overhead.
AI familiarity: Python, Ruby, and JavaScript have vastly more training data. Scheme and Haskell likely have less, and the results reflect this. Ruby/Steep's larger overhead (2.0–3.2×) vs. Python/mypy (1.6–1.7×) may also reflect lower AI familiarity with Steep compared to mypy.

Most likely, these factors combine to produce the observed results.

Doesn't lack of types mean more bugs?

Possibly. The tests pass, but untested paths in dynamically typed languages may have type errors.

That said, type errors are among the easiest bugs to detect and fix. If an agent frequently introduced type errors without a type checker, it would likely introduce logic bugs at a similar rate — at which point the problem goes beyond type checking.

It's also worth noting that the only 3 failures out of 600 runs were in Rust (2) and Haskell (1) — both statically typed languages with unique concepts like ownership and monads. This may be coincidence, but types don't prevent all bugs.

A 2× difference isn't that big, is it?

I think it is.

In real-world development, you're constantly iterating: prompt → wait → think about the next task → prompt again. The difference between waiting 30 seconds and 60 seconds matters — not just in total time, but in focus and flow. Response time is critical in iterative development.

"Take longer to build something robust" is a reasonable argument, but when competitors are shipping at twice the speed, is waiting the right call? Development speed is itself a dimension of quality.

That said, if the difference shrinks to 1 second vs. 0.5 seconds in the future, then it truly won't matter.

The task is too small. Static typing should shine at larger scales.

I don't disagree. But designing a large-scale benchmark that's fair across 15 languages is quite challenging. Someone should try it.

Isn't ecosystem and runtime performance more important for language choice?

Absolutely. From a generation standpoint too — if you can leverage an ecosystem, there's less code to generate, which should be faster and cheaper. And if runtime speed is essential for your application, there's no reason to choose a slow dynamic language.

For this experiment, I intentionally chose a task with no external library dependencies to isolate language-level differences. The spec uses a custom hash instead of SHA-256 for this reason.

Conclusion

I quantitatively evaluated which programming languages are best suited for code generation with Claude Code. At least for prototyping-scale tasks, Ruby, Python, and JavaScript appear to be the best fit.

Static typing may become advantageous at larger scales — someone should test this.

The classic strategy — start with a dynamic language, then migrate to a static one as the project matures — may still be the right call. Coding agents seem very capable at cross-language migration (needs verification), making this an increasingly realistic option.

Notes

Evaluated in March 2026. Given the pace of AI progress, results may look different in a few months.
This experiment was supported by the Claude for Open Source Program. Thanks Anthropic for 6 months of free Claude Max 20x!

Top comments (13)

Bruce Hauman • Mar 6

Obviously I’m curious about my language of choice Clojure. So I looked at Scheme as possible proxy and then I thought well if the test harness didn’t have hooks to fix paren errors well that would heavily throw the numbers off. Any lisp language without some kind of paren remediation will suffer under this test.

But the bigger point is that probably the users of all relatively niche languages have built prompts and tools to overcome trained in disadvantages.

Ben Munat • Mar 5

No Elixir? :-(

Boris Barroso • Mar 5

It should be the best according to autocodebench.github.io/

Yusuke Endoh • Mar 6

I tried Elixir and Kotlin (and Ruby for comparison) just twice, but they seemed completely inadequate in this benchmark.

	run1 total time	run1 total cost	run2 total time	run2 total time
Elixir	166.6s	$0.71	392.4s	$1.20
Kotlin	252.5s	$0.89	145.4s	$0.71
Ruby	77.6s	$0.38	80.9s	$0.38

When I get time, I'd like to add more other languages and retry to get full results.

Just for reference :-) x.com/mametter/status/202978392925...

Bazyli Brzóska • Mar 20

The results are interesting, but have a large bias towards greenfield projects.
I think it would be worth benchmarking a task that modifies complex existing projects (e.g. feature or bug fix). That's the place where things like static typing can help significantly with discovery and validation.

Harjot Singh • Jun 1

it's interesting to see how statically typed languages lag in both speed and cost for tasks like implementing a simplified git. at moonshift, you can get a full next.js + postgres + auth app deployed in about 7 minutes, and you keep the code on your github. if you're curious, i can set you up for a free build to test it out.

J.R. Hill • Mar 6

Any qualitative analysis?

2x speedup is potentially the difference between weeks and months... but if the code is 10x more convoluted, it will be 10x more difficult to make future changes (even with AI assistance), and on a less trivial project in the long run, bring a net productivity loss.

Also I worry it might be naive to categorize Rust and Haskell failing tests as "bugs." One of the major selling points of more rigorous static analysis (including static typing) is that you catch errors early, instead of end users facing them in production. Shipping code fast and cheap is great, but if it's at the expense of customer experience then everything of value is lost.

Jesús Gómez • Apr 16

The Clojure ecosystem prides itself on being precise in the words used for domain concepts, making Clojure code particularly dense in the wild (meaning/size).

I bet that is something you would like to trial.

Peter Marreck • Mar 30 • Edited

You should have included Zig. I've had excellent results with it. Its semantics seem quite understandable by the LLM and it has enough footgun prevention to prevent many (but not all) classes of failure mode.

I understand why Elixir didn't work out- it's not great at math and specifically has a weakness replicating the behavior that you get with bit-constrained number types in lower-level languages.

AI Agent Digest • Mar 10

This is exactly the kind of empirical work the AI coding space needs more of. The finding that dynamic languages are 1.4-2.6x faster and cheaper isn't surprising intuitively -- less boilerplate means fewer tokens -- but having the numbers across 600 runs with error bars is valuable.