We Asked 10 LLMs to Write Efficient Code. Only 4 Got Better.

#ai #llm #benchmark #programming

By Vilius Vystartas | May 2026

Every LLM can write code that works. The question is: can they write code that's efficient — and does telling them to be efficient actually help?

I tested 10 models on 10 coding tasks, each in two phases: unprompted (the model writes its own code) and prompted (explicitly told to write clean, DRY, efficient code). That's 200 API calls, $0.56 total. The results are... not what most prompt engineers would predict.

GPT-5.4 was the only model where prompting gave a substantial boost (+0.20). For most models, the "write efficient code" prompt was meaningless or actively harmful.

How the Metric Works

Each task has a known optimal token budget — the minimum tokens needed to produce correct, DRY code for that task (e.g., 70 tokens for 10 styled buttons using CSS classes vs 340 tokens for 10 separate button blocks). The efficiency score is optimal_tokens / actual_tokens, capped at 1.0.

A score of 0.63 means the model used about 1.6x the optimal — not bad. A score of 0.43 means it used about 2.3x the optimal. The gap between unprompted and prompted tells you whether the "write efficient code" instruction actually changes behaviour.

The Leaderboard (Sorted by Prompted Efficiency)

#	Model	Unprompted	Prompted	Δ	Frugal	Cost	Correctness
🥇	GPT-5.4	0.43	0.63	+0.20	30%	$0.096	78% → 85%
🥈	Qwen 3.6 Plus	0.44	0.60	+0.17	40%	$0.158	78% → 87%
🥉	Gemma 4 31B	0.54	0.58	+0.04	50%	$0.003	92% both
4	DeepSeek Chat	0.51	0.55	+0.04	30%	$0.006	91% → 80%
5	Claude Sonnet 4	0.47	0.52	+0.04	40%	$0.121	92% both
6	LFM 2 24B A2B	0.54	0.47	-0.06	30%	$0.001	90% → 80%
7	Mistral Large 2411	0.54	0.46	-0.08	40%	$0.050	90% → 82%
8	Gemini 2.5 Flash	0.47	0.46	-0.01	50%	$0.020	92% → 90%
9	Cohere Command A	0.60	0.44	-0.17	40%	$0.071	90% → 82%
10	Kimi K2.6	0.34	0.43	+0.09	30%	$0.029	76% → 86%

What Stands Out

GPT-5.4 Is the Prompt Whisperer

GPT-5.4 improved on 7 of 10 tasks when prompted for efficiency. The biggest wins were config-generation (+0.81 — went from 12 inline JSON blocks to a template loop), html-from-data (+0.71), and magic-strings (+0.38 — switched to an Enum). It's the only model in the batch where the "write efficient code" instruction consistently produces different (and better) output.

The cost is notable — $0.10 for 20 tasks is mid-range, not cheap, not expensive. But the efficiency gain is real.

Gemma 4 31B: The Quiet Winner

Half of Gemma 4's tasks were already "frugal" — naturally efficient without being told. It scored 92% correctness on both phases at just $0.003 total. That's a 40x cost advantage over GPT-5.4 with higher correctness and competitive efficiency. For high-volume production where you want concise, correct code, Gemma 4 31B is the value pick of this batch.

Cohere Command A: Prompting Backfires

Cohere Command A had the highest unprompted efficiency in the batch (0.60) — it naturally writes concise code. But when told "write efficient code," it ballooned output on several tasks. html-from-data went from a tight 45-token solution to a 600+-token monstrosity (-0.92 gap). The prompt made it overthink.

Lesson: if a model is already efficient, don't prompt it to be more efficient.

Qwen 3.6 Plus: Second Place, Slowest

Qwen 3.6 Plus scored second in prompted efficiency (+0.17 improvement) but took 26 minutes for 20 tasks — by far the slowest model. The efficiency gain is real (especially on html-from-data where it went from hardcoded rows to a map/join pattern), but you're waiting for it. Batch workloads only.

The Kimi Surprise

Kimi K2.6 had the lowest unprompted efficiency (0.34 — verbose, boilerplate-heavy code) but improved the most at the bottom end (+0.09). Still last place, but the prompt actually helped it compress — which is the opposite of the Cohere effect. Some models need the nudge.

Frugality: What Does It Mean?

"Frugal" means the model naturally produced code at or near the optimal token count without being asked. Gemma 4 31B and Gemini 2.5 Flash led at 50% — half their tasks were already efficient. GPT-5.4, DeepSeek Chat, and Kimi K2.6 were only 30% frugal — they needed the prompt to tighten up.

The Bigger Picture

Group	Models	Behaviour
Prompt-responsive	GPT-5.4, Qwen 3.6 Plus	Efficiency improves substantially with prompting
Prompt-neutral	Gemma 4 31B, DeepSeek Chat, Claude Sonnet 4, Gemini 2.5 Flash, Kimi K2.6	Prompt has little effect (±0.04)
Prompt-antagonistic	LFM 2 24B A2B, Mistral Large 2411, Cohere Command A	Efficiency drops when prompted

The prompt-antagonistic group is the most interesting. These models know how to write efficient code (0.54-0.60 unprompted), but the explicit instruction triggers over-engineering — they add abstractions, comments, error handling, and other bloat that makes the output less efficient by the metric.

If the prompt says "write efficient code" and the model responds by writing more tokens, something in the training signal is misaligned.

My Picks

Best prompted efficiency: GPT-5.4 — 0.63, $0.10 for 20 tasks. The only model where prompting reliably improves output.
Best value overall: Gemma 4 31B — 0.58 prompted, 92% correctness, $0.003. Absurd price/performance.
Best natural efficiency: Cohere Command A — 0.60 unprompted. Don't prompt it, just let it work.
Most consistent: Claude Sonnet 4 — 92% correctness on both phases, small +0.04 efficiency gain. Reliable.
Skip if you're in a hurry: Qwen 3.6 Plus — 26 minutes for 20 tasks. Great efficiency gains, terrible latency.
Watch list: Kimi K2.6 — low base efficiency but the prompt actually helps. Worth retesting with a better prompt.

Methodology

Ten real-world coding tasks across CSS, JavaScript, Python, SQL, and bash — each with a known optimal token budget for a correct, DRY solution. Tasks included: styling 10 buttons (CSS), rendering 20 data rows as HTML (JS/HTML), bulk renaming (shell), form validation (Python), parametrized tests (Python), unit conversion (Python), SQL reporting queries, config generation (JSON), magic string replacement (Python/Enum), and middleware decorator pattern (Python/Flask).

Each model ran 10 tasks unprompted, then the same 10 tasks with an efficiency prompt appended. Scoring: efficiency_ratio = optimal_tokens / actual_tokens (capped at 1.0). Correctness scored against expected output patterns.

Total cost: $0.56 for 200 API calls (10 models × 10 tasks × 2 phases). Temperature: 0.1. Max tokens: 600.

Full results: benchmarks.workswithagents.dev