Nate Voss

Posted on Apr 21

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work

#ai #tutorial #javascript #optimization

If you're still picking LLM providers by gut feeling, you're leaving money on the table. I ran 5 developer use cases through Claude 3.5 Sonnet, GPT-4o, and Gemini 2.0 Flash using PromptFuel to measure token usage and cost. The results? More interesting than "fastest wins." Here's what I found.

The Setup

I took 5 tasks I actually do in PromptFuel development:

JSON schema validation prompt — catch malformed API responses
Code review feedback — multi-file analysis with context
Refactoring suggestion — optimize a chunky utility function
Bug diagnosis — trace through a stack trace with logs
Documentation generation — write API docs from code comments

Each got run through all three models with identical input. I used PromptFuel's CLI to count tokens and calculate costs, because doing this manually is chaos. Output quality was rated by me (subjectively, but honestly).

Use Case Breakdown

1. JSON Schema Validation

Input: Schema definition + malformed JSON sample + expected error message format

Token usage (input → output):

Claude Sonnet: 1,847 → 512 (cost: $0.0043)
GPT-4o: 2,156 → 487 (cost: $0.0082)
Gemini Flash: 1,923 → 501 (cost: $0.0001)

Quality: All three nailed it. Claude was most concise in its explanation. GPT-4o over-explained. Gemini was crisp and useful.

Token efficiency win: Gemini, by cost. Claude, by clarity per token.

2. Code Review (3 files, ~200 LOC)

Input: Three TypeScript modules + review instructions + examples of good feedback

Token usage:

Claude Sonnet: 4,231 → 891 (cost: $0.0147)
GPT-4o: 4,782 → 856 (cost: $0.0208)
Gemini Flash: 4,456 → 823 (cost: $0.0003)

Quality: Claude caught subtle issues I actually cared about. GPT-4o was thorough but verbose. Gemini gave surface-level feedback.

Token efficiency win: Gemini cheapest. Claude best output/token.

3. Refactoring Suggestion

Input: 80-line utility function + performance requirements + current bottleneck description

Token usage:

Claude Sonnet: 2,134 → 618 (cost: $0.0054)
GPT-4o: 2,445 → 602 (cost: $0.0110)
Gemini Flash: 2,287 → 587 (cost: $0.0002)

Quality: Claude's refactor was production-ready. GPT-4o suggested good ideas but with syntax issues. Gemini's suggestion worked but wasn't elegant.

Token efficiency win: Gemini cost, Claude quality.

4. Bug Diagnosis

Input: Stack trace (15 lines) + error logs (20 lines) + code snippet (40 lines) + attempted fixes tried

Token usage:

Claude Sonnet: 2,856 → 445 (cost: $0.0071)
GPT-4o: 3,102 → 421 (cost: $0.0127)
Gemini Flash: 2,934 → 438 (cost: $0.0002)

Quality: Claude nailed it immediately. GPT-4o circled around the issue. Gemini flagged the right file but not the root cause.

Token efficiency win: Gemini cost, Claude accuracy.

5. Documentation Generation

Input: 12 functions with JSDoc comments + expected markdown format + examples

Token usage:

Claude Sonnet: 3,445 → 734 (cost: $0.0118)
GPT-4o: 3,821 → 689 (cost: $0.0182)
Gemini Flash: 3,567 → 712 (cost: $0.0004)

Quality: Claude's docs were complete and well-structured. GPT-4o was good but required minimal cleanup. Gemini's docs were functional but missing details.

Token efficiency win: Gemini cost, Claude completeness.

The 3 Things I Learned

1. Cost-per-task != best value. Gemini Flash is comically cheap (~90% less than GPT-4o), but you're paying for what you get. When I needed high-stakes work (code review, bug diagnosis), Claude was worth the extra cents because I didn't have to iterate. For throwaway tasks (generating examples, formatting), Gemini's cost made its mediocrity acceptable.

2. Token count is not predictive of quality. All three models produced similar token counts for the same input, but output quality varied wildly. GPT-4o consistently used more tokens and wasn't proportionally better. Claude packed useful signal into fewer tokens. This matters: if you're optimizing for cost alone, you'll pick the wrong model.

3. Real-world testing beats benchmarks. The model rankings flip depending on what you're actually doing. For documentation, Claude wins. For budget validation of a throwaway check, Gemini wins. Generic "fastest model" articles don't capture this. You need to test your actual tasks.

How to Benchmark Yours

Here's the thing: this comparison is data, not law. Your tasks might weight differently. Let me show you how I tested this using PromptFuel.

# Install PromptFuel (if you haven't)
npm install -g promptfuel

# Create a test file with your prompt
cat > test-prompt.txt << 'EOF'
[your prompt here]
EOF

# Count tokens across models
pf count test-prompt.txt --model claude-3-5-sonnet
pf count test-prompt.txt --model gpt-4o
pf count test-prompt.txt --model gemini-2.0-flash

# Compare costs
pf count test-prompt.txt --compare

That --compare flag gives you a cost matrix. Takes 30 seconds. Beats guessing.

The real insight: run this for your specific use cases. A document summarizer might favor Claude. A high-throughput classification pipeline might favor Gemini. The only way to know is to test.

The Real Optimization

After picking your model, there's still money left on the table. Here's a before/after from actual PromptFuel code:

Before (unoptimized prompt):

You are an expert code reviewer. Review the following code for quality, security, 
and performance issues. Check for common bugs, suggest improvements, and rate the 
code from 1-10. Consider edge cases, error handling, and best practices. Be thorough 
and detailed in your feedback.

[400 tokens of instructions]
[200 tokens of examples]
[150 tokens of code to review]
Total: ~750 input tokens

After (optimized with PromptFuel):

Review code for quality, security, performance. Rate 1-10.

[Stripped redundant instructions]
[Examples reduced to 1 exemplar instead of 3]
[Code reformatted to remove whitespace]
Total: ~420 input tokens

Cost saved: ~$0.0012 per review on Claude. Run that 100 times a day, and you're saving $0.12/day, $36/year. Small? Yes. Multiplied by 50 internal tools? Now you're talking real money.

The Honest Take

Pick the model that gives you the output you need, then optimize the prompt. Stop optimizing for the wrong metric. Benchmarks are fun, but production bills are real.

If you're running this analysis for your own stuff, PromptFuel makes it stupidly easy. It's free, no API keys needed, runs locally. Just npm install -g promptfuel and compare. If you want the actual numbers from your prompts, run the test. Don't inherit my data — build your own.

What's your highest-volume LLM task? Test it. You might be surprised which model wins.

Tags: #ai #tutorial #javascript #optimization

DEV Community

3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work

The Setup

Use Case Breakdown

1. JSON Schema Validation

2. Code Review (3 files, ~200 LOC)

3. Refactoring Suggestion

4. Bug Diagnosis

5. Documentation Generation

The 3 Things I Learned

How to Benchmark Yours

The Real Optimization

The Honest Take

Top comments (0)