Shrijal Acharya for Tensorlake

Posted on Jan 18 • Originally published at tensorlake.ai

🔥 Claude Opus 4.5 vs GPT 5.2 High vs Gemini 3 Pro: Production Coding Test ✅

#webdev #ai #programming #javascript

Okay, so right now the WebDev leaderboard on LMArena is basically owned by the big three: Claude Opus 4.5 from Anthropic, GPT-5.2-codex (high) from OpenAI, and finally everybody's favorite, Gemini 3 Pro from Google.

So, I grabbed these three and put them into the same existing project (over 8K stars and 50K+ LOC) and asked them to build a couple of real features like a normal dev would.

Same repo. Same prompts. Same constraints.

For each task, I took the best result out of three runs per model to keep things fair.

Then I compared what they actually did: code quality, how much hand-holding they needed, and whether the feature even worked in the end.

⚠️ NOTE: Don't take the result of this test as a hard rule. This is just a small set of real-world coding tasks that shows how each model did for me in that exact setup and gives you an overview of the difference in the top 3 models' performance in the same tasks.

TL;DR

If you want a quick take, here’s how the three models performed in our tests:

Claude Opus 4.5 was the most consistent overall. It shipped working results for both tasks, and the UI polish was the best of the three. The main downside is cost. If they find a way to achieve this performance while reducing cost, it will actually be over for most other models.

GPT-5.2-codex (high) was one of the best. But it's obviously slower due to the higher reasoning. When it hit, the code quality and structure were great, but it needed more patience than the other two in this repo.

Gemini 3 Pro was the most efficient. Both tasks worked, but the output often felt like the minimum viable version, especially on the analytics dashboard.

💡 If you want the safest pick for real “ship a feature in a big repo” work, Opus 4.5 felt the most reliable in my runs. If you care about speed and cost and you’re okay polishing UI yourself, Gemini 3 Pro is a solid bet.

Test Workflow

For the test, we will use the following CLI coding agents:

Claude Opus 4.5: Claude Code (Anthropic’s terminal-based agentic coding tool)
Gemini 3 Pro: Gemini CLI
GPT-5.2 High: Codex CLI

Here’s the repo used for the entire test: iib0011/omni-tools

We will check the models on two different tasks:

Task 1: Add a global Action Palette (Ctrl + K)

Each model is asked to create a global action menu that opens with a keyboard shortcut. This feature expands on the current search by adding actions, global state, and keyboard navigation. This task checks how well the model understands current UX patterns and avoids repetition without breaking what's already in place.

Task 2: Tool Usage Analytics + Insights Dashboard

Each model had to add real usage tracking across the app, persist it locally, and then build an analytics dashboard that shows things like the most used tools, recent activity, and basic filters.

We’ll compare code quality, token usage, cost, and time to complete the build.

💡 NOTE: I will share the source code changes for each task by each model in a .patch file. This way, you can easily view them on your local system by cloning the repository and applying the patch file using git apply <path_file_name>. This method makes sharing changes easier.

Real-world Coding Tests

Test 1: Add a global Action Palette (Ctrl + K)

The task is simple: all models start from the same base commit and then follow the same prompt to build what is asked in the prompt.

And obviously, as mentioned, I will evaluate the response from the model from the "Best of 3."

Let's start off the test with something interesting:

Here's the prompt used:

This project already has a search input on the home page that lets users find tools. I want to add an improved, global version of this idea that works as an **Action Palette**, similar to what you see in editors like VS Code.

**What to build**

* Pressing **Ctrl + K** (or Cmd + K on macOS) should open a centered action palette overlay from anywhere in the app.
* The palette should support:
  * Searching and navigating to tools (reuse existing tool metadata)
  * Executing actions, such as:

    * Toggle dark mode
    * Switch language
    * Toggle user type filter (General / Developer)
    * Navigate to Home and Bookmarks
    * Clear recently used tools

* Fully keyboard-driven experience:

  * Type to filter
  * Arrow keys to navigate
  * Enter to execute
  * Escape to close

**Notes**

* This should not replace the existing home page search. Think of it as a more powerful, global version that combines navigation and actions.
* The implementation should follow existing patterns, styling, and state management used in the codebase.

GPT-5.2-Codex (high)

GPT-5.2 handled this surprisingly well. The implementation was solid end to end, and it basically one-shotted the entire feature set, including i18n support, without needing multiple correction passes.

That said, it did take a bit longer than some other models (~20 minutes), which is expected since reasoning was explicitly set to high. You can clearly see the model spending more time thinking through architecture, naming, and edge cases rather than rushing to output code. The trade-off felt worth it here.

The token usage was noticeably higher due to the reasoning set to high, but the output code reflected that.

Here's the demo:

You can find the code it generated here: GPT-5.2 High Code

Cost: ~$0.9-1.0
Duration: ~20 minutes (API time)
Code Changes: +540 lines, minimal removals
Token Usage:
- Total: ~203k
- Input: ~140k (+ cached context)
- Output: ~64k
- Reasoning tokens: ~47k

💡 NOTE: I ran the exact same prompt with the same model using the default (medium) reasoning level. The difference was honestly massive. With reasoning set to high, the quality of the code, structure, and pretty much everything jumps by miles. It’s not even a fair comparison.

Claude Opus 4.5

Claude went all in and prepared a ton of different strategies. At the start, it did run into build issues, but it kept running the build until it was able to fix all the build and lint issues.

The entire run took me about 7 minutes 50 seconds, which is the fastest among the models for this test. The features all worked as asked, and obviously, the UI looked super nice and exactly how I expected.

Here's the demo:

You can find the code it generated here: Claude Opus 4.5 Code

To be honest, this exceeded my expectations; even the i18n texts are added and displayed in the UI just as expected. Absolute cinema!

Cost: $0.94
Duration: 7 min 50 sec (API Time)
Code Changes: +540 lines, -9 lines

Gemini 3 Pro

Gemini 3 got it working, but it's clearly not on the same level as GPT-5.2 High or Claude Opus 4.5. The UI it built is fine and totally usable, but it feels a bit barebones, and you don't get many choices in the palette compared to the other two.

One clear miss is that language switching does not show up inside the action palette at all, which makes the i18n support feel incomplete even though translations technically exist.

Here's the demo:

You can find the code it generated here: Gemini 3 Pro Code

Cost: Low (helped significantly by cache reads)
Duration: ~10 minutes 49 seconds (API Time)
Code Changes: +428 lines, -65 lines
Token Usage:
- Input: ~79k
- Cache Reads: ~536k
- Output: ~10.7k
- Savings: ~87% of input tokens served from cache

Overall, Gemini 3 lands in a very clear third place here. It works, the UI looks fine, and nothing is completely broken, but compared to the depth, completeness, and polish of GPT-5.2 High and Claude Opus 4.5, it feels behind.

Test 2: Tool Usage Analytics + Insights Dashboard

This test is a step up from the action palette.

You can find the prompt I've used here: Prompt

GPT-5.2-Codex (high)

GPT-5.2 absolutely nailed this one.

The final result turned out amazing. Tool usage tracking works exactly as expected, data persists correctly, and the dashboard feels like a real product feature. Most used tools, recent usage, filters, everything just works.

One really nice touch is that it also wired analytics-related actions into the Action Palette from Test 1.

It did take a bit longer than the first test, around 26 minutes, but again, that’s the trade-off with high reasoning. You can tell the model spent time thinking through data modeling, reuse, and avoiding duplicated logic. Totally worth it here.

Here’s the demo:

You can find the code it generated here: GPT-5.2 High Code

Cost: ~$1.1–1.2
Duration: ~26 minutes (API time)
Code Changes: Large multi-file update, cleanly structured
Token Usage:
- Total: ~236k
- Input: ~162k (+ heavy cached context)
- Output: ~75k
- Reasoning tokens: ~57k

GPT-5.2 High continues to be slow but extremely powerful, and for a task like this, that’s a very good trade.

Claude Opus 4.5

Claude Opus 4.5 did great here as well.

The final implementation works end to end, and honestly, from a pure UI and feature standpoint, it’s hard to tell the difference between this and GPT-5.2 High. The dashboard looks clean, the data makes sense, and the filters work as expected.

Here’s the demo:

You can find the code it generated here: Claude Opus 4.5 Code

Cost: $1.78
Duration: ~8 minutes (API Time)
Code Changes: +1,279 lines, -17 lines

Gemini 3 Pro

Gemini 3 Pro gets the job done, but it clearly takes a more minimal approach compared to GPT-5.2 High and Claude Opus 4.5.

That said, the overall experience feels very bare minimum. The UI is functional but plain, and the dashboard lacks the polish and depth you get from the other two models.

Also, it didn't quite add the button to view the analytics right in the action palette, similar to the other two models.

Here’s the demo:

You can find the code it generated here: Gemini 3 Pro Code

Cost: Low, with heavy cache utilization
Duration: ~5 minutes (API Time)
Code Changes: +351 lines, -3 lines
Token Usage:
- Input: ~67k
- Output: ~7.1k
- Savings: ~85%+ input tokens served from cache

Overall, Gemini 3 Pro remains efficient and reliable, but in a comparison like this, efficiency alone is not enough. 🤷‍♂️

Conclusion

At least from this test, I can conclude that the models are now pretty much able to one-shot a decent complex work, at least from what I tested.

Still, there have been times when the models mess up so badly that if I were to go ahead and fix the problems one by one, it would take me nearly the same time as building it from scratch.

If I compare the results across models, Opus 4.5 definitely takes the crown. But I still don’t think we’re anywhere close to relying on it for real, big production projects. The recent improvements are honestly insane, but the results still don’t fully back them up.

For now, I think these models are great for refactoring, planning, and helping you move faster. But if you solely rely on their generated code, the codebase just won’t hold up long term.

I don't see any of these recent models as “use it and ship it” for "production," in a project with millions of lines of code, at least not in the way people hype it up.

Let me know your thoughts in the comments.