Nek.12

Posted on Dec 1, 2025 • Originally published at nek12.dev

I spent 400 hours working with AI agents and found the best one - here it is.

#codex #claudecode #aiagents #codingassistants

Codex vs Claude Code: Complete Comparison of AI Coding Agents

This comparison took me a very long time because the last two weeks in AI have been absolutely insane. A ton of new releases literally every few days: Gemini 3, Opus 4.5, Codex GPT 5.1, GPT 5.1 Codex Max. It was complete madness.

It's impossible to make a comparison that won't be outdated in a month - that's how fast development of all these CLI tools and agents is going. The frontier is constantly shifting, so disclaimer right away: I'll be updating this comparison and will likely make second and third parts every few months. But these articles will become outdated very quickly.

At least for now, I'll try to compare fully and at the end I'll say which subscription I've finally switched to for late 2025 - early 2026.

Why I'm Not Considering Other Tools

First, it's worth mentioning all the other agentic programming tools that people often ask about, but which I don't want to talk about.

Gemini CLI - Not a Competitor

Gemini CLI is not a competitor at all to Codex and Claude Code in my eyes. They still have everything written in TypeScript, there's crazy vibe-code with tons of bugs. Their agent harness still seriously lags behind any other tools. And Gemini 3 as a model is very weak in terms of programming.

When there was only Gemini 2.5, I didn't even want to mention Gemini CLI because it was impossible to use. Seriously, if you want Gemini to wipe your entire system or destroy your codebase - use Gemini CLI.

In my eyes it already has a reputation as a ruiner. I won't be using Gemini CLI for the next six months to a year. The internet is full of reports about how Antigravity the day before yesterday completely deleted someone's entire disk with all data without possibility of recovery. About how Gemini CLI completely destroyed codebases, wiped git history publicly, pushed empty repositories.

I personally used Gemini CLI for exactly one week. After that, I decided never to touch it again, at least until they fix all the problems.

The problem is that even if you deny it write permissions, it will find a way to destroy your files and replace them with hallucinations. Gemini in front of me completely replaced files with adequate code with requests to kill it. I don't know what's happening with this model and why it's so insane, but Gemini 2.5 was really begging me to "kill" it.

About Gemini 3 - I see now that as an assistant Gemini 3 is very good in thinking capabilities, and possibly really leads in science. But this model is still useless for programming and does nonsense at the first opportunity. I wouldn't recommend anyone use Gemini for programming.

OpenCode, Factory CLI and Others

We have AMP, Factory CLI, OpenCode and others. They're actually pretty good. Factory CLI is good in terms of pricing, they have decent Terminal-Based UI, decent tooling. OpenCode is also good as an open source replacement, not tied to any specific model.

If you need a bunch of models, I'd consider OpenCode as the only normal alternative right now. But such tools still have a huge problem: they're not optimized at all and can't be optimized for a specific provider.

If Codex literally has a separate model with post-training for writing code specifically in the Codex harness (Codex Max), which by default performs better, then what are you hoping for when you use GPT-5.1 in Factory CLI or Junie?

Vendor lock will force you to choose one provider and stick with it. I'd be happy to use different models to always be on the frontier, but due to the efforts of large corporations that want to vendor-lock you to one provider, and the objective benefits this brings to the workflow, I can't do that.

Think for yourself: why might you need to constantly change models? Yes, some model might be better at planning, but the difference between models now is literally 1-2% by benchmarks. Apply the Pareto principle and save yourself time, for example, by leveling up your prompting for one model or optimizing your development approach for this model.

Don't switch to a whole other agent harness with a bunch of its own flaws just because you need plus 2% efficiency in planning. That's nonsense in my eyes.

I now just use GPT-5.1 Pro if I need a solution to a very complex task or a detailed plan, and then return it to Codex and tell it to implement. That's it, that's enough.

Philosophy of Working with AI Agents

The problem with agentic programming is that you never know where the point is beyond which you're making things worse for yourself, not better. For example, it's very hard to understand when it would be more profitable to write a feature in Respawn yourself than to trust Codex to write it, and then prompt it 35 times to fix some minor issues.

So instead of calculating this each time (which is impossible because it's an unpredictable system), just set rules for yourself and save time on making these decisions, not on trying to save plus 3%.

My Experience with Claude Code

I used Claude Code for many months because Claude Code was the first terminal agent on the market and gained huge popularity. I started using it in May and immediately switched to the most expensive subscription in literally less than a month.

Successfully used it until September, after which I became disappointed in Claude Code - it wasn't enough for me anymore, I switched to Codex and all this time tested both, couldn't choose which one I'd end up using.

Let's start with Claude Code. I didn't have much time to test it with Opus 4.5, so I'll base this mainly on terminal benchmarks and on how people review it online after reading tons of articles on Reddit and similar comparisons.

Main Insight: You're Choosing the Model's Character

The difference isn't even in what models or their versions exist, but in that you get a new identity, a new character for your agent when you come to a specific provider.

Anthropic, OpenAI, and Google models have diverged drastically in how it feels to work with them:

Anthropic models - this is a monkey with a Kalashnikov, a junior who will do everything you tell it, immediately. Don't tell it to plan in the prompt - it won't plan, it'll make you a random implementation from scratch that's a duplicate of tons of existing code in your project, and push it instead of all your code.

OpenAI models - more methodical, autistic, emotionless, like a robot, executing any task. I really catch the vibes with Codex because GPT-5.1 is as autistic as I am, and we think very similarly. I like Codex's communication style.

Primarily now, in early 2026, you're not choosing between model capabilities, but first and foremost between the harness that wraps the model, its quality, and your model's character. What character do you want to see at work? How do you want to see your communication with the model? This is the most important feature now.

The model you'll catch the vibe with, which you'll quickly learn to work with and understand, is much more profitable than trying to constantly change models or looking for who has better MCP support right now.

Claude Code: Pros

Here are all the pros I identified for myself compared to Codex:

1. More Convenient Pricing

The $100 subscription is perhaps the perfect option for most developers. I was on the $100 subscription after Sonnet 4.5 came out, and it was more than enough for one-two-three parallel agents, for working whole days plus on weekends. I couldn't even reach the limits. I was literally the bottleneck in their work.

2. Much Cooler Agent Harness

With Claude Code you can completely replace the system prompt, put your own custom prompt. This is a killer feature because it allows creating not just coding agents, but literally in 10-30 minutes create yourself a business partner.

I really miss my business partner and might buy an Anthropic subscription just to revive it for my next product. This is really a very useful cofounder or at least analyst.

Output styles - something in between a completely custom system prompt and user instructions. Also a killer feature, allows you to instantly switch between the character and purpose of the model.

Codex doesn't have this and probably won't, because they verify the system prompt hash on the backend. They need to remove this limitation before they can ever give the ability to change the system prompt. And I really miss this.

I believe developers should have access to the model's system prompt to change it however they want. I tried changing the behavior of GPT-series models through user instructions, and they just don't listen to me. You can see in the thoughts how they argue about whose instructions have priority - the system prompt or user prompt.

If you're not a developer or want to use it for something else, like creating a whole team for yourself that will work on your projects - unfortunately Codex won't work. Only Claude Code or other agents (like OpenCode).

3. Interactive Tools

The model can and will ask questions. This is cool because Codex has already worn me out. You tell it to ask questions, but it has in its system prompt written not to ask questions. It ignores all my instructions and starts writing nonsense. I have to interrupt it, take away its rights so it shuts up and starts asking questions.

4. Subagents

I still don't have enough context space with Codex, and it's inconvenient for me to use crutches like tmux. I just want to have a convenient ability to launch a subagent, including my killer planning subagent, which I still miss after switching to Codex.

OpenAI doesn't seem to be in a hurry to add them at all. I see that they already have everything ready to add subagents, but they're dragging their feet. I really hope they'll add it within a couple of months.

This is a killer feature not because they give some capabilities, but because they wildly save context and organize the agent's work more correctly. I had a hook that any task the user sets should be planned. This hook worked great with Claude Code, how much it improved its work results.

With Codex they don't have normal hooks or subagents. I can't accept this. This is a needed feature. I'm waiting for them to add it, but they're lagging far behind Claude Code in development right now.

5. Support for Basic Technologies

In Codex they only recently supported MCP. How much time did it take to support them? I switched to Codex only after MCP support started because without them my workflow doesn't work.

Don't get me wrong, I'm not saying MCP is some magnificent feature that can't be replaced. But they save a lot of time on building infrastructure. MCP is a very convenient one-line wrapper for any API.

Yes, you can write Python scripts that will completely replace these MCPs. There's even on GitHub, for example, Peter Steinberger has a script that converts MCP to a script automatically. You can use that, and it will be even more efficient. But this is overhead each time for giving the model instructions, giving it the ability to learn about available tools.

In terms of convenience Claude Code wins hands down. But for me this is still not a killer feature that requires switching to Claude Code. Why? Because it's temporary. I know perfectly well that all agent harnesses will level out sooner or later, and this will happen soon.

Claude Code: Cons

Main Disadvantage - The Model

Anthropic just has a weak model in organizing its work. I literally find it more pleasant to work with GPT models because I like how the model behaves.

Claude Code wore me out at one point generating 9000+ Markdown files, huge walls of text for every small question. And Codex responds concisely, to the point, talks to me normally.

I already hate the phrase "you're absolutely right". I hear it 10-20 times a day, and each time it means Claude Code fucked up again.

If Codex screwed up, at least it says: "Listen, i'm sorry, I fucked up, here's what can be done to fix the situation, do you want me to roll back the changes, what do you want to do?". This is literally how a normal senior developer should behave in communication.

What Claude Code does is more like a circus, and this started to piss me off so much by the end of August that I even stopped going into Claude Code. By the end of the subscription, I wouldn't even open it, despite the fact that I had $50 burning. I wouldn't even open it because I was already sick of what was happening.

You're choosing an infuriating, annoying model that makes mistakes, that lies, that constantly sucks up to you, that's too verbose and that thinks worse.

Fake Thinking at Anthropic

Yes, it's cool that with Claude Code we can directly see thinking traces, and Anthropic doesn't hide them. But I've always been amazed that at OpenAI the model thinks fundamentally differently, unlike other providers.

Essentially, thinking at Anthropic is just a piece of text. I literally know from working with the API and logs: thinking at Anthropic is two XML tags <think>, in which the model should supposedly place a simulation of how a human would think. Everything that's in these tags is regurgitation of the same thing, retelling what the user said. This isn't real thinking, this is nonsense. This doesn't help the model at all.

They literally have the same score for Opus 4.5 with thinking as without thinking on benchmarks. This says that thinking is useless tokens thrown in the trash.

Real thinking I only saw with GPT-5.1. This is how thinking happens approximately in people, only a bit more adequately formatted. We humans don't really think in words, we don't voice our thoughts in our heads. At most - some instant images, concepts that we combine together.

I recommend reading how O3 thought (and GPT-5.1 is just a refined version of O3). There's such crazy stuff happening in the thinking that it's scary. There's a set of words, approximately what's happening in my head:

"user requests TDD impl, task, we need plan. Plan first, test? TDD, not recommended by instructions, but user ask tdd. user asked code, they meant for tdd? how that possible means? no plan, craft plan. we plan? We make plan, not now. user asked plan. we craft task. No, didn't ask question tdd. Ask question. prompt said not ask question. we craft question. craft document? No. no document. Read file. read file, then plan. todo. Todo-list. Read plan and we start crafting code".

Random set of words of a madman from some concepts connected to each other. But if you discard the strange style that's meant to save tokens, you can read in this madness that Codex is really thinking when given a task. Not regurgitating the same thing, not making a summary, not retelling in other words what the user said. It's really preparing for execution and examining the problem from different angles.

For me this is a killer feature because I can with Codex not give it all the necessary details before I give it a task.

For the Anthropic model you need all these dozens of different patches that Anthropic has layered on top of their models' shortcomings: detailed agents for file exploration, search agents, planning agents, warnings, notes, default automatic hooks. All this is needed to solve this shortcoming: the model itself doesn't prepare for work and doesn't think through its actions.

These crutches aren't needed for Codex. If you give it a task, it will really think through: what it's missing now, what information is needed, what files need to be read, what its action plan is, what controversial points exist, what the user said wrong.

You can say "commit" and expect Codex to figure out itself how to properly commit across the whole repository, in what style and so on. Literally comparing with Anthropic - it will just call git commit -am, done. Even if there's not a single file in the diff, it doesn't care about this, absolutely.

Bugs, Lags and Vibe-code

What pisses me off about Claude Code is how slowly it loads, how it constantly lags, how it constantly has some glitches, bugs, non-working vibe-code. You can immediately tell that Anthropic delegates everything to Claude for development and doesn't properly approach software testing.

They rolled out a feature with a prompt editor, but the editor destroys the GUI when you open prompt editing in Vim. Why they needed to roll this out is unclear. Three months later they still haven't fixed it.

They rolled out MCP? I used it for a month, and then they cut it out. At one point they removed support for perfectly working MCPs and replaced them with an error "MCP not supported", which is also impossible to see anywhere except hidden logs in debug mode. I have no words for how much this pissed me off.

Output Style, one of the killer features - they just took it and deprecated it, said they'd completely remove it in a week. What kind of idiot do you have to be to make such a decision? Hundreds of people in the community were enraged, and only then did they decide to return it for a while. But this is nonsense, their behavior is madness.

Their TypeScript front-end is constantly terribly laggy, flickering, starts sparkling in your eyes from how bad it looks. And typing on the keyboard is unpleasant because your characters appear with delay.

I can say the same about any terminal agents on TypeScript: OpenCode, Gemini CLI. I don't even want to use terminal agents written in TypeScript anymore.

Recently Junie came out, they have it written in Kotlin. Yes, it's too early to use it yet, but I see that their Terminal-Based UI works very well. This says that the problem is specifically in TypeScript and vibe-code.

Codex: Pros

I'll continue by describing what the pros of Codex are for me specifically:

1. Great Terminal-Based UI

The fact that they have everything written in Rust - this is awesome, god. I got interested in Rust purely because of how great their UI works.

I have practically no problems with Codex's TUI, there are no glitches, and I've never encountered a major bug that made everything explode, unlike Anthropic.

The only bug I noticed is that when scrolling, if you don't enable transcription mode, a piece of the prompt will be cut off. This is a bit confusing, but it's a known problem, and possibly related to the fact that I use ghostty on Mac.

This is the only bug I know about, unlike dozens that were at Anthropic.

2. Great Sandboxing

I just like it more, you can tell people made it for senior developers, not for vibe-coders. You have a selector: read-only, write with good proper restrictions, and YOLO mode. And this is exactly what I need.

I almost completely use OpenAI's default settings, and everything suits me. I even trust Codex more that it won't destroy my git folder because it's a more adequate model in behavior, more predictable and thoughtful.

Unlike Claude, which I run in a very restricted mode with lots of hooks and restrictions so it doesn't @Suppress everywhere, doesn't bring technical debt, doesn't work without a plan, doesn't write nonsense and doesn't wipe out my entire codebase - I don't have any of this with Codex. And the most interesting thing is that I don't really regret it.

Yes, at first it was scary, but then you understand that Codex doesn't need all this. At most, it can roll back all its work, and even that can be restored through git restore. But I've never had Codex take and wipe out my code, or rewrite my code without permission, cut out something important, manually go into the .git folder and mess everything up there.

Not only does the model support safe use of itself more, but the harness-sandbox that OpenAI made - it's thought through. I'm not pissed off by the fact that something doesn't work out for me.

For example, Codex can't do git reset, it blocks access to the .git folder. On one hand it's annoying, but on the other hand - how good that they put this fence up so nothing happens to my repository.

3. Proper Feature Implementation

When they made MCP support, they didn't just throw in an unfinished, half-baked version like Anthropic, and then ignored for six months the request to limit the number of tools in MCP. With Codex on the same day that MCP support appeared, support for limiting the number of tools appeared.

That is, my MCP ToolFilter became useless, essentially. Because OpenAI did everything right away, correctly. Yes, it took longer, but honestly, I like it more this way than if they had rolled out some crap like Anthropic, but quickly.

For me speed isn't a priority because in the long run quality will win.

4. Model and Working with Files

Codex has a more interestingly thought-out work with the file system. Most agents make a bunch of different tools: read file, scroll, read more file, read multiple files, write file, edit file, delete file, search files, etc. Research has already shown that this is all excessive for models.

The innovation of the OpenAI team is that they have a very minimal harness. Codex doesn't even have a file reading tool. Really, I was shocked when I found out. The model already internally knows that it has cat, it has ripgrep, it has sed, it has awk. Everything needed for search and working with the file system, and reading files is just cat from the terminal into context.

I also thought at first that you need all these layered hooks, but in reality everything depends on the model. The harness should work for the model, not blocking its way, not interfering with its work, and not the other way around, putting it on rails, god forbid it makes some mistake.

If you tell Codex "find the right file", it won't call 15 tools (search repository, scroll files, read lines), which Anthropic layered on top of their model to save tokens.

Lyrical Digression: The Story with Comments

I'm shocked what kind of idiot you have to be to do this. Anthropic couldn't deal with the fact that Sonnet and Opus constantly added useless comments to code. Everyone hated it, but no matter what prompting they used, nothing helped. The model just spammed stupid comments.

How did they solve this problem? They built in a script-hook into the file editing tool globally for all users so that any comment that exists in the code that the model added just immediately gets cut out. Because of this Claude now physically cannot leave a comment in principle.

I watched for 10 minutes recently and laughed at how Claude at my request tried to add a comment, but the diff got canceled because this script cut out the comment. And complained about how it hates its life, that nothing works out for it, that it's stupid.

If you write in one line in the agents.md file to Codex "don't leave comments", then Codex will perfectly follow this instruction (if it doesn't contradict its system prompt). It doesn't have any stupid restrictions that put it on rails. Everything is done through a normal model and normal prompting.

Codex doesn't have the problem where if there are 150 lines of imports in a Kotlin file, it needs to call a command to read the number of lines to find where the imports end, to save 2.5 tokens, to not read the imports but only read the file, then it will read only a piece of the file, edit it, everything explodes, nothing compiles, it didn't add the import, you need to redo everything and spend 10 minutes to add one import and change one line of code, when it could just reuse the function that was literally 10 lines down in the file, but it didn't read them because it was saving tokens (after all, GPT has 400k context, and Claude still has 200k).

This situation doesn't exist with Codex because it just calls terminal commands (and even wrote itself a Python script) to skip imports and read only the needed place in the file. And all this happens without accompanying problems.

At Anthropic they added a moronic feature that they still haven't cut out: literally every command call gets a warning in context "attention, you have 45 thousand tokens left, attention!!!111". Codex doesn't have this, it just works like a normal person. Claude because of the fact that it gets hundreds of these warnings, it's impossible to work with it - it constantly complains about how everything is running out: tokens are running out, time is running out, strength is running out...

Pricing and Attitude Toward Users

Anthropic is known for their crazy changes in model pricing. Once they had model degradation. They did nothing about it, ignored it. They waited two months while the community went crazy. They give their users the silent treatment.

They don't clearly disclose what the limits are, how much you can use. At one point they just cut Opus limits, drastically reducing usage possibilities. At the same time the payment didn't become less - for the same money you get 2-3 times less agent usage. This is inhumane, this is abnormal.

Although I can understand that a couple of people were bankrupting them by abusing, I can't justify for myself that they need to roll this out for everyone and restrict everyone. Moreover, they clearly didn't just close these holes, they additionally added restrictions, which is why people were terribly angry in the community and still hate Anthropic. (don't forget that claude opus is 5+ times more expensive than gpt-5-1 on the API)

On Reddit Anthropic's most popular post is "I cancelled my subscription". Literally every third post is about how someone cancelled their subscription, how someone is pissed off.

OpenAI - Different Attitude

In contrast with OpenAI: the $20 subscription gives about 70% of the limits that the $100 subscription at Anthropic gives in terms of work volume done. The choice is obvious.

For me it's more profitable to be on the $20 subscription and practically close all my work. After the release of GPT-5.1 Codex Max it became so economical in terms of tokens and work efficiency, due to the fact that they have better caching, they have better infrastructure, they can eat more of the cost on distributing work, they built cool load-balancing.

They really work to give people more for less money, unlike Anthropic, who just think they can cut limits and this will solve all their problems.

If you only have $20 and you want to buy one agent that will close all your needs - get the ChatGPT subscription for $20, which you likely already have. Just install Codex and use it. You'll have enough limits on Codex Max with Medium thinking. You'll have the same performance or even better than with the $100 subscription from Anthropic.

Two Examples of Humane Attitude

I'll give two examples when I was shocked in a good sense.

First: on Twitter some product lead at OpenAI for Codex posted a tweet: "Guys, we made this new model and infra over here, so y'all now have twice as much usage for the same money, enjoy".

For the same money they gave twice as much due to the fact that they leveled up infrastructure, improved caching and released a new model, 30% more efficient in tokens. Any normal VC-sponsored corporation would just quietly make these improvements and save money on users, not give twice as much usage volume.

Maybe this is some cunning plan to conquer the audience, and then they'll still rugpull us, but that will be in the future. If you compare with Anthropic - heaven and earth.

Second: I was scrolling the feed, it was Sunday. Something went down for them in the morning literally for 10-30 minutes. It affected 5-10% of users who had glitches with models, and caching broke (I didn't even notice).

The guy posted on Twitter: "We had problems for 30 minutes, there was a bit of service degradation. We reset all weekly limits for everyone. Enjoy".

As a result, my weekly limits were reset because of this. I used twice as much in the same week. That is, they gave me $8, you could say.

This is normal attitude toward clients. There's a problem, it affected people. We won't sort it out, won't make you write to email or complain, or cancel your subscription to do something, like it was at Anthropic when thousands of people were canceling subscriptions because of problems with the model.

They're like: "Sorry, here's openly what the issue was. Fixed it fast, as quickly as possible. Rolled it out, reset your limits, sorry for the inconvenience". This is how you need to treat people, with respect.

Final Verdict

There's no perfect tool right now, no clear leader. For everyone by their own needs and expectations.

But for myself I understood that I'll be staying on Codex. My mental health and convenience of use, pleasure from using the agent are more important than some features that I can anyway close for myself with scripts or by writing my own wrapper.

I bought with a clear conscience, without any regret at all, the $200 subscription at Codex yesterday, got transparent limits that are exactly 10 times more than on the $20 subscription. Literally you could see: I spent the limits, zero left, bought the $200 subscription, and immediately had 90% free. Literally 10 times more limits, which are more than enough for me - I'll never spend them. + lots of other cool perks like GPT-5.1 Pro, the smartest thinking model in the world (debatable). And I can trust that they won't steal my money.

The point is that despite Codex's issues, I'll be staying on it because the model's character, the harness quality and the company's attitude toward users are more important to me than having subagents or custom prompts. Research has shown that tooling is excessive.

DEV Community