sudheer singh

Posted on Feb 3 • Originally published at fumics.in

OpenAI's Codex App Wants to Replace Your IDE. I'm Not Sure It Should.

#ai #programming #openai #tooling

OpenAI shipped a desktop app for Codex yesterday. It lets you run multiple coding agents in parallel, each in its own git worktree, and manage them from a single interface.

I've spent a good chunk of recent years building AI coding agents for frontend development. Figma-to-code, agent orchestration, multi-step task planning across AI models. So when I look at the Codex app, I'm not evaluating it as a curious observer. I'm looking at it as someone who's been in the guts of this problem.

The Worktree Trick

The cleverest thing in the Codex app has nothing to do with AI. It's the use of git worktrees for agent isolation.

If you haven't used worktrees: git lets you check out multiple branches of the same repo into separate directories simultaneously. They share the same .git folder, so there's no duplication of history, but each directory has its own working tree. The Codex app creates a worktree per agent. Agent A works on the login flow in one directory while Agent B refactors the API layer in another. They can't step on each other's files because they're literally in different folders.

This solves a real problem. I've seen it firsthand. When you run multiple agents against the same codebase, they will try to edit the same files. One agent reformats your imports while another is halfway through adding a new one. The merge conflicts aren't just annoying, they break the agent's mental model of what the code looks like. It spends tokens trying to fix a conflict it created, fails, and you end up worse than where you started.

Worktrees sidestep all of this at the filesystem level. Each agent gets a clean, isolated copy of the code with zero overhead. Cursor shipped the same idea months ago with their background agents. It's a good pattern, and I expect it to become standard for any tool running multiple agents on one repo.

What I Recognize

The Codex app includes a Figma-to-code skill. You point it at a Figma file, it fetches design context and assets, and it generates production-ready UI code. I've built something like this. Here's what I learned: the demo always looks great.

Getting a model to produce a React component that visually matches a Figma frame is not that hard anymore. The models are good at it. Where things fall apart is in the stuff nobody shows in demos. Design tokens that don't match the existing system. Components that duplicate logic already in your codebase. Responsive behavior that works on the three screen sizes the model was thinking about but breaks on the fourth. Accessibility. Always accessibility.

We built component indexing, design token extraction, and prompt pipelines to handle this. Months of work, and we still shipped things that a human designer would catch in seconds. The gap between "code that looks right" and "code that belongs in this codebase" is wider than most people think, and I don't see how a skill file closes it.

I'm not saying OpenAI's Figma skill is bad. I genuinely don't know. But I am saying that if it works well, it's because they did a lot more work than a one-paragraph description suggests.

The Productivity Paradox

The fight about AI coding tools has been going on for a year now. One camp says they're the future. The other says they tried it and their code got worse. Both have data now.

METR ran a study with 16 experienced open-source developers. They were 19% slower with AI tools than without them. And here's the part that stings: they thought they were faster. They estimated a 20% speedup while actually losing almost 20%.

Then there's the Index.dev report: teams using AI completed 21% more tasks, but company-wide delivery metrics didn't improve. The extra tasks were apparently offset by more time in code review, more bugs to fix, and more security issues to patch. Apiiro found that 48% of AI-generated code contains security vulnerabilities.

I notice the pattern in my own work. AI tools are incredible for generating boilerplate, writing tests for existing code, and cranking out CRUD endpoints. They save me real time on tasks I already know how to do. But for the hard stuff, the architecture decisions and the tricky edge cases and the "wait, what should actually happen here?" moments, they mostly generate confident-looking code that I then have to carefully audit. Sometimes the audit takes longer than writing it myself would have.

The Codex app's bet is that the problem isn't the AI. The problem is the interface. If you could run more agents in parallel, review their work more easily, and isolate their changes from each other, the productivity math works out. Maybe. I think the bottleneck is somewhere else, though.

Where the Bottleneck Actually Is

Bicameral AI published a breakdown that I keep thinking about. Only 16% of a developer's time goes to writing code. The rest is code review, monitoring, deployments, requirements clarification, security patching, meetings. AI coding tools target that 16% and mostly ignore the other 84%.

The Atlassian developer experience report found that AI saves developers roughly 10 hours per week on coding tasks. But the extra overhead created by AI-generated code (review, debugging, security) nearly cancels out those savings. You write code faster and then spend the saved time cleaning up what the AI wrote.

I think this is the real problem, and it's one I haven't seen anyone solve well yet. The models generate code that looks plausible but embeds requirements gaps. A human developer hits an ambiguous requirement and asks the product manager. The AI hits the same ambiguity and makes a guess. If the guess is wrong (it often is), you don't find out until code review, or worse, production.

The Codex app has a "skills" system where you can teach it workflows. In theory you could write a skill that says "when you encounter ambiguous requirements, stop and ask." In practice, the model doesn't know what it doesn't know. That's the hard part.

It's Electron

I can't not mention this. The app is Electron. 8GB of RAM to manage some chat threads and diffs.

The usual argument applies: VS Code is Electron and people complain about it constantly, Slack is Electron and gets constant grief for memory usage, and an app specifically for developers should respect developer machines.

I think the Electron critics are right in principle and wrong in practice. Yes, a native app would be better. No, it won't happen. These companies optimize for iteration speed, not runtime performance. They're shipping new features every week and Electron lets them do that with a web team. Is that the right tradeoff for a tool developers live in all day? Probably not. Will it stop anyone from using it? Also probably not.

What I'd Actually Want

The Codex app is optimized for the "supervise a fleet of agents" workflow. You give each agent a task, they run in parallel, you review the diffs. That's a valid way to work, and for certain kinds of tasks (write tests for these 12 files, update the API calls in these 8 components), it's probably efficient.

But the work I find hardest can't be parallelized. Figuring out how a feature should actually work before writing any code. Deciding where in the codebase something belongs. Reading a Figma design and realizing the interaction model breaks on mobile. An agent fleet doesn't help with any of that.

What I'd want is an AI tool that's good at the requirements conversation. Something that looks at a Figma design and a codebase and says "this dropdown pattern doesn't match your existing select components, should I use the existing pattern or create a new one?" Something that reads a ticket and flags the three edge cases the PM didn't think about before I start writing code.

Nobody is building that, as far as I can tell. Everyone is building faster code generators. Cursor is still the best experience for this workflow, and the Codex app is OpenAI's attempt to catch up. But I think we're all still optimizing the wrong 16%.

Top comments (3)

PEACEBINFLOW • Feb 4

This really resonates, especially the framing around where the real bottleneck actually lives.

The worktree idea is genuinely smart — not because it’s “AI magic,” but because it acknowledges something very human: agents break down the moment they step on each other’s mental model of the codebase. Isolating them at the filesystem level is one of those solutions that feels obvious after you’ve been burned by conflicts a few times. Credit where it’s due.

But I’m with you on the bigger point: we keep pouring energy into accelerating the part of the job that already isn’t the slowest. Writing code faster feels productive, but most of my time is spent deciding what not to write, where something belongs, or why this requirement is underspecified. AI confidently plows through ambiguity; humans pause and ask questions. That gap shows up later as review debt, security debt, or just “why does this feel wrong?” fatigue.

The Figma-to-code section hit home too. Matching pixels is solved. Matching systems isn’t. The distance between “this renders correctly” and “this fits the codebase’s patterns, tokens, and accessibility expectations” is still huge, and no amount of parallel agents closes that without a strong judgment layer.

What I keep wanting from these tools is exactly what you describe at the end: less “here’s the code” and more “here are the decisions you need to make before any code exists.” Flagging ambiguity, surfacing edge cases, and forcing the uncomfortable questions early would probably save more time than spinning up ten agents ever could.

So yeah — Codex feels like a well-engineered answer to the wrong slice of the problem. Impressive, useful in narrow lanes, but still optimizing the 16% while the other 84% quietly eats the gains.

leob • Feb 11 • Edited

Wow, this sounds like a reality check for companies who think they shouldn't hire juniors anymore, because AI in the hands of seniors will replace them? But in the meantime, reading what you write:

1) Coding is at most 15-10% of the whole software development process (who would be surprised at that? well, not me)

2) And when we use AI for those 15-20% it only works really well for some tasks, while for other tasks it incurs other costs (the need for more reviewing, code bloat, security vulns, etcetera) - and note that fixing those issues will take time from senior devs, not juniors ...

Are companies being far too optimistic, and are they kidding themselves?

I'm not against using AI for tasks it's suited to, but somehow I can't conceal feeling some sort of satisfaction thinking that those companies rushing to dump their junior devs, out of an urge to save a few bucks, might reach a point where they're thinking again ...

P.S. regarding OpenAI and their new product - of course OpenAI just wants to make a buck and sell tokens :-)

Naveen Kumar • Feb 19

I got to know some thing new