OpenAI shipped a desktop app for Codex yesterday. It lets you run multiple coding agents in parallel, each in its own git worktree, and manage them from a single interface.
I've spent a good chunk of recent years building AI coding agents for frontend development. Figma-to-code, agent orchestration, multi-step task planning across AI models. So when I look at the Codex app, I'm not evaluating it as a curious observer. I'm looking at it as someone who's been in the guts of this problem.
The Worktree Trick
The cleverest thing in the Codex app has nothing to do with AI. It's the use of git worktrees for agent isolation.
If you haven't used worktrees: git lets you check out multiple branches of the same repo into separate directories simultaneously. They share the same .git folder, so there's no duplication of history, but each directory has its own working tree. The Codex app creates a worktree per agent. Agent A works on the login flow in one directory while Agent B refactors the API layer in another. They can't step on each other's files because they're literally in different folders.
This solves a real problem. I've seen it firsthand. When you run multiple agents against the same codebase, they will try to edit the same files. One agent reformats your imports while another is halfway through adding a new one. The merge conflicts aren't just annoying, they break the agent's mental model of what the code looks like. It spends tokens trying to fix a conflict it created, fails, and you end up worse than where you started.
Worktrees sidestep all of this at the filesystem level. Each agent gets a clean, isolated copy of the code with zero overhead. Cursor shipped the same idea months ago with their background agents. It's a good pattern, and I expect it to become standard for any tool running multiple agents on one repo.
What I Recognize
The Codex app includes a Figma-to-code skill. You point it at a Figma file, it fetches design context and assets, and it generates production-ready UI code. I've built something like this. Here's what I learned: the demo always looks great.
Getting a model to produce a React component that visually matches a Figma frame is not that hard anymore. The models are good at it. Where things fall apart is in the stuff nobody shows in demos. Design tokens that don't match the existing system. Components that duplicate logic already in your codebase. Responsive behavior that works on the three screen sizes the model was thinking about but breaks on the fourth. Accessibility. Always accessibility.
We built component indexing, design token extraction, and prompt pipelines to handle this. Months of work, and we still shipped things that a human designer would catch in seconds. The gap between "code that looks right" and "code that belongs in this codebase" is wider than most people think, and I don't see how a skill file closes it.
I'm not saying OpenAI's Figma skill is bad. I genuinely don't know. But I am saying that if it works well, it's because they did a lot more work than a one-paragraph description suggests.
The Productivity Paradox
The fight about AI coding tools has been going on for a year now. One camp says they're the future. The other says they tried it and their code got worse. Both have data now.
METR ran a study with 16 experienced open-source developers. They were 19% slower with AI tools than without them. And here's the part that stings: they thought they were faster. They estimated a 20% speedup while actually losing almost 20%.
Then there's the Index.dev report: teams using AI completed 21% more tasks, but company-wide delivery metrics didn't improve. The extra tasks were apparently offset by more time in code review, more bugs to fix, and more security issues to patch. Apiiro found that 48% of AI-generated code contains security vulnerabilities.
I notice the pattern in my own work. AI tools are incredible for generating boilerplate, writing tests for existing code, and cranking out CRUD endpoints. They save me real time on tasks I already know how to do. But for the hard stuff, the architecture decisions and the tricky edge cases and the "wait, what should actually happen here?" moments, they mostly generate confident-looking code that I then have to carefully audit. Sometimes the audit takes longer than writing it myself would have.
The Codex app's bet is that the problem isn't the AI. The problem is the interface. If you could run more agents in parallel, review their work more easily, and isolate their changes from each other, the productivity math works out. Maybe. I think the bottleneck is somewhere else, though.
Where the Bottleneck Actually Is
Bicameral AI published a breakdown that I keep thinking about. Only 16% of a developer's time goes to writing code. The rest is code review, monitoring, deployments, requirements clarification, security patching, meetings. AI coding tools target that 16% and mostly ignore the other 84%.
The Atlassian developer experience report found that AI saves developers roughly 10 hours per week on coding tasks. But the extra overhead created by AI-generated code (review, debugging, security) nearly cancels out those savings. You write code faster and then spend the saved time cleaning up what the AI wrote.
I think this is the real problem, and it's one I haven't seen anyone solve well yet. The models generate code that looks plausible but embeds requirements gaps. A human developer hits an ambiguous requirement and asks the product manager. The AI hits the same ambiguity and makes a guess. If the guess is wrong (it often is), you don't find out until code review, or worse, production.
The Codex app has a "skills" system where you can teach it workflows. In theory you could write a skill that says "when you encounter ambiguous requirements, stop and ask." In practice, the model doesn't know what it doesn't know. That's the hard part.
It's Electron
I can't not mention this. The app is Electron. 8GB of RAM to manage some chat threads and diffs.
The usual argument applies: VS Code is Electron and people complain about it constantly, Slack is Electron and gets constant grief for memory usage, and an app specifically for developers should respect developer machines.
I think the Electron critics are right in principle and wrong in practice. Yes, a native app would be better. No, it won't happen. These companies optimize for iteration speed, not runtime performance. They're shipping new features every week and Electron lets them do that with a web team. Is that the right tradeoff for a tool developers live in all day? Probably not. Will it stop anyone from using it? Also probably not.
What I'd Actually Want
The Codex app is optimized for the "supervise a fleet of agents" workflow. You give each agent a task, they run in parallel, you review the diffs. That's a valid way to work, and for certain kinds of tasks (write tests for these 12 files, update the API calls in these 8 components), it's probably efficient.
But the work I find hardest can't be parallelized. Figuring out how a feature should actually work before writing any code. Deciding where in the codebase something belongs. Reading a Figma design and realizing the interaction model breaks on mobile. An agent fleet doesn't help with any of that.
What I'd want is an AI tool that's good at the requirements conversation. Something that looks at a Figma design and a codebase and says "this dropdown pattern doesn't match your existing select components, should I use the existing pattern or create a new one?" Something that reads a ticket and flags the three edge cases the PM didn't think about before I start writing code.
Nobody is building that, as far as I can tell. Everyone is building faster code generators. Cursor is still the best experience for this workflow, and the Codex app is OpenAI's attempt to catch up. But I think we're all still optimizing the wrong 16%.
Top comments (1)
This content, while only loosely related to development, makes for fascinating reading - especially from an AI perspective:
GDC have released their 2026 State of the Game Industry report, comprising survey results from thousands of quizzed developers on the craft and business of gamesmaking. As in the 2025 report, this year’s responses signal a growing discontent with generative AI tools, with opposing sentiments tipping into simple majority status for the first time: 52% now say GenAI is having a negative impact on the industry, up from 30% in 2025 and 18% in 2024.
What was once seen largely as an exciting set of tools is now sparking significant concern, reflecting broader debates about automation, creativity, and the future of the craft.