DEV Community: sudheer singh

OpenAI's Codex App Wants to Replace Your IDE. I'm Not Sure It Should.

sudheer singh — Tue, 03 Feb 2026 08:26:31 +0000

OpenAI shipped a desktop app for Codex yesterday. It lets you run multiple coding agents in parallel, each in its own git worktree, and manage them from a single interface.

I've spent a good chunk of recent years building AI coding agents for frontend development. Figma-to-code, agent orchestration, multi-step task planning across AI models. So when I look at the Codex app, I'm not evaluating it as a curious observer. I'm looking at it as someone who's been in the guts of this problem.

The Worktree Trick

The cleverest thing in the Codex app has nothing to do with AI. It's the use of git worktrees for agent isolation.

If you haven't used worktrees: git lets you check out multiple branches of the same repo into separate directories simultaneously. They share the same .git folder, so there's no duplication of history, but each directory has its own working tree. The Codex app creates a worktree per agent. Agent A works on the login flow in one directory while Agent B refactors the API layer in another. They can't step on each other's files because they're literally in different folders.

This solves a real problem. I've seen it firsthand. When you run multiple agents against the same codebase, they will try to edit the same files. One agent reformats your imports while another is halfway through adding a new one. The merge conflicts aren't just annoying, they break the agent's mental model of what the code looks like. It spends tokens trying to fix a conflict it created, fails, and you end up worse than where you started.

Worktrees sidestep all of this at the filesystem level. Each agent gets a clean, isolated copy of the code with zero overhead. Cursor shipped the same idea months ago with their background agents. It's a good pattern, and I expect it to become standard for any tool running multiple agents on one repo.

What I Recognize

The Codex app includes a Figma-to-code skill. You point it at a Figma file, it fetches design context and assets, and it generates production-ready UI code. I've built something like this. Here's what I learned: the demo always looks great.

Getting a model to produce a React component that visually matches a Figma frame is not that hard anymore. The models are good at it. Where things fall apart is in the stuff nobody shows in demos. Design tokens that don't match the existing system. Components that duplicate logic already in your codebase. Responsive behavior that works on the three screen sizes the model was thinking about but breaks on the fourth. Accessibility. Always accessibility.

We built component indexing, design token extraction, and prompt pipelines to handle this. Months of work, and we still shipped things that a human designer would catch in seconds. The gap between "code that looks right" and "code that belongs in this codebase" is wider than most people think, and I don't see how a skill file closes it.

I'm not saying OpenAI's Figma skill is bad. I genuinely don't know. But I am saying that if it works well, it's because they did a lot more work than a one-paragraph description suggests.

The Productivity Paradox

The fight about AI coding tools has been going on for a year now. One camp says they're the future. The other says they tried it and their code got worse. Both have data now.

METR ran a study with 16 experienced open-source developers. They were 19% slower with AI tools than without them. And here's the part that stings: they thought they were faster. They estimated a 20% speedup while actually losing almost 20%.

Then there's the Index.dev report: teams using AI completed 21% more tasks, but company-wide delivery metrics didn't improve. The extra tasks were apparently offset by more time in code review, more bugs to fix, and more security issues to patch. Apiiro found that 48% of AI-generated code contains security vulnerabilities.

I notice the pattern in my own work. AI tools are incredible for generating boilerplate, writing tests for existing code, and cranking out CRUD endpoints. They save me real time on tasks I already know how to do. But for the hard stuff, the architecture decisions and the tricky edge cases and the "wait, what should actually happen here?" moments, they mostly generate confident-looking code that I then have to carefully audit. Sometimes the audit takes longer than writing it myself would have.

The Codex app's bet is that the problem isn't the AI. The problem is the interface. If you could run more agents in parallel, review their work more easily, and isolate their changes from each other, the productivity math works out. Maybe. I think the bottleneck is somewhere else, though.

Where the Bottleneck Actually Is

Bicameral AI published a breakdown that I keep thinking about. Only 16% of a developer's time goes to writing code. The rest is code review, monitoring, deployments, requirements clarification, security patching, meetings. AI coding tools target that 16% and mostly ignore the other 84%.

The Atlassian developer experience report found that AI saves developers roughly 10 hours per week on coding tasks. But the extra overhead created by AI-generated code (review, debugging, security) nearly cancels out those savings. You write code faster and then spend the saved time cleaning up what the AI wrote.

I think this is the real problem, and it's one I haven't seen anyone solve well yet. The models generate code that looks plausible but embeds requirements gaps. A human developer hits an ambiguous requirement and asks the product manager. The AI hits the same ambiguity and makes a guess. If the guess is wrong (it often is), you don't find out until code review, or worse, production.

The Codex app has a "skills" system where you can teach it workflows. In theory you could write a skill that says "when you encounter ambiguous requirements, stop and ask." In practice, the model doesn't know what it doesn't know. That's the hard part.

It's Electron

I can't not mention this. The app is Electron. 8GB of RAM to manage some chat threads and diffs.

The usual argument applies: VS Code is Electron and people complain about it constantly, Slack is Electron and gets constant grief for memory usage, and an app specifically for developers should respect developer machines.

I think the Electron critics are right in principle and wrong in practice. Yes, a native app would be better. No, it won't happen. These companies optimize for iteration speed, not runtime performance. They're shipping new features every week and Electron lets them do that with a web team. Is that the right tradeoff for a tool developers live in all day? Probably not. Will it stop anyone from using it? Also probably not.

What I'd Actually Want

The Codex app is optimized for the "supervise a fleet of agents" workflow. You give each agent a task, they run in parallel, you review the diffs. That's a valid way to work, and for certain kinds of tasks (write tests for these 12 files, update the API calls in these 8 components), it's probably efficient.

But the work I find hardest can't be parallelized. Figuring out how a feature should actually work before writing any code. Deciding where in the codebase something belongs. Reading a Figma design and realizing the interaction model breaks on mobile. An agent fleet doesn't help with any of that.

What I'd want is an AI tool that's good at the requirements conversation. Something that looks at a Figma design and a codebase and says "this dropdown pattern doesn't match your existing select components, should I use the existing pattern or create a new one?" Something that reads a ticket and flags the three edge cases the PM didn't think about before I start writing code.

Nobody is building that, as far as I can tell. Everyone is building faster code generators. Cursor is still the best experience for this workflow, and the Codex app is OpenAI's attempt to catch up. But I think we're all still optimizing the wrong 16%.

500 Lines vs. 50 Modules: What NanoClaw Gets Right About AI Agent Architecture

sudheer singh — Mon, 02 Feb 2026 06:39:29 +0000

NanoClaw is a personal Claude assistant built in roughly 500 lines of core TypeScript, with agents running inside Apple's new container technology instead of behind application-level permission checks.

The motivation was straightforward: the creator didn't want to run software he couldn't fully understand when it had access to his files, email, and shell. The alternative — OpenClaw — has 52+ modules, 45+ dependencies, and 8 config files. NanoClaw replaces all of that with four source files and a SQLite database.

What NanoClaw Actually Is

The architecture:

WhatsApp (baileys) → SQLite → Polling loop → Container (Claude Agent SDK) → Response

Four key files handle everything: WhatsApp connection, container runner, task scheduler, and SQLite operations. Each group chat gets its own isolated container with its own memory file.

No microservices. No message queues. No plugin registry.

The Complexity Trap

Fred Brooks drew the distinction between essential and accidental complexity in 1986. AI agent frameworks have a serious accidental complexity problem.

What's actually essential for a personal AI assistant:

Receive a message
Pass it to an LLM with context
Execute tools the LLM requests
Send the response back
Remember things between conversations

That's it. Everything else — plugin registries, chain abstractions, memory backends, retrieval pipelines — exists because frameworks try to be general-purpose.

NanoClaw sidesteps this by refusing to be general-purpose. One LLM (Claude), one messaging platform (WhatsApp), one storage backend (SQLite), one deployment model (single Mac).

OS-Level Isolation vs. Permission Checks

Most agent frameworks use application-level controls: allowlists, permission prompts. NanoClaw uses Apple Container — actual VMs with their own kernel, not just namespaces.

When an agent runs, it can only see directories explicitly mounted into its container. The security boundary is enforced by the hypervisor, not application code. One bug in a framework's permission system and the agent has access to everything. OS-level isolation doesn't have that problem.

The tradeoff: platform lock-in to macOS Tahoe on Apple silicon.

Fork and Modify vs. Plugin Architectures

NanoClaw tells contributors: "Don't add features. Add skills." Instead of a plugin system, users write skill files that teach Claude Code how to transform a fork of the codebase.

This works because:

The codebase is small enough for an LLM to safely modify
Each user gets purpose-built software

It breaks down when:

You need upstream security fixes
Skills conflict with no dependency resolution
The codebase grows past LLM-safe modification size

What This Means for Agent Architecture

Most agent complexity is accidental. Start simple, add complexity only when needed.
OS-level isolation is underused. Apple Container, gVisor, Firecracker provide stronger guarantees with less code.
Fork-and-modify has legs for AI-era software when codebases stay small.
Readability is a security property. If you can't audit the code that has access to your files and shell, you're trusting the framework author entirely.

The lesson isn't to rewrite everything in 500 lines. It's to question every layer of abstraction.

Originally published at fumics.in

Your Phone Silently Sends GPS to Your Carrier — Here's How

sudheer singh — Mon, 02 Feb 2026 06:38:57 +0000

Here's something that will ruin your morning: right now, your mobile carrier can send a silent command to your phone, and your phone will compute its exact GPS coordinates and send them back. No notification. No permission prompt. No indication whatsoever that it happened.

This isn't a bug. It isn't a hack. It's a feature — baked into the cellular protocol stack since the early 2000s, operating at a layer so deep that your phone's operating system doesn't even know it's happening.

The protocols are called RRLP (Radio Resource Location services Protocol) for 2G/3G networks, and LPP (LTE Positioning Protocol) for 4G/5G. Together, they form what's known as control-plane positioning — and they're the reason your carrier knows where you are with GPS-level precision, whether you want them to or not.

How It Actually Works

Every smartphone has two processors:

The application processor (AP) — runs iOS or Android, your apps, your location permissions
The baseband processor (BP) — runs the cellular modem firmware, handles radio communication, talks directly to the cell tower

These two processors are largely isolated. The baseband is a black box. When your carrier sends a location request, it goes to the baseband, not to Android or iOS.

The carrier's SMLC sends a positioning request over the control plane. The baseband receives it, activates the GPS chipset, computes coordinates, and sends them back. The application processor is never involved.

The Protocol Details

RRLP (3GPP TS 04.31) was designed for GSM/UMTS. LPP (3GPP TS 36.355) is the 4G/5G successor. Both support MS-Assisted and MS-Based positioning.

The critical detail: RRLP requires no authentication. The phone doesn't verify that the location request is legitimate. The baseband just responds.

Who's Been Using This?

Law enforcement: The DEA was using carrier-assisted GPS tracking by 2006
Israel's Shin Bet: Used carrier location data for COVID contact tracing at scale
Carriers selling data: T-Mobile, AT&T, Sprint sold real-time location data to third parties (FCC fined them $200M+)

Why You Can't Opt Out

Airplane mode — works, but no phone
Location permissions — irrelevant, controls app access not baseband
Location Services toggle — OS-level only
VPNs/firewalls — operate at IP layer, control-plane bypasses all of it

Apple's Fix — And Its Limits

iPhone 16e with Apple's C1 modem + iOS 26 introduces Location Privacy:

OS is notified of control-plane location requests
User consent before responding
Option to downgrade to coarse cell-tower estimate

But only works on C1 modem devices. Android has no equivalent.

What Developers Should Know

Location permissions are theater for this threat model
The baseband is the real attack surface
A phone with cellular = always trackable by carrier

Originally published at fumics.in