DEV Community

Cover image for Three AI Agents, One Spec
Reme Le Hane
Reme Le Hane

Posted on

Three AI Agents, One Spec

There’s been a steady undercurrent of people switching tools lately. Copilot to Claude. Claude to something else. Codex quietly entering more workflows. A lot of confidence in different directions, often delivered with certainty.

I had just signed up for Claude to evaluate whether it could realistically replace ChatGPT in my workflow. Copilot already lives inside my IDE and has been solid for a while. ChatGPT fills the thinking space outside it. Codex was new to me, and I hadn’t pushed it beyond small experiments. Rather than rely on opinions, I wanted to see how they behaved when given the same real problem.

I took a feature from a production side project and gave the exact same specification to all three. No prompt tuning between runs. No follow-up corrections. Just the same instructions, copied and pasted.

The project itself isn’t the focus, but for context it’s a 3D print cost calculator that’s been live for some time. The feature came directly from user feedback: support multiple materials in a single calculation.

Modern 3D printers can switch between filament spools mid-print. That means a single object might use two or three different materials. From a costing perspective, the application can no longer assume one material per job. It needs to aggregate several weights, each with its own cost per kilogram, and still produce a stable total.

Previously, the model was simple: one material, one weight. The new version required introducing a list of material usages, updating the UI to support adding and removing entries, migrating existing saved calculations so nothing broke, adjusting the cost engine to sum across usages, and updating exports and tests to reflect the change. It was a contained but multi-layered change, the kind that forces you to think about backward compatibility before aesthetics.

All three tools produced coherent implementations. Nothing wildly off target. Nothing unusable. Each of them understood the structure of the problem and translated the specification into working code.

There was one shared oversight. The feature was meant to be premium only, and none of them added any form of gating. That detail wasn’t in the instructions, so it never appeared in the output. It wasn’t dramatic or surprising, just a straightforward reminder that these systems implement what is written. Anything that lives in your head but not in the spec simply does not exist from their perspective.

Where They Diverged

Claude wrote the most code and made the broadest structural changes. It introduced immutability into the data model and added supporting packages to enforce that pattern. It also flagged localisation concerns, something the others ignored, although the implementation itself only covered English, ignoring the other 9 languages. The overall feel was of an engineer who prefers improving the surrounding structure while adding the feature, even if that means expanding the footprint of the change.

Copilot took a more conservative approach. It extended the existing structures rather than reshaping them and stayed close to established conventions in the project. The implementation felt incremental and aligned with what was already there. Tests were adjusted where necessary, but new work was ignored as far as testing was concerned.

Codex landed somewhere between those two. Its structural footprint was closer to Copilot’s, but the feature felt more complete. The user flow around multiple materials was slightly cleaner, and the test coverage went beyond simply keeping the build green, it updated tests where necessary but also included test for the newly introduced business logic. It did not attempt to modernise the surrounding architecture, but it did seem to push the feature towards a finished state rather than a minimal extension.

Claude also took noticeably longer to produce its output. Codex and Copilot completed within minutes of each other, while Claude took roughly twice as long. That difference did not materially change the evaluation, but it was visible when running the comparison side by side.

What stood out over time was less about correctness and more about inclination. Each tool behaved like a different kind of engineer. One leaned towards structural refinement, one towards continuity, and one towards feature completeness. None of them had context about my longer-term intentions for the project, and none of them could weigh architectural disruption against speed of integration in the way a human maintainer does. They simply optimised within the boundaries of the instructions provided.

The Decision

In the end, I am not merging any of the branches as they stand. Codex is probably the closest to something I could integrate with minimal additional work, largely because its changes sit comfortably within the existing structure while still delivering a rounded implementation. Restoring a removed widget or adding premium gating is straightforward, and extending localisation properly is manageable. The path to integration feels shorter.

But the final implementation will be selective. The strongest elements from each approach can be combined, and the business constraints that were never specified can be applied deliberately. That part still belongs to me.

What this exercise clarified is not which tool is “best.” It is how differently they interpret the same constraints and where they naturally apply effort. Used in isolation, that bias might go unnoticed. Run in parallel, it becomes visible.

That visibility is useful.

Instead of choosing a single tool and treating it as the answer, there is value in letting them generate alternative implementations and then integrating intentionally. They can explore branches quickly. I remain responsible for coherence, consistency, and long-term direction.

That division feels realistic.

Top comments (0)