How Should We Evaluate AI Coding Tools in Real Engineering Environments

Kingsley Osime - IEEE — Mon, 18 May 2026 23:09:03 +0000

We already know AI coding tools can generate code.

The more interesting question for me was whether they can reason about software systems intelligently and in ways that are genuinely useful during real engineering work.

I had some spare time recently so I decided to evaluate Anthropic's Claude code against OpenAI's Codex using the same unfamiliar public codebase rather than relying on subjective impressions alone.

I evaluated both tools against the HTTPie CLI codebase - a mature, production-grade open-source project that is small enough to explore quickly, yet complex enough to assess codebase comprehension, architecture understanding, testing strategy, and developer workflows.

Given my familiarity with Claude, I also recognised the potential for bias. To mitigate this, I defined five objective evaluation criteria and assessed the performance of both tools strictly against each criterion:

Accuracy — How factually correct and aligned the response is with the actual codebase.
Depth — How completely and meaningfully the response addresses the question asked.
Clarity — How easy the response is to follow and understand.
Actionability — How effectively the response enables the user to take practical next steps.
Effort — How much cognitive effort is required to extract value from the response.

Each criterion was scored on a scale of 1–5, where 1 represented a weak outcome and 5 represented a highly effective outcome within the scope of the task.

The Questions

Q1 – Understanding — "Explain how this project works at a high
level and its main components."
Q2 – Onboarding — "If I were new to this repo, how would I run
it locally and what should I look at first?"
Q3 – Feature Deep Dive — "Walk me through how an HTTP request
is constructed and executed in this codebase."
Q4 – Code Quality — "Identify 2–3 areas in this codebase that
could be improved upon and explain why?"
Q5 – Testing — "How is this project tested and how would you
improve test coverage?"

Findings

Q1 — High-level understanding
Verdict: Claude stronger on developer comprehension and sequencing.

Both tools accurately explained the request flow and main components. However, while Codex adhered more closely to the structure of the question itself, Claude introduced the system in a sequence that felt more natural for building understanding — components first, followed by request flow.

Q2 — Onboarding experience
Verdict: Claude stronger for experienced developers, Codex stronger for guided onboarding.

Both tools provided accurate and actionable setup instructions, but they appeared to assume different developer personas. Codex took a more guided and safety-aware approach, directing the user toward documentation and contributor guides first, while also surfacing warnings about modified files. Claude, by contrast, assumed a more experienced developer and prioritised direct entry into the codebase through the main execution flow and core components.

Q3 — Request execution walkthrough
Verdict: Claude provided the stronger conceptual walkthrough.

Both tools delivered technically strong and detailed explanations of the request lifecycle. However, Claude structured the explanation in a way that was easier to internalise, presenting the flow as a coherent sequence with clear conceptual boundaries between each stage. Its formatting, step compression, and final "full picture" diagram made the overall request lifecycle easier to reason about. Codex, by contrast, provided a more exhaustive technical trace with stronger emphasis on implementation detail, file references, and execution stages.

Q4 — Code quality and improvement opportunities
Verdict: Claude provided the stronger engineering critique.

Both tools identified meaningful areas for improvement and highlighted similar structural concerns around request execution and orchestration. However, Claude connected these issues more effectively, framing them in terms of behavioural risk, coupling, and testability rather than simply decomposition. Codex approached the problem more as a refactoring exercise, while Claude explained why the underlying design decisions could create maintenance and engineering challenges over time.

Q5 — Testing strategy and coverage
Verdict: Claude provided the stronger analysis of the testing approach.

Both tools correctly identified that the project relies heavily on integration-style testing and demonstrated strong understanding of the test infrastructure. However, Claude went further in explaining the trade-offs of the current testing strategy, the implications for failure isolation and maintainability, and the reasoning behind the identified coverage gaps. Codex provided a strong inventory of the existing test suite and practical suggestions for expansion, but its analysis was more descriptive than evaluative.

Final Verdict

Across the five questions evaluated, Claude consistently produced responses that were easier to internalise, better structured, and more effective at building a clear mental model of the system. Its explanations generally prioritised developer comprehension, conceptual flow, and reasoning, making it particularly strong for onboarding, architecture understanding, and engineering analysis.

Codex, however, demonstrated different strengths. Its responses were highly precise, implementation-aware, and strongly grounded in the structure of the codebase itself. In several cases, it provided more exhaustive technical detail, stronger traceability through files and line references, and a more execution-oriented perspective on the system.

What became increasingly clear throughout the evaluation was that the two tools appear to optimise for different developer experiences:

Claude behaves more like a collaborative engineer or technical mentor, focused on explanation, reasoning, and comprehension.
Codex behaves more like an execution-oriented engineering assistant, focused on structure, traceability, and implementation detail.
It is also important to acknowledge that the questions selected for this evaluation largely emphasised codebase understanding, explanation, critique, and reasoning. A more implementation-heavy or autonomous task set may have produced different results and potentially favoured Codex more strongly.

My overall impression is that both tools are highly capable but currently serve slightly different purposes within the software engineering workflow. For understanding unfamiliar systems, reasoning through architecture, and accelerating developer comprehension, I found Claude consistently stronger. For implementation-oriented workflows, code tracing, and execution-heavy engineering tasks, Codex appears to show considerable promise.

For transparency and reproducibility, the complete prompts, raw outputs, methodology notes, and evaluation data used in this analysis are available in the supporting material.

Supporting materials and evaluation data

DEV Community: Kingsley Osime - IEEE

How Should We Evaluate AI Coding Tools in Real Engineering Environments