In engineering management, few challenges are as persistent and contentious as objectively measuring team performance. How do we assess velocity, output quality, and even seniority in a way that is fair, insightful, and resistant to gaming? Traditional metrics—such as story points, commit counts, or lines of code—often fall short. They can incentivize quantity over quality, reward activity rather than impact, and fail to capture the nuanced reality of production software development.
Recently, I conducted an experiment to explore a novel approach: using a large language model (Claude) as an impartial "auditor" to evaluate a codebase over several months. The goal was to answer fundamental questions:
- What was actually delivered?
- How complex was the work in reality?
- How long did it take compared to expectations for a senior engineer?
- How stable was the resulting output?
The results were thought-provoking—and somewhat unflattering—prompting deeper reflection on whether AI can (or should) play a role in performance assessment.
The Core Idea: An AI-Powered Engineering Audit
The experiment centered on a detailed system prompt designed to guide Claude through a structured audit process. Rather than relying on superficial signals like commit volume, the prompt instructs the model to:
- Identify distinct deliverables by analyzing code changes and timelines.
- Evaluate true complexity by examining architecture, dependencies, novel logic, and integration challenges.
- Compare actual development time against benchmarks for senior engineers.
- Assess stability through regression patterns, hotfixes, and rework.
- Quantify overall efficiency and highlight red flags.
Importantly, the prompt emphasizes reading the actual code, not just commit messages, to avoid bias from optimistic descriptions.
Here is the full prompt I used: GitHub Gist.
The prompt is divided into six phases:
- Deliverables Inventory – Catalog what was built.
- True Complexity Assessment – Classify each deliverable (TRIVIAL to HIGHLY COMPLEX) with evidence.
- Time Efficiency Analysis – Expected vs. actual timelines.
- Quality & Regression Assessment – Stability and post-ship issues.
- Breadth vs. Depth Analysis – Distribution of new features, improvements, maintenance, and rework.
- Honest Assessment – Overall efficiency, rework rate, and verdict.
Running the Experiment
I applied this prompt to a real production codebase spanning approximately four months of work by a small team. The audit produced a comprehensive report with tables, evidence excerpts, and quantitative summaries.
Key high-level findings:
- Most deliverables were rated as MODERATE or lower complexity, even those that felt substantial during development.
- Several features showed timelines 2–10x longer than the prompt's senior-engineer benchmarks.
- Rework (fixes, hotfixes, reverts) consumed a significant portion of total effort.
- The overall efficiency calculation suggested room for substantial improvement compared to a "well-functioning" baseline.
Notably, the assessment appeared conservative—often downgrading apparent complexity by emphasizing reusable patterns, existing infrastructure, and boilerplate. For instance, systems involving real-time processing and multiple external integrations were classified as COMPLEX but with expected senior timelines of 1–2 weeks, despite months of iteration in practice.
Reflections: Does AI Underestimate—or Cut Through Bias?
The results raised immediate questions about the reliability of AI-driven audits.
On one hand, LLMs may systematically underestimate complexity in real-world systems. Production engineering involves hidden challenges that models struggle to appreciate fully:
- Iterative policy refinement and edge-case handling.
- Integration debt with legacy systems or third-party APIs.
- Coordination overhead in distributed teams.
- The intangible cost of context-switching and debugging in complex environments.
On the other hand, the audit's critical tone might reflect a valuable strength: cutting through human bias. Engineers (myself included) naturally overestimate the novelty and difficulty of our work. An impartial observer—especially one trained on vast codebases—can identify patterns, leverage points, and simplifications that feel groundbreaking internally but are standard externally.
Broader Implications and Open Questions
This experiment touches on several important topics in engineering leadership:
- Measurement Practices: How do most organizations currently evaluate velocity and quality? Many rely on proxies (velocity in story points, DORA metrics, cycle time), but few attempt deep code-level audits. Could structured AI analysis complement these?
- Seniority Calibration: The prompt uses senior-engineer time estimates as anchors. Is this fair? Seniority varies widely—what one engineer completes in days, another might require weeks due to domain knowledge gaps.
- Ethical Considerations: Automating performance assessment raises concerns around fairness, transparency, and morale. If an AI labels output as "inefficient," how do we ensure the evaluation is accurate and contextualized? Should such tools ever influence compensation or promotion?
- AI Limitations and Improvements: Current models excel at pattern matching but may miss subtle production realities. Future iterations could incorporate more signals (e.g., test coverage depth, production incident data, user impact metrics).
Have you experimented with LLMs for code review, retrospective analysis, or performance measurement? What pros and cons have you encountered?
Conclusion
Using Claude to audit engineering output proved to be a revealing—if humbling—exercise. While the assessment seemed to underestimate certain complexities, it highlighted areas for process improvement and forced a more honest reckoning with deliverables.
I share this not to prescribe AI audits as the solution, but to spark discussion. In an era where AI is transforming how we write code, perhaps it can also help us better understand the code we write—and the teams that write it.
What are your thoughts? How do you measure engineering effectiveness today, and where do you see AI fitting in (or not)?
Thanks for reading. Feedback welcome in the comments.
Top comments (0)