A few months back, the tech world got hit with a wave of panic-inducing headlines. CEOs and tech leaders were on stages everywhere claiming that massive percentages of their code was now AI-generated. If you weren't on board, you were basically toast.
This kicked off what I can only describe as a spending frenzy. Companies started signing six and seven-figure contracts for AI coding tools, desperate not to fall behind. The question everyone was asking was simple: "How do we get our entire team using AI?"
Now? The conversation's changing. The companies that jumped in early are starting to ask a much harder question: "Is this actually worth it?"
We've moved past the hype cycle into the messy reality of execution. Instead of "Are we using AI?" teams are asking "Are we using it well?" And here's the problem: the metrics we've relied on for years to measure engineering productivity? They're completely inadequate for this new world.
Why Your Current Metrics Are Lying to You
Trying to measure AI's impact with traditional engineering metrics is like trying to figure out how fuel-efficient your Tesla is by checking its oil consumption. The engine's changed completely, so the gauges you're looking at are basically useless.
Cycle Times That Mislead
Your cycle time might drop by 50%, and you'll celebrate. But here's what's really happening: AI shortened your coding phase dramatically, but the code it generated is so convoluted that your review phase doubled in length. You haven't actually gained anything. You've just moved the bottleneck somewhere else in your pipeline.
Traditional tools can't see this shift, so you end up celebrating a vanity metric while your team drowns in review friction.
DORA Metrics That Can't Diagnose
Don't get me wrong, DORA metrics are great for measuring overall pipeline health. But they're too high-level to tell you anything specific about AI's impact. Your Change Failure Rate goes up, and DORA just shrugs. It can't tell you if that's because of poorly-written AI prompts, bad code quality from your AI tool, or something completely unrelated to AI at all.
It's a smoke alarm that can't tell you where the fire actually is.
Lines of Code (Still Terrible)
Lines of code has always been a terrible metric, but AI makes it completely absurd. An AI tool can spit out thousands of lines in seconds. Measuring productivity this way is worse than useless because it actively rewards the wrong behavior.
A Better Framework for Measuring AI ROI
An engineering organization is fundamentally a system: you put things in (headcount, tools, cloud spend) and you get things out (a working product). A real framework for measuring AI ROI has to connect these inputs to outputs in a meaningful way.
Here's how to think about this in stages, from basic to sophisticated. I've broken it down into four pillars that every team needs to address.
Pillar 1: Measure Real Engineering Velocity
This is where everyone starts, because the core promise of AI is speed. Your job is to quantify that speed objectively.
Start Simple: Track Basic Output
First, just measure throughput. Look at pull requests merged per week or issues resolved. This gives you a baseline of what your team is actually shipping.
Get More Sophisticated: Understand How Work Gets Done
Next, you need to understand the mechanics of that work. Start tracking your AI Code Ratio, which is just the percentage of your merged code that came from AI. At the same time, analyze your team's calendars to see if AI is actually freeing up engineers for more focused work, or if they're still spending all their time in meetings.
The Gold Standard: Segment by AI Usage
The ultimate goal here is to connect AI usage directly to outcomes. Segment your cycle time metrics by how much AI was used in each pull request. This lets you answer the most important question: "Are PRs with lots of AI-generated code actually moving through our system faster than human-written code?"
Pillar 2: Measure Quality and Maintainability
Speed without quality is just technical debt with a stopwatch. This pillar measures the hidden costs you're accumulating.
Start Simple: Track Your Change Failure Rate
Begin with your Change Failure Rate. It's a lagging indicator, but it gives you a good sense of how many deployments are breaking things in production.
Get More Sophisticated: Look at Rework and Complexity
A more proactive approach is tracking your Rework Rate, which is the percentage of code that gets rewritten shortly after being merged. Are your AI-heavy PRs more brittle than human-written code? At the same time, start measuring code complexity. Is the AI generating clean, maintainable code, or is it creating unreadable messes that will haunt you for years?
The Gold Standard: Track Defects by AI Dosage
The real test of quality is measuring your Defect Escape Rate for AI-generated code versus human-written code. This requires patience, you'll usually need about 90 days post-deployment to see meaningful patterns, but it gives you the definitive answer on whether AI is improving or degrading your customer experience.
Pillar 3: Measure Organizational Impact
AI's impact goes way beyond your codebase. It changes your team dynamics and your culture.
Start Simple: Track Who's Using the Tools
First, just measure adoption. Who's actually using your AI tools? Track weekly active usage across your organization to see which teams are leaning in and which ones are holding back.
Get More Sophisticated: Measure Onboarding Speed
One of AI's big promises is that it helps people learn faster. Start measuring Time to 10th PR for new engineers. Is AI actually helping them become productive members of the team sooner?
The Gold Standard: Assess the Talent Pipeline Risk
Here's something most teams miss: AI is automating a lot of the simple, repetitive work that used to be the training ground for junior engineers. This creates a real long-term risk. Are you accidentally eliminating the path from junior to senior engineer? This is harder to quantify, but it's critical for any serious ROI discussion.
Pillar 4: Measure Total Cost
Once you understand velocity, quality, and organizational impact, you can finally connect everything to cost.
Start Simple: Compare License Costs to Headcount
The most basic analysis is just comparing your annual spend on AI tools to what you'd pay for an additional engineer.
Get More Sophisticated: Track Token Usage
Next, start tracking token consumption. Which teams or engineers are power users? Where are you burning through credits the fastest?
The Gold Standard: Automate R&D Capitalization
The most advanced approach is using AI to automatically classify your engineering work into categories like "New Feature," "Maintenance," or "Infrastructure." This lets you generate automated R&D cost capitalization reports for finance, turning your engineering data into a strategic business asset.
Building the Right Culture Around Metrics
A framework doesn't matter if your team doesn't trust it. Engineering metrics can be incredibly valuable, but if you implement them poorly, you'll just create fear and mistrust.
Be Transparent About What You're Measuring
The "why" matters more than the "what." Tell your team openly what you're measuring and why. Frame it as a tool for finding and fixing systemic problems, not for micromanaging individuals.
Focus on Systems, Not People
Use metrics to understand the health of your development process, not to create a performance leaderboard. The question should be "Is our system working?" not "Who's the fastest coder?"
Start Conversations, Don't End Them
Metrics should kick off discussions, not close them. If review times are spiking for a team, don't just demand they move faster. Ask what's getting in their way. Is the process too complex? Is the code too hard to understand?
The Surprising U-Shaped Curve
When you actually implement this framework, you'll find some counterintuitive patterns. We've seen one trend consistently across nearly every team we work with: the impact of "hybrid" AI usage.
You might assume that PRs with 100% AI-generated code are the riskiest, but the data shows something different. PRs with a mixed bag of 25-50% AI-generated code consistently cause the most rework and review friction.
This actually makes sense when you think about it. When a developer uses AI as pure autocomplete, mixing human logic and AI logic throughout a PR, it creates cognitive whiplash for the reviewer. They're constantly switching between evaluating two different "minds," and that's exhausting. A PR that's clearly all-human or all-AI is often much easier to review.
One team I worked with was dealing with a massive code review bottleneck. By analyzing their AI Code Ratio on each PR, they found that a small group of engineers was creating most of these high-friction hybrid PRs. It wasn't a performance issue, it was a training issue. After coaching the team on better prompt engineering to generate more complete PRs, they cut their average review time by nearly 30%.
The Economic Reality Coming Your Way
This level of measurement is becoming urgent because the economics of AI are about to change dramatically. Right now, most AI tools aren't profitable. They're being subsidized by venture capital to drive adoption, just like AWS did in the early days of cloud computing.
Remember what happened next? An entire FinOps industry emerged to help companies optimize their cloud spend. The same thing is coming for AI. As these tools get more expensive, you're going to face hard trade-offs. The question will be: "Do I spend my next $100,000 on more AI licenses, or do I hire another engineer?"
Without objective data on ROI, you're just guessing.
Questions We Still Can't Answer
We're still early in this story. While we can now measure velocity, quality, and cost, a new set of long-term questions is emerging that nobody has solved yet.
For example, people care about code readability today, but will they in three years? If an AI can interface directly with code, does human readability even matter? In a sense, English is becoming the new programming language. We care about the English prompt, we don't care about the compiled assembly code. This is a fundamental shift in abstraction, and we don't yet know what it means long-term.
What I do know is that leaders are going to need new signals for velocity, quality, and maintainability in this AI-native world. The best place to start is by building a framework to measure what's happening right now. Move past the hype, implement a real approach to measuring ROI, and you'll be able to navigate this transformation with actual data instead of just following the crowd.
Next Steps
Ready to move beyond guesswork and start measuring AI's real impact on your engineering team? Here are concrete actions you can take:
Get Ground Truth on AI Usage: Start by understanding what percentage of your codebase is actually AI-generated. Span's AI Code Detector provides 95% accurate detection of AI vs. human-authored code, giving you the foundation for all other measurements. This is your baseline, the ground truth you need before making any strategic decisions about AI tooling.
Build a Complete Measurement Framework: Rather than cobbling together disparate tools and spreadsheets, consider a unified developer intelligence platform that connects AI detection to velocity metrics, quality indicators, and team dynamics. The right platform should automatically track your AI Code Ratio, segment outcomes by AI usage, and surface patterns like the U-shaped curve described above.
Understand What Metrics Can and Can't Tell You: Before diving deeper into measurement, learn why perception-based approaches like surveys fall short when measuring AI impact. Understanding the limitations of different measurement approaches will help you avoid common pitfalls and build a more robust framework from the start.
See a Demo of Production-Ready AI Measurement: If you're ready to implement a comprehensive AI measurement strategy without building it from scratch, schedule a demo to see how leading engineering teams are measuring and optimizing their AI coding initiatives.


Top comments (0)