Lucien Chemaly

Posted on Dec 23, 2025

Stop Measuring Noise: The Productivity Metrics That Really Matter in Software Engineering

#discuss #management #productivity #softwareengineering

"Productivity" has become a dirty word in engineering.

Mention it in a Slack channel, and the immediate assumption is that management is looking for a reason to fire the bottom 10%, or that McKinsey is back with another controversial report. (For what it's worth, their 2023 report actually warns against using "overly simple measurements, such as lines of code produced, or number of code commits" rather than recommending them. The backlash came from other aspects of their approach.)

The skepticism is earned. For decades, productivity metrics in software engineering have been weaponized to micromanage individual contributors rather than optimize systems.

But ignoring metrics entirely is just as dangerous. When you run an engineering organization on vibes and anecdotal evidence, you end up playing a game of Telephone. The reality of what's happening on the ground gets distorted as it passes through layers of management. You lose the ground truth.

The question isn't if we should measure productivity. The question is what we should measure.

Most standard dashboards are filled with noise. In the era of AI coding tools, DORA metrics are no longer enough. Here's what actually matters in 2025.

The Problem with Velocity at All Costs

For the last few years, the industry converged on DORA metrics (Deployment Frequency, Change Lead Time, Change Fail Percentage, Failed Deployment Recovery Time) as the gold standard.

DORA is excellent, but it's a smoke alarm. It tells you when the house is burning down (Change Fail Percentage spikes) or when you're moving painfully slow (Deployment Frequency drops). But it doesn't tell you why.

The rise of AI tools like GitHub Copilot and Cursor has broken velocity as a standalone metric. A developer can now generate a 500-line Pull Request in 30 seconds using an LLM. The cycle time for the coding phase looks incredible. But if that code is a hallucinated mess that clogs up the review process for three days, that personal velocity came at the expense of the team's throughput.

If you only measure speed, you'll incentivize behaviors that destroy quality. Research from Tilburg University analyzing GitHub activity found that while less-experienced developers gain productivity from AI tools, core developers now review 6.5% more code and show a 19% drop in their own original code productivity. The time shifted to reviewing AI-generated submissions.

The New Framework: Inputs, Outputs, and Internals

To get a real signal, stop treating engineering like a black box. Start treating it like a system. You need to measure three distinct areas:

Inputs: What are we investing? (Headcount, Tooling Costs, Cloud Spend)
Internals: How is the work actually happening? (PR workflow, Rework, Focus Time, Context Switching)
Outputs: What's the result? (Reliability, Feature Adoption, Customer Value)

Here are the four specific metrics that are proving most valuable for modern engineering leaders.

1. Rework Rate (The AI Counter-Balance)

This is the most underrated metric in software development right now. Rework Rate measures the percentage of code that is rewritten or reverted shortly after being merged. The 2024 DORA Report now includes rework rate as part of their evolved framework, categorizing it alongside Change Fail Percentage as an "instability metric."

In an AI-augmented world, it's very easy to ship bad code fast. Data from platforms analyzing hundreds of engineering teams reveals a fascinating U-shaped curve regarding AI adoption:

Low AI usage: Standard rework rates
High AI usage (Boilerplate): Low rework rates (AI excels at unit tests and scaffolding)
Hybrid usage (25-50% AI): Highest rework rates

When developers use AI in a "mixed context" (half human logic, half AI autocomplete), it creates code that looks correct at a glance but fails in edge cases. This hybrid approach causes cognitive whiplash for reviewers who must switch between evaluating human and AI logic. PRs that are clearly all-human OR all-AI are easier to review than mixed-ratio PRs.

The red flag: If you see your cycle time improving but your rework rate creeping up, you're not moving faster. You're just building technical debt faster.

GitClear's analysis of 211 million lines of code found that code churn is projected to double in 2024 versus their 2021 baseline, with 7.9% of newly added code being revised within two weeks (compared to 5.5% in 2020). Copy-pasted code rose from 8.3% to 12.3%.

The challenge: you can't fix what you can't see. Most teams have no idea what percentage of their code is AI-generated, let alone how that AI code correlates with rework. Tools like Span's AI Code Detector now measure AI-authored code with 95% accuracy (for Python, TypeScript, and JavaScript), giving you ground truth on adoption patterns and quality impact.

2. Investment Distribution (The Truth vs. Your Task Tracker)

Ask a VP of Engineering what their team is working on, and they'll show you a roadmap: "40% New Features, 20% Tech Debt, 40% KTLO."

Ask the engineers what they're working on, and you'll hear about the "shadow work": "I'm supposed to be on the Platform team, but I'm spending 20 hours a week helping the Checkout team fix bugs because I'm the only one who knows the legacy codebase."

Project management tools are often a lagging indicator of intent, not reality. Tickets get rolled over, scope creeps, and quick fixes go untracked. An IDC report analyzing developer time allocation found that application development accounts for just 16% of developers' time, with the rest consumed by meetings, context switching, and what engineering manager Anton Zaides calls "shadow work".

Zaides identifies three main types of invisible work stealing team capacity:

Invisible production support: Investigating alerts, answering questions, ad-hoc requests
Technical glue work: Code reviews, planning, mentoring, documenting
Shadow backlog: Off-the-record PM requests, engineers doing things "right" without approval

In one case study, a senior engineer had more than 40% of their time as invisible work. An internal team performed roughly 65% shadow work, all without cost codes or billing.

The red flag: You need a metric that looks at actual code activity (commits, PRs, and reviews) to classify where effort is going. This automated analysis often reveals that the "Innovation" team is actually spending 70% of their time on maintenance. You can't fix that allocation if you rely solely on project management tools.

Platforms like Span automatically categorize engineering work by analyzing git activity, creating what they call an "automated P&L of engineering time." You can finally answer questions like "How much time did we actually spend on that platform migration?" with data instead of guesswork.

3. Review Burden Ratio

This metric tracks the relationship between the time spent writing code and the time spent reviewing it.

As AI drives the marginal cost of writing code to zero, the bottleneck shifts to the reviewer. AI generates code instantly, humans review code linearly. If your review burden is skyrocketing, your senior engineers are becoming human spell-checkers for LLMs.

The Tilburg University research quantified this shift: each core contributor now reviews approximately 10 additional PRs annually. Meanwhile, Faros AI's analysis found that code review time increases 91% as PR volume outpaces reviewer capacity, with PR size growing 154% and bug rates climbing 9%.

The red flag: Watch for "LGTM" culture versus "Nitpick" culture.

Too fast: If massive AI-generated PRs are getting approved in minutes, your Review Burden is suspiciously low, and you likely have a quality issue looming
Too slow: If review burden is high, your senior engineers are trapped in Review Hell, leading to burnout and a halt in architectural innovation

4. Fragmented Time (The Anti-Flow Metric)

Engineering requires flow. Yet, most organizations schedule their way out of productivity. Fragmented Time measures the blocks of available deep work time (2+ hours) versus time fractured by meetings and interruptions.

You can have the best AI tools in the world, but if an engineer has 30 minutes between standup and a planning meeting, they're not shipping complex features. They're answering emails.

Research from UC Irvine professor Gloria Mark found that it takes an average of 23 minutes and 15 seconds to fully return to a task after an interruption. The nuance: this measures returning to your original "working sphere" (project-level work), not recovering from every small distraction. Her updated 2023 research in "Attention Span" found this has increased to approximately 25 minutes, while average attention on screen dropped from 2.5 minutes (2004) to just 47 seconds (2021).

The red flag: If you look at calendar data and see that 40% of your engineering capacity is lost to context switching costs (the 30-minute dead zones between meetings), you've found the cheapest way to improve productivity: cancel meetings. No tool can fix a broken calendar.

How to Measure Without Being Big Brother

The biggest risk with productivity metrics is cultural, not technical. If you roll these out as a way to rank engineers, you've already lost. You'll incentivize gaming the system (splitting one PR into ten tiny ones to boost "throughput").

Here's the golden rule: Metrics are for debugging systems, not people.

Don't use data to ask: "Why is Alice slower than Bob?"
Do use data to ask: "Why is the Checkout Team stuck in code review twice as long as the Platform Team? Do they need better tooling? Is their tech debt unmanageable?"

Leaders shouldn't want a dashboard that tells them who to fire. They should want a dashboard that acts as a neutral third party, objective data to validate what engineers are already saying in 1:1s. When an engineer says, "I'm swamped with maintenance work," you want the data to prove it so you can secure the budget and resources the team actually needs.

Getting Started

So how do you actually implement these metrics?

For Rework Rate:
Use your version control system to track code that gets reverted or significantly modified within 14-21 days of merging. Most engineering intelligence platforms can calculate this automatically. If you're rolling your own, start with a git analysis script that flags commits that touch the same files within your chosen time window.

For Investment Distribution:
Engineering intelligence platforms can automatically categorize work based on code patterns, repo structure, and commit messages. The key is classification: New Features vs. Bug Fixes vs. Tech Debt vs. Operations. If you don't have budget for tools, start manually by sampling one week per month and having engineers log their actual time allocation, then compare it to what your task tracker says.

For Review Burden:
Calculate the ratio of time PRs spend in review versus time spent in active development. GitHub, GitLab, and Bitbucket all expose this data through their APIs. A healthy ratio varies by team, but if you see review time consistently exceeding development time, or if senior engineers are spending more than 50% of their time reviewing, you've got a problem.

For Fragmented Time:
Use calendar analysis tools or simple calendar exports. Count the number of 2+ hour blocks available per engineer per week. Anything less than 10 hours of uninterrupted time per week per engineer is a red flag.

The AI Quality Question

The elephant in the room: how do you actually measure AI code quality?

You can't manage what you can't measure. The first step is knowing how much AI-generated code you're actually shipping. Most teams are flying blind here, relying on surveys that are notoriously unreliable (self-reported data has serious limitations).

Platforms like Span use code-level detection to identify AI-authored code with high accuracy, then correlate that with downstream metrics like rework rate, review cycles, and bug density. This gives you a complete picture: where AI is helping, where it's hurting, and how to coach your team to use it more effectively.

The counterintuitive finding: it's not about reducing AI usage. It's about using AI more deliberately. Teams that coach engineers to use AI for complete, well-scoped tasks (not mixed human-AI work) see better outcomes.

The Bottom Line

We're entering a new era of engineering intelligence. The old proxies (lines of code and commit counts) are dead. Even the modern standards like cycle time are insufficient on their own.

To navigate the next few years, you need to understand the interplay between human creativity and AI leverage. You need to measure the quality of AI code, not just the volume.

If you want to build a high-performing team in 2025, stop measuring how busy everyone looks. Start measuring the friction that slows them down.

The metrics discussed here won't give you a single number to optimize. Instead, they'll give you a system view: where time actually goes, where quality breaks down, where your senior engineers are drowning in review work, and where your calendar is stealing focus time.

Use them to debug your system, not to rank your people.

If you're ready to move beyond vanity metrics and get real insights into your engineering organization, check out Span to see how engineering intelligence platforms are helping teams measure what actually matters.

DEV Community