Kody from Kodus for Kodus

Posted on Jan 19 • Originally published at kodus.io

KPIs in Software Development: What to Track in 2026

#analytics #management #productivity #softwaredevelopment

Most engineering teams have some dashboard lying around. Usually it’s full of charts tracking everything from Jira ticket velocity to CI build time. The problem shows up when you ask how those numbers help anyone make a better decision. Most of the time, nobody knows how to answer. This is the common state of KPIs in software development: lots of data, little real understanding.

Dashboards end up becoming slide material in meetings, but they don’t make it into the day-to-day of the people building the system. Worse, when those metrics turn into isolated targets, they encourage shortcuts and local optimizations that don’t improve, and sometimes even make worse, how the system works as a whole.

The disconnect starts when we measure engineering effort instead of impact. The moment a metric becomes a target, it stops being a good measure. If the focus becomes “number of PRs merged,” you’ll get small and irrelevant PRs. If it becomes “story points closed,” estimates start inflating. In the end, these metrics measure activity, not value. And that creates friction between engineering and the rest of the business, because they don’t help answer the only question that really matters: are we building the right thing, in a sustainable way?

What We Should Actually Measure

The obsession with purely quantitative “productivity” metrics is a leftover from a different era of work. Metrics like lines of code or commits per week are easy to game and say almost nothing about quality, maintainability, or the impact of the work being done. With AI code generation becoming common, measuring volume makes even less sense. Today, a developer can produce thousands of lines in an afternoon. That doesn’t mean they created value, and often it just means they created more stuff to maintain later.

For engineering teams in 2026, you can’t look at performance from a single angle anymore. You need a set of indicators that shows how the system actually works: delivery speed, product quality, and the impact of all that on the people building the software. Good metrics help you find bottlenecks, make conscious choices, and improve the flow of work without burning out the team. The goal isn’t to hit an isolated feature target, but to build a system that can deliver value in a predictable and consistent way over time.

DORA Metrics in 2026: what these metrics still tell you

DORA metrics have been around for a while, but they’re still relevant because they measure two core outcomes of any delivery system: speed and stability. They tell you what is happening, not how the team gets there. In 2026, the risk is in interpretation. We live in a context where CI/CD is already part of the routine and AI code generation is common, which changes how these numbers need to be read.

Interpreting the KPIs

Looking at these metrics in isolation is what gets teams into trouble. A team can deploy frequently, but if every other deploy triggers an incident, that speed comes with a high cost in stability and user trust.

Deployment Frequency: Speed Needs Stability

This metric measures how often you can successfully ship code to production. High frequency usually indicates a healthy, automated pipeline and small batches of change. But that’s only half the story. If you deploy ten times a day but your Change Failure Rate is 20%, you’re just shipping problems faster. The real question this metric helps answer is: “How automated and reliable is our path to production?” Low deployment frequency often signals manual steps, flaky tests, or fear of breaking things, all system-level problems that need to be fixed.

Lead Time for Changes: batch size and review cost

This is the time between a developer’s first commit and that code running in production. It’s a powerful indicator of the efficiency of the entire development cycle. A long lead time is rarely a “coding speed” problem. It’s almost always a waiting problem. Code waits for CI to finish, waits for review, waits for a QA environment, or waits for a manual release window. You can think of that waiting time as a “verification tax.” Every step added to ensure quality before merge increases that tax. The point is to find the right balance, automating as much verification as possible so the tax doesn’t slow delivery to the point of stalling. Small PRs are the easiest way to reduce that tax, because they’re faster to review, test, and merge.

Change Failure Rate: what’s breaking after deploy

This metric tracks the percentage of deploys that cause a production failure, requiring a hotfix, rollback, or patch. A high rate points to problems in your testing or review process. Maybe your test suite is missing whole categories of issues, or reviews have turned into rubber stamps because PRs are too large to analyze carefully. With increased use of AI-generated code, this metric carries even more weight. The code can look correct at first glance, but hide bugs or performance issues that static analysis won’t catch. That makes well-done integration and end-to-end tests essential, not optional.

Mean Time to Recovery (MTTR): system resilience

When a failure happens, how long does it take for the service to return to normal? MTTR measures the recovery capability of the system and the team. Low values usually indicate good observability, automated rollback, and clear operational procedures. Teams with low MTTR handle incidents better because they know what to do and don’t have to improvise.

This metric goes hand in hand with Change Failure Rate. Reducing failures is important, but eliminating them completely isn’t realistic. That’s why the balance is investing both in preventing problems and in recovering quickly when they inevitably happen.

Quality indicators in the development flow

DORA gives you a high-level view of system outcomes. But to understand the “why” behind those outcomes, you need to look at the work itself as it moves through the development process.

Pull Request Size

Small PRs are one of the clearest signals of a healthy workflow. Large PRs are hard to review deeply, have a higher chance of merge conflicts, and are more likely to introduce bugs. Tracking the distribution of PR size (for example, lines of code changed) helps you notice when batches are growing. If the median size of your PRs goes above 200–300 lines of code, your lead time tends to increase and your change failure rate usually goes up. It’s an early signal that the process is starting to get out of control.

Time to First Review

Code that’s been written but not reviewed doesn’t generate value and risks becoming stale. This metric measures the time between opening a PR and the first truly relevant review comment. A long delay here is a clear bottleneck. It can be unclear ownership of reviews, reviewer overload, or simply PRs slipping through the cracks. It’s a direct measure of team collaboration and responsiveness.

Rework Rate

Rework is code that gets changed or removed shortly after being written, often within the same PR or in a follow-up PR. A high rework rate (for example, more than 15–20% of code experiencing churn) signals waste. The cause is usually earlier in the flow: ambiguous requirements, a poorly thought-out technical design, or feedback that arrived too late in the process. It’s a good metric for tech leads, because it points to friction in planning and design phases, not only in the coding phase.

Code Complexity and Cognitive Load

This is harder to quantify, but critical to track. Tools can measure cyclomatic complexity or flag low-cohesion code, but this also connects to developer experience. How hard is it for someone new to understand a specific part of the codebase? How many services do you have to touch to ship a simple feature? High cognitive load slows development and increases the chance of errors. Often, this is best measured through qualitative feedback, like asking developers which parts of the system they avoid working on.

The Distortion Problem: When Metrics Become Targets

The moment a metric becomes a performance evaluation target, it stops being a good measure. This is Goodhart’s Law, and it shows up all the time in software development.

Optimizing a single metric leads to harmful behavior. If you reward developers for closing more story points, estimates tend to inflate. If you reward high deployment frequency, the incentive becomes shipping changes that are too small just to boost the number, creating operational noise without delivering real value.

The impact of AI on review bottlenecks and “commit inflation.” AI code generators can produce large volumes of code quickly. If teams are measured by lines of code or commit frequency, these tools create an illusion of productivity. But that code still needs to be reviewed, tested, and maintained by people. This can lead to “commit inflation,” where the volume of changes increases but the value delivered doesn’t, while the review process becomes a permanent bottleneck.

To avoid these distortions, metrics need to be analyzed together. Deployment frequency, for example, only makes sense when viewed alongside Change Failure Rate, to ensure speed isn’t coming at the expense of quality. In the same way, PR throughput needs to be paired with PR size, to ensure the team is shipping meaningful changes, not just inflating numbers.

How to design a metrics system that actually makes sense

Instead of looking at isolated metrics, you can adopt a set of metrics that complement each other. The idea isn’t to optimize a specific number, but to understand how the system behaves as a whole and avoid having gains in one area create problems in another.

DORA Metrics + Rework Rate

A simple, but very good, evolution is adding Rework Rate to DORA metrics. That adds a dimension of quality and efficiency to the workflow itself, not just the final outcome. It helps answer the question: “How much of our effort is going toward moving forward versus fixing recent mistakes?”

Flow metrics

Flow metrics come from Lean practices and treat software development as a value delivery system. They’re excellent for diagnosing bottlenecks.

Flow Time: The total time from starting work to delivery (similar to Lead Time).
Flow Velocity: The number of work items completed per unit of time (for example, stories per week).
Flow Efficiency: The ratio between active work time and total flow time. Low efficiency (often below 15%) shows most time is spent waiting.
Flow Load: The number of work items currently in progress. Too much WIP slows everything down.
Flow Distribution: The distribution of work by type (for example, features, bugs, risks, debt). This helps you see whether you’re spending all your time fighting fires instead of building new value.

SPACE Framework: Developer Experience matters

The SPACE framework starts from the idea that you can’t understand productivity without considering developer well-being. In practice, that means looking at:

Satisfaction and well-being: How happy and healthy are developers?
Performance: What are the outcomes of the work? (This is where DORA metrics fit).
Activity: What actions are being taken? (for example, commit count, PRs created).
Communication and collaboration: How well does information flow within and between teams?
Efficiency and flow: How easily does work move forward without interruptions? (This is where flow metrics fit).

You don’t need to track every SPACE metric. Picking a few metrics across different areas already helps you see what’s going on more clearly.

Adapting metrics as the company grows

The right metrics depend on the team’s size and stage. What works well in a startup tends to create problems when applied the same way in a larger company.

Startup focus: Speed and Sustainability

Early-stage companies are focused on speed and finding product-market fit. Lead Time for Changes is crucial. The risk is piling up rushed decisions that charge their price later and end up slowing everything down. Pairing Lead Time with Rework Rate helps ensure speed represents real progress, not just motion without direction.

Scale-up challenges: Distributions

As the organization grows, averages become less useful. You need to look at distributions. Your average lead time might be two days, but if your 95th percentile is 20 days, you have a predictability problem. Here, flow metrics become critical to identify the systemic bottlenecks that show up as more teams and dependencies are added.

Enterprise reality: Capacity over Speed

In large organizations, the goal is often to build organizational capacity. You might measure adoption of a new CI/CD platform or the percentage of teams that can deploy independently. The focus shifts from the speed of a single team to the resilience and modularity of the entire engineering organization.

At the end of the day, KPIs are only useful when they help you understand cause-and-effect relationships in your development process. They’re a tool for making better trade-offs and improving delivery predictability, so you can sustain speed without sacrificing the quality and stability your users depend on.

DEV Community