Measuring AI's Impact on Engineering Teams: The Metrics Already Exist

#ai #devops #analytics

There's a question that keeps coming up in engineering leadership circles: how do we measure whether AI is actually helping?

It's a fair question. Most of us are investing real time and money into AI tooling for our teams and at some point we need to understand what we're getting for it. The challenge is that the most visible metrics...PR counts, commits, code suggestions accepted...are measuring activity, not outcomes.

We've been down this road before. There was a time when lines of code was the go-to measure of engineer productivity, and it was a terrible idea. People padded their stats and quality suffered. Same thing happened with test coverage. Once it became the goal instead of a signal, teams chased the number instead of what we actually wanted which was confidence in our code.

The 2025 DORA Report puts some real numbers to this. AI helps developers complete 21% more tasks and merge 98% more pull requests...but organizational delivery metrics stayed flat.

More activity, same outcomes.

The report also notes that AI "magnifies the strengths of high-performing organizations and the dysfunctions of struggling ones." It's an amplifier, not a fix.

So if activity metrics aren't the answer, what is?

I keep coming back to DORA. For those unfamiliar, DORA (DevOps Research and Assessment) is a research program, now part of Google Cloud, that has studied software delivery performance across thousands of organizations. Their findings, published in the book Accelerate and in annual State of DevOps reports, identified four key metrics that correlate with high-performing engineering organizations.

What's interesting about these metrics is that none of them measure activity. They measure outcomes.

Lead time is the time from when an idea is conceived to when it's in production and in front of customers. Pragmatically, most teams measure this as first commit to first production deployment...which doesn't tell the whole story, but sometimes you have to start with what you can actually measure and refine from there.

This is where AI's leverage really shows because if we can compress that loop, we're not just shipping faster...we're learning faster. We're getting real data about whether we built the right thing, and we're able to iterate on it sooner.

That feedback loop is everything.

Deployment frequency tells us how often our teams are releasing to production. More frequent, smaller deployments are a signal of a healthy delivery pipeline and a team that's iterating confidently.

With AI in the mix, we should see this go up...not because we're pushing more code, but because we're moving through the development cycle with less friction.

Change failure rate is the percentage of deployments that result in a failure requiring remediation. This one should go down with AI since it can help catch bugs, flag regressions, and improve code quality before things hit production.

If our change failure rate isn't trending in the right direction, that's a signal worth paying attention to because it might mean we're shipping faster without the guardrails to match.

Mean time to resolution is traditionally about how quickly we recover from incidents. But I think this metric gets more interesting when we broaden it a bit.

Most teams have low-priority security findings they can't get to, edge-case bugs that keep slipping down the backlog, tech debt that never wins the prioritization fight. AI gives us the horsepower to finally tackle that work.

Think about a backlog of tech debt that's been piling up for months because there was always a feature or a bug that took priority...that's the kind of work AI can help burn down. It's not just that MTTR improves; it's that the scope of what we can resolve expands.

Now here's the thing that's always been true about DORA and is worth restating: any one of these metrics in isolation can be misleading. We can deploy like crazy and ship nothing of value. We can have a great change failure rate because we're barely deploying at all.

It's the composition of these signals together that tells the real story. A team that's deploying frequently, with short lead times, low failure rates, and fast recovery...that's a team that's delivering well. And that picture doesn't change just because AI entered the equation.

The piece I'd add on top of DORA is product metrics. At the end of the day, don't tell me how many PRs went up; tell me whether the change actually mattered to the people that use it. Did the feature hit its goals? Did users behave differently? Did we move a number that the business cares about?

DORA tells us whether we're delivering well. Product metrics tell us whether we're delivering the right thing. Together, they give us the full picture.

None of this is particularly new thinking. DORA has been around for years and the Accelerate book laid out these principles back in 2018. What's new is that AI is tempting us to forget them in favor of shinier, easier-to-measure activity metrics.

The tools for measuring what actually matters are already there. We just have to use them.

DEV Community

Measuring AI's Impact on Engineering Teams: The Metrics Already Exist

Top comments (0)