Marcos Vinicios da Silva Neves

Posted on Apr 10

Your devs are faster with AI. Your team might not be.

#ai #devops #management #productivity

How to use DORA metrics to measure the real impact of AI tools — and stop confusing speed with delivery

OK, everyone's using AI. Now what?

I was in a conversation the other day about the real impact of AI on software development. Not the "AI will replace developers" talk — the practical one: how do we actually measure this? How do we know if these tools are genuinely improving our team's delivery?

The conversation went to DORA, to productivity metrics, to ROI. And I walked away buzzing — not with doubt, but with the urge to organize all of it. This article is the result.

I use AI every day. Claude Code, Cursor, Copilot, the whole stack. For my own work and for my team's. I've been sharing weekly in my newsletter how that journey has been going — what works, what doesn't, what surprises me.

But using is one thing. Measuring is another.

And measuring for real — with numbers, with evidence, with something you can put in a spreadsheet and show someone — is where most teams get stuck.

Here's the truth: most teams today are in "adopt and hope" mode. Rolled out Copilot for everyone, maybe a Claude Code license for whoever asked, and the success metric became a vibe. "Seems faster." "People like it." "More PRs this sprint."

Great. But did lead time drop? Did the change failure rate shift? Is review time the same, better, or — spoiler — worse?

If you can't answer those questions, you're not alone. But you can't stay there either.

In this article, I'm going to walk you through a practical plan to measure the impact of AI on your team. I'll explain what DORA metrics are (and why they matter now more than ever), show where AI actually shows up in the numbers — and where it misleads — and give you a playbook with concrete tools and metrics you can start using tomorrow. Not next week. Tomorrow.

Because an opinion without data is just another opinion.

What DORA metrics are (skip if you already know)

If you know DORA, skip ahead. No hard feelings. But if you've heard the term in a meeting and nodded along pretending you knew what it meant — stick with me. It's worth it.

DORA stands for DevOps Research and Assessment. It's a research program that started in 2013 and now lives inside Google Cloud. Every year they publish a massive report based on surveys with thousands of teams around the world. The goal is simple: understand what makes a software team deliver well.

And "deliver well" here isn't an opinion. It's metrics. Four of them, originally — now five.

The original four

Deployment frequency — how often your team ships to production. Not "how often we merge PRs." Deploy. To production. Running. High-performing teams do this multiple times a day. Struggling teams do it once a month, sometimes less.

Lead time for changes — how long it takes from commit to that code running in production. This includes everything: CI, review, QA, approval, deploy. If your commit takes 3 weeks to reach production, it doesn't matter how fast you wrote the code. Delivery is still slow.

Change failure rate — of all the deploys you make, how many cause problems? Rollbacks, hotfixes, incidents. If you deploy 10 times a week and 3 of those break something, your change failure rate is 30%. That's terrible.

MTTR (mean time to recover) — when things go wrong, how long does it take to get back to normal? A team that recovers in 15 minutes lives in a different world from a team that takes 2 days.

Together, these four metrics measure two things: speed (deployment frequency + lead time) and stability (change failure rate + MTTR). DORA's key insight is that elite teams are good at both. It's not "fast or stable" — it's both at the same time.

The fifth metric: rework rate

In 2024, DORA added a fifth metric: rework rate. It measures the percentage of deployments that are unplanned fixes — the patch that wasn't on the backlog, that showed up because something shipped to production wasn't quite as ready as it seemed.

The difference from change failure rate is subtle but important: change failure rate catches the deploy that explodes. Rework rate catches the deploy that works, but works poorly. The bug a user reports three days later. The weird behavior nobody caught in review.

Keep this metric in mind. It's going to be very important when we talk about AI-generated code.

The problem with DORA (that nobody tells you)

Now, before you rush off to build a dashboard with these five metrics: hold on.

DORA on a slide looks beautiful. Four quadrants, nice colors, "elite vs low performer." You can build an entire presentation around it and walk out of the meeting to applause.

But DORA in practice requires context. If you measure deployment frequency without accounting for the fact that half your deploys are config changes that don't go through review, the number lies. If your lead time is low because your team pushes straight to main without review — congratulations, your lead time is great and your production is a ticking time bomb.

The metrics were designed to be read together, as a system. Speed without stability is chaos. Stability without speed is bureaucracy. Balance is what matters.

And that's exactly why DORA is the right framework to measure AI impact. Because AI disrupts that entire balance — and if you only look at one metric in isolation, you'll draw the wrong conclusion.

Which is exactly what's happening with most teams today.

The paradox: your devs are faster, but your team isn't

OK, now that we speak the same language about DORA, let me show you what's happening in the real world.

Faros AI analyzed telemetry from over 10,000 developers across 1,255 teams. The individual numbers look great: 21% more tasks completed, 98% more PRs merged. If you stopped reading here, the conclusion is obvious — roll out AI for everyone and watch the numbers climb.

Don't stop reading here.

When they measured DORA on the same teams — deployment frequency, lead time, change failure rate, MTTR — nothing moved. Zero. Developers produced nearly double the PRs and team delivery stayed exactly the same.

It gets worse. Anthropic ran an internal survey with 132 engineers using Claude Code. The individual results were stunning: 67% more PRs merged per day, usage jumped from 28% to 59% of daily work, self-reported productivity gains between 20% and 50%. Then someone checked the organizational dashboard. The delivery metrics hadn't moved.

This has a name: AI Productivity Paradox.

And it's not an isolated case. The 2024 DORA report was even more direct: for every 25 percentage point increase in AI adoption, delivery throughput dropped 1.5% and delivery stability dropped 7.2%. More AI, worse delivery. Read that again.

The METR study with 16 experienced developers working on familiar codebases showed that developers using AI took 19% longer to complete tasks — while estimating they were 20% faster. A 39-point gap between perception and reality.

So we have a scenario where:

Devs write more code ✅
Devs open more PRs ✅
Devs feel faster ✅
The team delivers more value ❌

How is this possible? How do you double individual output and the system's result stays the same — or gets worse?

The answer lies in what happens after the code is written.

We're reviewing code with no author

To understand the paradox, you need to look at code review. Because that's where everything jams up.

Think about the pipeline: AI generates code → dev opens PR → someone reviews → merge → deploy → production → value. AI accelerated the first step. Wonderful. But the rest of the pipeline didn't change. And when you speed up one stage without speeding up the others, you don't deliver faster — you create a bigger queue at the next stage.

In this case, the queue is code review. And Faros AI's numbers are brutal: while PR volume went up 98%, review time went up 91%. PRs got 154% larger. The bottleneck didn't disappear — it changed address.

But the problem isn't just volume. The problem is that review changed in nature.

Before AI

When you reviewed a colleague's code, you were evaluating someone's reasoning. You know the person, you know their level, you know how they think. Review was a conversation: "I see what you were going for here, but what if we did it this way?" You'd read the code and understand the intent behind it.

The effort was naming, architecture, readability. Aesthetic and structural concerns. Important, sure — but familiar.

After AI

Now you're reviewing code that nobody on the team thought through. The AI wrote it, the dev glanced at it, opened the PR. The code looks right. It compiles, passes tests, follows project conventions. But "looks right" and "is right" are very different things.

And that changes review completely:

From intent to factual correctness. Before, it was "I get what you were trying to do." Now it's "Is this actually correct?" Because AI writes plausible code, not necessarily correct code. It's great at generating something that looks like it works.

From author trust to zero trust. Before, you knew who wrote it. Now, part of the code wasn't thought through by anyone on the team. Review becomes almost an audit — trust, but verify.

From reading to mental simulation. Before, you'd read the code and follow the flow. Now you need to imagine: unexpected inputs, weird state, integrations breaking. Review became more about executing the code in your head than reading it.

From improving to detecting risk. Before, it was refactoring suggestions, clean code, better naming. Now it's "could this cause a problem?", "does this scale?", "does this open a security hole?" Priority shifted from aesthetic quality to risk mitigation.

From diff to system. Without adequate context, AI tends to ignore project patterns, duplicate existing logic, and create isolated solutions. Good specs and well-crafted prompts reduce this significantly — but most teams aren't there yet. Until they are, the reviewer foots the bill: they're no longer just reviewing the diff, they're verifying whether it fits the rest of the system.

From review to rubber stamp. And then there's the scenario nobody likes to admit: the dev gets the PR, sees it's large, sees it was AI-generated, and approves without really reading it. "The AI made it, must be fine." PR volume is already high, review is already heavier, and the temptation to drop a quick LGTM and move on is real. When that happens, review stops existing as a quality filter — and everything we've discussed about change failure rate and rework rate becomes a direct consequence.

The invisible effect

All of this together explains why review takes longer, is more draining, and demands more seniority. Because now you need to think more than read. The cognitive effort is fundamentally different.

And the data backs up what we feel in practice. Stack Overflow's 2025 survey shows developer trust in AI tool accuracy dropped from 40% to 29% in a single year. 66% of developers say AI-generated code is "almost right, but not quite" — and that "almost" is exactly what makes review harder. Completely wrong code fails fast. Almost-right code passes tests and breaks in production three weeks later.

Sonar's survey of over 1,100 developers surfaced the most worrying number: 96% of developers don't trust that AI-generated code is functionally correct. But only 48% say they always review before committing. Read that again: almost everyone is skeptical, but half don't verify. There's a name for this: verification debt.

The review workarounds

And then there's the second problem: how teams are trying to solve this.

Some people copy the PR diff, paste it into ChatGPT or Claude, and treat the response as a completed review. Others connect an MCP, point the AI at the code and say "review this." And yes, it catches things — bad naming, unused imports, sometimes even an obvious bug.

But that's not code review. That's lint with better marketing.

The AI doing the review doesn't know your system's context. It doesn't know that service has an implicit contract with another module. It doesn't know that pattern was an architectural decision your team made for a specific reason. It sees the diff in isolation — which is exactly the "diff vs system" problem we already discussed.

Using AI as a first filter before human review? Excellent. Tools like Claude Code Review and CodeRabbit do exactly that — they handle the mechanical stuff so the human reviewer can focus on the cognitive stuff. But using AI as a replacement for human review is trading a bottleneck for a blind spot.

What this does to the metrics

And this creates a direct side effect:

Writing time ↓ Review time ↑ Lead time... doesn't change.

Or worse: it gets longer.

Code review stopped being a conversation between developers and became a process of validating generated code. The main job is no longer understanding intent — it's ensuring correctness.

AI doesn't remove the bottleneck — it moves the bottleneck

Let's zoom out and look at the whole system.

Before AI, most teams' bottleneck was writing code. It was slow. It required focus. There was always that task sitting in "in progress" for three days because it was complex, because the dev got stuck, because it needed research. Writing was the slowest point in the pipeline.

AI solved that. Writing got fast. Sometimes absurdly fast.

But the pipeline didn't shrink — the weight just shifted. Now the slowest point is reviewing, validating, and testing. And this is so predictable it already has a name in the literature: Verification Tax. The cost of verifying what the AI produced.

Think about it this way: if your team had a 100-hour pipeline and 40 of those hours were writing code, AI might have cut those 40 down to 15. Great. But if the 30 hours of review turned into 50 because volume doubled and review complexity increased, the total went from 100 to... 95. Five hours gained. Not exactly the miracle the slide promised.

And then you go back to DORA metrics and the question that actually matters:

"Did this increase in output translate to improvement in delivery?"

In most cases, the answer is no. Deployment frequency didn't go up because the bottleneck is now review, not writing. Lead time didn't drop because code reaches the PR faster but sits longer waiting for approval. Change failure rate might even get worse if overloaded review lets things slip through. And rework rate — remember the fifth metric? — goes up because "almost right" code generates fix after fix.

The 2025 DORA report sums this up in a way that stings: AI works as a mirror and amplifier. If the team already has good processes, AI accelerates them. If the team already has problems, AI amplifies the problems. It doesn't fix anything — it exposes what was already there.

And that's why measuring only individual output is so dangerous. The dev looks at their own work and thinks "I'm flying." The tech lead looks at the board and thinks "why hasn't velocity changed?" They're both right. They're both looking at different metrics.

The right question isn't "are my devs more productive?" It's "is my team delivering more value?" And to answer that, you need a playbook.

The playbook: how to start measuring tomorrow

Enough concepts. This is where we get operational. I'm going to give you a four-week plan with concrete tools. Adapt it to your context, but the structure works.

Week 1: create your baseline

Before measuring the impact of anything, you need to know where you stand today. Without a baseline, any number you pull later is meaningless — you have nothing to compare against.

What to measure:

Lead time — from first commit to deploy in production. If you use GitHub Actions for deployment, GitHub Insights already shows part of this. For a more complete picture, Apache DevLake (open source, self-hosted) connects with GitHub, GitLab, and Jira and gives you a ready-made DORA dashboard. If your team already uses Datadog, the DORA Metrics module integrates directly with your CI/CD pipeline and correlates with infrastructure data.

Deployment frequency — how many deploys per week your team ships. Same tools: DevLake or Datadog pull this automatically from your pipelines. For something simpler, Four Keys from Google's own DORA team (open source) does exactly this from GitHub events.

Change failure rate — percentage of deploys that cause problems in production. I need to be honest here: to measure this, you need some form of incident management. And if your team doesn't do that today, that's OK — many teams don't. But it's time to start.

You don't need an expensive tool on day one. Start simple: create a Slack channel (something like #incidents), agree with the team that every rollback, hotfix, or production issue gets a message there with a date, description, and resolution time. That's incident management. Basic, manual, but it's real data.

As that process matures, then it's worth looking at tools like PagerDuty, Opsgenie, or even GitHub Issues with incident labels. DevLake and LinearB integrate with these sources and calculate change failure rate automatically. But Slack with discipline already gives you the number that matters.

MTTR — time between the problem appearing and the service returning to normal. If you followed the advice above and created the incidents channel, you already have what you need: the time the message was posted and the time someone replied "resolved." The difference between those two is your MTTR. As your process evolves, incident management tools calculate this automatically — but the starting point is the same manual log.

You don't need perfection here. You need an honest snapshot of the last 2–3 months. Pull the data, put it in a spreadsheet, and save it. That's your "before" photo.

Week 2: measure the PR flow

Now you go one level deeper and look at where AI actually shows up — in the daily work of writing code. This is where the impact (or the illusion of impact) becomes visible.

PR cycle time — time from PR open to merge. GitHub has the Issue Metrics Action (free, maintained by GitHub) that measures time to first response and time to close. Configure it once and it runs on whatever schedule you want. If you need more detail, the Pull Request Analytics Action breaks the cycle into stages: development time, reviewer response time, review time, integration time.

Time to first review — how long the PR sits waiting for someone to look at it. If this number is climbing, you already have evidence of a review bottleneck. The Issue Metrics Action gives you this for free.

Review load per dev — how many PRs each person reviews per week and how many lines per review. This reveals saturation. Graphite Insights shows this in a visual dashboard. If you don't want a paid tool, the Pull Request Analytics Action generates per-dev reports with review volume and engagement.

PR size — lines added per PR. Remember the 154% increase from the Faros AI study? Microsoft's PR Metrics Action (open source) adds automatic size labels to the PR title — XS, S, M, L, XL. Simple, visual, and everyone on the team sees it.

Week 3: compare the two photos

Now you have two data sets: the system-level DORA metrics and the day-to-day PR metrics. The question you ask here is simple:

"We're producing more code, more PRs, faster. But are we delivering more?"

Cross-reference them. If PR throughput went up but deployment frequency stayed flat, the gains are dying between the PR and production. If PR cycle time went up while writing time went down, the bottleneck moved — exactly what we discussed earlier.

You don't need a sophisticated tool here. You need a spreadsheet with two columns side by side and the honesty to read what the numbers say.

Week 4: identify the new bottleneck

With the data from previous weeks, you'll be able to pinpoint where the pipeline is jamming. And the spoiler is almost always one of these:

Review — if time to first review and review cycle time are climbing, the bottleneck is review. Action: consider automated review as a first filter (Claude Code Review, GitHub Copilot code review, CodeRabbit), limit PR size, distribute review load better.

QA/Tests — if change failure rate is climbing or rework rate is high, code is reaching production with problems. Action: more automated test coverage, quality gates in CI before merge.

Deploy — if PRs merge fast but deployment frequency doesn't rise, the bottleneck is in the deploy pipeline. Action: this isn't an AI problem, it's an infrastructure and process problem.

The real talk on tools

I'll be direct: if you want to start tomorrow at zero cost, go with Apache DevLake for DORA metrics and Issue Metrics Action + PR Metrics Action for PR flow. All open source, all integrate with GitHub.

If you have budget and want something more integrated, LinearB and Swarmia combine DORA with developer experience metrics and give you ready-made dashboards.

If you already use Datadog, activate the DORA Metrics module — it's right there, and you're probably paying for it without knowing.

What no tool can solve

Now, the disclaimer that matters: this is not a silver bullet.

Every team has its own context. Your pipeline, your review process, your deploy culture, your test maturity — all of this changes how the numbers behave and what they mean. A team that uses feature flags and continuous deploy lives in a different reality from a team that does biweekly releases with manual QA.

Tools measure. You interpret. And you adapt to your context too.

The playbook gives you the structure. The numbers give you the evidence. But the decision of what to do with it? That's on the tech lead, the manager, the team. No dashboard replaces judgment.

AI ROI: the math almost everyone gets wrong

If you've made it this far, you've probably already spotted where most people go wrong calculating AI tool ROI. But let me make it explicit, because this is the part that ends up on the C-level's slide.

The naive math goes like this:

ROI = (productivity gain − tool cost) / tool cost

And "productivity gain" is usually calculated like: "our devs save 3–4 hours per week, we have 20 devs, that's 60–80 hours per week, at an average cost of X per hour, we're saving Y per month." Subtract the license cost, divide, and you're done — positive ROI. Everyone's happy.

The problem is that those 3–4 hours saved on writing code didn't disappear from the system. They reappeared in review, validation, and debugging of code that nobody on the team actually wrote. The time was redistributed, not eliminated.

If your DORA metrics haven't improved — if deployment frequency didn't go up, if lead time didn't drop, if change failure rate didn't decrease — then the individual gain didn't become an organizational gain. And organizational ROI is what pays the bills.

That doesn't mean ROI is zero. It means the math needs to be more honest. A few things that rarely make it into the calculation but should:

The cost of additional review. If your seniors are spending 91% more time reviewing PRs, that time is coming from somewhere — mentorship, architecture, technical planning. What's the value of what they're not doing?

The cost of rework. If rework rate went up, every unplanned fix is work that wasn't on the backlog. It's invisible cost.

The cost of the learning curve. Teams need time to learn to use AI effectively — good specs, good prompts, good workflows. That initial investment rarely appears on the ROI spreadsheet.

The more honest math isn't "how much time did the dev save writing code." It's "how much more value did my team deliver to production after adopting the tool." And that answer comes from DORA metrics, not from satisfaction surveys.

If DORA improved: congrats, your ROI is real. If DORA stayed flat: your team is faster at the wrong stage. If DORA got worse: you're paying to create problems.

The only sentence that matters

If you read this entire article, thank you. If you skipped everything and landed here, no judgment — at least take this:

AI optimizes the individual. DORA measures the system. And the system ≠ the sum of its individuals.

Your dev is faster. Your team might not be. And until you measure the system, you won't know the difference.

Now you have the metrics, the tools, and the playbook. What's left is to open the spreadsheet and start. Tomorrow.

Tools mentioned in this article

Metrics and observability

Apache DevLake — Open source engineering metrics platform. Connects with GitHub, GitLab, and Jira, ships with a ready-made DORA dashboard. Self-hosted.
Four Keys — Open source project from Google's DORA team. Collects events from GitHub/GitLab and calculates the four DORA metrics.
LinearB — Paid platform combining DORA metrics with developer experience and workflow automation (gitStream). Integrates with GitHub, GitLab, Jira, and incident tools.
Swarmia — Paid platform with DORA, SPACE, and investment metrics (features vs. maintenance vs. bugs). Good Slack integration.
Datadog DORA Metrics — DORA module inside Datadog. If you already use Datadog for observability, just turn it on — it correlates delivery metrics with infrastructure data.

PR and code review metrics

GitHub Issue Metrics Action — Official free GitHub Action. Measures time to first response, time to close, and generates automatic reports as repo issues.
Pull Request Analytics Action — Free GitHub Action that generates detailed review reports: review cycle, load per dev, engagement, and stalled PRs.
Microsoft PR Metrics Action — Open source GitHub Action from Microsoft. Adds automatic size labels (XS to XL) and test coverage indicators to PR titles.
Graphite Insights — Visual PR metrics dashboard: merged PRs, review load per dev, cycle time. Paid with free tier.

Automated code review

Claude Code Review — Multi-agent automated review from Anthropic. Runs on each PR, identifies bugs and ranks by severity. Doesn't approve — leaves the decision to the human.
CodeRabbit — AI-powered automated review that analyzes PRs and suggests improvements. Integrates with GitHub and GitLab.

Incident management

PagerDuty — Incident management and on-call platform. Integrates with DevLake and LinearB to calculate change failure rate and MTTR automatically.
Opsgenie — Incident management and alerting from Atlassian. Same integrations as PagerDuty for DORA metric calculation.

References

Faros AI — The AI Productivity Paradox: Key Takeaways from the DORA Report 2025. Telemetry analysis of 10,000+ devs across 1,255 teams. faros.ai/blog/key-takeaways-from-the-dora-report-2025
Anthropic — How AI Is Transforming Work at Anthropic. Internal survey with 132 engineers on Claude Code usage. anthropic.com/research/how-ai-is-transforming-work-at-anthropic
METR — Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. RCT with 16 devs and 246 tasks. metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study
Google Cloud — 2025 DORA State of AI-Assisted Software Development Report. cloud.google.com/resources/content/2025-dora-ai-assisted-software-development-report
DORA — Balancing AI Tensions: Moving from AI Adoption to Effective SDLC Use. Qualitative analysis of 1,110 responses from Google engineers. dora.dev/insights/balancing-ai-tensions
Stack Overflow — 2025 Developer Survey. Survey with 49,000+ developers on AI adoption and trust. stackoverflow.blog/2025/developer-survey
Sonar — State of Code Developer Survey Report 2026. Survey with 1,149 developers on AI-generated code verification. sonarsource.com/state-of-code-developer-survey-report

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.