Originally published at devopsdiary.blog. Post F2 in the "Governing AI in the Enterprise" series.
DORA worked because shipping software was a pretty stable thing to measure. You changed code, you deployed it, you watched whether prod fell over. The four metrics held up for a decade because the underlying activity didn't change much underneath them.
Then Copilot showed up. And Cursor. And whatever your team is piloting this quarter that nobody told platform engineering about.
The activity changed. The metrics didn't. That's the gap.
I keep landing in the same place on this. You need two layers. Most teams have neither. One is an evaluation layer that watches the AI itself. The other is a governance layer that decides what the evaluation results mean. Skip either one and you end up with dashboards that look healthy while the work underneath them quietly drifts.
DORA still works, it's just watching the wrong thing now
Larridin has a piece called "Why DORA Metrics Break in the AI Era" and the title is a little stronger than the argument. DORA isn't broken. It still measures what it always measured: the throughput and stability of the deployment pipeline. The problem is that with AI-assisted development, the pipeline isn't where the interesting variance lives anymore.
Lead time for changes can drop 30% because the AI wrote the boilerplate. Great. What it doesn't tell you: did the AI write the right boilerplate? Did the developer actually understand what they merged? Is change failure rate stable because the code is good, or because the model is good at producing code that compiles and passes the existing tests but slowly rots the codebase in ways that won't show up for six months?
DORA can't see any of that. It was never designed to. Asking DORA to evaluate AI-generated code is like asking a thermometer about the menu. Wrong instrument, wrong question.
This is why teams with healthy DORA dashboards get blindsided by the first AI-related incident. The metrics didn't lie. They just weren't watching the thing that broke.
Layer one: evaluation
The evaluation layer answers a specific question. Is the AI doing what we think it's doing? Forget "is the team shipping faster." That's downstream. Upstream of any throughput claim is the actual quality of what the model produced and what the human did with it.
A useful evaluation layer measures things DORA never touched.
Acceptance rate per suggestion. Skip "how many completions did Copilot serve" and count how many survived first review. How many made it to prod unchanged. How many got reverted within a week. The shape of that funnel tells you whether your developers are using AI as a thinking partner or as a stochastic autocomplete they're too tired to argue with.
Suggestion-to-defect correlation. Track which AI-generated changes correlate with later bug reports. This is hard. It's also where the real signal lives, because it's the only metric that connects model output to production reality.
Human override frequency. When the AI proposes something and the developer ignores it, that's data. When the developer accepts something they shouldn't have, that's also data, and it's the more dangerous kind.
None of these are pipeline metrics. They sit one layer up, watching the interaction between the model and the human before the result ever hits the pipeline DORA measures. Without this layer, you're flying blind on the part of the system that actually changed.
The first time I ran into this was drafting an AI transformation roadmap. We'd been exploring Copilot and Model Context Protocol for SDLC workflows, and I sat down to write the measurement section assuming I'd just point at the existing DORA dashboards. I couldn't. Every question I actually wanted to ask about the AI work lived somewhere those dashboards weren't looking: was the suggested code any good, were developers accepting things they shouldn't, was quality drifting under the throughput numbers. The roadmap ended up with a whole separate metrics track for AI-generated code quality, which felt like overkill at the time and now feels like the bare minimum.
Layer two: governance
Microsoft published a piece on adaptive AI governance back in April, and the part worth stealing is the framing around feedback loops. Their argument, roughly: governance for AI can't be a static policy document, because the models, the use cases and the risks all shift faster than any approval cycle can keep up with. So governance has to be adaptive. Adaptive means it has to consume signal from somewhere.
That somewhere is the evaluation layer.
This is the part most enterprise programs get wrong. They stand up a governance committee, draft a policy, hold quarterly reviews, and never wire any actual telemetry into the loop. The committee meets, reads vendor documentation, debates risk tiers and adjourns. The AI usage they're supposed to be governing is happening somewhere they can't see. I sat through enough versions of this pattern in a previous life (back when it was a Change Advisory Board approving deploys on vibes) to recognize the shape immediately. The label on the meeting changes. The failure mode doesn't.
A governance layer that works does three things. It defines the thresholds: what acceptance rate is too low to trust, what override pattern signals a model regression, what defect correlation is unacceptable. It pulls those thresholds from the evaluation layer continuously, not at quarterly review. And there's a clear path from "threshold breached" to "tool gets paused or scope gets narrowed" without requiring a six-week change-management cycle.
If you can't draw a line from a metric the evaluation layer captures to a decision the governance layer makes within a week, what you have is a steering committee.
| From | To | What flows |
|---|---|---|
| Evaluation layer (acceptance, overrides, defect correlation) | Governance layer | Signal |
| Governance layer (thresholds, owner, decision path) | Evaluation layer | Policy and scope changes |
| Governance layer | DORA (throughput, stability) | Decisions that change what ships |
| DORA | Governance layer | Did the decisions work? |
The three layers compose. Evaluation feeds governance, governance acts, DORA tells you whether the action worked. Pull any layer out and the loop breaks.
Why both, and why neither is optional
The two layers fail differently when one is missing.
An evaluation layer without governance produces dashboards nobody acts on. You can see the AI is degrading. You watch it happen. Nothing changes because no one has the authority or the framework to pull the lever.
A governance layer without evaluation produces policy theater. The committee meets, makes decisions from gut feel and vendor slides, then ships rules that don't connect to anything happening in the codebase. Developers route around the rules because the rules don't reflect reality.
You need both. The evaluation layer generates the signal. The governance layer turns the signal into a decision. DORA, sitting downstream of both, still tells you whether the decisions worked. Skip any one of them and treat the others as sufficient, and you end up explaining to leadership why the AI rollout looked great in the slides and broke production in the demo.
What I'd build first
Starting from zero on a platform team next week, I'd do this in order. Instrument acceptance and override telemetry for whatever AI tools the team is already using, even if it's ugly. A webhook and a sqlite file is fine for a month. Pick three thresholds I'd actually be willing to act on. Write down, in one page, who owns the decision when a threshold breaks and how fast they have to act. Then revisit the DORA dashboard and see how much of it I still need.
That's the whole thing. Two layers, three thresholds, one decision owner. The shape is simpler than the slide decks make it look. Designing it is the easy part. The hard part is admitting that the dashboards you've been watching for the last decade aren't enough anymore.
Top comments (0)