DEV Community: Todd Linnertz

Validators Judge. They Don't Help.

Todd Linnertz — Tue, 14 Jul 2026 04:05:39 +0000

Originally published at devopsdiary.blog. Post F-AID3 in the "Governing AI in the Enterprise" series.

A validator that fixes the thing it just flagged has quit its job. It stopped grading and started doing the homework, in the same motion, and now you can't tell which part of the result you're supposed to trust.

I keep running into this in the tooling everyone's shipping right now. The AI-dev vendors have converged on a lot of good ideas over the past year. Most of them I want to steal. One of them I'll fight about, because it quietly breaks the one property that made the whole setup governable in the first place.

Let me take the good ones first. I'm not interested in being the guy who only shows up to say no.

Three ideas worth stealing

The first is a piece of vocabulary: verification debt. Sonar's been using it to name the gap between the quality an agent produces by default and the quality a long-lived, business-critical app actually needs. That gap has always existed. We used to file it under "tech debt" and move on, which was lazy, because tech debt is what you owe after a shortcut you chose. Verification debt is what you owe after a machine made a hundred choices you never saw. Different problem. Better name. I'm adopting it.

Second: inner loop and outer loop as a way to say where a check runs. Inner loop is inside a single reasoning step, the agent second-guessing itself mid-thought. Outer loop is across the whole task, after the work is notionally done. It sounds like pedantry until you try to explain to a security team where your quality gate lives, and you realize you've been waving your hands at "somewhere in the agent." The outer loop is where the promotion gate belongs. Naming it makes the gate a thing you can point at.

Third, and this is the one with teeth: shadow testing. Run the new agent in parallel with the real one, write path disabled, and compare what it would have done against what actually happened. One payroll-automation team took an agent from 70 percent accuracy to 98 before it ever touched production, purely by running it in the dark and grading it against humans for a few weeks. That only works because the live version is frozen while the shadow runs. Same inputs, no side effects, honest comparison. It's the most disciplined testing idea I've seen come out of the agent world, and it maps cleanly onto something I already believed: don't promote what you haven't watched behave.

So far, so agreeable. That's where I get off the bus.

The one I won't take

Sonar's framework has a stage they call Solve. The validator doesn't just find the problem. It fixes it. Finds the bug, writes the patch, closes the loop, all inside the thing whose job was to tell you whether the code was any good.

I understand why it demos well. It feels like magic to watch a tool flag an issue and resolve it in the same breath. But think about what you just did to your audit trail. The finding and the fix collapsed into one event, with one author, and that author is the same component that decided the code was wrong. You've asked the grader to fix your answer before scoring it. It's going to give you an A. It has no reason not to.

In a regulated shop the cost is concrete, and it lands on a Monday morning. Something ships broken, the review board asks "what changed, who approved it, and what did the check actually catch," and you need those to be three separate facts. A validator that remediates smears them into one line: the tool found it, the tool fixed it, promoted clean. Nobody looked. There's no daylight between the judgment and the intervention, which means there's nowhere to stand and ask whether the judgment was right.

And there's a blast-radius problem hiding underneath. The second your validator can write, it's not a read-only observer anymore. Its scope just went from "look and report" to "look and change your code." That's a different security posture, a different threat model, and a different conversation with your platform team, and most people adopt it without having any of those conversations because it arrived bundled as a feature.

The honest counterargument

I don't want to pretend this is a clean win. The pull toward Solve is real, and the people building it aren't fools.

Fast feedback is how you get the good numbers. When the fix lands a half-second after the finding, iteration speed goes through the roof, and there's a genuine prize on the table: teams reporting something like nine in ten issues caught at edit time, before a human ever reviews a line. If I forbid the validate-and-remediate collapse with a hard rule, I'm leaving some of that speed on the floor. A strict "no" has a cost, and anyone who tells you otherwise is selling something too.

So the question isn't whether remediation is valuable. Obviously it is. The question is whether it belongs inside the validator, and my answer is still no.

Where AIEOS lands

The invariant I build on is that validators judge, they don't help. When I say that, people hear "never auto-fix anything," and that's not what I mean. Auto-fix all you want. Just don't let the grader hold the pen.

In practice, remediation is a separate artifact with its own author. A remediation agent consumes the validator's verdict and proposes a change. That change goes back through the same gate as any other change, from a frozen baseline, and has to clear it on its own merits. The verdict and the fix stay two events, two authors, two timestamps.

  Agent output
       |
       v
  [ Validator: judges only ]
       |
       +--- pass -----------------------+
       |                                |
       +--- fail --> [ Remediation      |
                      agent:            |
                      proposes fix ]    |
                            |           |
                            v           v
                  [ Promotion gate: frozen baseline ]
                            |
                      +-----+------+
                      |            |
                   clears       fails --> back to remediation
                      |
                      v
                   Promote

The validator emits a verdict and stops. A separate remediation agent proposes the fix, which clears the same promotion gate as any other change. Two authors, two events, one audit trail.

You keep the speed. You keep the audit trail. What you give up is the thing that was never yours to keep: the idea that the check and the cure can be the same act without anyone losing track of which was which. Freeze-before-promote does the load-bearing work. The fix doesn't get to skip the line because it came from a smart tool. It clears the gate like everything else, or it doesn't ship.

The part I haven't figured out

If remediation doesn't live in the validator, where does it live? I have a working answer, a separate agent under its own governance, but I'm not convinced that's the final shape. Maybe it's a first-class layer. Maybe it's a practice folded under verification with hard rules about artifact separation. The industry hasn't settled this, and neither have I, and I'd rather say that plainly than paper over it with a diagram.

The one line I'll hold, though, is narrow enough to defend: the thing that decides whether your code is good doesn't also get to make it good. Keep those two jobs in two hands. The day they merge, you've automated away the only honest signal you had, and you won't notice until the thing you trusted to catch problems becomes the thing quietly creating them.

Todd Linnertz is a Senior Solutions Engineer with thirty years of enterprise engineering experience. He's the creator of AIEOS, an open-source AI governance system for software delivery teams. Find him at devopsdiary.blog and github.com/wtlinnertz.

Inside AIEOS: AI Can Write the Spec. It Can't Approve It.

Todd Linnertz — Mon, 06 Jul 2026 04:55:00 +0000

Ask Claude for a system architecture document. You'll get one. It'll have the right sections, confident prose, a diagram, tradeoffs that sound reasonable. It reads like something a staff engineer wrote on a good day.

Now tell me if it's correct.

You can't, really. Not by reading it. The thing that makes AI good at producing the document is the same thing that makes the document hard to trust. It generates what's plausible, and plausible is a long way from sound.

I've spent about a year building a framework for that exact problem. It's called AIEOS. This is the first real look at it.

AI is a generation engine, not a decision-maker

That sentence is the whole thing, so I'll sit on it for a second.

An LLM is very good at producing a first draft of almost any engineering artifact: a PRD, a test strategy, a release plan, a postmortem. It is not good at deciding whether that draft is good enough to build on. Ask it to check its own work and it'll rationalize. It wrote the thing. It's the last narrator you want grading it.

I made the platform-side version of this argument a few months back, in The Agent Is 20% of the Work. The agent is the easy 20%. The durable system around it is the other 80%, and that's where projects live or die. AIEOS is my answer to what that 80% looks like when AI is writing the specs, designs, and records across a whole software lifecycle.

Three rules that hold it together

AIEOS runs on three rules. They're boring on paper and they matter enormously in practice.

First, separation. The rules for an artifact, the template it fills, the prompt that generates it, and the validator that judges it all live in separate files. Change the prompt and you haven't silently moved the goalposts on what counts as valid. Most AI-writing setups collapse all four into one giant instruction, and then nobody can explain why last week's output passed and this week's doesn't.

Second, freeze before promote. Before any downstream work starts, the upstream artifact is frozen. The architecture document can't quietly shift under the execution plan built on top of it. If it has to change, that's a new version and an explicit re-freeze, not a silent edit. Anyone who's watched requirements move mid-build knows why this rule earns its keep.

Third, validators judge, they don't help. A validator returns PASS or FAIL. It does not offer suggestions. It won't rewrite your draft or meet it halfway. The moment a validator starts helping, it's collaborating with the thing it's supposed to be checking, and you're right back to AI approving its own work.

Fifteen layers, and it's a loop

The work is organized into fifteen layers. The first eight are a sequential pipeline: strategy, product intelligence, solution sourcing, engineering execution, release, reliability, insight, diagnostics. Each one answers a single question and hands a frozen artifact to the next.

The other seven are cross-cutting kits (quality, security, data, infrastructure, documentation, peer review, business process). They fire on a trigger instead of waiting for a fixed turn. Security doesn't queue up behind release. It activates when an artifact event calls for it.

And it's a loop. Layer 7, insight and evolution, feeds what production actually taught you back into Layer 2, where the next round of product decisions gets made.

Delivery flows down. Learning loops back up.

  1  Strategy
  2  Product Intelligence   <---------------+
  3  Solution Sourcing                      |
  4  Engineering Execution                  |  what production
  5  Release & Exposure                     |  taught you
  6  Reliability & Resilience               |
  7  Insight & Evolution   -----------------+
  8  Operational Diagnostics

The part that makes it real

A layer model on a whiteboard is worth nothing. For a framework like this, the real test is whether each component does what its contract says it does, especially the adapters that wrap real tools like Semgrep, Trivy, Syft, cosign, and Flux.

AIEOS handles that with conformance attestation. Every adapter has to pass a conformance suite that checks its actual output against a frozen contract. When it passes, CI produces a signed attestation using Sigstore: keyless signing through Fulcio, logged in Rekor's transparency log. No valid attestation for the current contract version, no registration. The adapter doesn't get to play.

        Adapter output
             |
             v
      Conformance suite      checks real output vs. the frozen contract
             |
        +----+----+
      FAIL       PASS
        |          |
        v          v
   Rejected.   Sigstore signs      keyless: Fulcio + Rekor
   No              |
   registration.   v
               Signed attestation
                   |
                   v
               Registry admits the adapter

I finished wiring that up last week. Every repo in the framework, more than forty of them, now runs CI. All thirteen adapters produce real signed conformance attestations, not a green check that means "the unit tests passed and we're hoping the rest works." The signing loop is verified end to end. The orchestration core of the agent harness, the routing and state and convergence code, sits at 100% line coverage.

Turning the conformance checks back on found real bugs that a permissive CI config had been hiding. An SBOM adapter emitting the wrong CycloneDX schema version. A signing adapter still using a legacy bundle format. The exact kind of thing that looks fine until something finally checks it.

The orchestration is the easy part

The agent harness runs generation across providers. It supports a few routing strategies: fall back to a second provider when the first fails, run providers in sequence, run them in parallel and take a consensus, or route by cost. A convergence loop regenerates an artifact against validator feedback until it passes, or until the loop detects it's oscillating and gives up.

That's what people picture when they hear "multi-agent," and it's real. It's also a few hundred lines. Everything I described before it, the specs and freezes and validators and signed attestations, is the part that actually took a year.

Where to look

AIEOS is open. The code lives under github.com/wtlinnertz. Start with aieos-governance-foundation. That's the root of the whole thing: the layer model, the kit manifest, and the structural rules every other repo inherits.

Fair warning. I built most of this solo, over about a year, and it shows in the uneven places. Some kits are further along than others. But the spine is real and it's tested, which is more than I can say for most things wearing the "AI governance framework" label right now.

One more time, because it's the entire point. AI can write the spec. It can't approve it. The whole framework is what it takes to keep those two jobs in different hands.

More Context Isn't Better Context

Todd Linnertz — Thu, 02 Jul 2026 05:42:57 +0000

Originally published at devopsdiary.blog. Post F-AID2 in the "Governing AI in the Enterprise" series.

The advice for the last two years has been simple: give the agent more. More tools. A bigger window. Another rules file. Half the talks at AI Dev 26 argued the reverse, and they brought data.

The pitch sounds right, which is why it spreads. If the agent gave a bad answer, it must have been missing something. So you connect another MCP server, paste in more docs, add a second CLAUDE.md. The output gets worse. Now you're debugging a system that has too much to read, not too little.

The four things people believe that aren't true

A Stripe engineer put up a slide listing four myths about feeding context to agents, and I'd repeated at least two of them out loud the month before.

Naive RAG over your docs is not a context engine. A bigger context window does not fix output quality. Connecting more MCP servers does not get you there. And more rule files definitely don't. Each one feels like progress because it's an action you can take.

None of them touch the actual problem.

Underneath, the problem is that access isn't understanding. We wire agents into the code, the logs, the tickets, the docs, and we assume that putting information nearby is the same as the model comprehending it. The agent produces plausible output that compiles and then fails human review, because the thing it needed was never written down anywhere it could reach.

The slide that stuck was an iceberg. Above the waterline: the code, the docs, the tickets you can point an agent at. Below it: the original intent, the reason the thing was built this way, the migration flag nobody documented, the trade-off that got argued out in Slack eight months ago and then deleted. That submerged part decides whether the output is right. None of it lives somewhere you can index.

More context has a cost, and it compounds

Dumping more context in carries a cost, and CodeRabbit named the four ways it bites: bloat, dilution, conflict and latency.

Bloat is the obvious one. You blow the token budget. Dilution is worse, because the signal the model needed is still in there, buried under noise it now has to wade through to find it. Conflict is the dangerous one. Hand the model two sources that disagree and it picks one, hides the seam, and gives you a confident answer built on the wrong half. Latency is the tax on all of it. Slower, more expensive, every single call.

And the cost isn't flat. It grows the further the agent runs from your keyboard. At tab-complete you catch a bad suggestion in a second. With a background agent running unattended, bad context the first time means a silent failure you find days later in review, or worse, in production.

What actually works is curation

Path	Process	Outcome
Dump it all in	Bloat → dilution → conflict → latency	Plausible output that fails review
Context engine	Unify, retrieve, rank, resolve, govern, scope	Output that passes review

Same sources, two strategies. "More context" takes the left path. Curated context takes the right.
Same sources, two strategies. "More context" takes the left path. Curated context takes the right.

The teams that got past this stopped adding and started engineering what the agent sees. Stripe's framing was a context engine with six jobs, and it reads like a platform spec, not a prompting tip:

merge signals from every source into one view
retrieve only what the task needs, not everything that might be relevant
rank and compress so tokens aren't wasted
resolve conflicts by recency and authority instead of hiding them
enforce permissions and governance across every system it touches
scope relevance to your repos, your team, your work history

Read that list again and tell me which part is prompt-craft. None of it. It's retrieval, ranking, conflict resolution, access control and governance. That's infrastructure. It's the work platform teams have been doing for other systems for years, pointed at a new consumer.

Unblocked, who build exactly this for a living, showed a before-and-after on the same prompt and the same model, changing only the context. Without their engine, the output scored 2 to 3.5 out of 10 on things like respecting team conventions and not breaking existing code. With it, 8 to 9.5. Same model. Same prompt. The only variable was what the agent was allowed to see, and how well that slice had been curated.

Why this lands on the platform team

"Context engineering" isn't a prompt-writing trick your senior devs pick up on the side. It's a system somebody builds and owns. Something has to unify the sources, score them, govern access and hand the right slice to the right model at the right moment.

The hype said the bottleneck was model intelligence. Then it said the bottleneck was context. Both diagnoses were right about the symptom. Neither said who fixes it: context has to be assembled, ranked and governed, and that's platform engineering.

So when the next vendor tells you the fix is one more MCP server, ask them what happens to your token budget, your conflicts and your permissions once you've connected forty of them. The goal was always the right context. Building the thing that knows the difference is the job, and it's going to land on the platform team whether they staffed for it or not.

What DevOps Taught Me About AI Governance

Todd Linnertz — Wed, 17 Jun 2026 02:36:17 +0000

Originally published at devopsdiary.blog

The teams adopting AI coding tools the fastest are the same teams that would never deploy to production without a pipeline.

I've been watching this for two years. Engineers who pushed back on manual deployments, built approval gates and rollback runbooks, spent months getting GitOps through architecture review. Those same engineers are committing AI-generated code with no review policy, no acceptable use boundary, no way to answer "what did this tool actually do to our codebase."

The governance instincts are there. They've just been turned off for AI.

DevOps gave me a set of instincts I didn't appreciate until I started watching AI adoption.

The most visceral one is blast radius. Before you ship anything, you ask: what's the worst this can do, and how do you contain it? Feature flags, canary deployments, rollback runbooks: all of them exist because shipping without blast radius thinking isn't engineering. It's gambling with production. Sitting right next to it is auditability. In a regulated environment, "it worked" isn't an acceptable answer to "what happened?" You need to know who approved what, when, under what conditions, and what the rollback path was. Not bureaucracy for its own sake. That's what lets you recover without a three-week forensic investigation.

AI tools write code. That code goes into production. The blast radius question barely gets asked. The auditability trail ends at the commit. The model, the prompt, the context: gone.

Then there's measurement. In November 2023, I built a dashboard that showed teams exactly how slow their pipelines were. Some of them hated it. Not because the data was wrong. Because visible things require a response, and these teams had spent months not responding. That friction was the point. You can't govern what you can't see.

Nobody is measuring how AI tools are affecting their delivery pipeline. Not throughput, not defect rates, not review time for AI-generated PRs, not acceptance rates. The data exists in theory. Nobody's collecting it.

The translation from DevOps to AI governance is straightforward on paper:

DevOps instinct	AI governance equivalent
Blast radius analysis before deploy	Scope controls on what the AI tool can touch
Approval chains and auditability	Model, prompt, and context captured in the commit trail
Pipeline measurement	AI delivery metrics: acceptance rate, defect rate, review time
Rollback runbook	Policy to constrain or disable a tool when it misbehaves

The table isn't the hard part. The decision to build it is.

These instincts didn't come from theory. They came from watching what happens when they're actually applied.

Getting GitOps through enterprise architecture review at a large financial services firm took six months. Six months of presentations, security questions, "can you come back next cycle," and conversations with architects who needed to understand blast radius before they'd sign off. The process felt slow. It was slow. But I understood why it existed. A new deployment methodology touching hundreds of production pipelines warrants that kind of scrutiny.

GitHub Copilot arrived a year later. Teams were using it in production code within weeks of the pilots starting. No architecture review. No acceptable use policy. No measurement framework. Just "the demos land and teams want it."

GitOps got six months of scrutiny. Copilot got a pilot program and a Teams channel.

The difference is cultural. GitOps looked like infrastructure, so infrastructure governance applied. Copilot looked like a developer tool, so it went through the same review path as a new IDE plugin: essentially none.

AI coding tools write production code. That makes them infrastructure. The governance posture should match.

Platform engineers already know how to do this.

Apply blast radius thinking to AI: scope what the tool can touch, define what it can't, build the controls before you need them instead of after something breaks. Track auditability: capture the model, the prompt, the constraints, and the review that happened before the code shipped. Not for compliance theater. For the forensic investigation you'll eventually need. Measure: instrument the AI delivery pipeline the same way you'd instrument anything else. Two months of your own data will tell you more than any vendor benchmark.

None of this requires new tooling to start. It requires someone in the organization to decide that AI-generated code is production code, and production code gets governed.

That sentence is the whole shift. Everything else follows from it.

I spent most of 2024 watching the governance gap widen while the tooling race ran ahead of it. Every week there was a new agent framework, a new coding assistant, a new benchmark claiming another percentage point on SWE-bench. Very few conversations about what any of this looks like when it's been in your codebase for 18 months and something goes wrong.

AIEOS started from that frustration. The instincts built into it (blast radius, auditability, measurement, approval chains, rollback) are DevOps instincts. They translate directly. Most organizations already have engineers who understand all of this. What's missing is the decision to apply it.

That decision is the one most teams haven't made yet.

Two Layers Your AI-SDLC Metrics Are Missing

Todd Linnertz — Mon, 08 Jun 2026 03:48:56 +0000

Originally published at devopsdiary.blog. Post F2 in the "Governing AI in the Enterprise" series.

DORA worked because shipping software was a pretty stable thing to measure. You changed code, you deployed it, you watched whether prod fell over. The four metrics held up for a decade because the underlying activity didn't change much underneath them.

Then Copilot showed up. And Cursor. And whatever your team is piloting this quarter that nobody told platform engineering about.

The activity changed. The metrics didn't. That's the gap.

I keep landing in the same place on this. You need two layers. Most teams have neither. One is an evaluation layer that watches the AI itself. The other is a governance layer that decides what the evaluation results mean. Skip either one and you end up with dashboards that look healthy while the work underneath them quietly drifts.

DORA still works, it's just watching the wrong thing now

Larridin has a piece called "Why DORA Metrics Break in the AI Era" and the title is a little stronger than the argument. DORA isn't broken. It still measures what it always measured: the throughput and stability of the deployment pipeline. The problem is that with AI-assisted development, the pipeline isn't where the interesting variance lives anymore.

Lead time for changes can drop 30% because the AI wrote the boilerplate. Great. What it doesn't tell you: did the AI write the right boilerplate? Did the developer actually understand what they merged? Is change failure rate stable because the code is good, or because the model is good at producing code that compiles and passes the existing tests but slowly rots the codebase in ways that won't show up for six months?

DORA can't see any of that. It was never designed to. Asking DORA to evaluate AI-generated code is like asking a thermometer about the menu. Wrong instrument, wrong question.

This is why teams with healthy DORA dashboards get blindsided by the first AI-related incident. The metrics didn't lie. They just weren't watching the thing that broke.

Layer one: evaluation

The evaluation layer answers a specific question. Is the AI doing what we think it's doing? Forget "is the team shipping faster." That's downstream. Upstream of any throughput claim is the actual quality of what the model produced and what the human did with it.

A useful evaluation layer measures things DORA never touched.

Acceptance rate per suggestion. Skip "how many completions did Copilot serve" and count how many survived first review. How many made it to prod unchanged. How many got reverted within a week. The shape of that funnel tells you whether your developers are using AI as a thinking partner or as a stochastic autocomplete they're too tired to argue with.

Suggestion-to-defect correlation. Track which AI-generated changes correlate with later bug reports. This is hard. It's also where the real signal lives, because it's the only metric that connects model output to production reality.

Human override frequency. When the AI proposes something and the developer ignores it, that's data. When the developer accepts something they shouldn't have, that's also data, and it's the more dangerous kind.

None of these are pipeline metrics. They sit one layer up, watching the interaction between the model and the human before the result ever hits the pipeline DORA measures. Without this layer, you're flying blind on the part of the system that actually changed.

The first time I ran into this was drafting an AI transformation roadmap. We'd been exploring Copilot and Model Context Protocol for SDLC workflows, and I sat down to write the measurement section assuming I'd just point at the existing DORA dashboards. I couldn't. Every question I actually wanted to ask about the AI work lived somewhere those dashboards weren't looking: was the suggested code any good, were developers accepting things they shouldn't, was quality drifting under the throughput numbers. The roadmap ended up with a whole separate metrics track for AI-generated code quality, which felt like overkill at the time and now feels like the bare minimum.

Layer two: governance

Microsoft published a piece on adaptive AI governance back in April, and the part worth stealing is the framing around feedback loops. Their argument, roughly: governance for AI can't be a static policy document, because the models, the use cases and the risks all shift faster than any approval cycle can keep up with. So governance has to be adaptive. Adaptive means it has to consume signal from somewhere.

That somewhere is the evaluation layer.

This is the part most enterprise programs get wrong. They stand up a governance committee, draft a policy, hold quarterly reviews, and never wire any actual telemetry into the loop. The committee meets, reads vendor documentation, debates risk tiers and adjourns. The AI usage they're supposed to be governing is happening somewhere they can't see. I sat through enough versions of this pattern in a previous life (back when it was a Change Advisory Board approving deploys on vibes) to recognize the shape immediately. The label on the meeting changes. The failure mode doesn't.

A governance layer that works does three things. It defines the thresholds: what acceptance rate is too low to trust, what override pattern signals a model regression, what defect correlation is unacceptable. It pulls those thresholds from the evaluation layer continuously, not at quarterly review. And there's a clear path from "threshold breached" to "tool gets paused or scope gets narrowed" without requiring a six-week change-management cycle.

If you can't draw a line from a metric the evaluation layer captures to a decision the governance layer makes within a week, what you have is a steering committee.

From	To	What flows
Evaluation layer (acceptance, overrides, defect correlation)	Governance layer	Signal
Governance layer (thresholds, owner, decision path)	Evaluation layer	Policy and scope changes
Governance layer	DORA (throughput, stability)	Decisions that change what ships
DORA	Governance layer	Did the decisions work?

The three layers compose. Evaluation feeds governance, governance acts, DORA tells you whether the action worked. Pull any layer out and the loop breaks.

Why both, and why neither is optional

The two layers fail differently when one is missing.

An evaluation layer without governance produces dashboards nobody acts on. You can see the AI is degrading. You watch it happen. Nothing changes because no one has the authority or the framework to pull the lever.

A governance layer without evaluation produces policy theater. The committee meets, makes decisions from gut feel and vendor slides, then ships rules that don't connect to anything happening in the codebase. Developers route around the rules because the rules don't reflect reality.

You need both. The evaluation layer generates the signal. The governance layer turns the signal into a decision. DORA, sitting downstream of both, still tells you whether the decisions worked. Skip any one of them and treat the others as sufficient, and you end up explaining to leadership why the AI rollout looked great in the slides and broke production in the demo.

What I'd build first

Starting from zero on a platform team next week, I'd do this in order. Instrument acceptance and override telemetry for whatever AI tools the team is already using, even if it's ugly. A webhook and a sqlite file is fine for a month. Pick three thresholds I'd actually be willing to act on. Write down, in one page, who owns the decision when a threshold breaks and how fast they have to act. Then revisit the DORA dashboard and see how much of it I still need.

That's the whole thing. Two layers, three thresholds, one decision owner. The shape is simpler than the slide decks make it look. Designing it is the easy part. The hard part is admitting that the dashboards you've been watching for the last decade aren't enough anymore.

ChatGPT Won't Replace Your Pipeline

Todd Linnertz — Fri, 22 May 2026 02:58:31 +0000

Originally published at devopsdiary.blog

The first time I asked ChatGPT to write a deployment runbook, it did. That was the problem.

The output was close enough to be dangerous: kubectl steps, rollback sequence, health check endpoints. Structured, clear, apparently professional. But it had no idea whether any of it belonged to us. Whether our tooling matched what it described. Whether the service was subject to SOX controls or just basic SLO monitoring.

It wrote a competent generic runbook for a deployment process that didn't exist at our org. We already had something for that. It was called Google.

What December 2023 looked like from a platform engineering chair

By late 2023, ChatGPT had been public for a year. Engineers had stopped being surprised by it and started depending on it. The use cases I was seeing were real: code generation, IaC explanations, documentation drafts, commit messages, Jira ticket descriptions. (I'll confess I was writing Jira tickets with it too, so the skepticism was not entirely consistent.) The productivity numbers were positive. Nobody was exaggerating those.

What was quietly getting ignored was the distribution problem.

AI tools are generative. They produce outputs shaped by training data, not by your specific environment. A runbook they write doesn't know your Kubernetes version, your FluxCD reconciliation loop, or how your org defines "deployment-ready." A code snippet they generate doesn't know that your team settled a particular pattern dispute two years ago and the outcome is buried in a Confluence page nobody's touched since.

The outputs weren't wrong randomly. They were wrong the specific, consistent way that generic things are wrong when applied to specific contexts. Plausible on the surface. Missing the thing that mattered underneath.

I'd watched this pattern before

Every major technology shift I'd lived through follows the same arc: a capability arrives, adoption spreads before the infrastructure does, and governance shows up late to clean up what enthusiasm left behind.

CASE tools in the 90s. Offshore development in the early 2000s. Agile in the 2010s. Each one had a legitimate productivity case. Each one also created a class of problems nobody had built the infrastructure to handle, because they were busy with the first wave of adoption.

DevOps was different, because DevOps was about the infrastructure. You couldn't do GitOps without pipelines, and pipelines forced the governance questions into the open: Who owns this? What's the approval gate? What does rollback look like? The tooling made the governance visible whether you wanted it to be or not.

AI code generation skips that forcing function. The output is a file. You can add it to a repo without any pipeline knowing the difference. You can ship it without triggering a single question about where it came from or what reviewed it. The capability arrived with a lower floor for adoption than anything I'd seen before.

That observation sits differently when you've watched the same movie play out four times.

Era	Technology	Outcome
1990s	CASE tools	Fast generation. Governance: years later.
2000s	Offshore development	Scale arrived. Governance: years later.
2010s	Agile	Velocity up. Governance: sometimes never.
2015+	DevOps	Exception — pipelines forced governance from day one.
2023+	AI code generation	The output is a file. Pipeline optional.

The question nobody was asking

By the end of 2023, the industry conversation was focused on capability: what can the model do, how accurate is the output, what's the context window. Reasonable things to care about, if you're evaluating whether to adopt the tool.

The question I kept waiting for someone to ask was the governance question. Not in a compliance-checkbox sense. In the same way platform engineers ask it about any new delivery mechanism.

Who reviews AI-generated code before it ships? If a model hallucinates a library dependency that doesn't exist, what catches it before it reaches a build? When AI writes your runbooks, how does your audit process know those runbooks were AI-assisted? If something breaks at 2am and the runbook was written by a model that's never seen your infrastructure, what does your on-call engineer actually do with it?

None of these are theoretical. They're the same class of problems platform teams have always solved for human engineers. We just hadn't started solving them for AI output yet.

The gap wasn't surprising. It was predictable. The surprise was how few people seemed bothered by it.

The Agent Is 20% of the Work. The Platform Is the Other 80%.

Todd Linnertz — Sun, 17 May 2026 04:56:38 +0000

Originally published at devopsdiary.blog. Post F-AID1 in the "Governing AI in the Enterprise" series.

A payroll team shipped a production AI agent last year. Real workload, not a demo: processing 3,000+ emails a day, classifying them, extracting data and entering payroll. Six distinct steps, end to end.

Their test accuracy: 94%. Good enough to ship.

Their production accuracy: 70%.

That's the talk I keep thinking about from AI Dev 26. The drop itself isn't news. What they did about it is.

The accuracy gap has a cause

The 94% looked clean because the test set was curated. It covered the cases the team had thought of. Production didn't care about that. It sent typos. Impossible numbers. Screenshots. Hand-drawn notes. Vague references with no context. Conflicting instructions from two people in the same email thread.

The test distribution and the production distribution weren't the same. They almost never are.

A better model didn't close the gap. They ran shadow testing: the agent processed real production emails alongside their human team for four weeks, generating payroll entries but not submitting them. Humans reviewed the shadow outputs. Edge cases surfaced. New tests got written.

Final accuracy: 98%. The agent didn't change. The scaffolding around it did.

Month	Accuracy
M1	55%
M2	97%
M3	87%
M4	94%
M5 (live)	70%
M6 (shadow)	98%

Six months of accuracy data from the payroll agent. The dip at M5 is what shipping without production-distribution testing looks like.

The 20/80 problem

The final slide from that talk had a number I wrote down immediately: agent engine = 20% of the work. The durable system around the agent = 80%.

That ratio feels off if you've spent most of your time thinking about which model to use, how to prompt it, how to evaluate it against a benchmark. Those things matter. They're just not where a production AI project lives or dies.

The 80% is the multi-stage evaluation pipeline. Shadow testing infrastructure. The control tower that gives ops and leadership visibility into what the agent is actually doing. Input governance for the weird formats production throws at you. The routing logic that decides which step of the workflow a given input actually belongs in.

None of that is prompt engineering. All of it is platform work.

I've spent 30 years watching organizations adopt new technology and invest heavily in the visible capability while underbuilding the infrastructure that makes it last. The pattern is consistent. AI isn't running a different play.

What breaks without the infrastructure

Enterprise AI conversations split fast once you get past the demo stage. Some teams want to know about governance, evaluation pipelines, how outputs get reviewed before they do anything irreversible. Most are asking which model to use and when they can ship.

The 70% drop happens. Without a control tower to surface it, teams find out through complaints, not metrics.

That's a platform problem. Someone has to own the pipeline, not just the agent.

The line I can't stop thinking about

Day two had a closing panel. Loose, riffing. One panelist dropped a line that's been with me since:

"If you don't own your harness, you don't own your memory."

It took a beat to unpack. Your harness is your evaluation infrastructure: the test pipelines, the shadow mode, the tooling that decides what "good" looks like for your specific agents on your specific workloads. Your memory is what that harness teaches you over time: where your agents fail, which prompts hold up under real traffic, what your actual production distribution looks like.

Outsource the harness to a vendor and the vendor runs your evaluation loop. They see your production failures first. Every edge case your agents surface builds their system's understanding, not yours.

Most teams are focused on which LLM provider to pick, which coding assistant to standardize on. The harness question comes later, usually when a vendor relationship turns complicated and they realize how hard it is to move.

The payroll team built their own. Multi-stage evals, shadow infrastructure, control tower, four weeks of real production traffic before anything touched the write path. That's why they landed at 98%. And that's why the knowledge of how to get there belongs to them.

Twenty percent for the agent. Eighty percent for the system around it. Teams that understand that ratio are the ones shipping agents that stick.

What DevOps Taught Me About Running a Function

Todd Linnertz — Thu, 23 Apr 2026 03:32:38 +0000

Originally published at devopsdiary.blog.

Most engineering orgs measure platform teams like project teams. Both halves are wrong, and the second one is what kills them. Here are the three metrics that actually tell you if a platform function is working.

The first time I inherited a platform team I asked the obvious question. How is the platform doing? Uptime green, deploys up, tickets closing faster than they were opening. Two months later I knew none of those numbers had told me anything about whether the team was actually doing its job.

Once you see that gap, you can’t run a platform org any other way.

A function is not a project team
Most engineering organizations staff platform teams like project teams and then measure them like project teams. Both halves are wrong, and the second one is what kills them.

A project team exists to ship a thing. You measure it by whether the thing shipped, when, and how well it works. The metrics are about the team because the output is the team’s output.

A function is different. Platform engineering, DevOps, security, developer productivity: these are functions. A function exists to change the slope of everyone else’s work. Its output is not its own output. The thing you measure is what becomes possible across the rest of the org because the function exists.

If you measure a function the way you measure a project team you’ll get a team that ships beautiful internal artifacts nobody uses. Green dashboards and rising attrition. A platform org that looks healthy from the inside and is quietly failing from the outside, and you won’t see the failure until the consumer teams stop pretending.

Metric 1: adoption velocity
Adoption velocity is the percentage of consumer teams that move to the platform’s current standard within ninety days of release. Not whether they all get there eventually. The shape of the curve in the first quarter.

This is the metric that tells you whether the gap between built and adopted is closing or widening. A platform team can ship excellent technical work and still fail if the curve is flat. Worse, a flat curve means the platform team is generating debt at the same rate as the rest of the org, because every standard they release that nobody adopts becomes another version the team has to support forever.

When I led GitOps adoption, the first quarter looked great. Teams onboarded. We had momentum, we had a story to tell, the architecture review board was happy. The second quarter, same platform, same docs, same support model but the curve had stalled and nobody on the team noticed because the dashboards were full of green.

I went and talked to the teams that hadn’t adopted. Almost none of their reasons were technical. The blockers were political. Once I knew that, the fix was a half-day of negotiation with the product owners. The curve unstalled the next sprint.

Without an adoption curve I would have kept measuring uptime and deploy counts and concluded the team was crushing it. The team was crushing it. The platform was failing.

Metric 2: time to first success for a new consumer
This one is the cleanest signal in the set. How long does it take a brand-new team (one that has never touched the platform) to get from “we’re adopting this” to “we shipped something to production using it.”

Time to first success is the only proxy I trust for whether the documentation, the onboarding model and the support story actually work. It’s also the metric most platform teams are catastrophically wrong about, because they’ve never measured it. They ask each other whether the platform is intuitive and they all agree it is, because they built it.

Earlier in my career I inherited operational workflows where new teams were taking six weeks to onboard. Six weeks is a structural problem dressed up as an onboarding problem. The platform team had been adding documentation and the number hadn’t moved. Their theory was that the new teams weren’t reading carefully enough.

We didn’t write more docs. We restructured the handoffs. Of the four points where new teams were stalling, we collapsed two, automated one and put a single owner on the fourth. New teams started shipping in four days. Defect rates dropped, and throughput improved.

None of that came from better tooling. All of it came from going to look at a number the team wasn’t measuring and refusing to accept that the existing onboarding was working just because the team said it was.

Metric 3: support ratio
The third metric is the percentage of platform-team engineering hours going to consumer support, hand-holding and break-fix versus platform development. Healthy platform teams trend toward more development over time as the platform matures. Unhealthy teams trend the other way and don’t notice until the burnout hits and the senior engineers start interviewing.

Support ratio is the leading indicator for every organizational failure mode in platform engineering. Burnout. Attrition. Scope creep. Feature stagnation. The eventual quiet rebellion of the consumer teams who have been getting worse responses every month and have stopped expecting better. If you only get to watch one number on a platform org, watch this one.

It’s also the only metric that tells you whether the team’s design (interfaces, automation, self-service) is actually reducing toil or just relocating it. A team that ships a self-service portal and watches the support ratio climb has built a portal consumers can’t use.

This is the metric that convinced me the next generation of platform engineering needs structural governance. Better tools won’t save it. When AI generation accelerates the rate at which consumer teams produce work, the support ratio explodes unless the platform itself produces frozen, validated artifacts that the consumers can trust without a human in the loop. That conviction is why I’ve spent the last few months building AIEOS.

What these metrics force you to do
Once these three numbers are on your dashboard, the leadership job changes. You stop measuring your team by what they shipped and start measuring them by what the rest of the org shipped because of them. That sounds small. It isn’t.

The roadmap shifts, because you become willing to deprecate your own team’s work when adoption stalls instead of doubling down on a thing nobody is using. The way you spend political capital shifts, because you start defending the platform team’s time against the constant pressure to absorb every adjacent problem in the org.

It also changes the conversations you have with your own leadership. You stop reporting up on what your team built and start reporting up on what your team made possible. Those are different sentences. The second one is the one Directors and VPs are paid to say.

The hard part
The hardest thing about running a function is that the work is invisible until it isn’t. A team that’s quietly doing it right looks identical to a team that’s quietly burning down. Velocity charts won’t tell you which is which. Neither will uptime or deploy counts. These three metrics are how I tell the difference, and I can usually tell within the first month of taking over.

If you’re running a platform org and these aren’t on your dashboard, they should be. And if you’re hiring someone to run one, they should already be talking about them.

Todd Linnertz is a Senior Technology Leader with deep experience in enterprise architecture and DevOps. He is the creator of AIEOS, an open-source AI governance system for software delivery teams. Find him at devopsdiary.blog and github.com/wtlinnertz.

Why I Stopped Writing (And What Happened Since)

Todd Linnertz — Wed, 15 Apr 2026 17:14:04 +0000

Originally published at devopsdiary.blog. Series opener for "The Quiet Years," a retrospective on the work between August 2022 and now.

The last post on this blog went up in August 2022. Three and a half years later, here's why the silence happened and why it's ending now.
April 14, 2026 · Todd
One of the last post on this blog went up on August 2022. Time to restore service. Twenty-eight articles in nine months, and then nothing for three and a half years.

I wasn’t burned out. I didn’t lose interest. The blog went quiet because I took a new job two weeks later, and the work ate the writing.

That’s the honest version. The strategic version, the one that matters now, is that the work itself was the foundation I needed for what I’m doing today. I just couldn’t see that while I was inside it.

The work that ate the blog

In August 2022 I started a new Technical Architect role. I thought I’d be enabling DevOps practice. What I ended up doing was a lot of the day-to-day firefighting that comes with a large enterprise.

I spent the better part of 2023 in conference calls explaining why declarative deployments didn’t violate change management.

While that was happening, I was also running vendor evaluations and designing the configuration automation for our public cloud alongside an existing CloudBees installation. I built dashboards nobody wanted to see and I figured out what to do when Anaconda changed their licensing and hundreds of developers were impacted. A dev container solution I prototyped for my own team ended up getting adopted.

None of that looked like blog material at the time. It felt like work. It was the daily grind of making enterprise engineering slightly less terrible one approval at a time.

What I didn’t see coming

Somewhere in the middle of that stretch, ChatGPT showed up. Then Copilot. Then a flood of other tools that could generate code faster than any human could review it.

My first reaction was skeptical. My second reaction, was something like “how the hell are we going to govern this?” The output wasn’t bad. It was often impressive. But it was also a black box. It was a new source of engineering artifacts that could be produced at scale, but with no clear way to validate them or trace them back to the decisions that led to them. Architecture documents, design specifications, PRDs, test cases, deployment scripts. All of it could be generated by AI, but none of it could be governed by the processes that had been in place for human-generated artifacts.

That observation is where the rest of my career bent.

The governance instincts I’d been building (immutable artifacts, structured handoffs, validation that produces verdicts instead of suggestions, measurement that becomes gating) turned out to be the vocabulary AI-assisted software delivery needed. And almost nobody was connecting those dots. The MLOps world was building model training pipelines. The AI safety world was talking about alignment. The engineering leadership world was dreaming about productivity gains.

The gap in the middle was empty. Nobody was writing about what governance looks like when AI generates engineering artifacts at scale. That gap is where I’ve been living since early 2026.

Why now

In February I started building AIEOS, an open-source governance system for AI-assisted software delivery. I wrote the first post about it two weeks ago. That post is the reason this one exists.

I can’t keep writing forward-looking pieces about AI governance without also explaining where the ideas came from. They didn’t show up in February. They came from watching engineers try to absorb new tooling while keeping regulatory commitments, audit trails and production reliability intact. That’s the blog I didn’t write while I was living it.

So I’m going to write it now, in retrospect. This retrospective won’t read like greatest hits. Several of these posts are about things that didn’t work. A couple are about decisions I’d make differently today. I’m not trying to stack up wins. I want to show the actual path from doing enterprise governance to building AI governance infrastructure, because that path is shorter than most people think, and a lot of engineers are walking it right now without realizing it.

If you’re one of them, this series is for you.

Todd Linnertz is the creator of AIEOS, an open-source AI governance system for software delivery teams. Find him at devopsdiary.blog and github.com/wtlinnertz.

AI Doesn't Fix Your Development Problems. It Accelerates Them.

Todd Linnertz — Tue, 07 Apr 2026 12:21:50 +0000

I've watched the same failure pattern play out across every technology wave of my career.

Team gets a new tool that promises to change everything. Productivity numbers go up. Everyone celebrates. Six months later, they're drowning in the same late-stage rework they were drowning in before. Just more of it, arriving faster.

I saw it with CASE tools in the nineties. With offshore development in the 2000s. With Agile transformations in the 2010s. With DevOps automation in the 2020s.

AI code generation is the most powerful version of this pattern I've ever seen. And most engineering organizations are walking straight into it.

The Illusion Looks Like This

Your team adopts GitHub Copilot or a similar tool. A developer asks it to implement a user authentication module. In forty seconds, it produces three hundred lines of code, complete with error handling, tests and documentation comments.

It looks like progress. It genuinely feels like the future.

Most teams never stop to ask whether the spec for that authentication module was unambiguous.

Because if the acceptance criteria were vague, if the security requirements weren't spelled out, if the integration assumptions weren't documented, you didn't just get a module in forty seconds. You got a module built on a foundation of ambiguity in forty seconds. The rework that's coming is exactly the same size it would have been without AI, compressed into a shorter timeline, with more generated code to sort through.

This is what I mean when I say AI accelerates the appearance of progress while the underlying causes of late-stage rework remain unchanged.

The Real Source of the Problem

Late-stage rework has never been caused by slow typing.

After five companies and more failed projects than I can count, I can say this with confidence: rework happens because of process failures, not speed deficits.

The real culprits are consistent:

Ambiguous specifications that leave developers filling in the blanks with assumptions that won't survive contact with the product team.

Unstable upstream artifacts. The architecture document that's still being revised while the engineering team is implementing against it.

No separation between generation and judgment. The same person (or tool) that produces the artifact is asked to validate it. The result is rationalization, not evaluation.

Missing governance at handoff points. Work flows from planning to design to implementation with no formal freeze points and no immutable record of what was decided and when.

These process failures predate AI by decades. I saw every one of them long before anyone had a code assistant. What AI does is make them faster, and worse. When a developer could only produce two hundred lines of code per day, bad process produced two hundred lines of rework per day. When AI can produce two thousand lines of code per day, bad process produces two thousand lines of rework per day.

The throughput multiplied. The problem did not diminish.

What Most Teams Do About It

Most teams respond to this by trying to write better prompts.

That's the wrong level of the problem. Better prompts improve the quality of AI output within a session. They do nothing about the structural issues that make that output drift, conflict with upstream decisions, or fail validation three weeks later.

Some teams add code review. That helps at the implementation level, but it doesn't address the artifact chain. AI-generated architecture documents, PRDs, and design specifications have the same ambiguity problem as AI-generated code, and often create it earlier in the cycle where the blast radius is larger.

The instinct to treat AI governance as a prompt engineering problem is understandable. Prompt engineering is visible and immediate. The structural failures that cause rework aren't. They hide until you're already underwater.

What Actually Fixes It

After watching the same failure patterns repeat, and then watching them accelerate as my teams started adopting AI tooling, I concluded that the fix requires three structural changes, none of which are about prompting.

Treat AI as a generation engine, not a decision-maker. AI is extraordinarily good at producing artifacts: code, documentation, architecture drafts, test plans. It is not good at determining whether those artifacts are correct relative to upstream decisions it may not fully understand. The organizations that get this right separate generation (what AI does) from judgment (what humans and structured validators do). These are different activities and they need different infrastructure.

Freeze artifacts before downstream work begins. An architecture document that can change while engineering is implementing against it is a liability, plain and simple. Frozen artifacts create an immutable record of what was decided. When something downstream breaks, you know whether the upstream artifact shifted or whether the implementation deviated. Without freeze semantics, this is guesswork.

Make validation produce verdicts, not suggestions. When you ask an AI to review its own output, it will find ways to explain why what it generated is reasonable. That's rationalization, not validation. Real validation produces a binary result: the artifact meets the required criteria, or it doesn't. Anything softer than that is a governance gap dressed up as a process.

At a previous company, I inherited four operational workflows where the same rework patterns were burning cycles everywhere. We didn't buy new tools or speed anything up. We restructured the handoffs and built validation into each transition point. Defect rates dropped 50%. Throughput improved between 35 and 57 percent across all four areas. None of that came from faster tooling. All of it came from fixing the process around the work.

These aren't novel ideas. They're the same principles that make CI/CD pipelines reliable: automated gates, immutable artifacts, clear separation of build and deploy. The insight is that they apply just as well to AI-assisted software delivery as they do to code deployment pipelines. Maybe more so.

The difference is structure around the generation.

The Framework I Built

When I led GitOps adoption at my current company, the technology was the easy part. Getting architecture review board approval, building deployment standards and creating the governance structure that let teams adopt safely took ten times longer. The teams that tried to skip the governance stalled out. The ones that went through it shipped to production. That experience confirmed something I already suspected: the structure around adoption matters more than the tool being adopted.

In early 2026, I formalized these ideas into an open-source framework called AIEOS (AI-Enabled Operating System).

AIEOS structures how engineering artifacts are produced, validated and connected across the full software development lifecycle when AI is involved in generating them. It's built across 24 repositories: an 8-layer model covering the full value-delivery cycle from strategic direction through operational diagnostics, a multi-agent orchestration harness and a guided console for running governance workflows.

The design reflects a simple premise: when AI generates engineering artifacts, the quality of the output depends on the quality of the structure around it. Better prompts help. Better governance infrastructure is what makes the results repeatable, auditable and trustworthy at scale.

The repo is at github.com/wtlinnertz. It's open source, and the rest of this series will dig into how it's designed and why.

What's Coming in This Series

Over the next six posts, I'll cover:

The eight questions every AI-assisted engineering team must be able to answer and how they map to a governance architecture
The three non-negotiable rules for trustworthy AI-generated code
What DevOps taught me about AI governance (and why that background is an advantage)
Inside AIEOS: how multi-agent orchestration runs governance workflows
AI governance in financial services and why the compliance context changes everything

If you've been watching AI tooling arrive in your organization and wondering why the rework isn't going away, this series is for you.