DEV Community: Ryan Merlin

Own the Operation

Ryan Merlin — Fri, 05 Jun 2026 13:00:04 +0000

MCP connects agents to tools. Workflows coordinate tools. Skills teach models how to use tools.

None of those automatically owns the operation.

An operation is the durable unit of work: resolve this secret, explain this failed deploy, audit this cloud account, prepare this release, reconcile these receipts, summarize this incident. It includes the auth path, retries, local conventions, failure taxonomy, output contract, safety policy, and receipt. If that knowledge lives in a prompt, a notebook cell, a workflow node, or a thin MCP wrapper, the agent has to rediscover it every time.

Lately this is the part of the stack people call the action layer, the place where an agent does things instead of merely reaching them. That name is right about the location and quiet about the substance, because what gives an action layer any leverage is where the operation behind each action actually lives.

The useful question is not whether agents should use MCP or the command line. The answer is obviously both.

The useful question is:

Where does the operation live?

My claim: for repeated, composition-heavy, personally shaped work, the operation should often live behind an owned command surface. MCP can expose it. A workflow can schedule it. A skill can explain it. But the operation itself should compile into something deterministic, inspectable, reusable, and honest about failure.

That is what a good CLI gives you. Not a pile of scripts. Not an MCP server for every API. Not terminal nostalgia. A stable, discoverable, JSON-speaking operations layer over the systems you actually touch.

The layer mistake

MCP solved a real problem. Before MCP, every agent integration was bespoke glue. Anthropic describes MCP as an open standard for connecting agents to external systems, replacing custom pairings between every agent and every tool.

That win is real. It is also easy to overextend, which is the argument I made last week in The MCP Explosion Has a Scaling Problem: MCP won the tool layer, but a protocol boundary is not an operational boundary.

Layer	Owns	Should not own
Skills and docs	Instructions, examples, policies, procedural context	The only copy of critical behavior
CLI verbs	Executable operations, contracts, local conventions, error taxonomy	Remote distribution, OAuth consent, multi-tenant governance
Workflows	Scheduling, orchestration, state, retries across steps, approvals	The only definition of an operation other callers need
MCP	Protocol, auth, discovery, consent, hosted distribution	Your private operating knowledge by default

MCP answers: how does the agent reach a capability, what schema describes the call, who is allowed to invoke it, and how does the result come back?

An operation answers: what should actually happen, which local rule applies, what failure class is this, is it retryable, what should be redacted, what receipt proves it ran, and what output can every caller depend on?

Those are different questions. When they collapse into the same artifact, agent systems get brittle. Tool catalogs grow. Error modes blur. Prompts accumulate folklore. This is the same separation-of-layers point behind The Agent Protocol Stack Has a Runtime Gap, applied one level down, to your own operations.

The operation should live at the lowest layer that can execute it deterministically and be reused by every caller.

For personal agents, that layer is often a CLI.

Progressive discovery won

The first scaling failure in agent tooling was not model quality. It was context pressure.

Cloudflare's Code Mode is the cleanest production example. The Cloudflare API has more than 2,500 endpoints. Exposing every endpoint as a separate MCP tool would consume 1.17 million tokens. Code Mode exposes two tools, search() and execute(), keeps the footprint around 1,000 tokens, and still reaches the full API surface. The important part is not the token reduction. It is the shape: search first, inspect only what matters, then execute.

Anthropic reached the same conclusion from the client side. Loading every MCP tool definition up front and routing large intermediate results through the model makes agents slower and more expensive. Presenting MCP servers as code APIs lets the agent inspect only the tool files it needs and process intermediate data inside the execution environment. Their example workflow dropped from 150,000 tokens to 2,000, a 98.7 percent reduction.

Claude's Tool Search makes the pattern explicit: defer loading, search the catalog, and expand only the few tools needed for the request. Anthropic's docs say a typical multi-server setup can spend roughly 55,000 tokens on tool definitions before useful work starts, and tool search usually cuts that by more than 85 percent.

The honest conclusion is not "MCP failed." It is sharper:

Progressive discovery won. Do not load the world. Find the next handle.

A good CLI already works that way.

mycli discover                                          # top-level domains
mycli discover secrets                                  # one domain
mycli secrets resolve --help                            # one verb
mycli secrets resolve providers/openai/api-key --json   # one operation

The agent does not need the whole surface. It needs a catalog, a drill-down path, and the schema for the verb it is about to run.

This is also why the CLI keeps showing up in evaluations. Arize ran 500 GitHub-task evaluations across MCP, CLI skills, and bare shell. Correctness landed in a tight band: MCP at 0.83, the shorter CLI skill at 0.83, and bare shell at 0.84. But on the hardest tasks, MCP cost more than 6X what the skills cost, took five times longer, and averaged more tool calls. Tool fidelity fell to 0.33 because the MCP agent escaped into bash when the API surface could not express the composition it needed.

The same eval also shows where MCP wins. Creating a branch and opening a pull request went better through MCP because create_branch and create_pull_request existed as direct endpoint-shaped tools. That gives us the rule:

Endpoint-shaped tasks favor endpoint-shaped tools. Composition-shaped tasks favor executable surfaces.

Use the protocol when the task is a clean remote capability. Use an executable surface when the task requires filtering, joining, computing, redacting, retrying, or applying local judgment across messy intermediate state.

The seam is where systems fail

An expired token comes back as a 404. A missing secret also comes back as a 404.

So when your agent's credential fetch fails, it spends ten minutes debugging the wrong layer, because the five-line auth dance copied into nine scripts cannot tell the two apart.

Before:

source .env
TOKEN=$(curl -s -X POST "$AUTH_URL" -d "$AUTH_BODY" | jq -r .access_token)
curl -s "$VAULT_URL/secret/$SECRET_PATH" -H "Authorization: Bearer $TOKEN" | jq -r .data.value

That is not an operation. It is ceremony. Maybe it retries. Maybe it refreshes. Maybe it distinguishes auth failure from missing data. Maybe the next script copied the fixed version. Probably not.

After:

mycli secrets resolve providers/openai/api-key --json

Now the operation exists. It owns the auth flow, token refresh, retry policy, error taxonomy, output shape, and receipt. The agent does not need to infer the difference between an expired credential and a missing path. The command tells it. (That secrets layer is its own small build, which I wrote up in Building first-class secrets management into an AI agent.)

A useful command should not just return bytes. It should return a contract.

{
  "ok": false,
  "error": {
    "code": "AUTH_TOKEN_EXPIRED",
    "class": "auth",
    "retryable": true,
    "message": "Credential expired before vault lookup.",
    "remediation": "Run `mycli auth refresh` or allow automatic refresh."
  },
  "receipt": {
    "id": "sec_20260602_184211",
    "operation": "secrets.resolve",
    "target": "providers/openai/api-key"
  }
}

That is the difference between a model guessing and a system reporting.

The win is not that the command is shorter. The win is that the operation is executable, inspectable, reusable, and honest about failure. A human can call it. An agent can call it. A cron job can call it. A service can call it. An MCP wrapper can call it later. The operation is no longer trapped in a prompt, workflow node, notebook cell, shell history, or vendor connector. It is part of your surface.

What about workflows?

Workflow is one of the load-bearing ideas in modern software. Whole categories of company are built on the workflow principle: define a process once as a sequence of steps, then let an engine schedule it, run it, retry it, and prove it ran. But the word does not mean one thing. It spans tools that live at very different layers.

At its core it is orchestration. GitHub Actions runs workflows inside a repository, triggered by a push or a schedule. Airflow, Dagster, and Prefect orchestrate data pipelines as graphs of tasks with dependencies, retries, and backfills. Alteryx puts the same idea on a visual canvas for analysts who join and shape data without writing code. n8n does it for moving data between SaaS apps. Different audiences, one shape: a controlled, repeatable path through steps that are known in advance.

In 2026 that idea reached into agent harnesses. Anthropic draws the line cleanly: a workflow orchestrates models and tools along predefined code paths, while an agent lets the model direct its own process. Claude Code now ships a literal workflow primitive. A dynamic workflow, in research preview since May 2026, is a JavaScript script Claude writes for your task; a runtime executes it in the background and fans the work across as many as 1,000 subagents, returning only the verified result instead of the exhaust of every step. Even there the pattern holds: the script is the orchestrator that decides what runs and in what order, and the subagents do the work.

So "workflow" already spans a YAML CI file, a data graph, a visual analytics canvas, and a generated multi-agent script. Across all of them it answers the same question. A workflow decides when and in what order work happens, and what coordinates the steps. A command decides what the work is. Use these tools when the process is known in advance. Just do not confuse the engine with the work it runs.

Bad layering puts the substance inside the orchestrator:

the workflow node contains the auth logic
it contains the provider-specific retries
it contains the jq filters and the local naming conventions
it contains the only error handling anyone trusts

Good layering keeps the workflow thin and the operations owned:

workflow trigger
  -> mycli cloud drift --stack payments-prod --json
  -> mycli deploy explain --service api --env prod --since 30m --json
  -> mycli incident snapshot --service checkout --include logs,metrics,traces --json

The workflow owns scheduling, approvals, fan-out, retries between steps, and escalation. The CLI owns the operation contract. If an agent needs the same capability interactively, it calls the same command. If a service needs it, it calls the same command. If you later expose it through MCP, the wrapper calls the same command. That is the reuse workflows do not give you by themselves.

The honest exception: if the work is inherently a long-running process with durable state across restarts, approvals, and retries, use a workflow engine as the source of truth. But if a workflow node becomes the only place that knows how to interpret a failed deploy, classify an IAM error, or join logs with metrics and traces, promote that logic into an operation and let the workflow call it.

Workflows are control planes. Commands are operational primitives.

Skills and plugins are maps. Commands are machines.

This is the other meaning of "workflow," and it is where the argument gets sharpest. The ecosystem is converging on packaged agent capability: Agent Skills became an open standard in December 2025, and a Claude Code plugin bundles skills, subagents, and hooks into one installable folder.

That packaging is good. But a skill is not the operation itself. A skill can tell the agent to check the provider, refresh credentials, distinguish expired tokens from missing paths, emit structured JSON, and leave a receipt. That helps the model behave. It does not guarantee the behavior, because a SKILL.md is interpreted by the model, not executed.

So if the procedure lives only in prose, the agent reconstructs it every time. Sometimes correctly. Sometimes it skips a check, pastes stale shell, or reads the same 404 in the wrong layer. A command turns the procedure into an executable affordance: a deterministic engine runs the same code every time.

The skill or plugin tells the agent when and why. The CLI does the work.

Build by osmosis

You do not have to start from a blank file.

One thing MCP gives us, beyond runtime connectivity, is a corpus of tool-design prior art. An MCP server is a compact expression of how a domain exposes itself to agents: tool names, descriptions, schemas, auth assumptions, pagination, failure modes, and handler logic. The MCP tools spec makes this explicit: each tool carries a name plus metadata describing its schema. When the server is open source, the agent can read the implementation. When it is not, it can still inspect the exposed contract.

So MCP is useful even when MCP is not the final interface. The move is toolchain osmosis:

study the generic MCP surface
extract the operational primitives
preserve useful auth, pagination, retry, and rate-limit behavior
collapse repeated work into personal verbs, and discard what you do not use
harden the result as CLI commands
expose those commands back through MCP only if distribution requires it

The anti-pattern is a direct port:

mycli github list-issues --json
mycli github get-issue --id 123 --json
mycli github list-pull-requests --json
mycli github create-branch --json

That is endpoint sprawl wearing a terminal costume. The target is compression:

mycli work triage --project aria --since 7d --json
mycli ci explain-failure --repo aria --run latest --json
mycli release prepare --repo aria --base main --dry-run --json

Those commands know which accounts matter, which regions are production, which labels identify a release, which alarms are noisy, and what "rollback candidate" means. A good agent can help with that transformation, not by inventing tools from vibes, but by reading the existing implementation and proposing a smaller, sharper surface:

Study the MCP server in ./vendor/mcp-servers/github and its tools/list schemas.
Do not port tools one-for-one. Extract reusable operational primitives and propose
CLI verbs for my actual workflow. Constraints: collapse repeated multi-step
workflows; keep generic CRUD out unless called by more than one actor; every
command returns a stable JSON envelope, defines exit codes and error classes,
supports --dry-run for writes, and leaves a receipt for destructive actions;
keep secrets out of prompts and logs; note which MCP tools informed each verb.
Output: candidate verbs, rejected tools and why, JSON schemas, error taxonomy,
implementation plan, open questions.

The output you want is not a bigger catalog. It is a smaller one.

MCP can be the scaffolding. The CLI is what remains after the scaffolding comes down.

What earns a command

This is where the argument needs discipline. "Build your own toolchain" can decay into aesthetic DIY. That is not the claim. A command earns its place when at least one of these is true:

it is called by more than one actor: human, agent, workflow, service
it encodes a local convention a vendor cannot know
it joins multiple systems behind one stable contract
it has ambiguous failure modes that need honest errors
it handles sensitive data that should not pass through model context
it is repeated often enough that prompt reconstruction is waste
it is dangerous enough to require dry-run, receipts, or guardrails

If none of those are true, do not wrap it. Use MCP, the vendor connector, the workflow tool, or the API directly. Use the shell once and move on.

The test is not whether a command feels elegant. It is whether it reduces repeated reasoning, repeated glue, repeated tokens, repeated ambiguity, or repeated risk.

The chassis is the investment

The upfront work is real. You need the chassis before the leverage shows up: argument parsing, config loading, secret resolution, structured JSON output, stable exit codes, stable error classes, audit receipts, dry-run support, idempotency for writes, a useful --help, and machine-readable discovery.

That is not free. The first useful command costs more than it feels like it should. But the chassis changes the economics of every command after it. Adding a capability stops being "build an integration" and becomes "describe a verb." A thin wrapper over an API you already use is minutes. A real integration is quick and starts to pay off immediately.

The hundredth command is not free. It is prepaid.

You paid for it with the first ninety-nine decisions: one output envelope, one error taxonomy, one discovery model, one auth strategy, one audit trail, one binary every caller knows how to invoke. The compounding is not that commands become magically cheap. The compounding is that the surface becomes coherent.

Put MCP behind it when distribution matters

None of this requires being anti-MCP. In many systems the cleanest architecture is to put the owned operation behind MCP:

agent
  -> MCP server          # remote access, OAuth, consent, governance
      -> owned CLI verb  # executes the operation, returns the contract
          -> operation   # auth, retries, policy, redaction, receipts, taxonomy
              -> APIs, files, secrets, infrastructure

That gives you the protocol benefits without making the wrapper the source of operational truth. MCP handles remote access, OAuth, consent, governance, and hosted distribution. The CLI owns the operation. The skill explains when to use it. The workflow schedules it.

This split is also safer. Cloudflare notes that a shell introduces a much broader attack surface than a sandboxed isolate, and Code Mode keeps progressive discovery while executing inside that sandbox. Do not hand every hosted agent an unrestricted shell. Do not pretend bash is a permissions system. But also do not bury your operating knowledge in a remote schema wrapper when what you need is a deterministic operation with a contract.

Vendors can't design your daily driver

A vendor can connect GitHub, Jira, Slack, Drive, Salesforce, Kubernetes, AWS, GCP, and Azure, and expose the generic verbs: list, search, create, update, delete, summarize. That is useful. It is not enough.

The leverage in a personal agent lives in the local verbs:

explain the deploy that failed after the autoscaling event
audit the production account, but include IAM drift and recent security-group changes
resolve the secret using the local fallback rule, not the cloud default
summarize only the receipts attached to this client project
prepare the weekly report in the format I actually use

Those are not marketplace integrations. They are operating habits. A vendor cannot know them because they are not product features. They are the accumulated shape of your work.

That is why the personal toolchain matters. Not because command lines are noble, not because protocols are bad, and not because every developer should LARP as a platform team of one.

Because repeated friction is operational knowledge, and operational knowledge should compile.

What owning the operation does not solve

A command surface gives you determinism. It does not give you safety by default.

A CLI is not a permission model. It is a sharper knife.

Owning the operation does not make shell access safe, replace sandboxing, or settle multi-tenant authorization on its own. That still comes from capability scoping, secret isolation, policy gates, dry-runs, and receipts. Left undisciplined, the same surface decays into a private platform tax, or into endpoint sprawl if you port APIs one-for-one. And it is not free to keep: commands need tests, schemas, and versioning like any other code.

Where the logic lives is also a security question. A 2026 study of agentic GitHub Actions workflows, Demystifying and Detecting Agentic Workflow Injection, analyzed 13,392 agentic workflows across 10,792 repositories and confirmed 496 exploitable injections, 343 of them previously unknown, where untrusted issue or pull-request text flows into an agent's prompt and back out as exfiltrated secrets. When the auth path, the redaction rule, and the receipt live as prose inside a workflow node or a skill file, there is no hardened layer to hold a guardrail. An owned operation with secret isolation and an audit receipt at least gives you one.

The operation is the asset

Strip away the protocol wars and the tooling fashion, and one thing is left standing: the operation. Not the model, not the connector, not the orchestration graph. The durable unit of work, with its auth, its retries, its failure taxonomy, its receipt, and the local judgment no vendor can see.

Everything else is delivery. MCP carries capabilities to where they are needed, workflows schedule them, skills tell the agent when to reach for them. All of it rests on one question: where does the operation live? Leave it in a prompt, a workflow node, or a wrapper, and the agent rediscovers it every morning. Put it in something you own that runs the same way for every caller, and it compounds: the world the agent can act on becomes your world, encoded in tools you made from the work itself.

The strongest personal agent will not be the one with the biggest catalog of connectors. It will be the one with the clearest operations layer.

Own that layer.

Originally published at ryanmerlin.com.

The AI Productivity Dip Is Longer, Deeper, and Diverging

Ryan Merlin — Thu, 28 May 2026 18:24:42 +0000

One variable in DORA's AI ROI calculator changes the story from a first-year win to a first-year loss.

Not model cost. Not salary. Not adoption rate.

Duration.

DORA's sample model for AI-assisted software development assumes a three-month productivity dip. On that assumption, a 500-person engineering organization produces a first-year benefit of roughly $3.3 million, a 39% ROI, and a payback period under a year. When Faros AI stress-tested the same calculator with a twelve-month dip instead of a three-month dip, the result inverted: the same organization went from a $3.3 million first-year gain to a $6.6 million loss. A $9.9 million swing from one input.

That input is not a detail. It is the model.

DORA is Google Cloud's DevOps Research and Assessment program, the research group behind the software delivery metrics many engineering organizations use to benchmark performance. The original DORA "four keys" have now evolved into a five-metric model: change lead time, deployment frequency, failed deployment recovery time, change fail rate, and deployment rework rate. DORA's own guidance says these metrics measure a team's ability to deliver software safely, quickly, and efficiently, and that they predict better organizational performance and team well-being.

The 2026 DORA report on AI-assisted software development is not a hype memo. It is a serious attempt to answer a hard management question: how should engineering leaders reason about the return on AI when the first-order effects are tangled with learning costs, verification costs, platform maturity, quality risk, and organizational redesign? The DORA ROI report proposes an ROI framework and calculator that map AI adoption through capabilities, DORA delivery metrics, and ultimately financial outcomes. It also names the pattern many practitioners already feel: AI adoption follows a J-curve. Productivity drops before it rises.

DORA's explanation for the dip is right. Teams spend time learning new workflows. Developers must review AI-generated code because trustworthiness is not free. Downstream systems, review, test, security, CI/CD, and incident response, must absorb more output. The DORA ROI report calls this "the tuition cost of transformation."

The framing is useful.

The default timeline is the dangerous part.

The question is not whether AI creates a productivity dip. The question is whether the dip lasts three months, twelve months, or long enough that leadership cuts funding, reduces headcount, or declares failure before the organization has made the complementary investments required for the upside.

DORA frames this around software development because that is the domain they study. But the same dynamic plays out everywhere AI enters production work: business process automation, analytics pipelines, customer operations, content generation, financial modeling. The mechanism is the same. AI increases the volume of output before the organization has upgraded the verification, integration, and governance systems that must absorb it. Code is the most instrumented version of this story. It is not the only version.

That is where the J-curve becomes a fork.

DORA is right about the mechanism

DORA's strongest idea is not the calculator. It is the amplifier thesis.

In its 2025 State of AI-Assisted Software Development report, DORA argued that AI amplifies existing organizational conditions. Strong engineering systems get stronger. Weak systems get faster at producing dysfunction. AI does not replace delivery maturity; it magnifies the presence or absence of it.

That is the right lens.

An organization with strong automated tests, fast CI, disciplined review culture, observable production systems, small-batch delivery, clean internal documentation, and a mature developer platform can absorb AI-generated output. It has the verification surface area to handle increased volume.

An organization without those foundations gets something else: more code, larger pull requests, more review pressure, more rework, more hidden security exposure, and more incidents that appear downstream from the dashboard celebrating "AI adoption."

DORA's calculator includes this idea, but the sample assumptions understate how asymmetric the results become. The calculator's default case shows positive ROI. Faros AI's stress test shows that changing the dip from three months to twelve months flips the result negative. Faros's telemetry-informed scenario, combining longer adaptation time and quality degradation, also produces a negative first-year ROI.

That does not prove Faros is universally right. Faros is a vendor analyzing telemetry from its own customer base, which is not the same thing as a population-representative causal study.

But it proves the management point: the ROI is highly sensitive to the duration and depth of the trough.

If leadership treats DORA's default as an expectation rather than a scenario, they will under-budget the hard part.

The evidence does not tell one story

The empirical record on AI coding productivity is not contradictory because the researchers are incompetent. It is contradictory because they are measuring different work under different conditions.

DORA's own 2024 data showed the tension early. A 25% increase in AI adoption was associated with higher perceived documentation quality, code quality, and code review speed. It was also associated with a 1.5% decrease in delivery throughput and a 7.2% decrease in delivery stability. In other words: developers felt some things getting better while system-level delivery outcomes worsened. See the 2024 DORA Report.

METR's controlled experiment made that perception gap explicit. Sixteen experienced developers completed 246 tasks in their own open-source repositories, randomly assigned to use or not use AI tools. With AI tools, they took 19% longer. Before the study, they expected AI to save 24% of their time. Afterward, they still believed AI had sped them up by about 20%.

That is the most important finding in the METR paper: not merely that AI slowed these developers down, but that the developers misread their own productivity.

The caveat matters. METR's sample was small, the developers were experienced, the work was complex, and the tasks were in familiar real-world codebases. METR has also since published a follow-up noting that a later experiment produced an unreliable estimate because of study design and selection issues. The slowdown result should not be treated as a universal law.

A larger field experiment by Cui et al., run across Microsoft, Accenture, and an anonymous Fortune 100 company, found a very different result: a 26% increase in completed tasks among 4,867 developers using an AI coding assistant. The effects were stronger for newer and more junior employees.

Both findings can be true.

AI helps more when the task is bounded, the context is legible, the codebase is easier to navigate, and the developer has less accumulated domain-specific advantage. AI helps less, and can hurt, when the work requires deep local context, architectural judgment, production intuition, and careful integration into a complex codebase.

That distinction matters because most enterprise engineering is not greenfield demo work. It is legacy systems, migrations, dependencies, security constraints, test gaps, half-documented business rules, and code nobody wants to touch.

AI is very good at producing plausible code.

The enterprise problem is verified, maintainable, production-safe change.

The telemetry is flashing yellow

Faros AI's telemetry captures the shape of the tradeoff. In its analysis of 22,000 developers across 4,000 teams, output rose sharply: task throughput per developer increased 33.7%, epics per developer increased 66.2%, and tasks associated with pull requests per team increased 210%. But the quality and stability signals moved the other way: incidents per pull request increased 242.7%, monthly incidents increased 57.9%, and bugs per developer increased 54%.

Again, this is not causal proof that AI created every downstream issue. It is vendor telemetry, not an RCT.

But it is directionally consistent with what engineers are reporting elsewhere: AI increases output before organizations have upgraded the verification system that must absorb that output.

The vendor telemetry that follows carries the same caveat as Faros: these companies sell code quality, security, and review tools. They have commercial incentives to surface problems in the code their customers produce. That does not make the data wrong, but it means the findings should be read as signal, not proof.

Sonar's 2026 developer survey found that 96% of developers do not fully trust AI-generated code, yet only 48% say they always verify AI-generated code before committing it. Sonar also found that 53% of developers agree AI often produces code that looks correct but is not reliable.

That is the verification tax in compressed form: developers know the output is untrustworthy, but delivery pressure pushes them toward partial verification.

Security evidence points in the same direction. Veracode's GenAI Code Security Report tested more than 100 large language models across common programming languages and found that 45% of generated code samples failed security tests, including OWASP Top 10 classes of weakness. Larger and newer models did not consistently produce more secure code.

CodeRabbit's analysis of 470 open-source pull requests found that AI-coauthored PRs contained about 1.7 times as many issues per PR as human-authored PRs, with security vulnerabilities up to 2.74 times higher.

Apiiro reported that AI-assisted developers were writing three to four times more code and that AI-generated code was producing a tenfold increase in security findings, reaching 10,000 new findings per month by June 2025 across its observed repositories. See Apiiro's velocity and vulnerability analysis.

The pattern is not "AI code is bad."

The pattern is "AI changes the denominator."

When code volume rises faster than review capacity, test coverage, security scanning, architectural scrutiny, and production feedback loops, the system becomes less stable even if individual developers feel faster.

The verification tax is not temporary

A common mistake is treating verification as an early adoption friction that will disappear once developers get used to the tools.

Some of it will. Prompting improves. Tooling improves. Developers learn where AI is useful and where it is dangerous.

But the core verification tax is structural.

AI-generated code has no human intent behind it in the way a teammate's code does. It may be syntactically clean and idiomatic while being semantically wrong. It can pass local tests while violating an invariant that lives in a different service, a customer workflow, or an undocumented operational constraint.

That makes review harder, not easier.

The open-source world is already reacting. Curl ended its HackerOne bug bounty program after a flood of low-quality, AI-generated vulnerability reports overwhelmed maintainers. NetBSD now treats LLM-generated code as "tainted" unless approved by core developers. Gentoo banned AI-generated contributions, citing quality, copyright, and ethical concerns. The Linux kernel permits AI-assisted work, but places full responsibility on the human submitter and requires proper disclosure and review discipline.

Those are not Luddite reactions. They are maintenance systems defending scarce review capacity.

Enterprise engineering has the same problem, just inside the firewall.

If AI increases generation capacity by 2x but verification capacity by only 1.1x, the bottleneck moves. The organization does not become twice as productive. It becomes review-bound, test-bound, security-bound, and incident-bound.

That is why code volume is a dangerous success metric.

This has happened before

The productivity dip is not unique to AI.

Paul David's classic 1990 paper, "The Dynamo and the Computer," explained why electrification took decades to show up in factory productivity. Early factories overlaid electric motors onto steam-era layouts. They replaced the power source but kept the old organization of work. The payoff came later, when factories were redesigned around electricity: single-story layouts, unit drives, and production flows organized around materials rather than shafts and belts.

Brynjolfsson, Rock, and Syverson formalized the same mechanism as the "Productivity J-Curve." General-purpose technologies such as AI require complementary investments: process redesign, new business models, human capital, organizational restructuring, and other intangible assets that are poorly measured during the investment phase. Productivity can look flat or negative while those investments are being made, then overshoot once the new system starts compounding.

That is the economic mechanism behind DORA's J-curve.

But software leaders need to add one more observation: the J-curve does not resolve uniformly.

Some organizations invest through the trough and emerge with higher throughput, better developer experience, stronger platforms, and faster learning loops.

Others treat the trough as evidence that AI failed, or worse, use AI output as justification to cut the very people needed to verify and integrate it.

That is how the J becomes a K.

The upper branch is not just "more AI"

The companies moving up the curve are not merely buying more licenses. They are redesigning the delivery system around AI.

Google is the clearest high-scale example. In Q3 2024, Sundar Pichai said more than 25% of new code at Google was generated by AI and then reviewed by engineers. By Cloud Next 2026, Google said 75% of new code was AI-generated and approved by engineers. That is not evidence of ROI by itself, but it is evidence of a company pushing AI into the engineering workflow while preserving human review as a control point.

AWS offers a better measurement lesson. AWS reported a 15.9% year-over-year reduction in software development cost using its Cost to Serve Software framework. The important point is not "Amazon Q saved 15.9%." That would be too clean. The important point is that AWS measured the whole software delivery system: deployments per builder, human interventions, incidents per deployment, and cost-to-serve. AI was part of a broader developer-experience and operational-efficiency program, not a standalone magic line item.

Duolingo shows both the upside and the organizational risk. In 2025, the company launched 148 AI-created courses, roughly doubling its course catalog. That is real leverage. But Duolingo also faced backlash over its "AI-first" posture, and CEO Luis von Ahn later said the company would reverse a policy tying AI usage to performance reviews after employees pushed back on using AI for its own sake.

Shopify is a culture case, not an outcome case. Tobi Lütke's 2025 memo made reflexive AI usage a baseline expectation and required teams asking for more headcount or resources to show why AI could not help first. That is a strong operating philosophy. It is not yet a measured productivity result.

The upper branch is not "AI everywhere."

The upper branch is disciplined adoption: strong platforms, explicit verification, clear use-case boundaries, measurement beyond code volume, and leadership willing to fund the adaptation period.

The lower branch is not "no AI"

The lower branch can have plenty of AI.

Klarna is the canonical warning. In February 2024, Klarna announced that its AI assistant handled 2.3 million conversations, about two-thirds of customer-service chats, doing work equivalent to 700 full-time agents. It also said the assistant matched human customer-satisfaction scores and was expected to drive $40 million in profit improvement.

Then the narrative changed. By 2025, Klarna was bringing humans back into customer service, with CEO Sebastian Siemiatkowski acknowledging that the company had over-indexed on cost and needed to course-correct on quality.

The lesson is not that Klarna's AI did nothing. It clearly did something. The lesson is that volume metrics masked quality degradation in the interactions where quality mattered most.

Freshworks is a different warning. In May 2026, Freshworks announced it would cut roughly 500 jobs, about 11% of its workforce, while CEO Dennis Woodside said more than half of the company's code was written by AI and that automation had reduced rote work. The company estimated restructuring charges of about $8 million.

That may prove financially rational. It may also prove to be the exact failure mode DORA warns against: reducing human capacity during the period when AI-generated output increases the need for verification, architectural judgment, and production accountability.

The lower branch is not low adoption.

It is unmanaged adoption.

The macro data shows concentration, not universality

The broader enterprise data supports the divergence story.

BCG's 2025 research found that leading "future-built" companies are pulling away from laggards: 1.7 times the revenue growth, 3.6 times the three-year total shareholder return, and 1.6 times the EBIT margin. BCG also found that agentic AI is accelerating the value gap, with agents accounting for 17% of total AI value in 2025 and projected to reach 29% by 2028.

McKinsey's 2025 State of AI survey found that 88% of organizations use AI in at least one business function, but only about one-third have begun to scale AI programs. Only about 6% qualify as AI high performers, defined as organizations attributing 5% or more EBIT impact to AI and reporting significant value. McKinsey also found that high performers are more likely to redesign workflows, define when model outputs require human validation, and have senior leaders actively engaged in adoption.

OECD research on emerging divides in the transition to AI similarly finds that AI adoption is accelerating unevenly across firms, sectors, and regions, reinforcing existing divides. AI champions are concentrated among larger firms, innovative regions, and knowledge-intensive services, while skills shortages, cost, data protection concerns, and technology lock-in slow diffusion elsewhere.

This is the K-curve at enterprise scale.

Adoption is becoming common.

Impact is not.

The catch-up window is open, but narrowing

The strongest objection to the K-shaped thesis is the cloud precedent.

Cloud adoption also looked divergent at first. Late movers eventually learned from early movers, hired experienced consultants, adopted proven patterns, and caught up faster than expected. The playbook became legible: DevOps, CI/CD, infrastructure-as-code, SRE, platform teams, FinOps.

AI adoption may follow the same path.

But there is a difference.

Cloud was primarily an infrastructure and operating-model migration. Difficult, but codifiable.

AI-assisted software development requires a deeper form of organizational learning: trust calibration, review heuristics, task decomposition, context engineering, internal knowledge access, risk classification, human-in-the-loop design, and new quality gates. Those capabilities can be taught, but they cannot simply be installed.

The longer a high-performing organization uses AI productively, the more it accumulates process knowledge: which tasks are safe, which are dangerous, how to review, how to instrument, how to route work, how to train juniors, how to evaluate agents, and how to distinguish code generation from software delivery.

That institutional learning compounds.

The catch-up window is not closed. But it is not passive.

Agents create a second J-curve

Most organizations have not finished adapting to copilots, and agents are already creating the next transition.

McKinsey's 2025 survey found that 23% of organizations are scaling an agentic AI system somewhere in the enterprise, while another 39% are experimenting. In any individual business function, no more than 10% are scaling agents.

Gartner predicts that more than 40% of agentic AI projects will be canceled by the end of 2027 because of escalating costs, unclear business value, or inadequate risk controls.

S&P Global Market Intelligence found that the share of companies abandoning the majority of AI initiatives before production rose from 17% to 42% year over year, with organizations reporting that 46% of projects are scrapped between proof of concept and broad adoption.

This is not surprising. Agents require a different verification model than copilots.

A copilot suggests. An agent acts.

That means the complementary investments change: permissions, audit logs, sandboxing, tool access, approval workflows, production guardrails, identity boundaries, rollback paths, and human escalation. The verification tax does not disappear. It moves from reviewing generated code to supervising generated action.

Organizations that built strong verification discipline during the copilot phase will move faster through the agent phase.

Organizations that skipped the discipline will start the second curve with unpaid debt from the first.

The talent pipeline is the long fuse

The hidden cost in most AI ROI calculators is not licensing. It is apprenticeship.

If AI handles the work junior engineers used to do, the short-term spreadsheet looks better. Fewer entry-level hires. Fewer simple tickets. More senior engineers supervising generated output.

But software engineering judgment is not created by watching AI write code. It is created by making decisions, breaking things, debugging them, getting reviewed, discovering why the obvious solution was wrong, and slowly building taste.

The labor-market evidence is early and contested, but it is concerning. Stanford Digital Economy Lab's "Canaries in the Coal Mine?" study uses high-frequency administrative payroll data and finds that early-career workers aged 22 to 25 in the most AI-exposed occupations have experienced a significant relative employment decline, while more experienced workers in the same occupations have remained stable or continued to grow. The Stanford publication reports a 16% relative decline in the latest version; SIEPR's summary of an earlier version reports 13%.

There is counter-evidence too. Strada's 2026 employer survey found that many employers expect AI to reshape entry-level work rather than eliminate it, increasing analytical and judgment-based responsibilities while reducing routine tasks. In tech, the bar is rising: more judgment, fewer rote assignments.

That does not eliminate the pipeline risk. It clarifies it.

The risk is not simply "fewer junior developers."

The risk is fewer safe environments where junior developers can acquire the judgment senior developers need.

An organization that replaces junior work with AI output may save money now and discover in three to five years that it has fewer engineers capable of evaluating the output.

What engineering leaders should measure instead

Do not measure "percentage of code written by AI" as a success metric.

That is a volume metric. It is not a delivery metric. It can rise while quality, security, and stability fall.

Measure these instead.

Change failure rate by code origin. Compare AI-assisted changes with human-authored changes. If AI-assisted changes fail materially more often, the bottleneck is verification, not adoption.

Incident rate per pull request by code origin. Faros AI found incidents per PR rising sharply after AI adoption. Your number matters more than Faros's number. Instrument it.

Review time and review depth by code origin. If AI-assisted PRs wait longer, require more review cycles, or get merged with less scrutiny, you have a control problem.

Security findings by code origin. Static analysis, dependency scanning, secrets detection, and application security testing should be broken out by AI-assisted versus human-authored change.

Rework and churn. Track how often code is modified, reverted, or deleted within 30 days of merge. GitClear's work on AI-era code quality points toward higher churn, more duplicated code, and less refactoring-associated activity as AI coding assistants spread.

Accepted suggestion rate paired with downstream quality. Acceptance rate alone is a usage metric. It becomes useful only when paired with review outcomes, rework, incidents, and security findings.

Developer trust calibration. Teams should be able to articulate where AI is safe, where it is useful but risky, and where it is prohibited. "We use AI for X but not Y because..." is an artifact of maturity.

What to tell the board

Do not present DORA's 39% sample ROI as the forecast.

Present it as the optimistic case.

Then present the sensitivity case.

The board-level message should be this:

AI-assisted development has real upside, but the return depends on funding the adaptation period. The productivity dip is not wasted time; it is the investment phase in new verification, delivery, and operating capabilities. Organizations that underinvest during the dip are likely to see more code, more rework, and more incidents rather than durable productivity gains.

Based on the evidence above, my working estimate is twelve to eighteen months, not three.

Months one through six are the calibration phase. Developers learn where AI helps, where it lies, and how to review its output. Productivity may look flat or negative if you measure verified delivery rather than code volume.

Months six through twelve are the pipeline phase. Review, testing, security, CI/CD, observability, and incident response adapt to increased generation capacity. Leading indicators should improve even if aggregate ROI is still mixed.

Months twelve through eighteen are the crossover phase. If the leading indicators are improving, the organization should begin to see durable productivity gains: higher deployment frequency without higher change failure, lower lead time without more incidents, and better developer experience without quality erosion.

That is the investment thesis.

Not "AI writes code, therefore we need fewer engineers."

"AI increases generation capacity, therefore we must upgrade the delivery system."

The first 90 days

Instrument before expanding.

Add code-origin metadata to pull requests. Track whether changes are human-authored, AI-assisted, or agent-generated. This does not need to be perfect on day one. It needs to be good enough to correlate origin with review time, rework, security findings, and incidents.

Create an explicit verification owner.

Not a committee. A person. Their job is to watch the metrics, identify where AI-assisted work creates downstream risk, and feed that learning back into team practice.

Start with task boundaries.

AI is safer for tests, documentation, scaffolding, migrations with strong patterns, log analysis, small refactors, and local transformations. It is riskier for authorization logic, cryptography, concurrency, financial calculations, distributed-systems behavior, data migrations, and architecture. The list will differ by organization. The point is to make the boundary explicit.

Measure accepted output, not generated output.

The unit of value is not a suggestion, a token, a line of code, or a generated pull request. The unit of value is a verified change that improves the system without increasing risk.

Protect the junior rung.

Juniors should not become passive reviewers of AI output. They need to write code, predict behavior, run experiments, debug failures, and compare their solution to the model's. The training loop matters more than the short-term ticket velocity.

Run one calibrated pilot.

Pick one team with enough delivery maturity to produce meaningful data. Give them tools, instrumentation, and freedom from code-volume targets. After 90 days, compare AI-assisted and non-AI-assisted work across delivery, quality, security, and developer-experience metrics.

Then scale the operating model, not just the licenses.

Closing: the second curve belongs to the disciplined

DORA's core insight is right: AI amplifies the system it enters.

That is why the productivity dip matters. The dip is not just a temporary inconvenience before the inevitable payoff. It is the period when organizations reveal whether they are capable of making the complementary investments AI requires.

Some will use AI to build stronger delivery systems.

Some will use it to ship more code into weak systems.

Some will cut people before they have upgraded verification.

Some will mistake adoption for transformation.

The tuition is real. But tuition is not the same as graduation.

The organizations that finish the course will not be the ones that generated the most code.

They will be the ones that learned how to verify, integrate, operate, and improve at AI speed without losing the human judgment that makes the output worth anything.