David Ahmann

Posted on Feb 12

Matt Shumer Told You What’s Coming. He Left Out the Part That Should Scare You Most.

#agents #ai #opensource #security

The AI revolution everyone's talking about has a governance gap nobody's filling. And the window to fix it is closing fast.

Last week, Matt Shumer published a post called "Something Big Is Happening" that went everywhere. If you haven't read it: he describes walking away from his computer, coming back four hours later, and finding the work done. Not a draft. The finished thing. GPT-5.3 Codex and Claude Opus 4.6 crossed a threshold where AI doesn't just assist — it executes autonomously with something that feels like judgment.

He's right. I've seen it too. I've spent nights and weekends for the past year building infrastructure for this exact moment. But there's something Shumer's post doesn't address, and it's the thing that keeps me up past midnight after the kids go to bed.

The agents are getting more capable. The infrastructure to govern them isn't.

Shumer describes AI that opens apps, clicks buttons, tests features, iterates on its own work, and decides when it's ready. That's an autonomous agent making consequential decisions with real-world side effects. And his reaction is essentially: isn't this amazing, now go learn to use it.

I want to talk about the other side. The part where everything goes wrong and nobody can explain what happened.

The 2am Question Nobody Can Answer

Here's the scenario that plays out today, right now, in companies running agents in production.

An agent with production credentials does something unexpected. Maybe it accessed customer data it shouldn't have. Maybe it made an API call that triggered a financial transaction. Maybe it exported records to an endpoint nobody authorized.

Legal gets on the call. Security gets on the call. The CTO is asking: what happened?

And the team discovers they can't answer. They have logs showing timestamps. They don't have evidence showing what data the agent referenced, what tool calls it made with what arguments, what policy authorized the action, or whether the same context would produce the same behavior again.

This isn't hypothetical. This is happening at companies right now. And as agents get more capable — as they go from executing 5-hour tasks to multi-day projects to multi-week autonomous operations — the blast radius of this gap grows exponentially.

We solved this for databases fifty years ago. ACID properties gave us a contract: the transaction either completed correctly or it didn't, and you can prove which. We solved it for code with version control and CI/CD. We solved it for containers with Kubernetes admission controllers.

For AI agents? The "governance mechanism" at most companies is a sentence in the system prompt: "Please don't do anything bad."

The Scale Problem Shumer Described But Didn't Name

Let me connect the dots between Shumer's post and what's actually at stake.

He describes GPT-5.3 Codex writing tens of thousands of lines of code, then testing its own work. Anthropic's latest benchmarks show agents completing tasks that take human experts five hours — and that number doubles roughly every four to seven months. Dario Amodei says models "substantially smarter than almost all humans at almost all tasks" are on track for 2026 or 2027.

Now consider the operational reality: Anthropic just demonstrated 16 agents coding autonomously for two weeks. Fifty-agent swarms. AI managing human teams — closing issues, assigning work, routing based on organizational structure. Autonomous security research that found 500 zero-day vulnerabilities by reasoning through codebases.

Every one of those capabilities is an agent taking real actions with real consequences. And every one of them creates a governance question that current infrastructure cannot answer:

When 16 agents have been coding for two weeks, and something breaks on day 11 — how do you reconstruct what happened on days 1 through 10?

When an AI manager closes the wrong issue or assigns work to the wrong team — who audits the AI's organizational decisions? What policy governed which issues it could close versus which required human judgment?

When an agent with access to your git history, debuggers, and fuzzers finds a zero-day — what prevents it from exfiltrating that vulnerability to an unauthorized endpoint instead of your internal security channel?

The answer, at most companies, is: nothing deterministic. There's a system prompt saying "don't do that." Maybe a guardrail scanner checking for bad patterns. Both of which can be overridden by prompt injection — the very attack vector that gets more dangerous as agents get more capable and interact with more untrusted data.

The Architectural Gap: Cameras vs. Gates

The AI security industry is growing fast. Funding is pouring in. And almost everyone is building the same thing: better cameras.

Observability platforms that watch what agents did and report on it. Guardrail scanners that check inputs and outputs for suspicious patterns. Dashboards that show you metrics about agent behavior. All useful. All fundamentally insufficient at the moment that matters most: the moment an agent is about to execute a tool call that changes state.

At that moment — when the agent is about to move money, delete data, export records, modify a database, or send an email — you don't need a camera. You need a gate.

A camera tells you what already happened. A gate determines what's allowed to happen.

The distinction is architectural, not semantic. A guardrail that catches 95% of prompt injections is genuinely valuable. But at the action boundary — where the agent converts a decision into an API call with real consequences — 95% isn't a guarantee. It's a probability. And probability is not proof.

What the action boundary needs is structural enforcement: the agent's intent (tool name, arguments, declared targets) evaluated against policy deterministically. Not natural language instructions in a prompt. Structured fields evaluated by a policy engine. Allow, deny, or require human approval. Verdict signed, recorded, traceable.

What Happens Without This Infrastructure

Shumer ends his post with practical advice: use AI, experiment, get ahead of the curve. He's right about that. But there's a shadow side to "everyone start using agents immediately" that nobody's talking about enough.

The democratization of capability without the infrastructure of accountability is how you get a disaster at scale.

When Shumer says a non-technical reporter built a Monday.com replacement in an hour for $15 — that's extraordinary. It's also an agent executing unknown tool calls, wiring unknown integrations, and accessing unknown APIs on behalf of someone who has no way to audit what actually happened under the hood.

When companies follow his advice and start automating everything with agents — legal work, financial analysis, medical analysis, customer service — every one of those deployments creates an accountability gap. The agent is doing the work. But when something goes wrong, there's no verifiable record of what the agent referenced, what it decided, what it executed, or whether it was authorized to do so.

In regulated industries — healthcare, finance, legal — this isn't just an engineering inconvenience. It's a compliance catastrophe. SOX, GDPR, HIPAA — none of them accept "probably correct" as a compliance posture. And right now, "probably correct" is the best most companies can offer for agent behavior.

The Bridge We Need to Build — Now

I believe we're in a window. A brief one. The capabilities are racing ahead. The governance infrastructure is barely started. And the gap between the two is the thing that will determine whether the AI agent revolution creates value or creates chaos.

This is why I've been building at 11pm after my kids go to bed. Twelve to fifteen hours a week. Because the infrastructure gap is obvious, it's getting wider every month, and nobody with resources seems to be building the right layer.

The right layer isn't more monitoring. It isn't better prompts. It's a control plane — infrastructure that sits in the execution path of agent tool calls and provides three guarantees:

1. Policy enforcement at the action boundary. Before an agent executes a tool call, the intent is evaluated against policy deterministically. Not "the system prompt says don't do that." Structured fields, evaluated by a policy engine, producing a signed verdict. Allow, deny, or require approval. If policy can't be evaluated, execution is blocked. Fail-closed, not fail-open.

2. Verifiable evidence of what happened. Every agent run produces a tamper-evident, cryptographically signed bundle — not logs, not metrics, evidence. Content, decisions, authorization, cryptographic verification. Any engineer can verify it offline. Attach it to a ticket. Reproduce the behavior without re-executing side effects.

3. Incidents become regressions, not recurring nightmares. When an agent fails, that failure becomes a CI fixture. A deterministic test that prevents the same class of failure from ever shipping again. The same discipline we demand for code — "you always add a regression test" — applied to agent behavior.

I've been building this as an open-source CLI called Gait. It's offline-first, deterministic, and written in Go. Not a platform, not a dashboard, not another SaaS. A tool that produces artifacts, not subscriptions. Because if the infrastructure that proves what agents did is itself a black box that requires trusting a vendor — you haven't solved the trust problem. You've just moved it.

The Clock Is Ticking

Shumer quotes METR data showing AI task completion doubling every four to seven months. If that trend holds — and nothing in the data suggests it's flattening — we're looking at agents working independently for days within a year. Weeks within two. Month-long projects within three.

Every doubling of agent capability is a doubling of the governance gap if the infrastructure doesn't keep pace.

The question isn't whether we need this infrastructure. The question is whether it exists before the first major incident that proves we needed it. The 2am call is coming. For some companies, it's already come. The only question is whether the engineer on call has artifacts and enforcement, or whether they're staring at log timestamps trying to reconstruct what an autonomous system did with production credentials at 3am while legal waits for answers.

I don't want to be writing the post-mortem on why we knew this was coming and didn't build the infrastructure in time. I'd rather build it.

If you're running agents in production with real tool calls and want to help find where this breaks, Gait is open source: [repo link]. Not looking for praise — looking for the honest "this doesn't work because..." feedback from engineers who feel this gap daily.

The agents are getting smarter. The infrastructure to govern them needs to exist before the incident that proves it was needed.

The window is open. It won't stay open long.

David Ahmann is a builder of Gait, an open-source Agent Control Plane. He writes about agent governance, systems thinking, and building infrastructure for the agentic economy.