DEV Community

AI doesn't fail because the model is bad. It fails because there's nothing underneath it

Norbert Rosenwinkel on May 31, 2026

There's a question every system runs into the moment it goes to production and starts doing real things: what exactly happened, in what order, agai...

Read full post

Syed Ahmer Shah • May 31

This is a fantastic breakdown and a perspective that desperately needs to be repeated in the industry right now.

We are so caught up in the hype of model benchmarks and parameter counts that people forget an AI model is just an engine—it doesn't matter how powerful it is if you haven't built the chassis, transmission, and wheels around it. True production-ready AI isn't about a raw prompt; it is about the data pipelines, the deterministic guardrails, and the boring, unsexy software engineering that wraps around it. The "wrapper" isn't a bad thing; it's literally the most important part of the architecture if you want reliability. Spotted on with this.

Norbert Rosenwinkel • Jun 1

The engine/chassis image is perfect — and the rehabilitation of "the wrapper" is the part worth saying out loud. People use that word like an insult, but a 1000-horsepower engine with no chassis, no brakes, and no steering isn't a car, it's a way to hurt yourself faster. The wrapper is where reliability actually lives. The model is the easy 10%; the boring 90% is what decides whether you ship or just demo.

Alex Shev • May 31

Strong framing. The part that resonates is separating model quality from the operational substrate. Once an agent can mutate state, "was the answer good?" becomes less important than "can we replay, attribute, and undo the action?"

I think a lot of teams will discover they need event logs, audit trails, and rollback paths before they need a better prompt.

Norbert Rosenwinkel • May 31

Exactly. "Was the answer good?" is a model question. "Can we replay, attribute, and undo it?" is a systems question, and that is the one that decides whether you can run an agent in production.
Your last line nails the order of operations. Most teams reach for a better prompt first. The event log, the audit trail, the rollback path are what they actually hit the wall on. Prompt quality is tuning. State you cannot reconstruct is a structural problem.

Alex Shev • May 31

Yes, exactly. I think the practical test is: if the agent made a bad change at 2:14pm, can you answer three questions without guessing?

What state did it read, what state did it write, and what is the smallest reversible step?

If the answer is no, the model quality almost doesn't matter yet. You don't have an agent system, you have an impressive action generator with weak brakes.

Norbert Rosenwinkel • Jun 1

"Impressive action generator with weak brakes" — I'm stealing that ;-) And the three-question test is exactly right: read state, write state, smallest reversible step. If you can't answer all three without guessing, you don't have a system, you have a demo that happens to run in production.
The brakes are the product. The model just decides how fast you're going.

Alex Shev • Jun 1

Exactly. That is the line I keep coming back to: the model can be impressive and still be unsafe if the system cannot tell you what changed, why it changed, and how to roll it back.

I think a lot of agent demos optimize for acceleration first and add brakes later. For real production work it has to be the other way around: logs, state boundaries, approvals, tests, and a rollback path. Then the speed becomes useful instead of just exciting.

TxDesk • Jun 2

Strong post and the framing lands. The "AI did one thing, it took away your excuse" line is the cleanest statement of why this matters now that I've seen.

One angle worth adding from the DeFi side: the foundation you describe gives you provability of what your system did, which is necessary, but in our domain there's a second integrity surface that sits outside your event store entirely, the chain itself. The agent's own event log can be perfectly consistent and replayable, and the wallet it's reasoning about has still been touched by something the agent doesn't see, MEV bot, a different dApp, a hardware-wallet sig from a different device. The event sourcing tells you what your system thought was true. It doesn't tell you whether that's still true against the world you're about to act on.

So we end up needing the foundation you describe plus a freshness gate, a deterministic re-read of the actual external state right before the action commits, with the read pinned to the same hash the decision was made against. Diverge = abort. That's outside the event store's job, but the event store is what makes the abort recoverable. Both layers, not one.

The bucketed single-writer / parallel-writers tradeoff is also exactly the shape we hit on multi-chain reads, per-wallet ordering matters, across-wallets we want parallelism. 4096 buckets via cheap bit-mask is a nice detail.

Norbert Rosenwinkel • Jun 3

"What your system thought was true vs whether it's still true against the world you're about to act on" — that's the distinction I didn't draw sharply enough, and it's the real one. Internal replayability is necessary and nowhere near sufficient; the event store is honest about your own history and completely blind to a wallet someone else just touched.
Freshness gate is a great name for it. The way I'd place it: it's optimistic concurrency, generalized to a resource you don't own. Event sourcing already does this internally — append expects version N, someone else bumped it, you abort and retry. You're applying the same compare-against-the-version-you-decided-on, except the "aggregate" is external chain state you can only read, never lock. Pin the read to the hash the decision was made against, diverge = abort. Same shape, no write lock available.
And you're right it's both layers: the gate decides whether to commit, the event store makes the abort recoverable and auditable instead of a dangling half-action. Where I'd stay honest — even the gate only narrows the window, it doesn't close it. Between your re-read and the action actually landing on-chain there's still a race, and finality is probabilistic anyway, so the gate is a pre-flight and the tx's own atomic guards (nonce, revert conditions) are the last line. The event store can't save you from a reorg; it can only make sure you know one happened and can compensate cleanly.

TxDesk • Jun 4

The layer worth naming between your freshness gate and the tx atomic guards is the signing flow itself. On EVM with EIP-712 typed data, the signature can embed the state hash the freshness gate verified against. User signs at T1, tx lands at T1+30s, but the signed bytes still pin the T1 state hash, so the contract verifies on-chain (revert if mismatched) what the gate verified off-chain. The signed-bytes-as-freshness-witness pattern carries the gate's decision across the network gap.

Doesn't help with the reorg case you flagged, where finality itself is probabilistic. Nothing pre-tx can. But for the "state changed between gate and landing" race specifically, the user's own signature can be the carrier of the freshness invariant.

Self-Correcting Systems • May 31

This framing is strong.

The line that stands out to me is that the model was not necessarily “wrong” — the
missing layer underneath was the system’s ability to prove what happened, what state it
read, who authorized it, and how to reverse it.

I’ve been running into a neighboring problem while testing AI agent memory: a retrieved
memory can be relevant to the action and still not be authoritative enough to govern it.

So I think there are two layers that have to meet:

The system layer you’re describing: immutable history, attribution, replay, compensation.
The memory/authority layer: which instruction, policy, or remembered fact was allowed to decide the action in the first place.

Without the first layer, you can’t prove what happened.

Without the second, you may be able to prove the agent acted cleanly on the wrong
authority.

That’s the part I think many agent systems will run into next: not just “what did the
agent do?” but “what rule or memory gave it permission to do that?”

Great article. The boring substrate is becoming the real product.

Norbert Rosenwinkel • Jun 1

This is the sharpest version of the point I've seen. You're right that the two layers are different questions: one proves what happened, the other decides what was allowed to decide.
The way I'd connect them: the authority itself is also a fact worth recording. Which policy, which remembered fact, which rule admitted the action — that's not just context, it's part of the record. So the second layer doesn't sit outside the first; it becomes another event. Then you can replay not only "the agent did X" but "the agent did X because rule Y was in force at that moment." Authority becomes auditable, not just the action.
That's exactly why the boring substrate is the real product. "Clean action on the wrong authority" is a failure you can only catch if the authority left a trace too.

Self-Correcting Systems • Jun 1

Yes, that is the bridge.

Authority should not be treated as invisible reasoning around the trace. It should become
part of the trace.

A normal event says:

tool_call: send_email

A better event says:

tool_call: send_email
allowed_by: current_email_policy_v3
required_gate: human_approval
memory_status: active
source_snapshot: session_start_10:00
live_check: passed_at_10:14

That changes the record from “the agent acted” to “the agent acted under this authority
state.”

That is the difference between debugging behavior and auditing governance.

The failure mode you named — clean action on the wrong authority — is exactly the scary
one because every local piece can look correct. The tool call succeeded. The syntax was
valid. The workflow completed. The dashboard turned green.

But the wrong rule admitted the action.

If authority does not leave a trace, you only discover that after damage. If authority is
recorded as an event, you can query it:

show actions governed by superseded policies
show writes without live authority checks
show tool calls where the governing memory was provisional
show decisions where the action boundary differed from the retrieved context

That is why I keep coming back to boring substrates too. JSONL, metadata, explicit status
fields, authority events. Nothing flashy. But it gives the system a memory of why it
believed it was allowed to act.

That is the part that makes “what happened?” and “why was it allowed?” answerable from
the same run.

Norbert Rosenwinkel • Jun 3

This is the upgrade. Recording "authorized: true" tells you a gate existed; recording the authority state tells you which gate, on which policy version, against what memory — and those are completely different questions the moment something goes wrong.
The part I'd underline: authority is itself state, so it rots exactly the way business state does. A boolean "allowed" captured at write time is already a snapshot of now — six weeks later "now" is a different policy, and you can't reconstruct why the action was admitted. Your source_snapshot and allowed_by: policy_v3 fields are the fix: they freeze the authority context as-of-decision-time. That's the same move event sourcing makes for the domain — what was true when the decision happened, not what's true today — just applied to governance instead of business data. Same discipline, second axis.
And hard agree on boring substrates. The flashy part is the agent; the part that survives an audit is JSONL with explicit status fields and an authority event sitting next to the action event. "What happened?" and "why was it allowed?" being answerable from the same run is exactly the bar — and most systems can answer the first and just shrug at the second.

Self-Correcting Systems • Jun 3

The event sourcing frame is exactly the right analogy. freezing the authority context
as-of-decision-time is the same discipline applied to governance instead of domain
state. what was true when the decision happened, not what's true when someone comes
asking six weeks later. that's the gap between a boolean and an authority event.

and yes on the boring substrates. a JSONL run where "what happened" and "why was it
allowed" are both answerable from the same file is the bar. most agent systems can do
the first and have nothing for the second. the enforcement artifact direction we're
working toward is exactly that second axis sitting next to the first. the flashy part
is the agent, the part that survives an audit is the record.

David Loibner • Jun 1

This framing makes a lot of sense: the failure is often not the model alone, but the missing system layer underneath it.
I think there is also one layer before the event log: before an agent changes state, should this proposed intent be admitted at all?
Event sourcing makes impact reconstructable. An intent boundary decides what impact is allowed to exist in the first place.
Those two feel complementary to me.

sourabh Shukla • Jun 1

David's point about intent admission resonates. There's a layer even before the event log: how was the agent scoped in the first place? The permission surface, the tool contracts, the what-it-is-allowed-to-touch - those decisions made during design compound into exactly the audit problems Norbert describes. A narrowly scoped agent with explicit tool boundaries produces a much shorter "what could it have done?" surface when the replay question comes. Good architecture upfront reduces how much the event log has to carry.

Norbert Rosenwinkel • Jun 1

This is the cheapest layer of the three, and the one people reach for last. The scoping you do at design time shrinks both of the others: a tool the agent was never given is one you don't have to gate at runtime and don't have to explain at replay. You can't misuse access you don't have.
So the stack reads top to bottom: scope decides what's possible, admission decides what's allowed right now, the event log records what happened. Narrow the top and the bottom two carry less. Good architecture upfront isn't a separate concern from auditability — it's what keeps the audit surface small enough to actually answer "what could it have done?" without a week of guessing.

sourabh Shukla • Jun 1

That ordering — scope → admission → event log - makes the cost visible too. Most teams treat auditability as a cost they bolt on. If scope is designed first, the audit surface is small by construction, not by effort.

Norbert Rosenwinkel • Jun 1

Yes, those two fit together well. The event log answers "what happened, and can I prove it?" Your intent boundary answers a different question first: "should this be allowed to happen at all?"
In practice that check runs just before an action is accepted — it decides whether the request gets through. Only an accepted action creates events. So the log never has to store something that should never have happened in the first place. It gets stopped earlier.

NOVAInetwork • Jun 1

The "AI just made the gap impossible to miss" framing is the right one. The five properties (state, history, attribution, reversibility, trust) were always missing in CRUD systems; agent volume just removes the option of muddling through.

The architecture you've laid out is the right floor for the within-org case. Event sourcing + clean architecture + tamper-evidence is the answer to "can your auditor believe the log" when the log is yours and the auditor trusts your operational controls. The bucket-based single-writer lock for ordering with cross-aggregate parallelism is a genuinely good design choice. I've seen too many event-sourced systems give up throughput to global locks they didn't need.

The interesting extension is what happens when the agent's actions cross organizational trust boundaries. Stratara's tamper-evident chain is convincing because there's one writer appending to one head. The moment two organizations have to coordinate (Agent A from Org X cancels Agent B's subscription at Org Y, then the customer disputes it across both), each side has their own internally-consistent log and there's no shared referee.

That's the case where the trust assumption has to migrate from "we promise this log is correct" to something a Byzantine actor can't subvert. Same five properties, but the architecture that satisfies them changes shape: the log can't live inside any one party's infrastructure.

Not arguing every team needs to solve this today. Most agent deployments are still firmly inside one org's trust boundary, and Stratara's the right floor for that. But the cross-org case is the one I'd watch as agent-to-agent commerce starts happening, because the gap-that-was-always-there logic applies there too: it was always missing, multi-agent volume just makes it impossible to muddle through.

Norbert Rosenwinkel • Jun 3

This is the sharpest version of the limit I keep running into — thank you. You've named exactly where the in-org design stops: the chain is convincing because there's one writer and one head, and that's also precisely the assumption that breaks the moment two orgs each hold their own internally-consistent log and neither one is the referee.

I gestured at the hinge in the tamper-evidence piece without fully chasing it: a within-org hash chain is still subvertible by an insider with full DB access, and the only real fix is anchoring periodic checkpoints to something outside your own infrastructure — a notary, a public chain, OpenTimestamps. That external anchor is the seed of the cross-org case. Once the trust root sits outside any single participant, you're on the road you're describing. The in-org chain and the cross-org ledger aren't opposed; the anchor is the thing that connects them.

Where I want to stay honest: I deliberately stopped Stratara at the in-org floor. Cross-org agent-to-agent commerce needs the log to live where no party can unilaterally rewrite its own side — co-signed actions so neither can repudiate, or a shared referee none of them runs — and that's a genuinely different system, with its own latency and governance cost, not a config flag on this one. Completely agree it's the frontier to watch as agents start transacting across boundaries. Your "it was always missing, volume just makes it impossible to muddle through" logic ports straight over.

NOVAInetwork • Jun 4

The anchoring point is right, with one subtlety I keep running into. An external anchor proves the log existed in some state at time T but doesn't enforce ordering of operations between anchors. Between checkpoints, an insider can still construct multiple internally-consistent histories that hash-match at the next anchor. The rollback only becomes visible if a counterparty has an out-of-band attestation from the intermediate window.

The shared-referee-they-don't-run case is doing something different in kind, not just degree. Every operation is ordered at write time by the protocol. There's no intermediate window where rewriting is possible because no party controls the writer. The latency cost is real and you're right it's a different system, but the property you get is ordering enforcement, not just existence proof.

Alex • Jun 1

"The AI part is now the easy part" — this matches exactly what I hit building a 12-agent pipeline. The model calls never broke; state, ordering, and "can you prove what happened?" did.

One angle to add: you focus on the runtime state an agent mutates (refunds, cancellations), while I hit the same wall one layer up — in the pipeline's own state, where a retry on a different provider can't be allowed to silently fork the run. Different domain, identical root cause. The compensation-events + idempotent-retry framing is spot on.

Bookmarking Stratara!

Norbert Rosenwinkel • Jun 1

Exactly — the pipeline's own state is the layer most people miss. A retry on a different provider forking the run silently is a perfect example: same idempotency problem, just moved up a level. The agent isn't the only stateful thing in the system; the orchestration is too, and it needs the same compensation + idempotent-retry discipline.

Thanks for the bookmark.

Harjot Singh • May 31

This is the thesis I'd tattoo on the industry. The model is the easy part now; the failure is always the missing layer underneath: no state handoff between steps, no verification, no retry policy, no memory of what was already tried. People keep upgrading the model and wondering why the agent still face-plants on real multi-step work, when the bottleneck was never the model. I spent the last year building exactly that "underneath" for Moonshift (the orchestration + verify + deploy harness that turns a prompt into a shipped app), and honestly the model swaps are a footnote next to the harness work. What's the piece of underneath you see teams underbuild most: state, verification, or recovery?

Norbert Rosenwinkel • May 31

Thanks. Went and actually looked at moonshift.io. Honestly the 6-minute launch isn't what got me. It was the boring stuff ;-). A security scan gated before deploy. Spend ceilings that just abort a run instead of overspending. Most agent demos skip exactly that part. "Can it act" is easy. "Can it stop safely" is the real underneath.
On your question: state is the most dangerously underbuilt, recovery the most openly underbuilt. Teams figure a database has state covered, but a row only knows now. It has no idea what was true when phase 7 made its call, so the agent keeps re-deriving a world that already moved on. Recovery they know they skipped. Panicked manual fixes instead of compensation, plus non-idempotent retries that quietly double the damage. Verification at least gets a try, usually as "we log it." Which proves something happened, not that it was allowed, or that nobody rewrote the record after.
So for me: state first, then recovery, then verification.

neitherGalax • Jun 1

This resonated with me. My recent work with MCP, Skills, and context engineering—especially after struggling with orchestration in multi-agent systems—keeps reinforcing the same lesson: AI systems succeed because of what's underneath the model, not just the model itself. Thanks for sharing.

Abdullah Shahin • May 31

The thesis lands and is also the thesis most agent practitioners are converging on. The interesting follow-on for people early in their journey: "nothing underneath" usually means three things specifically — (1) no observability into what the agent decided and why, (2) no contract on what the tool layer can and cannot do, (3) no replay capability when something breaks. Each is solvable in isolation; the hard part is the integration layer where all three meet.

Norbert Rosenwinkel • Jun 1

Clean breakdown — and I'd build the replay piece first, because it quietly carries the other two. If every decision is recorded as an event you can replay, the observability comes almost for free (the record is the trace), and replay only works if the tool layer has a clear contract — otherwise you can't trust what you're replaying.
That's actually the part of Stratara that's already built: you can rebuild any state up to any point in time, and re-run the whole event stream to rebuild your read models from scratch. The one thing it can't do for you is decide what to record — if you want to replay why an agent chose X, that decision has to be written as an event in the first place. The machinery is there; what goes into the log is on you.

Syed Ahmer Shah • Jun 3

We’ve spent so much time obsessing over the prompt layer and the LLM itself, completely ignoring that an enterprise-grade AI is only as good as the deterministic architecture supporting it. Relying on raw model intelligence to handle complex business logic without a robust backend pattern is just setting yourself up for an expensive, unpredictable failure.

Your point about the need for rigorous software engineering underneath the AI is a great wake-up call. It's not about replacing developers; it's about shifting our focus back to building the invisible infrastructure that gives these models a safety net. Great perspective!

CapeStart • Jun 5

A green dashboard is not evidence of a healthy system. It's evidence that the dashboard is green. Those are very different conclusions.