Norbert Rosenwinkel

Posted on May 31 • Edited on Jun 22

AI doesn't fail because the model is bad. It fails because there's nothing underneath it

#dotnet #ai #programming #eventsourcing

Audit trails and state history for agents

There's a question every system runs into the moment it goes to production and starts doing real things: what exactly happened, in what order, against what data — and can you prove it?

AI is just making that question very loud right now. Picture the case that gets more likely with every tool-using agent: a support agent — not a human, an LLM with tool access — cancels a subscription, issues a refund, fires off three follow-up emails. The next day the customer says: I never cancelled. Now answer the question above.

In most codebases the honest answer is: you can see the current state of the database (subscription cancelled), but not the path that got it there. A few log lines the next refactor will overwrite. No reliable record of which actor acted on behalf of which customer. And undoing it means hand-writing a correction and hoping you catch every side effect.

That's not a model problem. GPT wasn't "wrong." The problem sits one layer down.

The AI part is now the easy part

Two years ago the model was the hard part. Today you wire up an agent that calls tools, makes plans, and takes actions in ten minutes. The demo looks fantastic — and that's exactly the trap. A demo doesn't move anything real. The moment something touches real state in production, the problems no better prompt will solve show up:

State: what was the situation when the decision was made? A CRUD table only knows now.
History: which steps led to the outcome? Without a record — gone.
Attribution: who or what acted, and authorized by what?
Reversibility: a wrong action — how do you take it back cleanly?
Trust: can someone quietly rewrite the record afterwards?

This isn't actually an AI problem

And here's the part that matters to me more than all the agent hype: none of this is new, and none of it is AI-specific. A webhook that changes a record at night; a batch job; an admin clicking the wrong button under pressure; a second service writing in over the message queue — they all raise the exact same five questions. State, history, attribution, reversibility, trust are the properties of good software, full stop.

AI did just one thing: it took away your excuse. As long as the only actor was a human managing one click a minute, you could muddle through — grep the logs, guess when in doubt. An autonomous agent firing a hundred actions a second doesn't allow muddling through. It just makes the gap that was always there impossible to miss.

That was the idea behind what I build from the start: not an AI framework, but a foundation for good software. That an agent can run on top is an option — a very current one — but not the point.

What a foundation like that stands on

Nobody needs a particular pattern to ship a feature — I'd never claim that. But the moment a system seriously manages state, no matter who touches it, these three decisions stop being academic:

1. A system of record instead of a snapshot — CQRS + Event Sourcing.
Overwriting fields means throwing the past away. Make each change an immutable event instead, and history is built in, not bolted on. You replay the stream and reconstruct exactly what happened — whether an agent, a job, or a human triggered it. "Undo" becomes a domain-level compensation event instead of a panicked UPDATE.

2. Structure that keeps the volatile at the edge — Clean Architecture + Vertical Slices.
Every integration point is restless — a third-party API, a payment provider, and yes, AI code with its weekly-changing prompts and models. Let that seep into the domain core and it rots it. A clean core in the middle, the volatile as an outer layer — and each capability as a vertical slice (command → handler → events → projection). New things get added without tearing open five layers, and without the new toppling the existing.

3. Trust as a property, not a hope — audit, tamper-evidence, encryption, validation, authorization.
Any actor allowed to change state needs guardrails the system enforces. Every action runs as a command that's validated, authorized, and audited — and the audit records who triggered it separately from whose data it touched, because with an agent acting on a customer's behalf those are two different identities. The record itself is tamper-evident (hash-chained — rewrite a row after the fact and it shows). And personal data stays encrypted per subject and erasable — because neither "the AI did it" nor "that's just the nightly job" is a free pass against GDPR.

Only together do the three give an answer to "what happened — and can you prove it?" that an auditor will believe. That holds for your agent. It holds just as much for everything else.

Where I have to be honest (the limits)

So this doesn't turn into a list of miracle cures: a foundation doesn't make your software correct. It makes what it does provable, replayable, and containable — which is a different thing.

It records what happened, not why a model decided it. An AI black box stays a black box; you log inputs, actions, results — not the causality inside.
It doesn't prevent a dumb action. It makes it visible and handleable via compensation — but the email that already went out, no replay brings back.
It trades "delete data everywhere" for "manage keys and permissions cleanly." Honestly: a better problem — but still a problem.

Build the architecture, not the plumbing

And here's the catch: building this foundation yourself — event store, command pipeline, outbox, projections, audit, encryption, and all the wiring in between — eats months before you've shipped a single feature. So most teams skip it, ship on CRUD, and hit the can-you-prove-it question later at full force — no AI required.

I didn't want to pay those months again for every project. So I built the foundation once, cleanly, and lifted it out of our own products: Stratara, a .NET 10 stack that brings exactly this — CQRS, event sourcing, mediator, outbox, sagas, projections, identity, plus tamper-evident streams and tenant-bound encryption, lockstep-versioned across 22 NuGet packages, à la carte. The idea behind it isn't "AI platform" — it's simply that you build the architecture of your application, not the plumbing under it. That an agent fits safely on top is a nice side effect of the foundation being right.

Fast enough that you actually leave it on

The whole premise was an actor firing a hundred actions a second — so the foundation has to keep up, not buckle under its own audit guarantees. And here's the failure mode nobody admits to: the guarantees are real and slow, so the first time a load test goes red, someone quietly switches them off. Audit sampling drops to one-in-ten. The projection rebuild moves to a nightly cron instead of running live. The thing that was supposed to make the system provable becomes the thing you disable to hit your p99. That's not a foundation — that's a feature flag waiting to be turned off.

So the hot paths don't use reflection. Replaying a stream means calling an Apply method for every event, and the naive way — MethodInfo.Invoke per event — is exactly the cost that pushes people to cut corners. Instead, each apply-method, projection handler, and constructor is compiled once into a strongly-typed delegate (Expression.Lambda(...).Compile()) and cached. After that first compile it's a direct call, not a lookup. A compiled property write clocks about 13× faster than the reflection equivalent on this machine — and because it runs per event, that gap compounds linearly with stream length.

The payoff shows up where it matters, in a full replay:

Events replayed	Time	Allocated
10,000	0.11 ms	64 B
100,000	1.13 ms	64 B
1,000,000	11.6 ms	64 B

A million events in ~12 ms — and, the part I like more, at a constant 64 bytes no matter how long the stream. Replaying a whole history hands the garbage collector essentially nothing to chase. The audit trail you keep for the auditor is the same data structure you replay in single-digit milliseconds for the app. You don't get to pick between provable and fast; you get both or you get neither, and here it's both.

(Measured with BenchmarkDotNet on a fanless MacBook Air M4. Read the numbers as ratios, not server absolutes — a cooled box with real airflow moves the absolutes, not the shape. The benchmark project ships in the repo; dotnet run -c Release reproduces it.)

Scale by adding boxes, not by rewriting code

Speed on one core is table stakes. The harder promise — and the one event sourcing usually breaks — is what happens when the hundred-actions-a-second actually arrive, from many actors, all at once.

The textbook trap is the global lock. To keep one aggregate's events in order, the easy implementation serialises every write, and now your throughput ceiling is one core no matter how many you bought. Stratara takes a different route, and it's worth walking through, because it is the scaling story:

Commands don't block the caller. The default write path is fire-and-forget: a command goes onto a message bus and returns 202 Accepted immediately, while a worker handles it out-of-process. The request thread never waits on business logic, and a traffic spike buffers in the bus instead of pinning your web tier.

Workers compete for the work. The bus — RabbitMQ or Azure Service Bus — hands each message to whichever worker is free. Add replicas (more pods, more nodes) and they share the load automatically. No leader to elect, no partitions to reassign by hand. Scaling out is a number in a deployment manifest.

Ordering survives parallelism — through buckets, not locks. This is the clever part. Every aggregate id is hashed onto one of 4096 buckets, deterministically: the same id always lands in the same bucket. Writes within a bucket serialise through a single-writer lock, so one aggregate's events stay strictly ordered — but different buckets run fully in parallel. Per-aggregate consistency and cross-aggregate concurrency at the same time, with zero global coordination.

4096 is a power of two on purpose (cheap bit-masking instead of a modulo), and every persisted row — events, snapshots, command log, outbox — carries its bucket id and is indexed on it. So the bucket axis isn't only a lock; it's a partition key you can shard the database along too.

Read models keep up by subscribing, not polling. Projections never ask a table "anything new?" on a timer. They subscribe to the bus, and the write path publishes the event bundle the instant it commits. A read model trails its write by a beat, not by a poll interval — and you're not burning idle "is there work yet?" queries when traffic is quiet.

Put together, scaling stops being an architecture project and becomes an operations one. Command, projection, and saga workers all scale as competing consumers; the one deliberate exception is the tamper-evidence hash worker, which stays single-instance by design — it's appending to a single chain, and you don't want two writers fighting over its head.

When the bus drops, nothing waits in line

One more thing the hundred-a-second premise demands: the fast path can't be the only path, or a broker hiccup loses commands.

So the dispatcher tries the bus first — the direct, fast publish. Only if the bus is unreachable does the command land in a durable outbox table, where an OutboxWorker re-publishes it once the bus is back. In the normal case there's no outbox round-trip on the hot path at all; the durable net only engages on failure.

Let me be precise, because this is where marketing usually overclaims: the guarantee is at-least-once, not exactly-once. A command can arrive twice — a retry after a crash mid-publish — so handlers are written to be idempotent, and correlation ids make duplicates detectable. "No message silently lost," yes. "Each message exactly once, by magic," no — and anyone selling you the latter without idempotent handlers is selling you something.

Why this is the floor

None of these techniques is mine to claim — reflection-free dispatch, bucketed single-writer locks, push projections, an outbox fallback all predate me, and you could build any of them into your own append-only store. What eats the months is wiring all of them together cleanly, lockstep-versioned, and keeping them honest under load. That's the part I didn't want to pay for twice. If you're in .NET and you want your next system — with or without an agent on top — standing on something production-grade from day one, this is my floor.

Runnable, dependency-free samples are in the repo, docs at docs.stratara.tech. Source-available under MIT License.

Reading is one thing, building another. Grab the repo, run a sample, put your own first action on top of it — and if it earns a place in your toolbox, a star helps the next person find it. Bring your ideas for where the foundation could get better, here in the comments or as an issue on GitHub. A foundation doesn't get good because one person builds it; it gets good because many people use it and push exactly where it still gives. 😉

Top comments (35)

Syed Ahmer Shah • May 31

This is a fantastic breakdown and a perspective that desperately needs to be repeated in the industry right now.

We are so caught up in the hype of model benchmarks and parameter counts that people forget an AI model is just an engine—it doesn't matter how powerful it is if you haven't built the chassis, transmission, and wheels around it. True production-ready AI isn't about a raw prompt; it is about the data pipelines, the deterministic guardrails, and the boring, unsexy software engineering that wraps around it. The "wrapper" isn't a bad thing; it's literally the most important part of the architecture if you want reliability. Spotted on with this.

Norbert Rosenwinkel • Jun 1

The engine/chassis image is perfect — and the rehabilitation of "the wrapper" is the part worth saying out loud. People use that word like an insult, but a 1000-horsepower engine with no chassis, no brakes, and no steering isn't a car, it's a way to hurt yourself faster. The wrapper is where reliability actually lives. The model is the easy 10%; the boring 90% is what decides whether you ship or just demo.

Alex Shev • May 31

Strong framing. The part that resonates is separating model quality from the operational substrate. Once an agent can mutate state, "was the answer good?" becomes less important than "can we replay, attribute, and undo the action?"

I think a lot of teams will discover they need event logs, audit trails, and rollback paths before they need a better prompt.

Norbert Rosenwinkel • May 31

Exactly. "Was the answer good?" is a model question. "Can we replay, attribute, and undo it?" is a systems question, and that is the one that decides whether you can run an agent in production.
Your last line nails the order of operations. Most teams reach for a better prompt first. The event log, the audit trail, the rollback path are what they actually hit the wall on. Prompt quality is tuning. State you cannot reconstruct is a structural problem.

Alex Shev • May 31

Yes, exactly. I think the practical test is: if the agent made a bad change at 2:14pm, can you answer three questions without guessing?

What state did it read, what state did it write, and what is the smallest reversible step?

If the answer is no, the model quality almost doesn't matter yet. You don't have an agent system, you have an impressive action generator with weak brakes.

Norbert Rosenwinkel • Jun 1

"Impressive action generator with weak brakes" — I'm stealing that ;-) And the three-question test is exactly right: read state, write state, smallest reversible step. If you can't answer all three without guessing, you don't have a system, you have a demo that happens to run in production.
The brakes are the product. The model just decides how fast you're going.

Alex Shev • Jun 1

Exactly. That is the line I keep coming back to: the model can be impressive and still be unsafe if the system cannot tell you what changed, why it changed, and how to roll it back.

I think a lot of agent demos optimize for acceleration first and add brakes later. For real production work it has to be the other way around: logs, state boundaries, approvals, tests, and a rollback path. Then the speed becomes useful instead of just exciting.

TxDesk • Jun 2

Strong post and the framing lands. The "AI did one thing, it took away your excuse" line is the cleanest statement of why this matters now that I've seen.

One angle worth adding from the DeFi side: the foundation you describe gives you provability of what your system did, which is necessary, but in our domain there's a second integrity surface that sits outside your event store entirely, the chain itself. The agent's own event log can be perfectly consistent and replayable, and the wallet it's reasoning about has still been touched by something the agent doesn't see, MEV bot, a different dApp, a hardware-wallet sig from a different device. The event sourcing tells you what your system thought was true. It doesn't tell you whether that's still true against the world you're about to act on.

So we end up needing the foundation you describe plus a freshness gate, a deterministic re-read of the actual external state right before the action commits, with the read pinned to the same hash the decision was made against. Diverge = abort. That's outside the event store's job, but the event store is what makes the abort recoverable. Both layers, not one.

The bucketed single-writer / parallel-writers tradeoff is also exactly the shape we hit on multi-chain reads, per-wallet ordering matters, across-wallets we want parallelism. 4096 buckets via cheap bit-mask is a nice detail.

Norbert Rosenwinkel • Jun 3

"What your system thought was true vs whether it's still true against the world you're about to act on" — that's the distinction I didn't draw sharply enough, and it's the real one. Internal replayability is necessary and nowhere near sufficient; the event store is honest about your own history and completely blind to a wallet someone else just touched.
Freshness gate is a great name for it. The way I'd place it: it's optimistic concurrency, generalized to a resource you don't own. Event sourcing already does this internally — append expects version N, someone else bumped it, you abort and retry. You're applying the same compare-against-the-version-you-decided-on, except the "aggregate" is external chain state you can only read, never lock. Pin the read to the hash the decision was made against, diverge = abort. Same shape, no write lock available.
And you're right it's both layers: the gate decides whether to commit, the event store makes the abort recoverable and auditable instead of a dangling half-action. Where I'd stay honest — even the gate only narrows the window, it doesn't close it. Between your re-read and the action actually landing on-chain there's still a race, and finality is probabilistic anyway, so the gate is a pre-flight and the tx's own atomic guards (nonce, revert conditions) are the last line. The event store can't save you from a reorg; it can only make sure you know one happened and can compensate cleanly.

TxDesk • Jun 4

The layer worth naming between your freshness gate and the tx atomic guards is the signing flow itself. On EVM with EIP-712 typed data, the signature can embed the state hash the freshness gate verified against. User signs at T1, tx lands at T1+30s, but the signed bytes still pin the T1 state hash, so the contract verifies on-chain (revert if mismatched) what the gate verified off-chain. The signed-bytes-as-freshness-witness pattern carries the gate's decision across the network gap.

Doesn't help with the reorg case you flagged, where finality itself is probabilistic. Nothing pre-tx can. But for the "state changed between gate and landing" race specifically, the user's own signature can be the carrier of the freshness invariant.

Self-Correcting Systems • May 31

This framing is strong.

The line that stands out to me is that the model was not necessarily “wrong” — the
missing layer underneath was the system’s ability to prove what happened, what state it
read, who authorized it, and how to reverse it.

I’ve been running into a neighboring problem while testing AI agent memory: a retrieved
memory can be relevant to the action and still not be authoritative enough to govern it.

So I think there are two layers that have to meet:

The system layer you’re describing: immutable history, attribution, replay, compensation.
The memory/authority layer: which instruction, policy, or remembered fact was allowed to decide the action in the first place.

Without the first layer, you can’t prove what happened.

Without the second, you may be able to prove the agent acted cleanly on the wrong
authority.

That’s the part I think many agent systems will run into next: not just “what did the
agent do?” but “what rule or memory gave it permission to do that?”

Great article. The boring substrate is becoming the real product.

Norbert Rosenwinkel • Jun 1

This is the sharpest version of the point I've seen. You're right that the two layers are different questions: one proves what happened, the other decides what was allowed to decide.
The way I'd connect them: the authority itself is also a fact worth recording. Which policy, which remembered fact, which rule admitted the action — that's not just context, it's part of the record. So the second layer doesn't sit outside the first; it becomes another event. Then you can replay not only "the agent did X" but "the agent did X because rule Y was in force at that moment." Authority becomes auditable, not just the action.
That's exactly why the boring substrate is the real product. "Clean action on the wrong authority" is a failure you can only catch if the authority left a trace too.

Self-Correcting Systems • Jun 1

Yes, that is the bridge.

Authority should not be treated as invisible reasoning around the trace. It should become
part of the trace.

A normal event says:

tool_call: send_email

A better event says:

tool_call: send_email
allowed_by: current_email_policy_v3
required_gate: human_approval
memory_status: active
source_snapshot: session_start_10:00
live_check: passed_at_10:14

That changes the record from “the agent acted” to “the agent acted under this authority
state.”

That is the difference between debugging behavior and auditing governance.

The failure mode you named — clean action on the wrong authority — is exactly the scary
one because every local piece can look correct. The tool call succeeded. The syntax was
valid. The workflow completed. The dashboard turned green.

But the wrong rule admitted the action.

If authority does not leave a trace, you only discover that after damage. If authority is
recorded as an event, you can query it:

show actions governed by superseded policies
show writes without live authority checks
show tool calls where the governing memory was provisional
show decisions where the action boundary differed from the retrieved context

That is why I keep coming back to boring substrates too. JSONL, metadata, explicit status
fields, authority events. Nothing flashy. But it gives the system a memory of why it
believed it was allowed to act.

That is the part that makes “what happened?” and “why was it allowed?” answerable from
the same run.

Norbert Rosenwinkel • Jun 3

This is the upgrade. Recording "authorized: true" tells you a gate existed; recording the authority state tells you which gate, on which policy version, against what memory — and those are completely different questions the moment something goes wrong.
The part I'd underline: authority is itself state, so it rots exactly the way business state does. A boolean "allowed" captured at write time is already a snapshot of now — six weeks later "now" is a different policy, and you can't reconstruct why the action was admitted. Your source_snapshot and allowed_by: policy_v3 fields are the fix: they freeze the authority context as-of-decision-time. That's the same move event sourcing makes for the domain — what was true when the decision happened, not what's true today — just applied to governance instead of business data. Same discipline, second axis.
And hard agree on boring substrates. The flashy part is the agent; the part that survives an audit is JSONL with explicit status fields and an authority event sitting next to the action event. "What happened?" and "why was it allowed?" being answerable from the same run is exactly the bar — and most systems can answer the first and just shrug at the second.

Self-Correcting Systems • Jun 3

The event sourcing frame is exactly the right analogy. freezing the authority context
as-of-decision-time is the same discipline applied to governance instead of domain
state. what was true when the decision happened, not what's true when someone comes
asking six weeks later. that's the gap between a boolean and an authority event.

and yes on the boring substrates. a JSONL run where "what happened" and "why was it
allowed" are both answerable from the same file is the bar. most agent systems can do
the first and have nothing for the second. the enforcement artifact direction we're
working toward is exactly that second axis sitting next to the first. the flashy part
is the agent, the part that survives an audit is the record.

David Loibner • Jun 1

This framing makes a lot of sense: the failure is often not the model alone, but the missing system layer underneath it.
I think there is also one layer before the event log: before an agent changes state, should this proposed intent be admitted at all?
Event sourcing makes impact reconstructable. An intent boundary decides what impact is allowed to exist in the first place.
Those two feel complementary to me.

sourabh Shukla • Jun 1

David's point about intent admission resonates. There's a layer even before the event log: how was the agent scoped in the first place? The permission surface, the tool contracts, the what-it-is-allowed-to-touch - those decisions made during design compound into exactly the audit problems Norbert describes. A narrowly scoped agent with explicit tool boundaries produces a much shorter "what could it have done?" surface when the replay question comes. Good architecture upfront reduces how much the event log has to carry.

Norbert Rosenwinkel • Jun 1

This is the cheapest layer of the three, and the one people reach for last. The scoping you do at design time shrinks both of the others: a tool the agent was never given is one you don't have to gate at runtime and don't have to explain at replay. You can't misuse access you don't have.
So the stack reads top to bottom: scope decides what's possible, admission decides what's allowed right now, the event log records what happened. Narrow the top and the bottom two carry less. Good architecture upfront isn't a separate concern from auditability — it's what keeps the audit surface small enough to actually answer "what could it have done?" without a week of guessing.

sourabh Shukla • Jun 1

That ordering — scope → admission → event log - makes the cost visible too. Most teams treat auditability as a cost they bolt on. If scope is designed first, the audit surface is small by construction, not by effort.

Norbert Rosenwinkel • Jun 1

Yes, those two fit together well. The event log answers "what happened, and can I prove it?" Your intent boundary answers a different question first: "should this be allowed to happen at all?"
In practice that check runs just before an action is accepted — it decides whether the request gets through. Only an accepted action creates events. So the log never has to store something that should never have happened in the first place. It gets stopped earlier.

NOVAInetwork • Jun 1

The "AI just made the gap impossible to miss" framing is the right one. The five properties (state, history, attribution, reversibility, trust) were always missing in CRUD systems; agent volume just removes the option of muddling through.

The architecture you've laid out is the right floor for the within-org case. Event sourcing + clean architecture + tamper-evidence is the answer to "can your auditor believe the log" when the log is yours and the auditor trusts your operational controls. The bucket-based single-writer lock for ordering with cross-aggregate parallelism is a genuinely good design choice. I've seen too many event-sourced systems give up throughput to global locks they didn't need.

The interesting extension is what happens when the agent's actions cross organizational trust boundaries. Stratara's tamper-evident chain is convincing because there's one writer appending to one head. The moment two organizations have to coordinate (Agent A from Org X cancels Agent B's subscription at Org Y, then the customer disputes it across both), each side has their own internally-consistent log and there's no shared referee.

That's the case where the trust assumption has to migrate from "we promise this log is correct" to something a Byzantine actor can't subvert. Same five properties, but the architecture that satisfies them changes shape: the log can't live inside any one party's infrastructure.

Not arguing every team needs to solve this today. Most agent deployments are still firmly inside one org's trust boundary, and Stratara's the right floor for that. But the cross-org case is the one I'd watch as agent-to-agent commerce starts happening, because the gap-that-was-always-there logic applies there too: it was always missing, multi-agent volume just makes it impossible to muddle through.

Norbert Rosenwinkel • Jun 3

This is the sharpest version of the limit I keep running into — thank you. You've named exactly where the in-org design stops: the chain is convincing because there's one writer and one head, and that's also precisely the assumption that breaks the moment two orgs each hold their own internally-consistent log and neither one is the referee.

I gestured at the hinge in the tamper-evidence piece without fully chasing it: a within-org hash chain is still subvertible by an insider with full DB access, and the only real fix is anchoring periodic checkpoints to something outside your own infrastructure — a notary, a public chain, OpenTimestamps. That external anchor is the seed of the cross-org case. Once the trust root sits outside any single participant, you're on the road you're describing. The in-org chain and the cross-org ledger aren't opposed; the anchor is the thing that connects them.

Where I want to stay honest: I deliberately stopped Stratara at the in-org floor. Cross-org agent-to-agent commerce needs the log to live where no party can unilaterally rewrite its own side — co-signed actions so neither can repudiate, or a shared referee none of them runs — and that's a genuinely different system, with its own latency and governance cost, not a config flag on this one. Completely agree it's the frontier to watch as agents start transacting across boundaries. Your "it was always missing, volume just makes it impossible to muddle through" logic ports straight over.

NOVAInetwork • Jun 4

The anchoring point is right, with one subtlety I keep running into. An external anchor proves the log existed in some state at time T but doesn't enforce ordering of operations between anchors. Between checkpoints, an insider can still construct multiple internally-consistent histories that hash-match at the next anchor. The rollback only becomes visible if a counterparty has an out-of-band attestation from the intermediate window.

The shared-referee-they-don't-run case is doing something different in kind, not just degree. Every operation is ordered at write time by the protocol. There's no intermediate window where rewriting is possible because no party controls the writer. The latency cost is real and you're right it's a different system, but the property you get is ordering enforcement, not just existence proof.

Alex • Jun 1

"The AI part is now the easy part" — this matches exactly what I hit building a 12-agent pipeline. The model calls never broke; state, ordering, and "can you prove what happened?" did.

One angle to add: you focus on the runtime state an agent mutates (refunds, cancellations), while I hit the same wall one layer up — in the pipeline's own state, where a retry on a different provider can't be allowed to silently fork the run. Different domain, identical root cause. The compensation-events + idempotent-retry framing is spot on.

Bookmarking Stratara!

Norbert Rosenwinkel • Jun 1

Exactly — the pipeline's own state is the layer most people miss. A retry on a different provider forking the run silently is a perfect example: same idempotency problem, just moved up a level. The agent isn't the only stateful thing in the system; the orchestration is too, and it needs the same compensation + idempotent-retry discipline.

Thanks for the bookmark.

Harjot Singh • May 31

This is the thesis I'd tattoo on the industry. The model is the easy part now; the failure is always the missing layer underneath: no state handoff between steps, no verification, no retry policy, no memory of what was already tried. People keep upgrading the model and wondering why the agent still face-plants on real multi-step work, when the bottleneck was never the model. I spent the last year building exactly that "underneath" for Moonshift (the orchestration + verify + deploy harness that turns a prompt into a shipped app), and honestly the model swaps are a footnote next to the harness work. What's the piece of underneath you see teams underbuild most: state, verification, or recovery?

Norbert Rosenwinkel • May 31

Thanks. Went and actually looked at moonshift.io. Honestly the 6-minute launch isn't what got me. It was the boring stuff ;-). A security scan gated before deploy. Spend ceilings that just abort a run instead of overspending. Most agent demos skip exactly that part. "Can it act" is easy. "Can it stop safely" is the real underneath.
On your question: state is the most dangerously underbuilt, recovery the most openly underbuilt. Teams figure a database has state covered, but a row only knows now. It has no idea what was true when phase 7 made its call, so the agent keeps re-deriving a world that already moved on. Recovery they know they skipped. Panicked manual fixes instead of compensation, plus non-idempotent retries that quietly double the damage. Verification at least gets a try, usually as "we log it." Which proves something happened, not that it was allowed, or that nobody rewrote the record after.
So for me: state first, then recovery, then verification.

neitherGalax • Jun 1

This resonated with me. My recent work with MCP, Skills, and context engineering—especially after struggling with orchestration in multi-agent systems—keeps reinforcing the same lesson: AI systems succeed because of what's underneath the model, not just the model itself. Thanks for sharing.

Abdullah Shahin • May 31

The thesis lands and is also the thesis most agent practitioners are converging on. The interesting follow-on for people early in their journey: "nothing underneath" usually means three things specifically — (1) no observability into what the agent decided and why, (2) no contract on what the tool layer can and cannot do, (3) no replay capability when something breaks. Each is solvable in isolation; the hard part is the integration layer where all three meet.

Norbert Rosenwinkel • Jun 1

Clean breakdown — and I'd build the replay piece first, because it quietly carries the other two. If every decision is recorded as an event you can replay, the observability comes almost for free (the record is the trace), and replay only works if the tool layer has a clear contract — otherwise you can't trust what you're replaying.
That's actually the part of Stratara that's already built: you can rebuild any state up to any point in time, and re-run the whole event stream to rebuild your read models from scratch. The one thing it can't do for you is decide what to record — if you want to replay why an agent chose X, that decision has to be written as an event in the first place. The machinery is there; what goes into the log is on you.

View full discussion (35 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.