There's a question every system runs into the moment it goes to production and starts doing real things: what exactly happened, in what order, against what data — and can you prove it?
AI is just making that question very loud right now. Picture the case that gets more likely with every tool-using agent: a support agent — not a human, an LLM with tool access — cancels a subscription, issues a refund, fires off three follow-up emails. The next day the customer says: I never cancelled. Now answer the question above.
In most codebases the honest answer is: you can see the current state of the database (subscription cancelled), but not the path that got it there. A few log lines the next refactor will overwrite. No reliable record of which actor acted on behalf of which customer. And undoing it means hand-writing a correction and hoping you catch every side effect.
That's not a model problem. GPT wasn't "wrong." The problem sits one layer down.
The AI part is now the easy part
Two years ago the model was the hard part. Today you wire up an agent that calls tools, makes plans, and takes actions in ten minutes. The demo looks fantastic — and that's exactly the trap. A demo doesn't move anything real. The moment something touches real state in production, the problems no better prompt will solve show up:
- State: what was the situation when the decision was made? A CRUD table only knows now.
- History: which steps led to the outcome? Without a record — gone.
- Attribution: who or what acted, and authorized by what?
- Reversibility: a wrong action — how do you take it back cleanly?
- Trust: can someone quietly rewrite the record afterwards?
This isn't actually an AI problem
And here's the part that matters to me more than all the agent hype: none of this is new, and none of it is AI-specific. A webhook that changes a record at night; a batch job; an admin clicking the wrong button under pressure; a second service writing in over the message queue — they all raise the exact same five questions. State, history, attribution, reversibility, trust are the properties of good software, full stop.
AI did just one thing: it took away your excuse. As long as the only actor was a human managing one click a minute, you could muddle through — grep the logs, guess when in doubt. An autonomous agent firing a hundred actions a second doesn't allow muddling through. It just makes the gap that was always there impossible to miss.
That was the idea behind what I build from the start: not an AI framework, but a foundation for good software. That an agent can run on top is an option — a very current one — but not the point.
What a foundation like that stands on
Nobody needs a particular pattern to ship a feature — I'd never claim that. But the moment a system seriously manages state, no matter who touches it, these three decisions stop being academic:
1. A system of record instead of a snapshot — CQRS + Event Sourcing.
Overwriting fields means throwing the past away. Make each change an immutable event instead, and history is built in, not bolted on. You replay the stream and reconstruct exactly what happened — whether an agent, a job, or a human triggered it. "Undo" becomes a domain-level compensation event instead of a panicked UPDATE.
2. Structure that keeps the volatile at the edge — Clean Architecture + Vertical Slices.
Every integration point is restless — a third-party API, a payment provider, and yes, AI code with its weekly-changing prompts and models. Let that seep into the domain core and it rots it. A clean core in the middle, the volatile as an outer layer — and each capability as a vertical slice (command → handler → events → projection). New things get added without tearing open five layers, and without the new toppling the existing.
3. Trust as a property, not a hope — audit, tamper-evidence, encryption, validation, authorization.
Any actor allowed to change state needs guardrails the system enforces. Every action runs as a command that's validated, authorized, and audited — and the audit records who triggered it separately from whose data it touched, because with an agent acting on a customer's behalf those are two different identities. The record itself is tamper-evident (hash-chained — rewrite a row after the fact and it shows). And personal data stays encrypted per subject and erasable — because neither "the AI did it" nor "that's just the nightly job" is a free pass against GDPR.
Only together do the three give an answer to "what happened — and can you prove it?" that an auditor will believe. That holds for your agent. It holds just as much for everything else.
Where I have to be honest (the limits)
So this doesn't turn into a list of miracle cures: a foundation doesn't make your software correct. It makes what it does provable, replayable, and containable — which is a different thing.
- It records what happened, not why a model decided it. An AI black box stays a black box; you log inputs, actions, results — not the causality inside.
- It doesn't prevent a dumb action. It makes it visible and handleable via compensation — but the email that already went out, no replay brings back.
- It trades "delete data everywhere" for "manage keys and permissions cleanly." Honestly: a better problem — but still a problem.
Build the architecture, not the plumbing
And here's the catch: building this foundation yourself — event store, command pipeline, outbox, projections, audit, encryption, and all the wiring in between — eats months before you've shipped a single feature. So most teams skip it, ship on CRUD, and hit the can-you-prove-it question later at full force — no AI required.
I didn't want to pay those months again for every project. So I built the foundation once, cleanly, and lifted it out of our own products: Stratara, a .NET 10 stack that brings exactly this — CQRS, event sourcing, mediator, outbox, sagas, projections, identity, plus tamper-evident streams and tenant-bound encryption, lockstep-versioned across 22 NuGet packages, à la carte. The idea behind it isn't "AI platform" — it's simply that you build the architecture of your application, not the plumbing under it. That an agent fits safely on top is a nice side effect of the foundation being right.
Fast enough that you actually leave it on
The whole premise was an actor firing a hundred actions a second — so the foundation has to keep up, not buckle under its own audit guarantees. And here's the failure mode nobody admits to: the guarantees are real and slow, so the first time a load test goes red, someone quietly switches them off. Audit sampling drops to one-in-ten. The projection rebuild moves to a nightly cron instead of running live. The thing that was supposed to make the system provable becomes the thing you disable to hit your p99. That's not a foundation — that's a feature flag waiting to be turned off.
So the hot paths don't use reflection. Replaying a stream means calling an Apply method for every event, and the naive way — MethodInfo.Invoke per event — is exactly the cost that pushes people to cut corners. Instead, each apply-method, projection handler, and constructor is compiled once into a strongly-typed delegate (Expression.Lambda(...).Compile()) and cached. After that first compile it's a direct call, not a lookup. A compiled property write clocks about 13× faster than the reflection equivalent on this machine — and because it runs per event, that gap compounds linearly with stream length.
The payoff shows up where it matters, in a full replay:
| Events replayed | Time | Allocated |
|---|---|---|
| 10,000 | 0.11 ms | 64 B |
| 100,000 | 1.13 ms | 64 B |
| 1,000,000 | 11.6 ms | 64 B |
A million events in ~12 ms — and, the part I like more, at a constant 64 bytes no matter how long the stream. Replaying a whole history hands the garbage collector essentially nothing to chase. The audit trail you keep for the auditor is the same data structure you replay in single-digit milliseconds for the app. You don't get to pick between provable and fast; you get both or you get neither, and here it's both.
(Measured with BenchmarkDotNet on a fanless MacBook Air M4. Read the numbers as ratios, not server absolutes — a cooled box with real airflow moves the absolutes, not the shape. The benchmark project ships in the repo; dotnet run -c Release reproduces it.)
Scale by adding boxes, not by rewriting code
Speed on one core is table stakes. The harder promise — and the one event sourcing usually breaks — is what happens when the hundred-actions-a-second actually arrive, from many actors, all at once.
The textbook trap is the global lock. To keep one aggregate's events in order, the easy implementation serialises every write, and now your throughput ceiling is one core no matter how many you bought. Stratara takes a different route, and it's worth walking through, because it is the scaling story:
Commands don't block the caller. The default write path is fire-and-forget: a command goes onto a message bus and returns 202 Accepted immediately, while a worker handles it out-of-process. The request thread never waits on business logic, and a traffic spike buffers in the bus instead of pinning your web tier.
Workers compete for the work. The bus — RabbitMQ or Azure Service Bus — hands each message to whichever worker is free. Add replicas (more pods, more nodes) and they share the load automatically. No leader to elect, no partitions to reassign by hand. Scaling out is a number in a deployment manifest.
Ordering survives parallelism — through buckets, not locks. This is the clever part. Every aggregate id is hashed onto one of 4096 buckets, deterministically: the same id always lands in the same bucket. Writes within a bucket serialise through a single-writer lock, so one aggregate's events stay strictly ordered — but different buckets run fully in parallel. Per-aggregate consistency and cross-aggregate concurrency at the same time, with zero global coordination.
4096 is a power of two on purpose (cheap bit-masking instead of a modulo), and every persisted row — events, snapshots, command log, outbox — carries its bucket id and is indexed on it. So the bucket axis isn't only a lock; it's a partition key you can shard the database along too.
Read models keep up by subscribing, not polling. Projections never ask a table "anything new?" on a timer. They subscribe to the bus, and the write path publishes the event bundle the instant it commits. A read model trails its write by a beat, not by a poll interval — and you're not burning idle "is there work yet?" queries when traffic is quiet.
Put together, scaling stops being an architecture project and becomes an operations one. Command, projection, and saga workers all scale as competing consumers; the one deliberate exception is the tamper-evidence hash worker, which stays single-instance by design — it's appending to a single chain, and you don't want two writers fighting over its head.
When the bus drops, nothing waits in line
One more thing the hundred-a-second premise demands: the fast path can't be the only path, or a broker hiccup loses commands.
So the dispatcher tries the bus first — the direct, fast publish. Only if the bus is unreachable does the command land in a durable outbox table, where an OutboxWorker re-publishes it once the bus is back. In the normal case there's no outbox round-trip on the hot path at all; the durable net only engages on failure.
Let me be precise, because this is where marketing usually overclaims: the guarantee is at-least-once, not exactly-once. A command can arrive twice — a retry after a crash mid-publish — so handlers are written to be idempotent, and correlation ids make duplicates detectable. "No message silently lost," yes. "Each message exactly once, by magic," no — and anyone selling you the latter without idempotent handlers is selling you something.
Why this is the floor
None of these techniques is mine to claim — reflection-free dispatch, bucketed single-writer locks, push projections, an outbox fallback all predate me, and you could build any of them into your own append-only store. What eats the months is wiring all of them together cleanly, lockstep-versioned, and keeping them honest under load. That's the part I didn't want to pay for twice. If you're in .NET and you want your next system — with or without an agent on top — standing on something production-grade from day one, this is my floor.
Runnable, dependency-free samples are in the repo, docs at docs.stratara.tech. Source-available under FSL-1.1-MIT (not OSI-approved OSS; flips to plain MIT two years after each release).
Reading is one thing, building another. Grab the repo, run a sample, put your own first action on top of it — and bring your ideas for where the foundation could get better. Here in the comments or as an issue on GitHub. A foundation doesn't get good because one person builds it; it gets good because many people use it and push exactly where it still gives. 😉



Top comments (4)
This is a fantastic breakdown and a perspective that desperately needs to be repeated in the industry right now.
We are so caught up in the hype of model benchmarks and parameter counts that people forget an AI model is just an engine—it doesn't matter how powerful it is if you haven't built the chassis, transmission, and wheels around it. True production-ready AI isn't about a raw prompt; it is about the data pipelines, the deterministic guardrails, and the boring, unsexy software engineering that wraps around it. The "wrapper" isn't a bad thing; it's literally the most important part of the architecture if you want reliability. Spotted on with this.
This is the thesis I'd tattoo on the industry. The model is the easy part now; the failure is always the missing layer underneath: no state handoff between steps, no verification, no retry policy, no memory of what was already tried. People keep upgrading the model and wondering why the agent still face-plants on real multi-step work, when the bottleneck was never the model. I spent the last year building exactly that "underneath" for Moonshift (the orchestration + verify + deploy harness that turns a prompt into a shipped app), and honestly the model swaps are a footnote next to the harness work. What's the piece of underneath you see teams underbuild most: state, verification, or recovery?
Thanks. Went and actually looked at moonshift.io. Honestly the 6-minute launch isn't what got me. It was the boring stuff ;-). A security scan gated before deploy. Spend ceilings that just abort a run instead of overspending. Most agent demos skip exactly that part. "Can it act" is easy. "Can it stop safely" is the real underneath.
On your question: state is the most dangerously underbuilt, recovery the most openly underbuilt. Teams figure a database has state covered, but a row only knows now. It has no idea what was true when phase 7 made its call, so the agent keeps re-deriving a world that already moved on. Recovery they know they skipped. Panicked manual fixes instead of compensation, plus non-idempotent retries that quietly double the damage. Verification at least gets a try, usually as "we log it." Which proves something happened, not that it was allowed, or that nobody rewrote the record after.
So for me: state first, then recovery, then verification.
Strong framing. The part that resonates is separating model quality from the operational substrate. Once an agent can mutate state, "was the answer good?" becomes less important than "can we replay, attribute, and undo the action?"
I think a lot of teams will discover they need event logs, audit trails, and rollback paths before they need a better prompt.