DEV Community: Malcolm

AI agentic workflows on large codebases

Malcolm — Tue, 09 Jun 2026 06:54:49 +0000

The first post went over some of its capabilities. Over the past week Edict went v1.0, adding cursors for reading projections after command dispatch (to close some eventual-consistency gaps), a new type of projection that holds state inside the Orleans grain directly instead of a table, saga timeouts, schedules, an improved skills package and MCP server that ships with Edict, and more.

Edict has now grown to over 75,000 lines of code and more than 1000 tests, and contains several deep mechanisms that have been fixed, broken, and fixed again. It is well past the point where I can hold all of Edict in my head.

This post is about working with AI on large codebases, which I expect to be the first problem most software engineers have to solve.

The context problem

Years ago I was talking to a PhD candidate whose area of research was Natural Language Processing (NLP). He explained to me that one of the most difficult NLP problems was context. If a colleague says they need to pop out to pick their kids up from school, a scene can form in your head: one with a school, the layout of the road, people waiting, walking, driving, the environs. You may never have seen the school your colleague mentioned, but you can form a rich scene from your accumulated experience and use it to drive the rest of the conversation with a shared understanding.

LLMs ingeniously dodge this entire issue by making it your problem.

Just a word-probability machine

Strip away the chat window and a Large Language Model (LLM) is doing one thing: predicting the next token. Give it a run of text and it returns a probability distribution over what comes next, samples one, appends it, and repeats.

Companies like OpenAI and Anthropic then beat it into shape using techniques like supervised fine-tuning and reinforcement learning, which tune those probabilities in meaningful ways.

That is why Claude is always telling me "Good framing" or "You've spotted...". It even called me "Bold" on one occasion. The probabilities have been shaped to reward me; in other words the predictor is being steered towards outputs deemed preferable to people. This includes a proclivity to always complete a given task, making assumptions to do so, which can be dangerous when writing code.

What matters here is knowing where the gap is between you and the LLM, and how they are not the same thing. What it builds is statistical. When your colleague gave you, the human, the input of the school run, you assembled a scene out of a life of standing at school gates. When an LLM is given the same input as text, it assembles associations from text about school runs. Models lean on word association and underproduce exactly the emotional and physical detail a person supplies for free.

That is why it lays context at your feet: it cannot fetch yours, so you have to hand it over.

Side note: affectations and non-determinism

Because the interface is natural language, it is tempting to talk to an LLM like a person. I have found that effective, but two things are worth knowing.

First, non-determinism: the same prompt can give you different answers on different runs. It is a probability engine, not a lookup. For a codebase that means you cannot assume yesterday's good result repeats exactly.

Second, and more useful day to day: the model is acutely sensitive to how you frame things. "My CTO recommended this technology" or "I'm sceptical this will work" can produce entirely different outcomes, because the model tends to mirror the stance you hand it. The trap is telegraphing the answer you want. I stay factual and state goals plainly, not because emotion is forbidden, but because it stops me biasing the reply. It is part of why spec-driven development works so well: a spec is intent stated without a thumb on the scale.

The purpose of an agentic workflow is context

The workflow I use is largely based on Matt Pocock's skills with a few tweaks. There is a lot of great stuff in there, but the absolute minimum I use is four commands, run in order.

`/grill-with-docs`

This is where the context is established. It does four things.

Reads and maintains a CONTEXT.md (or a CONTEXT-MAP.md plus project-specific CONTEXT.mds for larger solutions), which holds your domain and its relationships. If you look at Edict's CONTEXT.md and are familiar with CQRS / event-driven systems, none of it should surprise you. But this is how I hand my context to every Claude session.
Reads and maintains Architectural Decision Records (ADRs) for non-obvious, hard-to-reverse decisions. This is how we address bad assumptions made by the LLM. ADR 0051, for example, details how event IDs are stamped once at enqueue time and persist through the idempotency, claim-check, and outbox mechanisms. Changing event IDs would cause havoc for Edict's telemetry and break several mechanisms, including the idempotency layer.
Walks Edict's surfaces (my addition). Across 45 projects, Edict has:
- Source generation
- Roslyn analysers
- Benchmarks for performance regression
- A skills package and MCP tooling for other developers' LLM sessions
- An in-memory testing library, Edict.Testing
- A gold-standard sample app demonstrating every feature of Edict, including the use of Edict.Testing

At some point it became a nightmare to keep everything aligned, and I found that adding this as another step kept all my ducks in a row.

Walks every branch of the decision tree. This is the essential guard against the LLM making bad guesses and assumptions. A mistake made here snowballs into every later step, as well as into future work.

`/to-prd`

Takes the context you established with /grill-with-docs, including every decision you made, analyses it along several axes (schema changes, architectural changes), and works out the implementation details while respecting your domain language and any relevant ADRs.

`/to-issues`

Takes the resulting Product Requirements Document (PRD) and turns it into vertical slices ready for implementation.

`/tdd`

Takes a vertical slice and implements it using a red-green-refactor method. Matt has an excellent video on why this works so well with LLMs.

Conclusion

At first it might be tempting to think "I need a better prompt", and that works for small projects. It does not scale to enterprise systems, or even a mere 75,000 lines of code.

If you accept what an LLM is, a powerful predictor with no access to the scene in your head, the problem shifts: how do you carry your context across every session, for an entire codebase? CONTEXT.md gives it the domain. ADRs give it the decisions it would otherwise guess at. Walking the surfaces and the decision tree catches the bad assumptions before they snowball. PRD, issues, and TDD turn that shared understanding into code, in slices small enough to review (if that's your thing).

The vocabulary differs, but GitHub, Anthropic, Thoughtworks, and many more are all converging on this concept of Context Engineering: spending human effort up-front to establish durable context and constraints so LLMs can take the wheel without drifting into dangerous territory.

Edict is past the point where I can hold it in my head. The workflow is how I hold it instead. The model never understands the codebase the way I do, and it does not need to, as long as I keep laying the context at its feet.

Edict: a CQRS framework for .NET, built in two weeks with Claude

Malcolm — Tue, 02 Jun 2026 11:12:38 +0000

"Most of my career was making things the same."

A staff engineer I worked with said that to me once. He was talking about a different framework at the time, but it was exactly the kind of problem Edict solves. That conversation stuck. It was a big driver in building this.

Edict is a CQRS framework for .NET on top of Microsoft Orleans. It absorbs the plumbing every event-driven team rewrites by hand: idempotency keys for at-least-once redeliveries, an outbox for atomic state and events, trace propagation across stream hops, a queryable dead-letter projection.

I built it in two weeks. Claude wrote almost every line of code; I drove the design, reviewed every change, and corrected course when needed.

This post is the short tour. What it does, how it was built, whether it's worth your time.

How this was built

Two weeks of focused sessions, almost entirely agentic. Claude wrote the code; I drove the design and reviewed every change.

The workflow was loosely modelled on Matt Pocock's skills: each feature began as a PRD on the issue tracker, got broken into vertical slices, and landed via red-green-refactor TDD. Domain language lives in CONTEXT.md; load-bearing decisions live in ADRs. Whenever I caught myself making the same correction twice, I codified it as a project skill or analyzer.

This is not "look what AI can do alone." Claude is a powerful implementation surface but it needed a human with strong architectural opinions and a clear domain language to be useful. This is what agentic development looks like when the human has done the design work.

The problem

Microsoft Orleans is great. It gives you a programming model where every entity in your system has exactly one in-memory home, on one node, on one thread at a time. The whole class of "two pods, same order, race condition" problems just disappears.

But Orleans is a runtime, not an opinion. The moment you want CQRS, event-driven flows, sagas, projections, or an outbox, you start writing the same plumbing every team writes:

Idempotency keys for at-least-once redeliveries
Trace propagation across async stream hops
Atomic commit of state and the events you raised
A dead-letter table for the poison message that just took out your aggregate

None of it is hard. All of it is repetitive. And most teams get at least one of them subtly wrong.

Edict's bet is that this is a framework's job, not yours.

What it feels like

A command handler is one method:

public partial class OrderCommandHandler : EdictCommandHandler<OrderState>
{
    public Task<EdictCommandResult> HandleAsync(PlaceOrderCommand cmd)
    {
        State.Status = OrderStatus.Open;
        Raise(new OrderPlacedEvent(cmd.OrderId));
        return Task.FromResult<EdictCommandResult>(new EdictCommandResult.Accepted());
    }
}

A validator that gates that handler is one constructor:

public sealed class OrderPlaceCommandValidator : EdictCommandValidator<PlaceOrderCommand>
{
    public OrderPlaceCommandValidator() =>
        RuleFor(x => x.CustomerReference).NotEmpty().WithErrorCode("customer_reference_required");
}

It runs in the same activation turn as the handler, before HandleAsync. A failure short-circuits to EdictCommandResult.Rejected with customer_reference_required as the rejection code; the handler never sees an invalid command and no state mutation occurs.

An event handler is also one method:

public sealed partial class OrderEmailHandler(IEmailSender email) : EdictEventHandler
{
    public Task HandleAsync(OrderPlacedEvent evt) => email.SendConfirmation(evt.OrderId, evt.EventId);
}

Both sides of an event-driven flow. No Orleans interfaces. No stream wiring. No serialization attributes. No idempotency code. Source generators connect HandleAsync to the right stream based on its parameter type, and the base class deduplicates redeliveries by EventId before ever calling you.

The whole vocabulary

There are six things you write. That's it.

Concept	What you write
Command handler	The aggregate's invariant. Receives a command, mutates state, raises events.
Event handler	A side effect. Receives an event, does something (send email, call API).
Saga	A long-running coordinator. Reacts to events, sends commands.
Projection builder	A read model. Receives events, writes a queryable row.
Sender	How callers reach into the system to issue a command.
Stream	A topic identity. Where events flow.

Everything else is the framework's problem: routing, serialization, the outbox, retries, dead-lettering, tracing, the parts of Orleans you'd rather not type.

The flow end to end

One OpenTelemetry trace covers the whole graph. If any handler throws, the failure lands in a dead-letter projection you can query. The aggregate keeps accepting commands.

Pick your substrate

The same handler code runs on either of two reference pairings, both passing the same conformance battery:

Substrate	Streaming	State
Azure	Azure Queue Storage	Azure Table Storage
Kafka + Postgres	Apache Kafka	PostgreSQL

Adding a third is a matter of implementing the substrate seam. The framework itself doesn't care which queue or store sits underneath.

Testing without containers

Edict ships an in-memory test app so you can exercise an entire command → event → saga → projection flow in-process. No Orleans cluster, no Azurite, no Docker. Three lines:

await using var app = await EdictTestApp.StartAsync(b => b
    .WithConsumer(typeof(OrderCommandHandler).Assembly));

await app.SendAsync(new PlaceOrderCommand(orderId, "REF-001"));
await app.Drain();

await Verify(await app.GetSagaProgress<OrderPaymentSaga, OrderPaymentProgress>(orderId));

Chaos is on by default. Duplicate redeliveries and bounded reorder are simulated deterministically, so every test you write exercises the at-least-once guarantees production has to tolerate. No setup required.

AI-assisted development, built in

Edict's whole philosophy is that the framework should absorb the things every team rewrites by hand, so feature devs can focus on feature code. The MCP server and Claude Code skill bundle that ship with Edict apply that same principle to AI tooling: consumers should be able to use Claude productively against Edict without first writing scaffolding to teach the agent what Edict is.

Ask Claude "where does PlaceOrderCommand get routed?" and it calls edict_describe_silo_wiring instead of guessing from grep results. Ask it to add a saga and it calls edict_list_route_keys to see which RouteKey Guids are already taken, so it generates a fresh one instead of colliding. Ask it why a dead-letter behaves a particular way and it calls edict_lookup_adr and returns the source decision.

The two pieces:

edict-mcp is an MCP server that exposes six tools the agent can call against your live solution
edict-skills is a Claude Code skill bundle that knows when to call each tool

Together they mean the agent works from your actual code, not from a guess about what an event-driven framework probably looks like:

Skill (when it fires)	What the agent stops guessing
edict-authoring (adding a handler, saga, or projection)	Which `RouteKey` Guids are taken, which handlers already exist
edict-silo-wiring (touching any `AddEdict*` call)	Which substrate is wired in `Program.cs`, which extensions are missing
edict-contracts (attribute or wire-format questions)	What a `Stream` is, why `[Union]` is banned (with the source ADR)
edict-diagnostics (debugging dead-letter, outbox, or trace issues)	Why the framework behaves the way it does, with the decision record attached

I think this is going to matter more over the next year, not less. The frameworks that work well with AI tools are the ones that tell the agent the truth about themselves. That's a lot easier to do for the framework author than for every consumer team to invent on their own.

When it fits

Edict is probably worth a look if:

You're already on Orleans, or you've been weighing it up
You want CQRS and event-driven flows without handwriting the plumbing
You like writing C# and want one programming model across every entity in the system
You're curious what agentic development can produce when the human in the loop knows the design space cold

It's probably not for you if:

A REST API over EF Core is enough. You genuinely don't need this.
You can't move to .NET 10
You need a battle-tested production framework today. Edict is portfolio quality. The design is solid and the test coverage is real, but no one is running it at a Fortune 500 yet, including me.

I'd rather you know that up front than find out later.

Try it

Repo: https://github.com/MalcolmMcNeely/Edict
Getting started: docs/usage/getting-started.md
The Sample app: clone the repo, run the Aspire AppHost, and watch commands flow through to a Blazor dashboard in real time
The reasoning: every non-obvious design decision lives in docs/adr/

If you spot something missing, surprising, or wrong, a GitHub issue is the fastest way to reach me. I'm also on LinkedIn.

Thanks for reading.