Malcolm

Posted on Jun 9 • Originally published at malcolmania.co.uk

AI agentic workflows on large codebases

#csharp #ai #programming #dotnet

The first post went over some of its capabilities. Over the past week Edict went v1.0, adding cursors for reading projections after command dispatch (to close some eventual-consistency gaps), a new type of projection that holds state inside the Orleans grain directly instead of a table, saga timeouts, schedules, an improved skills package and MCP server that ships with Edict, and more.

Edict has now grown to over 75,000 lines of code and more than 1000 tests, and contains several deep mechanisms that have been fixed, broken, and fixed again. It is well past the point where I can hold all of Edict in my head.

This post is about working with AI on large codebases, which I expect to be the first problem most software engineers have to solve.

The context problem

Years ago I was talking to a PhD candidate whose area of research was Natural Language Processing (NLP). He explained to me that one of the most difficult NLP problems was context. If a colleague says they need to pop out to pick their kids up from school, a scene can form in your head: one with a school, the layout of the road, people waiting, walking, driving, the environs. You may never have seen the school your colleague mentioned, but you can form a rich scene from your accumulated experience and use it to drive the rest of the conversation with a shared understanding.

LLMs ingeniously dodge this entire issue by making it your problem.

Just a word-probability machine

Strip away the chat window and a Large Language Model (LLM) is doing one thing: predicting the next token. Give it a run of text and it returns a probability distribution over what comes next, samples one, appends it, and repeats.

Companies like OpenAI and Anthropic then beat it into shape using techniques like supervised fine-tuning and reinforcement learning, which tune those probabilities in meaningful ways.

That is why Claude is always telling me "Good framing" or "You've spotted...". It even called me "Bold" on one occasion. The probabilities have been shaped to reward me; in other words the predictor is being steered towards outputs deemed preferable to people. This includes a proclivity to always complete a given task, making assumptions to do so, which can be dangerous when writing code.

What matters here is knowing where the gap is between you and the LLM, and how they are not the same thing. What it builds is statistical. When your colleague gave you, the human, the input of the school run, you assembled a scene out of a life of standing at school gates. When an LLM is given the same input as text, it assembles associations from text about school runs. Models lean on word association and underproduce exactly the emotional and physical detail a person supplies for free.

That is why it lays context at your feet: it cannot fetch yours, so you have to hand it over.

Side note: affectations and non-determinism

Because the interface is natural language, it is tempting to talk to an LLM like a person. I have found that effective, but two things are worth knowing.

First, non-determinism: the same prompt can give you different answers on different runs. It is a probability engine, not a lookup. For a codebase that means you cannot assume yesterday's good result repeats exactly.

Second, and more useful day to day: the model is acutely sensitive to how you frame things. "My CTO recommended this technology" or "I'm sceptical this will work" can produce entirely different outcomes, because the model tends to mirror the stance you hand it. The trap is telegraphing the answer you want. I stay factual and state goals plainly, not because emotion is forbidden, but because it stops me biasing the reply. It is part of why spec-driven development works so well: a spec is intent stated without a thumb on the scale.

The purpose of an agentic workflow is context

The workflow I use is largely based on Matt Pocock's skills with a few tweaks. There is a lot of great stuff in there, but the absolute minimum I use is four commands, run in order.

`/grill-with-docs`

This is where the context is established. It does four things.

Reads and maintains a CONTEXT.md (or a CONTEXT-MAP.md plus project-specific CONTEXT.mds for larger solutions), which holds your domain and its relationships. If you look at Edict's CONTEXT.md and are familiar with CQRS / event-driven systems, none of it should surprise you. But this is how I hand my context to every Claude session.
Reads and maintains Architectural Decision Records (ADRs) for non-obvious, hard-to-reverse decisions. This is how we address bad assumptions made by the LLM. ADR 0051, for example, details how event IDs are stamped once at enqueue time and persist through the idempotency, claim-check, and outbox mechanisms. Changing event IDs would cause havoc for Edict's telemetry and break several mechanisms, including the idempotency layer.
Walks Edict's surfaces (my addition). Across 45 projects, Edict has:
- Source generation
- Roslyn analysers
- Benchmarks for performance regression
- A skills package and MCP tooling for other developers' LLM sessions
- An in-memory testing library, Edict.Testing
- A gold-standard sample app demonstrating every feature of Edict, including the use of Edict.Testing

At some point it became a nightmare to keep everything aligned, and I found that adding this as another step kept all my ducks in a row.

Walks every branch of the decision tree. This is the essential guard against the LLM making bad guesses and assumptions. A mistake made here snowballs into every later step, as well as into future work.

`/to-prd`

Takes the context you established with /grill-with-docs, including every decision you made, analyses it along several axes (schema changes, architectural changes), and works out the implementation details while respecting your domain language and any relevant ADRs.

`/to-issues`

Takes the resulting Product Requirements Document (PRD) and turns it into vertical slices ready for implementation.

`/tdd`

Takes a vertical slice and implements it using a red-green-refactor method. Matt has an excellent video on why this works so well with LLMs.

Conclusion

At first it might be tempting to think "I need a better prompt", and that works for small projects. It does not scale to enterprise systems, or even a mere 75,000 lines of code.

If you accept what an LLM is, a powerful predictor with no access to the scene in your head, the problem shifts: how do you carry your context across every session, for an entire codebase? CONTEXT.md gives it the domain. ADRs give it the decisions it would otherwise guess at. Walking the surfaces and the decision tree catches the bad assumptions before they snowball. PRD, issues, and TDD turn that shared understanding into code, in slices small enough to review (if that's your thing).

The vocabulary differs, but GitHub, Anthropic, Thoughtworks, and many more are all converging on this concept of Context Engineering: spending human effort up-front to establish durable context and constraints so LLMs can take the wheel without drifting into dangerous territory.

Edict is past the point where I can hold it in my head. The workflow is how I hold it instead. The model never understands the codebase the way I do, and it does not need to, as long as I keep laying the context at its feet.

DEV Community