DEV Community: Aming

Route Context: How I Built Right Context for Agents

Aming — Sat, 30 May 2026 12:28:20 +0000

We already knew agents needed the right context. Dogfood taught us the harder problem: making that context live, visible, auditable, and enforceable inside a multi-agent workflow.

The failure looked like success.

A worker changed one file. The focused test passed. The patch was small, readable, and locally correct. If you only looked at the worker's report, the job was done.

But the overall job was still wrong.

The change touched a permission-sensitive path. It needed evidence that the right role had acted, that the project map still matched the code, that the audit trail could explain why the edit was allowed, and that the final close check had not been skipped. Instead, the system had solved the nearest local problem. A lane that should have preserved the shape of the work had drifted into implementation. The worker had completed its tiny job, while the whole route had lost its meaning.

Nobody had to make a dramatic mistake for this to happen. That was the unsettling part. The worker was not lazy. The test was not fake. The prompt was not empty. The system had plenty of context. It just did not have route context.

Route context is the per-task, per-lane runtime packet that tells an agent what job it is in, what role it has, what it saw, what it cannot do, and what evidence must pass before the work is done.

That is the positioning of this article. I am not trying to convince technical readers that right context matters. Most people building with agents have already learned that, usually the hard way. The point is how we made right context operational for agent development: route context, a live artifact that turns the principle into prompt-visible, hashable, auditable, gate-checkable workflow state.

We started with a different hope. We wanted agent development to feel zero-orchestration-ish: the user says what they want, the system understands the work, the right agents handle the right pieces, and nobody has to manually conduct a meeting of subagents. The observer would keep things coherent. Workers would implement. Review lanes would check evidence. Validation would catch bad finishes.

Then we used it on ourselves.

Again and again, the same failure appeared in different clothes. The observer would begin correctly, then collapse into a nearby code edit. A worker would pass a test, but not satisfy the larger obligation. A reviewer would evaluate plausible reasoning, but not the same evidence the worker actually produced. A prompt would contain lots of useful system knowledge, but not the specific promises this worker had to satisfy.

That is when the thesis became operational: AI agent development needs right context, not more context, and route context is how we made that sentence executable.

The Words We Needed

We had to translate our internal language into something simple enough to survive a live task.

Topology means what kind of job this really is. Is this a tiny deterministic bug fix, a permission change, a runtime change, a graph update, a public decision, or a mixed task that needs independent lanes?

A contract is the set of promises a worker must satisfy: target files, acceptance criteria, out-of-scope areas, allowed actions, blocked actions, and evidence to return.

A gate is a check that prevents a false finish. A test can be a gate, but it is not the only one. A close gate may reject a change because the audit evidence is missing, the runtime was not redeployed, the project map is stale, or the wrong role acted.

Graph, backlog, timeline, and contract are the governance layers that preserve project state, user intent, execution evidence, and allowed action. The graph is the project map. The backlog records the work and acceptance criteria. The timeline records what happened. The contract says what this lane may do.

Once those words were clear, the bug was easier to see. Our static skill text explained the system, but the live task needed route context: the current path through the work, with the role, lane, injected context, and checks bound to the action about to happen.

Static skill text is only a bootloader. It can teach the agent that observers, workers, review lanes, graphs, and gates exist. It cannot reliably decide what matters for this task, this lane, this file, this permission boundary, and this moment. Route context is the runtime packet assembled for that decision.

The Route Context Artifact

The useful fix was not a bigger prompt. It was a route context alert: a small implementation artifact generated before lane dispatch, during topology classification and prompt-contract assembly. It describes what the lane is allowed to do, what was injected into its prompt, and what evidence will be checked later.

A simplified route context alert looks like this:

route_context_alert:
  task_intent: "fix permission handling without changing audit policy"
  role_boundary: "implementation worker; no merge, close, or graph mutation"
  topology: "permission-sensitive bug; requires independent review"
  contract: "edit accepted target file; run focused test; report evidence"
  blocked_actions: ["change route policy", "close backlog", "redeploy runtime"]
  visible_injection_manifest: ["route_doc@sha256:...", "contract@sha256:..."]
  evidence_gates: ["focused_test", "independent_validation", "close_gate"]

This is not magic model control. The model can still misunderstand things. The point is that the important context is visible in the prompt, listed in an audit manifest, and tied to workflow checks that can reject a false finish.

The visible injection manifest was especially important. If a document, decision summary, expert note, or implementation contract influences a lane, it should appear in the manifest with an id, kind, source reference, and hash. The hash proves the identity of the injected artifact. Gates and evidence decide whether the lane satisfied the contract. Without the manifest, nobody can reconstruct what the agent actually saw.

Route context gave us a way to keep orchestration minimal without making it invisible. The system still feels close to zero-orchestration from the user's side, but the route carries enough explicit structure that lanes do not silently blend together.

The Implementation Pattern In Aming Claw

This is not only language we use around the system. Aming Claw implements the pattern as code-level contracts, local gates, audited queries, and append-only evidence.

The route prompt contract is source-controlled in the mf_workflow_runtime.v1.json template. That contract makes route-owned prompt context explicit: the injected artifacts are listed in a visible manifest, observer and review lanes are blocked from drifting into implementation, and the worker has to carry matching route_context_hash, prompt_contract_id, and prompt_contract_hash values. In other words, the prompt is no longer just a bag of helpful text. It has identity.

Before a bounded worker is handed the job, Aming Claw runs a local dispatch gate in mf_subagent_contract.py. The gate checks the worker's branch, worktree, base commit, target head, merge queue, fence token, route hash, prompt hash, and owned files. Same-worktree dispatch is blocked by default because "please stay in this directory" is not a boundary. The boundary has to be represented in durable facts the system can re-check.

When the worker returns, the finish gate in the same module treats the response as a claim, not as truth. The finish validation requires the fence token to match, tests to pass, blockers to be absent, a checkpoint id to exist, and the worker identity to still match the handoff. This is the difference between "the agent says it is done" and "the route can safely advance."

The audit trail follows the same idea. The task_timeline.py module is append-only execution evidence. Its close gate expects the route to have the right event kinds: implementation, verification, and close_ready. The later close verification checks those facts before a backlog item can be honestly closed.

Even project knowledge is handled this way. Graph context is not dumped wholesale into the prompt. It is queried through an audited graph-query trace and exposed through the MCP graph_query surface, so later review can ask what the agent looked up instead of guessing. The public manual-fix SOP names the same workflow: route, contract, timeline, and close gates are required evidence, and dispatch has to prove the worker's fenced identity before handoff (timeline and contract gates, dispatch requirements).

That is the implementation pattern: right context becomes a chain of small, checkable facts. The agent can reason with them, but it does not get to be the only witness.

Before And After

Before, the agent confidently finished the wrong local job.

It saw a failing test and fixed the code. It saw a nearby file and edited it. It saw a plausible completion story and wrote one. The observer forgot what kind of job this really was. The worker optimized inside its local patch. The review or validation step, if present, evaluated the local result instead of the global obligation.

After, route context requires the observer to start by preserving global state. It names the topology: tiny fix, permission-sensitive change, runtime change, graph-impacting change, major decision, or something else. It dispatches lanes accordingly. The worker acts inside a contract. The architecture review lane checks whether the route makes sense. Validation checks evidence, not just confidence. Later validation and close gates check the route context alert, manifest, and returned evidence before accepting the result.

That rejection matters. A good system must be allowed to say, "The test passed, but the work is not done."

For example: the focused test passed, but the route context hash did not match the worker contract. Or the code changed the right file, but the audit evidence did not prove the right role acted. Or the patch was correct, but the graph, meaning the project map, was now stale. Or runtime needed redeploy before anyone could claim the fix was live. Route context makes those checks explicit instead of hoping a model remembers them.

What The Observer Is For

The observer's advantage is not that it manages subagents. That framing makes it sound like a tiny project manager.

The observer's real advantage is global-state custody. Route context gives it something concrete to preserve: route identity, dirty scope, graph/current state, runtime status, backlog state, tests, close gate requirements, and follow-up work.

Workers should be local. That is their strength. A bounded worker should know its target files, acceptance criteria, blocked actions, focused tests, and required evidence. It should not need to carry the whole project in its head. When every worker receives the whole world, prompts get heavier and guarantees get weaker.

The observer keeps the larger surfaces connected through route context. Did this lane have permission to edit? Did it stay inside its file fence? Did the test prove the actual promise or just a nearby behavior? Did an independent reviewer inspect the same packet the worker produced? Did the runtime or graph need an update? Is there follow-up work outside the worker's scope?

These checks are not glamorous. They are what stop a green test from becoming a false finish.

Efficiency Without The Theater

The biggest improvement was not raw wall-clock speed.

Parallel lanes can help. Architecture review, implementation, and validation lanes expose different gaps earlier than one linear worker. But the real win was effective efficiency and quality: fewer locally correct patches that could not be honestly closed, less rework after review, and earlier discovery of missing evidence.

The system became calmer because it stopped treating "done locally" as "done globally." Some gates remain serial on purpose. Commit, runtime redeploy, graph reconcile, and backlog close mutate shared state or claim shared state is current. Those steps should not be casually parallelized just because multiple agents are available.

This is the correction to zero-orchestration-ish design. The goal is not to hide all orchestration. Hidden orchestration is unreliable because nobody can audit what happened. Route context keeps orchestration visible and minimal at the points where role, evidence, and shared state matter.

The Rule We Use Now

A tiny deterministic edit can be one agent plus a focused test. If route context says the blast radius is clear, the file ownership is obvious, and there is no permission, audit, graph, or runtime implication, keep it simple. Give the worker a tight contract and verify the behavior.

P1 and P0 work is different. So are routing, permission, audit, graph, and runtime tasks. Those need an observer, an architecture review lane, an implementation worker, and independent validation. Major decisions need adversarial lanes: separate expert packets, an independent review lane comparing evidence, and an observer final decision. Not because important work deserves ceremony, but because important work has more ways to be locally successful and globally wrong.

The checklist is compact:

Name the topology: what kind of job this really is.
Bind the worker contract: target files, acceptance criteria, and evidence.
List blocked actions: what this lane must not do.
Expose injected context: manifest the artifacts and hashes the lane saw.
Make gates reject false finishes: tests, validation, and close checks must be able to say no.

That is the practical lesson we got from dogfood. More context made agents sound more informed. Route context made right context safer to act on.

For us, right context stopped being a prompt-writing aspiration when it became route context: a route that knows what matters now, a contract that bounds the worker, a manifest that shows what was injected, and gates that refuse to confuse a passing test with a finished job.

AI's tech debt is invisible — even to AI. I solved it at the architecture layer.

Aming — Sat, 23 May 2026 03:58:23 +0000

TL;DR — AI repeats your patterns badly, ignores existing services, and forgets every cross-session lesson you taught it. This isn't laziness — it's a new kind of tech debt: invisible, systemic, and architectural. Project memory hints don't scale. Bigger context windows don't help. The fix is structural: pin a graph projection of your codebase to every commit, let AI read it before writing, surface "graph stale" prompts when source drifts. Real commit receipts from my own OSS project aming-claw inline. Architects, change my mind in the comments.

What is AI tech debt?

Let me define this precisely, because it's a different beast from the tech debt you already know.

Dimension	Traditional tech debt	AI tech debt
Who creates it	Engineers (knowingly)	AI (unknowingly)
Awareness	Conscious tradeoff	AI doesn't know it's accruing
Fix lifecycle	Fix once, done	Every new session repeats it
Visibility	`git log` shows it	Invisible across sessions
Scale	Team-bounded	Systemic, AI-generated

The core asymmetry: the more your team uses AI for coding, the more invisible debt accrues — and you have no tool that sees it.

5 symptoms (diagnose yourself)

Run this checklist against your team:

❌ AI re-implemented a service that already exists
❌ AI shipped code using a pattern completely inconsistent with everything around it
❌ AI didn't see the implementation sitting in the next file over
❌ Every new session repeats the same mistakes you corrected last time
❌ AI treats a familiar codebase as if it were brand new

Three or more? You're accruing AI tech debt. The bigger your team and the more AI you use, the faster it compounds.

A real case study: my toolboxclient stateService

I'm the maintainer of toolboxclient (open-source cross-platform AI agent runtime, 274+ stars). I asked AI to add a stateService.

The directory server/services/ already contained, in clear sight:

TOOLBOXCLIENT/server/services/
├── fingerPrintService.js
├── memoryService.js
├── providerModelService.js
├── proxyService.js
├── taskService.js
├── toolServiceManager.js
├── walletService.js
└── webSocketService.js

Roughly a dozen services, all sharing the same HTTP pattern.

What AI shipped (commit 68487cc, 2026-03-19):

// AI's version: WebSocket-based StateClient with Proxy
class StateClient {
  constructor(agentName) {
    // 🚨 WebSocket, not HTTP — inconsistent with every other service in the folder
    this.ws = new WebSocket(...)
    this._data = {}
    this.state = this._createProxy()
  }

  _createProxy() {
    // Proxy traps to broadcast via WebSocket
    return new Proxy(this._data, { ... })
  }
}

It used WebSocket instead of HTTP. It used a Proxy-based intercept-and-broadcast pattern unlike anything else in the codebase. It built a parallel architecture next to an established one.

This wasn't a code bug. It was a pattern bug. AI literally couldn't see the existing services.

The first fix: project memory

My first instinct: add a hint to project memory.

use existing HTTP services, don't add WebSocket

AI refactored cleanly (commit bbdf82c, 2026-03-21):

feat: stateService Phase A+B — HTTP CRUD + SSE broadcast

Phase A: /api/state/* routes (read, write, session CRUD, language pref)
Phase B: SSE subscribe endpoint with topic filtering + EventBus broadcast

74/74 tests pass. No breaking changes — additive only.

WebSocket gone. HTTP CRUD + SSE matching the existing pattern. Clean fix.

For about ten seconds, I thought I'd solved it.

Why project memory hints don't scale

Then I realized something uncomfortable:

This catch only worked because I noticed.

The next AI session would start with zero memory of this lesson.
Every context window starts as a blank slate.

This is the systemic nature of AI tech debt:

AI can't see existing patterns when it writes
I see it → I fix it once → the fix doesn't propagate to future sessions
Manual project memory maintenance puts the work back on me, not AI
This doesn't scale — and the failure mode is silent

The first insight

I stopped trying to fix prompts and started looking at the structural problem:

AI agents don't need bigger context windows.

They need a persistent structural record of the project that survives across sessions.

Context windows are short-term memory. What's missing is long-term, project-level memory — something any AI session can read before writing.

This is the insight that turned into aming-claw.

Building aming-claw (and falling into the next trap)

The idea: give every AI session a queryable graph of the project. Files, modules, functions, patterns — all of it, machine-readable, persistent.

Scan the codebase → build a graph of all entities and relations
Expose it through an MCP server that any agent can query
AI reads the graph before writing
Graph persists across sessions

I built it. It worked. Then it broke — at a higher layer.

I had implemented the graph with:

Mutable nodes — agents could edit graph state directly
A patch pipeline — 5-stage mutation flow (propose → validate → review → apply → snapshot)
A graph editor UI — humans could also edit the graph

Within a few weeks, the graph drifted from the actual code.

Why? Because I had created a second source of truth:

The real source of truth was source code
But I also let the graph be directly mutated
The two sources inevitably diverged

Same trap. Higher layer.

The real architectural insight

After hitting the same trap twice, the answer crystallized:

~~The graph is something you edit.~~

The graph is a projection of the commit.

In concrete terms:

Every commit can correspond to one graph

git commit (modifies source / hints / config)
     ↓
system detects: HEAD ≠ graph's bound commit
     ↓ ⚠️ "graph stale" prompt
user decides when to reconcile
     ↓ user-triggered
fixed_algorithm(source + hints + config)
     ↓
new graph snapshot ←→ new commit hash

4 key invariants

#	Invariant	What it guarantees
1	Fixed algorithm	Same input → same graph (deterministic, no randomness)
2	1:1 binding	Every commit hash maps to exactly one graph snapshot
3	User-triggered	Reconciliation is explicit, not a background git hook
4	Stale prompt	System surfaces drift in dashboard / CLI; user triggers when ready

Why not a git hook?

A reasonable question: why not auto-rebuild the graph on every commit via a git hook?

Three reasons I deliberately didn't:

Reconciliation is expensive (full codebase scan + algorithm)
Surprise auto-builds destabilize state — user should control when state changes
Batching commits before a single reconcile is often what users want

The system shows a graph stale indicator in dashboard and CLI. Users reconcile when they're ready. This is a deliberate design choice, not a limitation.

How modification and rollback work

Operation	Implementation
Modify the graph	Modify source / hints / config → trigger reconcile
Roll back the graph	`git revert` → trigger reconcile
Verify consistency	Same commit → same graph (replayable)

Logic lives in code. The graph is a read-only projection.

How this solves AI tech debt

Returning to the original problem: AI repeats patterns badly because it can't see the codebase.

The architectural fix:

Every AI session starts by querying the graph (via MCP)
The graph records the full structure — files, functions, modules, patterns
AI sees, for example, existing HTTP service pattern in server/services/
AI reuses the pattern instead of shipping a parallel WebSocket implementation
After AI makes changes → user commits → system flags graph as stale → user reconciles → next session sees updated patterns

Cross-session knowledge transfer happens through the graph, not the prompt.

This is what "solved at the architecture layer" means: it's not a smarter prompt, it's a different topology of state.

Coming up: the algorithm itself

This post covered why the projection model works. The next post covers how the algorithm builds the graph:

in-degree=0 entry detection
DFS 3-color marking
Tarjan SCC for cyclic clusters
6-signal layer scoring
Cross-language fact pipeline (Python + TypeScript)

Follow me here to catch the next one.

Change my mind

I claim this architectural pattern solves AI tech debt: every commit corresponds to one graph + user-triggered reconcile + stale-state prompt.

Your turn. Two architectural choices:

Treat project state as a single source of truth, commit-bound
Or maintain a separate memory store that AI writes to

Which is more robust? Which scales better? Where would you attack my approach?

Calibrated invitation: I want senior engineers and AI infra people to push back with specifics. "What about X?" or "Have you considered Y?" lands better than "this won't work." If you've shipped something adjacent, tell me — I want to compare designs.

AI proposed 5 components for my parallel system. After walking one scenario, only 3 were real.

Aming — Mon, 18 May 2026 04:18:53 +0000

TL;DR — AI loves to design "enterprise-grade" systems for you: message queue, distributed lock, state machine service, scheduler, monitoring bus. Half of them aren't real. The cheapest filter I know: before letting AI design anything, walk one concrete scenario through the system. Whatever shows up in the scenario is real. Whatever doesn't — delete. This week it took me from a 5-component design down to 3 — and surfaced one critical component AI had missed entirely.

What I was building

This week I was extending aming-claw (an open-source AI code governance tool I'm building) to support parallel multi-agent development: multiple AI agents working on the same project simultaneously, each on its own branch, all of it merging back into trunk.

I asked AI to help me design it.

It came back fast. Confident. Five components:

- Message queue        (so tasks can line up)
- Distributed lock     (so agents don't step on each other)
- State machine service (so we track progress)
- Task scheduler       (so we know what runs when)
- Monitoring bus       (so we see what's happening)

Each component had a paragraph of justification. The diagram looked impressive. The names sounded right.

I almost just said "ok, build it."

Why I didn't

A thing I've learned working with AI on architecture: AI doesn't filter for necessity. It filters for plausibility. The components it lists are real things real systems have — they're just not necessarily things your system needs.

So instead of letting it design the system, I did one thing:

I walked a concrete scenario through the system before agreeing to anything.

Here's an honest framing: nobody can look at a 5-component design and immediately tell you which 2 are load-bearing. AI can't. Most engineers reading this can't, not on inspection.

The good news:

You don't need to know what to design. You just need to walk one scenario.

The scenario does the filtering for you.

Scenario 1: five tasks with dependencies

I started with the most boring scenario I could think of:

Five AI agents working in parallel. Each one on its own branch. The tasks have a dependency chain: 1 → 2 → 3 → 4 → 5. Task 2 needs what task 1 built. Task 5 needs everything before it.

I walked through what the system has to do:

Five tasks running in parallel — they need to queue for merging. OK, "message queue" was real.
BUT — they have to merge in dependency order. Not first-come-first-served. So a plain FIFO message queue isn't enough. It has to be an ordered queue.

Already, one component refined. "Message queue" → "ordered merge queue."

Nothing has been deleted yet. Keep going.

Scenario 2: the machine reboots mid-batch

Now the machine reboots. When it comes back up: task 1 already merged. Task 2 tried to merge and failed. Task 3 hadn't started yet. Task 4 was waiting in queue. Task 5 was halfway through executing when the power cut.

I walked it again:

For the system to even know what state each task is in after a reboot, task state has to be on disk, not just in memory. Not a "state machine service" with its own server — just durable per-task state. (task_id → status → checkpoint.) That's a column in a database, not a service.
Task 2 failed, but tasks 3-5 are downstream of it. The system has to recognize "upstream failed, downstream blocked" automatically. That's not a separate component — it's a query against the durable state.
Task 5 was mid-execution when the power cut. When the machine restarts, what stops a second copy from picking it up and racing the half-finished one? Each execution attempt needs a unique token — whoever has the newest token is the live runner, everyone else gets fenced off.

Now two more things have surfaced:

Durable per-task state (which AI called "state machine service" — but it's not a service, it's a table)
Fence tokens to prevent zombie reruns

And here's the first thing that got deleted: distributed lock.

A distributed lock is "this resource is held by exactly one agent right now." Fence tokens solve the same problem in a much weaker, much cheaper way: "the latest token wins, all stale tokens are ignored." For agent merge work, that's sufficient. Distributed locks would be massive overkill for the actual scenario.

1 component deleted, 0 lines of code written.

Scenario 3: the ordering itself was wrong

This one wasn't in my original head-list. It only surfaced when I kept walking:

Five tasks ran. Three merged. Then it turns out the dependency order I gave the system was wrong — it should have been 1 → 3 → 2 → 4 → 5, not 1 → 2 → 3 → 4 → 5. The three already-merged tasks need to be rolled back as a batch and replayed in the correct order.

This is a scenario most systems never plan for. Per-task rollback is common — undo one merge. Batch rollback with replay is rarer.

Plain per-task revert doesn't work — you can't revert task 2 while leaving task 3 (which depends on task 2's wrong order) intact.
The whole batch has to roll back atomically.
Then the system has to replay them in the new order, with all the graph artifacts (snapshots, indices, semantic projection, test results) re-derived per merge.

This is the component AI had not mentioned at all. It only surfaced because I walked a scenario nobody told me to walk.

Call it BatchMergeRuntime. It's the rarest kind of architectural decision: not "should we have it" but "do we even know we need it?" — and the answer, for most teams, is not until production.

What the architecture actually became

After walking three scenarios:

Scenario	What it surfaced
5 tasks with dependencies	Ordered merge queue
Machine reboots mid-batch	Durable task state + fence tokens
Dependency order was wrong	Batch rollback + replay runtime
All of the above untested	Test scenario matrix as P0.0 (highest priority)

Three real components. The fourth — the test scenario matrix itself — is a meta-component: the dry-run scenarios I just walked became the first acceptance bar for every subsequent PR. Anything that ships has to survive these scenarios before merge.

AI's first design vs what scenarios required

AI's first list	Reality after scenario walk
Message queue	✅ Needed — but ordered, not FIFO
Distributed lock	❌ Deleted — fence tokens are sufficient
State machine service	✅ Needed — but as a table, not a service
Task scheduler	❌ Deleted — the ordered queue is the scheduler
Monitoring bus	❌ Deleted — each component emits its own events
(AI did not propose)	✅ Batch rollback runtime — surfaced only by scenario 3

Net: 5 → 3 components, plus the one critical piece AI had missed entirely.

The win is not "I deleted 2 components." The win is I now know why each remaining component exists, which means I can explain it, scope it, and reject scope creep on it. That's the difference between a system you built and a system you understand.

The method, in 3 steps

❌ Don't:   "Hey AI, design me a system that does X."
           → AI returns a plausible-looking inventory of components.
           → Half of them aren't real for your specific case.

✅ Do:      Step 1.  Write one concrete scenario yourself.
                    (Or: have AI write the scenario, you evaluate it.
                     Real numbers, real steps, with crashes,
                     failures, and orderings going wrong.)

           Step 2.  Walk the scenario through your design.
                    At each step, ask: "What does the system need here?"

           Step 3.  Aggregate "what's needed."
                    That's your minimal architecture.
                    Anything not in that list — delete.

That's it. Three steps. No architecture-pattern library required. The scenario does the work for you.

Why this works (and why it's hard to skip)

Three reasons:

1. AI optimizes for plausibility, not necessity. It lists components that sound right for this kind of system, drawing from its training data. It can't know which components are necessary for your specific scenario, because it doesn't see your scenario unless you walk it through.

2. Scenarios surface the negative space. A happy-path design is the union of every component someone might need. A scenario walk is the intersection of components someone definitely needs for that scenario. The intersection is always smaller — and more honest.

3. Scenarios surface what AI missed. The batch-rollback runtime wasn't on AI's list. It surfaced because scenario 3 was a state AI's training data didn't lean on. Whatever your system's weird state is — only your scenarios will find it.

The reason this method is hard to skip is that the pressure to just accept AI's design is enormous. The design looks complete. It uses real words. You feel productive saying "yes, build it." Walking a scenario feels like slowing down. It is. That's the whole point.

What's next in this series

This is part 2 of the AI Collaboration Survival Guide. The previous post was about making AI's claims about completed work auditable via a backlog database. The next ones, lining up:

Pain	Coming up
AI edits one function, breaks 10 callers	Code graph + impact analysis
AI modifies code it shouldn't touch	Governance hints as the only authoring surface
What did AI even change this week?	Event ledger
Every session starts from zero	Project memory layer

One pain per article. All built around the same open-source project, aming-claw.

About aming-claw

GitHub: amingclawdev/aming-claw
What it is: A shared workspace where you and your AI agent see the same dashboard. Backlog database, code graph, event ledger, governance hints — all queryable by AI through MCP.
Why I'm writing this series: I keep running into the same kind of AI-collaboration pain. Each post fixes one of them. The fixes generalize beyond aming-claw — the scenario-walk method in this post is a 5-minute habit you can adopt in any project.

If the parallel-agent scenario sounded familiar, drop a comment with the architecture decision AI most recently tried to oversell you on — I'll work through it the same way in the comments. Free architectural review, basically. The repo also takes stars and they're free for you to give. 🌟

Part 2 of "AI Collaboration Survival Guide" — practical patterns for the messy reality of shipping with AI agents.

I told my AI to build a feature. Did it? I had no idea.

Aming — Sat, 16 May 2026 18:43:34 +0000

TL;DR — I tried to "manage" AI by having it write decisions, todos, and constraints into markdown docs. After 56 files, I realized AI doesn't maintain document state. So I built aming-claw — a backlog database AI can actually read and write through MCP.

A bug I kept running into

I thought I was doing AI collaboration the right way.

This is the docs/dev/ folder of my aming-claw project — 56 markdown files, all produced through AI collaboration:

proposal-* — new feature specs
review-* — design review records
handoff-* — state passed between sessions
Plus plan-, optimization-, interface-, manual-fix-...

Every file dated. Two months in, over a thousand pages of markdown. I figured the next AI session would read these. I figured I'd be able to search them too.

But there's one problem I can't engineer my way out of:

AI doesn't maintain document state.

proposal-graph-state-reconcile-and-chain-governance-modes.md — did this proposal ship? Which commit? Is it still valid?
handoff-2026-05-10-dashboard-semantic-hash-queue.md — did the next session actually pick up where this left off?
18 proposals on file. Which are done, which got rejected, which are still alive? Grep through git log line by line?

I don't manually maintain the docs, so the docs rot. AI doesn't maintain them either — its context window only sees a tiny slice of the workspace. The other 56 files are invisible.

The more we talk, the more we write — and the further docs drift from code. Eventually you don't trust the docs, and you don't have time to read the code.

Why this happens

This isn't AI being lazy. It's a structural problem:

Markdown is dead text. No state machine. "TODO" doesn't become "DONE" on its own. "Decision: use Redis" doesn't auto-expire when you flip back to in-memory three weeks later.
AI context has a boundary. Each session sees ~200 lines of working code. Old docs never enter the window. Not in the window → can't be maintained.
No traceable link between docs and code. Which TODO maps to which function? Once it's done, which commit landed it? Humans can't remember. AI doesn't look it up.

GitHub Issues, Notion, Linear — none of these help. AI can't see them, so they don't exist.

The core mismatch is this: humans want global state. AI sees only local present. Between them you need a living, traceable, AI-readable/writable state layer. Markdown isn't that layer.

How aming-claw solves it

I gave aming-claw a dedicated backlog database — a peer-level system to the code graph and event ledger, with its own schema, state machine, and query interface. Not stored in markdown. Not buried in code comments. Not dependent on an external issue tracker.

Each backlog entry is a structured record (todo / decision / constraint) with status, priority, source session, and a code reference (function name or file path). AI reads and writes it through MCP.

The flow:

1. You speak → it goes to the database, not a dead doc

In chat:

"Add a retry-after to the rate limiter on UserService.login"

Or: "Decision — use Redis instead of in-memory for caching"

aming-claw's MCP server intercepts those statements and writes directly into the backlog:

target:    UserService.login   # function or file path
type:      todo | decision | constraint
status:    proposed
priority:  P1
source:    session-id-xyz
timestamp: 2026-05-16T10:23:45Z

Markdown is dead text. The backlog database is live state — schema, indexed, state-machined, AI-accessible. That's the difference.

2. Dashboard shows it instantly

Open the aming-claw dashboard — the left panel shows the new backlog entry. Click it — the right panel jumps to the function via the vscode:// protocol. Status chips are editable inline.

The backlog view — every entry has priority, status, code reference, and update timestamp. AI and you query the same source of truth.

3. State machine, automatic

proposed → in_progress → done(commit hash) → verified

in_progress — AI started working on it
done — commit landed, hash automatically bound
verified — you reviewed it

Every state change is appended to an event ledger: which day, which session proposed it, which commit implemented it, who verified it — all queryable, all replayable.

4. AI reads the backlog itself, next time

Days later, in chat:

"Did we ever fix that Codex plugin Windows install bug?"

AI queries the backlog through MCP and returns:

status:     FIXED, P0
commit:     0ad8c7e
fixed at:   2 days ago
file:       agent/plugin_installer.py (line 455)
change:     replaced regex pattern with callable replacement

No grepping git log. No asking a teammate. No "I think we did?"

The key thing to notice: AI didn't "remember" this from conversation history. It queried the backlog database in real time through MCP. Even if this bug was raised three months ago, in a session that's long gone — AI still gets the current status + full commit trace.

That's the difference between dead markdown and a live state layer: the database is the memory, not the conversation.

This is just the start

Look back at the docs/dev/ screenshot — 56 markdown files, nobody knows which are alive.
Look at the dashboard screenshot — every backlog entry has status, commit, location.

The difference isn't the tool. It's whether information has state.

The backlog solves "did the AI build the feature I asked for?" — but AI collaboration has plenty of other holes I'm planning to fill in this series:

Pain	Next article
AI edits one function, breaks 10 callers	Code graph + impact analysis
AI modifies code it shouldn't touch	Governance hints
What did AI even change this week?	Event ledger
Every session starts from zero	Project memory layer

One article per pain point.

About aming-claw

GitHub: amingclawdev/aming-claw — open source
Next post: "AI breaks 10 callers when it edits one function" — coming this week
Hit me with issues if you've felt this pain

If "did the AI actually do that thing I asked?" sounds familiar, give the repo a star — it costs you nothing and tells me I'm not the only one.

This is part 1 of an "AI Collaboration Survival Guide" series — practical tools for the messy reality of building with AI agents.