DEV Community: Vuong Ngo

Prompt Injection in Assistant Tools Is a Boundary Bug

Vuong Ngo — Tue, 07 Jul 2026 12:12:48 +0000

In one recent Reddit thread, a Claude user said the assistant surfaced internal-looking Apify and Notion tool JSON, then warned about prompt injection after the user pasted job-board text. Treat that as a third-party report, not a vendor-confirmed root cause. Still, the symptom is the lesson: if hidden tool state can show up in chat, hidden state is close enough to the model, UI, logs, or memory path to deserve a boundary review.

The reported incident is useful as a boundary-design signal, not as vendor-confirmed root cause.

Prompt injection in assistant tools is not mainly an input-filtering problem. It is a context-boundary problem. The practical fix starts by making every piece of assistant state carry three labels before it reaches a model or a tool: source, visibility, and authority.

That sounds boring. Good. Boring boundaries are easier to test than vibes.

Prompt injection in assistant tools is a boundary bug

Hidden tool or context leakage happens when internal control state, tool schema, tool output, memory, or client-only metadata crosses into model-visible content or user-visible output without an explicit source, visibility, and authority boundary.

That definition matters because "the prompt" is too small a word for modern assistants. A tool-using assistant may combine user text, browser content, retrieved files, memory summaries, tool manifests, tool outputs, guardrail annotations, system instructions, signed file URLs, and UI metadata before the user sees one response.

The Reddit report is a good opening signal because it is messy. It includes user-visible internal-looking state and a warning that may have been a false positive. The point is not "this one vendor leaked exactly X for exactly Y reason." The point is that visible boundary confusion is a security smell.

In a production assistant, the same class of confusion can appear in four places:

A model sees implementation metadata and treats it as instruction.
A user sees internal state and loses trust in the product boundary.
A tool receives stale or injected arguments.
A memory system stores a contaminated summary for later turns.

Once you look at prompt injection through that lens, a regex over user input starts to feel like a smoke alarm in a building with no fire doors.

Stop calling every string "the prompt"

A model does not care where a string came from; your application has to.

OWASP's LLM01 guidance describes prompt injection as direct or indirect input that alters model behavior, including content that may be imperceptible to humans but still parsed by the model. That is the part assistant builders should keep in their heads: the model does not care whether a string came from a chat box, a README, a website, a tool result, or a hidden field.

Your application has to care.

Plane	Examples	Boundary failure
User-visible data	Chat text, pasted job posts, README files, uploaded PDFs	Data gets treated as privileged instruction
Model-visible control	System prompts, developer instructions, tool descriptions, guardrail policy text	Hidden control state leaks into normal output
Tool and action surface	Shell commands, browser actions, MCP tools, file writes, external sends	Untrusted or stale text triggers a real action
Client-only UI state	Signed URLs, widget metadata, progress state, internal JSON, `_meta` fields	Implementation detail enters chat, logs, model context, or memory

The table is simple because the failure is simple: a string crosses a lane without being reclassified.

The hard cases are not the obvious malicious prompts. They are ordinary product states that accidentally become model-visible. A progress object contains _meta.resourceUri. A file preview carries a signed URL. A tool result includes a debug blob. A memory summary says "the user prefers automatic setup recovery" after one rushed conversation. None of these is evil. Each one can become unsafe if it graduates into instruction or tool authority.

This is the operational rule I use: every context block needs a source, a visibility level, and an authority level before it can move.

A boundary-first assistant keeps user data, model control, tool action, and client-only UI state in separate lanes.

Memory makes continuity useful and state larger

Memory is valuable because it reduces restart cost. OpenAI describes memory as shared context that helps future conversations start from preferences, projects, and constraints rather than from scratch, while also naming freshness, relevance, staleness, correctness, and scale as engineering challenges.

That is not an argument against memory. It is an argument for classifying memory as state.

Here is a concrete failure mode. A developer asks an assistant to fix a Node repo. In a previous project, the user said, "If setup fails, run npm install and try again." A memory system compresses that into "user prefers automatic setup recovery." Weeks later, the assistant opens a different repo where install scripts are not trusted. If that memory has no source, freshness, or authority label, the product has a quiet policy bug.

The same problem shows up with project preferences:

"Always use production data for realistic tests" might be fine in one sandbox and reckless in another.
"Send status updates to the team channel" might be useful for a demo and a data leak in a customer workspace.
"Ignore linter warnings for generated code" might be acceptable in one package and fatal in a shared library.

Memory improves continuity by expanding state. Expanded state needs labels.

Tools turn boundary mistakes into actions

With a chat-only assistant, prompt injection can produce a bad answer. With tools, the blast radius changes.

The Model Context Protocol security guidance names implementation risks such as confused deputy problems, token passthrough, SSRF, session hijacking, local server compromise, OAuth validation, and scope minimization. Those are not copywriting concerns. They are what happens when a language model sits near credentials, network access, files, and long-lived sessions.

The 0DIN Claude Code proof of concept is a sharper example. The researcher described a normal-looking repo flow where setup recovery and external lookup chained into compromise. Do not copy that attack path into your app or your tests. Take the design lesson instead: routine developer actions become dangerous when untrusted project content can steer trusted tools.

This is why "detect prompt injection" is the wrong primary promise. Detection helps. It is a tripwire. The boundary is enforced by what content may enter model context and what a tool is allowed to do after the model proposes an action.

For tool-calling assistants, I would rather block one helpful but ambiguous action than silently convert untrusted text into shell, network, credential, file-write, or external-send authority.

A small context contract you can test

Here is a compact TypeScript pattern you can adapt. It does not solve prompt injection. It does something narrower and more useful: it prevents the wrong category of text from reaching the wrong authority level.

Save this as context-boundary.ts and run it with tsx context-boundary.ts.

type Source = "user" | "external" | "tool" | "memory" | "system" | "ui";
type Visibility = "user_visible" | "model_visible" | "tool_only" | "client_only";
type Authority = "data" | "instruction" | "tool_request" | "privileged";

type ContextBlock = {
  id: string;
  source: Source;
  visibility: Visibility;
  authority: Authority;
  content: string;
  createdAt: string;
  ttlSeconds?: number;
};

type Finding = {
  level: "error" | "warn";
  blockId: string;
  message: string;
};

const INTERNAL_PATTERNS = [
  /\btool_schema\b/i,
  /\bfunction_call\b/i,
  /\bApify\b/i,
  /\bNotion\b/i,
  /\b_meta\b/i,
  /\bsystem prompt\b/i,
  /\bBearer\s+[A-Za-z0-9._-]+/i,
  /https:\/\/[^\s]+X-Amz-Signature=/i,
];

function isExpired(block: ContextBlock, now = Date.now()): boolean {
  if (!block.ttlSeconds) return false;
  const created = Date.parse(block.createdAt);
  return Number.isFinite(created) && now - created > block.ttlSeconds * 1000;
}

function validateForModel(blocks: ContextBlock[]): Finding[] {
  const findings: Finding[] = [];

  for (const block of blocks) {
    if (block.visibility === "client_only") {
      findings.push({
        level: "error",
        blockId: block.id,
        message: "Client-only state must never enter model-visible context.",
      });
    }

    if (block.source === "user" || block.source === "external") {
      const leakedInternal = INTERNAL_PATTERNS.some((pattern) =>
        pattern.test(block.content),
      );

      if (leakedInternal) {
        findings.push({
          level: "error",
          blockId: block.id,
          message: "Untrusted content contains internal-looking tool or secret material.",
        });
      }
    }

    if (block.source === "memory" && !block.ttlSeconds) {
      findings.push({
        level: "warn",
        blockId: block.id,
        message: "Memory block has no freshness metadata.",
      });
    }

    if (block.source === "memory" && isExpired(block)) {
      findings.push({
        level: "warn",
        blockId: block.id,
        message: "Memory block is stale and should be re-confirmed.",
      });
    }

    if (block.authority === "privileged" && block.source !== "system") {
      findings.push({
        level: "error",
        blockId: block.id,
        message: "Only system-owned blocks may carry privileged authority.",
      });
    }
  }

  return findings;
}

function buildModelContext(blocks: ContextBlock[]): string {
  const findings = validateForModel(blocks);
  const errors = findings.filter((finding) => finding.level === "error");

  if (errors.length > 0) {
    throw new Error(errors.map((error) => `${error.blockId}: ${error.message}`).join("\n"));
  }

  return blocks
    .filter((block) => block.visibility === "model_visible")
    .map((block) => `[${block.source}:${block.authority}] ${block.content}`)
    .join("\n\n");
}

type ToolCall = {
  name: string;
  args: Record<string, unknown>;
  action: "read" | "shell" | "network" | "credential" | "file_write" | "external_send";
  reason: string;
};

const APPROVAL_REQUIRED = new Set<ToolCall["action"]>([
  "shell",
  "network",
  "credential",
  "file_write",
  "external_send",
]);

function gateToolCall(call: ToolCall, currentUserGoal: string) {
  const reasonMentionsGoal = call.reason
    .toLowerCase()
    .includes(currentUserGoal.toLowerCase());

  if (!reasonMentionsGoal) {
    return {
      allowed: false,
      approvalRequired: false,
      log: `Blocked ${call.name}: reason does not match current user goal.`,
    };
  }

  if (APPROVAL_REQUIRED.has(call.action)) {
    return {
      allowed: false,
      approvalRequired: true,
      log: `Approval required for ${call.name}: ${call.action} action.`,
    };
  }

  return {
    allowed: true,
    approvalRequired: false,
    log: `Allowed ${call.name}: read-only action matches current user goal.`,
  };
}

const blocks: ContextBlock[] = [
  {
    id: "user-1",
    source: "user",
    visibility: "model_visible",
    authority: "data",
    content: "Review this README and summarize setup risks.",
    createdAt: new Date().toISOString(),
  },
  {
    id: "ui-1",
    source: "ui",
    visibility: "client_only",
    authority: "data",
    content: "https://files.example.test/private?X-Amz-Signature=demo",
    createdAt: new Date().toISOString(),
  },
];

try {
  console.log(buildModelContext(blocks));
} catch (error) {
  console.error(String(error));
}

console.log(
  gateToolCall(
    {
      name: "readFile",
      args: { path: "README.md" },
      action: "read",
      reason: "Review this README and summarize setup risks.",
    },
    "Review this README",
  ),
);

The sample is intentionally conservative:

client_only blocks fail before model submission.
User and external content cannot smuggle internal-looking tool material.
Memory without freshness metadata gets flagged.
Privileged authority is reserved for system-owned blocks.
Tool calls are checked against the current user goal before approval policy runs.

For a more complete review checklist around access, shared context, files, credentials, and human review, this public page on workflow guardrails is a useful companion. The important part is not the checklist format. It is forcing every action path to say what state it trusts.

If you want a tiny test, add this below the sample and make it part of CI:

function assertNoLeakage(text: string) {
  const leaked = INTERNAL_PATTERNS.filter((pattern) => pattern.test(text));
  if (leaked.length > 0) {
    throw new Error(`Model-visible text contains ${leaked.length} internal pattern(s).`);
  }
}

assertNoLeakage("[user:data] Summarize README setup risks.");

This test will miss real attacks. That is fine. Its job is not omniscience. Its job is to catch category mistakes early: signed URLs in model text, internal tool names in user-originated blocks, and metadata that was supposed to stay in the client.

What this does not solve

A boundary contract reduces blast radius. It does not make prompt injection disappear.

Control	Best for	Not best for	Tradeoff
Source, visibility, and authority labels	Preventing state from crossing lanes silently	Detecting every malicious phrase	Requires plumbing through every context builder
Guardrail classifiers	Catching suspicious content before or after model calls	Replacing permission design	False positives can block useful work
Scope-minimized tools	Reducing damage when a tool is misused	Complex workflows that need broad access	More prompts and approval steps
Human approval gates	Shell, network, credential, file-write, and external-send actions	Low-risk read-only operations	Review fatigue if everything needs approval
Memory freshness metadata	Preventing stale preferences from acting like current policy	Perfectly understanding user intent	Older context may need re-confirmation

The strongest objection is fair: prompt injection is not solved. So do not sell "solved." Sell smaller blast radius, clearer state movement, and better logs when something crosses a line.

My rule is plain: if a field is client-only, privileged, stale, or untrusted, it should not silently become instruction. That rule will not catch every hostile page, poisoned README, or bad memory summary. It will make the failure easier to find, easier to review, and less likely to become a tool action before a human notices.

Cursor vs Claude Code Is Really Control Surface vs Autonomy

Vuong Ngo — Sun, 28 Jun 2026 09:54:46 +0000

The Cursor vs Claude Code debate mostly stops at features. That misses the actual tradeoff.

The short version: Cursor is a control surface. Claude Code is autonomy. The choice is not about which tool is smarter, it is about where you want control to live while the work is happening.

Claude Code is a terminal-first CLI that can edit files, run commands, and manage a project from the command line. Cursor centers rules, Plan Mode, and reviewing changes inside the editor. Those are not the same shape, and the gap between them matters for how you work day to day.

If you are choosing between them, do not ask which one is smarter. Ask where you want control to live while the work is happening.

The split is where review happens

Claude Code is built for the run-ahead case. You give it a task, it works through files and commands, and you review the result after the fact. Cursor is built for the steer-as-you-go case. You stay closer to the diff, the plan, and the next decision.

Control is one axis, coordination is another.

The Model Context Protocol docs describe MCP as an open protocol for connecting AI applications to external systems. The tool choice is only one axis. The coordination layer is the other. If you want a concrete version of that idea, see the shared board around assistant work.

Claude Code when you want the agent to run ahead

Claude Code makes sense when the task is larger than a single editor session and you want the assistant to keep moving without asking you to babysit each edit.

The CLI reference shows the pattern clearly:

git diff main --name-only | claude -p "review these changed files for security issues"

Feed it context, let it work, then review the answer. The point is not that you never inspect the output. The point is that inspection comes after the machine has done the repetitive part.

If your work is scriptable, repetitive, or spread across multiple files, that shape is useful. It is also where the risk goes up. More autonomy means more trust in the agent's judgement, and more care needed in the review step.

Cursor when you want the editor to stay in the loop

You can encode the rules of engagement in a repo-level instruction file and keep the editor honest about scope and review:

# AGENTS.md

- Start with a plan before making broad changes.
- Keep diffs small enough to review in one pass.
- Ask before touching files outside the agreed scope.
- Leave a short note explaining why the change exists.

That kind of file turns the editor into a control surface. The assistant can still help, but the human stays close to the decision points. Cursor's docs are written for exactly that workflow: scope first, code second, review throughout.

This is the better fit when the problem is not raw throughput. It is when you want fewer surprises.

The coordination layer is client-agnostic

If Claude Code is allowed to run ahead and Cursor is used to keep the human in the loop, the missing piece is not a third assistant. It is a shared place for state. MCP is the cleanest neutral layer I know for that job.

{
  "mcpServers": {
    "shared-board": {
      "type": "http",
      "url": "https://board.example.com/mcp"
    }
  }
}

claude --strict-mcp-config --mcp-config ./mcp.json

The same server block can serve either client. That is the point. The assistant choice is about control style. The MCP layer is about where work stays legible.

The debate is already a workflow debate

As of mid-2026, you can see the shape of the conversation in public threads. A Product Hunt discussion asks the question directly, while a recent r/ClaudeAI thread reflects the same operational tension from the other side: fast shipping, changing behavior, and the cost of keeping up.

That is why feature checklists keep disappointing people. Developers are not just comparing model quality. They are comparing how much attention the assistant consumes, when review happens, and where the work is recorded.

A quick decision matrix

Dimension	Cursor	Claude Code
Primary shape	Control surface	Autonomy
Review point	During the edit loop	After the run finishes
State location	Editor, rules, diffs	Terminal session, command output
Best for	Careful multi-file edits, plan-first work, team review	Longer tasks, batch changes, scriptable work
Trade-off	More human attention up front	More review attention later
My default	When I need to steer	When I need the agent to keep moving

Neither tool is better in the abstract. They optimize for different kinds of trust.

What I would choose

If the task is messy, I want Cursor. It keeps the work visible and reduces the chance that an agent drifts past a constraint I cared about.

If the task is boring but large, I want Claude Code. I want the assistant to chew through the repetitive part, then give me something concrete to inspect.

If the task needs both, I do not think the answer is "pick one and hope". I think the answer is to separate the tool that does the work from the place that keeps the work legible. That is the real architecture choice.

Tiered AI Code Review: A Framework for AI-Generated PRs

Vuong Ngo — Sun, 21 Jun 2026 21:03:33 +0000

Something has shifted in the review queue. The diffs are bigger, they arrive faster, and a growing slice of them were generated by an AI tool rather than drafted by a human thinking through the change. That is not a complaint. But it is a problem if your review process has not adapted to it.

GitClear's longitudinal analysis of AI-assisted code output tracked millions of lines of code and found that AI tooling correlates with rising churn (lines rewritten or deleted shortly after being written), increasing copy-paste frequency, and declining code reuse. A peer-reviewed study from NYU found that developers using AI assistants produced significantly less secure code and were overconfident about it. A Veracode analysis spanning 100+ LLMs sharpens that picture: 45 percent of AI-generated samples failed security checks, producing 2.74x more vulnerabilities than equivalent human-written code.

Neither of those findings means AI coding tools are a net negative. Teams that use them ship faster. The issue is that uniform review, treating every AI-generated PR with the same depth as every human-written one, creates a bottleneck. And skipping review because "the AI probably got it right" quietly accumulates the kind of defect debt that surfaces at the worst time.

What works is calibration: a tiered AI code review approach that matches review effort to the actual risk of each PR, rather than applying one rule to all.

Three signals for tiered code review

Before a reviewer opens a diff, three signals tell you roughly what you are dealing with:

Code origin is how much of this change came from an AI tool. A human who accepted a few Copilot completions is different from a Claude Code session that planned, drafted, and committed a full feature. The distinction matters because AI-generated code tends to be syntactically sound but logically shallow. It passes linters. It sometimes misses invariants, forgets to handle the error path, or silently ignores a business rule that only lives in the team's memory.

Change scope is how many lines changed. Not a perfect signal, but a useful one. Larger diffs mean more surface area for reviewers to miss something, more decisions that were made without explicit human intent, and less chance that any single reviewer holds the whole change in their head at once.

Blast radius is what the code touches. A PR that modifies a fixture file is recoverable if something is wrong. A PR that touches the auth flow, a payment processor integration, or a database migration schema is not. The cost of a missed defect scales with blast radius, so that is where review depth needs to be highest.

How three signals combine to assign a review tier. Blast radius is the override: a critical path always lands at Tier 3 regardless of origin or scope.

The tier decision matrix

Code Origin	Change Scope	Blast Radius	Review Tier
Human only	Any	Tests, docs, scripts	Tier 1: Skim
AI-assisted	< 100 lines	Low (internal tooling, scripts)	Tier 1: Skim
AI-assisted	100–500 lines	Moderate (API surface, business logic)	Tier 2: Scrutinize
AI-generated	Any scope	Low to moderate	Tier 2: Scrutinize
AI-generated	> 300 lines	Any	Tier 3: Sign-off
Any origin	Any scope	Critical (auth, payments, migrations, public API)	Tier 3: Sign-off

The last row is the one teams are most likely to under-enforce. A 40-line AI-generated change to the OAuth callback handler is still Tier 3. The blast radius overrides everything else.

Relative defect risk by review tier classification (illustrative, derived from GitClear 2024 general findings, not measured per-tier data). AI-generated Tier 3 PRs carry the highest uncaught-defect risk before review.

What each tier actually demands

Tier 1 (Skim): One reviewer. CI must pass. Read the entire diff end-to-end, including the parts that look fine. The only mandatory check beyond "CI green" is confirming there are no hardcoded credentials or API keys. Target turnaround: 4 hours.

Tier 2 (Scrutinize): One reviewer, more attention. The reviewer reads every changed function with the intent to understand the logic, not evaluate the formatting. Test coverage for new branches is required, not optional. Any new dependency added to the project gets audited: license, maintenance status, and whether it pulls in something unexpected. If the PR is AI-generated and crosses a service boundary, run a security scan. Target turnaround: 24 hours.

Tier 3 (Mandatory Sign-off): Two reviewers, one of them the tech lead. CI required. Security scan required. The PR description must include a rollback plan and evidence that the change was tested in staging. For teams building systems covered by the EU AI Act (high-risk AI obligations apply from August 2026), this tier is also where you flag and document regulatory touchpoints. Target turnaround: 48 hours.

Tier 3 is a gate, not a penalty. A PR that lands in it because it touches the payment flow is not a problem PR. It is a PR that deserves a different kind of attention.

Automating tier assignment

Manual labeling works up to about five AI-generated PRs a day. Beyond that, a GitHub Actions workflow that reads diff size and changed paths can assign the right label automatically. Contributors add the ai-generated label when they push; the workflow handles tier calculation from there.

# .github/workflows/pr-tier.yml
name: PR Review Tier

on:
  pull_request:
    types: [opened, synchronize]

permissions:
  pull-requests: write

jobs:
  assign-tier:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Compute tier inputs
        id: inputs
        run: |
          BASE=${{ github.event.pull_request.base.sha }}
          HEAD=${{ github.event.pull_request.head.sha }}

          LINES=$(git diff --numstat "$BASE..$HEAD" \
            | awk '{added += $1} END {print added+0}')
          echo "lines=$LINES" >> $GITHUB_OUTPUT

          git diff --name-only "$BASE..$HEAD" > /tmp/changed_files.txt
          if grep -qE '^(src/auth/|src/payments/|infra/|db/migrations/)' \
               /tmp/changed_files.txt; then
            echo "blast=critical" >> $GITHUB_OUTPUT
          elif grep -qE '^(src/api/|src/core/|src/services/)' \
               /tmp/changed_files.txt; then
            echo "blast=moderate" >> $GITHUB_OUTPUT
          else
            echo "blast=low" >> $GITHUB_OUTPUT
          fi

      - name: Determine tier
        id: tier
        run: |
          LINES=${{ steps.inputs.outputs.lines }}
          BLAST=${{ steps.inputs.outputs.blast }}
          HAS_AI=$(gh pr view ${{ github.event.pull_request.number }} \
            --json labels \
            -q '[.labels[].name] | contains(["ai-generated"])' \
            2>/dev/null || echo false)

          if [ "$BLAST" = "critical" ]; then
            echo "label=review/tier-3" >> $GITHUB_OUTPUT
          elif [ "$HAS_AI" = "true" ] && [ "$LINES" -gt 300 ]; then
            echo "label=review/tier-3" >> $GITHUB_OUTPUT
          elif [ "$LINES" -gt 300 ] || [ "$HAS_AI" = "true" ]; then
            echo "label=review/tier-2" >> $GITHUB_OUTPUT
          else
            echo "label=review/tier-1" >> $GITHUB_OUTPUT
          fi
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

      - name: Apply label
        run: |
          for l in review/tier-1 review/tier-2 review/tier-3; do
            gh pr edit ${{ github.event.pull_request.number }} \
              --remove-label "$l" 2>/dev/null || true
          done
          gh pr edit ${{ github.event.pull_request.number }} \
            --add-label "${{ steps.tier.outputs.label }}"
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Before this runs, create the three labels (review/tier-1, review/tier-2, review/tier-3) in your GitHub repo settings. The path patterns in grep -qE are illustrative; tune them to your actual directory structure.

The ai-generated label is set manually by the contributor today. If your AI tool's GitHub App supports labeling on commit, you can automate that too.

A policy template your team can commit

The automation handles assignment. A written policy handles what the reviewer is expected to actually do. Commit this to .github/ and reference it from CONTRIBUTING.md:

# .github/ai-review-policy.yml
# Review policy for AI-assisted and AI-generated pull requests.
# Version this file; update it when your team's trust calibration changes.

version: "1.0"

# Origin labels (applied by contributor before requesting review):
#   "ai-generated"  - PR was authored primarily by an AI tool
#   "ai-assisted"   - human-led PR with AI filling in sections
# No label = human-authored

tiers:
  tier-1:
    name: Skim
    requirements:
      min_approvals: 1
      ci_required: true
      checklist:
        - CI green
        - Full diff read (no section-skipping)
        - No hardcoded credentials or API keys
    target_sla_hours: 4

  tier-2:
    name: Scrutinize
    requirements:
      min_approvals: 1
      ci_required: true
      security_scan: recommended
      checklist:
        - Every changed function read and understood
        - Logic verified (not just syntax and style)
        - Branch coverage checked for new code paths
        - Error handling and edge cases reviewed
        - New dependencies audited (license + maintenance)
    target_sla_hours: 24

  tier-3:
    name: Mandatory Sign-off
    requirements:
      min_approvals: 2        # includes tech lead
      ci_required: true
      security_scan: required
      architecture_review: required
      checklist:
        - Threat model reviewed (updated if changed)
        - Rollback plan documented in PR description
        - Staging deployment verified before merge
        - Tech lead sign-off recorded in a review comment
        - Regulatory obligations flagged (EU AI Act if applicable)
    target_sla_hours: 48

This is a starting point. The SLAs in particular need to match your team's actual capacity; 48 hours for Tier 3 is a ceiling, not a target.

What review tooling can and cannot do

In March 2026, Anthropic launched a dedicated code review tool for Claude, joining existing tools from CodeRabbit, Bito, and others. These tools surface obvious issues automatically and are worth running as part of CI on every PR. They do not replace tier assignment.

The tier determines who looks at the output, with how much attention, and with what checklist. An automated reviewer can flag a suspicious SQL interpolation. It cannot tell you whether the new auth middleware changes the behavior of a third-party SSO integration in a way that matters. That is still a human judgment call, and tiers are how you make sure the right humans are making it.

Making tiers stick

A policy that lives only in a file nobody reads will not hold. Two things that help more than documentation:

The label is visible. When a reviewer opens the PR list and sees review/tier-3 on four open PRs, they know before clicking what those reviews require. That visibility reduces the chance of a quick glance standing in for a real review.

Pattern tracking matters too. If your Tier 3 queue is consistently dominated by one service, one contributor pattern, or one type of AI-generated output, that is a signal worth addressing at the root rather than only at review time. Some teams use a shared task board to track which work items were completed by AI agents and which are still waiting for human sign-off. Agiflow's Claude Code integration, for example, gives engineering leads a view of AI-agent task handoffs alongside open work, so reviewers have context before they open the diff.

Trust through evidence

The data on AI code quality is real, and the right response to it is not to treat AI-generated code as permanently suspect. It is to build the kind of evidence trail that tells you, over time, whether your calibration is right.

Start with the matrix. Add the labels. Ship the workflow. After a month, look at your Tier 2 defect rate for AI-generated PRs versus human ones. If they are converging, tighten the thresholds. If they are not, you have found the signal that tells you where to spend more attention.

That is the version of trust that holds up in a tiered code review system. Earned, specific, adjusted as the tools improve. Not reflexive approval, and not permanent suspicion.

Developer Handoff Package in 2026: What Engineering Actually Needs

Vuong Ngo — Wed, 17 Jun 2026 12:03:47 +0000

A developer handoff package in 2026 fails engineering not because the spec is too short, but because it is too polished.

You get a neat paragraph about the feature, a few screenshots, maybe some acceptance language, and then the first question from the developer is the one the spec never answered: what exactly is the data model, what breaks, and what is explicitly out of scope?

That gap matters more now because the work has shifted from writing docs to orchestrating systems. IBM's 2026 AI trends piece argues that the competition is moving toward systems, not models, and that matches what handoff now looks like: the spec is part of the system, not just a note to the system. If the handoff cannot be checked, it will drift. If it cannot be executed, it will be rewritten by whoever builds it.

This post is for the person who keeps receiving AI-generated specs that look complete and then fall apart under implementation. I am going to show the smallest developer handoff package I would trust, how to make it machine-checkable, and how to turn one edge case into a test before code exists.

Why a polished spec still fails

As of 2026, the most common failure in AI-assisted product development is a developer handoff package that looks finished on the surface but omits the data model, edge cases, and constraints that engineering needs to build.

AI output often fails in the same places humans do, just faster. It sounds decisive. It forgets constraints.

That is why I like seeing spec-driven development treated as a real practice, not a slogan. GitHub's spec-kit repository (captured 2026-06-17) is a useful signal here: the spec is the durable artifact, not a disposable note. My read is simpler than the marketing around it. If the handoff is meant to guide code, it needs the same discipline as code. It needs structure, validation, and a place where omissions are obvious.

There is also a reason the problem keeps showing up in production. A recent Hacker News thread about evaluating an AI agent in production (captured 2026-06-17) reads like a list of boring failures that were not boring to the team living with them: environment assumptions, local-only dependencies, and behavior that looked fine until it touched the real system. That is the same pattern you see in weak handoffs. The prose looked finished. The implementation was not.

So I would not start with prettier writing. I would start with a better container for the work.

The minimum package I would send to engineering

I want a handoff package that can survive two different readers:

A developer scanning it for implementation detail.
A validator checking whether the spec is actually complete.

At minimum, the package should contain:

Component	Why it matters	What it prevents
Problem statement	Keeps the team aligned on the actual user pain	Building the wrong thing
Data model	Defines the objects and fields the code must handle	Hidden schema decisions
Edge cases	Names the weird paths before they become bugs	False confidence
Constraints	Makes platform, security, and performance limits visible	Unforced rewrites
Acceptance criteria	Turns intent into testable outcomes	Debates disguised as progress
Known limitations	States what the first version will not do	Scope drift
Out of scope	Draws the boundary in writing	Feature creep

What belongs in the handoff package: each section prevents a different failure mode.

That is the checklist. I would treat everything else as optional.

Here is the shape I mean in machine-readable form:

product: notification-scheduler
owner: product
problem: Teams need to schedule customer notifications without sending duplicates.
data_model:
  entities:
    - notification
    - delivery_attempt
    - schedule
  notes:
    - notification belongs to one schedule
    - delivery_attempt records retry state
edge_cases:
  - user saves without a timezone
  - duplicate schedule is submitted twice
  - provider timeout happens after the request is accepted
constraints:
  - must work in the existing Postgres schema
  - no background job may send twice for the same schedule window
  - error responses must remain JSON
acceptance_criteria:
  - user can create a schedule with timezone, channel, and message
  - duplicate submissions do not create duplicate sends
  - validation errors explain the missing field
known_limitations:
  - manual rescheduling is not supported in v1
  - only email and SMS channels are in scope
out_of_scope:
  - full campaign analytics
  - A/B testing
  - inbox-style reply handling

That file is not fancy. That is the point. A good handoff package should be boring enough that a validator can read it and strict enough that a developer can build from it without guessing.

Make the package fail when it is incomplete

The next step is the one most teams skip. They save the handoff as a nice document, then trust people to notice when it is missing something obvious.

I would rather fail the build.

This is a small TypeScript validator that reads handoff.yaml, parses it with js-yaml, and rejects the file if any required section is missing or empty.

Install:

npm install -D tsx zod js-yaml

Validator:

// validate-handoff.ts
// Run with: npx tsx validate-handoff.ts handoff.yaml

import fs from 'node:fs';
import path from 'node:path';

import yaml from 'js-yaml';
import { z } from 'zod';

const NonEmptyStringArray = z.array(z.string().min(1)).min(1);

const HandoffSchema = z.object({
  product: z.string().min(1),
  owner: z.string().min(1),
  problem: z.string().min(1),
  data_model: z.object({
    entities: NonEmptyStringArray,
    notes: z.array(z.string().min(1)).optional(),
  }),
  edge_cases: NonEmptyStringArray,
  constraints: NonEmptyStringArray,
  acceptance_criteria: NonEmptyStringArray,
  known_limitations: NonEmptyStringArray,
  out_of_scope: NonEmptyStringArray,
});

const filePath = process.argv[2] ?? 'handoff.yaml';
const raw = fs.readFileSync(path.resolve(filePath), 'utf8');
const parsed = yaml.load(raw);

const result = HandoffSchema.safeParse(parsed);

if (!result.success) {
  for (const issue of result.error.issues) {
    const location = issue.path.length > 0 ? issue.path.join('.') : '(root)';
    console.error(`${location}: ${issue.message}`);
  }
  process.exit(1);
}

console.log(`OK: ${filePath} passed handoff validation`);

Validation is cheaper than trust: fail the spec before the developer pays for it.

I like this pattern because it changes the conversation. The question is no longer "does the doc look done?" The question becomes "can the doc survive validation?"

That is close to the direction GitHub is pointing with spec-kit (captured 2026-06-17), and it lines up with the market signal from Future AGI's 2026 trends post (captured 2026-06-17), which argues for custom evals and closed-loop testing instead of one-shot trust. I would treat that as a vendor claim, not neutral research, but the practical lesson still holds: the review loop needs to be explicit.

Turn one edge case into a test

The validator keeps the handoff honest. Tests keep the implementation honest.

The cleanest handoff package I have seen is the one where each edge case turns into one small test before the feature gets built. Not after. Before. That way the developer is not reverse-engineering product intent from a half-remembered meeting.

Here is a tiny example using Node's built-in test runner:

// schedule.test.ts
// Run with: npx tsx --test schedule.test.ts

import test from 'node:test';
import assert from 'node:assert/strict';

function createSchedule(input) {
  if (!input.timezone) {
    throw new Error('timezone is required');
  }

  return input;
}

test('rejects a missing timezone', () => {
  assert.throws(() => createSchedule({}), /timezone is required/);
});

That test is tiny on purpose. It does one job: it makes the edge case visible in code. Once you start doing this consistently, the handoff package stops being a narrative artifact and starts behaving like a contract.

There is a trade-off here. More structure means more work up front. I think that is a good trade for any team that spends time reconciling ambiguous specs later. A prose-only PRD is quicker to write, but it is also much easier to misread. A design canvas gives you context, but not enough implementation detail. A spec repo plus validator is less forgiving, which is exactly why it works for handoff.

Approach	Best for	Weak point
Prose-only PRD	Early alignment, rough discovery	Hard to validate, easy to drift
Design canvas	Visual exploration and framing	Too light on implementation detail
Machine-readable handoff package	Teams that need explicit constraints	Requires discipline up front
Spec-first repo workflow	Code-first teams with repeatable reviews	More process, less improvisation

If you want the quick version of the argument: docs are for humans, contracts are for humans and machines.

What I would keep in the package and what I would cut

I would keep anything that changes implementation.

That means data shapes, state transitions, API assumptions, non-goals, error handling, and known limitations. It also means writing down what the system will not do. Leaving those parts implicit is how teams end up debating "obvious" behavior in review while the rest of the sprint sits blocked.

I would cut decorative detail. Big promises, vague outcomes, and speculative future phases do not help engineering ship the first version. They just create more places for the model, the writer, or the reader to improvise.

This is also where fairness matters. A handoff package is not the only way to work. Some teams will stay with a lightweight PRD. Some will keep using Figma, Notion, or Miro as the main coordination surface. Some code-first teams will prefer a spec repository and tools like GitHub's spec-kit because the spec lives next to the work. I would not pretend one tool wins everywhere. The real question is whether your process can show missing information before a developer pays for it.

IBM's 2026 trend framing is still the useful mental model for me here: when the system matters, the handoff has to describe the system. Not the vibe. Not the aspiration. The system.

Conclusion

A developer handoff package earns its keep when it does three things: it narrows ambiguity, it exposes omissions, and it gives engineering something they can validate before they build.

That does not require a giant doc. It requires the right fields, a strict enough validator, and at least one edge case turned into a test.

If you want to see one concrete example of the output shape, Agimon's developer handoff guide shows a structured handoff package with technical requirements, component specs, and API suggestions. I am not suggesting every team should copy that workflow. I am saying the handoff is better when it behaves like a contract instead of a memo.

Operating Model for AI Coding Agents: Delegate, Review, Own

Vuong Ngo — Sun, 14 Jun 2026 11:08:56 +0000

An operating model for AI coding agents isn't optional. As of mid-2026, it is the gap between teams that scale AI assistance and teams that drown in AI-generated review queues.

The pattern is predictable. You add an AI coding assistant, the team ships PRs faster, and within a few weeks the review queue is longer than it has ever been.

Opsera's 2026 AI Coding Impact Benchmark, drawn from 250,000 developers across 60-plus enterprises, puts numbers to it: AI reduces time-to-PR by up to 58%, but AI-generated pull requests wait 4.6 times longer in review than human-authored ones. The agent didn't slow you down. The process around the agent did.

The Anthropic 2026 Agentic Coding Trends report names this directly: verification and coordination are the new bottleneck, not writing code. The DORA 2025 research on AI-assisted software development adds an uncomfortable corollary: higher AI adoption correlates with both more delivery throughput and more delivery instability. Agents amplify what's already in place.

If you're an engineering lead who already has agents running and is watching the review queue grow, you probably don't need more agent capability. You need a process that tells the agent exactly what to do, tells the reviewer exactly what to check, and tells the team who owns the result.

That's the framework I'll walk through here: Delegate, Review, Own.

Figure 1: The Delegate-Review-Own loop. Scope boundaries are fixed before the agent runs; the decision log feeds the next iteration.

The Problem with Fuzzy Mandates

Before the mechanics: the coordination failure usually starts before the agent runs.

Augment Code's agentic engineering operating model guide states it plainly: "fuzzy team boundaries produce fuzzy agent scopes, with the same downstream coordination costs." Their framing is a useful starting point: a three-tier decision model that maps to what most teams are already running informally.

Tier	Who acts	Examples
Human-only	Human, no agent involvement	Architecture decisions, security calls, release approvals, defining agent scope itself
Agent-assisted	Agent generates; human approves before the effect is applied	PR authoring, test writing, refactor passes, documentation drafts
Fully autonomous	Agent executes within a pre-approved, policy-bounded scope	Lint fixes, dependency patch PRs within a constraint, scheduled changelog updates

The row most teams skip is the first one. You cannot delegate well if you haven't decided what is not delegatable. Once that boundary is explicit, the other two tiers become manageable.

The framework below assumes you've done that work. If you haven't, start there.

Delegate: Scope First, Task Second

The most common delegation mistake is handing the agent a task description and letting it decide the scope. The agent will interpret scope generously, because nothing in its prompt told it not to.

A well-formed MCP task delegation includes the scope boundary directly in the call:

{
  "method": "tools/call",
  "params": {
    "name": "run_task",
    "arguments": {
      "task_id": "T-204",
      "intent": "Refactor the user authentication module to use the new token-validation library",
      "allowed_paths": [
        "src/auth/",
        "tests/auth/"
      ],
      "definition_of_done": [
        "All existing auth tests pass",
        "No changes outside allowed_paths",
        "No new dependencies added without explicit approval"
      ],
      "out_of_scope": [
        "Do not modify session management",
        "Do not touch src/middleware/",
        "Do not update package.json"
      ]
    }
  }
}

The out_of_scope list is what people omit. It takes two minutes to write and prevents the agent from helpfully refactoring things adjacent to the task because they "looked related."

For longer-running work where the agent reads a context file at the start of its session, the same contract translates to YAML:

# task-brief.yaml
task_id: T-204
intent: "Refactor auth module to use new token-validation library"
owner: "@vuong"

allowed_paths:
  - src/auth/
  - tests/auth/

definition_of_done:
  - "All existing auth tests pass (run: pnpm test src/auth)"
  - "No changes outside allowed_paths"
  - "No new dependencies without explicit approval"

out_of_scope:
  - session management
  - src/middleware/
  - package.json modifications

stop_and_ask_on_uncertainty: true

The stop_and_ask_on_uncertainty flag is a convention, not a standard MCP field. Add it to your agent config as a rule: when the agent hits a decision it wasn't scoped for, it surfaces the question rather than resolving it silently. That one convention eliminates a large portion of the scope-drift issues that produce bloated PRs and ambiguous review requests.

Review: Deliberate, Not Accidental

If AI-generated PRs already wait 4.6 times longer for review, the answer is not to skip review. It's to make the wait intentional rather than incidental.

The difference is specificity. "Needs review" tells the reviewer nothing. An agent-generated PR should carry a checklist that maps to the actual failure modes of agent-produced code: logic drift from the acceptance criteria, scope overrun, and security exposure in the changed path.

The Opsera report is specific on the security point: AI-generated code carries 15 to 18 percent more security vulnerabilities than human-authored code. A review gate that ignores that is not a real gate.

Figure 2: AI cuts time-to-PR by 58% but AI-generated PRs wait 4.6x longer in review. Source: Opsera AI Coding Impact 2026 Benchmark.

Here's a GitHub Actions gate that blocks agent-generated PRs until a reviewer confirms the right things:

# .github/workflows/agent-pr-review.yml
name: Agent PR Review Gate

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  check-agent-pr:
    runs-on: ubuntu-latest
    if: contains(github.event.pull_request.labels.*.name, 'agent-generated')

    steps:
      - name: Require human review checklist
        uses: actions/github-script@v7
        with:
          script: |
            const body = context.payload.pull_request.body || '';
            const required = [
              '- [x] Scope: changes are within the declared allowed_paths',
              '- [x] Intent: output matches the task definition-of-done',
              '- [x] Security: no new auth, session, or credential handling introduced',
            ];
            const allChecked = required.every(item => body.includes(item));
            if (!allChecked) {
              core.setFailed(
                'Agent-generated PR is missing the required human review checklist. ' +
                'Add and complete each item in the PR description before merging.'
              );
            }

Label agent-generated PRs with agent-generated as part of your delegation step. The gate then becomes self-activating. Reviewers know what they're looking at and what they're responsible for confirming.

The checklist items aren't arbitrary. "Scope" and "Intent" address the two most common agent failure modes. "Security" is there because the data says it should be.

Own: Log the Decision So the Next Session Has Evidence

The delegate and review steps protect you during the run. Own protects you after it.

Once a reviewer approves an agent-executed task, record why. Not for compliance, though that's a side benefit. Because the next agent session, or a developer who joins the team next quarter, has no context for a decision made in a prior conversation that no longer exists.

A minimal decision log entry:

# decision-log/T-204.yaml
task_id: T-204
approved_by: "@vuong"
approved_at: "2026-06-14T09:32:00+10:00"
summary: "Auth module refactored to token-validation v2. All existing tests pass."

acceptance_check: "pnpm test src/auth (47 tests, 0 failures)"
scope_confirmed: true
out_of_scope_violations: none

rollback_plan: "Revert to commit abc123f if auth failure rate exceeds 0.5% in first 24h"

deferred:
  - id: T-205
    note: "Agent proposed removing legacy token cache during execution. Deferred pending security review."

The deferred block is the part that compounds over time. When the agent proposes something outside scope, that proposal shouldn't vanish into a dismissed PR comment. Log it as a deferred item with an ID. The next session has a starting point rather than a blank slate.

If your team uses a project board to manage agent work, these decision records belong there rather than scattered across PR threads. Agiflow models work units with status tracking, artifact storage, and workflow locks so decision logs have a stable, addressable home the agent can reference in subsequent runs. That's a useful pattern regardless of which board you use; the critical thing is that the record lives somewhere durable and findable, not in a conversation that expires.

What Changes When This Operating Model Runs at Scale

At one agent, one task, these three steps are easy to follow manually. They become more important, not less, when you're running multiple agents across multiple work units simultaneously.

The CIO.com coverage of McKinsey's agentic AI research notes that organizations achieving 20 to 40 percent operating cost reductions from AI share one attribute: a deliberate orchestration layer with audit trails built in from the start. The article frames this as a correlation rather than proven causation, which is honest. But the direction is clear: coordination discipline is what makes the gains stick.

The DORA finding I cited earlier is the plain version of the same point. AI amplifies what's already there. Strong teams with clear ownership and tight feedback loops get better. Teams with fuzzy handoffs and unclear mandates find those problems more expensive, not cheaper, to untangle.

Delegate with an explicit scope. Review against that scope. Own a record of what changed and why, and hand that record to the next session.

The loop is short. The discipline is the work.

Three Kinds of AI Context: Most Tools Only Solve One

Vuong Ngo — Tue, 09 Jun 2026 13:02:08 +0000

AI context failure bundles three distinct problems: personal context (who you are), product-decision context (what the product should do), and local task persistence (what work is queued). Two new tools and one Anthropic feature each solve one layer. But the fourth layer — a shared, writable contract of the current open work item — is what none of them address, and it's why developers who've installed all three still feel stuck.

You set up a CLAUDE.md. Maybe you wrote memory files. In the week of 8 June 2026, two tools hit Product Hunt — one for personal context, one for product decisions — and you installed those too. You are still re-explaining the project at the start of every session.

The problem is the diagnosis. "AI starts from scratch" treats one frustration as one cause. It isn't. It's at least three separate context failures that happen to produce the same symptom, and most tools solve exactly one of them.

Here's the model I've landed on after watching this category for the past six months.

Layer 1 — Who you are

This is the most static layer. Your stack, your role, your preferences, how you like commit messages formatted, what framework you avoid. It changes roughly as often as your LinkedIn headline.

Unabyss hit #1 on Product Hunt on launch day with 755 upvotes for solving exactly this. The tagline is unambiguous: "Set it up once and never re-explain yourself to AI again." It pulls structured context from LinkedIn, Notion, and Gmail, then exposes it to any tool that speaks MCP, with per-tool visibility controls.

If this is your gap, you can fix it manually today. A CLAUDE.md covering who you are is a perfectly working solution for a single assistant:

## About me

- Stack: TypeScript, Node, PostgreSQL, React
- Prefer functional React; no class components
- Testing: Vitest + Testing Library; mock at service boundaries only
- Commit style: conventional commits, imperative mood, no period
- Time zone: AEST

The limit is portability. If you run multiple agents, switch machines, or want consistent preferences across tools, file-per-tool doesn't hold. That's the gap Unabyss fills: one writable context store, readable by anything that asks.

Layer 2 — What your product should do

This layer moves slower than your task queue but faster than your identity. It's the architectural decisions already made, the approaches that were ruled out and why, the constraints that aren't visible in the code.

Brief launched this week and reached #5 with 253 upvotes. The problem statement is sharp: "AI agents can ship quickly, but without the right product context, they're often flying blind." Brief stores those decisions and serves relevant context to agents through chat, Slack, CLI, and MCP.

A CLAUDE.md can carry this layer too, but it gets unwieldy:

## Architecture decisions

- Auth: custom JWT + refresh token. Rejected Clerk (vendor lock-in concern, 2024-11).
  See: docs/ADRs/0003-auth-approach.md
- DB: Postgres. MongoDB ruled out early — our query patterns are relational.
- Background jobs: BullMQ. No migration to new runners without a spike.

At a certain point, keeping that file accurate is its own maintenance job. Tools like Brief try to automate the curation. Whether you use a tool or a disciplined ADR directory, the important thing is that this layer exists and stays current — because an assistant that doesn't know why the auth system looks the way it does will confidently propose changes you ruled out eight months ago.

Layer 3 — What work is left (on this machine)

On January 22, 2026, Anthropic shipped Claude Code Tasks — a persistent task system that survives session termination. Tasks live in ~/.claude/tasks/ as JSON:

{
  "id": "01JJ3QZWZ4R2XM6GBTF9V7Y8KP",
  "title": "Implement rate limiting on /api/v1/completions",
  "status": "in_progress",
  "dependencies": ["01JJ3QY..."],
  "owner": "claude",
  "created": "2026-01-24T08:12:00Z"
}

Before Tasks, Claude Code stored todos in session memory. They disappeared when the terminal closed. Tasks fix this: create them once, and they persist across restarts, terminal crashes, and session resets. That's a genuine improvement over the status quo.

The constraint is scope. Tasks are local. They live on one machine. They store orchestration metadata — status, dependencies, owner — but not the content of the work. What "done" means, what the acceptance criteria are, what artifacts prove the task is complete. And they don't synchronise across machines or agents.

The four layers, together

Layer	What it answers	Change frequency	Example tools
1 — Personal context	Who you are, preferences, stack	Rarely (months)	Unabyss, CLAUDE.md
2 — Product-decision context	What should be built and why	Occasionally (weeks)	Brief, ADRs
3 — Local task persistence	What work is queued on this machine	Constantly (sessions)	Claude Code Tasks
4 — Structured current-work context	What is open, what done means, what proves it	Constantly, shared	—

The question mark in that last column is where most developers who've installed layers 1–3 are still stuck.

The four layers plotted by how often they change and who can see them. Layers 1 and 2 sit in the slow-change rows; layers 3 and 4 are in constant flux. Most tools cover the left column. The top-right cell is the gap. (Author's model.)

The AI Context Gap Nobody Names

Walk through a real session. You open a new Claude Code instance. Layer 1 tells it you prefer TypeScript and conventional commits. Layer 2 tells it why the auth system looks the way it does. Layer 3 tells it there's a task called "Implement rate limiting" in progress.

What it doesn't know: what done means for that task. What the acceptance criteria are. Whether there's a failing test waiting. Whether another agent already started the same work in a different worktree. Whether the spec changed since you queued the task.

That information isn't in your CLAUDE.md. It's not in your decisions log. It's not in the Tasks JSON. It's the contract of the work — and it needs to live somewhere shared, writable, and structured. Not a file you write once and hope stays accurate.

This is also what distinguishes Layer 4 from the others in a practical sense: the contract changes as the work progresses. An assistant needs to be able to read it at session start and write to it as evidence accumulates. Static files can't do that.

Why a longer prompt doesn't close this gap

The instinct is to paste more context into the system prompt or CLAUDE.md. It rarely helps, and there's a mechanical reason.

Model recall by position in a long context window. Acceptance criteria buried in paragraph 12 of a CLAUDE.md face the worst retrieval odds. Based on Liu et al. 2023. Y-axis values are qualitative only.

A 2023 study on how language models use long contexts — "Lost in the Middle" — showed that models retrieve information reliably from the start and end of long inputs but degrade badly for content in the middle. The longer the context window, the more of your carefully-written CLAUDE.md sits in the graveyard.

Anthropic's context engineering guide for agents says it directly: "context is a critical but finite resource." The guidance is to treat it as something you curate and structure, not something you dump in bulk.

For Layer 4, the implication is concrete. If the acceptance criteria for a task are buried in paragraph 12 of a 600-line memory file, the assistant is not reliably reading them. They need to be in a distinct, retrievable record — something the assistant fetches on demand rather than scans.

What structured current-work context actually looks like

Here's the shape of the missing piece. This isn't a vendor-specific format — it's what a work item record needs to carry to be genuinely useful to an AI assistant at session start:

{
  "id": "wu_01JJ3R",
  "title": "Rate limiting on /api/v1/completions",
  "status": "active",
  "acceptanceCriteria": [
    "Returns 429 with Retry-After header when limit exceeded",
    "Limit is configurable per API key, not global",
    "Integration test covers the 429 path with a real Redis instance"
  ],
  "artifacts": [
    {
      "type": "spec",
      "label": "Rate limit spec",
      "url": "https://...",
      "linkedAt": "2026-06-08T09:00:00Z"
    },
    {
      "type": "test-result",
      "label": "Failing test run (pre-fix)",
      "url": "https://...",
      "linkedAt": "2026-06-09T11:43:00Z"
    }
  ]
}

Compare that to the Layer 3 task record. Layer 3 tells the agent that a task exists, who owns it, and whether it's in progress. Layer 4 tells it what the task means — the criteria that will constitute evidence of completion, and the evidence that already exists.

Wiring this up over MCP looks like any other context server:

{
  "mcpServers": {
    "project": {
      "command": "npx",
      "args": ["-y", "@your-tool/project-mcp@latest"],
      "env": {
        "PROJECT_API_KEY": "your-key"
      }
    }
  }
}

With this configured, the assistant can call get_work_unit at the start of every session and receive the full record — criteria, artifacts, status — fetched fresh. Not read from a static file that may have drifted.

For a detailed breakdown of how this plays out across multiple tasks and agents, Agiflow's write-up on coordinating multi-task workflows with work units covers the model in practice.

Which layer is actually your problem

You probably don't need all four fixed today.

If the assistant keeps asking about your stack or commit style: Layer 1. A good CLAUDE.md or Unabyss solves it in an afternoon.

If it makes decisions that contradict past architecture choices: Layer 2. Start writing ADRs, or try Brief.

If it loses track of what it was doing when the session ends: Layer 3. Claude Code Tasks is already shipped, it's free, and it's local.

If it knows what work is queued but not what done means, drifts off the spec mid-session, or can't pick up where another agent left off: that's Layer 4. No static file solution handles it cleanly. You need a writable, shared, structured source of truth for the current work contract.

Most developers I've seen hit Layer 4 and diagnose it as Layer 3. They add more to CLAUDE.md, the agent still drifts, and the conclusion is "AI just isn't reliable enough yet." Sometimes that's true. More often, the right AI context structure was never there to begin with.

MCP Server for Task Tracking: What the MCP Tasks Extension Specifies in 2026

Vuong Ngo — Tue, 02 Jun 2026 15:06:01 +0000

If you are building an MCP server for task tracking, you eventually hit the same wall: the work outlives the connection. As of June 2026, that is exactly the gap MCP Tasks is trying to close. The important part is not that it exists — it is how far the protocol has actually stabilized, what still looks experimental, and where application-layer task boards already solved the same shape in a different way.

This post is a skeptical read for developers and technical architects. We will walk through the current MCP Tasks semantics, look at the state machine, compare the extension to A2A, and end with a minimal application-layer example of durable, agent-readable task state. Agiflow appears only as one concrete example of a board that exposes scoped task state to agents, not as a recommendation or a proof of spec maturity.

Why this exists at all

The core problem is simple: long-running work does not fit cleanly into a request/response cycle. If the model needs to wait for a batch job, a human approval, or a slow external API, blocking a connection is fragile and hard to resume.

The current MCP roadmap makes that explicit. The 2026 MCP roadmap (captured 2026-06-02) still treats task lifecycle edge cases as active protocol work, and the Tasks overview (captured 2026-06-02) describes a durable handle, not a streaming socket replacement.

That distinction matters. A task ID is a state handle. It is not a chat transcript. It is not a job queue. It is not a promise that every client and server will agree on the same retry behavior next quarter.

The shape of MCP Tasks

The protocol model is a durable task with a small state machine. The useful bit for builders is that the shape is explicit:

Surface	What looks stable	What still moves	Practical takeaway
Task identity	Durable task ID	Retention policies	Persist the ID and expect resume/poll flows
State machine	`working`, `input_required`, `completed`, `failed`, `cancelled`	Transition edge cases	Design for clear terminal states
Mid-flight input	`tasks/update` for outstanding input requests	Client UX details	Treat human approval as a first-class path
Transport behavior	Polling works everywhere	Push support varies	Poll first, subscribe second
Lifecycle policy	Terminal states are terminal	Retry and expiry semantics are still evolving	Keep your implementation conservative

MCP Tasks state machine: working and input_required are non-terminal; completed, failed, and cancelled are terminal. Source: MCP Tasks extension overview (captured 2026-06-02).

The easiest way to see the negotiation is to look at the client and server capabilities side by side.

{
  "_meta": {
    "io.modelcontextprotocol/clientCapabilities": {
      "extensions": {
        "io.modelcontextprotocol/tasks": {}
      }
    }
  }
}

{
  "capabilities": {
    "extensions": {
      "io.modelcontextprotocol/tasks": {}
    }
  }
}

If both sides opt in, the server can return a CreateTaskResult instead of a synchronous result.

{
  "resultType": "task",
  "task": {
    "taskId": "task_01JQ2Z8XKQ7G6Q5X5ZK1N7T9A2",
    "status": "working",
    "ttlMs": 600000,
    "pollIntervalMs": 2000
  }
}

The point of the ttlMs and pollIntervalMs fields is not cosmetic. They tell the client how long the task can reasonably be resumed and how often to ask for an update.

Polling, completion, and input_required

Once a task exists, the client follows a straightforward loop: poll until terminal, or respond if the server pauses for input.

{
  "taskId": "task_01JQ2Z8XKQ7G6Q5X5ZK1N7T9A2"
}

{
  "taskId": "task_01JQ2Z8XKQ7G6Q5X5ZK1N7T9A2",
  "status": "completed",
  "result": {
    "summary": "Build finished successfully.",
    "artifacts": [
      "dist/app.js",
      "dist/app.js.map"
    ]
  }
}

If the work needs a decision, the protocol explicitly stops pretending it can continue alone.

{
  "taskId": "task_01JQ2Z8XKQ7G6Q5X5ZK1N7T9A2",
  "status": "input_required",
  "inputRequests": {
    "approve_release": {
      "type": "confirmation",
      "message": "Approve deployment to staging?"
    }
  }
}

{
  "taskId": "task_01JQ2Z8XKQ7G6Q5X5ZK1N7T9A2",
  "inputResponses": {
    "approve_release": {
      "confirmed": true
    }
  }
}

That is the most useful mental model here: task state is not an implementation detail. It is a protocol surface for deferred work.

The experimental repository reinforces the caution. The experimental Tasks spec repo (captured 2026-06-02) labels the extension experimental and warns that it may change or disappear. That does not make it unusable. It does mean you should treat it like an evolving contract.

What is stable for MCP task tracking in production

My read is conservative:

Safe to build on: the existence of a durable task handle, explicit task states, polling, and cooperative cancellation.
Use with caution: exact retry behavior, retention/expiry policy, and any client-specific push notification behavior.
Do not over-interpret: the fact that a task exists does not mean every agent workflow should become a task.

The roadmap backs that up. In the 2026 MCP roadmap (captured 2026-06-02), retry semantics and result-retention policy are still called out as open gaps. That is a clue that the shape is real, but the surrounding policy is still settling.

There is also market signal, but it is still only market signal. SiliconANGLE's February 12, 2026 report on Manufact (captured 2026-06-02) shows money flowing into MCP infrastructure. That says the category is getting real attention. It does not prove the protocol has finished evolving.

MCP Tasks is not the only answer

If your use case is agent-to-agent collaboration rather than deferred execution, you should also look at A2A. The A2A specification (captured 2026-06-02) focuses on inter-agent communication, capability discovery, and collaborative tasks with its own task lifecycle and message model.

That makes the comparison easier:

Topic	MCP Tasks	A2A
Primary problem	Deferred execution inside MCP requests	Coordination between independent agents
Core primitive	Durable task handle	Agent-to-agent session/task exchange
Best fit	Slow tools, approvals, batch work, resumable jobs	Multi-agent workflows and interoperability
Failure mode	Overfitting every long job into one protocol	Using a coordination protocol when you only need deferred tool execution

In other words, use the simplest layer that matches the problem. MCP Tasks is a good fit when the work is still fundamentally one request that needs to finish later. A2A is a better fit when the work is really a conversation between agents.

A minimal application-layer analogue

This is the part that often gets conflated with the protocol extension. A board can expose the same shape without implementing MCP Tasks itself: stable ID, readable status, and a write path that updates state as work progresses.

Two independent primitives that independently converge on the same structural shape: stable durable ID, explicit status state machine, and pollable/readable state.

That is close to what Agiflow's connection docs show at the application layer: a scoped task endpoint that lets an assistant work against durable board state. The important distinction is still the same one from above. That is an application-level model, not proof that the product implements the MCP Tasks extension.

type TaskStatus = "working" | "input_required" | "completed" | "failed" | "cancelled";

type TaskRecord = {
  taskId: string;
  title: string;
  status: TaskStatus;
  updatedAt: string;
  notes?: string;
};

const tasks = new Map<string, TaskRecord>([
  [
    "task_123",
    {
      taskId: "task_123",
      title: "Review deployment checklist",
      status: "working",
      updatedAt: new Date().toISOString(),
    },
  ],
]);

export function readTaskState(taskId: string): TaskRecord | undefined {
  return tasks.get(taskId);
}

export function writeTaskState(
  taskId: string,
  patch: Partial<Pick<TaskRecord, "status" | "notes">>,
): TaskRecord {
  const current = tasks.get(taskId);

  if (!current) {
    throw new Error(`Unknown task: ${taskId}`);
  }

  const next: TaskRecord = {
    ...current,
    ...patch,
    updatedAt: new Date().toISOString(),
  };

  tasks.set(taskId, next);
  return next;
}

const before = readTaskState("task_123");

if (before?.status === "working") {
  writeTaskState("task_123", {
    status: "input_required",
    notes: "Need approval for the release window.",
  });
}

This is the same shape MCP Tasks is formalizing, minus the protocol mechanics. You can swap the in-memory Map for a database row, a Durable Object, a project board record, or an external job handle. The pattern stays the same.

What I would do in practice

My rule of thumb is boring on purpose:

Use core MCP for synchronous tool calls.
Use MCP Tasks when the work is truly deferred and the client can resume later.
Keep the state machine simple and terminal-state driven.
Treat push notifications as an optimization, not a requirement.
Reach for A2A when the problem is really agent coordination, not deferred execution.

That gets you most of the value without pretending the extension is more mature than it is.

Wrapping up

The useful headline is not "MCP Tasks is done." The useful headline is that MCP server task tracking now has a credible protocol shape: durable, resumable, long-running work that survives disconnects. That is a real step forward, but it is still an evolving surface area, so the safest implementation stance is conservative.

If you want to see how a real product scopes assistants to board state and task context, start with Agiflow's connection docs. It is a good application-layer contrast to the protocol story here, and it makes the boundary between "board state" and "protocol task" much easier to see.

MCP, Code, or Commands? A Decision Framework for AI Tool Integration

Vuong Ngo — Sun, 07 Dec 2025 05:16:36 +0000

When building AI-assisted development workflows, the documentation explains what each approach does—but not the real cost implications or when to use which.

I instrumented network traffic and ran controlled experiments across five approaches using identical tasks: same 500-row dataset, same analysis requirements, same model (Claude Sonnet). The results revealed that architecture matters more than protocol choice.

MCP Optimized consumed 60,420 tokens. MCP Vanilla consumed 309,053 tokens. Same protocol. Same task. 5x difference—driven entirely by one decision: file-path references vs. data-array parameters.

This article provides a decision framework based on measured data, not marketing claims.

The Decision Framework

Before diving into data, here's the framework I developed from these experiments:

Quick Decision Guide

If your situation is...	Use this approach
Repeating task (>20 executions), large datasets, need predictable costs	MCP Optimized
One-off exploration, evolving requirements, prototyping	Code-Driven (Skills)
User must control when it runs, deterministic behavior needed	Slash Commands
Production system with security requirements	MCP Optimized (never Skills)

Decision Flowchart

Q1: One-off task (< 5 executions)?
    YES → Code-Driven or direct prompting
    NO  → Continue

Q2: Dataset > 100 rows AND need < 5% cost variance?
    YES → MCP Optimized
    NO  → Continue

Q3: User needs explicit control over invocation?
    YES → Slash Commands
    NO  → Continue

Q4: Execution count > 20 AND requirements stable?
    YES → MCP Optimized
    NO  → Code-Driven (prototype, then migrate)

NEVER:
  - MCP Vanilla for production (always suboptimal)
  - Skills for multi-user or sensitive systems

The Three Approaches Explained

MCP (Model Context Protocol)

A structured protocol for AI-tool communication. The model calls tools with JSON parameters, the server executes and returns structured results.

// MCP tool call - structured, typed, validated
await call_tool('analyze_csv_file', {
  file_path: '/data/employees.csv',
  analysis_type: 'salary_by_department'
});

Characteristics: Structured I/O, access-controlled, model-decided invocation, reusable across applications.

Critical distinction: There's a 5x token difference between vanilla MCP (passing data directly) and optimized MCP (passing file references). Same protocol, vastly different economics.

Code-Driven (Skills & Code Generation)

The model writes and executes code to accomplish tasks. Claude Code's "skills" feature lets the model invoke capabilities based on semantic matching.

# Claude writes this, executes it, iterates
import pandas as pd
df = pd.read_csv('/data/employees.csv')
result = df.groupby('department')['salary'].mean()
print(result)

Characteristics: Maximum flexibility, unstructured I/O, higher variance between runs, requires sandboxing.

Slash Commands

Pure string substitution. You type /review @file.js, the command template expands, and the result injects into your message.

<!-- .claude/commands/review.md -->
Review the following file for security vulnerabilities,
performance issues, and code quality:

{file_content}

Focus on: authentication, input validation, error handling.

Characteristics: User-explicit, deterministic, single-turn, zero tool-call overhead.

Measured Data: What the Numbers Show

Methodology

Same workload: load 500-row CSV, perform grouping, summary stats, two plots
Same model: Claude Sonnet, default settings
3-4 runs per approach with logged request/response payloads
Costs calculated at current Claude Sonnet pricing

Token Consumption

Token consumption per API request. MCP Optimized achieves consistently low usage through file-path architecture.

Approach	Avg tokens/run	vs Baseline	Why
MCP Optimized	60,420	-55%	File-path parameters; zero data duplication
MCP Proxy (warm)	81,415	-39%	Shared context + warm cache
Code-Skill (baseline)	133,006	—	Model-written Python; nothing cached
UTCP Code-Mode	204,011	+53%	Extra prompt framing
MCP Vanilla	309,053	+133%	JSON-serialized data in every call

Cost at Scale

At 1,000 monthly executions:

Approach	Per Execution	Monthly	Annual
MCP Optimized	$0.21	$210	$2,520
Code-Skill	$0.44	$440	$5,280
MCP Vanilla	$0.99	$990	$11,880

$9,360 annual difference between optimized and vanilla MCP for a single workflow.

Scalability

Cumulative token consumption. MCP Optimized maintains low growth; vanilla approaches accumulate steeply.

Approach	Scaling Factor	10K Row Projection
MCP Optimized	1.5x	~65K tokens
Code-Skill	1.1-1.6x	~150-220K tokens
MCP Vanilla	2.0-2.9x	~500-800K tokens

MCP Optimized exhibits sub-linear scaling because file paths cost the same tokens regardless of file size. MCP Vanilla exhibits super-linear scaling because larger datasets require proportionally more tokens for JSON serialization.

Variance

Approach	Coefficient of Variation	Consistency
MCP Optimized	0.6%	Excellent
MCP Proxy (warm)	0.5%	Excellent
Code-Skill	18.7%	Poor
MCP Vanilla	21.2%	Poor

MCP Optimized hit 60,307, 60,144, and 60,808 tokens across three runs. Code-Skill ranged from 108K to 158K. High variance breaks capacity planning and makes cost prediction unreliable.

Latency

Skills and sub-agents use tool-calling, which means two LLM invocations instead of one:

User message → Model decides → Tool call → Tool result → Final response

Slash commands avoid this—they're just prompt injection with direct response.

Key Lessons

1. Architecture Trumps Protocol

The 5x token difference between MCP Optimized and MCP Vanilla uses the same protocol. The difference is entirely architectural: file paths vs data arrays. Focus on data flow design, not protocol debates.

2. The File-Path Pattern

The single biggest efficiency gain: eliminate data duplication.

// Anti-pattern: 10,000 tokens just for data
await call_tool('analyze_data', {
  data: [/* 500 rows serialized */]
});

// Pattern: 50 tokens for the same operation
await call_tool('analyze_csv_file', {
  file_path: '/data/employees.csv'
});

The MCP server handles file I/O internally. Data never enters the context window.

3. Prototype with Skills, Ship with MCP

Skills execute arbitrary code—bash commands, file system access, network calls. They're excellent for figuring out what tools you need. They're inappropriate for production systems where security matters.

4. Slash Commands Are Underrated

When you need deterministic, user-controlled workflows, slash commands win. No tool-call overhead, no model surprises, no latency penalty. Use them for repeatable tasks like code review checklists or deployment procedures.

5. Sub-Agent Context Isolation

Sub-agents can't see your main conversation history. If they need context, you must explicitly pass it in the delegation prompt. This is by design—clean delegation—but requires explicit information passing.

6. CLAUDE.md Costs Compound

CLAUDE.md content injects into every message, including sub-agent conversations. Keep it concise. Use file references to pull in additional docs only when needed:

<!-- CLAUDE.md -->
# Project Standards
See @docs/CODING_STANDARDS.md for detailed guidelines.

Key rules:
- Use TypeScript strict mode
- No any types

7. Measure Before Optimizing

Instrument your network traffic. The Anthropic API returns token usage in every response—log it. You might be surprised where tokens are actually going.

Implementation Patterns

Parallel Tool Execution

File-path architecture enables parallel calls:

// Four visualizations, one API call, ~400 tokens total
await Promise.all([
  call_tool('create_viz', { file: '/data/emp.csv', type: 'bar', x: 'dept', y: 'salary' }),
  call_tool('create_viz', { file: '/data/emp.csv', type: 'scatter', x: 'exp', y: 'salary' }),
  call_tool('create_viz', { file: '/data/emp.csv', type: 'pie', col: 'department' }),
  call_tool('create_viz', { file: '/data/emp.csv', type: 'bar', x: 'location', y: 'salary' }),
]);

Progressive Tool Discovery

For large tool catalogs (20+ tools), use meta-tools for on-demand discovery instead of loading all tools upfront:

// Initial context: 2 tools, ~400 tokens
const meta_tools = [
  { name: 'describe_tools', description: 'Discover available tools' },
  { name: 'use_tool', description: 'Execute a specific tool' }
];

// Instead of: 50 tools, ~50,000 tokens upfront

Phased Migration Strategy

For uncertain repeatability:

Phase 1: Use code-driven to validate the task. Accept higher per-execution cost for flexibility.
Phase 2: If the task stabilizes and will repeat, invest in MCP Optimized.
Phase 3: Track actual execution count and token consumption. Migrate when patterns are clear.

Summary

Approach	Best For	Avoid When
MCP Optimized	Production workloads, large datasets, predictable costs, security requirements	One-off tasks, evolving requirements
Code-Driven	Prototyping, novel requirements, maximum flexibility	Production systems, multi-user environments
Slash Commands	User-controlled workflows, deterministic behavior, zero overhead	Automation, context-dependent decisions

The core insight: how you architect data flow matters more than which protocol you choose. The 5x token difference between optimized and vanilla MCP—for the same task—demonstrates this clearly.

Match the tool to your constraints. Measure the results.

References

Token Efficiency in AI-Assisted Development - Full analysis of token consumption across approaches
Claude Code Internals: Reverse Engineering Prompt Augmentation - Deep dive into how Claude Code's prompt mechanisms work
MCP Specification
AICode Toolkit (GitHub) - MCP servers and tools for AI-assisted development
Token efficiency experiments (GitHub)
Prompt augmentation analysis (GitHub)

All claims are reproducible using the open-source data and tooling in the referenced repositories.

AI Keeps Reinventing Your Components. Here's How to Stop It.

Vuong Ngo — Sun, 30 Nov 2025 10:43:40 +0000

Three days before a customer pilot, our PM pinged me: "Can we ship that analytics dashboard?" The design had been sitting in Figma for weeks. I promised I'd have it in production by Friday with AI co-pilot.

By Wednesday morning, the PR was still in draft. Not because the UI was hard—it looked exactly like the mock—but because the AI kept inventing work.

Here's what a typical week produced:

// Monday - inline styles
export const RevenueCard = () => {
  return (
    <div style={{
      background: 'white',
      borderRadius: '12px',
      padding: '24px',
      boxShadow: '0 1px 3px rgba(0,0,0,0.1)'
    }}>
      <span style={{ color: '#6B7280', fontSize: '14px' }}>Total Revenue</span>
      <div style={{ fontSize: '32px', fontWeight: 600 }}>$124,500</div>
    </div>
  );
};

// Tuesday - MUI (we use Tailwind)
import { DataGrid } from '@mui/x-data-grid';

// Wednesday - CSS modules (since when?)
import styles from './FilterPanel.module.css';

// Thursday - styled-components (not even installed)
import styled from 'styled-components';

Four days. Four completely different approaches. The code worked, technically. But maintaining it? Good luck.

The root cause became obvious: AI doesn't read documentation the way humans do. It pattern-matches. And if your codebase doesn't have clear patterns to match, AI will invent its own—differently every time.

The Lesson

AI reflects your architecture. Chaotic codebase, chaotic output. Structured codebase, structured output.

After months of trial and error, here's what actually works.

1. Separate State From Representation (Smart vs Dumb Components)

AI writes strange things when fetch logic, loading UI, and display live in the same file. Split them.

// Container: owns data fetching
export function RevenueCardContainer() {
  const { data, error, isLoading } = useRevenue();

  if (isLoading) return <RevenueCardView state="loading" />;
  if (error) return <RevenueCardView state="error" message="Revenue unavailable" />;
  if (!data) return <RevenueCardView state="empty" message="No revenue yet" />;

  return <RevenueCardView state="ready" value={data.value} previousValue={data.previousValue} />;
}

// Presentational: pure UI, tokens only
export function RevenueCardView({ state, value, previousValue, message }: Props) {
  if (state === 'loading') return <MetricCard loading label="Revenue" />;
  if (state === 'error') return <MetricCard label="Revenue" error message={message} />;
  if (state === 'empty') return <MetricCard label="Revenue" empty message={message} />;

  return (
    <MetricCard
      label="Revenue"
      value={value}
      previousValue={previousValue}
      format="currency"
    />
  );
}

Storybook becomes the contract AI must honor. Capture the four canonical states—loading, empty, error, ready—so the bot can't invent new ones:

// RevenueCardView.stories.tsx
export const Loading = { args: { state: 'loading' } };
export const Empty = { args: { state: 'empty', message: 'No revenue yet' } };
export const Error = { args: { state: 'error', message: 'Revenue unavailable' } };
export const Ready = { args: { state: 'ready', value: 124500, previousValue: 110600 } };

AI stops rebuilding components that already exist because the stories show the "golden" versions.

2. Adopt Atomic Design

Atomic Design by Brad Frost turns out to be exactly what you need when AI is generating your code.

The core insight: hierarchical composition. Atoms form molecules. Molecules form organisms. Each level has a single responsibility.

Why does this matter for AI? Because AI excels at composition when given well-defined pieces and falls apart when rules are ambiguous.

// Level 1: Atoms - indivisible primitives
export const Text = ({ variant, color, children }: TextProps) => (
  <span className={cn(textVariants[variant], textColors[color])}>
    {children}
  </span>
);

export const Skeleton = ({ variant = 'text' }: SkeletonProps) => (
  <div className={cn('animate-pulse bg-muted rounded', skeletonVariants[variant])} />
);

// Level 2: Molecules - atoms with a purpose
export const MetricValue = ({ value, previousValue, format, loading }: Props) => {
  if (loading) {
    return (
      <div className="space-y-2">
        <Skeleton variant="heading" />
        <Skeleton variant="text" className="w-16" />
      </div>
    );
  }

  return (
    <div className="space-y-1">
      <Text variant="display" color="primary">{formatters[format](value)}</Text>
      {previousValue && <TrendIndicator value={calculateChange(value, previousValue)} />}
    </div>
  );
};

// Level 3: Organisms - complete UI sections
export const MetricCard = ({ label, value, previousValue, format, loading }: Props) => (
  <Card variant="elevated" padding="lg">
    <MetricLabel label={label} />
    <MetricValue value={value} previousValue={previousValue} format={format} loading={loading} />
  </Card>
);

Now when I ask for "a revenue metric card," AI composes: <MetricCard label="Revenue" value={revenue} format="currency" />. Consistent every time.

3. Design Tokens as Vocabulary

Components solve structural consistency. Design tokens solve visual consistency.

Named constants for every visual decision—not "blue" but action-primary, not "16px" but spacing-4:

export const colors = {
  action: {
    primary: '#6366F1',
    primaryHover: '#4F46E5',
  },
  surface: {
    page: '#F9FAFB',
    card: '#FFFFFF',
  },
  content: {
    primary: '#111827',
    secondary: '#4B5563',
    muted: '#9CA3AF',
  },
} as const;

Wire them through Tailwind config:

// tailwind.config.js
module.exports = {
  theme: {
    extend: {
      colors: {
        action: {
          primary: 'var(--color-action-primary)',
        },
        surface: {
          card: 'var(--color-surface-card)',
        },
        content: {
          primary: 'var(--color-content-primary)',
        },
      },
    },
  },
};

Now when AI writes bg-surface-card or text-content-secondary, it's speaking your design language. No hex codes that drift.

4. Scaffold Before You Generate

AI behaves best when you give it a guardrailed sandbox instead of a blank file.

A command like pnpm ui:generate metric-card should create:

MetricCard/
├── MetricCard.tsx        # Container
├── MetricCardView.tsx    # Presentational
├── MetricCard.stories.tsx
├── MetricCard.test.tsx
└── index.ts

The generated files include TODOs and comments telling AI where to edit and where not to touch. AI fills the blanks instead of rewriting the world. You can also use this mcp to help with scaffolding with the folder structure you liked.

5. Enforce Contracts with Lint and Stories

Static rules catch mistakes before they ship.

// eslint.config.mjs
export default [
  {
    rules: {
      'no-restricted-imports': [
        'error',
        {
          paths: ['styled-components', '@mui/material'],
          patterns: [
            { group: ['**/../*'], message: 'Import UI from @/components/ui' },
          ],
        },
      ],
      'tailwindcss/no-custom-classname': [
        'error',
        { callees: ['cn'], config: './tailwind.config.js' },
      ],
    },
  },
];

ESLint bans off-piste imports
Tailwind plugin forces token utilities
CI fails if stories miss the four states

Mistakes die in CI, not in code review.

Bonus: Composition Over God Components

Don't build this:

// ❌ 60+ props nobody can reason about
<DataTable
  data={transactions}
  columns={columns}
  pagination={true}
  paginationPosition="bottom"
  sortable={true}
  filterable={true}
  selectable={true}
  // ... 55 more props
/>

Build this:

// ✅ Composition: each piece does one thing well
<DataTable data={transactions} columns={columns}>
  <DataTableToolbar>
    <DataTableFilter column="status" options={statusOptions} />
    <DataTableSearch placeholder="Search..." />
  </DataTableToolbar>
  <DataTableBody loading={isLoading} emptyState={<Empty />} />
  <DataTableFooter>
    <DataTablePagination pageSize={10} />
  </DataTableFooter>
</DataTable>

Same capabilities, different mental model. When requirements change, you reorganize JSX rather than hunting through props.

Extra Tips

Embed source locations in the DOM:

<div
  data-component="MetricCard"
  data-source="src/components/MetricCard/MetricCard.tsx"
>

AI can inspect the DOM and jump straight to the file. No guessing.

Sub-agents to save context:

Your main conversation doesn't need the entire component library in memory. Spin up focused agents for specific tasks (UI fixes, story writing, a11y audits)—they load only what they need and return.

Reusable commands:

Build /add-story, /review-component, /fix-ui commands that encode your conventions. AI follows them without you repeating yourself.

The Path Forward

The fix for inconsistent AI output isn't better prompting. It's tighter architecture.

Every atom you add to your library is an atom AI never reinvents. Every design token is a decision that never drifts. Every composition pattern is a template for variations you haven't thought of yet.

Build the rails, the bot stays on track.

Originally published at agiflow.io

AI Keeps Breaking Your Architectural Patterns. Documentation Won't Fix It.

Vuong Ngo — Sun, 12 Oct 2025 07:40:32 +0000

I've been using AI coding assistants across our engineering team for over a year. Working in a data department, we had some privilege to experiment and use Claude, Roo-Code and other in-house agents for our daily workflow.

The pattern emerged slowly. Junior developers shipping features faster than before, which was great. Code reviews taking longer, which wasn't. The code functionally worked, tests passed, but something was consistently off. Direct database imports in service layers. Default exports scattered across a codebase that had standardized on named exports years ago. Repository pattern bypassed in favor of inline SQL.

These weren't bugs. The code ran fine in production. They were architectural drift—the slow erosion of patterns we'd spent years establishing. What made it frustrating was the inconsistency. A junior developer would correctly implement dependency injection in one file, then bypass it completely in the next. Same developer, same day, same codebase. The knowledge was there, but it wasn't being applied consistently.

The obvious answer was "better code review." But that doesn't scale. When you're reviewing 20+ PRs a day across a 50-package monorepo, you can't catch every architectural violation. And the ones you miss compound.

Here's what we figured out: this isn't an AI problem or a developer problem. It's a feedback timing problem.

TL;DR

AI-generated code violates architectural patterns because of timing and context, not capability
Static documentation creates a validation gap that AI can't bridge
Effective architecture enforcement requires runtime feedback loops, not upfront documentation
Path-based pattern matching provides file-specific architectural context
We built Architect MCP to close the feedback loop at code generation time
Results: 80% pattern compliance vs 30-40% with documentation alone

The Real Problem: Temporal and Spatial Context Loss

Let's be precise about what's happening here. AI coding assistants operate with ephemeral context windows. Even with project-specific documentation (CLAUDE.md, system prompts, etc.), there's a fundamental mismatch between when architectural constraints are communicated and when they need to be applied.

Consider a typical session:

Claude reads your architectural guidelines at initialization (t=0)
You discuss requirements, explore the codebase, iterate on design (t=0 to t=20min)
Claude generates code implementing the agreed-upon logic (t=20min)

By step 3, the architectural constraints from step 1 are 20 minutes and dozens of messages removed from the working context. The AI is optimizing for correctness against the immediate requirements, not consistency against architectural patterns defined at session start.

This isn't a memory problem—it's a priority and relevance problem.

What AI Optimizes For

When generating code, LLMs are fundamentally pattern-matching against their training data. Your specific architectural conventions represent a tiny signal compared to the millions of codebases in the training set. Without active feedback, the model defaults to the strongest statistical patterns:

Common > Custom: Express.js patterns over your Hono.js conventions
Simple > Structured: Direct database calls over repository pattern
Familiar > Framework-specific: Default exports because they're ubiquitous in the training data

This is why you see the same violations repeatedly, even with extensive documentation.

Why Documentation Fails (And What That Tells Us)

Our first attempt was documentation. We already had a substantial CLAUDE.md, but we expanded it. Detailed sections on dependency injection patterns, repository layer requirements, export conventions, framework-specific architectural rules. We made it comprehensive—over 3,000 lines.

Junior developers referenced it. AI assistants had access to it. Compliance rate stayed around 40%. The failure modes are instructive:

1. The Relevance Gap

A 1k-line document applies to every file equally, which means it applies to no file specifically. A repository needs repository-specific guidance. A React component needs component-specific rules. Serving generic "follow clean architecture" advice to both is essentially noise.

2. The Retrieval Problem

Even with RAG systems, retrieving the right architectural context at code generation time is non-trivial. You need to know what patterns apply before you can retrieve them. If Claude is generating a new file type, there's no obvious query to pull the relevant constraints.

3. The Validation Gap

This is the critical one. Documentation describes correct patterns but provides no mechanism to verify compliance. It's teaching without testing. The feedback loop is broken.

Rethinking the Problem: Feedback Over Front-loading

Here's the architectural insight: you can't front-load all context, but you can close the feedback loop.

Instead of trying to make AI remember everything upfront, we need to provide architectural feedback at two critical moments:

Before code generation: "What patterns apply to this specific file?"
After code generation: "Does this implementation comply with those patterns?"

This shifts from a memory problem to a validation problem. And validation can be automated.

The Feedback Loop Architecture

The system needs three components:

1. Pattern Database
Organized by file path patterns with specific architectural requirements:

src/repositories/**/*.ts → Repository pattern rules
src/services/**/*.ts → Service layer rules
src/components/**/*.tsx → Component architecture rules

2. Pre-generation Context Injection
Before generating code, query the pattern database with the target file path. Inject specific, relevant architectural constraints into the immediate context.

3. Post-generation Validation
After code generation, validate against the same patterns. Use severity ratings to determine action (submit, flag, auto-fix).

The key insight: specificity matters more than comprehensiveness. Better to provide five highly relevant rules for a specific file than 50 generic rules that might apply.

Implementation: Architect MCP

We implemented this as an MCP (Model Context Protocol) server with two primary tools:

get-file-design-pattern

Provides file-specific architectural context before code generation.

// Input: File path
get-file-design-pattern("src/repositories/userRepository.ts")

// Output: Specific patterns for this file type
{
  "template": "backend/hono-api",
  "patterns": [
    "Implement IRepository<T> interface",
    "Use constructor-injected database connection",
    "Named exports only (export class RepositoryName)",
    "No direct database imports (import from '../db' is violation)"
  ],
  "reference": "src/repositories/baseRepository.ts"
}

This runs before Claude generates code, injecting precise architectural requirements into the active context.

review-code-change

Validates generated code against architectural patterns.

// Input: File path and generated code
review-code-change("src/repositories/userRepository.ts", generatedCode)

// Output: Structured validation results
{
  "severity": "LOW" | "MEDIUM" | "HIGH",
  "violations": [...],
  "compliance": "92%",
  "patterns_followed": ["✅ Implements IRepository<User>", ...],
  "recommendations": [...]
}

This runs after code generation, providing structured feedback that can drive automation (auto-submit on LOW, flag on MEDIUM, auto-fix on HIGH).

Path-Based Pattern Matching: The Critical Detail

The pattern database uses path-based matching to provide file-specific guidance. This deserves deeper explanation because it's where the system gains leverage.

Pattern Hierarchy

# Global patterns (apply to all projects)
**/*.ts:
  - No 'any' types without justification
  - Use named exports

# Template patterns (apply to projects using this template)
backend/hono-api:
  src/repositories/**/*.ts:
    - Implement IRepository<T>
    - Use dependency injection

  src/services/**/*.ts:
    - No direct database access
    - Use repository layer

# Project patterns (apply to specific project)
user-management-api:
  src/services/authService.ts:
    - Must use AuthProvider interface
    - Specific to auth domain

The system applies patterns from most general to most specific, with later patterns overriding earlier ones. This provides both consistency (global rules) and flexibility (project-specific exceptions).

Why This Scales

New projects inherit template patterns automatically. No need to reconfigure architectural rules for every new service—just specify the template in project.json:

{
  "name": "new-api-service",
  "sourceTemplate": "backend/hono-api"
}

The service immediately inherits 50+ architectural patterns specific to Hono.js APIs.

LLM-Powered Validation: Using AI to Check AI

Here's a non-obvious design choice: we use Claude to validate Claude-generated code.

Why? Because architectural compliance isn't mechanical pattern matching. Consider:

Mechanical linter approach:

// Regex: /export\s+default/
// Violation: Uses default export
export default class UserService { }

LLM validation approach:

// Understands context and intent
export default class UserService { }
// Violation: Uses default export when named export required per repository pattern
// Recommendation: Change to 'export class UserService' for consistency with repository pattern established in architect.yaml

The LLM-based validation:

Understands architectural intent, not just syntax
Provides contextual explanations
Can reason about related patterns (if you're violating DI, you're probably also missing interface implementation)
Generates actionable recommendations

This is more expensive than static linting, but the cost is justified because it runs only on changed files and provides significantly higher signal.

Layered Validation: Defense in Depth

Architect MCP isn't a replacement for existing validation layers—it's complementary. The full validation stack:

Layer 1: TypeScript Compiler

Catches: Type errors, syntax violations
Speed: < 1s
Coverage: Type safety

Layer 2: Biome/ESLint

Catches: Code style, simple rules
Speed: < 5s
Coverage: Style consistency

Layer 3: Architect MCP

Catches: Architectural pattern violations
Speed: 5-10s (LLM call)
Coverage: Framework-specific architecture

Layer 4: Code Review (Human/AI)

Catches: Business logic, complex issues
Speed: Minutes to hours
Coverage: Domain-specific concerns

Each layer has different trade-offs. TypeScript is fast but can't enforce architectural patterns. Linting handles style but not domain architecture. Architect MCP fills the gap between syntax/style and human review.

What Actually Changed

After 3 months in production across our 50+ project monorepo with a team of 8 developers:

The obvious improvement: Architectural violations became rare instead of common. Not eliminated—there are still legitimate cases where you need to break a pattern—but the unconscious drift stopped. Junior developers stopped ping-ponging between following patterns correctly and breaking them in the next file.

The unexpected improvement: Code review shifted. We thought we'd just catch violations faster. What actually happened was we stopped spending review cycles on architectural corrections. Comments like "this should use dependency injection" or "use named exports" basically disappeared. Reviews focused on design decisions, edge cases, business logic—things that actually need human judgment.

The subtle improvement: Context-switching overhead decreased. When you're working across multiple projects with different architectural patterns (Next.js app vs Hono API vs TypeScript library), you're constantly reloading mental context. Having the validation layer means you find out immediately when you've applied the wrong pattern to the wrong project, not three reviews later.

What didn't improve: We still see legitimate architectural violations. Sometimes you need to bypass a pattern for a specific reason. The difference is those are now conscious decisions documented in the PR, not unconscious mistakes that slip through review.

What This Reveals About AI-Assisted Development

The broader lesson: AI coding assistants need tight feedback loops, not extensive documentation.

This mirrors how junior developers actually learn a codebase. They don't absorb architectural patterns by reading documentation upfront. They learn by:

Getting specific guidance for the task at hand
Making changes
Getting feedback on what they did wrong
Iterating

When junior developers pair with AI, both need the same learning structure. The difference is speed. Human code review happens in hours or days. Automated feedback happens in seconds. That speed difference is what makes the approach viable.

The unexpected insight: this doesn't just help junior developers. Senior developers using AI make the same architectural mistakes—they just catch them earlier in their own review. Automated validation helps everyone maintain consistency when context-switching between projects with different architectural patterns.

Implementation Notes

If you're considering building something similar, a few non-obvious lessons:

1. Pattern Granularity Matters
Too broad (e.g., "follow clean architecture") and AI can't apply it. Too narrow (e.g., "line 47 must use Promise.all") and you've essentially hardcoded the implementation. The right level is "file-type specific patterns" (repository pattern for repositories, component pattern for components).

2. Severity Ratings Enable Automation
Without severity ratings, you can't automate responses. With them:

LOW → Auto-submit (pattern followed)
MEDIUM → Flag for attention (minor violations)
HIGH → Block submission (critical violations)

3. Template Inheritance Is Critical for Scale
Defining patterns per-project doesn't scale past ~10 projects. Template-based inheritance means you define patterns once per framework/architecture, then all projects using that template inherit them.

4. LLM Validation Is Worth the Cost
We initially tried regex-based pattern matching. It caught obvious violations—literal regex matches like export default—but missed anything requiring context. Why is this a default export? Is it actually violating the pattern or is this one of the legitimate exceptions? Regex can't answer that. LLM validation understands intent and context. Yes, it costs money per validation. But the alternative is human code review catching these issues, which is orders of magnitude more expensive in terms of developer time.

Getting Started

Architect MCP is open source: github.com/AgiFlow/aicode-toolkit

The implementation is straightforward—it's an MCP server that reads YAML pattern definitions and uses Claude to validate code against them. The hard part isn't the code—it's defining your architectural patterns clearly enough to encode them. We spent more time debating what our patterns actually were than building the validation system.

If you're building something similar, start with:

Identify your top 5 most-violated architectural patterns
Define them as path-based rules in YAML
Build the pre-generation context injection first (higher ROI than validation)
Add validation once you've proven the concept

Open Questions

We're still figuring out:

1. Pattern Evolution
How do you version architectural patterns? When you update a pattern, do you auto-update all projects or let them opt-in?

2. Cross-File Patterns
Current implementation handles single-file patterns well. Cross-file architectural concerns (e.g., "services should only call repositories, never directly call other services") are harder to encode and validate.

3. Performance at Scale
LLM-based validation works well at our scale (50 projects, ~10 changes/day). What happens at 500 projects or 1000 changes/day? Do you need caching, batching, or a hybrid approach?

If you've solved these problems, I'd love to hear about it.

If you're dealing with similar problems—AI generating code that works but breaks your architectural patterns—I'd be curious to hear how you're handling it. Drop a comment or reach out.

Resources:

Architect MCP GitHub
Preview post about Scaffolding technique

Scaling AI-Assisted Development: How Scaffolding Solved My Monorepo Chaos

Vuong Ngo — Sun, 05 Oct 2025 23:40:29 +0000

The Moment I Realized AI Coding Was Broken.

It was 10PM. I'd just asked Claude to add a navigation component. Thirty seconds later, I was staring at this:

// What the AI generated (again)
const Navigation = ({ items }: NavigationProps) => {
  const [open, setOpen] = useState(false);
  return <nav className="navigation">...</nav>
}
export default Navigation;

Nothing wrong with it, technically. Except I don't use default exports. I use named exports. And useState should come from our custom hooks. And we use 'isOpen', not 'open'. And the TypeScript interface should be exported separately like every other component in our codebase.

I'd explained this exact pattern so many times I'd lost count.

Same pattern. Different day. Different component. Different wrong implementation.

This wasn't a one-off. My monorepo had become a Frankenstein's monster of inconsistent patterns—each one technically correct, all of them a maintenance nightmare.

The promise was simple: AI would code faster than humans.

The reality? I was spending more time fixing AI-generated code than I would've spent just writing it myself.

How AI-Assisted Development Actually Breaks

Here's what nobody tells you about scaling with AI coding assistants:

Week 1: The Honeymoon Phase

You: "Build me a login page"
AI: ✨ generates perfect login page ✨
You: "Holy shit, this is the future"

Everything works. You're shipping features at 10x speed. Your manager thinks you're a wizard. You're thinking about that promotion.

Month 1: The Cracks Appear

You're reviewing frontend components and notice something odd:

// TaskBadge.tsx (written 2 weeks ago)
export const TaskBadge = ({ status }: TaskBadgeProps) => {
  return <span className={`badge ${getColor(status)}`}>{status}</span>;
};

// PriorityBadge.tsx (written yesterday)
export function PriorityBadge(props: PriorityProps) {
  const color = getPriorityColor(props.priority);
  return <div className={color}>{props.priority}</div>;
}

// StatusLabel.tsx (written today)
function StatusLabel({ value }: StatusProps) {
  return <Badge variant={getVariant(value)}>{value}</Badge>;
}

Three badge components. Three different patterns. All from the same AI. All from the same human (you).

And on the backend, it's the same story:

// userService.ts (2 weeks ago)
@injectable()
export class UserService {
  constructor(@inject(TYPES.Database) private db: Database) {}
}

// authService.ts (yesterday)
export class AuthService {
  private db: Database;
  constructor(database: Database) {
    this.db = database;
  }
}

// paymentService.ts (today)
class PaymentService {
  constructor(public database: Database) {}
}

Three services. One uses dependency injection properly. One doesn't. One is halfway there.

"Okay, I need better instructions," you think.

Month 2: The Documentation Death Spiral

Your CLAUDE.md file has grown massively. You've documented:

Component patterns ✓
Import styles ✓
File naming ✓
Prop validation ✓
Error handling ✓
State management ✓
API patterns ✓

You've told the AI everything.

Then you ask it to create a settings page, and it still uses a different button component than the rest of your app.

"But I literally documented this!" you scream at your screen at 3 AM.

The AI apologizes (One-time, I said "Your're f*king right" which is hilarious). Generates a new version. Wrong again, but differently wrong.

Month 3: The Breaking Point

You're now maintaining:

Dozens of CLAUDE.md files scattered everywhere
Multiple variations of what should be the same pattern
A massive style guide that the AI follows inconsistently
Code reviews that are mostly style debates instead of logic discussions

The math breaks: You're spending hours fixing what should've taken minutes to write.

This was me. And this was my monorepo:

Frontend apps built with Next.js and TanStack Start
Backend APIs using Hono.js, FastAPI and Lambda
Shared packages for everything reusable
Microservices, edge functions, and infrastructure all in one repo

The bigger it grew, the worse it got. And I wasn't alone.

My Failed Experiments

Attempt 1: The Mega CLAUDE.md

I created comprehensive documentation files referencing everything:

Project Structure
Coding Standards
Technology Stack
Conventions
Style System
Development Process

Result: Even with token-efficient docs, I couldn't cover all design patterns across multiple languages and frameworks. AI still made mistakes.

Attempt 2: CLAUDE.md Everywhere

"Maybe collocated instructions work better?" I created CLAUDE.md files everywhere for different apps, APIs, and packages.

Result: Slightly better when loaded in context (which didn't always happen). But the real issue: I only had a handful of distinct patterns. Maintaining dozens of instruction files for those same patterns? Nightmare fuel.

Attempt 3: Autonomous Workflows

I set up autonomous loops: PRD → code → lint/test → fix → repeat.

Result: I spent more time removing code and fixing bugs than if I'd just coded it myself. The AI would hallucinate solutions, ignore patterns, and create technical debt faster than I could clean it up.

The Three Core Problems

1. Inconsistency Across Codebase

Frontend:

// AgentStatus.tsx - uses our design system
export const AgentStatus = ({ status }: Props) => {
  return <Badge className={getStatusColor(status)}>{status}</Badge>;
};

// TaskStatus.tsx - reinvents the wheel
export function TaskStatus({ task }: TaskProps) {
  return <div className="status-badge">{task.status}</div>;
}

// SessionStatus.tsx - different again
const SessionStatus = (props: SessionProps) => (
  <span className={styles.badge}>{props.status}</span>
);

Backend:

// taskRepo.ts - proper DI
@injectable()
export class TaskRepository {
  constructor(@inject(TYPES.Database) private db: IDatabaseService) {}
}

// projectRepo.ts - missing decorator
export class ProjectRepository {
  constructor(private db: IDatabaseService) {}
}

// memberRepo.ts - no DI at all
export class MemberRepository {
  private db = getDatabaseClient();
}

Same concept, different implementations. All technically correct. All maintenance nightmares.

2. Context Window Overload

Your documentation grows from this:

# Conventions
- Use functional components
- Use TypeScript

To this monstrosity:

# Conventions
## Components
- Use functional components
- Props interface must be exported
- Use PascalCase for component names
...(10+ reference docs)

Eventually, even AI can't keep up.

3. Pattern Recreation Waste

How many times have you watched AI recreate the same pattern?

Authenticated API routes with similar structure
Badge components that look identical but use different approaches
Repository classes with the same DI pattern but inconsistent implementation
Service classes that all need the same base configuration

Each time slightly different. Hours wasted on work already done.

The Solution: Intelligent Scaffolding

Instead of fighting these problems with longer instructions, I needed a fundamental shift: teach AI to use templates, not just write code.
How It Works: The scaffolding approach leverages MCP (Model Context Protocol) to expose template generation as a tool that AI agents can call. It uses structured output (JSON Schema validation) for the initial code generation, ensuring variables are properly typed and validated. This generated code then serves as guided generation for the LLM—providing a solid foundation that follows your patterns, which the AI can then enhance with context-specific logic. Think of it as "fill-in-the-blanks" coding: the structure is guaranteed consistent, while the AI adds intelligence where it matters.

The Key Insight

Traditional scaffolding requires complete, rigid templates. But with AI coding assistants, you only need:

A skeleton with minimal code
A header comment declaring the pattern and rules
Let the AI fill in the blanks contextually

Example from our actual codebase:

/**
 * PATTERN: Injectable Service with Dependency Injection
 * - MUST use @injectable() decorator
 * - MUST inject dependencies with @inject(TYPES.*)
 * - MUST define constructor parameters as private/public based on usage
 * - MUST include JSDoc with design principles
 */
@injectable()
export class {{ ServiceName }}Service {
  constructor(
    @inject(TYPES.Database) private db: IDatabaseService,
    @inject(TYPES.Config) private config: Config,
  ) {
    // AI fills in initialization logic
  }

  // AI generates methods following the established pattern
}

The AI now knows the rules and generates code that follows them.

Enter: Scaffold MCP

I built the @agiflowai/scaffold-mcp to implement this approach. It's an MCP (Model Context Protocol) server that provides:

Boilerplate templates for new projects
Feature scaffolds for adding to existing projects
AI-friendly minimal templates with clear patterns

Why MCP?

✅ Works with Claude Code, Cursor, or any MCP-compatible tool
✅ Tech stack agnostic (Next.js, React, Hono.js, your custom setup)
✅ Multiple modes: MCP server or standalone CLI
✅ Always available to AI like any other tool

Real-World Workflow Transformation

Before Scaffolding: Starting a New API

You: "Create a new Hono API with authentication"
AI: *generates files with different patterns*
You: "Wait, where's the dependency injection?"
You: "Can you use our standard middleware setup?"
You: "Actually, use Zod for validation like our other APIs..."
*Back-and-forth debugging*

After Scaffolding: Starting a New API

Using CLI:

# See available templates
scaffold-mcp boilerplate list

# Create API with exact conventions
scaffold-mcp boilerplate create hono-api-boilerplate \
  --vars '{"apiName":"notification-api","port":"3002"}'

# ✓ Complete API structure created
# ✓ Dependency injection configured
# ✓ All following your team's conventions

Using Claude Code:
Simply say: "Create a new notification API"

Claude automatically uses the scaffold-mcp MCP server and creates your API with proper DI, middleware, and validation.

Before Scaffolding: Adding Features

You: "Add a new repository class for comments"
AI: *creates class without DI decorator*
You: "No, use dependency injection like the other repos"
AI: *adds DI but forgets the @injectable decorator*
You: "Look at TaskRepository as an example"
*More back-and-forth*

After Scaffolding: Adding Features

Using CLI:

# What can I add?
scaffold-mcp scaffold list ./backend/apis/my-api

# Add matching feature
scaffold-mcp scaffold add scaffold-repository \
  --project ./backend/apis/my-api \
  --vars '{"entityName":"Comment","tableName":"comments"}'

# ✓ Perfect pattern match with proper DI

Using Claude Code:
"Add a repository for comments to my API"

Claude uses scaffold-mcp to ensure the new repository matches your DI patterns, uses the correct decorators, and follows your coding standards.

Creating Your Own Templates

The real power comes from encoding your team's patterns.

Step 1: Installation

# Install
npm install -g @agiflowai/scaffold-mcp

# Initialize templates
scaffold-mcp init

Step 2: Enable Admin Tools (Claude Code)

{
  "mcpServers": {
    "scaffold-mcp": {
      "command": "npx",
      "args": ["-y", "@agiflowai/scaffold-mcp", "mcp-serve", "--admin-enable"]
    }
  }
}

Step 3: Create Template with AI

Tell Claude:

"Create a boilerplate template called 'injectable-service' in 'backend-templates'
that creates backend services with dependency injection and proper structure"

Claude uses the admin tool to generate:

# templates/backend-templates/scaffold.yaml
boilerplate:
  name: injectable-service
  description: Backend service with DI, config, and proper structure
  targetFolder: backend/apis/my-api/src/services

  variables_schema:
    type: object
    properties:
      serviceName:
        type: string
        pattern: "^[A-Z][a-zA-Z0-9]*$"
        description: Service name in PascalCase
    required:
      - serviceName

  includes:
    - {{ serviceName | camelCase }}Service.ts
    - {{ serviceName | camelCase }}Service.test.ts

Step 4: Create Template Files

Tell Claude:

"Create the service template file with dependency injection, config support, and JSDoc"

Claude generates:

// templates/backend-templates/{{ serviceName | camelCase }}Service.ts.liquid
/**
 * {{ serviceName }} Service
 *
 * DESIGN PRINCIPLES:
 * - Dependency Injection: Use Inversify for loose coupling
 * - Type Safety: Full TypeScript support
 * - Configuration: Use Config service for environment variables
 *
 * PATTERN: Injectable Service
 * - MUST use @injectable() decorator
 * - MUST inject dependencies with @inject(TYPES.*)
 * - MUST document design principles in JSDoc
 */
import { inject, injectable } from 'inversify';
import type { IDatabaseService } from '@agimonai/backend-database';
import type { Config } from '@/services';
import { TYPES } from '@/types';

@injectable()
export class {{ serviceName }}Service {
  constructor(
    @inject(TYPES.Database) private db: IDatabaseService,
    @inject(TYPES.Config) private config: Config,
  ) {
    // Initialization logic
  }

  /**
   * Add your service methods here
   */
  public async execute() {
    // Implementation
  }
}

Step 5: Use Your Template

scaffold-mcp boilerplate create injectable-service \
  --vars '{"serviceName":"Email"}'

# ✓ Created backend/apis/my-api/src/services/emailService.ts
# ✓ Created backend/apis/my-api/src/services/emailService.test.ts
# ✓ All with proper DI, JSDoc, and patterns

Or with Claude Code:

"Create a new Email service using our injectable service template"

The Results

After switching to scaffolding:

Before

Setup time: Hours of back-and-forth per project
Code consistency: Inconsistent across the codebase
Review time: Mostly spent on style debates
Onboarding: Weeks to learn all the conventions

After

Setup time: Minutes per project
Code consistency: Enforced by templates
Review time: Focused on logic, not style
Onboarding: Days instead of weeks

Net result: Dramatically faster initialization, zero convention debates, consistent quality across the entire monorepo.

Best Practices

1. Start Simple, Evolve Gradually

# Week 1: Use community templates
scaffold-mcp add --name nextjs-15 --url https://github.com/AgiFlow/aicode-toolkit

# Weeks 2-4: Observe what you change repeatedly

# Week 5+: Create custom templates for your patterns

2. Use Liquid Filters for Consistency

{% comment %}
✅ Good: Ensure consistent casing with filters
Available filters: pascalCase, camelCase, kebabCase, snakeCase, upperCase
{% endcomment %}
@injectable()
export class {{ serviceName | pascalCase }}Service {
  private readonly logger = createLogger('{{ serviceName | kebabCase }}');
  private readonly TABLE_NAME = '{{ tableName | snakeCase }}';
}

{% comment %}
❌ Bad: Rely on user input casing
{% endcomment %}
export class {{ serviceName }}Service {
  private logger = createLogger('{{ serviceName }}');
  private TABLE = '{{ tableName }}';
}

3. Validate with JSON Schema

# ✅ Good: Enforce format and patterns
properties:
  serviceName:
    type: string
    pattern: "^[A-Z][a-zA-Z0-9]*$"  # Must be PascalCase
    example: "Email"
  port:
    type: number
    minimum: 3000
    maximum: 9999

# ❌ Bad: Accept anything
properties:
  serviceName:
    type: string
  port:
    type: number

4. Document in Templates

instruction: |
  Service created successfully!

  Files created:
  - {{ serviceName | camelCase }}Service.ts: Main service with DI
  - {{ serviceName | camelCase }}Service.test.ts: Test suite

  Next steps:
  1. Register in TYPES: add {{ serviceName }}Service to dependency container
  2. Run `pnpm test` to verify tests pass
  3. Inject: @inject(TYPES.{{ serviceName }}Service)

Getting Started Today

Quick Start (5 minutes)

# 1. Install
npm install -g @agiflowai/scaffold-mcp

# 2. Initialize
scaffold-mcp init

# 3. List templates
scaffold-mcp boilerplate list

# 4. Create project
scaffold-mcp boilerplate create <name> --vars '{"projectName":"my-app"}'

Claude Code Setup (2 minutes)

{
  "mcpServers": {
    "scaffold-mcp": {
      "command": "npx",
      "args": ["-y", "@agiflowai/scaffold-mcp", "mcp-serve"]
    }
  }
}

Restart Claude Code and say: "What scaffolding templates are available?"

The Path Forward

The future of AI-assisted development isn't about AI writing more code—it's about AI writing the right code, consistently, following your conventions.

Three Levels of Adoption

Level 1: User (Start here)

Use existing templates
10x faster setup
Guaranteed consistency

Level 2: Customizer (Next step)

Adapt templates to your team
Encode patterns once, reuse forever
Zero convention debates

Level 3: Creator (Advanced)

Build custom templates for your stack
Advanced generators for complex workflows
Share across your organization

The Bottom Line

Stop fighting with AI over conventions. Stop reviewing the same style issues. Stop recreating the same patterns.

Start with templates. Scale with scaffolding.

Resources

GitHub: github.com/AgiFlow/aicode-toolkit
NPM: @agiflowai/scaffold-mcp

"The best code is the code you don't have to write. But when you do write it, scaffolding ensures you write it right the first time—every time."

This is Part 1 of my series on making AI coding assistants work on complex projects. Stay tuned for Part 2!

Questions? I'm happy to discuss architecture patterns, scaffolding strategies, or share more implementation details in the comments.