DEV Community: mufeng

I Built a Durable AI Knowledge Base with Markdown and Git

mufeng — Sat, 25 Jul 2026 03:32:01 +0000

Why search alone does not create memory—and how source layers, agent rules, and deterministic checks keep a knowledge base useful over time

My AI knowledge base starts with a directory tree, not a search box:

10-inbox      rough, unverified capture
20-sources    evidence that cannot be silently rewritten
30-knowledge  synthesis that can change with new evidence
40-projects   goals, constraints, and current execution state
50-research   questions, competing explanations, and evidence gaps
60-writing    drafts and publication records
70-investing  observations, theses, and capital decisions
80-logs       append-only changes, decisions, and handoffs
90-archive    inactive material retained for history

The point of these folders is not tidiness. Each one defines a different change contract.

A captured conversation may be incomplete. A source record should preserve what was actually found. A synthesis page must remain editable because better evidence may overturn it. A project decision needs a date and an owner. A log should not be rewritten simply because the outcome later became inconvenient.

That distinction turned out to matter more than the choice of model, embedding database, or note-taking app.

Search can retrieve knowledge. It does not maintain it.

Most AI document tools follow a retrieval-augmented generation pattern: upload files, retrieve relevant chunks at query time, and ask a model to assemble an answer.

This is useful. It is also easy to mistake for long-term memory.

If a question requires five documents, a retrieval system may locate and combine five fragments each time the question is asked. The answer can be excellent, yet the synthesis usually disappears into chat history. A contradiction discovered today may have to be rediscovered next month. A corrected interpretation may never update the next answer.

In April 2026, Andrej Karpathy described a different pattern in his LLM Wiki proposal: put a persistent, interlinked Markdown wiki between raw sources and the user. When a new source arrives, an agent does more than index it. The agent updates topic pages, adds cross-references, records contradictions, and revises the existing synthesis.

Karpathy's document is a design proposal, not a benchmark proving that a wiki beats RAG at every scale. I treat the two as complementary:

The maintained wiki stores conclusions, relationships, and unresolved disagreements that have already been developed.
Retrieval helps an agent find the right sources and pages as the collection grows.
Raw evidence remains available when a conclusion needs to be audited or rebuilt.

Search is the navigation layer. It should not quietly become the truth layer.

A durable AI knowledge base needs three jobs

The simplest useful architecture has three layers:

raw sources -> maintained knowledge -> agent schema

Each layer answers a different question.

1. What did the source actually say?

The source layer stores provenance: the exact URL or file, author, publication date when available, access date, and any capture limitations. It may also include a faithful excerpt or source note.

Its job is not to be elegant. Its job is to make later verification possible.

2. What do I currently think the evidence supports?

The knowledge layer contains concept pages, comparisons, summaries, and evolving conclusions. These pages are expected to change. If new evidence weakens an old claim, the synthesis should say so.

This is where accumulated knowledge lives. It is not raw evidence, and it should never pretend to be.

3. How may an agent change the system?

The schema or repository protocol tells an agent how to name files, check for duplicates, cite claims, update indexes, and validate its work. In my setup, these rules live in AGENTS.md and in the frontmatter schema used by each page.

Without this layer, every new agent session has to guess the rules again. The result is predictable: duplicate pages, drifting names, missing links, and confident summaries with weak provenance.

Why I extended the three-layer model

The three-layer model explains how documents become maintained knowledge. My repository also has to support execution: projects, research questions, writing, and investment decisions.

Those objects do not age in the same way.

An early draft used a generic structure:

raw/
wiki/
daily/
memory/
projects/

It looked simple until real material arrived.

Should an unverified chat transcript go into wiki or memory? Should a research conclusion and a project decision use the same status fields? If a source later proves inaccurate, may the original record be edited?

Adding more vaguely named folders would not solve those questions. Defining mutation rules did.

That led to the numbered directory structure at the beginning of this article. The important move was separating evidence that should not be silently changed from conclusions that must remain revisable.

Both rules are necessary. Immutable conclusions become dogma. Mutable evidence destroys the audit trail.

Why Markdown and Git are a practical foundation

I am not claiming that Markdown files automatically survive for decades, or that Git makes a repository truthful.

I chose them for narrower reasons.

The CommonMark specification defines Markdown as a plain-text format for structured documents and emphasizes that the source remains readable. A person can inspect it without a proprietary application. So can Codex, Claude Code, Cursor, Gemini CLI, OpenCode, or a future tool that does not exist yet.

Pro Git describes version control as recording changes to files over time so that earlier versions can be recovered. In practice, Git gives this knowledge base reviewable diffs, history, branches, and portable clones.

Together, Markdown and Git provide four properties I care about:

Inspectability. The content, sources, and operating rules are readable without a dedicated product.
Comparability. When an agent changes a conclusion, a diff shows what changed.
Portability. Changing editors, models, or retrieval systems does not require exporting the core knowledge first.
Rebuildability. Search indexes, graph caches, and visual interfaces can be regenerated from the files.

There are limits. Git is not a backup policy. Markdown does not verify claims. Private repositories still need access control and remote copies. Sensitive material still needs an explicit visibility model. The tools make governance possible; they do not perform it.

Give the agent a repository protocol, not a vague prompt

“Organize my notes” is not an operating model.

Before making a substantive change, an agent in my setup is expected to read the current priorities, the relevant indexes, the directory rules, and the schema. It must search for an existing page on the same subject before creating another one.

A minimal page uses frontmatter like this:

---
schema: v1
id: note-llm-compiled-knowledge
type: note
title: "LLM-compiled knowledge"
status: stable
visibility: shareable
created: 2026-07-24
updated: 2026-07-24
tags:
  - ai-agents
  - knowledge-management
confidence: high
---

The schema is not an attempt to turn Markdown into a database. It establishes the minimum shared vocabulary required for several agents to work on the same repository without constantly renegotiating identity and lifecycle.

The protocol also distinguishes different kinds of statements:

Fact: directly supported by a source.
Inference: a conclusion drawn from one or more facts.
Hypothesis: a claim that still needs evidence.
Opinion: an explicit judgment.
Decision: a chosen action, including its context and date.

One of the more dangerous model errors is not an obvious fabrication. It is a compression error: an author's opinion is summarized as a verified fact, or a tentative hypothesis loses its uncertainty label after three rewrites.

Statement types make that drift easier to notice.

The citation failure that changed the design

An early draft of my source material cited Karpathy's work with a link to his general Gist page.

Technically, there was a reference. Practically, it was not auditable. A reader could not tell which Gist supported the claim, when it was created, or whether I had accurately represented it.

I replaced the profile-level link with the exact LLM Wiki Gist and created a separate source record containing its author, creation date, access date, capture method, and limitations.

The mistake was small, but the lesson was not: a references section is not the same thing as evidence provenance. Search-result snippets, homepages, and model paraphrases are leads. They are not substitutes for the source itself.

Semantic review and deterministic checks solve different problems

Natural-language instructions are good at expressing editorial judgment:

Is this source credible enough for the claim?
Did a summary erase a disagreement?
Is a conclusion now stale?
Has an inference been presented as a fact?

They are an inefficient way to catch mechanical errors that ordinary code can find reliably:

missing required fields
duplicate IDs
broken relative links
files stored under the wrong content type
pages omitted from an index

My implementation uses a small Python tool, built only with the standard library, to perform those structural checks. The repository's verification command runs both unit tests and a health check:

make verify

This does not prove the knowledge is correct. It proves that a defined set of structural invariants holds. Semantic review and deterministic validation are complementary, not interchangeable.

A minimal implementation you can build this weekend

You do not need my full directory structure. Start with:

sources/
knowledge/
projects/
logs/
AGENTS.md
INDEX.md

Then add five constraints.

1. Choose the authoritative store

Treat Markdown files and local assets as the source of record. Note apps, vector databases, and graph views can be useful interfaces, but they should not be the only copy of the knowledge.

2. Separate sources from synthesis

Preserve the origin of a claim in sources/. Put revisable summaries and concept pages in knowledge/. Never promote an AI-generated summary into the source layer.

3. Write the agent's modification rules

Your AGENTS.md should answer:

What must an agent read before editing?
How are files named and deduplicated?
Which records may be edited, and which are append-only?
How are facts cited and inferences labeled?
Which checks must pass before the task is complete?

4. Use a minimal schema

Start with id, type, status, created, updated, and tags. Add a field only when it solves an observed retrieval, review, or collaboration problem.

A large ontology created on day one is usually a maintenance bill disguised as preparation.

5. Turn mechanical rules into tests

Check required fields, unique IDs, links, file locations, and index coverage with deterministic code. Reserve model judgment for source quality, contradictions, uncertainty, and synthesis.

When should you add vector search?

Not on the first day.

At a modest scale, an index file plus full-text search may be enough. Add BM25, embeddings, reranking, or a graph database when you can name the recurring failure:

known information is repeatedly hard to find
the index is too large to navigate
vocabulary mismatch defeats keyword search
near-duplicate pages keep appearing
relationship queries have become central to the work

This is an engineering threshold, not a universal rule. Retrieval infrastructure can become necessary as the corpus, query patterns, and number of collaborators grow. It should remain a rebuildable acceleration layer wherever possible.

Frequently asked questions

What is a durable AI knowledge base?

It is a knowledge system that preserves source provenance, maintains revisable synthesis, records changes, and gives humans and agents explicit rules for updating the collection. Its value comes from accumulated, auditable work—not merely from answering the current query.

Does a Markdown wiki replace RAG?

No. A maintained wiki stores conclusions and relationships that should persist between questions. RAG or other retrieval methods help find relevant material at query time. Many systems benefit from both.

Why use Git for knowledge management?

Git makes file changes reviewable and recoverable. It can show when an agent changed a claim, compare competing edits, and preserve repository history across clones. It does not replace backups, access controls, or fact-checking.

Can several AI agents safely edit the same knowledge base?

They can collaborate more reliably when the repository defines naming, evidence, mutation, indexing, and validation rules. Concurrency still requires ordinary Git discipline and human review for consequential changes.

What should be tested automatically?

Automate structural invariants: required metadata, unique IDs, valid links, correct locations, and index coverage. Do not treat a passing linter as proof that a conclusion is true.

What this system has—and has not—proved

My current implementation shows that clear layers, a repository protocol, and standard-library checks can let multiple AI coding agents work on the same Markdown assets while leaving reviewable file history.

It has not proved that the structure will remain sufficient across thousands of pages, years of use, or heavy multi-user concurrency. Vector retrieval may become necessary. The directory tree may need to be split. The schema may grow heavier.

Those changes should be triggered by observed failures.

The useful measure of an AI knowledge base is not how many pages it generated on day one. It is whether the system can preserve evidence, correct a conclusion, and let the next conversation continue from work that can still be inspected.

References

Andrej Karpathy, “LLM Wiki”, created April 4, 2026; accessed July 25, 2026.
CommonMark Spec 0.31.2, January 28, 2024; accessed July 25, 2026.
Pro Git, “About Version Control”, accessed July 25, 2026.

What 178 Claude Code Sessions Taught Me About Working With Coding Agents

mufeng — Fri, 17 Jul 2026 07:40:41 +0000

How /insights turned repeated mistakes into CLAUDE.md rules, reusable Skills, and evidence-driven agent workflows

I no longer use Claude Code only to complete functions or fix isolated bugs.

Over the past few months, I have used it across Next.js and TypeScript product work, Swift and iOS release preparation, localization, payments, production debugging, open-source maintenance, and technical writing.

The tasks became more ambitious, but one problem became harder to see: I remembered why individual sessions succeeded or failed, yet I could not identify the patterns across dozens of them.

Then I ran:

/insights

Claude Code analyzed 178 of my 205 sessions, covering 1,412 messages across 43 days, from May 26 to July 15, 2026.

The report did not write any product code. It did something more useful: it showed me how I work with coding agents, where the collaboration repeatedly breaks down, and which temporary corrections should become permanent engineering rules.

The main value of /insights is not another usage dashboard. It is turning scattered session history into a workflow you can change and later verify.

What Claude Code Insights actually is

/insights is a built-in Claude Code command. Anthropic describes it as a way to generate a report that analyzes your sessions, including project areas, interaction patterns, and friction points.

In practice, my report tried to answer questions such as:

What kinds of projects and tasks do I use Claude Code for?
Do I ask it to debug, implement, review, or write?
How do I frame requests and approve changes?
Which collaboration patterns tend to produce strong outcomes?
Which mistakes, tool failures, and incorrect assumptions keep returning?
Which preferences belong in CLAUDE.md?
Which repeated procedures should become Skills, Hooks, or agent workflows?

That makes Insights closer to an engineering retrospective about the human-agent system than a code-quality scanner.

It is also different from a team analytics dashboard. Analytics measures adoption, accepted lines of code, activity, and cost. Insights examines the behavior inside your sessions.

This distinction matters because some parts of the report are facts, while others are model-generated interpretations. “1,412 messages” is a reported count. “You interrogate before you authorize” is a behavioral summary. A claim that this pattern explains a higher satisfaction rate is an inference, not an audited conclusion.

Treating all three as equally certain would be a mistake.

The pattern it found in my work

The report described my default style as “interrogate first, authorize second.”

I rarely begin with “change the code.” I usually start with a diagnostic question:

Why is the system deletion dialog using Chinese?
Why can users select only the monthly plan?
Why does this navigation transition look wrong?
Is this CTA actually helping the page?

Only after Claude Code explains the root cause and shows evidence do I tell it to implement the fix.

The tool data supported that description: my use of Bash and Read was much higher than Write. I spent more effort investigating and verifying than generating new code.

That approach led to some of my best sessions. Simulator recordings exposed a first-frame navigation issue. Controlled experiments clarified an iOS system-language behavior. Browser screenshots revealed layout failures. curl and GitHub API calls replaced plausible explanations with observable evidence.

But the report also found the weakness: I had never fully encoded this method into the system.

Requirements such as “always run type checking,” “show a real screenshot for UI work,” and “verify platform behavior with a controlled experiment” still lived mostly in my head. I kept repeating them in new sessions.

Before Insights, I interpreted that as “Claude misunderstood this task.” After Insights, I saw a different problem: the collaboration contract had never been made persistent.

Failure pattern 1: a small fix becomes a refactor

One request was supposed to change four localized strings. Claude Code expanded it into a refactor of two switch blocks with twelve cases.

The code was not necessarily invalid. The direction and scope were wrong.

Another request asked for an annual payment option. Claude initially implemented a deep link, but I wanted an explicit annual-plan button. The first approach had to be rejected and rolled back.

Insights grouped these incidents as scope drift. My current approval gate for cross-file work is:

Before writing code, list the exact files you would change, what would change
in each file, and every user-visible behavior change. Wait for me to say “go.”

If the work touches more than five files, provide two viable approaches and
explain the trade-offs.

Implement only what I explicitly requested. If you notice adjacent warnings,
refactoring opportunities, or alternative solutions, list them separately and
wait for approval instead of implementing them.

This prompt does not make the model more intelligent. It adds a cheap decision point before an expensive diff exists.

Failure pattern 2: the first plausible explanation wins

While debugging a GitHub OAuth 400 error, one investigation path blamed the local proxy. Later evidence pointed to the Supabase provider configuration.

This is a familiar debugging failure. Once the first explanation sounds plausible, every subsequent command starts trying to confirm it instead of falsify it.

For problems that cross application code, configuration, infrastructure, and local tooling, I now prefer mutually exclusive hypotheses:

Use three parallel agents to investigate the GitHub OAuth 400 error.

Agent 1: inspect the application code and callback route.
Agent 2: call the Supabase auth endpoint directly with curl.
Agent 3: inspect the network and proxy path.

Each agent must report only its conclusion, the exact commands it ran, and the
raw evidence. Do not propose fixes yet. After all evidence is available, choose
the smallest fix that explains the observations.

Parallelism is not the goal. Killing incorrect branches quickly is the goal.

Each agent should own one hypothesis and have a clear falsification condition. Without that structure, “use multiple agents” can simply produce several confident opinions at once.

Failure pattern 3: reasoning replaces reproduction

My most valuable debugging sessions did not stop at reading code.

For UI and platform behavior, the evidence often existed outside the source files:

A simulator recording revealed the first expanded frame of a navigation bar.
A controlled simulator experiment clarified the language used by an iOS system dialog.
A real HTTP response showed whether an upstream service was failing.

I converted that lesson into a reusable instruction:

Do not infer the cause yet. Reproduce the problem first.

Record the simulator, capture a browser screenshot, or call the endpoint with
curl to prove that the bug exists. Report exactly what you observed, then
propose a fix.

After the fix, repeat the same reproduction and show the before-and-after
evidence.

This moves verification from the end of the task to the entrance condition for root-cause analysis.

Failure pattern 4: repeated work never becomes a system

Across several sessions, I repeatedly performed the same App Store release checks:

permission-purpose strings;
hosted Terms, Privacy, and EULA URLs;
hard-coded prices in paywalls;
localization completeness across .lproj files;
app icon assets;
version and build-number increments;
multilingual release notes based on the real Git diff.

Explaining this checklist from scratch every time wastes context and increases the chance of omission.

The report suggested converting it into a reusable release-audit Skill. The important improvement is not fewer keystrokes. It is running the same release gates every time and receiving evidence in a consistent format.

Where each Insight should go

Not every recommendation belongs in CLAUDE.md.

Kind of recommendation	Best home	Example
Stable rule relevant to most sessions	`CLAUDE.md`	Run type checking and tests before reporting completion
Repeatable multi-step procedure	Skill	App Store release audit
Deterministic action after an event	Hook	Run type checking after file edits
Independent investigation with isolated context	Subagent	Test code, configuration, upstream, and environment hypotheses

This separation follows the roles described in Anthropic’s Claude Code extension documentation: CLAUDE.md supplies persistent context, Skills package reusable workflows, Hooks respond to lifecycle events, and Subagents run specialized loops in separate contexts.

Putting everything into one giant instruction file would create a different problem: more context, weaker relevance, and less predictable behavior.

What changed before and after Insights

There is an important evidence boundary here.

My material contains detailed evidence from the sessions before I adopted these recommendations. It also contains the report and the workflows I derived from it. It does not yet contain several weeks of post-adoption data.

I therefore cannot honestly claim that Insights has already reduced my rework by a certain percentage or saved a specific number of hours.

What has changed is concrete but narrower:

A vague feeling about scope drift became an explicit approval rule.
A wrong proxy hypothesis became a falsification-first debugging workflow.
Repeated preferences became candidates for CLAUDE.md.
A scattered release process became a Skill specification.
Long tasks that previously died with the session now have staged checkpoints and written state.

The durable engineering effects still need to be measured.

In the next cycle, I can compare unrelated diff size, rollback frequency, repeated instruction count, evidence attached to root-cause claims, and recovery cost after interrupted sessions.

That is a more defensible before-and-after story than turning recommendations into fictional results.

A practical way to use `/insights`

The command is simple:

cd /path/to/your/project
claude

Then run:

/insights

The useful work begins after the report appears:

Check the sample. Confirm the number of sessions and the time range. A few sessions can make an accident look like a habit.
Read friction before praise. The flattering sections are easy to accept; recurring failures are more actionable.
Separate counts from interpretations. Label what is measured, what is summarized, and what is inferred.
Choose one to three changes. Do not redesign your entire agent workflow in one pass.
Put each change in the right mechanism. Use CLAUDE.md, a Skill, a Hook, or a Subagent deliberately.
Define a verification metric. Decide what would prove that the change helped.
Run Insights again later. Treat the first report as a baseline, not a verdict.

For long-running work, I would add one more rule: write findings, decisions, remaining tasks, and verification output to a file while the work is happening. Several of my sessions were interrupted by usage limits, output limits, or terminated background processes. A saved checkpoint makes interruption recoverable.

Limits worth keeping in mind

Insights can analyze what happened in the recorded sessions. It cannot automatically prove that the business outcome succeeded.

It may notice that you run tests frequently, but it cannot replace an audit of test quality. It may recommend parallel agents, but parallelism costs resources and is a poor fit for work with tight sequential dependencies.

Its interpretations can also be wrong. A report should generate hypotheses about your workflow, not become an unquestionable source of truth.

Finally, review the report before sharing it. Session-derived reports and screenshots may expose project names, file paths, customer information, endpoints, tokens, account details, or unreleased products.

The real benefit

The most useful sentence in my report was not that I was “good at debugging.” It was the implication behind the evidence: I had developed a repeatable way of working with coding agents, but I had not yet encoded it into a system the tools could execute consistently.

That is where /insights becomes valuable.

It turns “Claude sometimes goes off track” into named failure modes. It turns repeated corrections into persistent rules. It turns recurring tasks into reusable workflows. And it gives you a baseline for checking whether the next version of your process is actually better.

Generate the report once for awareness. Run it again later for evidence.

References

DESIGN.md: Stop Letting AI Guess What Your UI Should Look Like

mufeng — Thu, 16 Jul 2026 08:54:38 +0000

A practical workflow for turning visual intent into a versioned, lintable contract for coding agents.

I started with one broken token reference:

components:
  button-primary:
    backgroundColor: "{colors.action}"

There was no colors.action token.

Google's official DESIGN.md linter caught it immediately:

Reference {colors.action} does not resolve to any defined token.
errors: 1, warnings: 1, infos: 2
exit code: 1

That small failure captures the point of DESIGN.md better than a polished demo does.

AI coding tools can already produce a working page. The harder problem is getting five pages to feel like the same product, then preserving that consistency across new sessions, new agents, and later revisions.

Most teams try to solve this by adding more adjectives to the prompt:

Make it modern, clean, premium, and similar to a top-tier SaaS product.

The prompt is not too short. It is too ambiguous.

DESIGN.md replaces that ambiguity with a file that humans can review, agents can read, Git can version, and tools can validate.

What DESIGN.md actually is

Google Labs describes DESIGN.md as a format for communicating visual identity to coding agents.

A conforming file has two layers:

YAML front matter containing machine-readable colors, typography, spacing, radii, and component tokens.
Markdown prose describing the design intent, component semantics, responsive behavior, and explicit do's and don'ts.

A small file might look like this:

---
version: alpha
name: Signal Desk
colors:
  primary: "#182230"
  tertiary: "#5B5BD6"
  surface: "#FFFFFF"
rounded:
  sm: 6px
components:
  button-primary:
    backgroundColor: "{colors.tertiary}"
    textColor: "{colors.surface}"
    rounded: "{rounded.sm}"
---

## Overview

Signal Desk should feel like an operations notebook crossed with a quiet
avionics panel: compact, factual, and calm under pressure.
It is not a glossy marketing dashboard.

## Do's and Don'ts

- Do make the release state legible before secondary metrics.
- Don't add gradients, glass blur, glowing borders, or decorative charts.

The tokens and prose do different jobs.

Tokens define what to use.
Rationale explains why it should be used that way.
Negative constraints define what the result must not become.

Tokens alone reduce a design system to a palette. Prose alone leaves exact values open to interpretation. Together they form a visual contract.

Why this improves AI-generated UI

This is an engineering explanation based on the public specification and the project described below. It is not a model-vendor benchmark, and DESIGN.md does not guarantee good taste.

What it does is reduce four kinds of guessing.

1. It reduces hidden degrees of freedom

Ask for a "modern dashboard" and the model still has to choose the font, scale, palette, density, radii, shadows, motion, and responsive behavior.

Every unspecified decision is another opportunity for drift.

DESIGN.md narrows the design space before implementation begins. The model is no longer sampling from the average visual language of the web. It is working inside a smaller, explicit system.

2. It turns taste into semantic rules

#5B5BD6 is only a color value. It becomes a rule when the file says that indigo is reserved for the single primary action and the current selection.

Likewise:

"Do not turn every metric into a floating card" is more actionable than "use fewer cards."
"Status must include text and cannot rely on color alone" is more testable than "remember accessibility."

The value is not more description. It is more operational description.

3. It persists across sessions

A chat prompt is temporary. A repository file is durable.

Once design decisions live in the project, the next session, another agent, and a code reviewer can all work from the same source of truth. Git also makes those decisions diffable and reversible.

Google's CLI currently supports lint, diff, and token export. That moves part of design-system maintenance out of memory and into a normal engineering workflow.

4. It creates a verification loop

"The colors feel inconsistent" is hard to put into CI.

An unresolved {colors.action} reference is not.

Once visual rules become structured data, tooling can catch broken references, missing typography, contrast problems, and structural mistakes before the agent produces more code from a bad contract.

The four-file setup

The useful setup is not four overlapping instruction manuals. Each file should own a different question.

project/
├── README.md
├── AGENTS.md
├── DESIGN.md
├── CLAUDE.md
└── src/

File	Question it answers	What belongs there
`README.md`	What are we building?	Users, product goal, scope, setup, acceptance criteria
`AGENTS.md`	How should code be changed?	Architecture, commands, tests, boundaries, security rules
`DESIGN.md`	What should the product look and feel like?	Tokens, rationale, component semantics, responsive rules, anti-patterns
`CLAUDE.md`	How should Claude Code consume the project context?	Shared-rule imports and Claude-specific workflow notes

One detail matters here: Claude Code's official documentation says it reads CLAUDE.md, not AGENTS.md.

If the repository already uses AGENTS.md, Anthropic recommends importing it rather than maintaining a duplicate:

@AGENTS.md

## Claude Code

- Summarize the constraints in README.md and DESIGN.md before changing UI code.
- Do not replace specific design rules with generic "modern SaaS" styling.

Codex discovers AGENTS.md files and merges instructions along the directory hierarchy.

That makes CLAUDE.md a good adapter layer. Copying all of AGENTS.md into it creates two sources of truth. When only one copy gets updated, the model receives conflicting context.

A working example: Signal Desk

I tested the approach with a small release-status website called Signal Desk.

The implementation uses only HTML and CSS. No React, Tailwind, component library, external font, or generated product screenshot. Keeping the stack small made it easier to separate the effect of project context from the capabilities of a framework.

Step 1: define the product boundary

The README.md describes the job and acceptance criteria:

# Signal Desk

A one-page release status dashboard for independent developers.
Users should understand the current version, outstanding risk,
recent releases, and the next action within ten seconds.

## Acceptance criteria

- Use native HTML and CSS with no third-party assets.
- Support 375px mobile and 1440px desktop widths.
- Keep the page keyboard-accessible with visible focus states.
- Follow the information hierarchy in DESIGN.md.

This comes first for a reason. A detailed design system cannot rescue an undefined product. It can only help the agent build the wrong product more consistently.

Step 2: define the engineering contract

The AGENTS.md owns implementation and verification rules:

## Implementation

- Use semantic HTML before ARIA attributes.
- Keep CSS tokens in `:root`; token names must map to `DESIGN.md`.
- Support 375px and 1440px viewports without horizontal scrolling.
- Preserve visible `:focus-visible` states and reduced-motion behavior.

## Verification

- Run `npx @google/design.md lint DESIGN.md` after editing the design contract.
- Verify that actions, warnings, and status labels do not rely on color alone.

Notice what is missing: background colors, radii, and card styling. Those belong in DESIGN.md, not in the engineering instructions.

Step 3: give the design a specific world

The Overview does not say "modern, professional, premium."

It says:

Signal Desk should feel like an operations notebook crossed with a quiet
avionics panel: compact, factual, calm under pressure.
It is not a glossy marketing dashboard.

That reference naturally suggests a warm paper-like canvas, dark ink, sparse status color, compact information density, restrained corners, and no glassmorphism or decorative charts.

Google's DESIGN.md philosophy makes the same point: a specific reference carries more useful information than a broad list of adjectives.

Step 4: implement and inspect the result

The final page puts release state at the top of the hierarchy and keeps one primary action above the fold. Versions and timestamps use monospace. Success and rollback states combine color with labels. Borders and tonal shifts create structure without floating-card effects.

This is a real local browser capture from the project. The interface copy is Chinese, but the behavior of the design contract is language-independent.

The screenshot does not prove that DESIGN.md beats every possible prompt. It proves a narrower claim: the four-file contract can produce a running artifact whose design decisions can be traced back to versioned project files.

The failure was more useful than the first render

The intentionally broken {colors.action} reference failed lint as expected.

After fixing it, the linter still reported four warnings. The neutral, line, success, and warning tokens existed but were not referenced by any component.

That exposed a common design-system mistake: more tokens do not automatically mean more control. An unused token is inventory, not a working rule.

I added semantic component mappings for page, divider, status-success, and status-warning. The final result was:

errors: 0, warnings: 0, infos: 1
exit code: 0

Three practical lessons came out of that loop:

DESIGN.md has to evolve with real components. It should not expand independently of the product.
A small number of high-value tokens with semantic component mappings is better than hundreds of copied variables.
Passing lint proves structural validity, not visual quality. Responsive inspection, accessibility checks, and human review still matter.

Do not copy another company's visual identity

The awesome-design-md repository is useful for studying how detailed design documents are structured. Its files are extracted from publicly visible websites and provided as-is. The repository explicitly says it does not claim ownership of those visual identities.

The safe use is to study decisions, not clone a brand:

How are primary and secondary actions distinguished with few colors?
How do type scale and spacing establish hierarchy?
How are interaction states, responsive behavior, and accessibility described?
Which rules are general principles, and which belong to that specific company?

Vercel's public /design.md is a useful large example. It documents the light Geist system through color scales, typography, radii, spacing, and component tokens. It also points to a separate dark-theme document.

It is not a universal starter template. Vercel's own Web Interface Guidelines explicitly separate general interface guidance from Vercel-specific preferences.

Study how constraints are expressed. Do not copy the identity they protect.

A practical adoption sequence

If you want to introduce DESIGN.md into an existing app or website, use this order:

Choose one bounded screen. Start with a login, settings, dashboard, or detail page.
Extract facts from the current product. Record the colors, type, spacing, radii, and states that actually exist.
Write the Overview and Don'ts first. Define a concrete reference and the five to ten failure modes agents repeat most often.
Add the minimum useful tokens. Only add values used by the first screen, then map them to semantic components.
Put verification in AGENTS.md. State when lint runs and which viewports, states, and accessibility rules must be checked.
Keep CLAUDE.md thin. Import shared rules and add only tool-specific instructions.
Review design changes with diff. Run the official command before accepting token or semantic changes:

npx @google/design.md diff DESIGN.md DESIGN-v2.md

Design-system changes should be reviewed like API changes. They should not appear silently inside one generation.

What DESIGN.md cannot solve

DESIGN.md is not a replacement for Figma, user research, information architecture, content design, or interaction prototyping.

It is also context, not enforcement. Anthropic makes this distinction explicitly for CLAUDE.md: instructions guide model behavior, but they are not a hard configuration layer. The same caution applies to any file an agent reads.

Specific, concise, non-conflicting rules improve the odds of compliance. They do not guarantee it.

The useful scope is narrower:

Make fewer visual decisions implicit.
Keep design language stable across pages and sessions.
Put visual intent under version control and review.
Catch some contract errors before they spread into implementation.

The real gain is not that DESIGN.md raises the model's aesthetic ceiling. It raises the quality of the constraints around the model.

README.md defines what to build. AGENTS.md defines how to work. DESIGN.md defines what the result should look and feel like. CLAUDE.md adapts those shared rules for Claude Code.

That is how a one-off prompt becomes maintainable project context.

References

Loop Engineering: Turning /goal and /loop into Verifiable AI Agent Workflows

mufeng — Tue, 07 Jul 2026 10:46:52 +0000

Loop Engineering is becoming one of those terms that spreads faster than its definition.

That usually creates two bad outcomes. Some people dismiss it as another AI buzzword. Others treat it as magic: prepend /loop to a prompt and expect an agent to ship production-ready work while they sleep.

Both readings are wrong.

The practical definition is simpler:

Loop Engineering is the practice of designing AI agent work as a repeatable cycle with a clear goal, bounded action, verification, state persistence, and stop rules.

In other words, it is not a prompt trick. It is an engineering discipline for long-running AI work.

The Problem: You Are Still the Loop

Most developers use coding agents like this:

Write a prompt.
Let the agent edit code.
Run tests manually.
Paste the failure back.
Ask the agent to try again.
Repeat until it works or you give up.

It looks like the AI is doing the work. In reality, you are still the scheduler, QA engineer, state manager, and stop-condition checker.

The agent executes single instructions. You decide whether the result is correct, whether the next step should happen, which error matters, what changed, and when the task is done.

Loop Engineering moves those responsibilities out of your head and into an explicit workflow.

A good loop answers these questions before the agent starts:

What does done mean?
What is the allowed scope?
What should happen in each iteration?
How will the result be verified?
Where is progress recorded?
When should the agent stop?
Which actions require human approval?

That is the shift. Prompt Engineering tries to improve one response. Loop Engineering tries to make multiple responses converge toward a verifiable result.

The Vocabulary That Matters

Before talking about /goal and /loop, it helps to separate several related concepts.

Prompt Engineering

Prompt Engineering is writing a good single instruction.

Example:

Fix the failing tests in the auth module and explain what changed.

This can work for a small task. But it does not define the scope, verification method, failure policy, or stop condition.

For one-shot requests, that may be fine. For multi-step coding work, it is fragile.

Context Engineering

Context Engineering is deciding what the model sees at each step: instructions, files, tool outputs, memory, logs, retrieval results, MCP data, and previous state.

Long-running agents produce new context every turn. If you simply keep adding more history, the model gets more noise, not more clarity.

Good context engineering keeps high-signal information and externalizes state into files the agent can reread.

Harness Engineering

The harness is the environment the agent runs inside.

For coding agents, that usually means:

Project instructions: CLAUDE.md, AGENTS.md, or similar files
Permission rules: what the agent can run automatically and what requires approval
Tooling: MCP servers, browser access, GitHub, databases, design tools
Hooks: format after edit, lint before commit, log tool calls
Subagents: separate contexts for review, research, and verification
Memory: durable project decisions and recurring preferences

The loop runs on top of the harness. Without a harness, the agent has to guess your project structure, commands, conventions, and boundaries every time.

Guessing is where many agent failures begin.

Loop Engineering

At its smallest, a loop looks like this:

Read goal -> act -> verify -> write state -> continue or stop

This is not the same as a traditional script.

A script repeats fixed steps. A looped agent evaluates state, chooses the next action, handles errors, and updates the plan.

That flexibility is useful. It is also dangerous if you do not define boundaries.

Verifier

The verifier is the evidence layer.

It should not ask, "Do you think this is done?" It should check evidence:

Did the tests pass?
Did the build pass?
Did the diff stay inside the allowed scope?
Do the links open?
Does the screenshot match the target state?
Does the implementation satisfy the written acceptance criteria?

The best verifier is often separated from the worker. If the same agent writes the code and judges the result in the same context, it can rationalize its own mistakes.

Memory and State

Memory and state are related, but they are not the same.

Memory is long-term project knowledge:

team preferences
architectural decisions
recurring constraints
things the agent should remember across sessions

State is current task progress:

what has been done
what failed
what is blocked
what the next iteration should read first

A useful minimal setup is:

AGENTS.md or CLAUDE.md       # long-term project rules
LOOP-STATE.md               # current loop progress
IMPLEMENTATION_PLAN.md      # current plan and checklist
logs/                       # iteration logs

Without state, a loop often becomes a repetition machine. It looks busy but keeps rediscovering the same facts.

Stop Rules

Stop rules are the brakes.

Every loop needs at least two kinds:

Success stop: what evidence proves the task is complete
Failure stop: when the agent should stop trying and return control to a human

Example:

Success:
- pnpm test auth passes
- auth coverage stays above 80%
- git diff only touches lib/auth and tests/auth

Failure:
- the same test fails 3 times without a new hypothesis
- database migrations need to be modified
- a new production dependency is required
- 8 iterations pass without meeting the goal

An agent loop without stop rules is not automation. It is a cost risk.

/goal vs /loop

In Claude Code, /goal and /loop represent two different loop shapes.

According to the Claude Code documentation, /goal sets a completion condition. Claude keeps working and checks after each turn whether the goal has been reached.

/loop runs a prompt repeatedly at an interval inside the current Claude Code session. It is better for polling, monitoring, reminders, or waiting for external state changes.

The short version:

Command	Core question	Stop condition	Best for
`/goal`	What state counts as done?	Goal reached or failure rule triggered	migrations, failing tests, docs, issue cleanup
`/loop`	How often should this be checked?	external event, human stop, explicit rule	deployment checks, PR monitoring, scheduled review

There is an important portability detail.

The engineering pattern transfers across tools. The command names do not.

Claude Code has /goal and /loop in its documentation. Codex emphasizes AGENTS.md, Automations, Subagents, Workflows, and CLI workflows. Do not assume every agent tool exposes the same slash commands.

Write portable loop specifications, then map them to the tool you are using.

When to Use /goal

Use /goal when the task has a verifiable endpoint.

Bad:

/goal make the project better

Good:

/goal auth migration is complete:

Done when:
- all new password writes use argon2id
- legacy bcrypt hashes are rehashed after the next successful login
- pnpm test auth passes
- tests cover migration, failed login, and legacy hash compatibility

Scope:
- only edit lib/auth, tests/auth, docs/auth-migration.md
- do not edit merged db/migrations
- do not change the session cookie format

Verification:
- run pnpm test auth after each change
- inspect git diff for out-of-scope files
- if tests fail, diagnose before editing again

Stop:
- stop when all done conditions pass
- stop if the same failure repeats 3 times
- stop before adding production dependencies
- stop after 8 iterations

State:
- maintain LOOP-STATE.md
- update done, blocked, and next-step items after every iteration

The point is not verbosity. The point is verifiability.

If your goal cannot be checked by tests, builds, diffs, file content, screenshots, link checks, metrics, or clear human acceptance criteria, it is not ready for a loop.

When to Use /loop

Use /loop when the main job is to check something repeatedly.

Example:

/loop 5m check whether production deployment is complete:

Each iteration:
- inspect the latest GitHub Actions workflow for this branch
- if it is still running, record the current job and elapsed time
- if it succeeded, check whether the production homepage returns 200
- if production is healthy, report success and stop
- if the workflow failed, summarize the failed log and stop

Stop:
- deployment succeeds and health check passes
- workflow fails
- 12 iterations pass without completion

Typical /loop tasks:

check deployment status every 5 minutes
monitor whether a PR has new review comments
generate a daily project status summary
watch whether an external service has recovered
periodically process failed jobs or logs

Bad /loop tasks:

one-shot questions
vague ideation
high-risk production changes
product direction decisions without human context
"keep improving this" with no stop condition

"Keep improving this" is one of the most dangerous instructions you can give to an autonomous agent. It has no endpoint, no boundary, and no cost ceiling.

A Reusable Loop Spec Template

For serious work, put the loop spec in a file such as PROMPT.md, LOOP-SPEC.md, or IMPLEMENTATION_PLAN.md.

# Goal

Describe the final state in one or two sentences.
Do not write "improve this." Write what evidence must be true.

## Work Scope

- Readable directories:
- Editable directories:
- Forbidden directories:
- Actions requiring human approval:

## Work Method

- Process one subtask per iteration
- Read the existing implementation before editing
- Prefer existing project patterns
- Do not add dependencies unless you stop and explain why

## Verification

- Commands to run each iteration:
- Files to inspect:
- Evidence to preserve:
- Retry policy when verification fails:

## State

- Read LOOP-STATE.md at the start of each iteration
- Update LOOP-STATE.md at the end of each iteration
- Allowed states: todo, doing, done, blocked, needs-human

## Stop Rules

- Success stop:
- Failure stop:
- Max iterations:
- Max budget:
- Conditions requiring human intervention:

## Report

At the end, report:
- what was completed
- verification evidence
- blockers
- files changed
- recommended next step

This template tells the agent how to work, not just what to do.

The Codex Equivalent

If you are using Codex, I would map the same pattern into three layers.

First, use AGENTS.md for repository-level instructions. OpenAI's Codex documentation describes AGENTS.md as the place for project guidance, test commands, coding standards, and constraints.

Minimal example:

# Repository Instructions

## Commands

- Install: pnpm install
- Test: pnpm test
- Lint: pnpm lint

## Rules

- Prefer existing helpers under src/lib.
- Do not add production dependencies without asking first.
- Run pnpm test after changing TypeScript files.
- Keep changes scoped to the user request.

## Verification

Before finishing, report:
- Files changed
- Commands run
- Tests passed or why they could not run

Second, use Codex Automations for scheduled or recurring checks.

Third, use Subagents and Workflows when research, verification, log analysis, or review should happen in separate contexts.

The warning is the same as with Claude Code: parallel agents are not free. They consume more tokens and introduce coordination overhead. Use them when they reduce context pollution or improve verification quality.

What Loop Engineering Actually Solves

It reduces human QA relay work

You still review the final result. But you stop acting as the manual bridge between test output and the next agent instruction.

It makes long tasks recoverable

A clear LOOP-STATE.md lets an agent resume from the previous iteration instead of relying on a giant chat transcript.

It replaces confidence with evidence

Agents are often confident. Evidence is better.

Looped work should end with test logs, build output, diffs, screenshots, link checks, benchmark results, or explicit acceptance criteria.

It turns repeated work into team assets

The first loop spec is slow to write. The second is faster. By the third time, it probably belongs in a reusable workflow, skill, automation, or project template.

That is the difference between a prompt and an engineering practice.

Three Practical Scenarios

1. Research briefs without fake citations

A common failure mode: ask an AI to write a research brief, and it returns polished claims with references that are dead links or do not support the claim.

The loop version should require verification:

/goal research brief is complete:

Done when:
- every major claim has at least 2 accessible sources
- each source supports the specific claim it is attached to
- invalid sources are removed or replaced
- final Markdown includes a references section

Verification:
- open each link
- summarize what claim each source supports
- mark mismatched sources as invalid and replace them

Failure stop:
- no authoritative source found after 3 distinct search attempts
- required evidence is behind a paywall
- 6 iterations reached

The loop is not about writing faster. It is about not publishing unsupported claims.

2. Fixing a frontend persistence bug

Suppose a settings page says "saved," but after refresh the settings disappear.

Good loop:

/goal settings persistence bug is fixed:

Done when:
- the save-refresh-loss bug is reproduced
- root cause is identified and fixed
- a regression test covers save and reload
- npm test settings passes
- if a dev server is available, manual refresh confirms persistence

Scope:
- inspect app/settings, lib/settings, tests/settings first
- do not modify auth, billing, or database migrations

Stop:
- stop if the API contract must change
- stop if a data migration is required
- stop if the same test fails 3 times without progress

This works well because the task has natural stages: reproduce, diagnose, fix, test, verify.

3. Building a 0-to-1 product

Andrew Ng's framing of three product loops is useful:

Agentic coding loop: the agent builds, tests, and fixes against a spec
Developer feedback loop: the developer reviews product direction and updates the spec
External feedback loop: real users, alpha testers, or A/B tests change the product vision

These loops run at different speeds.

The coding loop may run every few minutes. The developer feedback loop may run every few hours. External feedback may take days or weeks.

Do not try to automate all three equally.

My practical judgment: the first loop can be heavily automated. The second still needs human product context. The third must not be faked. User feedback cannot be replaced by a model's guess about what users might want.

The point is not to remove the human entirely. It is to move the human out of repetitive low-level relay work and back into judgment, direction, and acceptance.

Common Failure Modes

1. The goal is a wish

Bad:

/goal make the app better

Better:

/goal homepage performance pass is complete:
- Lighthouse Performance >= 90
- LCP < 2.5s
- existing analytics remain intact
- npm run build passes
- before and after metrics are reported

Agents need endpoints, not vibes.

2. The verifier is weak

"Check if there are any issues" is not verification.

Good verification:

runs a specific command
reads a specific output
compares against a written condition
reports pass or fail with evidence
does not silently fix failures while pretending the check passed

3. There is no state file

Without state, loops repeat themselves:

already-fixed tests get fixed again
rejected hypotheses get rediscovered
forbidden files get reopened
previous failure reasons disappear

Keep state short and structured:

# LOOP-STATE

## Current Status

- status: doing
- current_step: add regression test for password migration
- last_verified: pnpm test auth failed on legacy hash path

## Done

- confirmed current hash implementation
- added migration helper draft

## Blocked

- none

## Next

- fix legacy bcrypt verification test
- rerun pnpm test auth

4. Permissions are too broad

The more autonomous the loop, the narrower the permissions should be.

Limit destructive commands, force pushes, database migrations, production deployment, customer data writes, outbound messages, purchases, and anything that cannot be safely undone.

5. Context keeps growing

More context is not always better. In long-running loops, it often becomes rot.

Prefer:

file paths as indexes
reading files only when needed
summarizing large logs
writing state to disk
compacting long sessions
separating verification into a fresh context

6. /loop is used where /goal belongs

If you know the endpoint, use a goal. If you only know the checking rhythm, use a loop.

Bad:

/loop 10m keep refactoring until it is better

Better:

/goal user-service split is complete:
- user-service.ts split into no more than 4 modules
- each module below 300 lines
- public API unchanged
- pnpm test user passes
- max 6 iterations

7. Human context is automated too early

Product taste, business tradeoffs, customer insight, and brand judgment can be AI-assisted. They should not be silently delegated when the model lacks the context you have.

Loop Engineering is strongest for execution and verification. Direction still needs context.

My Rule of Thumb

Before I put a task into a loop, I ask five questions:

Can the result be verified?
Can the scope be narrowed?
Can failures be recovered or escalated?
Are human-approval actions explicit?
Can progress be written to a state file?

If I cannot answer at least four of those clearly, I do not start a loop.

"Design a better business model" is not ready for a loop. I would first use normal conversation to clarify constraints and options.

"Classify 20 pieces of user feedback, output the top 5 issues, preserve the original quote for each, and assign a priority" is ready. It has inputs, outputs, evidence, and a completion condition.

Final Checklist

Before writing a loop, check this:

Goal:
- Is there a clear final state?
- Can it be verified by tests, builds, diffs, links, screenshots, metrics, or acceptance criteria?

Scope:
- What can be read?
- What can be edited?
- What is forbidden?
- Are new dependencies allowed?

Execution:
- Is each iteration small?
- Should existing project patterns be reused?
- Is state written after each iteration?

Verification:
- What command or check proves progress?
- What happens on verification failure?
- Is the verifier separated from the worker when needed?

Stop:
- What is the success stop?
- What is the failure stop?
- What is the max iteration or budget?

Permissions:
- Are destructive, production, and sensitive-data actions restricted?
- Are high-risk actions routed back to a human?

If you cannot fill this out, do not start the loop yet.

Conclusion

Loop Engineering is not about making AI "run by itself."

It is about making AI run inside boundaries.

The real shift is:

from writing prompts to writing completion conditions
from pasting errors to designing verification layers
from trusting confidence to requiring evidence
from accumulating chat history to externalizing state
from one-off interactions to reusable workflows

/goal is for tasks with endpoints. /loop is for repeated checks. Harness provides the floor. Verifier provides evidence. Memory and state provide continuity. Stop rules provide the brakes.

The stronger the model gets, the more discipline it needs.

A weak model cannot get very far. A strong model can get very far in the wrong direction.

References

The Real AI Productivity Hack Is Not a Better Prompt

mufeng — Sat, 04 Jul 2026 00:47:04 +0000

I used to think the next jump in AI productivity would come from writing better prompts.

Longer prompts. More precise prompts. Prompts with role definitions, tone rules, examples, constraints, and output formats.

After reading a book on Agent Skills, I think that framing is too small.

The real bottleneck is not that I fail to explain a task once. The real bottleneck is that I keep explaining the same class of task again and again: how I want an article structured, how I review code, how I prepare App Store release notes, how I generate visuals, how I check a draft before publishing.

At some point, “using AI” quietly turns into “managing AI manually.”

The book’s most useful idea is simple:

AI productivity does not come from making every prompt longer. It comes from turning repeated work into executable, maintainable, testable skills.

That changed how I think about AI work.

A Skill Is Not a Prompt
A prompt is a temporary instruction inside one conversation.

A skill is a reusable operating manual for an agent.

That difference sounds small until you use AI every day. A prompt tells the model what you want right now. A skill tells the agent how a category of work should be done every time:

when to activate
what input to read
what steps to follow
which tools or scripts to call
what output to produce
what must never happen
where the agent should stop and ask for human judgment
That last part matters.

The goal is not to remove the human from the work. The goal is to stop spending human attention on the same low-level instructions.

For me, the most obvious candidates are not exotic:

a writing style skill
a code review skill
an iOS release checklist skill
an App Store release notes skill
a book notes skill
a weekly review skill
These are not tasks I cannot do. They are tasks where I keep repeating the same standards, preferences, caveats, and checks.

That repetition is the real cost.

The Useful Split: Judgment, Mechanics, and Workflow
One of the cleanest distinctions in the book is this:

prompts handle semantic judgment
scripts handle deterministic mechanics
skills orchestrate the whole workflow
This sounds obvious, but many AI workflows fail because they give the model the wrong job.

For example, asking a model to decide where an article needs illustrations is reasonable. Asking it to reliably rename files, validate image dimensions, split long documents, or calculate table values is usually a mistake.

Those are deterministic jobs. They should be handled by scripts or strict tools.

The model is better used for judgment:

choosing the angle of an essay
identifying the weak part of a draft
comparing two architecture options
explaining a tradeoff
turning rough material into clear language
The skill sits above both. It says: when this kind of task appears, use the model for the judgment parts, use scripts for the mechanical parts, and preserve the checkpoints where a human decision is required.

That is a much more durable pattern than trying to put everything into one giant prompt.

Context Is a Workbench, Not a Warehouse
Large context windows make it tempting to dump everything into the conversation.

Style guides. Prior chats. Examples. Templates. API docs. Drafts. Personal preferences. All of it.

The book argues for the opposite discipline: load the right material at the right time.

That is how skills should be designed. The main SKILL.md should not become a warehouse. It should contain the core workflow:

trigger conditions
inputs and outputs
main steps
hard constraints
failure modes
references to load only when needed
Long templates, examples, API notes, and style samples belong in separate reference files.

This is not just about token savings. It is about attention. The more unrelated material you push into context, the easier it becomes for the model to miss the one rule that actually matters.

Context should feel like a workbench: only the tools needed for the current job should be on it.

Good Workflows Are Not Fully Automatic
The dangerous version of AI automation is the one that looks efficient because it removes every pause.

Become a Medium member
Give the agent source material. Let it choose the angle. Let it write the draft. Let it polish the draft. Let it generate images. Let it publish.

That looks like a productivity win. Often it is just a way to outsource the most important decisions.

The better workflow is more selective.

For writing, I want AI to:

analyze source material
propose several angles
stop
let me choose the angle
draft from that angle
revise against my standards
prepare platform-specific versions
The pause is not friction. It is the point.

The same applies to development. AI can propose implementation plans, write tests, scan for regressions, and generate release notes. But architecture decisions, product tradeoffs, and publish decisions still need human ownership.

AI can do the prep work. It should not silently take over the judgment.

Skills Need Engineering, Not Decoration
A useful skill should be treated more like a small software product than a clever note.

That means it has a lifecycle:

define the real problem
build the smallest usable version
run it on real tasks
record failure modes
add tests or examples
refactor when the file becomes too large
keep improving it as the work changes
The most useful part of a skill is often not the elegant workflow. It is the “gotchas” section.

That is where you record the failures that keep happening:

the agent forgot to read the reference template
the output sounded too generic
the script handled the wrong file path
the model rewrote sections it should have preserved
the task needed a human checkpoint before publishing
This is where personal experience becomes operational memory.

If the same mistake happens twice, it probably belongs in the skill. If the same task happens three times, it is probably a candidate for a skill.

The Security Boundary Is Part of the Design
Skills become more serious when they can read files, write files, call scripts, access the network, or publish content.

At that point, they are not just prompts. They are operational tools.

So the safety rules need to be designed in from the beginning:

limit where the skill can read and write
avoid destructive actions without confirmation
back up before overwriting important files
test publishing workflows with fake data first
remove local paths, secrets, and personal assumptions before sharing a skill publicly
inspect third-party skills before running their scripts
This is not paranoia. It is basic engineering hygiene.

The more capable the agent becomes, the more explicit the boundaries must be.

What I Am Going to Try First
The book made the idea feel concrete enough that I can turn it into a weekly habit.

This week, I would start with three small skills.

First: a writing style skill.

Not a giant manifesto. Just a role, three style principles, a short banned-phrase list, and a few examples of what “good” looks like.

Second: an iOS or app release checklist skill.

The first version only needs to cover version number, release notes, screenshots, privacy text, and a final manual confirmation before submission.

Third: a gotchas section for existing skills.

Take the last three AI outputs that were disappointing. Convert each failure into a specific rule. Do not patch for one example. Capture the pattern.

There is also one experiment worth running immediately:

Take a piece of material you want to turn into an article. Do not ask AI to write the article. Ask it to do only two things: analyze the material and propose three angles. Then stop and choose the angle yourself.

If the final article improves, the human checkpoint paid for itself.

The Shift
The book did not make me want to use AI more.

It made me want to manage AI less manually.

That is the real shift: from temporary instruction to reusable workflow; from prompt accumulation to experience engineering; from asking AI to remember my preferences to writing those preferences into a system that can be maintained.

Better prompts still matter.

But the real compounding return comes when the prompt stops being a one-off instruction and becomes part of a skill.

Disclosure: this essay was adapted from my Chinese reading notes and drafted with AI assistance.

Two Ways Claude Code Calls Codex: One-Shot Subprocess vs. Persistent App Server

mufeng — Fri, 19 Jun 2026 09:59:31 +0000

"Claude Code calls Codex" sounds like one feature. It's at least two different process models, and they have almost nothing in common past the name.

The first spawns a one-shot subprocess with codex exec. You hand it one explicit instruction, it produces a file or a structured result, and it exits. The second runs a persistent runtime with codex app-server and talks to it over JSON-RPC, managing threads, turns, reviews, and interrupts for work that needs to carry state across rounds.

Both let Claude Code borrow Codex. They differ on startup cost, protocol, permissions, error recovery, and the kind of task they fit. Get the distinction wrong and you either over-engineer a one-shot job or reach for a stateless call on work that needs to resume.

The conclusion first: two architectures, not two commands

Dimension	`codex exec` one-shot subprocess	`codex app-server` persistent service
Reference implementation	baoyu `codex-imagegen` backend	OpenAI Codex Plugin for Claude Code
Process shape	Spawned per task, exits when done	Long-running, reused within a session
Transport	Launch args, stdin, JSONL event stream	JSON-RPC requests and notifications
State model	Single run, no dependence on the last	Thread holds multiple turns, can resume
Permission posture	The example uses `danger-full-access`	Review is read-only; task can switch to `workspace-write`
Typical task	Image gen, file generation, single deterministic op	Code review, long delegated tasks, multi-turn work
Main risk	Full-access child, cold start every time	More protocol and lifecycle complexity

The one-line test:

If you need to run once and get a single verifiable artifact, reach for codex exec.
If you need ongoing collaboration, retained context, and the ability to cancel or resume, reach for codex app-server.

Version scope: keep the numbers honest

The first thing this writeup exposed wasn't architecture. It was version accounting. I had carried over the original draft's phrasing about "the current local version," and only after checking the install records did I confirm that the marketplace source and the active plugin were not the same snapshot.

Local commands and plugin records show:

Codex CLI is 0.140.0.
The OpenAI Codex Plugin for Claude Code is 1.0.4, commit 807e03a.
The baoyu-skills marketplace source snapshot is 2.5.1, commit 441ca30.
But Claude Code's installed-plugin record still points baoyu-skills at the earlier 1.111.1 snapshot.

So the accurate way to state the baoyu-codex-imagegen analysis below is this:

It's based on the baoyu-skills v2.5.1 source snapshot in the local marketplace, not a claim that the active plugin has been upgraded to v2.5.1.

This is easy to miss. The marketplace source, the cached snapshot, and the active version can all be different commits. Read the directory name or the changelog alone and you'll write "the version I read" when you mean "the version actually running."

Path one: `codex exec`, Codex as a one-shot operator

What it solves

The baoyu-codex-imagegen skill has a narrow job: let a non-Codex host like Claude Code call the image_gen tool built into the Codex CLI, and save the result to a chosen path.

Tasks like that share a shape:

Clear input boundary, usually a prompt, an aspect ratio, and an output path.
Clear result boundary, usually one file and one line of structured status.
No need for multiple rounds, and no need to restore prior context.

So it skips a persistent service and spawns directly:

codex exec \
  --json \
  --sandbox danger-full-access \
  --skip-git-repo-check \
  -

If a reference image exists, it appends one or more --image arguments.

Why each flag is there

exec runs non-interactively for scripting. OpenAI's CLI docs position it as the execution path for automation and CI: run, return a result, done.

--json turns process output into line-delimited JSON events, or JSONL. The caller doesn't parse terminal display text; it reads structured events for the thread, tool calls, usage, and the final message.

--sandbox danger-full-access is here because this implementation needs Codex to copy the image from its default generation directory to an arbitrary target path the caller specifies, so it grants full file permissions.

That is not a general best practice. OpenAI's docs recommend workspace-write for automation and say to avoid unnecessary full access unless the runtime is already isolated.

--skip-git-repo-check lets Codex run outside a Git repo, since image jobs may launch from a temp or plugin directory rather than a trusted repository.

The trailing - tells Codex to read the instruction from stdin. The wrapper writes the task contract with child.stdin.write(instruction) and then closes stdin.

The task contract is the real work

This path doesn't pass the user prompt straight through. It wraps a strict instruction, roughly:

TASK:
Generate an image and save it to the given path.

STEPS:
1. You must call the built-in image_gen.
2. Copy the result to the target path.
3. Check that the target file exists.
4. Return one line of JSON only.

HARD CONSTRAINTS:
- Do not call an external image API.
- Do not fake the image with a script.
- You must use image_gen to produce real pixels.

This is the "sub-agent as operator" design:

Fixed input structure.
Fixed set of allowed tools.
Fixed file side effects.
Fixed output format.
Explicit prohibitions.

For an automated pipeline, the constraints matter more than the phrasing. The caller wants a verifiable result, not an open conversation.

Don't trust self-reported success: three checks

The engineering detail worth keeping is that this implementation does not call the job done just because Codex replied "success."

It checks, in order:

Whether the JSONL events contain a thread ID.
Whether an image actually appears under $CODEX_HOME/generated_images/{threadId}/.
If the directory check fails, whether the tool calls include a cp or mv from the generation directory to the target path.
Whether the target file actually exists and has a byte count above zero.

Failure becomes a structured error:

agent_refused
no_image_gen_tool_use
timeout
codex_not_installed
spawn_failed

The point isn't the image. It's a general principle:

An agent's natural-language reply is a claim. Files, events, and repeatable checks are evidence.

Where it fits and where it doesn't

Good fit:

Single image or file generation.
A code transform with clear boundaries.
One-off analysis that returns structured JSON.
Automation that doesn't need inherited context.

Limits:

Every run pays process and model cold-start cost.
No cross-run state by default.
With danger-full-access, the trust boundary is very wide.
Timeout, cancellation, and recovery usually fall to the wrapper to build.

Path two: `codex app-server`, Codex as a stateful service

The OpenAI Codex Plugin for Claude Code does not re-run codex exec per command. It starts codex app-server and manages an ongoing session over JSON-RPC.

OpenAI's docs define the App Server's core abstraction in three layers:

Thread: a conversation that persists.
Turn: one round of user input and agent execution inside a thread.
Item: events inside a turn, such as messages, reasoning, commands, and file edits.

Direct connection and broker

The plugin supports two connection modes.

Direct:

Claude Code
    |
    | stdin/stdout JSONL
    v
codex app-server

The client starts codex app-server itself and sends line-delimited JSON-RPC over stdio.

Broker:

Claude Code command
    |
    | Unix socket
    v
Broker
    |
    | reuse
    v
codex app-server

The plugin stores the broker endpoint in CODEX_COMPANION_APP_SERVER_ENDPOINT so review, rescue, and status commands in the same Claude Code session share one Codex runtime.

If the broker returns the busy error -32001, or the connection hits ENOENT or ECONNREFUSED, the plugin drops the broker and starts an App Server directly to retry.

That's one more layer than a one-shot subprocess, and it buys:

Runtime reuse within a session.
Thread persistence.
Background task management.
Cancel and resume.
Permission isolation between review and task.

Handshake: initialize first

Once the App Server connection is up, the client sends initialize, then an initialized notification.

The plugin passes this client identity:

{
  "title": "Codex Plugin",
  "name": "Claude Code",
  "version": "1.0.4"
}

It also uses optOutNotificationMethods to unsubscribe from some token-level delta events, keeping the structured notifications that are worth more to the caller and cutting noise.

Session model: threads and turns

The key RPC methods the plugin uses:

Method	Purpose
`thread/start`	Create a new thread
`thread/name/set`	Name a thread
`thread/resume`	Resume an existing thread
`thread/list`	Query past threads
`turn/start`	Start a turn in a thread
`review/start`	Start a code review
`turn/interrupt`	Interrupt a running turn

So the App Server isn't a single-round wrapper that "sends a prompt and waits." It's a managed session runtime.

Review and task have different permissions

The plugin keeps the two actions separate.

Review runs read-only, on a temporary thread, through review/start. It returns findings and does not touch code.

Task defaults to read-only. Pass --write and it switches to workspace-write. It can save the thread, and it can continue prior work with --resume or --resume-last.

This is closer to what an engineering system's default should look like than "run everything with full access." Set the minimum permission by the nature of the task, then decide whether to widen write scope.

Hooks wire Codex into the Claude Code lifecycle

The plugin registers three Claude Code hooks:

SessionStart: prepare the shared runtime.
SessionEnd: clean up the broker and session resources.
Stop: an optional stop-gate review.

With the review gate on, every time Claude Code is about to stop, it can have Codex check whether the last round has a blocking problem.

The value isn't "one more model." It's putting a second model inside the delivery flow:

Claude makes a change
    |
    v
Codex reviews independently
    |
    +-- ALLOW: stop is permitted
    |
    +-- BLOCK: return findings, keep working

It has a cost. The official plugin README warns that the review gate can create long Claude/Codex loops and burn through usage fast, so don't turn it on unconditionally.

How to choose

When `codex exec` fits

Use a one-shot subprocess when most of these hold:

The task is a single round.
The result can be verified by a file or JSON.
You don't need to restore prior context.
Cold-start cost is acceptable.
The caller can handle timeout and retry on its own.

Examples: generate an image, convert input to a fixed format, run one analysis on a file, run a check once in CI.

When `codex app-server` fits

Use the persistent service when you need:

Multiple rounds of conversation.
Thread resumption.
Background runs and status queries.
Interruption of a running task.
Separate review and write permissions.
Integration with Claude Code's session lifecycle.

Examples: review a branch continuously, delegate a long investigation, let Codex change code and then add tests, or run an automatic second-model gate before stopping.

How this was verified

This published version doesn't lean on the draft's description. I redid a minimal verification.

The steps I ran:

Read the draft and listed every factual claim about versions, commands, RPC methods, and permissions.
Ran codex --version, codex exec --help, and codex app-server --help to confirm the current CLI's commands and flags.
Checked the OpenAI plugin manifest, install records, app-server.mjs, codex.mjs, and the hook config.
Checked spawn.ts, main.ts, the version file, and the Git commit in the baoyu marketplace source.
Cross-checked against the OpenAI Codex CLI, App Server, Codex Plugin, and Claude Code Hooks docs.
Recorded "current active version" and "source snapshot I actually read" separately.

The mistake and the lesson

I first took the draft's baoyu-skills v2.5.1 as "the current local version." On further checking, the v2.5.1 marketplace source does exist locally, but Claude Code's installed-plugin record still points at an earlier snapshot.

Without checking the install record, that phrasing looks reasonable and is wrong.

The lesson:

When you analyze a local plugin, record at least the marketplace HEAD, the install cache path, the plugin manifest, and the commit. No single one of those stands in for "the version actually running."

Practical advice

One-shot tasks: hardcode the output contract

Don't write "generate an image for me" or "check my code." An automation prompt should include at least:

Goal
Allowed tools
Input and output paths
Prohibitions
Verification steps
Final return format

That cuts the uncertainty of an agent improvising, and it lets the caller judge success or failure.

Long tasks: resume with the delta only

When you resume a thread, send only what changed:

Continue the last task. Apply the first fix and add the matching test.

There's no reason to re-paste the whole background. Repeating context adds noise and can make the model misread the task boundary.

Review tasks: bind every finding to evidence

Whether you run a standard review or an adversarial one, require each finding to carry:

The file or diff actually examined.
A reproducible failure path.
A clear risk level.
A split between fact, inference, and open question.

A "might be a problem" with no evidence rarely makes it into an engineering decision.

Permissions: start at the smallest scope

The order of preference:

read-only
    |
    v
workspace-write
    |
    v
danger-full-access

Widen only when the task genuinely needs a larger file scope and the runtime is trusted.

Closing

"Claude Code calls Codex" is not one calling convention.

codex exec is a one-shot, stateless subprocess that's easy to wrap. It fits single tasks with clear boundaries and verifiable results.

codex app-server is a stateful, resumable, manageable agent service. It fits code review, task delegation, and complex work that needs ongoing collaboration.

The real selection criteria aren't "which is more advanced." They are:

Does the task need state?
Can the result be verified in one shot?
Do you need interruption, resume, and background management?
Can permissions be graded by action?
Is the extra protocol complexity worth it?

Simple tasks get a simple process. Ongoing collaboration gets a stateful service. Draw that line clearly and the system gets easier to understand and to maintain.

References

Your blog is invisible to AI. Here's the 1999 fix.

mufeng — Mon, 15 Jun 2026 03:49:58 +0000

A quick story about a dead protocol, a confused chatbot, and the ten minutes that gave my blog a new kind of reader.

Hey friends,

A small thing happened the other day that I haven't been able to stop thinking about.

I dropped a link to my blog into Claude and asked it to read a few of my recent posts. It came back and told me: can't fetch it. The page returned an empty shell — undefined | loading. My blog runs on NotionNext, the content renders client-side with JavaScript, and AI crawlers don't execute JS. All it got was the skeleton that exists before the page comes to life.

I stared at that spinner for a few seconds, and something clicked: in the AI era, a site built only for human eyes is worth only half of what it could be.

The other half belongs to machine readers. And the door to those readers was already built back in 1999. It's called RSS.

If you've been around the internet long enough, you just felt a little nostalgia twinge. Stay with me — this turned out to be one of the highest-leverage things I've done for my writing in years.

What RSS actually is

One sentence: RSS is a read-only API your blog exposes to the world.

It's a static XML file listing your most recent posts in reverse-chronological order — title, link, publish date, and either a summary or the full text. Any program can grab it with a single HTTP request. No JavaScript, no login, no API key.

If you're technical, picture a public GET /articles?limit=20 endpoint whose response format hasn't changed in over two decades. A protocol defined in 1999, and every reader today still parses every feed. In web terms, that's a living fossil.

It solves exactly one problem: readers no longer have to keep reopening your site to check for updates. Someone adds your feed to their reader, the reader polls it on a schedule, new posts get pushed to them. The subscription lives entirely in their hands — no algorithm, no rate limit, no platform taking a cut.

(Sound familiar? It's basically what you're doing by reading this email. A newsletter is RSS with a friendlier face.)

Why we forgot about it

The platforms won.

When Google Reader shut down in 2013, control over information flow shifted from subscription to recommendation. Twitter/X, TikTok, Instagram — algorithms decide what you see and feed your attention on a drip. Subscription is too "dumb" for that business model: it won't guess what you like, won't manufacture anxiety, won't keep you scrolling.

So RSS retreated to the corner, kept alive by a small group: programmers, content creators, deep readers.

But here's the twist — that small group is exactly the audience an independent writer most wants. People still using an RSS reader actively curate their own sources. They don't scroll a feed; they choose their springs. Get into their list and you've earned a long-term seat at the table: they read everything you publish, not the one piece an algorithm happened to surface.

Why AI is bringing it back

Two shifts changed my mind.

One: machines are your blog's new readers. People now ask ChatGPT and Claude to summarize your work, point assistants at your site to track updates, and let agents pull your content into research. Most of those crawlers don't run JavaScript — so a client-rendered blog is a blank page to them. RSS is pure server-side XML; any AI can parse it in one line. When I sent Claude my RSS link instead, it instantly read every recent post. Same content — the HTML page is a welded-shut door, the feed is an open window.

Two: AI fixes RSS's old fatal flaw. Subscription used to die under its own weight — a hundred feeds, hundreds of daily updates, no human can keep up. An LLM dissolves that. More people now let AI sweep every source once a day and produce a linked digest, surfacing only the few pieces worth reading closely. You pick the sources, AI does the skimming, you keep the deep reading.

In the algorithm era, a platform uses AI to feed you. In the RSS + LLM era, you use AI to feed yourself. The controls have flipped.

Do it in ten minutes

Confirm you have a feed. Most frameworks ship one for free. Try yourdomain.com/rss/feed.xml or /atom.xml (NotionNext / Hexo / Hugo), yourdomain.com/feed (WordPress), or yourdomain.com/rss (Ghost). See XML? It works.
Make it visible. Put an RSS link (with the orange icon) in your footer or About page, and confirm your HTML <head> has the auto-discovery line:

   <link rel="alternate" type="application/rss+xml" title="RSS" href="https://yourdomain.com/rss/feed.xml" />

Use it yourself. Install Feedly, Reeder, or Folo. Subscribe to five writers you admire plus your own blog. Live with it a week and feel the difference between information finding you and you chasing it.

Want to go further? Use n8n or GitHub Actions to pull your feeds on a schedule, send the updates to an LLM API for a daily digest, and push it to your inbox or Telegram. An evening's work — probably the highest-ROI personal infrastructure you'll ever build.

The honest limits

It won't reach a mass audience. Most people don't know what RSS is. Bulk traffic still comes from social and search. RSS serves the small high-value slice — and the machines.
Almost no engagement data. No open rates, no idea who's reading. For dashboard people, it feels like writing in the dark.
Full text vs. summary is a real tradeoff. Full text is kind to readers but invites scrapers; summaries drive clicks but degrade the experience. My take: ship full text. An independent writer's enemy was never being reposted — it's not being read at all.

One last thing

After years of building software, I keep coming back to one conviction: the good protocols outlive the platforms. Email is older than every social app and won't die. HTTP has watched products rise, throw their banquet, and collapse. RSS has been pronounced dead more times than anyone can count — and in the AI era, it found its second spring.

Platforms change. Algorithms change. Whichever channel is hot this quarter will change. But the need for an open, machine-readable outlet anyone can subscribe to does not.

Spend ten minutes today: find your feed, surface it, subscribe to it. Then hand the link to your AI assistant and watch it read back every post you've ever written.

That's the moment you realize your blog just gained a whole new audience that's always online.

Until next time,
Joey

If a friend would find this useful, forward it along. And if someone shared this with you — you can subscribe below to get the next one straight to your inbox.

Your AI Agent Is Underperforming Because of Your Harness, Not the Model

mufeng — Thu, 11 Jun 2026 05:08:36 +0000

The pattern is familiar: your AI agent produces garbage output, so you switch to a better model. Things improve for a few days, then the same problems resurface. You upgrade again.

Here's what you're probably missing: the model is just one input. The rest is harness — and that's almost always where the real problem lives.

What Is a Harness?

The cleanest definition comes from engineer Vtrivedy, who coined the term:

Agent = Model + Harness. If you're not the model, you're the harness.

A harness encompasses everything except the model itself:

System prompts, CLAUDE.md / AGENTS.md files, Skill definitions
Tool descriptions, MCP servers, and their technical specifications
Execution environment: filesystem, sandboxes, headless browsers
Subagent orchestration: spawning logic, task handoffs, routing
Hooks: deterministic enforcement layers (linting, formatting, permission checks)
Observability: cost monitoring, latency tracking, logs

This entire surface area is yours to design, not the model provider's.

Claude Code, Cursor, Codex, Cline — these tools might run on identical underlying models, but the behavior you experience is dominated by the harness each one provides. The underlying model might be identical across two setups; the behavior you see will be completely different.

This leads to a counterintuitive but well-supported finding:

A decent model with a great harness consistently outperforms a great model with a bad harness.

Why Engineers Default to Model-Blaming

When an agent does something nonsensical, blaming the model is the path of least resistance. It's the most visible component, and failures often look like reasoning problems.

But most failures are legible if you look closely:

Agent ignored a coding convention → Add it to AGENTS.md
Agent ran a destructive command → Write a Hook to block it
Agent got lost in a 40-step task → Split into Planner and Executor subagents
Agent consistently ships broken types → Wire a type-checker signal into the loop

As HumanLayer frames it: "It's not a model problem. It's a configuration problem."

Consider the performance benchmarks: a leading model running inside an off-the-shelf framework often scores dramatically lower than the exact same model running in a custom, highly-tuned harness. The model's capabilities didn't change — the harness is what unlocks them.

The Ratchet: Every Failure Becomes a Rule

The most important habit in harness engineering is treating agent failures as permanent signals, not one-off flukes to retry and forget.

Think of a mechanical ratchet: it only moves forward, never backward.

When an agent makes a mistake, you don't retry and hope for better luck. You engineer a permanent fix so the same exact failure cannot happen again.

Example: An agent submits a PR with commented-out tests. It gets merged into main.

Wrong response: Fix it manually. Move on.

Harness response:

Add to AGENTS.md: "Never comment out tests. Delete or fix them."
Add a pre-commit Hook that flags .skip( in any diff automatically.
Update the Reviewer subagent's instructions: commented-out tests are a blocking issue.

Three layers. Same failure is structurally impossible now.

Constraints should be added when you observe a real failure, and removed when a more capable model makes them redundant. Every line in a good system prompt should trace back to a specific, historical failure. A harness that grows without bound is just as broken as one that never grows.

CLAUDE.md Is a Failure Log, Not Documentation

This is the mistake I see most often. Engineers treat CLAUDE.md like a README written for an AI: project overview, tech stack, coding conventions. Useful — but incomplete.

Mature harnesses treat CLAUDE.md differently: every rule should trace back to a specific, real incident. If you can't remember the failure that generated a rule, it's probably noise that dilutes the signal of the rules that actually matter.

Examples of rules with provenance:

"Never use any type without explicit authorization" → From a production bug after TypeScript checks were bypassed.
"Run the full test suite before committing, even for one-line changes" → From a regression where a small fix touched adjacent logic without running tests.
"Back up configuration files before modifying" → From an agent that overwrote a production config.

Rules derived from real incidents carry weight in the agent's reasoning. Rules written speculatively get treated as suggestions — not because the model is bad, but because they lack the contextual authority that real constraints carry.

Context Engineering: The Harness Layer People Miss

There's a component of harness design that gets less attention than it deserves: context management.

Antonio Gullí, Engineering Director at Google, defines Context Engineering in Agentic Design Patterns:

Not information dumping. Carefully selecting, trimming, and packaging context. To get AI to peak accuracy, you must give it short, focused, powerful context.

This distinguishes Context Engineering from the more common Prompt Engineering. Prompt Engineering asks: How should I phrase this request? Context Engineering asks: What should already be in front of the agent before it even sees the request?

The discipline applies to every part of the harness:

Tool descriptions: Concise and precise, not comprehensive
Skill files: Exact schemas and templates the agent needs, not everything
System prompts: Specific constraints from real failures, not generic guidelines

An agent drowning in context doesn't perform better — it performs worse. Every line in your CLAUDE.md or system prompt is doing Context Engineering. Noise in equals noise in the agent's reasoning.

Two-Tier Configuration: Team Brain + Personal Brain

Claude Code's configuration architecture is worth understanding as a design pattern applicable to any agent harness.

Project .claude/ — lives in the repo, committed to Git
Team-shared rules, hooks, security policies, workflow definitions. Every engineer who clones the repo inherits the full agent behavior constraints automatically. This is an engineering asset, maintained alongside code.

Global ~/.claude/ — personal directory, stays out of Git
Personal coding style preferences, cross-project shortcuts, individual tool configurations.

The separation enforces the right ownership boundaries: team standards are reliable and shared, personal preferences are free and local. New team members inherit your agent setup the moment they clone the repository.

What Changes When You See It This Way

Once you internalize Agent = Model + Harness, the questions you ask about AI tools shift.

Before:

Which model has better code generation?
What's the context window size?
What's the price per token?

After:

How mature is this harness?
What does the failure recovery path look like?
How are harness rules maintained over time?
What's the observability story?

The model is table stakes at this point. The harness is the differentiator.

Anthropic's engineering team published this framing directly:

The gap between what today's models can theoretically do and what you actually see them doing is largely a harness gap.

The ceiling isn't the model. The floor you're operating at is almost entirely determined by your harness.

Start Here

Open your CLAUDE.md, or create one if it doesn't exist.

Think about the last thing your agent got wrong. Not a model failure — a behavioral failure. Something it did that violated an expectation.

Write one rule. Note where the failure came from. One sentence is enough.

That's the first notch on the ratchet. Over months, this file becomes a compressed history of your collaboration — every line representing a mistake that was never repeated.

The harness isn't designed. It's earned.

I write about practical AI engineering, agent design, and building production systems with Claude. Follow for more.

How to Make AI Coding Agents Actually Follow Engineering Process

mufeng — Sun, 07 Jun 2026 15:53:26 +0000

The problem isn't that AI coding agents write bad code.

The problem is that they skip steps.

Ask an agent to fix a bug—it reads a few files, guesses a cause, patches the code. Ask it to add a feature—it starts writing before anyone's agreed on what the feature actually does. Ask it to refactor—it touches unrelated files, reformats half the codebase, and hands you a diff too large to review.

None of this is stupidity. It's the absence of process discipline.

Software development has always required workflow constraints: clarify before implementing, plan before coding, test before shipping, debug root causes not symptoms, verify before declaring done. The question is whether your AI agent follows them—or bypasses them entirely.

Superpowers is a plugin framework for Claude Code and Codex that encodes those constraints as loadable, composable agent workflows. This is what it is, when to use it, and how to get started.

What "Skills" Actually Are

The word "skill" is overloaded in AI contexts. Here it means something specific: a workflow protocol that loads into an agent session and constrains how the agent approaches a category of task.

Not "be more careful." Not a style guide. A specific sequence of steps with defined inputs, outputs, and verification gates.

The analogy is a checklist for a surgeon or a pilot—not because either lacks expertise, but because cognitive discipline under pressure requires procedural anchors.

The core Superpowers Skills cover the major failure modes in AI-assisted development:

Skill	Failure Mode It Prevents	What It Produces
`brainstorming`	Implementing the wrong thing	Clarified scope with edge cases surfaced
`writing-plans`	Drifting mid-implementation	Executable task list: file scope + verification per step
`test-driven-development`	"Works on my machine" guesswork	RED-GREEN-REFACTOR cycles that lock behavior first
`systematic-debugging`	Shotgun-patching symptoms	Root cause hypotheses, evidence-based elimination, minimal fix
`verification-before-completion`	"Should be done" claims	Actual test runs, browser paths, or device checks
`requesting-code-review`	Merging unreviewed code	Severity-ranked risk list before merge
`using-git-worktrees`	Task bleed across workstreams	Isolated workspaces with clean baseline

These aren't independent tips—they chain into a complete development pipeline:

Vague requirement
  → brainstorming  (scope + edge cases)
  → writing-plans  (executable task list)
  → test-driven-development  (behavior locked by tests)
  → requesting-code-review  (risks surfaced)
  → verification-before-completion  (actually verified)

The Key Insight: Process Errors vs. Code Errors

AI agents will get better at writing correct code over time. They won't automatically get better at following process—unless process is encoded somewhere.

The bugs Superpowers Skills prevents aren't syntax errors or logic bugs. They're:

Building the wrong feature because nobody asked the right clarifying questions
Writing code that "looks complete" but has zero coverage on the edge cases that matter
Patching a symptom while the root cause persists
Refactoring that expands scope until the diff is unmergeable
Shipping because the agent said "done" without running anything

A more capable model doesn't fix these. A faster agent arguably makes them worse—more code written in the wrong direction before anyone catches it.

A Real Example: Adding Invoice Export

Imagine you tell an agent: "Add a billing export feature."

Without workflow constraints, it will probably find the billing service, write an endpoint, add a download button, and report completion. Whether that implementation handles empty data, unauthorized requests, large datasets, or export format edge cases depends entirely on whether the model guessed right.

With Superpowers Skills, the flow looks like this:

Step 1: `brainstorming`

Before touching any files, the agent surfaces questions:

Export format: PDF, CSV, or Excel?
Date range limits?
Permission checks required?
Sync download or async background job?
What does the user see on failure?

This isn't bureaucracy. This is the list of decisions that will otherwise get made silently—by the model, in the wrong direction.

Step 2: `writing-plans`

A compliant plan doesn't say "implement invoice export." It says:

1. Add exportInvoiceCsv(userId, range) to billing service.
   Verify: unit tests covering empty data, normal data, unauthorized access.

2. Wire export endpoint in API routes.
   Verify: 403 on missing permissions, valid text/csv response on success.

3. Add download button to billing page.
   Verify: file downloads on click, loading and error states render correctly.

Every task has a file scope and a verification gate. That's what makes it executable instead of aspirational.

Step 3: `test-driven-development`

Tests first. Not as documentation—as behavior contracts:

describe("exportInvoiceCsv", () => {
  it("exports invoices as csv rows", () => {
    const csv = exportInvoiceCsv([
      { id: "inv_001", amount: 1999, currency: "USD" },
      { id: "inv_002", amount: 2999, currency: "USD" },
    ]);

    expect(csv).toContain("id,amount,currency");
    expect(csv).toContain("inv_001,1999,USD");
    expect(csv).toContain("inv_002,2999,USD");
  });
});

Write the failing test. Confirm it fails. Implement the minimum to pass. Confirm it passes. Then refactor. The order matters.

Step 4: `requesting-code-review`

Before merge, the review targets:

Does this match the agreed plan?
Any authorization gaps?
Large dataset edge cases?
Unhandled error states?
Files changed outside the agreed scope?

Step 5: `verification-before-completion`

Depending on project type:

Project Type	Verification Method
Web app	Start dev server, walk the critical path in browser
Backend service	Run tests, type check, hit the endpoint
CLI tool	Run the command, check actual output
iOS app	Test on real device (especially IAP, StoreKit, permissions)
SDK / Library	Unit tests + integration tests + example project

The principle: evidence over claims. "I think it's done" is not verification.

How to Install

Claude Code

/plugin install superpowers@claude-plugins-official

Or via the Superpowers marketplace:

/plugin marketplace add obra/superpowers-marketplace
/plugin install superpowers@superpowers-marketplace

Codex CLI

/plugins

Search superpowers, select Install Plugin.

Codex App

Sidebar → Plugins → Coding category → Superpowers → +

When to Use vs. Skip

Not every task needs a full workflow. A typo fix doesn't need a plan. A one-liner doesn't need TDD.

The right mental model is risk-proportional discipline:

Task	Recommended Approach
Typo fix, config lookup	Direct action—just verify the output
Single-file small change	Optional workflow; at minimum verify
Bug with unclear root cause	`systematic-debugging` required
New feature	`brainstorming` + `writing-plans` + TDD
Cross-module refactor	Plan + verification strongly recommended
Pre-merge / pre-deploy	`requesting-code-review` + `verification-before-completion`

Skills should add friction proportional to the blast radius of getting it wrong.

Three Skills to Start With

If you're integrating Superpowers into an existing project, don't try to use everything at once. Start with three:

1. `systematic-debugging`

Tell the agent:

"Use systematic-debugging. Do not modify any code yet. List your root cause hypotheses first, then we'll validate them one by one."

This stops the shotgun-patch reflex before it starts.

2. `writing-plans`

Before any non-trivial feature or change:

"Use writing-plans. Produce an executable plan first. I'll confirm before you implement anything."

This surfaces scope creep before it happens, not after you're reviewing a 500-line diff.

3. `verification-before-completion`

Add this to your project's CLAUDE.md or AGENTS.md:

"Before declaring any task complete, use verification-before-completion. Run tests, verify in browser or device, report exactly what you checked and what the result was."

This closes the gap between "I think it works" and "I confirmed it works."

The Broader Pattern: Startup Superpowers

Startup Superpowers—a companion project that applies the same framework to startup validation—illustrates why this pattern generalizes beyond coding.

It applies the same idea (codify a professional workflow into loadable agent protocols) to hypothesis tracking, competitor research, customer interviews, and MVP scoping. Available slash commands:

Command	Purpose
`/whats-next`	Assess current stage, recommend next action
`/competitors`	Map direct and indirect competitors
`/market-research`	Research customers, pricing, and trends
`/hypotheses`	Write testable hypotheses with evidence tracking
`/interviews`	Design scripts and analyze transcripts
`/surveys`	Design surveys and manage responses
`/mvp`	Design the minimum testable product

Everything is stored as Markdown in a startup/ directory—version-controllable, agent-readable, no SaaS dependency.

That's the actual pattern: take a repeatable professional workflow, encode it as agent steps with defined inputs and outputs, make it loadable in any session, and store all state in files the agent can read and write. The AI doesn't get smarter. The process gets stable.

Summary

Superpowers Skills solves a specific problem: AI coding agents that know how to write code but don't know how to do software development.

The six questions it forces an agent to answer before declaring a task complete:

Did you clarify the requirements before implementing?
Did you make a verifiable plan before writing code?
Did you write tests before the implementation?
Did you find the root cause before patching?
Did you get a review before merging?
Did you actually verify—not just assume—that it works?

Without workflow constraints, developers have to ask these questions themselves, every session, every task. With Superpowers, the constraints are stable, loadable, and consistent across sessions, developers, and projects.

If you're using AI coding agents in real projects today, start with three skills: systematic-debugging, writing-plans, and verification-before-completion. They won't make development magical. They'll make your agent behave like a collaborator with engineering discipline instead of one without it.

Superpowers: github.com/obra/superpowers
Startup Superpowers: github.com/SergeiGorbatiuk/startup-superpowers

DEV Community: mufeng

I Built a Durable AI Knowledge Base with Markdown and Git

Search can retrieve knowledge. It does not maintain it.

A durable AI knowledge base needs three jobs

1. What did the source actually say?

2. What do I currently think the evidence supports?

3. How may an agent change the system?

Why I extended the three-layer model

Why Markdown and Git are a practical foundation

Give the agent a repository protocol, not a vague prompt

The citation failure that changed the design

Semantic review and deterministic checks solve different problems

A minimal implementation you can build this weekend

1. Choose the authoritative store

2. Separate sources from synthesis

3. Write the agent's modification rules

4. Use a minimal schema

5. Turn mechanical rules into tests

When should you add vector search?

Frequently asked questions

What is a durable AI knowledge base?

Does a Markdown wiki replace RAG?

Why use Git for knowledge management?

Can several AI agents safely edit the same knowledge base?

What should be tested automatically?

What this system has—and has not—proved

References

What 178 Claude Code Sessions Taught Me About Working With Coding Agents

What Claude Code Insights actually is

The pattern it found in my work

Failure pattern 1: a small fix becomes a refactor

Failure pattern 2: the first plausible explanation wins

Failure pattern 3: reasoning replaces reproduction

Failure pattern 4: repeated work never becomes a system

Where each Insight should go

What changed before and after Insights

A practical way to use /insights

Limits worth keeping in mind

The real benefit

References

DESIGN.md: Stop Letting AI Guess What Your UI Should Look Like

What DESIGN.md actually is

Why this improves AI-generated UI

1. It reduces hidden degrees of freedom

2. It turns taste into semantic rules

3. It persists across sessions

4. It creates a verification loop

The four-file setup

A working example: Signal Desk

Step 1: define the product boundary

Step 2: define the engineering contract

Step 3: give the design a specific world

Step 4: implement and inspect the result

The failure was more useful than the first render

Do not copy another company's visual identity

A practical adoption sequence

What DESIGN.md cannot solve

References

Loop Engineering: Turning /goal and /loop into Verifiable AI Agent Workflows

The Problem: You Are Still the Loop

The Vocabulary That Matters

Prompt Engineering

Context Engineering

Harness Engineering

Loop Engineering

Verifier

Memory and State

Stop Rules

/goal vs /loop

When to Use /goal

When to Use /loop

A Reusable Loop Spec Template

The Codex Equivalent

What Loop Engineering Actually Solves

It reduces human QA relay work

It makes long tasks recoverable

It replaces confidence with evidence

It turns repeated work into team assets

Three Practical Scenarios

1. Research briefs without fake citations

A practical way to use `/insights`

Path one: `codex exec`, Codex as a one-shot operator

Path two: `codex app-server`, Codex as a stateful service

When `codex exec` fits

When `codex app-server` fits

Step 1: `brainstorming`

Step 2: `writing-plans`

Step 3: `test-driven-development`

Step 4: `requesting-code-review`

Step 5: `verification-before-completion`