Shinsuke KAGAWA

Posted on Jan 19 • Edited on Jan 26 • Originally published at norsica.jp

Stop Putting Everything in AGENTS.md

#ai #llm #softwareengineering #productivity

If you're using Agentic Coding and find yourself explaining the same thing to the LLM over and over, you have a learning externalization problem.

The fix seems obvious: write it down in AGENTS.md (or CLAUDE.md, depending on your tool) and never explain it again.

Note: This article uses "AGENTS.md" as the generic term for root instruction files. Claude Code uses CLAUDE.md, Codex uses AGENTS.md, and other tools have their own conventions. The principles apply regardless of the specific filename.

But here's what actually happens—you keep adding rules, AGENTS.md grows to 200+ lines, and somehow the LLM still ignores half of what you wrote.

This article is about how to actually make your rules stick: where to write them, what to write, and how to verify they work.

The Real Problem

LLMs don't learn across sessions. Every conversation starts fresh. This means:

You explain something once
It works
Next session, you explain it again
And again
Eventually you get frustrated

The solution is to externalize your learnings into rules. But most people do this wrong.

The Common Mistakes

Mistake	What Happens
Put everything in AGENTS.md	It bloats, becomes noise, important rules get buried
Put everything in code comments	LLM doesn't load them into context unless you explicitly reference the file
Don't write it down at all	You repeat yourself forever

The thing is, where you write a rule determines whether the LLM actually follows it.

Where to Write Rules

Not all rules belong in the same place. A simple decision tree:

When is this rule needed?
│
├─ Always, on every task → AGENTS.md
│
├─ When working on a specific feature → Design Doc
│
├─ When using a specific technology → Rule file (skill)
│
└─ When performing a specific task type → Task guidelines

Note: "Skills" are modular rule files used in tools like Codex and Claude Code. They allow you to inject context-specific rules only when relevant. If your tool doesn't have this concept, think of them as separate rule files you reference when needed.

"Task guidelines" refers to rules that apply only during specific operations—like code review, migration, or content generation. Some call these "task rules" or "task-specific constraints."

The Full Picture

Destination	Scope	When Applied	Examples
AGENTS.md	All tasks	Always	Approval flows, stop conditions, project principles
Rule files (skills)	Specific technology area	When using that tech	Type conventions, error handling patterns, function size limits
Task guidelines	Specific task type	When doing that task	Subagent usage rules, review procedures
Design docs	Specific feature	When developing that feature	Feature requirements, API specs, security constraints
Code comments	Specific code location	When modifying that code	Implementation rationale, gotchas

The Key Question

Ask yourself: "Is this needed on every task in this project?"

Yes → AGENTS.md
No → Put it closer to where it's needed

This keeps AGENTS.md lean (around 100 lines) and ensures task-specific rules don't create noise for unrelated work.

You don't need to get this perfect from day one. Start with one thing: keep AGENTS.md small. That alone changes a lot.

What to Write

This is the hard part. Most people write the wrong thing.

The Principle: Write Root Causes, Not Incidents

When something goes wrong, the instinct is to document the specific incident. But this creates bias—the LLM over-fits to that one case.

❌ Bad (specific incident)
"The getUser() function in UserService was missing null check"

✅ Good (root cause / system fix)
"Always null-check return values from external APIs"

The first one only helps if the LLM encounters that exact function again. The second one prevents the entire class of errors.

Specific Incident vs. Root Cause

Aspect	Specific Incident	Root Cause
Applies to	That one location	All similar cases
Prevents recurrence	Weakly (same bug elsewhere)	Strongly (operates as principle)
Bias risk	High (overfitting)	Low (generalizable)

Finding the Root Cause

When you encounter an issue, ask:

Why did this mistake happen? (direct cause)
Why wasn't it prevented? (system gap)
Where else could this same mistake occur? (scope)

Example:

Direct cause: getUser() was missing null check
System gap: We trusted external API return values without validation
Scope: All external API calls

→ Rule to write: "Always null-check return values from external APIs"

How to Verify Rules Work

This is the step most people skip—and it's critical.

The Principle: Fix the System, Then Discard and Retry

When you add or modify a rule in AGENTS.md or a skill file, you need to verify it actually works. The only way to do this:

Add/modify the rule
Discard the current artifact (or stash it in a branch)
Start a new session with the updated rules
Re-run the same task
Verify the issue doesn't recur

Continue with existing artifact after rule change → ❌
Discard and restart with new rules → ✅

Why This Matters

If you keep the existing artifact and just continue, you're still operating in a context polluted by the old system. The new rule might not get properly applied because:

The existing artifact carries biases from before the rule existed
The LLM might try to "reconcile" the new rule with existing work rather than applying it cleanly
You can't tell if the rule actually works or if you just manually fixed the symptom

Verification Checklist

[ ] Modified the rule (AGENTS.md / skill file / task guideline)
[ ] Discarded current artifact (or moved to a branch)
[ ] Started new session with updated rules
[ ] Re-ran the same task
[ ] Confirmed the issue doesn't recur

For small changes, you can stash instead of discard. The key is: test the system in isolation.

When to Write Rules

Not every issue deserves a rule. Some guidance:

Situation	Write a Rule?	Rationale
You explained the same thing twice	Yes	Prevent the third time
Encountered unexpected behavior	Maybe	Find root cause first
Task completed successfully	Maybe	Retrospective—any generalizable insights?
Found a serious bug	Yes	Prevent recurrence

Warning Signs You're Over-Documenting

AGENTS.md exceeds 100 lines
A single rule file exceeds 300 lines (~1,500 tokens)
Rules take more than 1 minute to read through
You find yourself thinking "is this really needed every time?"
Rules contradict each other

If you see these signs, it's time to prune. Rule maintenance includes deletion.

How to Write Rules (Cheat Sheet)

This section is a reference. You don't need to read it all now—come back when you're actually writing a rule. The rest of the article stands on its own.

1. Minimum Viable Length

Context is precious. Same meaning, shorter expression. But don't sacrifice clarity for brevity.

❌ Verbose (38 chars)
If an error occurs, you must always log it

✅ Concise (20 chars)
All errors must be logged

❌ Too short (unclear)
Log errors

2. No Duplication

Same content in multiple places wastes context and creates update drift.

❌ Duplicated
# base.md
Standard error format: { success: false, error: string }

# api.md
Errors use { success: false, error: string } format

✅ Single source
# base.md
Standard error format: { success: false, error: string }

3. Measurable Criteria

Vague instructions create interpretation variance. Use numbers and specific conditions.

✅ Measurable
- Functions: max 30 lines
- Cyclomatic complexity: max 10
- Test coverage: min 80%

❌ Vague
- Readable code
- Sufficient testing

4. Recommendations Over Prohibitions

Banning things without alternatives leaves the LLM guessing. Show the right way.

✅ Recommendation + rationale
【State Management】
Recommended: Zustand or Context API
Reason: Global variables make testing difficult, state tracking complex
Avoid: window.globalState = { ... }

❌ Prohibition list
- Don't use global variables
- Don't store values on window

5. Priority Order

LLMs pay more attention to what comes first. Lead with the most important rules.

## Critical (Must Follow)
1. All APIs require JWT authentication
2. Rate limit: 100 requests/minute

## Standard Specs
- Methods: Follow REST principles
- Body: JSON format

## Edge Cases (Only When Applicable)
- File uploads may use multipart

6. Clear Scope Boundaries

State what the rule covers—and what it doesn't.

## Scope

### Applies To
- REST API endpoints
- GraphQL endpoints

### Does Not Apply To
- Static file serving
- Health checks (/health)

The Feedback Loop

This is how it all fits together in practice:

[Working with LLM]
       │
       ├─ Issue occurs
       │      │
       │      ▼
       │  Find root cause (not just symptom)
       │      │
       │      ▼
       │  Decide where to write (AGENTS.md? Skill? Task guideline?)
       │      │
       │      ▼
       │  Write the rule
       │      │
       │      ▼
       │  Discard current work
       │      │
       │      ▼
       │  New session with updated rules
       │      │
       │      ▼
       │  Verify issue doesn't recur
       │
       ▼
[Continue working]

The goal is to reach a state where you never explain the same thing twice. Every explanation either:

Gets externalized into a rule, or
Was truly a one-off that doesn't need capturing

Passing Feedback Correctly

One more thing: when you give feedback to the LLM, don't just paste error logs. Include your intent.

❌ Just the error
[Stack trace]

✅ Intent + error
Goal: Redirect to dashboard after user authentication
Issue: Following error occurred
[Stack trace]

Without the intent, the LLM optimizes for "make the error go away." With the intent, it optimizes for "achieve the goal while resolving this error."

These are very different things.

Anti-Pattern Summary

Quick reference if you want to check your current practices:

Anti-Pattern	Reference
Put everything in AGENTS.md	→ "Where to Write Rules"
Write specific incidents instead of root causes	→ "What to Write"
Continue with old artifacts after changing rules	→ "How to Verify Rules Work"
List only prohibitions without recommendations	→ "How to Write Rules" #4
Keep explaining instead of writing it down	→ "When to Write Rules"

Key Takeaways

AGENTS.md is not a dumping ground. Only rules needed on every task belong there. Everything else goes closer to where it's used.
Write root causes, not incidents. "Null-check external API returns" beats "UserService.getUser() was missing null check."
Test your rules. After adding a rule, discard current work and re-run. If the issue recurs, the rule isn't working.
Maintenance includes deletion. If AGENTS.md is over 100 lines, you've probably over-documented. Prune ruthlessly.
Explain twice, document once. If you're explaining the same thing for a second time, stop and externalize it.

Once you stop expecting rules alone to do the work, the real question becomes how to design the workflow around them. In practice, that starts with planning—before execution ever begins.

The Research

The practices in this article are grounded in LLM research:

SALAM (Wang et al., 2023): LLM self-feedback is often inaccurate. Structured feedback from external agents (or externalized rules) is more effective.

LEMA (An et al., 2023): Learning from mistakes (error → explanation → correction) improves LLM reasoning ability—but this requires explicit externalization of what was learned.

Feedback Loop for IaC (Palavalli et al., 2024): Feedback loop effectiveness decreases exponentially with each iteration and plateaus. This supports the "discard and restart" approach over endless iteration in the same context.

Reflexion (Shinn et al., 2023): Combining short-term memory (recent trajectory) with long-term memory (past experience) enables effective self-improvement. Externalized rules function as that long-term memory.

References

Wang, D., et al. (2023). "Learning from Mistakes via Cooperative Study Assistant for Large Language Models." arXiv:2305.13829
An, S., et al. (2023). "Learning From Mistakes Makes LLM Better Reasoner." arXiv:2310.20689
Palavalli, M. A., et al. (2024). "Using a Feedback Loop for LLM-based Infrastructure as Code Generation." arXiv:2411.19043
Shinn, N., et al. (2023). "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS 2023.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.