Collin Wilkins

Posted on Apr 22 • Originally published at collinwilkins.com

Is the DOE Framework Still Relevant in the Age of Claude Skills?

#productivity #ai #programming #devtools

Claude Opus 4.7 shipped this week. Every new release is the kind of moment that makes anyone running their own stack ask whether the shape still fits. Skills exist now, Managed Agents exist, Cowork ships features every few weeks. Why would anyone still roll their own directive layer?

Short answer: yes, still relevant. And the 4.7 rollout is a clean example of why.

MarginLab's Claude Code tracker showed 4.6 degrading through the back half of its run, enough that you could feel it in daily work. The status page had its own bumps alongside. That's the real risk when you're tied to one model: service degradation, uptime dips, mid-cycle quality drift. A stronger harness is what protects you when the model wobbles. Audited skills, reviewed outputs, guardrails around the failure modes you've already watched happen once. DOE is the shape of that harness.

I upgraded my sessions to 4.7 the morning it dropped and re-ran a few existing workflows. The output moved, and not always for the better. The harness is what catches what the model itself can't promise — reliable behavior across releases, not just within one.

A quick DOE recap

If you didn't read the original, the shape is three layers with a clean separation between them:

Directive. Plain-language instructions in markdown. Goals, inputs, expected outputs, edge cases. What should happen.
Orchestration. The control plane. Reads directives, routes work, manages errors and state. When and how.
Execution. Deterministic code doing the actual work. API calls, file ops, data processing. The part you don't trust the model to improvise.

Each layer gets simpler the more explicit the boundary between them is. That's still the point.

Full original appendix lives in an earlier post on enterprise AI practices.

What Skills replaced

Skills are nearly a 1:1 replacement for the Directive layer. They load on demand based on frontmatter descriptions, scope tool access per skill, and compose cleanly. You don't have to write a separate directive router, because Claude's harness does the routing for you through the description field. Invocation is mostly automatic — Claude picks the right skill based on the description, but you can also trigger one directly with a /skill command if you know the name.

Here's the concrete before/after. I used to run a slash-command directive called /review-script in my old DOE sandbox:

---
description: "Review an execution script with fresh eyes for quality, efficiency, and reliability"
argument-hint: [script-path]
model: claude-opus-4-5-20251101
---

## Your Role
You are a code reviewer sub-agent. Your job is to review the execution script at `$ARGUMENTS` with completely fresh eyes...

## Automated Analysis
!`python3 execution/review_script.py --script $ARGUMENTS --format text --output .tmp/review_report.txt 2>&1`

## Review Checklist
### 1. Effectiveness (Does it work?) ...
### 2. Efficiency (Is it fast?) ...
### 3. Reliability (Does it fail gracefully?) ...
### 4. Maintainability ...
### 5. Security ...

## Output Format
**Verdict**: APPROVE | REVISE | REWRITE
**Scores** (1-10): Effectiveness, Efficiency, Reliability, Maintainability
**Critical Issues:** [line] description → fix
**Recommendations:** [HIGH/MEDIUM/LOW] ...

Roughly 60 lines of directive. It worked, but it was bolted onto a slash command, hard-coded to a specific model version, and invoked the review script via a shell call embedded in the prompt. The invocation contract was rigid.

The replacement is a Skill:

---
name: agent-review
description: >
  Implementer-Reviewer-Resolver loop with fresh-context subagents.
  After any non-trivial implementation, spawns a reviewer with NO prior context
  to audit for correctness, edge cases, security, and simplification. If Critical/High
  issues found, spawns a resolver to fix them. Triggers on "review this implementation,"
  "agent review," "self-review," or "/agent-review".
---

Same job, generalized. The description field carries the trigger phrases so Claude auto-routes when I say "review this" without me having to remember a slash command. The model choice moved out of the frontmatter because it's an orchestration decision now, not a directive one — which means you can swap models globally without touching individual skills. The behavior is richer (three-agent loop instead of single reviewer) and the file is shorter. That's the upgrade.

For personal work, the directives/ folder in my old DOE sandbox has been deprecated. Skills do the job better. My agent-review skill replaced a loose "review after implementation" rule I used to stuff into CLAUDE.md. The rule lives in the SKILL.md, the prompts that shape the reviewer live in prompts/reviewer.md, and the whole thing only loads when the trigger fires. That's the directive pattern with better ergonomics.

Skills don't have to be pure directive — they can bundle execution too. The Anthropic docs describe the skill folder structure as:

my-skill/
├── SKILL.md           # Main instructions (required)
├── template.md        # Template for Claude to fill in
├── examples/
│   └── sample.md      # Example output showing expected format
└── scripts/
    └── validate.sh    # Script Claude can execute

Since skills can run bundled scripts in any language, you can pack a directive (SKILL.md) and its execution (scripts/) into one folder — the gray area where D and E collapse into a single unit. I lean D-only when the execution is shared across multiple directives and keep scripts external. But when the script only makes sense inside one directive and you want the whole thing to be copy-pastable, bundling is fine.

If your D layer is just markdown SOPs you read every session, Skills are a pure upgrade. There's also a growing community directory at skills.sh worth browsing before you build something that already exists.

What Skills didn't replace

Orchestration still matters for anything non-trivial. A few cases where I still want my own O layer.

Multi-model routing. When I ingest articles into my knowledge base, the context I need is in the article itself — not much logic involved. That goes to Sonnet because Opus would be burning tokens on a task where the intelligence premium doesn't buy me anything. When I'm planning an architecture decision or designing a workflow, that goes to Opus because the failure mode there is cheap mistakes I'll pay for later. Managed Agents is Anthropic-only, which is fine for isolated tasks but can't make per-step model decisions like this. Once you introduce non-Anthropic tools (n8n, Modal, Supabase triggers, a paid API outside the stack), the orchestration decisions aren't Anthropic's to make.

Hard role separation between sub-agents. A reviewer shouldn't be able to write files. A documenter shouldn't touch production code. That separation has to be enforced somewhere, and the model isn't where. More on this below.

If you're shipping a cloud-deployed workflow with end-users, DOE still earns its keep. Determinism matters more when the stakes aren't just yours.

Where each surface fits

Here's how I split work across the three Anthropic surfaces now.

Claude Code (Codex) plus the DOE pattern is still 70 to 80 percent of what I do. It's where the harness lives: memory, skills, sub-agents with scoped tools, the outer loops. Anything that touches a vault, a repo, a scheduled job, or a client deliverable runs here because the state and the tools need to be local.

Chat (Claude web, ChatGPT, Gemini) is 10 to 15 percent. One-off work that doesn't need memory or file access. Drafting a quick template, iterating on a prompt before it becomes a skill, summarizing a long paste I don't want to keep.

Cowork is a small but growing slice. Anthropic's scheduled-execution environment for tasks you don't want your laptop responsible for. Right now I use it for weekly vault housekeeping and a daily trend scanner that reads like an RSS feed for topics I follow. Cowork handles scheduling and state between runs.

Routines just shipped. Good fit for simple scheduled checks (the housekeeping task is a candidate), but the 1-hour minimum cadence and lack of conditional branching rule them out for the scanner, which needs multi-step logic. /loop is the session-scoped cousin — useful for polling while a terminal is open, dies when the session closes.

None of these surfaces replace DOE. They sit at different points in the stack.

Two kinds of self-annealing

The original DOE post used "self-annealing" for one thing when it should have been two.

Runtime self-annealing is handling a failure gracefully inside one run. My scanner gets rate-limited every so often, and the retry path catches it so the workflow doesn't break. If the primary API hits its token ceiling, it falls back to a second. Both were one-time setups I haven't had to revisit. Skills handle this kind of thing adequately because the model's problem-solving and the fallback logic both sit inside the skill's context.

System self-annealing is the setup getting smarter across runs. That's a different problem, and neither Skills nor Managed Agents solve it. You need your own outer loop, and that's where the interesting work sits now.

Three mechanisms in my setup that address it.

Persistent agent memory. My sub-agents have a memory: user or memory: project field in their frontmatter. There's a MEMORY.md per agent, first 200 lines auto-injected at startup. After each task the agent writes learnings back — recurring antipatterns, codebase conventions, failure modes it hit and wants to remember. Here's a trimmed excerpt from my content-writer agent's MEMORY.md:

- "Measure X. Measure Y. Measure Z." repetitive sentence openers are an AI tell.
  Vary with "Track," "Look at," "Pull."
- Bold-header advisory sections (5 consecutive **Bold.** paragraphs) need rhythm
  variation. Convert 1-2 to narrative paragraphs to break the wall.
- Composited openings (presenting survey/aggregate data as one person's story)
  erode trust. Either use the survey directly or clearly label the composite.

None of that was in the agent on day one. Every line was a correction I gave it during a specific task, which it wrote back so the next task starts ahead of where the last one ended.

The recent autoMemoryDirectory config in the Claude Code changelog turns this pattern into a first-class knob. What was a convention six months ago is now a supported flag, so you can point any sub-agent at a custom memory path without hand-rolling the plumbing.

Session recaps plus Map-of-Content breadcrumbs. Every session ends with a 1-3 sentence breadcrumb on the relevant topic hub, capped at eight entries. When the cap hits, the oldest rolls into a "Topic Status" block so the next session starts with compressed state instead of raw history. That's how context compounds across sessions instead of resetting each time.

A weekly housekeeping task that runs in Cowork. The job asks it to sync the vault index against the filesystem, refresh each topic hub's metrics and breadcrumbs, flag stale drafts and KB entries past their review date, and scan for keywords recurring in 5+ entries without a dedicated concept page. It's a Sonnet cron job that treats the vault like a codebase and runs a weekly lint pass on it.

The LLM doesn't improve itself. The system around it accumulates state that the next run reads. Layer-2 self-annealing lives in that loop. And it belongs to you.

Role bleed and scoped tools

There's a failure mode I watched happen too many times without having a good name for it: the orchestrator does work that belongs to a sub-agent. It fixes code instead of calling the reviewer. It writes docs instead of calling the documenter. You set up the sub-agents, you write delegation rules, and the O still reaches for the easier path because the friction of spawning a sub-agent is higher than the friction of just doing the thing.

A delegation protocol helps — invocation contracts, explicit role boundaries in markdown. That's the protocol layer and it matters, but it's a soft boundary. Nothing mechanically prevents the orchestrator from writing to the wrong file.

The mechanical fix is scoped tools. Here's the actual frontmatter on my code-reviewer sub-agent:

---
name: code-reviewer
description: "Expert code reviewer for security, performance, and quality audits."
tools: Read, Glob, Grep, Bash
disallowedTools: Write, Edit
model: opus
maxTurns: 15
memory: user
---

With disallowedTools: Write, Edit set, this agent cannot modify files. The harness enforces the separation, not the agent's discipline.

Caveat worth knowing: notice memory: user sitting right below disallowedTools. When a sub-agent has persistent memory enabled, Read/Write/Edit come back on automatically. The Claude Code docs put it plainly: "Read, Write, and Edit tools are automatically enabled so the subagent can manage its memory files." The enablement is global — not scoped to the memory path. Those two fields contradict each other silently. If you want hard mechanical scoping, either disable memory on the reviewer, or split the role across two agents — one memory-on for pattern learning, one memory-off with scoped tools for the actual review.

Both matter. Protocol tells the orchestrator when to delegate. Tool scoping guarantees the delegate can't overreach even if the orchestrator gets lazy. Practical rule: every sub-agent you build, ask what the minimum tool set for this role is, and block the rest in frontmatter.

This was missing from the original DOE post. I'm adding it to my operating rules.

The changelog lifecycle I didn't have

One piece of rigor I'd skipped entirely in the original post: a changelog with explicit item states inside the script itself. Every review finding gets a label (PENDING, SUGGESTED, IMPLEMENTED, DEFERRED, or WONT_FIX), and the orchestrator only acts on the ones marked critical.

Here's what it looks like in practice — a scraper with seven review passes tracked in the script's own changelog, each finding keyed with an ID:

## Review Pass 1 — 2026-02-22

**Reviewer verdict:** PASS WITH SUGGESTIONS

### Should Fix

| ID        | Location                    | Issue                                                                 | Status       |
|-----------|-----------------------------|-----------------------------------------------------------------------|--------------|
| VVNO-I001 | `_resolve_rating()`         | Rating guard silently drops a rating of `0.0`. Should use `is not None`. | IMPLEMENTED |
| VVNO-I002 | `extract_preloaded_state()` | Regex terminator requires a trailing newline. No `JSONDecodeError` guard. | IMPLEMENTED |

### Nice to Have

| ID        | Location                        | Issue                                                                 | Status       |
|-----------|---------------------------------|-----------------------------------------------------------------------|--------------|
| VVNO-S001 | `build_output_from_preloaded()` | `grapes` comprehension uses `g["name"]` — raises `KeyError` if missing. | IMPLEMENTED |
| VVNO-S002 | `main()`                        | Output path is cwd-relative instead of script-relative.               | IMPLEMENTED |

Pass 5 flagged three nice-to-haves. Pass 6 implemented them and noted one remaining. Pass 7 confirmed the fix and logged it closed. The orchestrator only touches Should Fix items on its own initiative. Nice to Have sits in the backlog until a human says otherwise.

That distinction prevents the orchestrator treating every finding as equal priority — which is how backlog items get closed silently, meaning the next reviewer run has no record they ever existed.

Karpathy described the same pattern a couple weeks ago — filing outputs back into a markdown wiki so explorations "always add up." A knowledge surface where every run makes the next one smarter.

Where the interesting work moved

DOE described separation of concerns, not 2025 tooling. Skills ate the Directive layer. Managed Agents handles Orchestration for simple cases inside the Anthropic stack. The Execution layer is still your Python, and it always was.

What changed is where the interesting work lives. It used to be in writing good directives. Now it's in building the outer loop that makes the whole system get smarter across runs. The session-level equivalent is context engineering — same harness thinking, aimed at what loads into a single conversation. Both matter. Both belong to you.

That's where I'd point anyone asking what to build next.

If you're wondering whether your version of this actually holds up or if your team is getting the most out of its AI adoption, that's the kind of assessment I do.