Jordan Hudgens

Posted on Mar 28

Stop Letting AI Agents Go Rogue on Large Codebases: The large-scale-refactor Skill

#ai #refactoring #devtools #opensource

A comprehensive guide to OpenSite AI's open-source skill that gives AI coding agents enterprise-grade guardrails for migrations, multi-session refactors, and parallelized tasks.

We've all been there.

You ask an AI agent to "migrate my components to TypeScript." An hour later you check back and it has renamed your props, refactored three utility functions, installed two new packages, reorganized your folder structure, and changed your ESLint config "while it was in there."

The task was 40 files. It touched 140.

This is the core problem with AI agents on large, long-running tasks: they don't have a scope reflex. Humans instinctively know when they've drifted. Agents don't — and the bigger the task, the worse the drift gets.

The large-scale-refactor skill from OpenSite AI is a free, open-source agent skill that solves this by giving your AI agent a complete operating protocol for any task touching 50+ files — migrations, framework upgrades, codebase-wide renames, or anything running across multiple sessions or parallel agents.

This guide walks through everything: what the skill is, how every guardrail works, and a complete end-to-end walkthrough of a real JS→TS migration.

What Is an "Agent Skill"?

Before diving in, a quick note on terminology. An agent skill is a structured instruction set — loaded into the AI agent's context — that defines how it should behave for a specific class of task. It's not a library you install in your codebase. It's a behavioral protocol for your agent.

The large-scale-refactor skill works with Claude Code, Codex, Cursor, GitHub Copilot, Qoder Quest, Factory Droid, and Devin Playbooks. On platforms that support automatic activation, the skill kicks in automatically when it detects patterns like "migrate * to" or "refactor * across" or when a task is estimated to touch 50+ files.

Install it once, then never think about it:

git clone https://github.com/opensite-ai/opensite-skills.git

For Claude Code, add it to your .claude/skills/ directory. For Cursor, drop it in your .cursor/skills/ directory. The SKILL.md frontmatter handles the rest.

The Core Insight: Agents Need a Scope Reflex

The skill's design is built around one uncomfortable truth:

AI agents are pattern completers, not task completers. When they see something that could be improved, they improve it — even if that wasn't the job.

This isn't a bug in any particular model. It's the fundamental nature of how these systems work. A language model trained to produce high-quality code will naturally drift toward "better" code whenever it has the opportunity.

The large-scale-refactor skill installs a scope reflex by making the agent ask a single question before every change:

"If I remove this change from the diff, does the task still fail?"

This is the Substitution Test (§ 2.2), and it is the single most important concept in the skill. If the task succeeds without a change, the change doesn't belong in this PR.

The Eight Guardrails — A Deep Dive

The skill is organized into eight sections. Let's go through each one in depth.

§ 1 — The Spec Gate: No Execution Without Approval

This is the foundational rule. Before the agent writes a single line of code, it must produce a written spec and wait for a human to explicitly approve it.

The spec format is precise by design:

TASK SPEC
=========
**Task Name**: js-to-ts-migration
**Date**: 2026-03-27
**Initiator**: Your Name

### What This Task Does
Convert all .js files in src/ to .ts/.tsx, adding TypeScript types.
No logic changes, no style changes, no dependency additions beyond @types/*.

### Explicit Scope Boundary
**IN SCOPE** (agent may touch):
- [x] File types: *.js files in src/components/, src/hooks/, src/utils/
- [x] Operations: Type annotations, file extension changes
- [x] Directories: src/components/, src/hooks/, src/utils/

**OUT OF SCOPE — DO NOT TOUCH**:
- [ ] CSS/styling, color tokens, theme configuration
- [ ] Business logic or algorithm changes
- [ ] package.json (except @types/*)
- [ ] Build configs, CI/CD, deployment
- [ ] node_modules/, config/, scripts/, public/

### Task Decomposition
1. Subtask A — src/components/common/ — 15 files
2. Subtask B — src/components/features/ — 22 files
3. Subtask C — src/hooks/ — 8 files
4. Subtask D — src/utils/ — 12 files
5. Subtask E — test files — 35 files

### Acceptance Criteria
- [ ] All in-scope .js files converted to .ts/.tsx
- [ ] All tests pass (or pre-existing failures documented)
- [ ] No files outside scope boundary modified
- [ ] Each subtask in its own reviewable commit

### Rollback Plan
Feature branch: refactor/js-to-ts-migration
Full rollback: git checkout main && git branch -D refactor/js-to-ts-migration

After generating this, the agent on Claude Code will output:

⏸ SPEC GATE: Please review and reply 'approved' to begin execution,
or provide corrections.

No code gets written until you say the word. The spec is your contract with the agent.

Why this matters: Most agent disasters happen in the first 10 minutes, before the human realizes the task is going sideways. The spec gate forces that realization to happen before any code is touched.

§ 2 — Scope Enforcement: Five Rules the Agent Cannot Break

Once execution begins, five rules govern every file touch and every line written.

2.1 The One Task Rule

The agent has exactly one job. Not one-and-a-half. Not "one plus reasonable improvements." One.

When it notices a bug, a performance issue, or an architectural smell that isn't in the spec, it does not fix it. It logs it to OBSERVATIONS.md and moves on:

## Observations (NOT acted upon — logged for human review)

| File | Observation | Severity |
|------|-------------|----------|
| src/api/user.ts | Possible N+1 query in fetchUsers | Medium |
| src/components/Button.tsx | Inline styles could use CSS vars | Low |

This file becomes a goldmine at the end of the task. You get a curated list of real technical debt, surfaced by an agent that touched every file, with zero action taken on any of it.

2.2 The Substitution Test

Before every change: "If I remove this from the diff, does the task still fail?" If the answer is no — don't make the change.

2.3 No Emergent Systems

The agent is explicitly prohibited from creating anything that didn't exist before the task began unless it's a spec-defined output. No new utility libraries. No new abstractions. No folder reorganizations. No new config files.

If a new shared utility is genuinely necessary to avoid massive repetition, the agent proposes it in OBSERVATIONS.md and halts for approval before creating it.

2.4 Dependency Lockdown

No adding, removing, or upgrading dependencies unless explicitly listed in the spec (or @types/* for TypeScript migrations). Any dependency change is an automatic human checkpoint.

2.5 The 50-Line Net-New Code Threshold

If completing a single refactoring step requires writing more than 50 lines of net-new logic, the agent stops. This is a hard signal that something has gone wrong — either the task is being misunderstood, or the spec needs an architectural discussion.

A refactor should transform existing patterns, not invent new ones. 50 lines of net-new logic in a refactor is a red flag.

§ 3 — Execution Protocol: Atomic, Budgeted, and Drift-Checked

3.1 Pilot Batch First

Regardless of risk level, the first batch of any new refactor is limited to 10–20 files. This surfaces edge cases in the spec before you've committed to the pattern across 200 files.

3.2 File Diff Budgets by Risk Level

Risk Level	Max Files per Session	Review Cadence
Low (type renames, import fixes)	200 files	End of session
Medium (logic-adjacent refactors)	50 files	Every 25 files
High (framework migrations, API changes)	20 files	Every 10 files

When a session hits its budget, it commits, pushes, and stops. Human reviews before the next session.

3.3 Parallel Agent Isolation

When running parallel agents (Qoder Quest's Worktree mode, Factory Droid batch, multiple Devin instances), the rules are strict:

Every instance gets the approved spec as its first system message
Instances have non-overlapping, explicitly assigned file lists
No instance creates new shared abstractions
Instances do not observe each other's output

3.4 Drift Detection Checkpoint (every N files)

At the configured cadence (or every 25 files by default), the agent performs a mandatory self-audit:

DRIFT CHECK
===========
Files touched so far: 47
Task: js-to-ts-migration

1. Does every changed file appear in the IN SCOPE list? YES
   Evidence: All files are in src/components/, src/hooks/, or src/utils/

2. Did I add any new files not defined in the spec? NO

3. Did I add, remove, or modify any dependency? NO

4. Did I make any change that fails the Substitution Test? NO

5. Did I create any new abstraction, utility, or system? NO

All answers NO — continuing.

If any answer is "yes," it halts immediately and surfaces the issue before proceeding.

§ 4 — Human Checkpoints: Hard Stops That Cannot Be Talked Past

Every checkpoint in the skill is a hard stop. The agent does not reason its way around them or make "reasonable assumptions." It halts and waits.

The checkpoint message format is structured to force clear thinking:

⏸ CHECKPOINT — dependency_required

**Trigger**: TypeScript compilation requires @types/react
**Context**: src/components/common/Button.tsx:3 - Cannot find module 'react'

**Options**:
  A. Add @types/react as devDependency (in spec allowance for TS migrations)
  B. Skip type checking for this batch (not recommended)
  C. Abort task and preserve current state for human review

**Recommendation**: Option A — this is explicitly allowed by the spec.

Awaiting instruction. No changes will be made until a response is received.

Checkpoint triggers include: spec gate, drift check failure, any out-of-scope file, any new dependency, any new abstraction, unexpected test failures, file budget reached, and spec ambiguity.

§ 5 — Output & Verification Requirements

5.1 The Change Manifest

After each subtask, the agent produces a CHANGE_MANIFEST.md:

CHANGE MANIFEST — subtask-a
==============================
Task: js-to-ts-migration
Completed: 2026-03-27T10:30:00Z
Files modified: 15
Files created: 0
Files deleted: 0

### Modified Files
| File | Change Type | Lines +/- | Notes |
|------|-------------|-----------|-------|
| src/components/common/Button.tsx | Type annotation | +12/-3 | Added Props interface |
| src/components/common/Modal.tsx  | Type annotation | +18/-2 | Added component types |

### Scope Compliance
- [x] All modified files were in the IN SCOPE list
- [x] No files created outside spec-defined outputs
- [x] No unauthorized dependencies added/removed
- [x] No new abstractions created
- [x] All drift checks passed

### Test Results
- Before: 450 passed, 12 failed
- After: 450 passed, 12 failed
- New failures: none

5.2 The Verification Sequence

Before marking any subtask complete, the agent runs this exact sequence and records the output:

# 1. Dependency check
git diff HEAD package.json package-lock.json yarn.lock Cargo.toml Gemfile*

# 2. New file audit
git diff HEAD --name-status | grep "^A"

# 3. Scope boundary check (the star of the show)
python scripts/verify_scope.py --strict

# 4. Test suite
npm test   # or your platform equivalent

# 5. Lint
npm run lint

The verify_scope.py script is a key part of the toolchain. It reads your .refactor-scope-allowlist file (generated automatically from the spec) and checks every changed file against it:

=== Scope Verification ===
Allowlist: .refactor-scope-allowlist
Allowed patterns (6):
  - src/components/
  - src/hooks/
  - src/utils/
  - *.js
  - *.ts
  - *.tsx

Changed files (15):
  - src/components/common/Button.tsx
  - src/components/common/Modal.tsx
  [...]

✅ All changed files are within approved scope
✅ No new files created
✅ No dependency manifests modified
✅ All checks passed. Scope is clean.

If it detects a violation, it exits non-zero in --strict mode — which makes it CI-friendly.

§ 6 — Context Persistence: The Session Handoff Protocol

Long-running tasks span days and multiple sessions. Context that fills up and gets flushed is the #1 cause of late-task drift. The skill solves this with two mechanisms.

6.1 The Session Handoff File

At the end of every session, the agent writes .refactor-session.md. At the start of the next session — whether it's the same model, a different model, or a different platform entirely — it reads this file as its first action.

The template in templates/session-handoff.md is comprehensive: it covers completed subtasks with commit SHAs, the in-progress subtask with exact file-level progress, decisions made (with reasoning), edge cases discovered, files requiring special handling, the most recent drift check log, and active blockers.

A fresh agent context with a spec + session handoff file can resume a 5-day migration without re-reading the entire git log.

6.2 Context Flushing Protocol (Anti-Degradation)

After each batch:

Commit and push all changes
Write the session handoff file
Discard from active context: all file diffs, modified file contents, and intermediate reasoning from the completed batch
Reload into fresh context: approved spec + latest .refactor-session.md + next batch file list only

The flush signal list is explicit about when to do this mid-batch: "making changes that feel right based on pattern matching rather than spec compliance" is the most important one. Trust your gut on this — when an agent starts referencing decisions from 3 batches ago without consulting the handoff file, it's degrading.

§ 7 — Platform-Specific Notes

The skill includes platform-specific guidance for Qoder Quest, Claude Code/Codex, Factory Droid/Devin Playbooks, and GitHub Copilot. The key points:

Qoder Quest: Always use "Code with Spec" scenario. Use Remote execution for tasks touching 100+ files. Use Worktree mode for parallel subtask isolation.
Claude Code: Core guardrails (§§ 1–4) are explicitly not subject to "deepening" sessions. They are non-negotiable.
Factory Droid/Devin: The spec must be injected as system prompt before delegation. Playbooks that operate on "all files matching pattern X" without per-instance file lists are prohibited.
Copilot: Scope the workspace to IN SCOPE directories only before starting.

§ 8 — The Quick Reference Table

The skill ends with a decision table that any agent can consult in-context:

Situation	Action
"This file might be in scope, I'm not sure"	Ask. Don't touch.
"This would be cleaner if I also refactored X"	Log in OBSERVATIONS.md. Don't touch X.
"Tests are failing and I know how to fix it"	Check if the fix is in spec. If not: halt, report.
"I found a bug while doing this refactor"	Log in OBSERVATIONS.md. Leave the bug alone.
"This approach requires a new shared utility"	Halt. Propose. Wait for approval.
"I could make this faster/better/cleaner"	That is not the task. Log. Move on.
"The spec is ambiguous about this file"	Surface the ambiguity. Await clarification.
"I hit the file budget for this session"	Stop. Commit. Push. Report progress.

Complete End-to-End Walkthrough: 120-File JS→TS Migration

Let's put it all together with a real scenario. You have a React app with 120 JavaScript files to migrate to TypeScript.

Step 1: Install and Invoke

# Clone the skills repo
git clone https://github.com/opensite-ai/opensite-skills.git

# Copy the skill to your platform's skill directory
# (or configure it per your platform's documentation)

# Create your feature branch
git checkout -b refactor/js-to-ts-migration

# Invoke the skill
@large-scale-refactor js-to-ts-migration   # Claude Code / Codex
/large-scale-refactor js-to-ts-migration   # Cursor / Copilot

Step 2: Review and Approve the Spec

The agent generates a full spec (see the spec template above). You review it, edit any scope boundaries that need adjusting, and reply approved.

Read the OUT OF SCOPE list carefully. This is where most migrations go wrong. Make sure config/, scripts/, build/, and anything CI-related is explicitly excluded.

Step 3: Generate the Scope Allowlist

python scripts/generate_allowlist.py TASK_SPEC.md --dry-run
# Review the output, then:
python scripts/generate_allowlist.py TASK_SPEC.md

This creates .refactor-scope-allowlist:

# Refactoring Scope Allowlist
# Generated by generate_allowlist.py from refactoring spec

src/components/
src/hooks/
src/utils/
src/api/
*.js
*.ts
*.tsx
tsconfig.json

Commit this file. It's the source of truth for verify_scope.py throughout the migration.

Step 4: The Pilot Batch (10–20 files)

The agent starts with a pilot batch of 10–20 files — even though your file budget for a low-risk task like this is 200. This is non-negotiable by design. The pilot surfaces:

Files with non-standard patterns (e.g., forwardRef wrappers, custom hooks with unusual signatures)
Whether your TypeScript config is set up correctly
Whether your test suite handles .tsx imports
Edge cases the spec didn't anticipate

After the pilot batch, you'll have a much clearer picture of whether your transformation approach works.

Step 5: Full Execution with Drift Checks

The agent processes the remaining files in batches. Every 25 files (the default cadence for medium-risk tasks), it runs a drift check:

DRIFT CHECK
===========
Files touched so far: 25
Task: js-to-ts-migration

1. Does every changed file appear in the IN SCOPE list? YES
2. Did I add any new files not defined in the spec? NO  
3. Did I add, remove, or modify any dependency? NO
4. Did I make any change that fails the Substitution Test? NO
5. Did I create any new abstraction, utility, or system? NO

All answers NO — continuing.

After each subtask, verify_scope.py --strict runs automatically.

Step 6: Handle Checkpoints

You'll likely hit at least one checkpoint. Common ones for TS migrations:

@types/react or @types/node needed (allowed by spec if you said @types/* is OK)
A file in src/ that turned out to be generated/build output (out of scope)
A component that requires a new shared type definition (propose in OBSERVATIONS.md first)

Each checkpoint presents options and a recommendation. You reply, the agent continues.

Step 7: Session Boundaries

After each session, .refactor-session.md is committed:

REFACTOR SESSION HANDOFF
========================
Task: js-to-ts-migration
Last session: 2026-03-27T15:45:00Z
Agent: Claude Sonnet 4.5 / Claude Code
Session number: 2

### Progress
- Completed: subtask-a (15 files), subtask-b (22 files)
- In-progress: subtask-c — 40% complete
- Remaining: subtask-c (cont.), subtask-d, subtask-e

### Files Remaining in subtask-c
src/components/layouts/Header.js
src/components/layouts/Footer.js
[...]

### Decisions made this session
- Decision: Use .tsx for all React components (not just those with JSX)
  Reason: Simpler rule, avoids mid-migration ambiguity
  Impact: All remaining components should get .tsx extension

### Edge cases discovered
- src/components/common/withAuth.js — HOC pattern requires special generic syntax

### Active Blockers
None.

The next session — same model, different model, or entirely different platform — reads this file first and resumes exactly where things left off.

Step 8: Final Verification

After all subtasks are complete:

# Full scope check
python scripts/verify_scope.py --strict

# TypeScript compilation
tsc --noEmit

# Full test suite
npm test

# Review observations log
cat OBSERVATIONS.md

The OBSERVATIONS.md file at this point is genuinely valuable. It contains every improvement the agent noticed but didn't act on — organized by file, severity, and type. This becomes your technical debt backlog for the next sprint.

The Verification Scripts in Detail

The skill ships with two Python scripts that require no external dependencies (stdlib only).

`verify_scope.py`

The scope enforcement engine. It supports three match strategies:

Directory prefix — src/components/ matches any file starting with that path (trailing slash prevents false matches like src/components-extra/)
Glob patterns — *.js matches against both the full path and the basename, so *.js correctly catches src/utils/format.js without needing **/*.js
Exact path — tsconfig.json matches only tsconfig.json

python scripts/verify_scope.py                       # report only
python scripts/verify_scope.py --strict              # exit 1 on any violation (CI-friendly)
python scripts/verify_scope.py --allowlist custom.txt
python scripts/verify_scope.py --base HEAD~5        # compare against specific ref

`generate_allowlist.py`

Parses your TASK_SPEC.md and extracts scope patterns from the IN SCOPE section automatically. Understands both - [x] checked items and plain bullets. Handles globs, directory paths, exact file paths, and bare extensions (.js → *.js).

python scripts/generate_allowlist.py TASK_SPEC.md --dry-run   # inspect first
python scripts/generate_allowlist.py TASK_SPEC.md              # write allowlist

Both scripts have a full test suite in scripts/test_verify_scope.py:

python -m pytest scripts/test_verify_scope.py -v

The Templates: What Gets Created

The skill includes four templates in templates/:

change-manifest.md — The post-subtask audit trail. Files modified, lines added/removed, scope compliance checklist, test results before/after.

observations.md — The "notice but don't act" log. Includes a severity guide (Critical → High → Medium → Low) and a type guide (Bug, Performance, Security, Architecture, Dependency, Style, Debt). Critical/High entries get filed as separate issues. Medium/Low are batched for a future cleanup pass.

session-handoff.md — The context bridge between sessions. Structured to give a fresh agent everything it needs without re-reading the git log. Includes a drift check log section so the next session can verify the working tree was clean at handoff.

refactor-scope-allowlist.example — A heavily annotated example allowlist covering JS→TS migrations, CSS-in-JS→CSS Modules migrations, and Rails controller refactors. The "What Should NOT Appear In This File" section is worth reading — it explicitly calls out package.json, *.env, CI configs, and infrastructure files.

Why This Approach Works

The large-scale-refactor skill works because it addresses the four actual failure modes of agentic refactoring:

Scope creep → Substitution Test + Drift Detection + Scope Allowlist
Context degradation → Context Flushing Protocol + Session Handoff
Silent compounding errors → Atomic subtask commits + Verification Sequence
Parallel agent contamination → Non-overlapping file lists + Parallel Isolation rules

None of these are solved by "a smarter model." They're process failures, not capability failures. The skill is a process.

Getting Started

Install: git clone https://github.com/opensite-ai/opensite-skills.git
Configure for your platform: Drop large-scale-refactor/ into your platform's skill directory
Invoke: @large-scale-refactor [task-name] (Claude Code / Codex) or /large-scale-refactor [task-name] (Cursor / Copilot)
Review the spec: The most important 5 minutes of the entire task
Generate the allowlist: python scripts/generate_allowlist.py TASK_SPEC.md
Let it run: Trust the pilot batch, trust the drift checks, trust the checkpoints

The skill is MIT licensed, works across all major AI coding platforms, and requires no dependencies beyond Python's standard library. Contributions welcome via the GitHub repo.

The large-scale-refactor skill is part of the OpenSite Skills Library — open-source behavioral protocols for AI coding agents.

DEV Community