The first legacy refactor I owned was a 14,000 line PHP file that handled checkout for an ecommerce site I had been hired to maintain. The original author had left two years before. The variable names were a mix of Hungarian notation, abbreviations, and what I can only describe as personal grudges. There were comments that referenced bugs that no longer existed and tickets in a tracker that had been decommissioned. I asked the team where to start. They said "do not." That was the strategy.
That advice was correct in 2018. In 2026 it is not. Refactoring legacy code used to be the kind of project that took a senior engineer three months and ended in a failed migration. Now the same work takes weeks, and the failure rate is dramatically lower. Claude Code is the difference. Not because it does the refactor for me, but because it does the research that used to make refactoring legacy code intractable.
Here is the workflow I use to modernize codebases I did not write, on a deadline I cannot move, with a team that does not have time to help.
Why Legacy Refactoring Fails
Most legacy refactor projects fail in one of three ways.
The first failure mode is scope creep. The team starts with "let us clean up the checkout module" and ends with "let us rewrite everything." Three months in, nothing ships. Six months in, the project gets cancelled. The original code is still in production.
The second failure mode is regression. The team makes changes that look right but break behavior that was load-bearing. The bug surfaces in production three weeks later when a customer hits an edge case nobody knew existed. The team rolls back. Trust in refactoring drops. Future refactors get harder to justify.
The third failure mode is abandonment. The team starts the refactor, hits an unexpected complication, and pauses. The pause becomes permanent. Six months later there is half-refactored code in production, harder to maintain than the original because now it has two patterns instead of one.
All three failure modes have the same root cause. The team did not understand the code well enough before changing it. Legacy code looks simple from the outside and is not. Every line in a legacy codebase is there for a reason. Some of those reasons are good. Some are obsolete. Some are workarounds for bugs in dependencies that have since been fixed. You cannot tell which is which without reading the code carefully, and reading 14,000 lines of legacy PHP carefully is a job nobody wants.
Legacy code is not bad code. It is code that survived. Survival is information. Most refactors fail because they discard the information.
The Archaeology Skill
Every legacy refactor starts with archaeology. The archaeology skill takes a file or a module and produces a markdown report with the following sections:
- Surface summary - what the code appears to do
- Hidden behavior - subtle behaviors that are not obvious from naming
- Dependencies - what it depends on, what depends on it
- Historical context - patterns that suggest old constraints
- Risk hotspots - code that handles edge cases or error paths
The historical context section is the one that surprised me most when I started using this workflow. Claude Code can often infer why code was written a certain way by reading the code carefully and recognizing patterns. A weird retry loop with a 30 second backoff is probably a workaround for a flaky upstream service. A function that handles three different argument shapes was probably called from three different places at different times. A comment that says "TODO: fix this" with no date is probably 5 years old and the author is no longer at the company.
The archaeology report becomes the briefing document for everything that follows. I commit it to a docs folder and reference it in every PR description. New team members read the archaeology before they touch the code. The cost of a careful read once is dramatically lower than the cost of repeated shallow reads forever.
The Behavior Spec Skill
Once I understand what the code is, the next step is documenting what the code does. The behavior spec skill takes a module and produces a markdown specification of its observable behavior, including all the edge cases and error paths.
This is not the same as documentation the original author would have written. It is a reverse-engineered specification based on reading the code as it actually is. The spec includes things the original author may not have intended but that are now load-bearing because callers depend on them.
A typical behavior spec for a 500 line module is about 80 lines and reads like a contract:
"Function
processOrder(order)returns one of:{success: true, id: string},{success: false, error: 'inventory'},{success: false, error: 'payment'},{success: false, error: 'unknown', code: number}. Theunknowncase occurs when the upstream service returns a 5xx response or times out. Thecodefield is the HTTP status code when the response is 5xx, or the value -1 when the call timed out."
That last sentence is the kind of detail that lives only in the code until someone writes it down. Once it is written down, refactoring becomes safe because the new code can be checked against the spec. Without the spec, you are checking against your assumption of what the code does, and your assumption is usually wrong in at least one place.
If you want the actual archaeology and behavior spec skills I use, the setup is documented at nextools.hashnode.dev. Adapt them to your stack and start treating legacy code as a research problem.
The Test Generation Pattern
Behavior specs are useful but they are not enforceable. Tests are. The test generation pattern takes a behavior spec and produces a test suite that exercises every documented behavior.
The pattern has three steps:
- Generate tests from the behavior spec
- Run tests against the existing code and confirm they all pass
- Use the tests as the safety net for the refactor
Step 2 is the critical one. If the generated tests do not all pass against the existing code, then either the behavior spec is wrong or the tests are wrong. Both are common on the first iteration. I usually run two or three iterations of "fix the spec, regenerate the tests, run them" before everything passes. Once everything passes, I have a test suite that locks down the existing behavior. From there, refactoring is safe in a way it never was before.
The test generation pattern produces tests that are usually about 70% as good as tests an engineer would have written by hand, but at about 5% of the cost. The 30% gap is mostly in test naming and organization. The actual coverage is comparable. For legacy refactors, this trade is overwhelmingly worth it.
The Incremental Refactor Loop
Refactors fail when they are big bang releases. They succeed when they are incremental. The incremental refactor loop is the workflow that keeps a refactor incremental even when it spans weeks.
The loop has four steps that repeat:
- Pick the smallest unit that can be refactored independently
- Refactor that unit, keeping the behavior spec satisfied
- Run the test suite, ship the change
- Move to the next unit
The "smallest unit" definition is what makes this work. In a 14,000 line file, the smallest unit might be a single function. In a 100,000 line module, the smallest unit might be a single file. In a 5,000 file system, the smallest unit might be a single module. The skill is recognizing what counts as small enough to ship in one PR without losing the team's attention.
I usually aim for PRs that are 200 to 400 lines, take 2 to 4 hours to review, and ship within 24 hours of opening. Anything larger gets split. Anything smaller usually means the refactor is not making meaningful progress.
Refactoring at the speed of small PRs is dramatically faster than refactoring at the speed of big PRs, because small PRs do not stall in review and do not introduce conflicts.
The Strangler Fig Pattern
For systems that cannot be refactored in place, the strangler fig pattern is the standard approach. You build new code alongside the old code, route traffic gradually from old to new, and delete the old code once everything has migrated.
Claude Code makes the strangler fig pattern dramatically easier because it can keep the new code and the old code in sync as the migration progresses. The strangler skill takes a behavior spec and a target architecture and produces a migration plan with:
- The new code structure
- The routing logic that decides old or new per request
- The metrics that confirm equivalent behavior
- The rollback plan if metrics drift
I have used the strangler skill three times in the last year. Each migration took weeks instead of months, and none of them caused production incidents. The metric-driven routing is the part that makes the difference. You can roll new code out to 1% of traffic, watch the metrics for a day, and either expand or roll back. Without that gradient, you are deploying changes in a single step and hoping.
The Deletion Pattern
The most underrated refactoring move is deletion. Most legacy codebases have dead code, unused features, and abandoned experiments that nobody has the courage to remove. The deletion pattern uses Claude Code to identify safely deletable code with high confidence.
The pattern works in three layers:
- Static analysis to find code that is never called
- Runtime analysis using production logs to find code that is called less than once a month
- Behavioral analysis to find code that is called but whose results are never used
The third layer is the surprising one. Most teams find dead code through static analysis. Behavioral analysis catches code that is technically reachable but functionally inert. A function whose return value is always discarded. A logging call that writes to a destination nobody reads. A side effect that nobody depends on.
Deletion typically removes 10 to 30% of a legacy codebase without changing any observable behavior. That is 10 to 30% less code to maintain, less code to refactor, less code to test, less code to onboard new engineers into. The compound benefit is large.
What I Got Wrong
Three lessons from the first refactor I did with this workflow.
The first lesson is that I tried to refactor too much at once. The archaeology report told me everything that was wrong. I wanted to fix all of it. I started a refactor that spanned eight modules and stalled at module four. The remaining four modules sat in a half-refactored state for two months before I split them into separate projects and shipped them one at a time. Lesson learned: scope is the most important variable in refactoring. Pick small. Always smaller than you think.
The second lesson is that I underestimated how much old code is load-bearing in subtle ways. I deleted a function that the deletion pattern flagged as dead. It was not dead. It was called from a cron job that ran once a quarter to generate a report nobody asked for but that the CFO read every quarter. Three weeks after I deleted it, the CFO asked where the report was. Lesson learned: cross-check deletion candidates with cross-functional stakeholders before deleting. The static and runtime analysis cannot see organizational dependencies.
The third lesson is that I trusted the behavior spec too much without testing it manually. The spec said the function returned a list of orders sorted by date. The function returned a list of orders sorted by date in some cases and unsorted in others depending on a flag I had missed. The tests passed because the test cases happened to match the sorted case. Production users hit the unsorted case immediately. Lesson learned: spec coverage and test coverage are not the same. Spend time exercising the spec manually before locking it down.
FAQ
How big does a legacy codebase need to be before this workflow is worth it?
Around 5,000 lines. Below that you can read it in an afternoon and refactor it without much process. Above that the cost of misunderstanding starts to dominate.
What about codebases in unusual languages?
Claude Code handles most languages well. The archaeology skill is language-agnostic. The test generation pattern depends on having a usable test framework, which is the bigger constraint than the language itself.
Do I need to convince my team to use this workflow?
Not initially. Run the workflow on your own pieces, ship clean refactors, and let the results speak. Teams that see refactors landing without incidents adopt the workflow on their own. Teams that have been burned by refactor projects need evidence before process.
What about codebases with no tests at all?
That is the standard case for legacy code. The test generation pattern is designed for this. You generate tests as part of the refactor process, not before it. The test suite grows alongside the modernization, not as a prerequisite.
The Bigger Picture
Legacy code is a permanent feature of software engineering. Every codebase becomes legacy code eventually. Every engineer eventually inherits code from someone who is no longer around. The question is not whether to deal with legacy code. The question is whether you can deal with it efficiently.
For most of the history of software, dealing with legacy code efficiently was a senior engineer skill. The people who could do it had built up years of pattern recognition, archaeological intuition, and sheer patience. Junior engineers were told to avoid legacy code because the failure modes were so expensive.
Claude Code is changing this. The archaeology skill produces the briefing document a senior engineer would have produced after a careful read. The behavior spec skill captures the contracts a senior engineer would have inferred. The test generation pattern produces the safety net a senior engineer would have demanded before starting the refactor. The result is that refactoring legacy code becomes accessible to anyone who is willing to follow the workflow.
The teams that adopt this workflow first will modernize their codebases faster than the teams that do not. Faster modernization compounds. Modern codebases attract better engineers, ship features faster, and have fewer incidents. The gap between teams that can refactor and teams that cannot is going to widen sharply over the next few years.
If you want to see the exact archaeology, behavior spec, test generation, and incremental refactor skills I use, my full Claude Code legacy refactor setup is documented at nextools.hashnode.dev. Steal them, adapt them, and start shipping refactors that used to be impossible.
The cost of legacy refactoring is collapsing. The codebases that survive the next decade will be the ones that get refactored continuously, in small increments, with the help of AI that does the research that used to be too expensive. Start with the archaeology. The rest follows.
Top comments (0)