Hopkins Jesse

Posted on May 21

I Let AI Refactor My Legacy Code for 14 Days — The Data Surprised Me

#ai #automation #experiment #productivity

I have a confession. I hate refactoring legacy code.

It is tedious, risky, and frankly boring. In January 2026, I decided to stop doing it manually. I set up an autonomous agent using the latest local LLM stack to handle technical debt in my side project, a Rust-based API gateway that has been running since 2023.

The goal was simple. Let the AI identify code smells, propose fixes, and run tests without my direct intervention. I wanted to see if "agentic workflows" were finally ready for prime time or just another hype cycle.

I gave it two weeks. Here is what actually happened.

The Setup: Local Agents Only

I did not use any cloud-based coding assistants. Privacy matters, and sending proprietary logic to external APIs feels wrong in 2026. I ran everything locally on my M3 Max MacBook Pro.

The stack looked like this:

Model: Llama-4-70B (quantized) via Ollama
Orchestrator: OpenDevin fork with custom Rust plugins
Test Runner: Cargo test with strict coverage requirements
Guardrails: A separate small model trained only to reject changes that alter public API signatures

I configured the agent to scan the src/ directory every night at 2 AM. It had permission to create branches, commit changes, and open pull requests. It did not have permission to merge them. That part was still on me.

Here is the configuration snippet I used for the agent's core loop:

{
  "agent_config": {
    "model": "llama4:70b-q4_K_M",
    "max_iterations": 50,
    "tools": ["file_reader", "code_editor", "bash_runner"],
    "constraints": [
      "no_changes_to_public_traits",
      "maintain_95_percent_test_coverage",
      "zero_clippy_warnings"
    ],
    "rollback_strategy": "git_reset_hard_on_failure"
  }
}

This setup cost me zero dollars in API fees. It did cost me about 4 hours of initial configuration time. I spent most of that time fighting with context window limits. The agent kept forgetting variable names in files it hadn't touched in three turns.

Week 1: The Honeymoon Phase

The first three days were impressive. The agent caught 14 instances of unused imports and fixed 8 minor clippy warnings. These are low-hanging fruit. Any linter can do this. But the agent also grouped them into logical commits and wrote decent commit messages.

On day four, it attempted its first real refactor. It identified a complex match statement in the authentication module that was nested six levels deep. This is classic "arrow code."

The agent proposed flattening it using early returns and helper functions. I reviewed the PR. The logic was sound. The tests passed. I merged it.

I felt smart. I felt like I had hacked the system. I was saving hours of mental energy by offloading the boring work. I estimated I saved about 3 hours that week.

Then day five hit.

Week 2: The Context Collapse

The agent started getting confident. Too confident.

It began modifying error handling patterns across multiple modules. In Rust, error propagation is specific. You cannot just swap Result<T, E> types without checking every caller. The agent missed two callers in a different crate.

The CI pipeline failed. Not once, but twelve times in a row.

I watched the logs as the agent tried to "fix" the build. It added more code to patch the errors. It did not understand the root cause. It was treating symptoms, not the disease. Each fix introduced two new bugs.

By day eight, I had 15 open PRs. Twelve were broken. Three were questionable. I spent 6 hours reviewing code that was worse than when I started.

The local model struggled with long-range dependencies. It could not hold the entire project graph in its context window. When it changed a struct in models.rs, it forgot how that struct was serialized in api_handlers.rs three folders away.

I had to intervene. I paused the agent. I manually fixed the breakage. Then I tightened the constraints.

The Numbers Don't Lie

At the end of 14 days, I crunched the data. I compared the state of the repo before and after the experiment.

Metric	Before Experiment	After Experiment	Change
Total Lines of Code	12,450	12,890	+440
Clippy Warnings	42	3	-39
Test Coverage	88%	87.5%	-0.5%
Cyclomatic Complexity	14.2 (avg)	12.1 (avg)	-2.1
Human Review Hours	0	18	+18
Bugs Introduced	0	7	+7

The reduction in cyclomatic complexity looks good on paper. The code is technically "cleaner" in terms of nesting. But look at the other columns.

The line count went up. The agent loves verbose variable names and extra helper functions. It adds boilerplate to explain its own logic.

Test coverage dropped slightly. The agent wrote new tests for

💡 Further Reading: I experiment with AI automation and open-source tools. Find more guides at Pi Stack.

DEV Community