I have a microservice written in Python 3.8 that handles user authentication. It was born in 2019. It has survived three major framework updates and two team restructures.
The code is ugly. I mean really ugly.
It has nested if statements that go six levels deep. Variable names like data and temp_list are everywhere. I avoided touching it for years because every change broke something unexpected.
In January 2026, I decided to stop being a coward. I wanted to see if the new generation of agentic coding tools could handle a real mess. Not a toy project. Not a "hello world" app. Actual production debt.
I gave an autonomous agent full read/write access to this specific module. I set strict guardrails. No changing public API signatures. No deleting tests. Just refactor for readability and type safety.
I ran this experiment for exactly 14 days. Here is what happened.
The Setup and Guardrails
I used a local LLM setup with a specialized refactoring agent. Cloud APIs were too expensive for the number of iterations I planned. I needed the agent to run hundreds of small cycles.
My goal was simple. Increase cyclomatic complexity scores. Improve type hint coverage. Reduce lines of code where possible without losing logic.
I did not just hit "go" and walk away. That is how you get infinite loops or deleted database tables. I created a sandbox environment. The agent could only commit to a separate branch.
Every commit triggered a CI pipeline. If tests failed, the agent received the error log as feedback. It had to fix its own mistake before proceeding. This loop ran automatically.
Here is the configuration I used for the agent's system prompt:
AGENT_CONFIG = {
"role": "Senior Python Refactorer",
"constraints": [
"Preserve all existing function signatures",
"Do not remove any comments marked # IMPORTANT",
"Maintain 100% test pass rate",
"Use Python 3.12 type hints exclusively"
],
"feedback_loop": "strict",
"max_iterations_per_file": 5,
"rollback_on_failure": True
}
This config seems obvious now. It was not obvious on day one. I initially forgot the max_iterations limit. The agent spent four hours trying to refactor a single 400-line function. It went in circles. I had to kill the process manually.
Week 1: The Honeymoon Phase
The first three days were impressive. The agent tackled the low-hanging fruit. It replaced old-style string formatting with f-strings. It added type hints to simple functions.
It renamed variables. usr_lst became user_list. dt became created_at. These are small changes. They make reading the code much easier.
I reviewed the pull requests daily. Most were clean. The diff views were green and tidy. I felt optimistic. Maybe this was the silver bullet we were promised.
Then day four happened.
The agent decided to optimize a database query. It noticed a N+1 problem in a loop. It rewrote the logic to use a bulk fetch. Smart move.
But it missed a subtle side effect. The original code relied on the order of items returned by the database. The bulk fetch did not guarantee that order. The tests passed because they mocked the database response. The integration tests failed in staging.
I caught it before it hit production. But it shook my confidence. The agent was smart, but it lacked context. It did not understand the business logic behind the data ordering.
I had to spend two hours writing a new test case that explicitly checked for sort order. Then I fed that failure back into the agent. It learned. It did not make that mistake again.
Week 2: Diminishing Returns
By the second week, the easy wins were gone. The agent started struggling with complex conditional logic.
It tried to extract methods from a massive validate_user function. It created five new helper functions. But it named them poorly. process_step_1, process_step_2. This was worse than the original spaghetti code.
I realized I needed to intervene more. I stopped letting it run autonomously for entire files. I switched to a pair-programming mode. I would select a block of code. I would ask the agent to suggest three refactoring options.
I would pick the best one. Then I would apply it myself.
This slowed things down. But the quality went up. The agent acted as a senior reviewer rather than a junior developer running wild.
Here is a comparison of the metrics before and after the 14-day experiment:
| Metric | Before (Jan 1) | After (Jan 15) | Change |
|---|---|---|---|
| Lines of Code | 4,250 | 3,890 | -8.5% |
| Cyclomatic Complexity | 18.4 (avg) | 12.1 (avg) | -34% |
| Type Hint Coverage | 12% | 89% | +77% |
| Test Pass Rate | 98% | 100% | +2% |
| Human Review Time | 0 hrs | 14 hrs | +14 hrs |
The numbers look great. Complexity dropped significantly. Type coverage is nearly complete. The codebase is objectively healthier.
💡 Further Reading: I experiment with AI automation and open-source tools. Find more guides at Pi Stack.
Top comments (0)