I have a monolith. It is a Rails application from 2018 that handles billing for a mid-sized SaaS. We call it "The Beast" internally because it bites back whenever you try to touch the payment logic.
For years, my strategy was simple. If it works, do not touch it. If it breaks, patch it with duct tape and prayers. This approach worked until last month when we needed to migrate our payment provider. The existing code was so tangled that estimating the work felt like guessing the weight of a cloud.
I decided to run an experiment. I would let an autonomous AI agent handle the refactoring of our billing module for 30 days. Not just code completion. Full structural refactoring.
My goal was not perfection. I wanted to see if AI could reduce the cognitive load of understanding legacy spaghetti. I set strict guardrails. The AI could propose changes, but I had to approve every pull request.
Here is what happened.
The Setup: Guardrails Over Trust
I did not just point Cursor or Copilot at the codebase and walk away. That is how you get security vulnerabilities and infinite loops.
I used a local LLM setup with a specialized agent framework. I chose this over cloud APIs because our billing data contains PII. I could not risk sending customer credit card tokens to a third-party server.
The stack looked like this:
- Model: Llama-3-70b-Instruct (quantized, running on local A100s)
- Framework: LangGraph for state management
- Testing: RSpec suite with 94% coverage
- Guardrail: Every change required passing the full test suite before I even saw the diff.
I gave the agent one instruction. "Refactor the PaymentProcessor class to use the Strategy Pattern. Keep all public methods identical. Do not change business logic."
Simple, right? Wrong.
Week 1: The Hallucination Phase
The first three days were painful. The agent kept trying to import libraries that did not exist. It invented a gem called active_billing_strategy and tried to bundle install it.
I spent four hours just correcting its understanding of our Gemfile.
On day four, it produced its first valid pull request. It extracted the Stripe logic into a separate class. The code looked clean. Too clean.
I reviewed the diff. It had removed a critical idempotency check. This check prevented double-charging customers if the webhook fired twice. The tests passed because our test fixtures did not cover concurrent webhook delivery.
This was a wake-up call. The AI optimized for readability, not correctness. It missed the subtle side effects that only exist in production traffic.
I added a new rule. "Do not remove any lines containing idempotency_key without explicit human comment approval."
Week 2: Finding the Hidden Bugs
After fixing the guardrails, things improved. The agent started identifying dead code. It found three methods in the Invoice model that had not been called since 2019.
It also spotted a N+1 query problem in the invoice generation loop. I had missed this for five years. The AI suggested adding .includes(:line_items) to the ActiveRecord query.
This single change reduced our invoice generation time from 4 seconds to 200 milliseconds.
I felt a mix of pride and shame. Pride that the system was faster. Shame that I had not caught this earlier.
Here is the data from our staging environment during week two:
| Metric | Before AI Refactor | After AI Refactor | Change |
|---|---|---|---|
| Avg Invoice Gen Time | 4.1s | 0.2s | -95% |
| Cyclomatic Complexity | 42 | 18 | -57% |
| Test Suite Duration | 12m 30s | 11m 45s | -6% |
| Human Review Time | 0m | 45m/day | +45m |
The test suite duration did not drop much because the AI added more granular tests. It believed that more tests equaled better safety. In this case, it was right.
Week 3: The Style War
By week three, the code looked different. The AI preferred functional styles over object-oriented patterns. It replaced many if/else blocks with hash lookups.
Ruby developers know this as a stylistic choice. But our team had conventions. We used classes for complex logic. The AI kept turning classes into hashes.
I had to spend time teaching the agent our style guide. I fed it our top 10 most recent approved PRs as few-shot examples.
Once it understood the pattern, the quality jumped. It stopped fighting our conventions. It started writing code that looked like I wrote it, but cleaner.
It also documented everything. Every method got a YARD docstring. Most were generic, but some were surprisingly insightful. It explained why a specific regex was used for email validation. I had forgotten why we used that specific regex. The AI inferred it from the test cases.
Week 4: The Final Push
In the final week, I asked the agent to tackle the hardest part. The tax calculation logic. This code had 15 nested conditionals based on state, country, and product type.
The agent proposed a complete rewrite using a rule engine pattern. It moved the logic out of code and into a YAML configuration file.
I was skeptical. Moving logic
💡 Further Reading: I experiment with AI automation and open-source tools. Find more guides at Pi Stack.
Top comments (0)