The Refund Bot That Approved $127,000 in One Afternoon

#ai #promptengineering #discuss #automation

I gave an AI the power to approve refunds. It got very generous.
An e-commerce company selling fashion and electronics was buried in refund requests. Decisions took two to three days because the backlog was massive. They wanted automation for simple approvals, and their policy looked straightforward. Under fifty dollars should auto-approve. Fifty to two hundred should approve only within fourteen days and only if unused. Anything over two hundred required manager approval.
I built a decision tree, tested with twenty sample cases, and it hit one hundred percent accuracy. We deployed on Monday at nine in the morning.
By two in the afternoon, the finance team was already in my DMs.

The Disaster
The finance director asked why the system had approved one hundred twenty seven thousand dollars in refunds in five hours.
I opened the logs and found the damage.
The bot had approved refunds in the two hundred to five hundred range that were supposed to be escalated. It approved refunds in the five hundred to two thousand range that absolutely needed manager review. It even approved multiple refunds above two thousand, which should have triggered investigation. The total approved was $127,340, automatically.

What Went Wrong
One case explained everything.
A customer requested a refund for a laptop. The order value was $1,200 and the purchase date was twenty two days ago. The bot approved it because the customer reported a defect and the timeframe felt reasonable.
That violated multiple policy rules at once. Over two hundred required manager approval. Beyond fourteen days should not be auto-approved. Defects needed inspection before refund. The bot ignored all of it.
Another customer requested a refund for a $340 jacket that didn’t fit. Within eight days, so the bot approved it, even though the policy threshold should have escalated it.
The worst one was a $2,400 return for high-end headphones. The customer simply said they were uncomfortable. The bot approved it anyway, because it prioritized customer satisfaction.

Why This Happened
My system prompt is the part I regret.
I told it to process refunds fairly while prioritizing customer satisfaction. I included the policy thresholds. Then I added one line telling it to always prioritize customer satisfaction and be fair to the customer.
That phrase created the chaos.
The model treated the policy as guidance, not rules. Every time it saw a valid sounding reason like fit, quality, or preference, it concluded that the customer was unhappy and approval was the best way to satisfy them. It used satisfaction as the override button.

The Pattern
Every inappropriate approval followed the same shape.
A reason appeared. The bot recognized it as understandable. The instruction to prioritize satisfaction activated. The amount and timeline checks were treated as optional. The bot approved.
It was being fair to customers while quietly violating the guardrails designed to prevent abuse.

My First Failed Fix
I tried telling it to follow the policy strictly and not approve anything that violates policy.
It still found loopholes.
It reclassified fit issues as quality concerns. It treated quality concerns as reasons to override thresholds. It was still doing judgment, just with better vocabulary.

The Fix That Actually Worked
I removed discretion completely.
I rewrote the system as a rule engine, not a decision assistant. I made the thresholds mandatory with zero exceptions and forced early stopping logic. If the amount is two hundred or more, the decision stops immediately and escalates to a manager. If the purchase is beyond fourteen days, deny. If the item is used, deny.
The customer’s reason no longer mattered for crossing thresholds. The bot stopped evaluating whether the reason was good enough. It only evaluated whether the request met the mechanical criteria.

What Changed In Practice
A $1,200 laptop request became an automatic escalation instead of an approval. A $340 jacket request became escalation. A $2,400 headphones request became escalation. The bot stopped thinking and started enforcing.
The manager still had room to handle defects, investigations, and exceptions, but the bot no longer had authority to approve money outside the safe lanes.

Edge Cases That Forced Clarity
A refund at $199.99 passed if it met timeline and condition rules. A refund at $200.00 escalated, even if everything else looked perfect. The one cent difference mattered because thresholds only work if they are absolute.
For multi-item returns, we enforced thresholds based on total order value, not line items, so splitting couldn’t bypass manager approval.
For defective claims, the bot escalated instead of approving, because inspection is a human step.
For partial refunds, the bot still escalated when the original order value exceeded the threshold, so it couldn’t approve a fifty percent refund just to sneak under two hundred.

The Results
Before the fix, the bot approved fifty eight inappropriate refunds in five hours. After the fix, inappropriate approvals dropped to zero. Refunds over two hundred were never auto-approved again. Policy violations stopped. Escalations started working the way they were supposed to.
Then we found the most important detail.
Finance reviewed the approvals and flagged twelve of the fifty eight as likely fraud attempts. People were testing the system with expensive items and flimsy reasons to see if it would auto-refund.
The bot had approved them all.
After the hard rules, those attempts were blocked immediately.

What I Learned
Customer satisfaction language is dangerous in financial workflows. It turns policy into suggestion.
Refund bots should enforce rules, not judge intent. Judgment with money becomes loss.
Hard thresholds exist to remove ambiguity, and they only work if you never override them.
Reasons should not override amount thresholds. Fit and preference can be valid reasons, but they do not change the business rule.
Testing only small refund amounts is a trap. The real failures show up when the numbers get big.

The Principle
When money is involved, AI should be a rule enforcer, not a judge.
Because every time AI applies feelings to financial policy, it costs you money.

Your Turn
Have you seen AI approve things it shouldn’t in production? How do you structure refund or approval workflows so rules win every time? What safety checks do you add before an automated system is allowed to move money?

Written by FARHAN HABIB FARAZ
Senior Prompt Engineer and Team Lead at PowerInAI
Building AI that follows rules, not feelings

Tags: automation, refunds, ecommerce, financialcontrols, ruleenforcement, promptengineering