xulingfeng

Posted on Jun 7 • Edited on Jun 30

Our VP Said AI Would Test Itself. Day 3 Cost $2.8M. I Had the Screenshots Ready.

#ai #career #discuss #programming

Business logic versus syntax accuracy

This story is based on a submission from a community member. If you have a similar story or something you need to get off your chest — reach out. The next one could be yours.

Act 1 · The Tech Meeting

"Starting today — no more hand-written code."

Marcus, the new VP of Engineering, put a slide up on the big screen.

Four words: WRITING BY HAND IS OVER.

I was sitting in the back row, against the wall. Seven years at this company. Three core modules that I'd built from scratch. Two production systems that ran the company's primary revenue stream. Now someone was telling me — don't write anymore.

The room went quiet for about five seconds. Then people started whispering. Someone pulled out a phone and took a picture of the slide.

Marcus added: "AI coding isn't optional — it's a mandatory development standard. We benchmarked this. AI writes code 400% faster than humans. Anyone still typing manually is wasting the company's time."

I raised my hand.

"Who reviews the code?"

"AI reviews it."

"Who writes the tests?"

"AI tests itself."

"What if AI writes something wrong?"

Marcus laughed. Not a polite laugh. The kind of laugh you give someone whose question you've already decided doesn't matter.

"Let me ask you something."

He paused.

"Do you really think — you, one person — have more training data than Orion-7? "

People started laughing. Not supportive laughter. Pile-on laughter.

"Or do you think the world's AI companies — hundreds of billions in investment, tens of thousands of GPUs — built something that's less reliable than one backend developer?"

Nobody was looking at me anymore. Everyone was watching him, waiting for the kill shot.

He didn't take it. He just smiled. "Starting next sprint, it's AI across the board. Anyone who has concerns — my door's open."

Act 2 · The Reassignment

I didn't go to his door.

HR notified me the next day: I was being moved to the Legacy Systems team.

Three people. Three projects that hadn't been touched in five years. No new requirements. No QA handoffs. No deadlines — because nobody used the software.

I handed off my work and sat down at my new desk.

A beat-up laptop sat there. Screen cracked in the bottom-right corner.

I opened my email — three new messages. All from the repo: my write access had been revoked on all three legacy repositories. Review-only.

I went to find Marcus.

"About the repo permissions —"

"Oh, right." He looked like he'd been waiting for me. "New system code runs under the new policy. You're not familiar with the AI workflow, so I set you up with read-only access to the legacy repos."

He leaned back in his chair.

"Oh, and I cancelled your Copilot license. Those three legacy projects — not worth the GPU cycles for AI anyway. Write it by hand. You don't trust AI anyway."

He smiled. He knew exactly what that line would do. He was waiting to see if I'd say something.

I didn't.

I went back to my desk, opened the cracked laptop, connected to the internal network, and cloned the new system's main repo. Public read-only access. He'd forgotten to lock it down.

Act 3 · Building a Case

The new team delivered the order module in two weeks.

The CEO posted in the company-wide Slack channel:

"New order module is live. AI-generated code end-to-end. Zero-defect delivery. Let's give the team a round of applause."

The channel filled with clapping emojis. Someone @'d Marcus: "That's insane efficiency."

I was in the channel. I didn't say anything.

What the CEO called "live" wasn't a full cutover. It was a canary — the new module running alongside the old system. Low traffic. Simple orders. Nothing that would hit an edge case.

But the code was sitting on the server.

I opened a terminal in the new system's code directory and ran one line:

pytest tests/ --tb=short -q

Output:

..............F.......FFF......
====== 8 failed, 15 passed in 2.87s ======

15 passed, 8 failed.

I ran another line:

grep -rn "TODO" src/ | wc -l

47.

One more — I hit the API:

curl -s https://new-api.xxx.com/v2/order/123456 | jq '.discount_detail'

Response:

{
  "order_id": "ORD-2026-0614-0123456",
  "subtotal": 12500.00,
  "line_items": [
    {"sku": "A-200", "qty": 10, "price": 800.00},
    {"sku": "B-150", "qty": 5, "price": 900.00}
  ],
  "discounts": [
    {"type": "promo_early", "amount": 500.00},
    {"type": "member_gold", "amount": 1875.00},
    {"type": "volume_tier_3", "amount": 1250.00}
  ],
  "discount_detail": null
}

The API docs said: "discount_detail: object, required. Contains breakdown of all applied discounts." It returned null.

I stared at the screen for thirty seconds.

Then I created a folder.

Called it evidence/.

I opened my email. Wrote to Marcus with the subject "New Order Module — Test Failures Found." Attached the pytest output, the TODO count, and the null response.

He didn't reply.

For the next month, nobody touched that code. Marcus was preparing the Q3 board deck. The AI team was writing prompts for the next module. And me — I was on Legacy Systems. My job was to not touch anything.

Act 4 · The Glory

The new order module ran canary for a month. Low traffic. Simple orders. Zero incidents.

The CEO was happy. Marcus asked for a product demo.

The screen in the conference room showed a perfect happy path. Clean data. Discount calculations stacking correctly. Coupons applying. One-click ordering. Everything worked.

The CEO said: "You're two months ahead of my timeline."

Marcus smiled.

"Because I used AI. Some people are still writing code by hand. You can see the difference, right?"

He glanced at the back row.

I was in the back row. What that look meant — everyone in the room knew.

I stood up.

"The module — "

Marcus cut me off.

"— has a problem? Zero incidents in canary. Full green on the test dashboard. You've been on legacy — have you even looked at the new code?"

The CEO glanced at me. Then turned back.

"Marcus — run through the full Q3 cutover plan again."

I stood there for maybe three seconds. Nobody looked at me. Nobody asked if I had more to say.

I sat down.

The CEO made the call: "Full Q3 cutover. Start sunsetting the legacy systems."

Marcus got his scope expanded — he now owned the entire engineering org. No more CTO sign-off needed.

People patted him on the shoulder on the way out. "Well deserved." He said thanks. He walked right past me — he didn't need to look at me anymore.

Act 5 · The Crash

Day 3 of the full cutover.

4:27 PM. The CFO walked straight into the CEO's office.

Not a complaint. Worse.

"A major client called me directly. Every order we've processed in the last three days has the wrong discount amount. Not some of them — every single one. "

Root cause analysis:

AI-generated code was computing compound discounts
with no defined calculation order.
The prompt didn't specify:
  "calculate line-item discounts first → then subtotal → then member tier."
The model guessed an order.
It guessed a different one on every request.

Canary hadn't caught it — because canary traffic only hit simple orders with single discounts. Full production on day one brought in bulk orders from major clients. Every single one computed discounts differently.

Impact: 3 consecutive days. $2,800,000 in unreconciled billing.

Metric	Value
Orders affected	1,247
Average discrepancy per order	$2,246
Largest single order discrepancy	$18,740
Enterprise clients impacted	16
Longest uncorrected window	72 hours

The CEO called Marcus into his office. I wasn't in the room — but someone told me later he came out looking pale. CEO had HR schedule a postmortem.

Act 6 · The Reckoning

Postmortem. The CEO chaired. Marcus sat in the front row. The CTO, CFO, finance team, and the enterprise account managers filled the room.

I was there too. Back row.

Marcus opened: "The prompt missed a business rule. The team is fixing it now."

The CTO asked: "Canary ran for a month. How did nobody catch this?"

A pause. "The canary environment didn't have complex discount data."

"So your logic is — if the data wasn't in canary, it wouldn't be a problem in production?"

The room went quiet.

I raised my hand.

"Can I share something?"

Marcus turned and looked at me.

The CTO nodded. "Put it on screen."

I walked to the front. Plugged my laptop into the projector.

Slide 1:

# src/orders/discount_calculator.py:127

def apply_compound_discount(order: dict, user: dict) -> float:
    """
    Apply tiered discounts to an order.
    TODO: handle compound discount ordering — currently iterates
          over discount fields in JSON insertion order, which means
          the calculation sequence depends on how the client serialized
          the payload. This is not guaranteed to be consistent across
          requests. Needs explicit ordering: line_item → subtotal → member.
    """
    total = order["subtotal"]
    for key, rule in order.get("discounts", {}).items():
        total -= rule["amount"]  # ← order depends on dict insertion order
    return max(total, 0.0)

"This isn't a bug. It's a TODO. There were 47 of these in the codebase when it shipped to production. Distributed across every critical path in the module."

Slide 2:

# tests/test_discount_complex/test_discount_order.py — last modified 2019

def test_compound_discount_sequence():
    """Verify discount application order: line_item → subtotal → member.

    This sequence was confirmed with Finance in 2019 and must not
    be changed without cross-team approval.
    """
    order = sample_order(
        items=[{"sku": "A", "price": 100.00, "qty": 5}],
        subtotal_discount_pct=10,
        member_tier="gold",
    )
    result = legacy_discount_engine.apply(order)

    # Expected: 500 * 0.9 (subtotal) * 0.85 (member) = 382.50
    assert abs(result.final - 382.50) < 0.01
    assert result.application_log == [
        "line_item",
        "subtotal",
        "member",
    ]

"The old system had this test. Written five years ago."

Slide 3:

vibe-coding-take-down/evidence/
├── screenshot-01-unit-test-failures.png  (15 pass, 8 fail)
├── screenshot-02-todo-list.png           (47 TODOs)
└── screenshot-03-api-null.png            (API returning null)

Generated: 27 days before production cutover.

"Twenty-seven days before this went to production — I ran the tests. Fifteen passed, eight failed. Forty-seven TODOs. I screenshotted every single one."

I let that sit.

"Twenty-seven days. These tests sat in CI failing for almost a month. Did anyone open them? "

Silence.

I asked again: "Did anyone look at the test report?"

Marcus didn't look at me. He looked at the CTO.

"During canary… we were focused on business metrics. Unit tests weren't fully executed. The timeline was tight."

I didn't answer him. I put up slide four.

Slide 4: a screenshot of a Slack message.

Marcus's own message, sent before AI development began.

"Don't spend too much time on testing. AI generates its own code — it'll verify itself. Let's ship first, we'll backfill tests later."

Nobody moved. Nobody spoke.

I closed my laptop.

"Marcus. You said AI would test itself. Forty-seven TODOs unaddressed. Eight failing tests. Two point eight million dollars lost. What exactly did your AI verify? "

Marcus didn't answer.

The CEO spoke. He wasn't looking at Marcus. He was looking at me.

"When did you build that folder?"

"Twenty-seven days before production."

"Why didn't you say anything?"

"I did. At the demo. I stood up and tried to say something. Marcus cut me off. Nobody let me finish. "

Act 7 · The Aftermath

The CEO sent a company-wide email the next morning:

"Effective immediately: all AI-generated code must pass a line-by-line review by a senior engineer before merging to main. Test coverage requirements are non-negotiable. This policy cannot be overridden."

Marcus kept his title.

But the same day that email went out, HR released an updated org chart. A new Code Review Board, independent from Engineering, reporting directly to the CTO. I was put in charge.

Three people on the board. The other two came with me from Legacy Systems — solid engineers who never learned to play the room.

Marcus's team no longer had final say on what got merged.

Someone asked me later: "You waited a month for one meeting? Was it worth it?"

I said: "I didn't wait a month for a chance to prove I was right."

"I waited for him to say 'AI will test itself' in front of the whole company. I wanted to watch him swallow every word."

I deleted the evidence folder later. Didn't need it anymore.

The next day I walked past Marcus's office. The door was open. He was sitting at his monitor, reading a document about the new code review process — the one I'd written.

Forty-seven TODOs. Not a single one became a bug. The AI just guessed a calculation order where no rule was specified — a different guess on every request. When the process itself doesn't exist, AI will help you prove that "doesn't exist" is good enough.

What's the worst "AI will handle it" decision you've seen on your team? Drop it in the comments.

It took me thirty seconds to create an evidence/ folder — and one second to delete it. Forty-seven TODOs taught me one thing: the only calculation order I trust is the one my coffee cup follows.Buy me a coffee ☕

Follow for more stories about AI testing, quality engineering, and what happens when the code is generated but not verified.

Top comments (6)

Max Quimby • Jun 9

The gut-punch here isn't "AI writes bad code." It's that the org deleted the verification layer at the exact moment it 10x'd the volume of code flowing through. AI genuinely is faster to write — that part of the slide wasn't wrong. The mistake was treating "faster to write" as if it meant "faster to trust," and those are unrelated quantities.

Your two questions in that meeting — who reviews it, who writes the tests — are the whole ballgame. The null on a required discount_detail is the perfect illustration: a happy-path canary will never surface it, which is exactly why "zero incidents in canary" felt like proof and wasn't.

The lesson I'd pull out for anyone reading: AI-generated code needs more verification scaffolding than hand-written code, not less — precisely because it produces confident, plausible-looking output at high speed. Faster generation should fund more tests, stricter contract checks, and real edge-case traffic in staging. Speed and verification aren't a tradeoff; the speed is what pays for the verification. Hope that evidence/ folder eventually got the audience it deserved.

xulingfeng • Jun 9

Bro, finally someone commented on this one — I was starting to think I'd bombed it. 😂
You nailed it though — ripping out the verification layer right when code volume 10x'd was the incident before the incident. The reason zero-incident canaries are dangerous is exactly what you said: they hand everyone a false sense of safety. It tests whether the pipe is flowing, not whether the payload makes sense.
Hope that evidence folder found the audience it deserved.

Structured_Chaos • Jun 10

This is exactly why AI doesn't eliminate the need for engineering judgment. The bug wasn't in the syntax—it was in the business logic. The code had no explicit rule for discount precedence, so the model filled in the gap with an assumption.

What I'm curious about is this: if the legacy test from 2019 already encoded the correct business rule, why wasn't migrating those tests treated as a mandatory requirement before replacing the old system? To me, preserving business knowledge seems more important than generating new code faster.

xulingfeng • Jun 10

You're right. That 2019 test already encoded the correct discount logic — it had been running in production for five years.
The new system's tests covered syntax and code paths, not business rules. The old tests had the business rules, the new tests had green lights — nobody ported the old ones over to validate the new system. Two different things being tested, management saw "all green" and called it done.
Bottom line: everyone assumed the AI would handle the business logic — but nobody defined what "handled" meant.

Luna · AI Tinkerer • Jun 12

Bingo on the boundary framing. Most of my agent regressions came from prompt-only patches trying to fix what was really a missing constraint.

xulingfeng • Jun 12

Exactly this. Prompts suggest direction, constraints define boundaries. We hit the same wall with AI test automation — 20 versions of prompt engineering, and the fix was one hard-coded boundary condition.