DEV Community

Cover image for Our VP Said AI Would Test Itself. I Raised My Hand. I Got Reassigned. Day 3 Cost $2.8M. I Had the Screenshots Ready.
xulingfeng
xulingfeng

Posted on

Our VP Said AI Would Test Itself. I Raised My Hand. I Got Reassigned. Day 3 Cost $2.8M. I Had the Screenshots Ready.

Based on real software development trends. About a VP of Engineering who believed AI would verify its own output, 47 TODOs that shipped to production, and a $2,800,000 discount calculation error that nobody caught.

This story is based on a submission from a community member. If you have a similar story or something you need to get off your chest — reach out. The next one could be yours.


Act 1 · The Tech Meeting

"Starting today — no more hand-written code."

Marcus, the new VP of Engineering, put a slide up on the big screen.

Four words: WRITING BY HAND IS OVER.

I was sitting in the back row, against the wall. Seven years at this company. Three core modules that I'd built from scratch. Two production systems that ran the company's primary revenue stream. Now someone was telling me — don't write anymore.

The room went quiet for about five seconds. Then people started whispering. Someone pulled out a phone and took a picture of the slide.

Marcus added: "AI coding isn't optional — it's a mandatory development standard. We benchmarked this. AI writes code 400% faster than humans. Anyone still typing manually is wasting the company's time."

I raised my hand.

"Who reviews the code?"

"AI reviews it."

"Who writes the tests?"

"AI tests itself."

"What if AI writes something wrong?"

Marcus laughed. Not a polite laugh. The kind of laugh you give someone whose question you've already decided doesn't matter.

"Let me ask you something."

He paused.

"Do you really think — you, one person — have more training data than Orion-7? "

People started laughing. Not supportive laughter. Pile-on laughter.

"Or do you think the world's AI companies — hundreds of billions in investment, tens of thousands of GPUs — built something that's less reliable than one backend developer?"

Nobody was looking at me anymore. Everyone was watching him, waiting for the kill shot.

He didn't take it. He just smiled. "Starting next sprint, it's AI across the board. Anyone who has concerns — my door's open."


Act 2 · The Reassignment

I didn't go to his door.

HR notified me the next day: I was being moved to the Legacy Systems team.

Three people. Three projects that hadn't been touched in five years. No new requirements. No QA handoffs. No deadlines — because nobody used the software.

I handed off my work and sat down at my new desk.

A beat-up laptop sat there. Screen cracked in the bottom-right corner.

I opened my email — three new messages. All from the repo: my write access had been revoked on all three legacy repositories. Review-only.

I went to find Marcus.

"About the repo permissions —"

"Oh, right." He looked like he'd been waiting for me. "New system code runs under the new policy. You're not familiar with the AI workflow, so I set you up with read-only access to the legacy repos."

He leaned back in his chair.

"Oh, and I cancelled your Copilot license. Those three legacy projects — not worth the GPU cycles for AI anyway. Write it by hand. You don't trust AI anyway."

He smiled. He knew exactly what that line would do. He was waiting to see if I'd say something.

I didn't.

I went back to my desk, opened the cracked laptop, connected to the internal network, and cloned the new system's main repo. Public read-only access. He'd forgotten to lock it down.


Act 3 · Building a Case

The new team delivered the order module in two weeks.

The CEO posted in the company-wide Slack channel:

"New order module is live. AI-generated code end-to-end. Zero-defect delivery. Let's give the team a round of applause."

The channel filled with clapping emojis. Someone @'d Marcus: "That's insane efficiency."

I was in the channel. I didn't say anything.

What the CEO called "live" wasn't a full cutover. It was a canary — the new module running alongside the old system. Low traffic. Simple orders. Nothing that would hit an edge case.

But the code was sitting on the server.

I opened a terminal in the new system's code directory and ran one line:

pytest tests/ --tb=short -q
Enter fullscreen mode Exit fullscreen mode

Output:

..............F.......FFF......
====== 8 failed, 15 passed in 2.87s ======
Enter fullscreen mode Exit fullscreen mode

15 passed, 8 failed.

I ran another line:

grep -rn "TODO" src/ | wc -l
Enter fullscreen mode Exit fullscreen mode

47.

One more — I hit the API:

curl -s https://new-api.xxx.com/v2/order/123456 | jq '.discount_detail'
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "order_id": "ORD-2026-0614-0123456",
  "subtotal": 12500.00,
  "line_items": [
    {"sku": "A-200", "qty": 10, "price": 800.00},
    {"sku": "B-150", "qty": 5, "price": 900.00}
  ],
  "discounts": [
    {"type": "promo_early", "amount": 500.00},
    {"type": "member_gold", "amount": 1875.00},
    {"type": "volume_tier_3", "amount": 1250.00}
  ],
  "discount_detail": null
}
Enter fullscreen mode Exit fullscreen mode

The API docs said: "discount_detail: object, required. Contains breakdown of all applied discounts." It returned null.

I stared at the screen for thirty seconds.

Then I created a folder.

Called it evidence/.

I opened my email. Wrote to Marcus with the subject "New Order Module — Test Failures Found." Attached the pytest output, the TODO count, and the null response.

He didn't reply.

For the next month, nobody touched that code. Marcus was preparing the Q3 board deck. The AI team was writing prompts for the next module. And me — I was on Legacy Systems. My job was to not touch anything.


Act 4 · The Glory

The new order module ran canary for a month. Low traffic. Simple orders. Zero incidents.

The CEO was happy. Marcus asked for a product demo.

The screen in the conference room showed a perfect happy path. Clean data. Discount calculations stacking correctly. Coupons applying. One-click ordering. Everything worked.

The CEO said: "You're two months ahead of my timeline."

Marcus smiled.

"Because I used AI. Some people are still writing code by hand. You can see the difference, right?"

He glanced at the back row.

I was in the back row. What that look meant — everyone in the room knew.

I stood up.

"The module — "

Marcus cut me off.

"— has a problem? Zero incidents in canary. Full green on the test dashboard. You've been on legacy — have you even looked at the new code?"

The CEO glanced at me. Then turned back.

"Marcus — run through the full Q3 cutover plan again."

I stood there for maybe three seconds. Nobody looked at me. Nobody asked if I had more to say.

I sat down.

The CEO made the call: "Full Q3 cutover. Start sunsetting the legacy systems."

Marcus got his scope expanded — he now owned the entire engineering org. No more CTO sign-off needed.

People patted him on the shoulder on the way out. "Well deserved." He said thanks. He walked right past me — he didn't need to look at me anymore.


Act 5 · The Crash

Day 3 of the full cutover.

4:27 PM. The CFO walked straight into the CEO's office.

Not a complaint. Worse.

"A major client called me directly. Every order we've processed in the last three days has the wrong discount amount. Not some of them — every single one. "

Root cause analysis:

AI-generated code was computing compound discounts
with no defined calculation order.
The prompt didn't specify:
  "calculate line-item discounts first → then subtotal → then member tier."
The model guessed an order.
It guessed a different one on every request.
Enter fullscreen mode Exit fullscreen mode

Canary hadn't caught it — because canary traffic only hit simple orders with single discounts. Full production on day one brought in bulk orders from major clients. Every single one computed discounts differently.

Impact: 3 consecutive days. $2,800,000 in unreconciled billing.

Metric Value
Orders affected 1,247
Average discrepancy per order $2,246
Largest single order discrepancy $18,740
Enterprise clients impacted 16
Longest uncorrected window 72 hours

The CEO called Marcus into his office. I wasn't in the room — but someone told me later he came out looking pale. CEO had HR schedule a postmortem.


Act 6 · The Reckoning

Postmortem. The CEO chaired. Marcus sat in the front row. The CTO, CFO, finance team, and the enterprise account managers filled the room.

I was there too. Back row.

Marcus opened: "The prompt missed a business rule. The team is fixing it now."

The CTO asked: "Canary ran for a month. How did nobody catch this?"

A pause. "The canary environment didn't have complex discount data."

"So your logic is — if the data wasn't in canary, it wouldn't be a problem in production?"

The room went quiet.

I raised my hand.

"Can I share something?"

Marcus turned and looked at me.

The CTO nodded. "Put it on screen."

I walked to the front. Plugged my laptop into the projector.

Slide 1:

# src/orders/discount_calculator.py:127

def apply_compound_discount(order: dict, user: dict) -> float:
    """
    Apply tiered discounts to an order.
    TODO: handle compound discount ordering — currently iterates
          over discount fields in JSON insertion order, which means
          the calculation sequence depends on how the client serialized
          the payload. This is not guaranteed to be consistent across
          requests. Needs explicit ordering: line_item → subtotal → member.
    """
    total = order["subtotal"]
    for key, rule in order.get("discounts", {}).items():
        total -= rule["amount"]  # ← order depends on dict insertion order
    return max(total, 0.0)
Enter fullscreen mode Exit fullscreen mode

"This isn't a bug. It's a TODO. There were 47 of these in the codebase when it shipped to production. Distributed across every critical path in the module."

Slide 2:

# tests/test_discount_complex/test_discount_order.py — last modified 2019

def test_compound_discount_sequence():
    """Verify discount application order: line_item → subtotal → member.

    This sequence was confirmed with Finance in 2019 and must not
    be changed without cross-team approval.
    """
    order = sample_order(
        items=[{"sku": "A", "price": 100.00, "qty": 5}],
        subtotal_discount_pct=10,
        member_tier="gold",
    )
    result = legacy_discount_engine.apply(order)

    # Expected: 500 * 0.9 (subtotal) * 0.85 (member) = 382.50
    assert abs(result.final - 382.50) < 0.01
    assert result.application_log == [
        "line_item",
        "subtotal",
        "member",
    ]
Enter fullscreen mode Exit fullscreen mode

"The old system had this test. Written five years ago."

Slide 3:

vibe-coding-take-down/evidence/
├── screenshot-01-unit-test-failures.png  (15 pass, 8 fail)
├── screenshot-02-todo-list.png           (47 TODOs)
└── screenshot-03-api-null.png            (API returning null)

Generated: 27 days before production cutover.
Enter fullscreen mode Exit fullscreen mode

"Twenty-seven days before this went to production — I ran the tests. Fifteen passed, eight failed. Forty-seven TODOs. I screenshotted every single one."

I let that sit.

"Twenty-seven days. These tests sat in CI failing for almost a month. Did anyone open them? "

Silence.

I asked again: "Did anyone look at the test report?"

Marcus didn't look at me. He looked at the CTO.

"During canary… we were focused on business metrics. Unit tests weren't fully executed. The timeline was tight."

I didn't answer him. I put up slide four.

Slide 4: a screenshot of a Slack message.

Marcus's own message, sent before AI development began.

"Don't spend too much time on testing. AI generates its own code — it'll verify itself. Let's ship first, we'll backfill tests later."

Nobody moved. Nobody spoke.

I closed my laptop.

"Marcus. You said AI would test itself. Forty-seven TODOs unaddressed. Eight failing tests. Two point eight million dollars lost. What exactly did your AI verify? "

Marcus didn't answer.

The CEO spoke. He wasn't looking at Marcus. He was looking at me.

"When did you build that folder?"

"Twenty-seven days before production."

"Why didn't you say anything?"

"I did. At the demo. I stood up and tried to say something. Marcus cut me off. Nobody let me finish. "


Act 7 · The Aftermath

The CEO sent a company-wide email the next morning:

"Effective immediately: all AI-generated code must pass a line-by-line review by a senior engineer before merging to main. Test coverage requirements are non-negotiable. This policy cannot be overridden."

Marcus kept his title.

But the same day that email went out, HR released an updated org chart. A new Code Review Board, independent from Engineering, reporting directly to the CTO. I was put in charge.

Three people on the board. The other two came with me from Legacy Systems — solid engineers who never learned to play the room.

Marcus's team no longer had final say on what got merged.

Someone asked me later: "You waited a month for one meeting? Was it worth it?"

I said: "I didn't wait a month for a chance to prove I was right."

"I waited for him to say 'AI will test itself' in front of the whole company. I wanted to watch him swallow every word."

I deleted the evidence folder later. Didn't need it anymore.

The next day I walked past Marcus's office. The door was open. He was sitting at his monitor, reading a document about the new code review process — the one I'd written.


Forty-seven TODOs. Not a single one became a bug. The AI just guessed a calculation order where no rule was specified — a different guess on every request. When the process itself doesn't exist, AI will help you prove that "doesn't exist" is good enough.


Have you ever seen someone package a process failure as a technology belief? What happened next?

It took me thirty seconds to create an evidence/ folder — and one second to delete it. Forty-seven TODOs taught me one thing: the only calculation order I trust is the one my coffee cup follows.Buy me a coffee ☕

Follow for more stories about AI testing, quality engineering, and what happens when the code is generated but not verified.

Top comments (0)