Based on real QA scenarios. About what happens when AI-generated metrics replace real testing, and the quiet engineer in the back row has been running his own numbers the whole time.
Act 1: The Review Meeting
I was sitting at the back of the long table, a ThinkPad in front of me, screen dimmed.
On the big screen, Zhang Lei was presenting the acceptance data for his "AI Automated Testing Platform." His delivery was smooth. Every slide was a beautiful chart — coverage trends, automation rate improvements, regression testing time curves. All three lines pointed up and to the right, exactly like the textbook ideal curves.
"In the past three months, the AI testing platform has executed 47,000 test cases, achieving 97.3% functional coverage. Regression testing time has dropped from 12 hours to 2.1 hours."
Sparse applause.
Zhang Lei added the final slide: "Monthly savings: approximately 200 person-days in labor cost."
General Manager Zhou nodded and started the applause. That number was what he cared about most.
I glanced at the other end of the table — the client's representative from RuiJie Technology. Chief Engineer Shen. Early fifties, thinning on top, silver-rimmed glasses. He hadn't said a word through the entire presentation. Hands folded on the table, occasionally jotting notes in a small book.
Zhang Lei opened the Q&A slide and looked around the room: "Any questions?"
Chief Engineer Shen flipped through the printed materials in front of him, stopped at the appendix, and looked up.
"Page 47, Table 3.2 — what's the confidence interval on that 97.3% coverage?"
The room went silent for about 15 seconds.
Not the kind of silence where people are thinking. The kind where nobody had ever thought about it. Zhang Lei stood by the projector, clicker still in his hand, paused for two seconds:
"Uh... the model confidence is quite high. The specific number is in the technical report."
"Which page?"
"I'll need to look it up."
Chief Engineer Shen didn't push further. He looked down and kept writing.
General Manager Zhou smoothed it over: "We'll align on the technical details later. The overall direction is solid."
As the meeting broke up, Chief Engineer Shen walked past my end of the table. He glanced at me. Didn't say a word. Walked out.
I closed my laptop, tore the page I'd been calculating out of my notebook, and folded it into my pocket. One sheet of paper. On the left side, Zhang Lei's 97.3%. On the right side, the numbers I'd actually run myself.
That number was under 30%.
Act 2: Three Months Earlier
Zhang Lei had arrived six months ago. Head count allocated from headquarters. Title: "AI Testing Architect." Rumor had it he came from HQ's AI lab. His résumé said he'd "led the rollout of a multi-million-dollar AI testing platform."
From day one, he pushed a plan: use large language models to auto-generate test cases, covering all end-to-end workflows. General Manager Zhou approved a budget — four A100s, a small GPU cluster, software licenses. Roughly $110,000 all in.
"Traditional testing is too slow," Zhang Lei said in the kickoff meeting. "One person writing cases one at a time, a week per iteration. AI generates 5,000 cases overnight. Iteration sign-off in one day."
People in the team had doubts. But nobody who doubted him could match his talk — because he sounded too convincing. His slides were too polished.
I said nothing.
My desk was in the back corner by the window. Two worn notebooks sat on it year-round. One held testing strategy notes from six systems over six years — boundary conditions for every module, postmortems on production incidents, regression traps I'd stepped in. The other wasn't anything official. It was my "why" notebook. Why does this module keep producing boundary bugs? Why does that flow always leak tests? Why does coverage start diminishing returns at 70%?
Three weeks into Zhang Lei's AI platform going live, my notebook was about two-thirds full.
I ran the three core workflows through my own test environment and pulled the actual coverage numbers for every module. The reason was simple: over 70% of the AI-generated test cases were equivalence class duplicates. The numbers looked big. The reports looked beautiful. But the most critical core flows? Not a single case covered them.
The coverage was fabricated — not by Zhang Lei himself, but by his "AI reporting template." The template automatically bumped coverage to 90%+, regardless of what had actually been executed.
I put the notebook back in my drawer. Didn't report it.
Because in that room, what I said didn't carry as much weight as a well-designed PDF.
Act 3: 72 Hours
Friday, 6 PM. I was about to shut down my machine.
My phone buzzed — Chief Engineer Shen calling me directly. First time I'd ever gotten his number.
"Can you come to RuiJie's server room?"
His voice was flat, but there was exhaustion underneath it. I didn't ask questions. Fifteen minutes later, I was at their building.
The server room had seven or eight people in it — General Manager Zhou, Zhang Lei, a few of RuiJie's ops engineers, and Chief Engineer Shen himself. Zhang Lei was on the phone, his voice a little strained. Chief Engineer Shen stood by a rack, screen showing the monitoring panel for the production environment that had gone live two days earlier.
Red everywhere.
89% timeouts. 43% error rate. Three core services all down.
Chief Engineer Shen scrolled the monitor two screens and spoke quietly — but everyone heard:
"You said 97.3% coverage. We put three core business flows into production based on that. Within 72 hours, all of them collapsed. Can you explain why everything the AI platform missed exploded in production?"
Nobody answered.
General Manager Zhou looked at Zhang Lei. Zhang Lei was still on the phone. Zhou couldn't find words either.
Chief Engineer Shen scanned the room. His eyes stopped on me.
He already knew who I was. Not because I was standing at the edge. Because at the review meeting, he'd noticed that one person in the room wasn't clapping — he was writing numbers the whole time.
"You," he said. "Come here."
Act 4: The Weekend
I pulled the two notebooks out of my bag.
"Give me three standard workstations. No GPUs needed."
"How long?"
"Monday morning."
General Manager Zhou looked at me with a complicated expression — probably because he'd never noticed there was someone like this in his own company, and now the client was pointing at him.
I didn't sleep much that weekend. But not because I was short on time. Most of the work was already done three months ago. In my notebook, I had test cases, boundary conditions, and exception paths for all six core modules — organized, labeled, with incident notes in red. The only thing I needed to run was environment validation and the latest API changes.
Sunday, 11 PM. I finished the last regression run.
Actual coverage: 92.7%.
Not from 5,000 AI-generated cases. From 347. Every single one a real scenario I'd decomposed myself. Equivalence class duplicates: near zero. Boundary coverage: 100%.
Three standard workstations. Zero additional hardware spend.
Act 5: Monday
RuiJie's validation meeting room.
Left side of the screen: Zhang Lei's "AI testing acceptance data" from three months ago. Right side: my actual results from Sunday night.
| Metric | AI Platform Report | Actual |
|---|---|---|
| Feature Coverage | 97.3% | 28.7% |
| Core Flow Coverage | Marked "All Covered" | 0 cases |
| Equivalence Class Duplication | Not disclosed | 71.4% |
| Boundary Coverage | Not disclosed | 6.3% |
| False Positive Rate | 2.1% | 37.8% |
| Confidence Interval | Not provided | ±3.1% |
The room was silent for about ten seconds.
Zhang Lei sat in his chair, hands folded, saying nothing.
Chief Engineer Shen wasn't looking at Zhang Lei. He was looking at General Manager Zhou.
"The 97.3% coverage data you delivered — I can't sign off on it. Next quarter's S-grade project acceptance criteria need to be reassessed. We'll adopt a benchmark-based testing standard, executed against the 347 core test cases."
He paused.
"Engineer Wang will be responsible."
He meant me.
Act 6: The Hallway
After the meeting, Chief Engineer Shen stopped me at the elevator.
He handed me a business card — not his chief engineer's card. It was an invitation to RuiJie's technical review meeting for the next quarter's new project.
"That 97.3% — when did you know it was fake?"
"The first review meeting."
He looked at me, the corner of his mouth twitching up.
"Three months without saying a word, waiting until now. Is your whole team that patient?"
I put the card away.
"Next quarter's S-grade project — am I in or out?"
He didn't answer. The elevator doors opened. He stepped in, turned back:
"Come to the technical review first. Listen to the requirements. Then decide."
The doors closed.
I stood alone in the hallway, the card in my hand, the light just a little too bright.
Sometimes the best data you'll ever run isn't the one on the report. It's the one you ran yourself, three months ago, in a notebook that nobody asked to see.
Have you ever had to quietly double-check an AI-generated metric that "passed" on paper but didn't pass your gut check? How did it end up — did the data hold, or did the production environment tell a different story?
Drop it below 👇
Top comments (1)
The 97.3% vs 28.7% gap looks dramatic 😅 but I've personally run into AI-generated test cases missing core flows more times than I'd like to admit Quantity is easy Depth is the hard part Anyone else hit this in production? What kind of gaps did your AI tests miss? 👇