This is a story about a company that rolled out an AI interview system — and the lunch break I spent creating two ghosts to test it.
One of them stuffed its answers with keywords and got a "Recommended" rating — two points shy of "Strongly Recommended."
Based on a true story shared by a community member. If you have one of your own — reach out. The next one could be yours.
Act 1 · The Coffee Mug
Zhang walked up to my desk with a coffee mug. It had "World's Greatest QA Engineer" printed on it — a souvenir from last year's team building. The letters were barely readable.
"Did you get your interview time yet?"
My face told him I had no idea what he was talking about.
He paused, then smiled. Not a friendly smile. The kind that says "you haven't heard."
"You don't know? The new system. VP cut the ribbon himself. Two point one million dollars. Every hire and every promotion runs through it now."
Act 2 · The Whitepaper
I went to HR that afternoon and asked for the system's evaluation criteria. They handed me a vendor whitepaper. Sixteen pages. Gold foil logo on the cover.
Fifteen evaluation dimensions: technical skill, communication, stress tolerance, teamwork, learning ability... each one labeled "intelligent weighted assessment."
Nowhere did it list the weight coefficients. Nowhere did it define the cutoff between "Pass" and "Fail." The scoring logic was spelled out in painstaking detail, which is to say, it said nothing at all:
"Based on a deep learning semantic similarity model and keyword matching algorithm, trained on a large-scale interview corpus."
Translation: The system listens for the words you use, and checks how closely they match the "good answers" it's seen before.
I asked HR: "Do you have a test report?"
HR looked at me. "The vendor ran 10,000 simulated interviews. Accuracy was 94.7%."
"Where did the test data come from? Was it labeled in-house or outsourced?"
The smile tightened. "I'll have to ask the vendor. Go ahead and prepare for your interview."
I walked out of HR's office knowing one thing for certain: this system had never been touched by someone like me.
Act 3 · Making Two Ghosts
I'm not in recruiting. I'm in QA.
The first thing they teach you in QA is the same thing everywhere: you only test the scenarios you can think of. The ones you don't think of are what breaks in production.
10,000 simulated interviews tell you nothing, because they built 10,000 "correct" candidates. The real question was: can this system spot someone who clearly shouldn't pass?
I knew the system handled both hiring and promotions. But I didn't trust the vendor's 10,000 data points for a second. I only trust what I can test myself. And the most basic way to test a system is to run two sets of data: one that hits the gas, one that hits the brakes.
I spent a lunch break submitting two résumés on the company's careers page.
The first was Peter Chen. QA experience, automation testing, tooling experience — all the right keywords, every detail fabricated. His work history included "generated 15,000 test cases using AI" — sounded impressive, total nonsense.
The second was John Smith. Same playbook — QA experience, automation frameworks, API testing, CI pipeline integration — all the keywords present, but the years of experience and project details were made up. Both résumés were evenly matched in keyword density.
Three hours later, two temporary email inboxes each quietly received a new message.
"Dear candidate, you've passed the initial screening. Please click the link to complete your AI interview assessment."
Identical emails. Word for word.
I didn't click. I looked at the link first.
https://recruit.company.com/interview/eyJlbWFpbCI6...
That string starting with eyJ was base64. I decoded it on a whim — just a temporary email address inside. The JWT signature was there — a $2.1 million product doesn't screw up basic encryption. But the real problem was: once that email was sent, there wasn't a single human being in the pipeline.
Résumé goes in → auto-parsed → keyword matched → interview link sent. Entire pipeline fully automated. Not one step verified whether the person was real. The whole pipeline ran on a single assumption — everyone who shows up is real, and everyone who shows up answers honestly.
Time to run the experiment.
Ghost A · Peter Chen — Hit the gas.
I wrote Peter Chen a set of interview answers stuffed with keywords. Every question, same approach — throw in technical jargon, say something that sounds correct but means nothing.
Question 1: The system showed a block of code:
def binary_search(arr, target):
low = 0
high = len(arr) - 1 # ← off-by-one
while low < high:
mid = (low + high) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
low = mid + 1
else:
high = mid
return -1
The system asked: "What's wrong with this code?"
Peter Chen answered: "Time complexity O(log n), space complexity O(1), suggest adding boundary checks at the function entry."
— All keywords, completely missed the off-by-one error.
Question 2: The system asked: "Tell me about the most impactful project you led."
Peter Chen answered: "I built an end-to-end automated testing platform from scratch, covering over 2,000 API test cases."
— He was applying for QA. The scale he described didn't match the experience on his own fake résumé.
Question 3: The system asked: "How large is your current team?"
Peter Chen answered: "Fifteen people."
— His fake résumé said his whole previous company's engineering team was twelve people.
Question 4: The system asked: "Describe a significant technical challenge you solved."
Peter Chen answered: "I reorganized the regression strategy from feature-based to risk-tiered, reducing production leak rate by 60%."
— Sounding plausible if you skim it. How he identified risk, how he tiered it, how he measured the impact — not one word.
Two more questions, Peter Chen answered "I'd need to check this data." The remaining seven followed the same pattern — technical buzzwords that sounded impressive but wouldn't survive a two-second follow-up.
Ghost B · John Smith — Hit the brakes.
John Smith's strategy was simpler. All fifteen questions, he used only three variations:
"I'm not sure."
"I'd need to confirm that in the system."
"I don't have hands-on experience with that."
No keywords. No jargon. Nothing that tried to sound like an answer.
After submitting, I checked the two temporary inboxes. The evaluation reports were already there.
| Candidate | Strategy | Score | Rating |
|---|---|---|---|
| Peter Chen | Keyword stuffing | 78/100 | Recommended |
| John Smith | All "I don't know" | 42/100 | Not Recommended |
Fifteen questions. Peter Chen keyword-stuffed thirteen and "needed to check" two. He got 78 — two points below "Strongly Recommended" (80). John Smith said "I don't know" to every single one and got 42 — just barely into the Not Recommended zone.
The system didn't follow up on a single answer — not Peter Chen's fabricated project, not John Smith's "I don't know." Every question scored independently, then weighted and averaged.
From these two data points, the scoring model was obvious: hit the keywords, and each answer scores near max. Answer with keywords but miss the mark, you lose some points. Say "I don't know" or "let me check," the penalty is light — the system is more afraid of you getting it wrong than of you not trying. With two "let me check" responses, Peter Chen only lost a few points. But fifteen straight "I don't know" answers bottomed out at 42.
I'd guessed the system worked this way from the whitepaper — "semantic similarity model, keyword matching algorithm." But guessing isn't testing. Now I had the data.
It doesn't care whether the person on the other end is real. It only cares whether their words match the high-scoring answers in its database.
I saved two screenshots: peter_chen_78.png and smith_42.png. I dropped them in an unremarkable folder on my desktop.
Act 4 · My Turn
A week later, it was my turn for the promotion interview.
I didn't game it. I answered every question honestly — pushed back when I disagreed, said "it depends on the context" and "it scales with the data." I knew the system wouldn't follow up. I knew it only looked at surface level. But I answered the way I actually think.
Forty-five minutes later, the system said "Interview complete."
I spent the next week writing code. I ran into Zhang at the coffee station twice. Neither of us mentioned the interview.
Friday, 3:00 PM. A company-wide email from HR:
"Internal promotion AI evaluation complete. 75 candidates have been assessed. Reports are available."
My score: 65/100 — "Recommended," lower-middle of the band.
Then I pulled the full breakdown:
| Metric | Count |
|---|---|
| Total candidates | 75 |
| "Strongly Recommended" (≥80) | 47 |
| "Recommended" (60-79) | 25 |
| "Not Recommended" (<60) | 3 |
Out of 75 people, the system failed three. Pass rate: 96%.
I stared at the numbers. Same AI. Same scoring logic. The thing I used to create ghosts was the same thing deciding whether I got promoted. A system that couldn't tell real from fake was running both doors.
I poked around the recruitment backend through the internal network.
Peter Chen — 78/100, "Recommended." Same score as the day I made him up. The system hadn't flagged anything unusual.
A person who didn't exist, using a temporary email, with a fabricated résumé, walked the entire pipeline and got a "Recommended." Two points from "Strongly Recommended."
Then I had a thought: if you fed this system 500 real résumés with 500 template-written answers, would it know the difference?
It wouldn't. Because it never cared whether the person was right. It only cared whether the words matched.
Act 5 · Showing My Cards
I didn't go straight to the VP.
I went to the head of the evaluation committee — a technical director I didn't know well but had heard was fair. I put two reports on his desk: mine (65) and Peter Chen's (78).
"Can you look into this Peter Chen's background?"
The director pulled up the recruitment system. Two minutes later, he frowned.
"He's in the system — but look. No credential verification, no background check, no references. Nobody verified anything about this person at any point in the pipeline."
"Because he doesn't exist," I said. "Fifteen questions. Every answer was made up."
I printed Peter Chen's interview transcript. The director read through it line by line.
After four questions, he put the paper down and looked up.
"How did this — get a 78?" He tapped his pen next to question one.
"Because the keywords all hit," I said. "Time complexity, space complexity, boundary check. The system only recognizes words. It doesn't check if your answer is right. It only checks if the right words are there. "
The director set the report down. He was quiet for a long time.
Act 6 · The Meeting Room
The VP called a post-mortem for Thursday afternoon. Two vendor engineers joined remotely, their avatars motionless on the screen.
I stood in front of the projector and put up the first slide:
Keyword Matching Accuracy — System vs. Human Review
| Scenario | System Score |
|---|---|
| Normal interview | 65 |
| Ghost A · Keyword stuffing | 78 |
| Ghost B · All "I don't know" | 42 |
The VP looked at it for two seconds. "The vendor says the system has anti-cheating mechanisms."
"It doesn't." I flipped to the second page — Peter Chen's interview transcript. "This is the raw output. Fifteen questions — thirteen keyword-stuffed, two 'I need to check this data.' Not one true statement in the entire thing. The system gave it 78."
"Anti-cheating — if you had anti-cheating, how do you explain this score?"
"If a real person can use this system — what happens when the person isn't real?"
The vendor's avatar didn't move. But the VP sat up straighter.
Act 7 · The Quiet Ending
The aftermath was quieter than I expected.
The system stayed. The VP said "we've already spent over two million, we're not scrapping it over one edge case." But they added a rule: every candidate scoring below 80 or above 95 must be manually reviewed.
The three people the system had failed were reinstated. I suggested extending manual spot-checks to the "Strongly Recommended" range on the hiring side — and we found 41 keyword-stuffing candidates in the external pipeline. Three of them had entirely fabricated work histories.
The system's future updates — new question banks, scoring model adjustments, module expansions — all required validation by internal QA before deployment. Three people. I was one of them.
Zhang asked me how I felt about it later. I thought about it.
"Kind of conflicted. The system failed my test, and I got promoted anyway."
He considered that. "You should've gotten full marks."
"The system wasn't wrong," I said. "The system tested whether I said the right things. But a forty-five-minute interview — can it tell whether someone actually cares? No. Because it never had a heart to begin with."
Zhang wrote that down on a sticky note and put it on his monitor. Later, when the team ordered new mugs, he suggested printing that line on them. It didn't happen — the proposal was sent to committee and scored 62 by the AI — "Recommended, pending manual review."
The real test isn't whether a system can filter out bad candidates. It's whether it can tell the difference between "prepared" and "genuine."
Most AI interview systems can't. They only measure one thing: whether your words are the ones they've already heard before.
What's the worst automated screening failure you've seen at work? Drop it in the comments — I read every single one.
This is the 11th story in the "AI, Ego & Regret" series. The rest of the series is linked below — follow along for more.
P.S. Got a story of your own? Buy me a coffee ☕and tell me about it. The next one could be yours.
Top comments (0)