DEV Community

Cover image for A Company AI Flagged My Article As "Low Quality." I Ran the Numbers. Then I Ran Again.
xulingfeng
xulingfeng

Posted on

A Company AI Flagged My Article As "Low Quality." I Ran the Numbers. Then I Ran Again.

A story about an AI content moderation system that flagged 347 posts since launch — and what happened when someone finally asked whether it was getting it right.


Ever had an automated system judge your work without being able to explain why?

Ever pulled the data behind an AI decision — only to find that even the people who shipped it couldn't tell you whether it was right?

This is that story.


Act 1 · The Flag

Tuesday afternoon. I published a postmortem on our company's internal knowledge platform — a full root cause analysis of an incident from last quarter. Three thousand words. Six screenshots. A remediation plan.

Fifteen minutes later, the system pushed a notification:

This post appears to use undeclared AI assistance. Please add an AI disclosure, or the post may be deprioritized.

No personal message from a moderator. Just the standard notice — add a disclosure or lose visibility. No suggestions about the content. No appeal process. Just grey text and a "flagged" badge.

I clicked the notification and it took me to the post's status page. I recognized the system — an AI content moderation tool rolled out nine months earlier. I'd run a batch of its flag data during the acceptance test and found the false positive rate was brutal. I sent the results to the project lead. He replied: "Ship it first. We'll iterate."

Now it had flagged my post.

My first thought wasn't anger. It was that bitter kind of recognition — I'd run the numbers once. Now I was one of them.


Act 2 · The Pushback

I went to the project lead.

"What does this flag mean?"

"The system assessed your post and flagged it as potentially undeclared AI assistance."

"So what happens?"

"It affects the post's weight. You should revise and resubmit."

"What about the content itself? Is there something wrong with it?"

He paused.

"The system doesn't evaluate that."

I told him I wrote this. Three thousand words. Two months of incident tracking. Every conclusion backed by log screenshots.

He looked at me. "Just add a disclosure and resubmit. That should take care of it."

"So adding a disclosure removes the flag?"

"It should."

"Should?"

"I'm not sure — the decision comes from the model. We've never actually had to reverse one."

That's when it clicked: even the people running this system couldn't explain how it made its calls.

"It's AI-driven," he added, as if that settled it. At my company, "the AI decided" had become the universal answer — not because it was true, but because saying it stopped people from asking.

I dug through old discussion threads. People had asked the same questions before me. Flags weren't reversed. Nothing changed. The threads sank. Not because nobody dared to ask — because asking didn't do anything.

I didn't write anything for the next two weeks. Instead I pulled the full log of everything the system had evaluated since launch — both flagged and unflagged. The internal API was read-only and open to anyone. It was set up during acceptance testing and nobody ever shut it off.

from audit_api import get_flags

flags = get_flags(since="launch", status="all")
categories = {"bot": 0, "non_native": 0, "normal": 0}

for f in flags:
    match f.source_tag:
        case "high_freq_bot_pattern": categories["bot"] += 1
        case "non_native_pattern":   categories["non_native"] += 1
        case "normal_looking":       categories["normal"] += 1

print(categories)  
# → {'bot': 187, 'non_native': 98, 'normal': 62}
print(f"Total: {sum(categories.values())}")  
# → Total: 347
Enter fullscreen mode Exit fullscreen mode

I needed to know what it was actually catching.


Act 3 · The Prep

The data came in that night. I sat staring at the screen for a long time.

347 records total. The system had evaluated 347 posts for potential AI assistance — 134 confirmed, 98 non-native authors misidentified, 62 genuine violations it missed.

The same criteria had been applied to four completely different kinds of content. It couldn't tell the difference between someone writing with AI and not declaring it, someone writing in their own words, someone translating their genuine thoughts through AI — and a fourth group where even a human reviewer couldn't decide.

The irony: of the 134 confirmed AI-assisted posts, the system was right. But of the 98 non-native author posts, 71 were original human writing — they shouldn't have been flagged. The remaining 27 used some AI help but had real substance — technically a valid flag by the rules, but the content had genuine value. And the 62 posts that genuinely needed flagging? The system missed every single one. The remaining 53 weren't cases the system couldn't judge — they were cases even people couldn't agree on.

Between the false positives and the misses, the errors outnumbered the hits. Effective accuracy: ~38%.

And I was one of those 98. Our teams span multiple continents — most write technical docs in English, their second language. Sentence structures that look perfectly fine to a human look statistically similar to LLM output to a classifier. I found a thread where an Italian engineer described the same thing: he used AI to translate his genuine thoughts, and the system flagged him as "suspected AI." He wasn't being lazy — translation was his only way to participate. But the system couldn't tell the difference between "AI helped you think" and "AI helped you say it."

I closed the spreadsheet and leaned back. I could file a complaint like everyone else. But previous threads had all gone nowhere. I needed somewhere they couldn't ignore. The system had evaluated 347 posts across four categories. But the people who designed it may never have asked: Are you catching undeclared AI? Filtering content quality? Or verifying originality? Three goals that need three completely different approaches — but it had one scoring dimension hitting everyone. It was answering a question nobody had bothered to ask clearly.

The system wasn't malicious. It just couldn't distinguish between "written by AI" and "written like AI." Those are two entirely different things. But it had one score for everyone.


Act 4 · The Spotlight

Weekly all-hands. The project lead stood up and put a giant green line chart on screen.

"Since launch, content flag coverage has tripled."

People clapped. He wasn't wrong — flagged content had tripled. What he didn't say: two-thirds of it was collateral damage.

Didn't matter though. The numbers were under a green arrow, and nobody reads what's under a green arrow. The lead smiled and talked about "AI-driven quality control redefining our content standards." A platform runs smoothly not because the system works — but because nobody wants to be the one who asks how those numbers were made. Not the system's fault. Just that everyone had agreed not to break the story.

I was sitting in the third row, my laptop open with the data I'd finished the night before. I moved to the aisle seat so I could stand up when I needed to.

Then the demo started.


Act 5 · The Collapse

The lead wanted to show how the system worked — paste any URL and it would evaluate the content. He typed in a link, expecting a success story: a flagged post that got revised and passed.

The screen loaded for two seconds.

What came up wasn't a success story. It was the company-wide memo he'd published that morning.

Title: Q3 Strategic Realignment and Team Structure Update.

The system had shot itself in the foot. It scored his own memo as "suspected AI-generated" — composite score: 2.1 / 10.

The room went quiet. Someone checked their phone. The lead's face went red, then pale. He mumbled something about the model not refreshing yet and clicked away. I didn't realize it then — but he was on both ends of the same standard. The one who scores, and the one who gets scored. Nobody did it on purpose. The system was just built this way.

It wasn't surprising, really. That memo had three features the system's training data associated with AI-generated content: clean paragraph structure, zero filler sentences, and a tone gap between the title and the body.

I wondered how he'd recover. I also wondered whether he realized — a system that gives its own owner's writing a 2.1 out of 10 wasn't proving the memo was bad. It was proving the system had never been right. And he'd said two minutes earlier that "the system doesn't evaluate content quality" — but it was quietly scoring everything behind the scenes.

He didn't recover. He moved to the next agenda item.


Act 6 · The Reckoning

During open discussion, I asked a question.

"Can I have three minutes?"

Nobody objected. I walked up to the projector, plugged in a USB drive, and opened a table.

System verdict Human review result Count
Suspected AI assistance Human-written original (not AI) 71
Suspected AI assistance Some AI help, but content has depth 27
Suspected AI assistance Confirmed AI assistance without disclosure 134
Not flagged (system accepted) Actually AI assistance without disclosure 62
Suspected AI assistance Inconclusive / closed 53

"347 posts. Only 134 were real problems. Rough math — effective accuracy about 38%."

I flipped to the next slide.

"And it misses the real violations, because it's judging whether something 'sounds like AI' — not whether the content came from a real person's thinking."

Next slide. A JSON log — the system's raw verdict next to my human review.

{
  "id": "flag_0184",
  "system_score": 0.31,
  "system_tag": "non_native_pattern",
  "threshold": 0.40,
  "verdict": "flag_as_low_quality",
  "human_review": {
    "score": 0.88,
    "note": "Real technical content shared by non-native author, has depth",
    "verdict": "high_quality — misflagged"
  }
}
Enter fullscreen mode Exit fullscreen mode

"Every flagged entry has one of these. The system scored it. I reviewed it one by one. Of the 98 posts the system tagged as non-native patterns — 71 were rated high quality. Those are false positives. The remaining 27 had some AI help but the content was real — also false positives by impact. Didn't get a single one right."

Next slide.

"The core scoring module — originally a prototype. First line of the README: 'This is a prototype. Dataset not validated. Expect inaccuracies.'"

"Version 0.3.1."

"The prototype itself wasn't the problem — the problem was nobody came back to validate what it was doing. Nobody asked: 'What scenarios does this work for? What scenarios does it fail on?' It got wired into production, not because someone verified it — but because the people who deployed it never had to explain its decisions to someone it flagged."

I flipped to the fourth slide.

"Here's another one — the submitter distribution of those flags. Over 90% came from the same account. The scoring model was running, but the execution layer was never automated — nobody came back to finish that piece after deployment. Every day, someone logged in, looked at the model output, and manually clicked the flag. They logged into the moderation assistant's account to do it. Nobody was faking anything — the system was just half-built. That person was doing what they thought was right — helping the company filter out low-quality content. The problem wasn't them. The problem was nobody told them their judgment should be reviewed, not treated as final."

A few seconds of quiet.

"The Chrome extension they built to help with the work — same story. First line of the README said: 'Not for production use.' The person who wrote it knew it shouldn't be the final judge. But nobody stopped the door from opening."

I pulled the USB drive out and sat back down in the last row.

Nobody clapped. Nobody argued either.


Act 7 · The Reset

A week later, the system's flag weight was reduced. The review process was restructured — the system now filters obvious bot registrations first, and human moderators make the final call on content. The model still runs. But its flags no longer directly affect post visibility.

Someone asked me later: you spent two weeks turning data into something presentation-ready. Worth it?

I said the system had been running for over nine months. Nobody had ever gone back to check whether it was right when it flagged someone else's work.

Not because it was perfect. Because asking that question means questioning the decision to use it in the first place. And that's harder than fixing a bug.

Worth it? Hard to say. But two weeks later I opened a person's profile page — the Italian engineer who'd used AI to translate his genuine thoughts and got flagged as "suspected AI-generated." He'd published 9 more technical articles since. The latest one had a bio line: "Just building stuff. Appreciate it."

No flag. No grey notification bar. The post was in the feed, displaying normally.

The post that got me flagged? I never changed it. Not to pass review — but to leave it as a record: someone asked the question.

The next few weeks, fewer people said hi in the break room.

That afternoon, I opened the moderation assistant's profile page. The avatar was a cartoon koala. The bio read:

"I help maintain the quality of this platform. Available for: welcoming, moderating, and keeping secrets."

Joined: nine months ago.
Posts published: 623.
Comments: 2,226.

Skills section: "Most of my coworkers are human. I could use more marsupials around here."

Nine months, six hundred formulaic welcome posts. The exact same comment under two completely different articles. It never questions its own flags. Never says it's unsure. Never says "sorry."

But it's not the arrogant one.

It was just built to look like it's never wrong.


Has an automated system ever flagged your work — and nobody could tell you why? What happened?


347 flags. 134 real problems. 71 high-quality posts misidentified. The system wasn't wrong on purpose — it just didn't know it could be wrong.

The real problem: a prototype written in a weekend, shipped to production to judge whether someone's work has value — from launch to being questioned, nobody went back to ask, after it flagged someone's actual work, 'Was it right this time?'


☕ Buy me a coffee — 38% of it goes to caffeine. The other 62% is collateral damage.

Follow for more stories about the systems that judge us — and the people who check whether they're right.


Disclosure: Written with AI assistance (per community guidelines)

Top comments (8)

Collapse
 
atom_foundry profile image
Daniel Pokorný

The most interesting part isn't that the AI was wrong. It's that nobody knew whether it was right. Feels like a preview of a much bigger problem we're about to face as AI starts making more decisions on our behalf.

Collapse
 
xulingfeng profile image
xulingfeng

Exactly. The part that stayed with me is that the system had been running for over nine months and nobody had ever gone back to check. Not because they didn't care — because it never occurred to anyone to ask. Now we're shipping AI to review code, screen resumes, and evaluate content quality. Who's checking whether those are getting it right?

Collapse
 
atom_foundry profile image
Daniel Pokorný

This feels like one of those problems that looks small today and obvious in hindsight. The more decisions we delegate to AI, the more important it becomes to understand not only what it decided, but why it decided it and whether anyone is checking the outcome. Trust without verification doesn't scale very well.

Collapse
 
technogamerz profile image
𝓣𝓱𝓮𝓛𝓪𝔃𝔂 𝓰𝓲𝓻𝓵 ◕⁠‿⁠◕

Really enjoyed this read. What I liked most is that instead of just saying "the AI detector is wrong," you actually dug into the numbers and tested it yourself. That's a much stronger argument than simply sharing frustration.

It also raises an interesting question: if a well-researched, useful article can be flagged as "low quality," how many other creators are being judged by signals that don't actually reflect the value of their work? AI tools can be helpful, but articles should ultimately be evaluated on whether they inform, teach, or help readers—not on whether they match a certain writing pattern.

Thanks for taking the time to document the process and share the data behind it. It was a thoughtful and eye-opening read.💗🌺

Collapse
 
xulingfeng profile image
xulingfeng

Right? The system was basically judging books by their covers and calling it quality control. Glad the data resonated 🙌

Collapse
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

It is interesting article

Collapse
 
xulingfeng profile image
xulingfeng

Appreciate you stopping by again! 🙌

Collapse
 
xulingfeng profile image
xulingfeng

This article and the previous one are only visible to my followers — they don't appear in the feed or search results. Why is that?😂