EvvyTools

Posted on Jun 12

How to Screen Writer Submissions for AI Content Without Burning Honest Writers With False Positives

#tools #writing #productivity

If you commission writing, you have probably hit the AI question. Did the writer use AI? How much? Does it matter? And the central problem: you cannot tell from reading, and the detector tools have failure modes that make them dangerous to use as evidence.

This is a workflow problem, not a tool problem. The right answer is not "find a better detector." It is to build a screening process that uses detector scores for what they are actually good at - flagging text that deserves a closer look - and avoids what they are bad at - rendering verdicts on individual writers.

Here is a workflow that gets the value out of detector tools without producing the false-positive damage that has gotten institutions in trouble.

Photo by www.kaboompics.com on Pexels

Step 1: Use Multiple Detectors, Not One

A single detector score is one tool's interpretation of statistical features that other tools weight differently. Running the same submission through two or three independent detectors and comparing results tells you whether the signal is consistent.

If three detectors return 80%+, the AI signal is strong and well-supported across different methodologies. If they return wildly different scores (one at 85%, another at 30%), the text is in the disagreement zone, and any single tool's confidence is misleading.

For a workflow, pick three: one that emphasizes phrase pattern matching, one that emphasizes statistical uniformity, and one that emphasizes hedging. Run every submission through all three before drawing any conclusion.

Step 2: Establish a Per-Writer Baseline

Every writer's natural prose scores somewhere on the detector spectrum. A writer who produces tightly edited, conventional copy may naturally score 50% AI on every piece, including the ones written by hand five years ago. A writer with a looser style may naturally score 20%.

When you onboard a new writer, get them to send two or three samples of writing you know was produced before AI tools were widely available, or work they produced live in front of you in a conversation. Run those through your detectors to establish the writer's baseline.

A submission scoring 85% from a writer whose baseline is 50% is suspicious. A submission scoring 85% from a writer whose baseline is 80% is normal.

This step solves the biggest single source of detector false positives: that some writing styles naturally score high regardless of AI involvement.

Step 3: Read the Sub-Scores, Not the Composite

The single number a detector reports is a weighted average of underlying signals. The signals are what is actually informative. A composite score of 75% might be driven by:

High AI phrase density ("delve into," "in conclusion," "furthermore")
Low sentence-length variation (mechanical uniformity)
Heavy hedging across declarative statements
Generic vocabulary throughout

Each of these is a different problem with a different appropriate response. The first is a vocabulary habit that can be addressed in revision. The second often reflects good editing rather than AI. The third may be a tone choice. The fourth could mean the writer is unfamiliar with the subject matter or that they are using a model at default settings.

Tools that expose the sub-scores let you reason about what the flag actually represents. The AI Content Detector on EvvyTools breaks the underlying signals out individually so you can see which factor is driving the result.

A workflow based on composite scores will produce false positives. A workflow that examines sub-scores produces more nuanced flags.

Step 4: Handle Flags as a Conversation, Not a Verdict

When a piece is flagged, the next step is not rejection. It is a conversation with the writer about the piece.

Useful questions:

Walk me through how you researched this. What sources did you use?
Did you draft any part of this with AI assistance? If so, what parts?
Can you show me your notes or earlier draft?

Most writers will answer these questions honestly. Writers who used heavy AI assistance will usually disclose it when asked directly, especially if your guidelines made it clear that disclosure rather than concealment is what matters. Writers who did the work by hand will be able to walk you through their process in specific detail.

This conversation does what a detector cannot do: it surfaces intent and process, not just statistical patterns.

Photo by Vladislav Anchuk on Pexels

Step 5: Make Your Policy Explicit

If you do not have a written policy about AI use in submissions, write one. Vague policies produce inconsistent application and selective enforcement, both of which damage writer relationships.

A useful policy covers:

Whether AI use is allowed (with bounds), disallowed entirely, or allowed with disclosure
What "AI assistance" means specifically (grammar tools? outlining? full drafts? polishing?)
How submissions will be screened
What happens when a piece is flagged (conversation, not automatic rejection)
What disclosure looks like when AI was used as a tool

The Writers Guild and several publishing trade groups have published model policies that work as starting points. Adapting one to your specific situation is faster than writing from scratch. The Modern Language Association has also published guidance specifically for academic and publishing contexts that handles the disclosure question well.

A clear policy lets writers self-select into work they want to do without ambiguity, and lets you enforce consistently when issues arise. Vague policies are also the source of most disputes: writers reasonably argue that they could not have known what was prohibited if you did not write it down. Writing the policy is cheap, and it pays back every time a question comes up.

Step 6: Track Outcomes Over Time

A screening workflow that catches false positives but never measures them is not actually catching false positives. It is just catching some pieces and assuming the catch was correct.

A simple outcome log helps: every time a piece is flagged, record what happened next. Was the writer able to explain the work? Did revision resolve it? Did the writer leave? Did the piece get published as-is? Over time, the log reveals what the screen is actually doing.

If 95% of flagged pieces resolve with conversation and revision, the screen is working well. If 90% of flagged writers leave, the screen is producing false-positive damage that you have not noticed. The numbers are how you know.

Reference material from the Association for Computational Linguistics on detection accuracy and false-positive rates is useful as a calibration check. If your screen produces flag rates significantly higher than published baselines on similar content, the threshold may need adjusting.

Step 7: Calibrate for Known Failure Modes

A few categories of writer score high on detectors regardless of AI involvement. Build awareness of these into your screening:

Non-native English writers consistently false-flag at higher rates
Formal academic prose scores high because the style is constrained
Tightly edited marketing copy scores high because editing converges on patterns
Translated text scores high because translation introduces uniformity

If your writer pool includes any of these categories, a high score is much less informative than for a casual conversational writer. Adjust your threshold accordingly.

A Tool, Not a Judge

Detection tools fit naturally into a content review workflow as a screening signal. They surface text that deserves a closer human look. They are not built to render verdicts on individual writers, and the documented harm from treating them as such is real.

The right workflow uses them at the front of the funnel to triage attention, not at the back as decision-making evidence. With a per-writer baseline, sub-score awareness, multi-detector confirmation, and a clear policy framework, the tools earn their keep without producing false-positive damage.

For deeper coverage of why detector scores disagree and what they actually measure, the longer EvvyTools guide on detector disagreement walks through the underlying statistics in detail. Additional writing-quality tools at the EvvyTools tools directory pair well with detector workflows for broader pre-publish review.

Quick Workflow Recap

Multiple detectors, not one
Per-writer baseline before drawing conclusions
Sub-scores over composite scores
Conversation when flagged, not verdict
Written policy so expectations are clear
Outcome tracking so you know whether the screen is working
Known false-positive categories get extra context

This workflow produces the same value detector tools were sold to deliver, without the false-positive damage that has cost institutions trust with their own writers.

DEV Community