DEV Community

Cover image for Building an AI Humanizer: why we stopped trying to fix prompts
Peggy
Peggy

Posted on

Building an AI Humanizer: why we stopped trying to fix prompts

This post is about a mistake we made early on: assuming that “unnatural” LLM output could be fixed at the prompt level.

It can’t. At least not reliably.

What finally worked for us was treating LLM text as a signal-processing problem at the sentence level, not a generation problem.

The signal we kept measuring 📊

We started from AI detection work, which forced us to look at text statistically instead of stylistically.

Across different LLMs and prompts, flagged samples shared the same low-level traits:

  • Sentence length variance was abnormally low
  • Clause depth was consistently shallow
  • Discourse markers repeated with high frequency
  • Sentence openers followed predictable templates

None of these are errors.

But together, they form a pattern.

When we plotted sentence length distributions, human-written text had long tails.

LLM text clustered tightly around the mean.

That clustering turned out to be a stronger signal than vocabulary choice.

Why prompts failed at fixing this 😐

Prompt instructions like:

“Vary sentence length”

“Write more naturally”

operate at generation time, but they don’t constrain local structure.

In practice, prompts affected:

  • word choice
  • tone
  • politeness

They barely affected:

  • sentence rhythm
  • transition placement
  • redundancy density

Worse, prompt changes introduced instability. Small edits caused large global shifts, which made debugging impossible.

From an engineering standpoint, that was a dead end.

Reframing the problem 🔁

We stopped treating LLM output as “final text”.

Instead, we treated it as raw material.

That led to a two-stage pipeline:

  1. Generation — optimize for clarity and correctness
  2. Sentence-level rewriting — optimize for distribution and flow

The second stage is what later became the AI Humanizer.

What sentence-level rewriting actually does 🧩

This is not paraphrasing everything.

We only touch sentences that trip specific heuristics:

  • length similarity above a threshold
  • repeated syntactic openers
  • excessive connective phrases
  • over-explained subordinate clauses

Rewrites are local:

  • split a sentence
  • compress another
  • delete a transition
  • reorder clauses

Semantics stay fixed.

Distribution changes.

That distinction matters.

Why this works better technically ⚙️

Because it’s measurable.

After rewriting, we can observe:

  • increased sentence length variance
  • reduced opener repetition
  • lower transition density
  • more human-like rhythm curves

This makes the system debuggable.

Prompts are opaque.

Post-processing isn’t.

Where the AI Humanizer fits 🧠

This approach eventually became the AI Humanizer inside Dechecker — not as a detector workaround, but as a controllable post-processing layer.

It has clear limits:

  • it won’t fix weak arguments
  • it can over-flatten voice if pushed too hard
  • different domains need different thresholds

But unlike prompt tuning, we can see exactly what changed and why.

Why this matters beyond detection 👀

Even if detectors didn’t exist, this problem would.

Uniform structure is tiring to read. Humans subconsciously expect irregularity. Sentence-level rewriting restores that irregularity without changing meaning.

From a systems perspective, it’s simply the right abstraction level.

Final takeaway ✅

If LLM-generated text feels unnatural, the issue is rarely what the model says.

It’s how evenly it says it.

Prompts don’t fix distributions.

Rewriting does.

Top comments (1)

Collapse
 
lola0786 profile image
Chandan Galani

We built an API that lets AI systems check if humans actually care before acting.
It’s a simple intent-verification gate for AI agents.
Early access, prepaid usage.