DEV Community

wd400
wd400

Posted on • Originally published at github.com

I got mass-flagged by GPTZero for my own writing. So I built an open-source alternative in pure Python.

Every AI text detector is either paid or closed-source.

GPTZero charges $15/month. Originality.ai charges per scan. Turnitin locks you into institutional contracts. And all of them are black boxes — when they flag your text as AI-generated, you have no idea why.

I got tired of this. Especially after GPTZero flagged my own human-written paragraphs as "98% AI."

So I built lmscan.

What it does

pip install lmscan
lmscan "paste any text here"
→ 82% AI probability, likely GPT-4
Enter fullscreen mode Exit fullscreen mode

It analyzes 12 statistical features — burstiness, entropy, Zipf deviation, vocabulary richness, slop-word density — and fingerprints 9 LLM families.

No neural network. No API key. No internet. Runs in <50ms.

The detection approach

AI text is unnaturally smooth. Humans write in bursts — short punchy sentences followed by long rambling ones. LLMs produce eerily consistent sentence lengths.

LLMs also have vocabulary tells:

  • GPT-4 loves "delve" and "tapestry"
  • Claude says "I think it's worth noting"
  • Llama overuses "comprehensive" and "crucial"

lmscan scores text against each family's marker set to fingerprint the source.

Python API

from lmscan import scan
result = scan("your text")
print(f"{result.ai_probability:.0%} AI, likely {result.fingerprint.model}")
Enter fullscreen mode Exit fullscreen mode

Features

  • 12 statistical features (burstiness, entropy, Zipf deviation, hapax legomena, vocabulary richness, slop-word density, and more)
  • 9 LLM fingerprints (GPT-4, Claude, Gemini, Llama, Mistral, Qwen, DeepSeek, Cohere, Phi)
  • Multilingual support (English, French, Spanish, German, Portuguese + CJK auto-detection)
  • Batch directory scanning with --dir
  • Mixed-content paragraph analysis with --mixed
  • HTML reports with --format html
  • Streamlit web UI with pip install lmscan[web]
  • Pre-commit hook integration
  • Calibration API for tuning thresholds on your own data

Honest limitations

This is statistical analysis, not a transformer classifier. It won't catch heavily paraphrased AI text. But:

  • You can see exactly which features triggered
  • No black-box false positives
  • Calibration API lets you tune for your domain
  • 193 tests, Apache-2.0

GitHub: github.com/stef41/lmscan
PyPI: pypi.org/project/lmscan

Feedback welcome — especially on which types of text it struggles with. That helps calibrate the feature weights.

Top comments (0)