Every AI text detector is either paid or closed-source.
GPTZero charges $15/month. Originality.ai charges per scan. Turnitin locks you into institutional contracts. And all of them are black boxes — when they flag your text as AI-generated, you have no idea why.
I got tired of this. Especially after GPTZero flagged my own human-written paragraphs as "98% AI."
So I built lmscan.
What it does
pip install lmscan
lmscan "paste any text here"
→ 82% AI probability, likely GPT-4
It analyzes 12 statistical features — burstiness, entropy, Zipf deviation, vocabulary richness, slop-word density — and fingerprints 9 LLM families.
No neural network. No API key. No internet. Runs in <50ms.
The detection approach
AI text is unnaturally smooth. Humans write in bursts — short punchy sentences followed by long rambling ones. LLMs produce eerily consistent sentence lengths.
LLMs also have vocabulary tells:
- GPT-4 loves "delve" and "tapestry"
- Claude says "I think it's worth noting"
- Llama overuses "comprehensive" and "crucial"
lmscan scores text against each family's marker set to fingerprint the source.
Python API
from lmscan import scan
result = scan("your text")
print(f"{result.ai_probability:.0%} AI, likely {result.fingerprint.model}")
Features
- 12 statistical features (burstiness, entropy, Zipf deviation, hapax legomena, vocabulary richness, slop-word density, and more)
- 9 LLM fingerprints (GPT-4, Claude, Gemini, Llama, Mistral, Qwen, DeepSeek, Cohere, Phi)
- Multilingual support (English, French, Spanish, German, Portuguese + CJK auto-detection)
- Batch directory scanning with
--dir - Mixed-content paragraph analysis with
--mixed - HTML reports with
--format html - Streamlit web UI with
pip install lmscan[web] - Pre-commit hook integration
- Calibration API for tuning thresholds on your own data
Honest limitations
This is statistical analysis, not a transformer classifier. It won't catch heavily paraphrased AI text. But:
- You can see exactly which features triggered
- No black-box false positives
- Calibration API lets you tune for your domain
- 193 tests, Apache-2.0
GitHub: github.com/stef41/lmscan
PyPI: pypi.org/project/lmscan
Feedback welcome — especially on which types of text it struggles with. That helps calibrate the feature weights.
Top comments (0)