I Built an AI Text Detector from Scratch — Here's What I Learned About Doing It the Hard Way First

#ai #javascript #webdev #programming

The Hard Way

Before I shipped Aidetector, I spent two weeks doing AI detection manually.

I'm not joking. A client asked me to review a batch of blog posts for AI-generated content, and I had no reliable free tool. So I did what any developer does when they're stubborn and slightly overconfident — I started reading papers.

I pulled research on AI writing patterns. I opened a spreadsheet. I flagged things like:

Sentence length variance (AI texts are suspiciously uniform)
Overuse of hedging language ("it is important to note that...")
Low lexical diversity in paragraph transitions
Predictable semantic structure — topic sentence, three supporting points, wrap-up

I was manually scoring documents on a 12-point rubric. It took me about 20 minutes per article. For 40 articles.

That's when I thought: this should be a tool.

Why I Built It

Most free AI detectors at the time were either:

Capped at 500 words (useless for long-form content)
Requiring signup or API keys
Running on a single heuristic with no transparency about what they were actually checking

I wanted something that ran entirely in the browser, explained its reasoning, supported recent models like GPT-5 and Claude 3.7, and had zero word limits. No backend. No user data. No nonsense.

The Tech Stack

The entire thing runs client-side:

Vanilla JavaScript — no framework overhead, just fast DOM manipulation
HTML/CSS — keeping it lightweight and accessible
No external APIs — everything is computed locally in the browser

The detection logic runs 12 linguistic pattern checks derived from published NLP research. These include:

- Burstiness score (variance in sentence lengths)
- Perplexity approximation (word predictability heuristics)
- Hedging phrase frequency
- Passive voice ratio
- Transition word overuse
- Semantic flatness (paragraph topic variance)
... and six more

Each check returns a weighted score. The final result is a composite confidence percentage, broken down so the user can actually see why the tool flagged something.

The Technical Challenges

1. Approximating perplexity without an LLM

True perplexity requires a language model to score token probabilities. I don't have a backend, so I approximated it using a trigram frequency lookup built from a curated corpus. It's not perfect, but it's directionally accurate for the patterns I care about.

2. Avoiding false positives on technical writing

Technical documentation naturally has low sentence variance and formal structure — exactly what my detector was flagging as AI. I had to add a context-aware exemption layer that detects domain-specific vocabulary density and adjusts scoring accordingly.

3. Keeping up with new models

GPT-5 and Claude 3.7 write noticeably differently than earlier models. I had to collect new sample sets and re-weight several heuristics. This is an ongoing calibration problem — the patterns shift as models improve.

Lessons Learned

Doing it the hard way first was actually useful. Building a manual rubric before automating it forced me to understand the problem domain deeply. I wasn't just wiring up someone else's API — I actually knew what I was detecting and why.

Transparency builds trust. Showing users which patterns triggered and why has been the most-praised feature. People don't want a black box percentage. They want to understand the reasoning.

No-login tools get used. Friction kills adoption. Removing signup entirely meant people actually came back and shared it.

Browser-only is a genuine constraint, not just a gimmick. You have to think carefully about what's computationally feasible without a server. Some things I wanted to add (real perplexity scoring, model fine-tuning) are simply not possible client-side at scale.

Try It

If you're an educator reviewing student submissions, a content editor checking freelance work, or just curious how your own writing scores — give it a shot: aidetector.getinfotoyou.com

No word limits. No login. No API key. Paste your text and see what it finds.

I'm still actively improving the heuristics. If you find a false positive or a miss, I'd genuinely like to know.