Steven Wang

Posted on Nov 20

How to Build an AI Content Checker From Scratch

#ai #webdev #python

1. Introduction

Today many people use AI to write text. AI can write blogs, homework, emails, stories, and almost anything. This is very cool, but it also brings a big problem: how do we know if a human wrote the text or if AI wrote the text?

So we need a tool. This tool is called an AI Checker.
An AI Checker reads text and tries to guess:

“Is this text written by a human?”
“Is this text written by AI?”
“Is this text copied from somewhere?”
“Is this text high quality?”

This blog will show you how to build your own AI Checker. We will use very simple English, but we will also use developer ideas, like code, models, APIs, and pipelines.

2. High-Level System Architecture

2.1 Core Components

Ingestion Layer This part takes the text. The text can come from:

copy & paste
file upload
API call
a big batch of text

Detection Engine This part is the “brain”. It reads the text and tries to predict:

AI or human
percent of AI text
strange patterns

Feature Analysis Module This part looks at the text like a small scientist. It checks:

how random the words are
how long each sentence is
how hard or easy the text is
many other small things

Model Ensemble
This part mixes predictions from different models.
One model may look at grammar.
One model may look at word patterns.
One model may look at AI style.
When they join, the final result is stronger.
Report Generator
This part makes a simple result report.
It shows scores and highlights.
REST API Layer
Apps can talk to your AI Checker through an API.
Developers like API because it is easy to use.
Frontend Dashboard
A simple web page so users can upload text and see results.

2.2 Architecture Diagram (ASCII)

           +---------------------+
           |   Frontend UI       |
           +----------+----------+
                      |
                      v
           +----------+----------+
           |     REST API        |
           +----------+----------+
                      |
       +--------------+--------------+
       |                             |
       v                             v
+------+--------+            +--------+-------+
|  Ingestion    |            | Report Generator|
+------+--------+            +--------+-------+
       |                             ^
       v                             |
+------+--------+            +--------+-------+
| Feature       |   ----->   | Model Ensemble |
| Extraction    |            |  (AI Detector) |
+---------------+            +----------------+

This diagram is simple, but it shows the idea.

2.3 Tech Stack Options

You can use many languages or tools, but here are easy ones:

Python for machine learning
FastAPI for API
PyTorch for models
Transformers library for embeddings
Redis for caching
PostgreSQL for storage
Docker for deployment

This stack is simple but very strong.

3. Building the Detection Engine

Now let’s build the “brain”.
An AI Checker needs to “feel” the text.
It cannot just read the text like a human; it needs to measure things.

So we use three big ideas:

Statistical signals
Stylometric features
Semantic patterns

Then we mix them together.

3.1 Input Preprocessing

Before we check the text, we clean it.

Steps:

lower case
remove extra spaces
split into sentences
split into tokens (words)
maybe remove numbers
maybe remove emoji

Simple Python example:

import re
import nltk

def clean_text(text):
    text = text.lower()
    text = re.sub(r"\s+", " ", text)
    sentences = nltk.sent_tokenize(text)
    return sentences

This step makes the next steps easier.

3.2 Statistical Signals

AI text often has patterns.
It is usually very smooth, very clean, and too perfect.
Human text has more “noise”.

So we calculate:

perplexity
burstiness
entropy
token variance

Here is a very simple fake example of calculating “simple perplexity”:

import math

def simple_perplexity(tokens):
    probs = []
    for t in tokens:
        p = 1.0 / len(tokens)
        probs.append(p)
    entropy = -sum([p * math.log(p, 2) for p in probs])
    return 2 ** entropy

This is not real, but it shows the idea:
AI text often has lower perplexity (more predictable).

3.3 Stylometric Features

Stylometry means “writing style”.
AI text has a special style:

similar sentence lengths
similar tone
similar grammar
often no strong emotion

Human writing is more messy.

We can check:

average sentence length
length variance
POS tag distribution
number of commas, periods, etc.

Example:

import numpy as np

def sentence_length_features(sentences):
    lengths = [len(s.split()) for s in sentences]
    return {
        "avg_len": np.mean(lengths),
        "var_len": np.var(lengths)
    }

This helps the model understand the “shape” of the writing.

3.4 Semantic Pattern Detection

We use embeddings to understand meaning.

AI text often has:

too-consistent tone
very generic ideas
repeating safe phrases

We can use SentenceTransformers:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")

def get_vector(text):
    return model.encode([text])[0]

Then we can compare vectors or run clustering.

3.5 Ensemble Model

After we get many features, we mix them.

Example of a tiny PyTorch classifier:

import torch
import torch.nn as nn

class SmallClassifier(nn.Module):
    def __init__(self, input_size):
        super().__init__()
        self.fc = nn.Linear(input_size, 1)

    def forward(self, x):
        return torch.sigmoid(self.fc(x))

This classifier can learn from:

perplexity
burstiness
style features
embedding features

And produce a final score.

4. Model Training Pipeline

To make the AI Checker smart, we need to train it.
Training a model is like teaching a child.
If you give the child good examples, the child learns.
If you give the child bad examples, the child becomes confused.

An AI Checker needs many, many samples.
We need both:

Human text
AI text

Then we show the model:

“This is human text.”
“This is AI text.”
“This is maybe AI text.”
“This is maybe human text.”

Over time, the model learns patterns.

4.1 Dataset Strategy

We need a large dataset.

Human text can come from:

blogs
books
emails
essays
forums
Wikipedia
Reddit
news sites

AI text can come from:

OpenAI models (GPT-3.5 / GPT-4 / GPT-4o)
Claude (Anthropic)
Gemini
LLaMA
DeepSeek
Mistral models
Other generative systems

We should collect AI text in many styles:

story style
academic style
SEO style
short answer
long answer
technical writing

This helps the model understand many patterns.

4.2 Labeling the Dataset

This step is simple:

If the text is written by a machine → label: AI
If written by a human → label: HUMAN

Sometimes we also give a score:

0.0 = pure human
1.0 = pure AI
0.5 = mixed

This helps the model give percentage results.

4.3 Training Loop

Here is a simple PyTorch training loop:

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

def train(model, dataset, epochs=3):
    loader = DataLoader(dataset, batch_size=32, shuffle=True)
    opt = torch.optim.Adam(model.parameters(), lr=1e-4)
    loss_fn = nn.BCELoss()

    for epoch in range(epochs):
        for x, y in loader:
            y_pred = model(x)
            loss = loss_fn(y_pred, y)
            opt.zero_grad()
            loss.backward()
            opt.step()

This loop makes the model better each time.

4.4 Validation & Testing

We must check if our model is right.

We test:

Accuracy
Precision
Recall
False positives
False negatives

A simple mistake:

A model that marks human text as AI → bad
A model that marks AI text as human → bad

We need balance.

We often test on new text that the model has never seen before.

4.5 Continuous Update Pipeline

The world changes fast.
New AI models appear every month.
So our AI Checker must also improve.

A good plan is:

Every month: collect new AI text
Every month: collect new human text
Every month: retrain
Every month: re-deploy new version

This is called continuous learning.

5. Building the REST API (FastAPI Example)

Now we make our model available to users.

A REST API lets:

apps
websites
extensions
scripts

talk to our AI Checker.

FastAPI is simple and fast.

5.1 Basic Structure

Here is a simple FastAPI app:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class TextInput(BaseModel):
    text: str

@app.post("/detect")
def detect_ai(data: TextInput):
    score = run_model(data.text)
    return {"ai_score": score}

This is all you need to start.

5.2 `/detect` Endpoint

This endpoint takes text and returns AI probability.

Example result:

{
  "ai_score": 0.87
}

Meaning: 87% chance this text is made by AI.

5.3 `/quality-score` Endpoint

You can also check text quality.

@app.post("/quality")
def check_quality(input: TextInput):
    q = get_quality_score(input.text)
    return {"quality": q}

Quality can be:

clarity
grammar
sentence flow
keyword strength

5.4 `/plagiarism-check` Endpoint

This endpoint checks if text is copied.

@app.post("/plagiarism")
def plagiarism(input: TextInput):
    result = run_plagiarism(input.text)
    return {"rate": result}

If rate = 0.45 → 45% similar
If rate = 0.95 → almost fully copied

5.5 `/explain` Endpoint

Shows sentence-level heatmap.

Example:

{
  "sentences": [
    {"text": "This is a test.", "ai_prob": 0.80},
    {"text": "I like pizza.", "ai_prob": 0.12}
  ]
}

Users love this feature because they want to see why the model thinks it's AI.

6. Frontend Dashboard (Simple Version)

Now we make a simple UI so humans can use the tool.

A simple React component:

function Checker() {
  const [text, setText] = useState("");
  const [result, setResult] = useState(null);

  async function sendText() {
    const res = await fetch("/detect", {
      method: "POST",
      body: JSON.stringify({text}),
      headers: {"Content-Type": "application/json"}
    });
    setResult(await res.json());
  }

  return (
    <div>
      <textarea value={text} onChange={e => setText(e.target.value)} />
      <button onClick={sendText}>Check</button>
      {result && <p>AI Score: {result.ai_score}</p>}
    </div>
  );
}

Very basic, but it works.

7. Adding Plagiarism Detection

Plagiarism means “copying someone else’s words”.
A good AI Checker also needs this feature.

We use two kinds:

String-based plagiarism
Semantic plagiarism

7.1 String-Based Plagiarism

Very simple idea:

Compare the text to a big database of many documents.

You can use:

n-grams
shingling
hashing
cosine similarity

Example:

def simple_similarity(a, b):
    set_a = set(a.split())
    set_b = set(b.split())
    return len(set_a & set_b) / len(set_a | set_b)

This is simple but not enough.

7.2 Semantic Plagiarism

This checks meaning, not just words.

We use embeddings again.

If two texts have similar meaning, cosine similarity is high:

from sklearn.metrics.pairwise import cosine_similarity

def semantic_score(v1, v2):
    return cosine_similarity([v1], [v2])[0][0]

This can detect AI rewrite tools.
Many students use AI to rewrite copied text.
Semantic models can detect this.

7.3 Using Elasticsearch or Pinecone

For big databases, we must use vector search.

These tools:

Elasticsearch
Pinecone
Weaviate
Qdrant

They can store millions of vectors.

Then we search fast.

8. Multi-Language Support

To support many languages, like:

Chinese
Japanese
Korean
Spanish
French
German

we must change:

tokenizers
embedding models
training datasets

Each language has different patterns.
For example:

Chinese has no spaces
English uses many small words
Japanese uses kanji and kana
Spanish has longer sentences

We must fine-tune per language.

9. Performance Optimization

If many users use your API, it must be fast.

Here are tips:

Cache model results
Use GPU if possible
Batch requests
Preload model to memory
Use Redis queue
Use async FastAPI

Example async API:

@app.post("/detect")
async def detect_ai(data: TextInput):
    score = await run_async_model(data.text)
    return {"ai_score": score}

10. Deployment

You can deploy with Docker:

Dockerfile

FROM python:3.10
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Then:

docker build -t ai-checker .
docker run -p 8000:8000 ai-checker

For cloud, use:

AWS
GCP
Azure
Railway
Render
Cloudflare Workers (for frontend)

11. Security & Privacy

Very important.

You must protect user data.

Rules:

Do not store text
Or store with user permission
Remove logs
No selling data
Use HTTPS
Add rate limit to block bots

Example:

from slowapi import Limiter

12. Future Extensions

Future ideas:

detect AI images
detect AI audio
detect AI video
detect AI code
browser extensions
WordPress plugin
API for LMS (school systems)

AI will grow fast.
AI detection must grow too.

13. Conclusion

We now have a full AI Checker:

architecture
models
features
API
frontend
training
deployment

We used simple English, but real tech.
Now any developer can build an AI Checker.

DEV Community

How to Build an AI Content Checker From Scratch

1. Introduction

2. High-Level System Architecture

2.1 Core Components

2.2 Architecture Diagram (ASCII)

2.3 Tech Stack Options

3. Building the Detection Engine

3.1 Input Preprocessing

3.2 Statistical Signals

3.3 Stylometric Features

3.4 Semantic Pattern Detection

3.5 Ensemble Model

4. Model Training Pipeline

4.1 Dataset Strategy

Human text can come from:

AI text can come from:

4.2 Labeling the Dataset

4.3 Training Loop

4.4 Validation & Testing

4.5 Continuous Update Pipeline

5. Building the REST API (FastAPI Example)

5.1 Basic Structure

5.2 `/detect` Endpoint

5.3 `/quality-score` Endpoint

5.4 `/plagiarism-check` Endpoint

5.5 `/explain` Endpoint

6. Frontend Dashboard (Simple Version)

7. Adding Plagiarism Detection

7.1 String-Based Plagiarism

7.2 Semantic Plagiarism

7.3 Using Elasticsearch or Pinecone

8. Multi-Language Support

9. Performance Optimization

10. Deployment

Dockerfile

11. Security & Privacy

12. Future Extensions

13. Conclusion

Top comments (0)

1. Introduction

2. High-Level System Architecture

2.1 Core Components

2.2 Architecture Diagram (ASCII)

2.3 Tech Stack Options

3. Building the Detection Engine

3.1 Input Preprocessing

3.2 Statistical Signals

3.3 Stylometric Features

3.4 Semantic Pattern Detection

3.5 Ensemble Model

4. Model Training Pipeline

4.1 Dataset Strategy

Human text can come from:

AI text can come from:

4.2 Labeling the Dataset

4.3 Training Loop

4.4 Validation & Testing

4.5 Continuous Update Pipeline

5. Building the REST API (FastAPI Example)

5.1 Basic Structure

5.2 /detect Endpoint

5.3 /quality-score Endpoint

5.4 /plagiarism-check Endpoint

5.5 /explain Endpoint

6. Frontend Dashboard (Simple Version)

7. Adding Plagiarism Detection

7.1 String-Based Plagiarism

7.2 Semantic Plagiarism

7.3 Using Elasticsearch or Pinecone

8. Multi-Language Support

9. Performance Optimization

10. Deployment

Dockerfile

11. Security & Privacy

12. Future Extensions

13. Conclusion

5.2 `/detect` Endpoint

5.3 `/quality-score` Endpoint

5.4 `/plagiarism-check` Endpoint

5.5 `/explain` Endpoint