DEV Community

Steven Wang
Steven Wang

Posted on

How to Build an AI Content Checker From Scratch

1. Introduction

Today many people use AI to write text. AI can write blogs, homework, emails, stories, and almost anything. This is very cool, but it also brings a big problem: how do we know if a human wrote the text or if AI wrote the text?

So we need a tool. This tool is called an AI Checker.
An AI Checker reads text and tries to guess:

  • “Is this text written by a human?”
  • “Is this text written by AI?”
  • “Is this text copied from somewhere?”
  • “Is this text high quality?”

This blog will show you how to build your own AI Checker. We will use very simple English, but we will also use developer ideas, like code, models, APIs, and pipelines.

2. High-Level System Architecture

2.1 Core Components

  1. Ingestion Layer This part takes the text. The text can come from:
  • copy & paste
  • file upload
  • API call
  • a big batch of text
  1. Detection Engine This part is the “brain”. It reads the text and tries to predict:
  • AI or human
  • percent of AI text
  • strange patterns
  1. Feature Analysis Module This part looks at the text like a small scientist. It checks:
  • how random the words are
  • how long each sentence is
  • how hard or easy the text is
  • many other small things
  1. Model Ensemble
    This part mixes predictions from different models.
    One model may look at grammar.
    One model may look at word patterns.
    One model may look at AI style.
    When they join, the final result is stronger.

  2. Report Generator
    This part makes a simple result report.
    It shows scores and highlights.

  3. REST API Layer
    Apps can talk to your AI Checker through an API.
    Developers like API because it is easy to use.

  4. Frontend Dashboard
    A simple web page so users can upload text and see results.

2.2 Architecture Diagram (ASCII)

           +---------------------+
           |   Frontend UI       |
           +----------+----------+
                      |
                      v
           +----------+----------+
           |     REST API        |
           +----------+----------+
                      |
       +--------------+--------------+
       |                             |
       v                             v
+------+--------+            +--------+-------+
|  Ingestion    |            | Report Generator|
+------+--------+            +--------+-------+
       |                             ^
       v                             |
+------+--------+            +--------+-------+
| Feature       |   ----->   | Model Ensemble |
| Extraction    |            |  (AI Detector) |
+---------------+            +----------------+
Enter fullscreen mode Exit fullscreen mode

This diagram is simple, but it shows the idea.

2.3 Tech Stack Options

You can use many languages or tools, but here are easy ones:

  • Python for machine learning
  • FastAPI for API
  • PyTorch for models
  • Transformers library for embeddings
  • Redis for caching
  • PostgreSQL for storage
  • Docker for deployment

This stack is simple but very strong.

3. Building the Detection Engine

Now let’s build the “brain”.
An AI Checker needs to “feel” the text.
It cannot just read the text like a human; it needs to measure things.

So we use three big ideas:

  1. Statistical signals
  2. Stylometric features
  3. Semantic patterns

Then we mix them together.

3.1 Input Preprocessing

Before we check the text, we clean it.

Steps:

  • lower case
  • remove extra spaces
  • split into sentences
  • split into tokens (words)
  • maybe remove numbers
  • maybe remove emoji

Simple Python example:

import re
import nltk

def clean_text(text):
    text = text.lower()
    text = re.sub(r"\s+", " ", text)
    sentences = nltk.sent_tokenize(text)
    return sentences
Enter fullscreen mode Exit fullscreen mode

This step makes the next steps easier.

3.2 Statistical Signals

AI text often has patterns.
It is usually very smooth, very clean, and too perfect.
Human text has more “noise”.

So we calculate:

  • perplexity
  • burstiness
  • entropy
  • token variance

Here is a very simple fake example of calculating “simple perplexity”:

import math

def simple_perplexity(tokens):
    probs = []
    for t in tokens:
        p = 1.0 / len(tokens)
        probs.append(p)
    entropy = -sum([p * math.log(p, 2) for p in probs])
    return 2 ** entropy
Enter fullscreen mode Exit fullscreen mode

This is not real, but it shows the idea:
AI text often has lower perplexity (more predictable).

3.3 Stylometric Features

Stylometry means “writing style”.
AI text has a special style:

  • similar sentence lengths
  • similar tone
  • similar grammar
  • often no strong emotion

Human writing is more messy.

We can check:

  • average sentence length
  • length variance
  • POS tag distribution
  • number of commas, periods, etc.

Example:

import numpy as np

def sentence_length_features(sentences):
    lengths = [len(s.split()) for s in sentences]
    return {
        "avg_len": np.mean(lengths),
        "var_len": np.var(lengths)
    }
Enter fullscreen mode Exit fullscreen mode

This helps the model understand the “shape” of the writing.

3.4 Semantic Pattern Detection

We use embeddings to understand meaning.

AI text often has:

  • too-consistent tone
  • very generic ideas
  • repeating safe phrases

We can use SentenceTransformers:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")

def get_vector(text):
    return model.encode([text])[0]
Enter fullscreen mode Exit fullscreen mode

Then we can compare vectors or run clustering.

3.5 Ensemble Model

After we get many features, we mix them.

Example of a tiny PyTorch classifier:

import torch
import torch.nn as nn

class SmallClassifier(nn.Module):
    def __init__(self, input_size):
        super().__init__()
        self.fc = nn.Linear(input_size, 1)

    def forward(self, x):
        return torch.sigmoid(self.fc(x))
Enter fullscreen mode Exit fullscreen mode

This classifier can learn from:

  • perplexity
  • burstiness
  • style features
  • embedding features

And produce a final score.

4. Model Training Pipeline

To make the AI Checker smart, we need to train it.
Training a model is like teaching a child.
If you give the child good examples, the child learns.
If you give the child bad examples, the child becomes confused.

An AI Checker needs many, many samples.
We need both:

  • Human text
  • AI text

Then we show the model:

“This is human text.”
“This is AI text.”
“This is maybe AI text.”
“This is maybe human text.”

Over time, the model learns patterns.

4.1 Dataset Strategy

We need a large dataset.

Human text can come from:

  • blogs
  • books
  • emails
  • essays
  • forums
  • Wikipedia
  • Reddit
  • news sites

AI text can come from:

  • OpenAI models (GPT-3.5 / GPT-4 / GPT-4o)
  • Claude (Anthropic)
  • Gemini
  • LLaMA
  • DeepSeek
  • Mistral models
  • Other generative systems

We should collect AI text in many styles:

  • story style
  • academic style
  • SEO style
  • short answer
  • long answer
  • technical writing

This helps the model understand many patterns.

4.2 Labeling the Dataset

This step is simple:

  • If the text is written by a machine → label: AI
  • If written by a human → label: HUMAN

Sometimes we also give a score:

  • 0.0 = pure human
  • 1.0 = pure AI
  • 0.5 = mixed

This helps the model give percentage results.

4.3 Training Loop

Here is a simple PyTorch training loop:

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

def train(model, dataset, epochs=3):
    loader = DataLoader(dataset, batch_size=32, shuffle=True)
    opt = torch.optim.Adam(model.parameters(), lr=1e-4)
    loss_fn = nn.BCELoss()

    for epoch in range(epochs):
        for x, y in loader:
            y_pred = model(x)
            loss = loss_fn(y_pred, y)
            opt.zero_grad()
            loss.backward()
            opt.step()
Enter fullscreen mode Exit fullscreen mode

This loop makes the model better each time.

4.4 Validation & Testing

We must check if our model is right.

We test:

  • Accuracy
  • Precision
  • Recall
  • False positives
  • False negatives

A simple mistake:

  • A model that marks human text as AI → bad
  • A model that marks AI text as human → bad

We need balance.

We often test on new text that the model has never seen before.

4.5 Continuous Update Pipeline

The world changes fast.
New AI models appear every month.
So our AI Checker must also improve.

A good plan is:

  • Every month: collect new AI text
  • Every month: collect new human text
  • Every month: retrain
  • Every month: re-deploy new version

This is called continuous learning.

5. Building the REST API (FastAPI Example)

Now we make our model available to users.

A REST API lets:

  • apps
  • websites
  • extensions
  • scripts

talk to our AI Checker.

FastAPI is simple and fast.

5.1 Basic Structure

Here is a simple FastAPI app:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class TextInput(BaseModel):
    text: str

@app.post("/detect")
def detect_ai(data: TextInput):
    score = run_model(data.text)
    return {"ai_score": score}
Enter fullscreen mode Exit fullscreen mode

This is all you need to start.

5.2 /detect Endpoint

This endpoint takes text and returns AI probability.

Example result:

{
  "ai_score": 0.87
}
Enter fullscreen mode Exit fullscreen mode

Meaning: 87% chance this text is made by AI.

5.3 /quality-score Endpoint

You can also check text quality.

@app.post("/quality")
def check_quality(input: TextInput):
    q = get_quality_score(input.text)
    return {"quality": q}
Enter fullscreen mode Exit fullscreen mode

Quality can be:

  • clarity
  • grammar
  • sentence flow
  • keyword strength

5.4 /plagiarism-check Endpoint

This endpoint checks if text is copied.

@app.post("/plagiarism")
def plagiarism(input: TextInput):
    result = run_plagiarism(input.text)
    return {"rate": result}
Enter fullscreen mode Exit fullscreen mode

If rate = 0.45 → 45% similar
If rate = 0.95 → almost fully copied

5.5 /explain Endpoint

Shows sentence-level heatmap.

Example:

{
  "sentences": [
    {"text": "This is a test.", "ai_prob": 0.80},
    {"text": "I like pizza.", "ai_prob": 0.12}
  ]
}
Enter fullscreen mode Exit fullscreen mode

Users love this feature because they want to see why the model thinks it's AI.

6. Frontend Dashboard (Simple Version)

Now we make a simple UI so humans can use the tool.

A simple React component:

function Checker() {
  const [text, setText] = useState("");
  const [result, setResult] = useState(null);

  async function sendText() {
    const res = await fetch("/detect", {
      method: "POST",
      body: JSON.stringify({text}),
      headers: {"Content-Type": "application/json"}
    });
    setResult(await res.json());
  }

  return (
    <div>
      <textarea value={text} onChange={e => setText(e.target.value)} />
      <button onClick={sendText}>Check</button>
      {result && <p>AI Score: {result.ai_score}</p>}
    </div>
  );
}
Enter fullscreen mode Exit fullscreen mode

Very basic, but it works.

7. Adding Plagiarism Detection

Plagiarism means “copying someone else’s words”.
A good AI Checker also needs this feature.

We use two kinds:

  1. String-based plagiarism
  2. Semantic plagiarism

7.1 String-Based Plagiarism

Very simple idea:

Compare the text to a big database of many documents.

You can use:

  • n-grams
  • shingling
  • hashing
  • cosine similarity

Example:

def simple_similarity(a, b):
    set_a = set(a.split())
    set_b = set(b.split())
    return len(set_a & set_b) / len(set_a | set_b)
Enter fullscreen mode Exit fullscreen mode

This is simple but not enough.

7.2 Semantic Plagiarism

This checks meaning, not just words.

We use embeddings again.

If two texts have similar meaning, cosine similarity is high:

from sklearn.metrics.pairwise import cosine_similarity

def semantic_score(v1, v2):
    return cosine_similarity([v1], [v2])[0][0]
Enter fullscreen mode Exit fullscreen mode

This can detect AI rewrite tools.
Many students use AI to rewrite copied text.
Semantic models can detect this.

7.3 Using Elasticsearch or Pinecone

For big databases, we must use vector search.

These tools:

  • Elasticsearch
  • Pinecone
  • Weaviate
  • Qdrant

They can store millions of vectors.

Then we search fast.

8. Multi-Language Support

To support many languages, like:

  • Chinese
  • Japanese
  • Korean
  • Spanish
  • French
  • German

we must change:

  • tokenizers
  • embedding models
  • training datasets

Each language has different patterns.
For example:

  • Chinese has no spaces
  • English uses many small words
  • Japanese uses kanji and kana
  • Spanish has longer sentences

We must fine-tune per language.

9. Performance Optimization

If many users use your API, it must be fast.

Here are tips:

  • Cache model results
  • Use GPU if possible
  • Batch requests
  • Preload model to memory
  • Use Redis queue
  • Use async FastAPI

Example async API:

@app.post("/detect")
async def detect_ai(data: TextInput):
    score = await run_async_model(data.text)
    return {"ai_score": score}
Enter fullscreen mode Exit fullscreen mode

10. Deployment

You can deploy with Docker:

Dockerfile

FROM python:3.10
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Enter fullscreen mode Exit fullscreen mode

Then:

docker build -t ai-checker .
docker run -p 8000:8000 ai-checker
Enter fullscreen mode Exit fullscreen mode

For cloud, use:

  • AWS
  • GCP
  • Azure
  • Railway
  • Render
  • Cloudflare Workers (for frontend)

11. Security & Privacy

Very important.

You must protect user data.

Rules:

  • Do not store text
  • Or store with user permission
  • Remove logs
  • No selling data
  • Use HTTPS
  • Add rate limit to block bots

Example:

from slowapi import Limiter
Enter fullscreen mode Exit fullscreen mode

12. Future Extensions

Future ideas:

  • detect AI images
  • detect AI audio
  • detect AI video
  • detect AI code
  • browser extensions
  • WordPress plugin
  • API for LMS (school systems)

AI will grow fast.
AI detection must grow too.

13. Conclusion

We now have a full AI Checker:

  • architecture
  • models
  • features
  • API
  • frontend
  • training
  • deployment

We used simple English, but real tech.
Now any developer can build an AI Checker.

Top comments (0)