Fahad Ali Khan

Posted on Mar 14

Building an AI-Generated Text Detector: A Full-Stack NLP Project Guide

#ai #machinelearning #nlp #tutorial

The line between human and machine writing is blurring fast. This guide walks through building a complete AI-text detection system from scratch — from classical ML baselines to transformer fine-tuning, wrapped in a production API and interactive demo.

This isn't a toy notebook. By the end, you'll have a modular codebase with an ensemble model, REST API, Gradio UI, Docker deployment, CI/CD, and a test suite. Let's break down each commit.

Commit 1: The Foundation — Clean Init

2a66fcb Clean init (no large data/models/outputs)

Every good project starts with structure. The initial commit sets up a config-driven architecture where nothing is hardcoded:

configs/config.yaml centralizes every hyperparameter — TF-IDF feature counts, learning rates, batch sizes, file paths. This means you never dig through scripts to change a setting.

src/ holds reusable library modules. scripts/ holds runnable entry points. This separation matters: your training logic lives in importable functions, not buried inside if __name__ == "__main__" blocks.

src/data.py handles schema normalization. Datasets from Kaggle have inconsistent column names — "text" vs "content" vs "essay", "label" vs "generated" vs "is_gpt". The normalize_schema() function maps all variants to a canonical (text, label) format with binary labels (0=human, 1=AI). This is the kind of defensive data engineering that saves hours of debugging later.

src/baseline.py implements the classical ML pipeline: TF-IDF vectorization (200K features, unigrams + bigrams) piped into Logistic Regression. The entire pipeline is a single scikit-learn Pipeline object, which means the vectorizer and classifier are serialized together — no risk of train/serve skew.

src/transformer.py fine-tunes DistilBERT for binary classification using HuggingFace's Trainer API. Key design decisions:

Automatic validation split creation if one isn't provided
Null/empty text filtering before tokenization (prevents cryptic errors)
Mixed-precision training (fp16) and gradient checkpointing for memory efficiency
Offline-friendly model loading (checks if model_name is a local directory)

.gitignore excludes data/, models/, and outputs/ — keeping the repo lightweight. Large binary files don't belong in git.

Commit 2: Linguistic Feature Extraction

0c716b4 Add linguistic feature extraction module

Raw text classification is powerful, but interpretability matters. The src/features.py module extracts stylometric signals — measurable properties of writing style that differ between humans and language models.

What gets extracted:

Feature	Why It Matters
Type-Token Ratio (TTR)	AI text tends toward lower vocabulary diversity — it reuses common words more
Hapax Legomena Ratio	Words used exactly once. Humans use more rare/unique words
Flesch Reading Ease	AI-generated text often clusters in a narrow readability band
Sentence Length Variation	Humans write with more variable sentence structure
Word Entropy	Shannon entropy of the word distribution — higher means more varied word choice
Punctuation Rates	AI text has distinctive punctuation patterns (fewer semicolons, more consistent comma usage)

The implementation uses zero external NLP libraries — just regex and math. The syllable counter uses a vowel-group heuristic (_syllable_count()), which is fast and accurate enough for readability formulas.

The key design choice: extract_features() returns a flat dictionary, and extract_features_df() wraps it for batch processing. This makes the features usable both in the API (single predictions) and in training pipelines (DataFrame operations).

Commit 3: Ensemble Model

b8b8911 Add weighted soft-voting ensemble model

Neither model alone is optimal. The baseline is fast but misses nuance. The transformer is accurate but slow. The ensemble combines them through weighted probability averaging.

How it works:

P_ensemble = 0.3 × P_baseline + 0.7 × P_transformer

Both models output class probabilities (not just labels). The EnsembleDetector class:

Runs TF-IDF + LogReg to get P(human) and P(AI) from the baseline
Runs DistilBERT inference in batches to get transformer probabilities
Computes the weighted average
Returns the argmax as the final prediction

Why soft voting over hard voting? Hard voting (majority rule) throws away confidence information. If the baseline says 51% human and the transformer says 95% AI, hard voting sees a tie. Soft voting correctly favors AI because the transformer's high-confidence signal dominates.

The weights (0.3/0.7) are configurable via CLI args or environment variables. You could tune them on a validation set, but 0.3/0.7 is a reasonable default given the transformer's higher accuracy.

scripts/run_ensemble.py evaluates the ensemble on both validation and test splits, saving JSON metrics for comparison.

Commit 4: FastAPI REST API

c130c05 Add FastAPI REST API for inference

A model without an API is a notebook. The api/app.py module wraps the ensemble detector in a production-grade FastAPI service.

Three endpoints:

GET /health — Returns model load status. Essential for container orchestration (Docker health checks, Kubernetes liveness probes).
POST /predict — Single-text classification. Returns the label, confidence, per-class probabilities, all linguistic features, and inference latency in milliseconds. The latency field is useful for monitoring performance degradation.
POST /predict/batch — Classify up to 64 texts in one request. Batch inference is significantly faster than 64 individual calls because the transformer processes them in GPU-friendly batches.

Design decisions:

Pydantic models for request/response validation. PredictRequest enforces min_length=1 so empty strings are rejected at the schema level, not in model code.
Lifespan context manager for model loading. The detector loads once at startup and stays in memory — no per-request loading overhead.
Environment variable configuration (BASELINE_WEIGHT, TRANSFORMER_WEIGHT, CONFIG_PATH) so the API is configurable without code changes in deployment.

Run it with:

uvicorn api.app:app --host 0.0.0.0 --port 8000

FastAPI auto-generates interactive Swagger docs at /docs.

Commit 5: Gradio Interactive Demo

ebf7183 Add Gradio interactive demo with feature analysis

APIs are for machines. Demos are for humans. The app.py file creates a web interface where anyone can paste text and see results instantly.

The UI has three output components:

Classification label with color-coded probabilities (Gradio's Label component handles this natively)
Verdict text — a plain-English summary like "Prediction: AI-generated (confidence: 94.2%)"
Feature analysis table — the linguistic breakdown rendered as a markdown table

Why Gradio over Streamlit? Gradio is purpose-built for ML demos. A gr.Interface or gr.Blocks app can be shared with a public URL via share=True, embedded in HuggingFace Spaces, and requires zero frontend code.

The demo includes pre-loaded examples so users can test immediately without typing.

Commit 6: Unit Tests

347fdbc Add unit test suite for features, data, utils, and API

Tests are non-negotiable for a serious project. The tests/ directory covers four modules:

test_features.py (22 tests) — The most comprehensive. Tests every feature function individually (vocabulary richness, sentence stats, readability, punctuation, entropy) plus integration tests for the full extract_features() pipeline. Key edge cases: empty strings, single-word inputs, all-identical words.

test_data.py — Tests schema normalization with different column name variants (text/content/essay, label/generated/is_gpt), string label mapping, NaN handling, and the split creation function (verifies CSVs are created and row counts sum correctly).

test_utils.py — Tests seed reproducibility (set seed, generate random numbers, reset seed, verify identical output), directory creation, metric computation (perfect/zero/partial accuracy), and JSON serialization.

test_api.py — Tests Pydantic schemas without loading actual models. Validates that empty text is rejected, batch requests parse correctly, and response models serialize properly.

The test structure uses pytest classes to group related tests, making the output easy to scan.

Commit 7: Docker Support

2c75827 Add Dockerfile and docker-compose for containerized deployment

Dockerfile builds a slim Python 3.10 image with CPU-only PyTorch (the full CUDA torch is 2GB+; CPU-only is ~200MB). Models are not baked into the image — they're mounted as volumes at runtime. This keeps the image small and lets you swap models without rebuilding.

docker-compose.yml orchestrates two services:

api on port 8000 — the FastAPI service with a health check
demo on port 7860 — the Gradio UI, depending on the API service

Both mount models/ and configs/ as read-only volumes. Environment variables control ensemble weights.

.dockerignore excludes data, outputs, git history, and the presentation/recording files — keeping the Docker build context small.

Deploy with one command:

docker-compose up --build

Commit 8: GitHub Actions CI

e8abcfa Add GitHub Actions CI pipeline

The CI pipeline (.github/workflows/ci.yml) runs on every push and PR to main:

Matrix testing — Python 3.10 and 3.11 on Ubuntu
Linting — Ruff checks for syntax errors and style issues (E, F, W rules, ignoring line length)
Testing — Full pytest suite
Docker build — Validates the Dockerfile builds successfully (runs after tests pass)

CPU-only PyTorch is installed from the PyTorch index URL to keep CI fast.

Commit 9: README and License

34953ee Overhaul README and add MIT license

The README is your project's landing page. The final version includes:

CI badge — Green checkmark shows the project builds and tests pass
Architecture diagram — ASCII art showing the ensemble data flow
Project structure — Every file explained with one-line descriptions
Quick start — Four numbered steps from clone to running inference
API reference — curl examples with sample JSON responses
Docker deployment — One-command setup
Model details table — Hyperparameters at a glance
Feature documentation — What each linguistic feature measures
Tech stack table — Every technology organized by category
Ethical considerations — Bias risks and usage caveats

The MIT license makes the project freely usable.

What Makes This Resume-Worthy

This project demonstrates competence across the ML engineering stack:

Skill	Evidence
NLP/ML	TF-IDF, transformer fine-tuning, ensemble methods, feature engineering
Software engineering	Modular architecture, config-driven design, type hints
API development	FastAPI with Pydantic schemas, batch endpoints, health checks
Testing	Pytest suite with edge cases and integration tests
DevOps	Docker, docker-compose, GitHub Actions CI
Documentation	Comprehensive README with architecture diagrams and API reference
Data engineering	Schema normalization, stratified splitting, null handling

It's not just a Jupyter notebook with model.fit(). It's a deployable system with multiple entry points (CLI scripts, REST API, web demo), proper error handling, and automated quality checks.

The full source code is available on GitHub. Clone it, train the models, and try the demo yourself.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.