The line between human and machine writing is blurring fast. This guide walks through building a complete AI-text detection system from scratch — from classical ML baselines to transformer fine-tuning, wrapped in a production API and interactive demo.
This isn't a toy notebook. By the end, you'll have a modular codebase with an ensemble model, REST API, Gradio UI, Docker deployment, CI/CD, and a test suite. Let's break down each commit.
Commit 1: The Foundation — Clean Init
2a66fcb Clean init (no large data/models/outputs)
Every good project starts with structure. The initial commit sets up a config-driven architecture where nothing is hardcoded:
configs/config.yaml centralizes every hyperparameter — TF-IDF feature counts, learning rates, batch sizes, file paths. This means you never dig through scripts to change a setting.
src/ holds reusable library modules. scripts/ holds runnable entry points. This separation matters: your training logic lives in importable functions, not buried inside if __name__ == "__main__" blocks.
src/data.py handles schema normalization. Datasets from Kaggle have inconsistent column names — "text" vs "content" vs "essay", "label" vs "generated" vs "is_gpt". The normalize_schema() function maps all variants to a canonical (text, label) format with binary labels (0=human, 1=AI). This is the kind of defensive data engineering that saves hours of debugging later.
src/baseline.py implements the classical ML pipeline: TF-IDF vectorization (200K features, unigrams + bigrams) piped into Logistic Regression. The entire pipeline is a single scikit-learn Pipeline object, which means the vectorizer and classifier are serialized together — no risk of train/serve skew.
src/transformer.py fine-tunes DistilBERT for binary classification using HuggingFace's Trainer API. Key design decisions:
- Automatic validation split creation if one isn't provided
- Null/empty text filtering before tokenization (prevents cryptic errors)
- Mixed-precision training (
fp16) and gradient checkpointing for memory efficiency - Offline-friendly model loading (checks if
model_nameis a local directory)
.gitignore excludes data/, models/, and outputs/ — keeping the repo lightweight. Large binary files don't belong in git.
Commit 2: Linguistic Feature Extraction
0c716b4 Add linguistic feature extraction module
Raw text classification is powerful, but interpretability matters. The src/features.py module extracts stylometric signals — measurable properties of writing style that differ between humans and language models.
What gets extracted:
| Feature | Why It Matters |
|---|---|
| Type-Token Ratio (TTR) | AI text tends toward lower vocabulary diversity — it reuses common words more |
| Hapax Legomena Ratio | Words used exactly once. Humans use more rare/unique words |
| Flesch Reading Ease | AI-generated text often clusters in a narrow readability band |
| Sentence Length Variation | Humans write with more variable sentence structure |
| Word Entropy | Shannon entropy of the word distribution — higher means more varied word choice |
| Punctuation Rates | AI text has distinctive punctuation patterns (fewer semicolons, more consistent comma usage) |
The implementation uses zero external NLP libraries — just regex and math. The syllable counter uses a vowel-group heuristic (_syllable_count()), which is fast and accurate enough for readability formulas.
The key design choice: extract_features() returns a flat dictionary, and extract_features_df() wraps it for batch processing. This makes the features usable both in the API (single predictions) and in training pipelines (DataFrame operations).
Commit 3: Ensemble Model
b8b8911 Add weighted soft-voting ensemble model
Neither model alone is optimal. The baseline is fast but misses nuance. The transformer is accurate but slow. The ensemble combines them through weighted probability averaging.
How it works:
P_ensemble = 0.3 × P_baseline + 0.7 × P_transformer
Both models output class probabilities (not just labels). The EnsembleDetector class:
- Runs TF-IDF + LogReg to get
P(human)andP(AI)from the baseline - Runs DistilBERT inference in batches to get transformer probabilities
- Computes the weighted average
- Returns the argmax as the final prediction
Why soft voting over hard voting? Hard voting (majority rule) throws away confidence information. If the baseline says 51% human and the transformer says 95% AI, hard voting sees a tie. Soft voting correctly favors AI because the transformer's high-confidence signal dominates.
The weights (0.3/0.7) are configurable via CLI args or environment variables. You could tune them on a validation set, but 0.3/0.7 is a reasonable default given the transformer's higher accuracy.
scripts/run_ensemble.py evaluates the ensemble on both validation and test splits, saving JSON metrics for comparison.
Commit 4: FastAPI REST API
c130c05 Add FastAPI REST API for inference
A model without an API is a notebook. The api/app.py module wraps the ensemble detector in a production-grade FastAPI service.
Three endpoints:
GET /health— Returns model load status. Essential for container orchestration (Docker health checks, Kubernetes liveness probes).POST /predict— Single-text classification. Returns the label, confidence, per-class probabilities, all linguistic features, and inference latency in milliseconds. The latency field is useful for monitoring performance degradation.POST /predict/batch— Classify up to 64 texts in one request. Batch inference is significantly faster than 64 individual calls because the transformer processes them in GPU-friendly batches.
Design decisions:
-
Pydantic models for request/response validation.
PredictRequestenforcesmin_length=1so empty strings are rejected at the schema level, not in model code. - Lifespan context manager for model loading. The detector loads once at startup and stays in memory — no per-request loading overhead.
-
Environment variable configuration (
BASELINE_WEIGHT,TRANSFORMER_WEIGHT,CONFIG_PATH) so the API is configurable without code changes in deployment.
Run it with:
uvicorn api.app:app --host 0.0.0.0 --port 8000
FastAPI auto-generates interactive Swagger docs at /docs.
Commit 5: Gradio Interactive Demo
ebf7183 Add Gradio interactive demo with feature analysis
APIs are for machines. Demos are for humans. The app.py file creates a web interface where anyone can paste text and see results instantly.
The UI has three output components:
-
Classification label with color-coded probabilities (Gradio's
Labelcomponent handles this natively) - Verdict text — a plain-English summary like "Prediction: AI-generated (confidence: 94.2%)"
- Feature analysis table — the linguistic breakdown rendered as a markdown table
Why Gradio over Streamlit? Gradio is purpose-built for ML demos. A gr.Interface or gr.Blocks app can be shared with a public URL via share=True, embedded in HuggingFace Spaces, and requires zero frontend code.
The demo includes pre-loaded examples so users can test immediately without typing.
Commit 6: Unit Tests
347fdbc Add unit test suite for features, data, utils, and API
Tests are non-negotiable for a serious project. The tests/ directory covers four modules:
test_features.py (22 tests) — The most comprehensive. Tests every feature function individually (vocabulary richness, sentence stats, readability, punctuation, entropy) plus integration tests for the full extract_features() pipeline. Key edge cases: empty strings, single-word inputs, all-identical words.
test_data.py — Tests schema normalization with different column name variants (text/content/essay, label/generated/is_gpt), string label mapping, NaN handling, and the split creation function (verifies CSVs are created and row counts sum correctly).
test_utils.py — Tests seed reproducibility (set seed, generate random numbers, reset seed, verify identical output), directory creation, metric computation (perfect/zero/partial accuracy), and JSON serialization.
test_api.py — Tests Pydantic schemas without loading actual models. Validates that empty text is rejected, batch requests parse correctly, and response models serialize properly.
The test structure uses pytest classes to group related tests, making the output easy to scan.
Commit 7: Docker Support
2c75827 Add Dockerfile and docker-compose for containerized deployment
Dockerfile builds a slim Python 3.10 image with CPU-only PyTorch (the full CUDA torch is 2GB+; CPU-only is ~200MB). Models are not baked into the image — they're mounted as volumes at runtime. This keeps the image small and lets you swap models without rebuilding.
docker-compose.yml orchestrates two services:
-
apion port 8000 — the FastAPI service with a health check -
demoon port 7860 — the Gradio UI, depending on the API service
Both mount models/ and configs/ as read-only volumes. Environment variables control ensemble weights.
.dockerignore excludes data, outputs, git history, and the presentation/recording files — keeping the Docker build context small.
Deploy with one command:
docker-compose up --build
Commit 8: GitHub Actions CI
e8abcfa Add GitHub Actions CI pipeline
The CI pipeline (.github/workflows/ci.yml) runs on every push and PR to main:
- Matrix testing — Python 3.10 and 3.11 on Ubuntu
-
Linting — Ruff checks for syntax errors and style issues (
E,F,Wrules, ignoring line length) - Testing — Full pytest suite
- Docker build — Validates the Dockerfile builds successfully (runs after tests pass)
CPU-only PyTorch is installed from the PyTorch index URL to keep CI fast.
Commit 9: README and License
34953ee Overhaul README and add MIT license
The README is your project's landing page. The final version includes:
- CI badge — Green checkmark shows the project builds and tests pass
- Architecture diagram — ASCII art showing the ensemble data flow
- Project structure — Every file explained with one-line descriptions
- Quick start — Four numbered steps from clone to running inference
- API reference — curl examples with sample JSON responses
- Docker deployment — One-command setup
- Model details table — Hyperparameters at a glance
- Feature documentation — What each linguistic feature measures
- Tech stack table — Every technology organized by category
- Ethical considerations — Bias risks and usage caveats
The MIT license makes the project freely usable.
What Makes This Resume-Worthy
This project demonstrates competence across the ML engineering stack:
| Skill | Evidence |
|---|---|
| NLP/ML | TF-IDF, transformer fine-tuning, ensemble methods, feature engineering |
| Software engineering | Modular architecture, config-driven design, type hints |
| API development | FastAPI with Pydantic schemas, batch endpoints, health checks |
| Testing | Pytest suite with edge cases and integration tests |
| DevOps | Docker, docker-compose, GitHub Actions CI |
| Documentation | Comprehensive README with architecture diagrams and API reference |
| Data engineering | Schema normalization, stratified splitting, null handling |
It's not just a Jupyter notebook with model.fit(). It's a deployable system with multiple entry points (CLI scripts, REST API, web demo), proper error handling, and automated quality checks.
The full source code is available on GitHub. Clone it, train the models, and try the demo yourself.
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.