Most detection pipelines today are black boxes — a neural network says "fake" and you just trust it. I wanted to see how far pure statistics could go. No deep learning. Just handcrafted image features and a logistic regression.
The results were better than I expected.
The setup
Dataset: CIFAKE — ~60,000 images (real photos vs. AI-generated)
Approach: Extract statistical features from each image, evaluate with two metrics:
- Covariance difference (Frobenius norm) — how different are the real vs. fake distributions?
- LDA accuracy — how well does a linear classifier separate the two classes?
Results by feature family
| Feature | Cov. Difference | LDA Accuracy |
|---|---|---|
| Noise residual | 2.05 × 10³ | 84.8% |
| FFT (frequency) | 6.23 × 10¹¹ | 79.9% |
| Texture (LBP + GLCM + Gabor) | 1.05 × 10⁵ | 76.2% |
| Color statistics | 5.23 × 10³ | 73.0% |
| DCT coefficients | 4.65 × 10³ | 68.2% |
| Intensity statistics | 2.61 × 10³ | 64.3% |
| Wavelet decomposition | 8.99 × 10³ | 63.1% |
Two things stand out:
1. Noise wins. At 84.8% LDA accuracy, noise residuals outperform every other feature family. Real cameras produce structured, spatially correlated sensor noise. Generative models don't have a camera — their noise patterns are statistically different, and easy to measure.
2. FFT is huge but nonlinear. The covariance gap for frequency features is 6.23 × 10¹¹ — orders of magnitude larger than anything else — yet LDA accuracy sits at only 79.9%. The differences are real but the decision boundary is nonlinear. FFT features likely need an SVM or neural network layer to be fully exploited.
Full pipeline results
Combining all features into a 48-dimensional vector, trained on 84,000 images, tested on 36,000:
| Metric | Score |
|---|---|
| Accuracy | 85.5% |
| Precision | 86.3% |
| Recall | 84.5% |
| F1 | 85.4% |
| ROC-AUC | 92.9% |
| Training time | 4.04 s |
| Inference time | 0.02 s |
A 92.9% ROC-AUC from a logistic regression, trained in 4 seconds, running inference in 20ms. No GPU needed.
Why this matters
Statistical detectors give you three things deep learning often doesn't:
- Interpretability — you can point to exactly which feature triggered the flag
- Speed — 20ms inference on a laptop, no GPU cluster required
- Generalization potential — features grounded in physical image properties are less tied to a specific generator than a CNN trained on one dataset
The best production systems will likely be hybrid: statistical features for fast first-pass screening, deep models for depth. Neither replaces the other.
The anomaly map
Beyond classification, I built a patch-level anomaly heatmap. Each patch gets a weighted score:
score = 0.45 × residual + 0.35 × frequency + 0.20 × gradient
Real images produce flat, uniform maps. Synthetic images show concentrated anomalies — usually at object boundaries or regions where the generator lost spatial coherence. Spatial explainability you don't get from a softmax output.
Experiments run on CIFAKE using Python, scikit-learn, OpenCV, and scikit-image.
Top comments (0)