You Don't Need a Neural Network to Spot a Deepfake

#statistics #deepfake #ai #programming

Most detection pipelines today are black boxes — a neural network says "fake" and you just trust it. I wanted to see how far pure statistics could go. No deep learning. Just handcrafted image features and a logistic regression.

The results were better than I expected.

The setup

Dataset: CIFAKE — ~60,000 images (real photos vs. AI-generated)

Approach: Extract statistical features from each image, evaluate with two metrics:

Covariance difference (Frobenius norm) — how different are the real vs. fake distributions?
LDA accuracy — how well does a linear classifier separate the two classes?

Results by feature family

Feature	Cov. Difference	LDA Accuracy
Noise residual	2.05 × 10³	84.8%
FFT (frequency)	6.23 × 10¹¹	79.9%
Texture (LBP + GLCM + Gabor)	1.05 × 10⁵	76.2%
Color statistics	5.23 × 10³	73.0%
DCT coefficients	4.65 × 10³	68.2%
Intensity statistics	2.61 × 10³	64.3%
Wavelet decomposition	8.99 × 10³	63.1%

Two things stand out:

1. Noise wins. At 84.8% LDA accuracy, noise residuals outperform every other feature family. Real cameras produce structured, spatially correlated sensor noise. Generative models don't have a camera — their noise patterns are statistically different, and easy to measure.

2. FFT is huge but nonlinear. The covariance gap for frequency features is 6.23 × 10¹¹ — orders of magnitude larger than anything else — yet LDA accuracy sits at only 79.9%. The differences are real but the decision boundary is nonlinear. FFT features likely need an SVM or neural network layer to be fully exploited.

Full pipeline results

Combining all features into a 48-dimensional vector, trained on 84,000 images, tested on 36,000:

Metric	Score
Accuracy	85.5%
Precision	86.3%
Recall	84.5%
F1	85.4%
ROC-AUC	92.9%
Training time	4.04 s
Inference time	0.02 s

A 92.9% ROC-AUC from a logistic regression, trained in 4 seconds, running inference in 20ms. No GPU needed.

Why this matters

Statistical detectors give you three things deep learning often doesn't:

Interpretability — you can point to exactly which feature triggered the flag
Speed — 20ms inference on a laptop, no GPU cluster required
Generalization potential — features grounded in physical image properties are less tied to a specific generator than a CNN trained on one dataset

The best production systems will likely be hybrid: statistical features for fast first-pass screening, deep models for depth. Neither replaces the other.

The anomaly map

Beyond classification, I built a patch-level anomaly heatmap. Each patch gets a weighted score:

score = 0.45 × residual + 0.35 × frequency + 0.20 × gradient

Real images produce flat, uniform maps. Synthetic images show concentrated anomalies — usually at object boundaries or regions where the generator lost spatial coherence. Spatial explainability you don't get from a softmax output.

Experiments run on CIFAKE using Python, scikit-learn, OpenCV, and scikit-image.