Jason Peterson

Posted on Jan 15

Did You Know CLIP Works as an AI Image Detector?

#machinelearning #computervision #ai #python

OpenAI's CLIP model was trained to match images with text descriptions. But here's something surprising: it also works remarkably well at detecting AI-generated images. No fine-tuning required—just extract embeddings and add a simple classifier.

I built one, with some help from Claude Code, to see how well this actually works. Here's what I learned.

The Dataset

I collected 1,050 portrait-style images:

525 AI images from CivitAI (various Stable Diffusion models)
525 real photos from Unsplash

Both sets were curated to look similar—street photography, portraits, natural lighting. The goal was to make this hard, not easy.

AI-generated portraits from CivitAI

Real photos from Unsplash

Can you tell which is which? Deep into curating the AI images, I'd occasionally think "wow, that looks real." But the moment I switched to Unsplash, I realized none of them actually did. Real photos, to my eye and for now anyway, have a texture, a messiness that resets your expectations entirely.

The Traditional Approach: FFT Analysis

Before trying CLIP, I tested a traditional forensics technique: analyzing the frequency spectrum.

The intuition is simple: real cameras introduce high-frequency sensor noise. AI generators don't simulate this noise, so AI images should have less energy in the high frequencies.

def compute_high_freq_energy(image_path):
    img = Image.open(image_path).convert("L")
    img_array = np.array(img, dtype=np.float64)

    fft = np.fft.fft2(img_array)
    fft_shifted = np.fft.fftshift(fft)
    power = np.abs(fft_shifted) ** 2

    # Measure energy in outer ring (high frequencies)
    # ...
    return high_freq_energy / total_energy

Result: 50.4% accuracy. Basically random.

The problem? JPEG compression destroys high-frequency information anyway. On compressed web images, this technique is useless.

The CLIP Approach

CLIP (Contrastive Language-Image Pre-training) was trained on 400 million image-text pairs. It learned rich visual features that transfer surprisingly well to other tasks—including AI detection.

The approach is dead simple:

from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def get_embedding(image_path):
    image = Image.open(image_path).convert("RGB")
    inputs = processor(images=image, return_tensors="pt")

    with torch.no_grad():
        embedding = model.get_image_features(**inputs)

    # Normalize to unit vector
    embedding = embedding / embedding.norm()
    return embedding.numpy().flatten()

Each image becomes a 512-dimensional vector. Then train a simple logistic regression:

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

Result: 88.5% accuracy on held-out test images.

That's a massive jump from 50% (FFT) to 88.5% (CLIP + LogReg).

Why Does This Work?

CLIP (Radford et al., 2021) learned rich visual features from 400 million image-text pairs. These features transfer well to tasks it was never trained for.

I used the smallest CLIP variant (ViT-B/32, ~150M parameters). Larger models like ViT-L/14 would likely do even better, but the small one already works surprisingly well.

When we project the 512-dimensional embeddings down to 2D using UMAP, we can see the separation:

UMAP projection of CLIP embeddings. Real images (blue) and AI images (red) cluster separately.

The two classes naturally separate in embedding space. The logistic regression just draws a line between them.

You Don't Even Need a Classifier

Here's the surprising part: you can detect AI images without training anything.

Just compute the centroid (mean) of each class and classify by nearest neighbor:

# Compute class centroids
ai_centroid = X_train[y_train == 1].mean(axis=0)
real_centroid = X_train[y_train == 0].mean(axis=0)

# Classify a new image
dist_to_ai = np.linalg.norm(embedding - ai_centroid)
dist_to_real = np.linalg.norm(embedding - real_centroid)

prediction = "AI" if dist_to_ai < dist_to_real else "Real"

Result: 74.8% accuracy with zero training.

The logistic regression adds ~14% improvement, but CLIP embeddings alone get you most of the way there.

What Does the Model See?

Here's the honest answer: I don't know.

I tried probing the CLIP dimensions to understand what features matter. The results were messy and inconclusive. These are learned representations, not human-interpretable features.

Looking at the AI images ranked by confidence, there's no obvious pattern:

AI images ranked from "fooled the detector" (top-left) to "obviously AI" (bottom-right). The visual pattern isn't clear—the model detects something we can't see.

The images at 12% confidence (fooled the detector) does look like a real photo at a glance, but does the 98% confidence image of the woman sitting at dusk in a sidewalk cafe really scream AI? CLIP is detecting subtle statistical signatures that aren't visible to human eyes.

Limitations

This is an exploration of the technique, not a production AI detector.

It won't generalize well. I trained on portrait photography. It won't work reliably on landscapes, illustrations, or other styles. A real detector would need a much more diverse training set.

AI generators are improving. The patterns CLIP detects today may disappear as generators get better at mimicking real image statistics.

The model isn't interpretable. We can measure that it works, but we can't explain why it works. That makes it hard to trust for high-stakes decisions.

Conclusion

CLIP embeddings are surprisingly effective for AI image detection:

Approach	Accuracy
FFT (traditional)	50.4%
Centroid distance (no training)	74.8%
Logistic Regression on CLIP	88.5%

The key insight: CLIP has learned features that capture something fundamental about how real and AI images differ—even though we can't see or explain what that something is.

For a quick-and-dirty AI detector on a specific image domain, this approach works remarkably well. Just don't expect it to generalize to everything.

DEV Community