DEV Community

DetectArena
DetectArena

Posted on

I checked 1000+ AI and real images with top AI image detectors. You will be surprised

I built AI Detector Arena — an independent platform that pits AI image detectors against each other. We tested 1+ AI-generated images and 251 real photographs across 11 detection services.

Here's what we found.

The dataset

AI images: generated by 17 models — Flux Pro, Midjourney, GPT Image 1.5, Gemini 3 Pro, Stable Diffusion 3.5, SeedDream v4, Grok Aurora, Ideogram v3, Leonardo, Firefly v4, and more.

We wrote prompts at 3 difficulty levels. The hardest ones describe mundane scenes with imperfections — motion blur, bad lighting, JPEG artifacts, phone camera noise. These are the images that fool detectors the most.

Detectors tested: Hive Moderation, SightEngine, AI or Not, Winston AI, Was It AI, Decopy, QuillBot, TruthScan, MyDetector, and two open-source HuggingFace models.

5 surprising findings

1.** False positives are rampant**

This was the biggest shock. The HuggingFace SDXL-detector classified a real photo of London Bridge as 99.8% AI-generated. A real city skyline — also 99.8%. These are genuine photographs from before AI generators existed. Some detectors had false positive rates over 20%. One in five real photos flagged as AI. In journalism, academia, or legal contexts — that's dangerous.

2.** No detector is consistently the best**

A detector might catch 95% of Midjourney images but only 40% of Flux outputs. Another nails Stable Diffusion but completely misses GPT Image. There is no single winner across the board.

3. New AI models are winning the arms race

Flux Pro v1.1, GPT Image 1.5, SeedDream v4 — detection rates dropped significantly compared to older models. Detectors claiming "99% accuracy" were clearly trained on last year's generators.

4. Prompt difficulty breaks detectors

Simple prompts ("a woman's face") get caught easily. Hard prompts describing imperfect real-world scenes — bad lighting, motion blur, phone camera quality — reduced detection rates dramatically. The very flaws that signal "real photo" can now be synthesized.

5. Agreement beats any single detector

When 8 out of 9 detectors say "AI," it probably is. When they split 5-4, treat it as inconclusive. The ensemble approach consistently outperformed every individual service. Never trust a single detector's verdict.

Top comments (0)