Bianca Rus

Posted on Jan 29

How to Prevent AI Models from Training on Website Images

#ai #security #privacy #webdev

Why people keep asking about this

AI models need absurd amounts of training data. A lot of that data still comes from scraping public websites.

If you publish original images, photography, design work, product photos, it's only natural to wonder whether you can stop them from being pulled into yet another training set.

The honest answer is boring but important: not always fully.

The more useful answer is that you can reduce exposure and clearly state where you stand. That alone already filters out a surprising amount of automated traffic.

What AI crawlers actually do (and don't do)

These bots don't really visit your site.

There's no browser window. No scrolling. No clicking around. Most of the time they just request your HTML and image files directly, store the responses, and deal with everything later somewhere else.

JavaScript usually doesn't run. UI tricks don't matter. Anything that relies on a human being annoyed enough to give up is irrelevant here.

That's why so many "protections" you'll see recommended online feel nice but change absolutely nothing.

Things that don't work (so don't waste time on them)

Let's get this out of the way:

disabling right-click
JavaScript overlays
CSS tricks to hide or blur images
base64 encoding assets

All of that targets people, not bots. Crawlers just fetch the file and move on.

If something doesn't operate at the HTTP, server, or network level, it's mostly for show.

Two different levers: signaling vs enforcement

This distinction matters more than most people realize.

Signaling is about saying "I don't consent to this".
Enforcement is about actually blocking requests.

You usually want both. Signals catch the crawlers that are trying to behave. Enforcement deals with the rest.

Generic websites (any site with server access)

If you control your server, VPS, dedicated hosting, custom backend, you have the most flexibility.

robots.txt: basic signaling

A robots.txt file is still worth having:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Applebot-Extended
Disallow: /

You can scope this down to image directories if you want:

User-agent: GPTBot
Disallow: /images/
Disallow: /uploads/
Disallow: /media/
Allow: /

This list is non-exhaustive and should be updated as new crawlers appear.

Just keep expectations realistic: robots.txt is a request, not a lock.
Compliant crawlers will respect it. Others won't.

HTTP headers: stating your policy clearly

Another layer is response headers.

On Apache:

<IfModule mod_headers.c>
Header set X-Robots-Tag "noai, noimageai"
</IfModule>

On Nginx:

add_header X-Robots-Tag "noai, noimageai";

In plain terms, this tells crawlers not to use your content, especially images, for training.

These headers aren't formal web standards yet, but OpenAI, Google, and Anthropic have all publicly stated they honor them.

They express intent and are respected by compliant crawlers, but they do not guarantee exclusion from all AI training.

Think of this as declaring intent, not enforcing it.

Server-level blocking (.htaccess / Nginx)

This is where things become real enforcement.

Apache – block everything:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|Google-Extended|CCBot|anthropic-ai|ClaudeBot) [NC]
RewriteRule .* - [F,L]
</IfModule>

Apache – block only images:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|Google-Extended|CCBot|anthropic-ai|ClaudeBot) [NC]
RewriteCond %{REQUEST_URI} \.(jpg|jpeg|png|gif|webp|svg|bmp|tiff)$ [NC]
RewriteRule .* - [F,L]
</IfModule>

Nginx equivalent:

if ($http_user_agent ~* (GPTBot|Google-Extended|CCBot|anthropic-ai|ClaudeBot)) {
    return 403;
}

This stops requests before your application even sees them.

The obvious limitation is that user agents can be spoofed. Still, this filters out a large chunk of automated traffic from companies that at least identify themselves.

CDN and firewall blocking

If you're using a CDN like Cloudflare, blocking at the edge is often the cleanest setup.

A simple rule matching known AI user agents and setting the action to Block or Challenge can save bandwidth and keep this traffic away from your origin entirely.

For hosted platforms, this is often the only real enforcement option.

WordPress sites

WordPress makes images very easy to find. Public uploads, predictable URLs, lots of resized variants sitting in /wp-content/uploads/. Crawlers love that.

You've got multiple options depending on your technical comfort level and hosting setup.

Option 1: Use plugins (easiest)

If you don't want to touch server files, plugins handle everything through the WordPress admin.

Image optimization plugins with AI blocking:

ShortPixel Image Optimizer is particularly the best here. Because it already handles image optimization and delivery, it's a natural place to apply AI-related access controls, like injecting X-Robots-Tag headers or restricting access to image assets, without touching server config files. It basically gives you the option to restrict AI training on your images, it optimizes your images and even delivers them via CDN if you choose to!

Dedicated AI crawler blocking plugins:

Block AI Crawlers – straightforward plugin that blocks known AI bots.
Dark Visitors – more comprehensive, with an updated crawler database, robots.txt generation, server-level blocking, and analytics.

These typically let you:

choose which crawlers to block
select the blocking method (robots.txt, server-level, or both)
monitor what's getting blocked

Option 2: Customize robots.txt

WordPress generates its own robots.txt automatically, but you can override it.

Physical file approach:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot
Disallow: /

Or block just uploads:

User-agent: GPTBot
Disallow: /wp-content/uploads/
Allow: /

Option 3: Server-level blocking with .htaccess

Add rules before the WordPress rewrite section:

# Block AI crawlers
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|Google-Extended|CCBot|anthropic-ai|ClaudeBot) [NC]
RewriteRule .* - [F,L]
</IfModule>

# Declare policy
<IfModule mod_headers.c>
Header set X-Robots-Tag "noai, noimageai"
</IfModule>

# BEGIN WordPress
# (WordPress rules below)

Shopify, Wix, Squarespace and similar platforms

These platforms are the most restrictive.

You can usually edit robots.txt in some form. Beyond that, options are limited. No custom headers. No server-level rules.

In practice, you can signal intent, but you can't really enforce much unless you put a CDN in front of the site and block traffic there.

A realistic setup that's "good enough"

You don't need to do everything.

robots.txt for compliant crawlers
X-Robots-Tag headers to state policy
server or CDN blocking where possible

That combination already reduces exposure significantly.

Let's be honest about the limits

What you can do:

clearly state boundaries
block crawlers that respect the rules
make large-scale scraping more expensive

What you can't do:

stop a determined actor spoofing headers
undo past scraping
build a perfect technical wall

The goal isn't perfection. It's risk reduction.

Final thoughts

Preventing AI models from training on your website images isn't about hiding content. It's about drawing lines and enforcing them where you realistically can.

Most legitimate AI companies do respect those lines. For everyone else, layered technical controls and clear legal language are your best tools.

This space changes fast, so whatever setup you choose, it's worth revisiting it from time to time.

If you've implemented any of these techniques, I'm curious which ones actually made a difference for you.

DEV Community