Who kept the bots out? Stopping content being harvested by AI

#ai #scraping #http

AI-powered content generation has exploded in popularity recently, with bots like ChatGPT and Bard, but the giant amounts of data these bots require comes from harvesting the web. What if you don’t want your content feeding the bots? Some respect robots.txt, others notice a new ‘noai’ header tag.

An article in Vice recently drew attention to the way AI bots are harvesting the web, in some cases quite aggressively. Site owners quite reasonably want to protect the originality of their creative works, assert their copyrights, and also not have to deal with traffic surges from ill-configured or badly-behaving scraper bots.

The particular tool mentioned in the Vice article is Img2dataset, and right now, it doesn't pay attention to the robots.txt file, the normal mechanism you can use to dissuade well behaved bots from indexing your content. However, it does respect a new HTTP header directive, X-Robots-Tag: noai (and also noindex, though that's an existing and already well-known part of the robots.txt standard).

A lot of Fastly customers have multiple websites running through Fastly, and in many cases multiple backend servers feeding a single public domain. Managing the addition of HTTP metadata at the origin application can be tedious and error prone, but fortunately applying it at the edge is pretty simple.

If you have a VCL powered Fastly service, you can add a header in a deliver snippet or in the vcl_deliver subroutine of your custom VCL code:

set resp.http.x-robots-tag = "noai";

Here's a fiddle you can play with demonstrating that code.

If you are using our new Compute@Edge platform, setting a header varies from one language to another. There's a generic example in our developer hub covering all languages we support, but as an example, here's how to do it in JavaScript:

resp.headers.set("x-robots-tag", "noai");

And again, here's a fiddle with that code you can remix and try for yourself.

Of course this just fixes the issue for one bot, and while others might pick up on signals from robots.txt, some are just going to be stubbornly insistent on crawling and downloading your content regardless. These bots should be considered bad actors and blocked. Fastly has lots of ways to do this - manual logic can be written in both VCL and all Compute languages to block based on signals such as:

ASN (Autonomous system number): the block of IPs owned by a single network operator
Explicit IP addresses or ranges
HTTP headers such as User-Agent or Accept
Originating traffic location

Here's an example of banning by IP address.

You might like to use challenges to separate out bot traffic, such as a CAPTCHA or proof of work, or use DNS lookups to authenticate good crawlers.

Finally, if you want more of a magic bullet, consider enabling our Next-gen Web Application Firewall on your service, which will detect anomalous behaviour, alert and block attacks automatically.

If you want to get inspiration for what else you could be doing with edge compute technologies, check out the solutions area on the Fastly developer hub. Happy coding!

DEV Community

Who kept the bots out? Stopping content being harvested by AI

Top comments (0)

Read next

Crew.ai vs Langgraph: A Comprehensive Comparison of AI Agent Frameworks

AI Last Week: Friday the 10th of January 2025

Running DeepSeek R1 1.5B on Android with Google AI Edge

Cody AI Programming Assistant Overview