Ervin Szilagyi

Posted on Aug 8, 2023

Disallow GPT Bot from Scraping our Blog Posts

#watercooler #openai #gpt3 #discuss

Lately, we can bloc GPT bots from scraping our pages for a site that we control, by setting the following lines in the robots.txt file:

User-agent: GPTBot
Disallow: /

I, myself, found out this from a tweet from Gergely Orosz:

My stance on this is similar to what Gergely is saying. GPT offers no citation to the information it provides. While I did update the robots.txt file on my personal website, I am also cross-posting to DEV. If we look at the robots.txt from DEV.to, we can notice that it does not have the same rule for GPTBot.

There are thousands of people posting to DEV, many of whom have different views about scraping information for LLM training. I'm in no position to request changes that will affect how the site works. I'm just curious of what is the opinion of other fellow authors about scraping your article for ML training.

Does updating your `robots.txt` actually solve something?

Obviously not. Adding this statement to your site is just a hint, a request for a bot to please not scrape your data.
This won't stop huge LLM bots (including GPT) to get the information it wants.

So, I'm curious about what do other people think about this topic?

Latest comments (4)

José Benavente • Aug 16 '23

This is good to know at least. Like Reme and Sijmen mention though, it won't do much realistically speaking.

Sijmen J. Mulder • Aug 13 '23

This won't stop huge LLM bots (including GPT) to get the information it wants.

It won't stop badly behaved actors but I feel OpenAI deserves the benefit of the doubt here, afaik there are no indications they are a bad actor in this regard.

Reme Le Hane • Aug 13 '23

Personally, don’t care, like you say this only stops GPT, it also does not expunge everything they’ve already scraped. Also as mentioned they not the only one. If you really want to not train them, stop posting online, that really is the only solution.

Wha about sites like web archive and all the cross posts, like if you going to put it online, you need to accept that fact that you’ve basically given it away for free to everyone, and what they do with it is entirely up to their own personal moral code.

Your licence in your code on GitHub is utterly meaningless without a lawyer and the finances to back it up. GitHub won’t do squat if Joe Soap or Big Corpa uses your code without credit.

Have you ever read the licence on a package you installed into a side project or even a work project, I don’t, I am pretty sure that at some point professionally or personally I’ve used a package in a way that goes against one of its guidelines and that’s purely as a result of ignorance and the fact that the vast majority of people never read licences or have any clue what they mean.

You cannot license and apply restrictions and limitations on a blog post that you are putting online, and even if you do, it’s up to you to legally enforce it, someone can copy it word for word and unless you can afford a lawyer or for some reason a big news channel cares about it, you’re SOL.

Bit late to start kicking a fuss now, these LLMs have been training for 4+ years on 30+ years worth of information we all wiling and freely put online.

Ervin Szilagyi • Aug 13 '23 • Edited

Hi Reme,

Thank your for your comment, I really appreciate it.

You bring up some valid points. You are absolutely right with your statement that if you put content online, you are giving it away for free. You have not control who will use your information and for what purpose.

Regarding licensing, I don't really want to get deeper into this. The reason being that I don't have the necessary knowledge, I'm not a lawyer, and frankly I don't really mind.

What it bothers me is the lack of citation in case of LLM models. Putting the technicalities aside whether it is possible or not for an LLM model to give you its sources for a piece of information, the problem is that they simply don't do it regardless. You see, presenting your sources is not just benefic for the author of that information, it is also helpful the user of an LLM model for fact-checking.

You are right about the fact that if you don't want your information to be taken you just don't put it online. What I want to add here is that you can limit the reach of your content with a simple login, but this has many other downsides as well.

Does updating your robots.txt actually solve something?

Does updating your `robots.txt` actually solve something?