DEV Community

Discussion on: No Robots Allowed

Collapse
 
dotnetcoreblog profile image
Jamie

How do you feel about the robots HTTP header?

For those who don't know, it's a header which you can include in page response which tells a web crawler what it's permitted to do with the page. It's not a replacement for the robots.txt, and (just like the robots.txt file) the web search companies don't have to support it.

An example of the robots header would be something like:

X-Robots-Tag: noarchive, nosnippet

This instructs a web crawler which finds the page that it is not permitted to archive the page or provide snippets from it (in search results).

Collapse
 
turnerj profile image
James Turner Turner Software

I'm a bit torn by the robots header. On one hand, it allows really fine control on a per-page basis. On the other hand, you have to do a request to the page to find whether you are allowed to keep the data or not which feels like a waste of bandwidth.

I mean, you could do a HEAD request to find out but then you might end up with two HTTP requests just to get content in an "allowed" scenario.

That said, I do see value in the header. I'm actually building my own web crawler (which I will do another post about in the future) and I want to add support for the header.