DEV Community

Cover image for Is your robots.txt file vulnerable? Here's how to check and secure it
SmartScanner
SmartScanner

Posted on • Originally published at thesmartscanner.com

Is your robots.txt file vulnerable? Here's how to check and secure it

A bad configured Robots.txt file is like a welcome sign to your website for hackers.
Let's see how to set up an efficient and secure robots.txt.

Can a Robots.txt file be vulnerable? Yes, Here's how to create a secure one.
Is your robots.txt file vulnerable? Here's how to check and secure it
Beware of Robots.txt! You should not reveal everything to robots.

What is robots.txt

Actually, the robots.txt file is not vulnerable by itself.
Robots.txt is a file for web robots. Web robots or web spiders and crawlers are programs that search the web for collecting content. Search engine robots like google collect your web page contents for indexing it and spammer robots look for email addresses and other things on your website.

Web robots look for the robots.txt file at the main directory (typically your root domain or homepage) of any website.

Anatomy of a robots.txt

In robots.txt, you can put instructions about your website for web robots; these instructions are called The Robots Exclusion Protocol.

Each line of a robots.txt file consists of a field, a colon, and a value. Comments are preceded with # character and white spaces are optional. So, the general syntax is like below:

<field>:<value><#optional-comment>
Enter fullscreen mode Exit fullscreen mode

Common fields are followings:

  • user-agent: identifies which crawler the rules apply to.
  • allow: a URL path that may be crawled.
  • disallow: a URL path that may not be crawled.
  • noindex: unofficial directive for preventing search engines from indexing a page
  • sitemap: the complete URL of a sitemap.

By default, robots try to crawl all your website or at least as many pages as they choose. But using the above directives you can guide them about crawling your website.
The allow and disallow are the most used directives for instructing robots about what pages they can crawl and what they shouldn't.
Using the user-agent you can associate rules with specific user agents (robots).

For example, consider the below sample robots.txt.

user-agent: *
allow: /*

user-agent: googlebot*
disallow: /oldui/

sitemap: https://example.com/sitemap.xml
Enter fullscreen mode Exit fullscreen mode

This robots.txt means that any robot (user-agent: *) is allowed to crawl any URL (allow: /*). But the google bot (user-agent: googlebot*) is not allowed to crawl the /oldui/ URL. This sample also includes a link to the sitemap.

The evil's nest

The disallow and noindex are usually misunderstood. Using these two directives for hiding pages from google and robots seems to be a good idea. But the truth is these directives are not respected by all web robots. You should keep in mind that robots.txt is a public file that can be accessed by both google bot and an attacker.

So if you put something like disallow: /admin/ in your robots.txt file, you're actually revealing the URL of your website admin section.

If you add any web pages of your website in the robots.txt file, that file will be accessed by the whole internet like the home page of your website. So robots.txt is not a locker where you can hide your secrets.

If you have a public web page but you don't want it to be indexed and shown in search results, you can use the disallow directive.

Conclusion

The robots.txt is a good tool for optimizing the experience of your website's robot visitors. But it's not just for robots, and not all robots are friendly. So do not put sensitive information in the robots.txt file. You can test your robots.txt for any sensitive information leakage with SmartScanner, the web vulnerability scanner. It's free and easy. Just enter the address of your website and hit the scan.

Top comments (0)