Introduction
A robots.txt file is a way of controlling access to the list of popular URLs and the list of non-popular pages, based on a variety of security concerns. For example, you may want to enforce HTTPS, or prevent people from searching for private data like your customers’ names and addresses.
A robots.txt file instructs web crawlers on how to access the page, what sections of the page to ignore, and what sections to follow. By controlling which section pages are crawled and indexed by a web crawler, you can specify what content appears in search results for that page.
Enumerating robots.txt files
To locate robots.txt in any website append the /robots.txt at the end of the web URL example, http://your_dormain/robots.txt. If the website uses robots.txt to instruct web crawlers, your output will look like this.
User-Agent: *
Allow : /admin/
Disallow: /
From the output above, it is clear the web crawler has been instructed to index files under the /admin/ directory and ignore the rest.
Using the information gathered from the robots.txt file the attacker can visit https://your_domain/admin/ to find more information that can aid in achieving malicious intentions.
In the scenario above the attacker may be directed to the admin panel login page. This poses a critical security risk since attackers will try all means to access the admin panel.
Impact
The Robots.txt file does not present any security concerns. However, if the contents of the file reveal sensitive information the attackers can take advantage and access the unauthorized web directories for their advantage. It is always advisable not to list important files/directories in the robots.txt file.
Prevention
To prevent sensitive information from being displayed under the robots.txt file developers can use X-Robots-Tag with appropriate values to instruct web crawlers on whether to index files/directories or not. The X-Robots-Tag is placed in the response header section.
X-Robots-Tag: googlebot: nofollow
X-Robots-Tag: otherbot: noindex, nofollow
The values noindex, and nofollow tell the crawler not to index the Googlebot and other bots.
Let's take a practical look at the following scenario...
If a user uploads images to his or her website, for example, the image could end up on an index page where users may be exposed to adult content. In this case, the developers of the website will want to ensure that no crawlable pages contain adult content by adding a line in their robots.txt file that specifies "noindex,nofollow"
Thanks for reading. Looking forward to your feedback.
Top comments (0)