DEV Community

Cover image for Robots.txt Mistakes That Are Silently Killing Your SEO And How To Fix Them
Shubham Singh
Shubham Singh

Posted on • Originally published at minddwave.com

Robots.txt Mistakes That Are Silently Killing Your SEO And How To Fix Them

Let’s be real. For many that are on the internet, the robots.txt file is an area of low grade anxiety. Its small text file hidden in your server and contains a huge quantity of information. One mistake one slash that’s not properly placed, and you could tell Google to shut down your whole website. Its like that big red button that you are instructed not to touch.

However, it does not need to be that way.

Imagine your robots.txt file not as a frightful red button, but rather as an efficient, friendly bouncer who is at the front of your site private club. Its role is to look up the ID of each web crawler that appears (like Googlebot, Bingbot, or AhrefsBot) and provide them with specific instructions You can go in here, but that area is off-limits.

When it is used correctly, this bouncer is your best friend. It helps you save money on your crawl and keeps irrelevant pages from being filtered out of the search results and also helps search engines to understand your site structure. This guide will help you understand the robots.txt file once and for all time. We will explore the most frequently encountered issues, go over the most recent changes and provide you with the confidence to transform this document from an object of dread to your most powerful SEO tool.

What is an robots.txt file and why should I care?

The robots.txt file is a text file that is based on the Robots Exclusion Protocol. It gives directions for web crawlers (bots) regarding which URLs on a particular domain they are permitted to browse and access.

The most crucial rule is where it should be: it must live in the root directory of your website.

Correct: https://yourdomain.com/robots.txt
Incorrect: https://yourdomain.com/blog/robots.txt
If its not located in the root directory, crawlers won’t be able to find it, and will assume they are allowed to browse everything.

The Most Common robots.txt Problems That Could Ruin Your SEO

These mistakes are more frequent than you believe. Let’s look at the most common offenders and the best ways to correct them so that you can check your website for errors.

The Ghost File: A 404 or inaccessible robots.txt

The Problem: If a crawler looks for yourdomain.com/robots.txt and gets a 404 Not Found error, it assumes no rules exist and will proceed to crawl your entire site. This means it could crawl pages that you do not want to be indexed, such as administrator logins or internal search results, which wastes the time and money of a crawl.
The Solution: Every site should have an robots.txt file even if its an empty file or one that allows all things. It shows that you are in control. An easy allow all file looks like this:
User-agent: *
Disallow:

Blocking Critical Resources (CSS & JavaScript)

The Issue: In the early days of the internet SEOs blocked CSS and JavaScript files to save crawl budget. Its now a fatal error. Google must render your site in the same way the user would to comprehend the layout and content. If you do not allow these files, Google will see a broken, jumbled mess of text, which could severely impact your ranking.
The Solution: Audit your robots.txt file for any disallow rules that prevent .css or .js files. Check to see if you are granting crawlers access to the resources required to render your site.
Bad: Disallow: /assets/js/
Excellent: Allow: /assets/js/ (or simply do not have a rule that blocks it).

Using Disallow to Noindex a Page

The Issue: This is the most widely known and damaging misunderstanding of robots.txt. Disallowing a URL in robots.txt does NOT prevent it from being indexed. It only prevents it from being crawled. If the page that is disallowed contains links to it on other sites, Google can still find it and index it. The result of a search will appear unattractive, often displaying the URL with a description such as No information is available for this page.
The Fix: If you wish to stop the page from appearing in results of a search to prevent it from appearing in search results, you must Use a meta tag with ‘noindex’ within the page HTML section, or the X-Robots tag inside the HTTP header.

Importantly, in order for Google to recognize the noindex tag it has to be capable of crawling the page. So, ensure that the page is not restricted in robots.txt.

Case Sensitivity and Trailing Slashes

The Issue: The robots.txt file is a brutally literal file.
Case: Disallow: /My-Folder/ will not restrict access to /my-folder/.
trailing slashes: Disallow: /folder might not block /folder/ on some systems
The Solution: Always use the exact URL structure and case that is displayed on your server. If you wish to block a complete directory you should include the trailing slash (/folder/) to ensure security. If you’re using both the /folder and /folder/ versions of URLs you might need rules for both or create redirects.

The Most Recent robots.txt Changes You Must Be aware of

The web is constantly evolving and the rules of robots.txt. This is what’s new as well as what’s changed.

The noindex Directive is Dead

As previously mentioned, Google officially stopped supporting the noindex directive in the robots.txt file in the year 2019. If you’re still having noindex: /some-page/ within the file, then its completely ignored. Take these lines out and apply the appropriate meta name=’robots’ content=’noindex’ method instead.

The Rise of the AI Crawlers

With the exploding popularity of large languages models (LLMs) new robots are currently crawling the internet to collect training data. If you do not wish to have your site content utilized to train models such as ChatGPT or Googles AI you can stop their crawlers.

GPTBot: This is the OpenAI web crawler.

Google-Extended: This is Google search engine that trains its AI models, which is separate from the Googlebot itself.
Here’s how to block them, while still allowing regular indexing of your search results:

Block OpenAI AI bot for training

User-agent: GPTBot

Disallow: /

Block Google AI bot for training

User-agent: Google-Extended

Disallow: /

Allow Google normal search bot to run.

User-agent: Googlebot

Allow: /

How to create a perfect robots.txt File

Open a plain Text Editor: Use Notepad (Windows) TextEdit (Mac) or an editor for code, such as VS Code. Don’t use Microsoft Word or Google Docs for this, since they can add formatting that could cause the file to be damaged.

The User-Agent is defined The line for the User-Agent defines which crawler to whom the rules are applicable. User-agent * applies the rules to all bots. User agent Googlebot applies them only to Google crawler.

Set Your Disallow and Allow Rules:

Disallow: instructs a bot to not to crawl an URL path.
Allow: can be used to override an Disallow rule that applies to a particular subfolder or file. Google makes use of the strictest rule. Allow: /wp-admin/admin-ajax.php is more specific than Disallow: /wp-admin/, so admin-ajax.php would be crawled.
Include Your Sitemap It is a good idea to add the complete URL for your XML sitemap. This will help search engines locate each page that you would like to crawl.

Save and upload: Save the file as robots.txt (all lowercase) and upload it to the root directory on your site, the main folder commonly called public_html, or the htdocs folder.

Sample robots.txt for the typical WordPress website:
User-agent: *

Block crawling of admin area and plugin files.

Disallow: /wp-admin/

Disallow: /wp-content/plugins/

Allow a particular important, essential file to be added to the admin area.

Allow: /wp-admin/admin-ajax.php

Theme # Block and core contain files

Disallow: /wp-content/themes/

Disallow: /wp-includes/

Sitemap: [https://yourdomain.com/sitemap.xml]

Final Check Check Your robots.txt

Before you leave, always test your file. Go to the Google Search Console robots.txt Tester. You can copy and paste your file content or test the version that is live. It will inform you whether there are any errors, and permit you to check URLs to determine whether they are not blocked or permitted.

The robots.txt file is a peaceful workhorse. Make sure you give it the time and attention it deserves and it will be working tirelessly to improve your site connection to search engines. Start by opening the file and start taking control. This is what you have.

Top comments (0)