EvvyTools

Posted on May 8

How to Write a Robots.txt File That Works With Your XML Sitemap

#tools #webdev #productivity

Robots.txt and sitemaps are separate files that do related jobs. Robots.txt tells crawlers what they can and cannot access. A sitemap tells them what you want indexed. When these two files contradict each other, you end up with disallowed URLs appearing in the sitemap, which confuses crawlers and wastes crawl budget.

Getting both files right is not complicated, but it does require knowing which rules apply where. This guide walks through the robots.txt file format, common configuration mistakes, and how to use a visual generator to avoid the most frequent errors.

Photo by panumas nikhomkhai on Pexels

What Robots.txt Does (and Does Not Do)

A robots.txt file sits at the root of your domain and provides crawling directives to well-behaved bots. The two most common directives are User-agent (which crawler the rule applies to) and Disallow (which paths are off-limits for that crawler).

Crucially, robots.txt is not a security mechanism. It relies on crawlers choosing to comply. Malicious bots routinely ignore it. It is also not a way to prevent pages from being indexed. A page can be indexed without ever being crawled directly, through links from other indexed pages. If you want a page excluded from the index, use a noindex meta tag on the page itself.

What robots.txt does well is limit crawl scope. If you have a large staging area, admin interface, or pagination set you do not want crawlers touching, robots.txt is the appropriate tool.

The Sitemap Directive in Robots.txt

One of the most overlooked features of robots.txt is the Sitemap directive. This lets you point crawlers directly to your sitemap file without requiring them to discover it elsewhere.

User-agent: *
Disallow: /admin/
Sitemap: https://yourdomain.com/sitemap.xml

This single line removes any guesswork about where your sitemap lives. Google, Bing, and most other crawlers support this directive. It is separate from submitting the sitemap through search consoles, but both are worth doing. The directive gives automatic discovery; the console submission gives you monitoring and error reporting.

Common Robots.txt Mistakes

Disallowing Pages That Are in Your Sitemap

This is the most damaging inconsistency. If your sitemap lists a URL and your robots.txt blocks the crawler from accessing it, the crawler cannot verify the page's content. Different crawlers handle this differently. Some will skip the page. Some will attempt to index it based on limited information. Neither outcome is good.

Before finalizing your robots.txt, cross-check your disallow rules against your sitemap URL list. Any URL that appears in the sitemap should not be blocked in robots.txt.

Blocking CSS and JavaScript

A common legacy mistake is blocking static assets from crawlers to reduce server load. Blocking CSS and JavaScript prevents Googlebot from rendering your pages accurately. Google has consistently recommended allowing crawlers access to all assets used to render the page. An inaccurate render leads to inaccurate indexing.

Using Wildcards Incorrectly

The * wildcard in a Disallow path matches the remainder of the URL string. Disallow: /search blocks /search, /search/, /search-results, and any other path that starts with /search. If you intended only to block /search/ and its subpaths, use Disallow: /search/.

Overlapping wildcard rules can also interact unexpectedly. Test your rules with the robots.txt tester in Google Search Console before deploying.

Step-by-Step: Building a Robots.txt File

The following steps apply to any site, from a simple blog to a large e-commerce catalog.

Step 1: List what you want to block. Start by identifying paths that crawlers should not follow: admin areas, API endpoints, duplicate content paths, staging subfolders, and internal search result pages. Be specific. The goal is to limit crawl scope, not to hide content.

Step 2: Decide on user-agent scoping. Most robots.txt files use a catch-all User-agent: * block. If you want rules that apply only to Googlebot or only to Bingbot, create separate blocks. Be careful not to accidentally restrict a crawler you want to allow.

Step 3: Use the visual generator. The Robots.txt Generator provides presets for WordPress, Next.js, Laravel, and Shopify, which handles the most common starting configurations. From there, add your specific disallow rules. The validator checks for syntax errors and flags paths that would block commonly crawled assets.

Step 4: Add the Sitemap directive. Before saving, add the Sitemap line pointing to your sitemap.xml or sitemap_index.xml. This is often forgotten because it is not a disallow rule. It belongs at the bottom of the file, outside the user-agent blocks.

Step 5: Validate and cross-check. Run the robots.txt through the Google Search Console tester after deployment. Test a few URLs that should be blocked and a few that should be allowed. Also open your sitemap and verify that none of the listed URLs fall under a disallow rule.

Photo by Jakub Zerdzicki on Pexels

The Robots.txt and Sitemap Relationship in Practice

Think of robots.txt as the gate and your sitemap as the directory. The gate controls who can enter and where. The directory tells visitors what is inside. A gate that blocks a wing of the building is fine, but listing that wing in the directory while the gate is locked is a contradiction.

For most sites, the cleanest setup is:

Sitemap contains only canonical, indexable URLs
Robots.txt blocks non-canonical paths, admin areas, and dynamic parameters
No URL appears in both the sitemap and a Disallow rule

For a deeper look at building and submitting the sitemap side of this relationship, the guide on XML Sitemap Generator: How to Build and Submit Sitemaps That Search Engines Actually Use covers URL validation, sitemap index splitting, and submission workflow in full.

Platform-Specific Considerations

Different platforms generate robots.txt files differently, and the generated files do not always match what you actually need.

WordPress, for example, generates a virtual robots.txt that cannot be directly edited in the file system. Changes require a plugin or a filter hook. The default WordPress robots.txt blocks the wp-admin and wp-includes directories, which is correct, but it does not block dynamic query strings like /?s= (search results) or /?p= (numeric post IDs) that many sites want excluded.

Next.js projects often need robots.txt to allow all paths by default while blocking /_next/static and other internal build directories. The Robots.txt Generator includes a Next.js preset that handles this correctly.

Shopify has its own managed robots.txt with limited edit access. In Shopify 2.0 themes, you can use the robots.txt.liquid template to customize it. Be careful not to block Shopify's own crawlable product and collection paths.

Checking Your Work Against the Protocol

The official robots.txt protocol is documented at robotstxt.org. Google has additional documentation on how Googlebot interprets robots.txt at developers.google.com, including specifics on wildcard handling and crawl delay support.

After deployment, Google Search Console provides a live robots.txt tester and will report crawl errors caused by blocked paths. If you are seeing coverage drops in Search Console after a robots.txt change, the tester is the first place to look.

EvvyTools includes the Robots.txt Generator as part of a suite of dev and technical site tools. The generator is free to use and works without requiring login for the core generation flow.

Summary

A correct robots.txt file requires only a few things: accurate disallow rules, no contradiction with your sitemap, assets left accessible for rendering, and the Sitemap directive pointing to your sitemap file. The complexity comes from knowing what to block, not from the format itself. Use the visual generator to catch syntax errors, use the platform presets to start from a sensible baseline, and always validate after deployment.

DEV Community