Your robots.txt Is Probably Wrong: A Guide to Crawl Directives

#webdev #seo #devops #tutorial

robots.txt is a plain text file at the root of your domain that tells search engine crawlers which URLs they can and cannot request. It's not a security mechanism (it's a suggestion, not a block), but it's a critical tool for managing how search engines interact with your site.

The syntax

User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /api/public/
Sitemap: https://example.com/sitemap.xml

User-agent: Which crawler the rules apply to. * means all crawlers. Specific agents include Googlebot, Bingbot, GPTBot, Bytespider.

Disallow: Paths the crawler should not request. /admin/ blocks everything under /admin/. / blocks everything. Empty string blocks nothing.

Allow: Overrides a Disallow for specific paths. Useful for allowing a subdirectory within a blocked directory.

Sitemap: Points crawlers to your XML sitemap. Not all crawlers use this, but Google does.

Common mistakes

Blocking CSS and JavaScript. Disallow: /assets/ or Disallow: /*.css$ prevents Googlebot from rendering your pages. Google needs to download your CSS and JS to understand your page layout, especially for JavaScript-rendered content. Blocking these files can hurt your search rankings because Google can't evaluate the rendered page.

Blocking everything during development and forgetting to fix it. Disallow: / is the nuclear option. It tells all crawlers to ignore your entire site. This is appropriate for staging environments but catastrophic if it makes it to production. Many sites have lost their entire search index because someone pushed a staging robots.txt to production.

Putting sensitive paths in robots.txt. If you have /admin-secret-dashboard/ and put it in Disallow, you've just published the path to every attacker who reads your robots.txt. robots.txt is public. Everyone can read it. Don't use it to "hide" pages. Use authentication.

Not specifying a sitemap. The Sitemap directive is the simplest way to help crawlers discover your pages. Without it, crawlers rely entirely on following links, which may miss orphaned pages.

Crawl budget management

Large sites (millions of pages) need to manage crawl budget, the number of pages Google will crawl per day. robots.txt is one tool for this:

Block faceted navigation URLs that create infinite parameter combinations
Block internal search result pages
Block sorted/filtered versions of the same page
Block paginated archives beyond a reasonable depth

For a site with 50 million product pages and each has 20 sort/filter variations, that's a billion URLs. Google won't crawl them all. By blocking the variations in robots.txt, you focus the crawl budget on canonical pages.

The Crawl-delay directive

Some crawlers support Crawl-delay: 10, which requests a 10-second pause between requests. Googlebot ignores this (use Google Search Console to set crawl rate instead). Bingbot respects it. For smaller servers that struggle under crawl load, this directive helps.

I built a robots.txt generator at zovo.one/free-tools/robots-txt-generator that creates properly formatted directives based on your site structure. It includes presets for common CMS platforms, handles multiple user agents, and validates the output against the specification.

I'm Michael Lip. I build free developer tools at zovo.one. 500+ tools, all private, all free.