robots.txt is a plain text file at the root of your domain that tells search engine crawlers which URLs they can and cannot request. It's not a security mechanism (it's a suggestion, not a block), but it's a critical tool for managing how search engines interact with your site.
The syntax
User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /api/public/
Sitemap: https://example.com/sitemap.xml
User-agent: Which crawler the rules apply to. * means all crawlers. Specific agents include Googlebot, Bingbot, GPTBot, Bytespider.
Disallow: Paths the crawler should not request. /admin/ blocks everything under /admin/. / blocks everything. Empty string blocks nothing.
Allow: Overrides a Disallow for specific paths. Useful for allowing a subdirectory within a blocked directory.
Sitemap: Points crawlers to your XML sitemap. Not all crawlers use this, but Google does.
Common mistakes
Blocking CSS and JavaScript. Disallow: /assets/ or Disallow: /*.css$ prevents Googlebot from rendering your pages. Google needs to download your CSS and JS to understand your page layout, especially for JavaScript-rendered content. Blocking these files can hurt your search rankings because Google can't evaluate the rendered page.
Blocking everything during development and forgetting to fix it. Disallow: / is the nuclear option. It tells all crawlers to ignore your entire site. This is appropriate for staging environments but catastrophic if it makes it to production. Many sites have lost their entire search index because someone pushed a staging robots.txt to production.
Putting sensitive paths in robots.txt. If you have /admin-secret-dashboard/ and put it in Disallow, you've just published the path to every attacker who reads your robots.txt. robots.txt is public. Everyone can read it. Don't use it to "hide" pages. Use authentication.
Not specifying a sitemap. The Sitemap directive is the simplest way to help crawlers discover your pages. Without it, crawlers rely entirely on following links, which may miss orphaned pages.
Crawl budget management
Large sites (millions of pages) need to manage crawl budget, the number of pages Google will crawl per day. robots.txt is one tool for this:
- Block faceted navigation URLs that create infinite parameter combinations
- Block internal search result pages
- Block sorted/filtered versions of the same page
- Block paginated archives beyond a reasonable depth
For a site with 50 million product pages and each has 20 sort/filter variations, that's a billion URLs. Google won't crawl them all. By blocking the variations in robots.txt, you focus the crawl budget on canonical pages.
The Crawl-delay directive
Some crawlers support Crawl-delay: 10, which requests a 10-second pause between requests. Googlebot ignores this (use Google Search Console to set crawl rate instead). Bingbot respects it. For smaller servers that struggle under crawl load, this directive helps.
I built a robots.txt generator at zovo.one/free-tools/robots-txt-generator that creates properly formatted directives based on your site structure. It includes presets for common CMS platforms, handles multiple user agents, and validates the output against the specification.
I'm Michael Lip. I build free developer tools at zovo.one. 500+ tools, all private, all free.
Top comments (0)