robots.txt Reveals More Than You Think — Hidden Paths, APIs, and AI Policies

#webdev #seo #security #tutorial

Before scraping any website, check robots.txt. It tells you exactly what you can and can't crawl — and reveals hidden information about the site.

https://example.com/robots.txt

What robots.txt Reveals

Disallowed paths = hidden content. When a site blocks /admin/, /staging/, /api/v2/ — they're confirming these paths exist.

Sitemap location. Most robots.txt files include Sitemap: https://example.com/sitemap.xml — your complete URL index.

Crawl-delay. How fast the site wants bots to go. Respect this.

Bot-specific rules. Some sites block GPTBot, Google-Extended, or CCBot specifically — revealing their AI-related policies.

User-agent: *
Disallow: /admin/
Disallow: /api/internal/
Crawl-delay: 2
Sitemap: https://example.com/sitemap.xml

User-agent: GPTBot
Disallow: /

This tells you: there's an admin panel, an internal API, they want 2s between requests, and they block AI crawlers from all content.

All 77 tools: Apify Store

Custom SEO audit — $20: Order via Payoneer