Before scraping any website, check robots.txt. It tells you exactly what you can and can't crawl — and reveals hidden information about the site.
https://example.com/robots.txt
What robots.txt Reveals
Disallowed paths = hidden content. When a site blocks /admin/, /staging/, /api/v2/ — they're confirming these paths exist.
Sitemap location. Most robots.txt files include Sitemap: https://example.com/sitemap.xml — your complete URL index.
Crawl-delay. How fast the site wants bots to go. Respect this.
Bot-specific rules. Some sites block GPTBot, Google-Extended, or CCBot specifically — revealing their AI-related policies.
Example
User-agent: *
Disallow: /admin/
Disallow: /api/internal/
Crawl-delay: 2
Sitemap: https://example.com/sitemap.xml
User-agent: GPTBot
Disallow: /
This tells you: there's an admin panel, an internal API, they want 2s between requests, and they block AI crawlers from all content.
Tools
- Robots.txt Analyzer — parse and analyze any robots.txt
- Sitemap Scraper — extract all URLs from sitemaps
- SEO Audit Tool — comprehensive technical SEO
All 77 tools: Apify Store
Custom SEO audit — $20: Order via Payoneer
Top comments (0)