Muhammad Ikramullah Khan

Posted on Jan 11

Sitemaps & robots.txt: The Secret to Faster, Smarter Scraping

#webdev #programming #beginners #python

I once scraped a news website with 50,000 articles by crawling from the homepage. It took 6 hours and hammered their server with requests as I followed every link.

Then I discovered their sitemap. It listed all 50,000 article URLs in one XML file. I scraped the same site in 20 minutes by going directly to each article. No wasted requests, no link-following, just pure efficiency.

Sitemaps and robots.txt are gifts from websites to crawlers. Let me show you how to use them properly.

What Is a Sitemap?

A sitemap is a file that lists all important URLs on a website.

Think of it like:

A table of contents for a book
A directory of all pages
A roadmap to the website's content

Benefits for scraping:

Skip navigation and link-following
Get all URLs instantly
Find pages not linked anywhere
See when pages were updated
Know page priorities

What Is robots.txt?

A robots.txt file tells crawlers what they can and cannot access.

Location: Always at https://example.com/robots.txt

What it contains:

Which pages crawlers can access
Which pages are blocked
Crawl delay recommendations
Sitemap locations

Example robots.txt:

User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 2

Sitemap: https://example.com/sitemap.xml

Finding Sitemaps

Method 1: Check robots.txt

Most sites list sitemap location in robots.txt:

curl https://example.com/robots.txt

Look for lines like:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap_index.xml

Method 2: Common Locations

Try these URLs:

https://example.com/sitemap.xml
https://example.com/sitemap_index.xml
https://example.com/sitemap1.xml
https://example.com/wp-sitemap.xml (WordPress)
https://example.com/page-sitemap.xml (WordPress)
https://example.com/post-sitemap.xml (WordPress)

Method 3: Google Search

Search: site:example.com filetype:xml sitemap

Method 4: Check HTML Source

Some sites link to sitemaps in HTML:

<link rel="sitemap" type="application/xml" href="/sitemap.xml">

Types of Sitemaps

Type 1: Basic XML Sitemap

Simple list of URLs:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/page1</loc>
    <lastmod>2024-01-15</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>https://example.com/page2</loc>
    <lastmod>2024-01-14</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.5</priority>
  </url>
</urlset>

Fields explained:

loc - URL of the page (required)
lastmod - Last modified date (optional)
changefreq - How often page changes (optional)
priority - Importance 0.0-1.0 (optional)

Type 2: Sitemap Index

Points to multiple sitemaps:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2024-01-15</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-blog.xml</loc>
    <lastmod>2024-01-14</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-categories.xml</loc>
    <lastmod>2024-01-10</lastmod>
  </sitemap>
</sitemapindex>

Why use index:

Organize by content type
Split large sites (max 50,000 URLs per sitemap)
Keep file sizes manageable (max 50MB)

Type 3: News Sitemap

For news articles:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
  <url>
    <loc>https://example.com/article1</loc>
    <news:news>
      <news:publication>
        <news:name>Example News</news:name>
        <news:language>en</news:language>
      </news:publication>
      <news:publication_date>2024-01-15T10:00:00Z</news:publication_date>
      <news:title>Breaking News Title</news:title>
    </news:news>
  </url>
</urlset>

Extra fields:

Publication name
Publication date
Article title
Keywords

Type 4: Image Sitemap

For image-heavy sites:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
  <url>
    <loc>https://example.com/product1</loc>
    <image:image>
      <image:loc>https://example.com/images/product1.jpg</image:loc>
      <image:caption>Product 1 main image</image:caption>
      <image:title>Product 1</image:title>
    </image:image>
    <image:image>
      <image:loc>https://example.com/images/product1-back.jpg</image:loc>
      <image:caption>Product 1 back view</image:caption>
    </image:image>
  </url>
</urlset>

Type 5: Video Sitemap

For video content:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
  <url>
    <loc>https://example.com/video-page</loc>
    <video:video>
      <video:thumbnail_loc>https://example.com/thumb.jpg</video:thumbnail_loc>
      <video:title>Video Title</video:title>
      <video:description>Video description here</video:description>
      <video:content_loc>https://example.com/video.mp4</video:content_loc>
      <video:duration>600</video:duration>
      <video:publication_date>2024-01-15T10:00:00Z</video:publication_date>
    </video:video>
  </url>
</urlset>

Type 6: Text Sitemap

Simple text file with one URL per line:

https://example.com/page1
https://example.com/page2
https://example.com/page3

Not as common, but some sites use it.

Scrapy SitemapSpider

Scrapy has a built-in spider for sitemaps!

Basic SitemapSpider

from scrapy.spiders import SitemapSpider

class MySitemapSpider(SitemapSpider):
    name = 'sitemap'
    sitemap_urls = ['https://example.com/sitemap.xml']

    def parse(self, response):
        # This gets called for every URL in sitemap
        yield {
            'url': response.url,
            'title': response.css('h1::text').get()
        }

That's it! Scrapy handles everything:

Downloads sitemap
Parses XML
Extracts URLs
Makes requests
Calls your parse method

Following Sitemap Index

class MySitemapSpider(SitemapSpider):
    name = 'sitemap'
    sitemap_urls = ['https://example.com/sitemap_index.xml']

    def parse(self, response):
        yield {
            'url': response.url,
            'title': response.css('h1::text').get()
        }

Scrapy automatically:

Detects sitemap index
Downloads all child sitemaps
Extracts all URLs
Scrapes everything

Filtering URLs with Rules

Only scrape specific URLs:

class ProductSitemapSpider(SitemapSpider):
    name = 'products'
    sitemap_urls = ['https://example.com/sitemap.xml']

    sitemap_rules = [
        ('/product/', 'parse_product'),  # URLs with /product/ use parse_product
        ('/category/', 'parse_category'),  # URLs with /category/ use parse_category
    ]

    def parse_product(self, response):
        yield {
            'type': 'product',
            'name': response.css('.product-name::text').get(),
            'price': response.css('.price::text').get()
        }

    def parse_category(self, response):
        yield {
            'type': 'category',
            'name': response.css('h1::text').get()
        }

Filtering by Regular Expression

sitemap_rules = [
    (r'/product/\d+/', 'parse_product'),  # /product/123/
    (r'/blog/\d{4}/\d{2}/', 'parse_blog'),  # /blog/2024/01/
]

Following Alternate Languages

Some sitemaps have alternate language versions:

<url>
  <loc>https://example.com/page</loc>
  <xhtml:link rel="alternate" hreflang="es" href="https://example.com/es/page"/>
  <xhtml:link rel="alternate" hreflang="fr" href="https://example.com/fr/page"/>
</url>

Scrape all languages:

class MultiLangSpider(SitemapSpider):
    name = 'multilang'
    sitemap_urls = ['https://example.com/sitemap.xml']
    sitemap_follow = ['.*']
    sitemap_alternate_links = True  # Follow alternate language links

    def parse(self, response):
        yield {
            'url': response.url,
            'language': response.url.split('/')[3],  # Extract language code
            'title': response.css('h1::text').get()
        }

Parsing Sitemaps Manually

Sometimes you need more control:

Download and Parse Sitemap

import scrapy
import xml.etree.ElementTree as ET

class ManualSitemapSpider(scrapy.Spider):
    name = 'manual_sitemap'
    start_urls = ['https://example.com/sitemap.xml']

    def parse(self, response):
        # Parse XML
        root = ET.fromstring(response.text)

        # Define namespace
        ns = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}

        # Extract all URLs
        for url_element in root.findall('sm:url', ns):
            loc = url_element.find('sm:loc', ns).text
            lastmod = url_element.find('sm:lastmod', ns)
            lastmod_date = lastmod.text if lastmod is not None else None

            # Make request to URL
            yield scrapy.Request(
                loc,
                callback=self.parse_page,
                meta={'lastmod': lastmod_date}
            )

    def parse_page(self, response):
        yield {
            'url': response.url,
            'lastmod': response.meta['lastmod'],
            'title': response.css('h1::text').get()
        }

Handling Sitemap Index

def parse(self, response):
    root = ET.fromstring(response.text)
    ns = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}

    # Check if it's a sitemap index
    if root.tag == '{http://www.sitemaps.org/schemas/sitemap/0.9}sitemapindex':
        # It's an index, download child sitemaps
        for sitemap in root.findall('sm:sitemap', ns):
            sitemap_url = sitemap.find('sm:loc', ns).text
            yield scrapy.Request(sitemap_url, callback=self.parse)

    else:
        # It's a regular sitemap, extract URLs
        for url_element in root.findall('sm:url', ns):
            loc = url_element.find('sm:loc', ns).text
            yield scrapy.Request(loc, callback=self.parse_page)

Using robots.txt

Respecting robots.txt

# settings.py
ROBOTSTXT_OBEY = True  # Respect robots.txt rules

When enabled, Scrapy:

Downloads robots.txt automatically
Checks allowed/disallowed paths
Respects crawl-delay
Won't scrape blocked URLs

Parsing robots.txt for Sitemaps

import scrapy

class RobotsSitemapSpider(scrapy.Spider):
    name = 'robots_sitemap'
    start_urls = ['https://example.com/robots.txt']

    def parse(self, response):
        # Extract sitemap URLs from robots.txt
        for line in response.text.split('\n'):
            if line.lower().startswith('sitemap:'):
                sitemap_url = line.split(':', 1)[1].strip()
                self.logger.info(f'Found sitemap: {sitemap_url}')

                # Download sitemap
                yield scrapy.Request(sitemap_url, callback=self.parse_sitemap)

    def parse_sitemap(self, response):
        # Parse sitemap and extract URLs
        import xml.etree.ElementTree as ET

        root = ET.fromstring(response.text)
        ns = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}

        for url_element in root.findall('sm:url', ns):
            loc = url_element.find('sm:loc', ns).text
            yield scrapy.Request(loc, callback=self.parse_page)

    def parse_page(self, response):
        yield {
            'url': response.url,
            'title': response.css('h1::text').get()
        }

Extracting Crawl-Delay

def parse(self, response):
    crawl_delay = None

    for line in response.text.split('\n'):
        if line.lower().startswith('crawl-delay:'):
            crawl_delay = float(line.split(':', 1)[1].strip())
            break

    if crawl_delay:
        self.logger.info(f'robots.txt specifies crawl-delay: {crawl_delay}')
        # Adjust settings
        self.crawler.engine.download_delay = crawl_delay

Smart Crawling Strategies

Strategy 1: Incremental Scraping

Only scrape updated content:

class IncrementalSpider(SitemapSpider):
    name = 'incremental'
    sitemap_urls = ['https://example.com/sitemap.xml']

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.last_scrape_date = self.load_last_scrape_date()

    def parse(self, response):
        # Check lastmod from sitemap
        lastmod = response.meta.get('lastmod')

        if lastmod and lastmod > self.last_scrape_date:
            # Page updated since last scrape
            yield {
                'url': response.url,
                'title': response.css('h1::text').get(),
                'scraped_at': datetime.now().isoformat()
            }
        else:
            self.logger.info(f'Skipping {response.url} (not modified)')

    def load_last_scrape_date(self):
        # Load from file or database
        try:
            with open('last_scrape.txt', 'r') as f:
                return f.read().strip()
        except FileNotFoundError:
            return '1970-01-01'  # Scrape everything

Strategy 2: Priority-Based Scraping

Use sitemap priority:

import xml.etree.ElementTree as ET

class PrioritySpider(scrapy.Spider):
    name = 'priority'
    start_urls = ['https://example.com/sitemap.xml']

    def parse(self, response):
        root = ET.fromstring(response.text)
        ns = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}

        for url_element in root.findall('sm:url', ns):
            loc = url_element.find('sm:loc', ns).text
            priority = url_element.find('sm:priority', ns)

            # Convert priority to Scrapy priority (higher = sooner)
            if priority is not None:
                scrapy_priority = int(float(priority.text) * 100)
            else:
                scrapy_priority = 50  # Default

            yield scrapy.Request(
                loc,
                callback=self.parse_page,
                priority=scrapy_priority
            )

    def parse_page(self, response):
        yield {'url': response.url, 'title': response.css('h1::text').get()}

Strategy 3: Filtering by Change Frequency

def parse(self, response):
    root = ET.fromstring(response.text)
    ns = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}

    for url_element in root.findall('sm:url', ns):
        loc = url_element.find('sm:loc', ns).text
        changefreq = url_element.find('sm:changefreq', ns)

        # Only scrape pages that change frequently
        if changefreq is not None and changefreq.text in ['always', 'hourly', 'daily']:
            yield scrapy.Request(loc, callback=self.parse_page)
        else:
            self.logger.info(f'Skipping {loc} (changes {changefreq.text if changefreq is not None else "unknown"})')

Handling Large Sitemaps

Compressed Sitemaps (.gz)

Many sites compress sitemaps:

import gzip
import xml.etree.ElementTree as ET

class GzipSitemapSpider(scrapy.Spider):
    name = 'gzip_sitemap'
    start_urls = ['https://example.com/sitemap.xml.gz']

    def parse(self, response):
        # Decompress gzip
        decompressed = gzip.decompress(response.body)

        # Parse XML
        root = ET.fromstring(decompressed)
        ns = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}

        for url_element in root.findall('sm:url', ns):
            loc = url_element.find('sm:loc', ns).text
            yield scrapy.Request(loc, callback=self.parse_page)

    def parse_page(self, response):
        yield {'url': response.url, 'title': response.css('h1::text').get()}

Good news: SitemapSpider handles .gz automatically!

Paginated Sitemaps

Some sites split sitemaps by date or number:

sitemap-2024-01.xml
sitemap-2024-02.xml
sitemap-2024-03.xml

from datetime import datetime, timedelta

class PaginatedSitemapSpider(SitemapSpider):
    name = 'paginated'

    def start_requests(self):
        # Generate sitemap URLs for last 6 months
        current_date = datetime.now()

        for i in range(6):
            date = current_date - timedelta(days=30*i)
            sitemap_url = f'https://example.com/sitemap-{date.year}-{date.month:02d}.xml'

            yield scrapy.Request(sitemap_url, callback=self.parse_sitemap)

    def parse_sitemap(self, response):
        # Use SitemapSpider's built-in parser
        for url in self.sitemap_urls_from_robots(response.text):
            yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        yield {'url': response.url, 'title': response.css('h1::text').get()}

Complete Real-World Example

Combining everything:

import scrapy
from scrapy.spiders import SitemapSpider
from datetime import datetime, timedelta

class ProductionSitemapSpider(SitemapSpider):
    name = 'production_sitemap'

    # Start with robots.txt to find sitemaps
    start_urls = ['https://example.com/robots.txt']

    sitemap_rules = [
        (r'/product/', 'parse_product'),
        (r'/blog/', 'parse_blog'),
    ]

    # Follow alternate languages
    sitemap_alternate_links = True

    custom_settings = {
        'ROBOTSTXT_OBEY': True,
        'DOWNLOAD_DELAY': 1,
        'CONCURRENT_REQUESTS': 8
    }

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.last_scrape_date = self.load_last_scrape_date()
        self.stats = {
            'products': 0,
            'blogs': 0,
            'skipped': 0
        }

    def parse(self, response):
        # First request is robots.txt
        if 'robots.txt' in response.url:
            # Extract sitemaps from robots.txt
            for line in response.text.split('\n'):
                if line.lower().startswith('sitemap:'):
                    sitemap_url = line.split(':', 1)[1].strip()
                    self.sitemap_urls.append(sitemap_url)
                    self.logger.info(f'Found sitemap: {sitemap_url}')

            # Now start sitemap crawling
            for url in self.sitemap_urls:
                yield scrapy.Request(url)

    def parse_product(self, response):
        # Check if updated since last scrape
        lastmod = response.meta.get('lastmod')

        if lastmod and lastmod < self.last_scrape_date:
            self.stats['skipped'] += 1
            return

        self.stats['products'] += 1

        yield {
            'type': 'product',
            'url': response.url,
            'name': response.css('.product-name::text').get(),
            'price': response.css('.price::text').get(),
            'stock': response.css('.stock::text').get(),
            'lastmod': lastmod,
            'scraped_at': datetime.now().isoformat()
        }

    def parse_blog(self, response):
        lastmod = response.meta.get('lastmod')

        if lastmod and lastmod < self.last_scrape_date:
            self.stats['skipped'] += 1
            return

        self.stats['blogs'] += 1

        yield {
            'type': 'blog',
            'url': response.url,
            'title': response.css('h1::text').get(),
            'author': response.css('.author::text').get(),
            'date': response.css('.date::text').get(),
            'lastmod': lastmod
        }

    def load_last_scrape_date(self):
        try:
            with open('last_scrape.txt', 'r') as f:
                return f.read().strip()
        except FileNotFoundError:
            return '1970-01-01'

    def closed(self, reason):
        self.logger.info('='*60)
        self.logger.info('SCRAPING STATISTICS')
        self.logger.info(f'Products: {self.stats["products"]}')
        self.logger.info(f'Blogs: {self.stats["blogs"]}')
        self.logger.info(f'Skipped (not modified): {self.stats["skipped"]}')
        self.logger.info('='*60)

        # Save current scrape date
        with open('last_scrape.txt', 'w') as f:
            f.write(datetime.now().isoformat())

Common Mistakes

Mistake #1: Not Checking for Sitemap Index

# BAD: Assumes it's a regular sitemap
for url in root.findall('sm:url', ns):
    # Won't find anything if it's an index!

# GOOD: Check the root tag
if root.tag.endswith('sitemapindex'):
    # Handle sitemap index
else:
    # Handle regular sitemap

Mistake #2: Ignoring lastmod

# BAD: Scrapes everything every time
yield scrapy.Request(url)

# GOOD: Check if updated
if lastmod > last_scrape_date:
    yield scrapy.Request(url)

Mistake #3: Not Respecting robots.txt

# BAD
ROBOTSTXT_OBEY = False  # Ignoring website rules

# GOOD
ROBOTSTXT_OBEY = True  # Be a good citizen

Summary

Sitemaps are goldmines:

List all URLs on site
Include metadata (lastmod, priority)
Much faster than crawling
Find hidden pages

Types of sitemaps:

XML (most common)
Sitemap index (multiple sitemaps)
News, image, video (specialized)
Text (rare)

robots.txt:

Lists sitemap locations
Defines crawl rules
Specifies crawl-delay
Always check it first

Best practices:

Use SitemapSpider when possible
Check robots.txt for sitemaps
Respect crawl-delay
Use lastmod for incremental scraping
Filter by URL patterns
Handle sitemap indexes
Compress large sitemaps

Remember:

Always check robots.txt first
Sitemaps save massive time
Use lastmod for efficiency
Respect crawl-delay
Not all sites have sitemaps

Start every scraping project by checking for sitemaps. They're the fastest path to the data you want!

Happy scraping! 🕷️