I once scraped a news website with 50,000 articles by crawling from the homepage. It took 6 hours and hammered their server with requests as I followed every link.
Then I discovered their sitemap. It listed all 50,000 article URLs in one XML file. I scraped the same site in 20 minutes by going directly to each article. No wasted requests, no link-following, just pure efficiency.
Sitemaps and robots.txt are gifts from websites to crawlers. Let me show you how to use them properly.
What Is a Sitemap?
A sitemap is a file that lists all important URLs on a website.
Think of it like:
- A table of contents for a book
- A directory of all pages
- A roadmap to the website's content
Benefits for scraping:
- Skip navigation and link-following
- Get all URLs instantly
- Find pages not linked anywhere
- See when pages were updated
- Know page priorities
What Is robots.txt?
A robots.txt file tells crawlers what they can and cannot access.
Location: Always at https://example.com/robots.txt
What it contains:
- Which pages crawlers can access
- Which pages are blocked
- Crawl delay recommendations
- Sitemap locations
Example robots.txt:
User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 2
Sitemap: https://example.com/sitemap.xml
Finding Sitemaps
Method 1: Check robots.txt
Most sites list sitemap location in robots.txt:
curl https://example.com/robots.txt
Look for lines like:
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap_index.xml
Method 2: Common Locations
Try these URLs:
https://example.com/sitemap.xml
https://example.com/sitemap_index.xml
https://example.com/sitemap1.xml
https://example.com/wp-sitemap.xml (WordPress)
https://example.com/page-sitemap.xml (WordPress)
https://example.com/post-sitemap.xml (WordPress)
Method 3: Google Search
Search: site:example.com filetype:xml sitemap
Method 4: Check HTML Source
Some sites link to sitemaps in HTML:
<link rel="sitemap" type="application/xml" href="/sitemap.xml">
Types of Sitemaps
Type 1: Basic XML Sitemap
Simple list of URLs:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/page1</loc>
<lastmod>2024-01-15</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://example.com/page2</loc>
<lastmod>2024-01-14</lastmod>
<changefreq>monthly</changefreq>
<priority>0.5</priority>
</url>
</urlset>
Fields explained:
-
loc- URL of the page (required) -
lastmod- Last modified date (optional) -
changefreq- How often page changes (optional) -
priority- Importance 0.0-1.0 (optional)
Type 2: Sitemap Index
Points to multiple sitemaps:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2024-01-15</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-blog.xml</loc>
<lastmod>2024-01-14</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-categories.xml</loc>
<lastmod>2024-01-10</lastmod>
</sitemap>
</sitemapindex>
Why use index:
- Organize by content type
- Split large sites (max 50,000 URLs per sitemap)
- Keep file sizes manageable (max 50MB)
Type 3: News Sitemap
For news articles:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
<url>
<loc>https://example.com/article1</loc>
<news:news>
<news:publication>
<news:name>Example News</news:name>
<news:language>en</news:language>
</news:publication>
<news:publication_date>2024-01-15T10:00:00Z</news:publication_date>
<news:title>Breaking News Title</news:title>
</news:news>
</url>
</urlset>
Extra fields:
- Publication name
- Publication date
- Article title
- Keywords
Type 4: Image Sitemap
For image-heavy sites:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
<loc>https://example.com/product1</loc>
<image:image>
<image:loc>https://example.com/images/product1.jpg</image:loc>
<image:caption>Product 1 main image</image:caption>
<image:title>Product 1</image:title>
</image:image>
<image:image>
<image:loc>https://example.com/images/product1-back.jpg</image:loc>
<image:caption>Product 1 back view</image:caption>
</image:image>
</url>
</urlset>
Type 5: Video Sitemap
For video content:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
<url>
<loc>https://example.com/video-page</loc>
<video:video>
<video:thumbnail_loc>https://example.com/thumb.jpg</video:thumbnail_loc>
<video:title>Video Title</video:title>
<video:description>Video description here</video:description>
<video:content_loc>https://example.com/video.mp4</video:content_loc>
<video:duration>600</video:duration>
<video:publication_date>2024-01-15T10:00:00Z</video:publication_date>
</video:video>
</url>
</urlset>
Type 6: Text Sitemap
Simple text file with one URL per line:
https://example.com/page1
https://example.com/page2
https://example.com/page3
Not as common, but some sites use it.
Scrapy SitemapSpider
Scrapy has a built-in spider for sitemaps!
Basic SitemapSpider
from scrapy.spiders import SitemapSpider
class MySitemapSpider(SitemapSpider):
name = 'sitemap'
sitemap_urls = ['https://example.com/sitemap.xml']
def parse(self, response):
# This gets called for every URL in sitemap
yield {
'url': response.url,
'title': response.css('h1::text').get()
}
That's it! Scrapy handles everything:
- Downloads sitemap
- Parses XML
- Extracts URLs
- Makes requests
- Calls your parse method
Following Sitemap Index
class MySitemapSpider(SitemapSpider):
name = 'sitemap'
sitemap_urls = ['https://example.com/sitemap_index.xml']
def parse(self, response):
yield {
'url': response.url,
'title': response.css('h1::text').get()
}
Scrapy automatically:
- Detects sitemap index
- Downloads all child sitemaps
- Extracts all URLs
- Scrapes everything
Filtering URLs with Rules
Only scrape specific URLs:
class ProductSitemapSpider(SitemapSpider):
name = 'products'
sitemap_urls = ['https://example.com/sitemap.xml']
sitemap_rules = [
('/product/', 'parse_product'), # URLs with /product/ use parse_product
('/category/', 'parse_category'), # URLs with /category/ use parse_category
]
def parse_product(self, response):
yield {
'type': 'product',
'name': response.css('.product-name::text').get(),
'price': response.css('.price::text').get()
}
def parse_category(self, response):
yield {
'type': 'category',
'name': response.css('h1::text').get()
}
Filtering by Regular Expression
sitemap_rules = [
(r'/product/\d+/', 'parse_product'), # /product/123/
(r'/blog/\d{4}/\d{2}/', 'parse_blog'), # /blog/2024/01/
]
Following Alternate Languages
Some sitemaps have alternate language versions:
<url>
<loc>https://example.com/page</loc>
<xhtml:link rel="alternate" hreflang="es" href="https://example.com/es/page"/>
<xhtml:link rel="alternate" hreflang="fr" href="https://example.com/fr/page"/>
</url>
Scrape all languages:
class MultiLangSpider(SitemapSpider):
name = 'multilang'
sitemap_urls = ['https://example.com/sitemap.xml']
sitemap_follow = ['.*']
sitemap_alternate_links = True # Follow alternate language links
def parse(self, response):
yield {
'url': response.url,
'language': response.url.split('/')[3], # Extract language code
'title': response.css('h1::text').get()
}
Parsing Sitemaps Manually
Sometimes you need more control:
Download and Parse Sitemap
import scrapy
import xml.etree.ElementTree as ET
class ManualSitemapSpider(scrapy.Spider):
name = 'manual_sitemap'
start_urls = ['https://example.com/sitemap.xml']
def parse(self, response):
# Parse XML
root = ET.fromstring(response.text)
# Define namespace
ns = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
# Extract all URLs
for url_element in root.findall('sm:url', ns):
loc = url_element.find('sm:loc', ns).text
lastmod = url_element.find('sm:lastmod', ns)
lastmod_date = lastmod.text if lastmod is not None else None
# Make request to URL
yield scrapy.Request(
loc,
callback=self.parse_page,
meta={'lastmod': lastmod_date}
)
def parse_page(self, response):
yield {
'url': response.url,
'lastmod': response.meta['lastmod'],
'title': response.css('h1::text').get()
}
Handling Sitemap Index
def parse(self, response):
root = ET.fromstring(response.text)
ns = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
# Check if it's a sitemap index
if root.tag == '{http://www.sitemaps.org/schemas/sitemap/0.9}sitemapindex':
# It's an index, download child sitemaps
for sitemap in root.findall('sm:sitemap', ns):
sitemap_url = sitemap.find('sm:loc', ns).text
yield scrapy.Request(sitemap_url, callback=self.parse)
else:
# It's a regular sitemap, extract URLs
for url_element in root.findall('sm:url', ns):
loc = url_element.find('sm:loc', ns).text
yield scrapy.Request(loc, callback=self.parse_page)
Using robots.txt
Respecting robots.txt
# settings.py
ROBOTSTXT_OBEY = True # Respect robots.txt rules
When enabled, Scrapy:
- Downloads robots.txt automatically
- Checks allowed/disallowed paths
- Respects crawl-delay
- Won't scrape blocked URLs
Parsing robots.txt for Sitemaps
import scrapy
class RobotsSitemapSpider(scrapy.Spider):
name = 'robots_sitemap'
start_urls = ['https://example.com/robots.txt']
def parse(self, response):
# Extract sitemap URLs from robots.txt
for line in response.text.split('\n'):
if line.lower().startswith('sitemap:'):
sitemap_url = line.split(':', 1)[1].strip()
self.logger.info(f'Found sitemap: {sitemap_url}')
# Download sitemap
yield scrapy.Request(sitemap_url, callback=self.parse_sitemap)
def parse_sitemap(self, response):
# Parse sitemap and extract URLs
import xml.etree.ElementTree as ET
root = ET.fromstring(response.text)
ns = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
for url_element in root.findall('sm:url', ns):
loc = url_element.find('sm:loc', ns).text
yield scrapy.Request(loc, callback=self.parse_page)
def parse_page(self, response):
yield {
'url': response.url,
'title': response.css('h1::text').get()
}
Extracting Crawl-Delay
def parse(self, response):
crawl_delay = None
for line in response.text.split('\n'):
if line.lower().startswith('crawl-delay:'):
crawl_delay = float(line.split(':', 1)[1].strip())
break
if crawl_delay:
self.logger.info(f'robots.txt specifies crawl-delay: {crawl_delay}')
# Adjust settings
self.crawler.engine.download_delay = crawl_delay
Smart Crawling Strategies
Strategy 1: Incremental Scraping
Only scrape updated content:
class IncrementalSpider(SitemapSpider):
name = 'incremental'
sitemap_urls = ['https://example.com/sitemap.xml']
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.last_scrape_date = self.load_last_scrape_date()
def parse(self, response):
# Check lastmod from sitemap
lastmod = response.meta.get('lastmod')
if lastmod and lastmod > self.last_scrape_date:
# Page updated since last scrape
yield {
'url': response.url,
'title': response.css('h1::text').get(),
'scraped_at': datetime.now().isoformat()
}
else:
self.logger.info(f'Skipping {response.url} (not modified)')
def load_last_scrape_date(self):
# Load from file or database
try:
with open('last_scrape.txt', 'r') as f:
return f.read().strip()
except FileNotFoundError:
return '1970-01-01' # Scrape everything
Strategy 2: Priority-Based Scraping
Use sitemap priority:
import xml.etree.ElementTree as ET
class PrioritySpider(scrapy.Spider):
name = 'priority'
start_urls = ['https://example.com/sitemap.xml']
def parse(self, response):
root = ET.fromstring(response.text)
ns = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
for url_element in root.findall('sm:url', ns):
loc = url_element.find('sm:loc', ns).text
priority = url_element.find('sm:priority', ns)
# Convert priority to Scrapy priority (higher = sooner)
if priority is not None:
scrapy_priority = int(float(priority.text) * 100)
else:
scrapy_priority = 50 # Default
yield scrapy.Request(
loc,
callback=self.parse_page,
priority=scrapy_priority
)
def parse_page(self, response):
yield {'url': response.url, 'title': response.css('h1::text').get()}
Strategy 3: Filtering by Change Frequency
def parse(self, response):
root = ET.fromstring(response.text)
ns = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
for url_element in root.findall('sm:url', ns):
loc = url_element.find('sm:loc', ns).text
changefreq = url_element.find('sm:changefreq', ns)
# Only scrape pages that change frequently
if changefreq is not None and changefreq.text in ['always', 'hourly', 'daily']:
yield scrapy.Request(loc, callback=self.parse_page)
else:
self.logger.info(f'Skipping {loc} (changes {changefreq.text if changefreq is not None else "unknown"})')
Handling Large Sitemaps
Compressed Sitemaps (.gz)
Many sites compress sitemaps:
import gzip
import xml.etree.ElementTree as ET
class GzipSitemapSpider(scrapy.Spider):
name = 'gzip_sitemap'
start_urls = ['https://example.com/sitemap.xml.gz']
def parse(self, response):
# Decompress gzip
decompressed = gzip.decompress(response.body)
# Parse XML
root = ET.fromstring(decompressed)
ns = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
for url_element in root.findall('sm:url', ns):
loc = url_element.find('sm:loc', ns).text
yield scrapy.Request(loc, callback=self.parse_page)
def parse_page(self, response):
yield {'url': response.url, 'title': response.css('h1::text').get()}
Good news: SitemapSpider handles .gz automatically!
Paginated Sitemaps
Some sites split sitemaps by date or number:
sitemap-2024-01.xml
sitemap-2024-02.xml
sitemap-2024-03.xml
from datetime import datetime, timedelta
class PaginatedSitemapSpider(SitemapSpider):
name = 'paginated'
def start_requests(self):
# Generate sitemap URLs for last 6 months
current_date = datetime.now()
for i in range(6):
date = current_date - timedelta(days=30*i)
sitemap_url = f'https://example.com/sitemap-{date.year}-{date.month:02d}.xml'
yield scrapy.Request(sitemap_url, callback=self.parse_sitemap)
def parse_sitemap(self, response):
# Use SitemapSpider's built-in parser
for url in self.sitemap_urls_from_robots(response.text):
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
yield {'url': response.url, 'title': response.css('h1::text').get()}
Complete Real-World Example
Combining everything:
import scrapy
from scrapy.spiders import SitemapSpider
from datetime import datetime, timedelta
class ProductionSitemapSpider(SitemapSpider):
name = 'production_sitemap'
# Start with robots.txt to find sitemaps
start_urls = ['https://example.com/robots.txt']
sitemap_rules = [
(r'/product/', 'parse_product'),
(r'/blog/', 'parse_blog'),
]
# Follow alternate languages
sitemap_alternate_links = True
custom_settings = {
'ROBOTSTXT_OBEY': True,
'DOWNLOAD_DELAY': 1,
'CONCURRENT_REQUESTS': 8
}
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.last_scrape_date = self.load_last_scrape_date()
self.stats = {
'products': 0,
'blogs': 0,
'skipped': 0
}
def parse(self, response):
# First request is robots.txt
if 'robots.txt' in response.url:
# Extract sitemaps from robots.txt
for line in response.text.split('\n'):
if line.lower().startswith('sitemap:'):
sitemap_url = line.split(':', 1)[1].strip()
self.sitemap_urls.append(sitemap_url)
self.logger.info(f'Found sitemap: {sitemap_url}')
# Now start sitemap crawling
for url in self.sitemap_urls:
yield scrapy.Request(url)
def parse_product(self, response):
# Check if updated since last scrape
lastmod = response.meta.get('lastmod')
if lastmod and lastmod < self.last_scrape_date:
self.stats['skipped'] += 1
return
self.stats['products'] += 1
yield {
'type': 'product',
'url': response.url,
'name': response.css('.product-name::text').get(),
'price': response.css('.price::text').get(),
'stock': response.css('.stock::text').get(),
'lastmod': lastmod,
'scraped_at': datetime.now().isoformat()
}
def parse_blog(self, response):
lastmod = response.meta.get('lastmod')
if lastmod and lastmod < self.last_scrape_date:
self.stats['skipped'] += 1
return
self.stats['blogs'] += 1
yield {
'type': 'blog',
'url': response.url,
'title': response.css('h1::text').get(),
'author': response.css('.author::text').get(),
'date': response.css('.date::text').get(),
'lastmod': lastmod
}
def load_last_scrape_date(self):
try:
with open('last_scrape.txt', 'r') as f:
return f.read().strip()
except FileNotFoundError:
return '1970-01-01'
def closed(self, reason):
self.logger.info('='*60)
self.logger.info('SCRAPING STATISTICS')
self.logger.info(f'Products: {self.stats["products"]}')
self.logger.info(f'Blogs: {self.stats["blogs"]}')
self.logger.info(f'Skipped (not modified): {self.stats["skipped"]}')
self.logger.info('='*60)
# Save current scrape date
with open('last_scrape.txt', 'w') as f:
f.write(datetime.now().isoformat())
Common Mistakes
Mistake #1: Not Checking for Sitemap Index
# BAD: Assumes it's a regular sitemap
for url in root.findall('sm:url', ns):
# Won't find anything if it's an index!
# GOOD: Check the root tag
if root.tag.endswith('sitemapindex'):
# Handle sitemap index
else:
# Handle regular sitemap
Mistake #2: Ignoring lastmod
# BAD: Scrapes everything every time
yield scrapy.Request(url)
# GOOD: Check if updated
if lastmod > last_scrape_date:
yield scrapy.Request(url)
Mistake #3: Not Respecting robots.txt
# BAD
ROBOTSTXT_OBEY = False # Ignoring website rules
# GOOD
ROBOTSTXT_OBEY = True # Be a good citizen
Summary
Sitemaps are goldmines:
- List all URLs on site
- Include metadata (lastmod, priority)
- Much faster than crawling
- Find hidden pages
Types of sitemaps:
- XML (most common)
- Sitemap index (multiple sitemaps)
- News, image, video (specialized)
- Text (rare)
robots.txt:
- Lists sitemap locations
- Defines crawl rules
- Specifies crawl-delay
- Always check it first
Best practices:
- Use SitemapSpider when possible
- Check robots.txt for sitemaps
- Respect crawl-delay
- Use lastmod for incremental scraping
- Filter by URL patterns
- Handle sitemap indexes
- Compress large sitemaps
Remember:
- Always check robots.txt first
- Sitemaps save massive time
- Use lastmod for efficiency
- Respect crawl-delay
- Not all sites have sitemaps
Start every scraping project by checking for sitemaps. They're the fastest path to the data you want!
Happy scraping! 🕷️
Top comments (0)