I made every mistake on this list. Every single one.
My first scraper got my IP banned in 10 minutes. My second one crashed after scraping 5 pages. My third one worked perfectly on my laptop but broke the next day when the website changed.
I learned web scraping the hard way. You don't have to.
Let me show you the 10 most common mistakes beginners make, so you can avoid them from day one.
Mistake #1: Ignoring robots.txt
What This Means
Every website has a file called robots.txt that tells crawlers what they can and can't scrape.
Find it here:
https://example.com/robots.txt
Just add /robots.txt to any website.
The Mistake
You start scraping without checking this file.
Example:
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /api/
This says: "Don't scrape these folders."
If you scrape them anyway, you're being rude (and might get blocked).
How to Fix It
1. Always check robots.txt first:
curl https://example.com/robots.txt
Or just visit it in your browser.
2. Respect what it says:
If it says Disallow: /admin/, don't scrape that section.
3. In Scrapy, enable it:
# settings.py
ROBOTSTXT_OBEY = True
Scrapy will automatically check and obey robots.txt.
Real Example
Bad:
# Just scraping everything
start_urls = [
'https://example.com/products',
'https://example.com/admin', # Might be disallowed!
]
Good:
# Check robots.txt first, then only scrape allowed pages
start_urls = [
'https://example.com/products', # Allowed
]
Mistake #2: Not Using Proper Headers
What This Means
When your scraper visits a website, it sends information about itself called "headers."
Without proper headers, you look like a robot. Websites block robots.
The Mistake
You send requests without a User-Agent header.
What websites see:
User-Agent: python-requests/2.28.0
This screams "I'm a bot!"
How to Fix It
Add a User-Agent that looks like a real browser:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get('https://example.com', headers=headers)
In Scrapy:
# settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
Better: Add More Headers
Real browsers send more than just User-Agent:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
}
This looks much more like a real browser.
Real Example
Bad:
# No headers, obvious bot
response = requests.get('https://example.com')
Good:
# Looks like a real browser
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...'}
response = requests.get('https://example.com', headers=headers)
Mistake #3: Using Bad Selectors
What This Means
Selectors are how you find elements on a page (like finding a specific button or product name).
Bad selectors break when the website changes even slightly.
The Mistake
Using absolute XPath:
# Breaks if website changes anything
title = response.xpath('/html/body/div[1]/div[2]/div[3]/h1/text()').get()
This says: "Go to body, then first div, then second div, then third div, then h1."
If the website adds ONE div anywhere, this breaks!
How to Fix It
Use relative selectors based on classes or IDs:
# Better, looks for class name
title = response.css('.product-title::text').get()
# Or ID
title = response.css('#product-name::text').get()
Even better, use multiple options:
# Try class first, then ID, then tag
title = (
response.css('.product-title::text').get() or
response.css('#product-name::text').get() or
response.css('h1::text').get()
)
Real Example
Bad:
# Absolute XPath, very fragile
price = response.xpath('/html/body/div[1]/div[2]/span[1]/text()').get()
Good:
# CSS class, much more stable
price = response.css('.price::text').get()
Better:
# Multiple fallbacks
price = (
response.css('.price::text').get() or
response.css('.product-price::text').get() or
response.css('[data-price]::text').get()
)
Mistake #4: Missing Pagination
What This Means
Many websites split content across multiple pages (page 1, page 2, page 3, etc.).
If you only scrape page 1, you miss 90% of the data!
The Mistake
def parse(self, response):
# Scrape items on this page
for product in response.css('.product'):
yield {'name': product.css('h2::text').get()}
# FORGOT TO GO TO NEXT PAGE!
You scrape page 1 and stop. Next page never gets scraped.
How to Fix It
Find and follow the "Next" button:
def parse(self, response):
# Scrape items on this page
for product in response.css('.product'):
yield {'name': product.css('h2::text').get()}
# Follow next page
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Real Example
Bad:
def parse(self, response):
for product in response.css('.product'):
yield {'name': product.css('h2::text').get()}
# Only scrapes page 1!
Good:
def parse(self, response):
# Scrape this page
for product in response.css('.product'):
yield {'name': product.css('h2::text').get()}
# Go to next page
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
# Now scrapes ALL pages!
Mistake #5: No Error Handling
What This Means
Things go wrong. Websites go down. Internet disconnects. Elements don't exist.
Without error handling, your scraper crashes and loses all progress.
The Mistake
def parse(self, response):
title = response.css('h1::text').get()
price = response.css('.price::text').get()
yield {
'title': title,
'price': float(price) # CRASHES if price is None!
}
If price doesn't exist, float(None) crashes your spider.
How to Fix It
Check if things exist before using them:
def parse(self, response):
title = response.css('h1::text').get()
price_text = response.css('.price::text').get()
# Check before converting
if price_text:
try:
price = float(price_text.replace('$', ''))
except ValueError:
price = None
else:
price = None
yield {
'title': title or 'Unknown',
'price': price
}
In Scrapy, use errback:
def start_requests(self):
urls = ['https://example.com']
for url in urls:
yield scrapy.Request(url, callback=self.parse, errback=self.error_handler)
def error_handler(self, failure):
self.logger.error(f'Request failed: {failure}')
Real Example
Bad:
price = float(response.css('.price::text').get()) # Crashes if None
Good:
price_text = response.css('.price::text').get()
if price_text:
try:
price = float(price_text.replace('$', ''))
except ValueError:
price = None
else:
price = None
Mistake #6: Scraping Too Fast
What This Means
You send 100 requests per second. The website thinks you're attacking it. It blocks you.
The Mistake
# Hammering the server
for i in range(1000):
response = requests.get(f'https://example.com/page{i}')
# No delay, sends all 1000 requests immediately!
This is like knocking on someone's door 1000 times in one second. Rude!
How to Fix It
Add delays between requests:
import time
import requests
for i in range(1000):
response = requests.get(f'https://example.com/page{i}')
time.sleep(1) # Wait 1 second between requests
In Scrapy (even better):
# settings.py
DOWNLOAD_DELAY = 1 # Wait 1 second between requests
Scrapy handles this automatically!
Or randomize the delay:
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True # Random delay between 0.5*DELAY and 1.5*DELAY
Real Example
Bad:
# 1000 requests instantly
for url in urls:
requests.get(url) # BOOM BOOM BOOM
Good:
# Polite scraping
for url in urls:
requests.get(url)
time.sleep(1) # Wait 1 second
Better (Scrapy):
# settings.py
DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS = 8 # Only 8 at a time
Mistake #7: Using requests on JavaScript Sites
What This Means
Some websites load content with JavaScript. The HTML you get with requests is empty!
The Mistake
response = requests.get('https://modern-spa-site.com')
soup = BeautifulSoup(response.text, 'html.parser')
products = soup.find_all('div', class_='product') # Returns nothing!
You look at the page in browser and see products. But requests sees nothing because JavaScript hasn't run yet.
How to Check
View page source in browser:
- Right-click page
- Click "View Page Source"
- Press Ctrl+F and search for the data you want
If you can't find it in the source, it's loaded with JavaScript!
How to Fix It
Option 1: Use Selenium or Playwright
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://modern-spa-site.com')
# Wait for JavaScript to load
time.sleep(3)
# Now you can get the content
html = driver.page_source
Option 2: Use Scrapy-Playwright
# settings.py
DOWNLOAD_HANDLERS = {
'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
}
# In spider
def start_requests(self):
yield scrapy.Request(
'https://example.com',
meta={'playwright': True}
)
Option 3: Find the API
Sometimes easier to find the hidden API that loads the data:
- Open browser DevTools (F12)
- Go to Network tab
- Reload page
- Look for XHR/Fetch requests
- Find JSON data
- Scrape the API directly!
Real Example
Bad:
# Won't work on JavaScript sites
response = requests.get('https://react-site.com')
data = response.text # Empty or incomplete!
Good:
# Use browser automation
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto('https://react-site.com')
page.wait_for_selector('.product') # Wait for content
html = page.content() # Now has data!
Mistake #8: Assuming Elements Always Exist
What This Means
Just because an element exists on one page doesn't mean it exists on all pages.
The Mistake
def parse(self, response):
title = response.css('h1::text').get()
price = response.css('.price::text').get()
yield {
'title': title.strip(), # CRASHES if title is None!
'price': price
}
If title doesn't exist, None.strip() crashes.
How to Fix It
Always check before using:
def parse(self, response):
title = response.css('h1::text').get()
price = response.css('.price::text').get()
# Check before using
if title:
title = title.strip()
if price:
price = price.strip()
yield {
'title': title or 'No title',
'price': price or 'No price'
}
Or use getall() with default:
# Gets first item or returns empty string
title = response.css('h1::text').get(default='')
Real Example
Bad:
name = response.css('.name::text').get()
email = response.css('.email::text').get()
yield {
'name': name.upper(), # Crashes if name is None
'email': email.lower() # Crashes if email is None
}
Good:
name = response.css('.name::text').get()
email = response.css('.email::text').get()
yield {
'name': name.upper() if name else None,
'email': email.lower() if email else None
}
Mistake #9: Mixing Everything Together
What This Means
You put scraping, cleaning, and saving all in one messy function.
The Mistake
def parse(self, response):
for product in response.css('.product'):
# Scraping
name = product.css('h2::text').get()
price = product.css('.price::text').get()
# Cleaning
name = name.strip() if name else ''
price = price.replace('$', '').replace(',', '') if price else '0'
price = float(price)
# Validating
if price < 0:
price = 0
if len(name) > 100:
name = name[:100]
# Saving
with open('products.csv', 'a') as f:
f.write(f'{name},{price}\n')
This is messy and hard to maintain!
How to Fix It
Separate concerns:
1. Scraping (in spider):
def parse(self, response):
for product in response.css('.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
2. Cleaning (in pipeline):
class CleaningPipeline:
def process_item(self, item, spider):
# Clean name
if item.get('name'):
item['name'] = item['name'].strip()
# Clean price
if item.get('price'):
item['price'] = item['price'].replace('$', '')
return item
3. Validating (in another pipeline):
class ValidationPipeline:
def process_item(self, item, spider):
# Validate
if not item.get('name'):
raise DropItem('Missing name')
return item
4. Saving (in yet another pipeline):
class SavePipeline:
def process_item(self, item, spider):
# Save to file/database
return item
Much cleaner!
Real Example
Bad:
# Everything in one place
def parse(self, response):
title = response.css('h1::text').get()
title = title.strip().upper()[:50] # Scraping + cleaning together
if title and len(title) > 3: # Validation
save_to_db(title) # Saving
yield {'title': title}
Good:
# Spider: Just scrape
def parse(self, response):
yield {'title': response.css('h1::text').get()}
# Pipeline: Clean
class CleanPipeline:
def process_item(self, item, spider):
if item.get('title'):
item['title'] = item['title'].strip().upper()[:50]
return item
# Pipeline: Validate
class ValidatePipeline:
def process_item(self, item, spider):
if not item.get('title') or len(item['title']) < 3:
raise DropItem('Invalid title')
return item
Mistake #10: No Logging or Monitoring
What This Means
Your spider runs, but you have no idea what's happening. Did it work? Did it fail? How many items scraped?
The Mistake
def parse(self, response):
for product in response.css('.product'):
yield {'name': product.css('h2::text').get()}
# No logging, no idea what happened!
Spider finishes. You have no idea if it worked.
How to Fix It
Add logging:
import logging
def parse(self, response):
products = response.css('.product')
self.logger.info(f'Found {len(products)} products on {response.url}')
for product in products:
name = product.css('h2::text').get()
if name:
self.logger.debug(f'Scraped: {name}')
yield {'name': name}
else:
self.logger.warning('Product has no name!')
Enable file logging:
# settings.py
LOG_FILE = 'spider.log'
LOG_LEVEL = 'INFO'
Track stats:
def closed(self, reason):
stats = self.crawler.stats.get_stats()
self.logger.info(f'Scraped {stats.get("item_scraped_count", 0)} items')
self.logger.info(f'Failed {stats.get("downloader/exception_count", 0)} requests')
Real Example
Bad:
# No idea what's happening
def parse(self, response):
for item in response.css('.item'):
yield {'data': item.css('::text').get()}
Good:
def parse(self, response):
items = response.css('.item')
self.logger.info(f'Processing {len(items)} items from {response.url}')
for item in items:
data = item.css('::text').get()
if data:
yield {'data': data}
else:
self.logger.warning('Empty item found')
self.logger.info(f'Finished processing {response.url}')
Quick Checklist
Before running your scraper, check:
- [ ] Read robots.txt
- [ ] Added User-Agent header
- [ ] Using CSS selectors (not absolute XPath)
- [ ] Following pagination/next pages
- [ ] Added error handling
- [ ] Set download delay (at least 1 second)
- [ ] Checked if site uses JavaScript
- [ ] Checking if elements exist before using
- [ ] Separated scraping from cleaning
- [ ] Enabled logging
If you check all these, you're ahead of 90% of beginners!
Summary
The 10 mistakes:
- Ignoring robots.txt → Always check and obey it
- No headers → Add User-Agent and other headers
- Bad selectors → Use CSS classes, not absolute XPath
- Missing pagination → Follow next page links
- No error handling → Check if elements exist
- Scraping too fast → Add delays between requests
- Using requests on JS sites → Use Playwright/Selenium
- Assuming elements exist → Always check before using
- Mixing everything → Separate scraping, cleaning, saving
- No logging → Log what's happening
Remember:
- Be polite (delays, robots.txt)
- Be prepared (error handling, checking)
- Be organized (separate concerns)
- Be informed (logging, monitoring)
Fix these 10 mistakes and your scrapers will be more reliable, respectful, and professional.
Happy scraping! 🕷️
Top comments (0)