DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

10 Web Scraping Mistakes Beginners Make (And How to Fix Them)

I made every mistake on this list. Every single one.

My first scraper got my IP banned in 10 minutes. My second one crashed after scraping 5 pages. My third one worked perfectly on my laptop but broke the next day when the website changed.

I learned web scraping the hard way. You don't have to.

Let me show you the 10 most common mistakes beginners make, so you can avoid them from day one.


Mistake #1: Ignoring robots.txt

What This Means

Every website has a file called robots.txt that tells crawlers what they can and can't scrape.

Find it here:

https://example.com/robots.txt
Enter fullscreen mode Exit fullscreen mode

Just add /robots.txt to any website.

The Mistake

You start scraping without checking this file.

Example:

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /api/
Enter fullscreen mode Exit fullscreen mode

This says: "Don't scrape these folders."

If you scrape them anyway, you're being rude (and might get blocked).

How to Fix It

1. Always check robots.txt first:

curl https://example.com/robots.txt
Enter fullscreen mode Exit fullscreen mode

Or just visit it in your browser.

2. Respect what it says:

If it says Disallow: /admin/, don't scrape that section.

3. In Scrapy, enable it:

# settings.py
ROBOTSTXT_OBEY = True
Enter fullscreen mode Exit fullscreen mode

Scrapy will automatically check and obey robots.txt.

Real Example

Bad:

# Just scraping everything
start_urls = [
    'https://example.com/products',
    'https://example.com/admin',  # Might be disallowed!
]
Enter fullscreen mode Exit fullscreen mode

Good:

# Check robots.txt first, then only scrape allowed pages
start_urls = [
    'https://example.com/products',  # Allowed
]
Enter fullscreen mode Exit fullscreen mode

Mistake #2: Not Using Proper Headers

What This Means

When your scraper visits a website, it sends information about itself called "headers."

Without proper headers, you look like a robot. Websites block robots.

The Mistake

You send requests without a User-Agent header.

What websites see:

User-Agent: python-requests/2.28.0
Enter fullscreen mode Exit fullscreen mode

This screams "I'm a bot!"

How to Fix It

Add a User-Agent that looks like a real browser:

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

response = requests.get('https://example.com', headers=headers)
Enter fullscreen mode Exit fullscreen mode

In Scrapy:

# settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
Enter fullscreen mode Exit fullscreen mode

Better: Add More Headers

Real browsers send more than just User-Agent:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate',
    'Connection': 'keep-alive',
}
Enter fullscreen mode Exit fullscreen mode

This looks much more like a real browser.

Real Example

Bad:

# No headers, obvious bot
response = requests.get('https://example.com')
Enter fullscreen mode Exit fullscreen mode

Good:

# Looks like a real browser
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...'}
response = requests.get('https://example.com', headers=headers)
Enter fullscreen mode Exit fullscreen mode

Mistake #3: Using Bad Selectors

What This Means

Selectors are how you find elements on a page (like finding a specific button or product name).

Bad selectors break when the website changes even slightly.

The Mistake

Using absolute XPath:

# Breaks if website changes anything
title = response.xpath('/html/body/div[1]/div[2]/div[3]/h1/text()').get()
Enter fullscreen mode Exit fullscreen mode

This says: "Go to body, then first div, then second div, then third div, then h1."

If the website adds ONE div anywhere, this breaks!

How to Fix It

Use relative selectors based on classes or IDs:

# Better, looks for class name
title = response.css('.product-title::text').get()

# Or ID
title = response.css('#product-name::text').get()
Enter fullscreen mode Exit fullscreen mode

Even better, use multiple options:

# Try class first, then ID, then tag
title = (
    response.css('.product-title::text').get() or
    response.css('#product-name::text').get() or
    response.css('h1::text').get()
)
Enter fullscreen mode Exit fullscreen mode

Real Example

Bad:

# Absolute XPath, very fragile
price = response.xpath('/html/body/div[1]/div[2]/span[1]/text()').get()
Enter fullscreen mode Exit fullscreen mode

Good:

# CSS class, much more stable
price = response.css('.price::text').get()
Enter fullscreen mode Exit fullscreen mode

Better:

# Multiple fallbacks
price = (
    response.css('.price::text').get() or
    response.css('.product-price::text').get() or
    response.css('[data-price]::text').get()
)
Enter fullscreen mode Exit fullscreen mode

Mistake #4: Missing Pagination

What This Means

Many websites split content across multiple pages (page 1, page 2, page 3, etc.).

If you only scrape page 1, you miss 90% of the data!

The Mistake

def parse(self, response):
    # Scrape items on this page
    for product in response.css('.product'):
        yield {'name': product.css('h2::text').get()}

    # FORGOT TO GO TO NEXT PAGE!
Enter fullscreen mode Exit fullscreen mode

You scrape page 1 and stop. Next page never gets scraped.

How to Fix It

Find and follow the "Next" button:

def parse(self, response):
    # Scrape items on this page
    for product in response.css('.product'):
        yield {'name': product.css('h2::text').get()}

    # Follow next page
    next_page = response.css('.next::attr(href)').get()
    if next_page:
        yield response.follow(next_page, self.parse)
Enter fullscreen mode Exit fullscreen mode

Real Example

Bad:

def parse(self, response):
    for product in response.css('.product'):
        yield {'name': product.css('h2::text').get()}
    # Only scrapes page 1!
Enter fullscreen mode Exit fullscreen mode

Good:

def parse(self, response):
    # Scrape this page
    for product in response.css('.product'):
        yield {'name': product.css('h2::text').get()}

    # Go to next page
    next_page = response.css('a.next::attr(href)').get()
    if next_page:
        yield response.follow(next_page, self.parse)
    # Now scrapes ALL pages!
Enter fullscreen mode Exit fullscreen mode

Mistake #5: No Error Handling

What This Means

Things go wrong. Websites go down. Internet disconnects. Elements don't exist.

Without error handling, your scraper crashes and loses all progress.

The Mistake

def parse(self, response):
    title = response.css('h1::text').get()
    price = response.css('.price::text').get()

    yield {
        'title': title,
        'price': float(price)  # CRASHES if price is None!
    }
Enter fullscreen mode Exit fullscreen mode

If price doesn't exist, float(None) crashes your spider.

How to Fix It

Check if things exist before using them:

def parse(self, response):
    title = response.css('h1::text').get()
    price_text = response.css('.price::text').get()

    # Check before converting
    if price_text:
        try:
            price = float(price_text.replace('$', ''))
        except ValueError:
            price = None
    else:
        price = None

    yield {
        'title': title or 'Unknown',
        'price': price
    }
Enter fullscreen mode Exit fullscreen mode

In Scrapy, use errback:

def start_requests(self):
    urls = ['https://example.com']
    for url in urls:
        yield scrapy.Request(url, callback=self.parse, errback=self.error_handler)

def error_handler(self, failure):
    self.logger.error(f'Request failed: {failure}')
Enter fullscreen mode Exit fullscreen mode

Real Example

Bad:

price = float(response.css('.price::text').get())  # Crashes if None
Enter fullscreen mode Exit fullscreen mode

Good:

price_text = response.css('.price::text').get()
if price_text:
    try:
        price = float(price_text.replace('$', ''))
    except ValueError:
        price = None
else:
    price = None
Enter fullscreen mode Exit fullscreen mode

Mistake #6: Scraping Too Fast

What This Means

You send 100 requests per second. The website thinks you're attacking it. It blocks you.

The Mistake

# Hammering the server
for i in range(1000):
    response = requests.get(f'https://example.com/page{i}')
    # No delay, sends all 1000 requests immediately!
Enter fullscreen mode Exit fullscreen mode

This is like knocking on someone's door 1000 times in one second. Rude!

How to Fix It

Add delays between requests:

import time
import requests

for i in range(1000):
    response = requests.get(f'https://example.com/page{i}')
    time.sleep(1)  # Wait 1 second between requests
Enter fullscreen mode Exit fullscreen mode

In Scrapy (even better):

# settings.py
DOWNLOAD_DELAY = 1  # Wait 1 second between requests
Enter fullscreen mode Exit fullscreen mode

Scrapy handles this automatically!

Or randomize the delay:

DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True  # Random delay between 0.5*DELAY and 1.5*DELAY
Enter fullscreen mode Exit fullscreen mode

Real Example

Bad:

# 1000 requests instantly
for url in urls:
    requests.get(url)  # BOOM BOOM BOOM
Enter fullscreen mode Exit fullscreen mode

Good:

# Polite scraping
for url in urls:
    requests.get(url)
    time.sleep(1)  # Wait 1 second
Enter fullscreen mode Exit fullscreen mode

Better (Scrapy):

# settings.py
DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS = 8  # Only 8 at a time
Enter fullscreen mode Exit fullscreen mode

Mistake #7: Using requests on JavaScript Sites

What This Means

Some websites load content with JavaScript. The HTML you get with requests is empty!

The Mistake

response = requests.get('https://modern-spa-site.com')
soup = BeautifulSoup(response.text, 'html.parser')
products = soup.find_all('div', class_='product')  # Returns nothing!
Enter fullscreen mode Exit fullscreen mode

You look at the page in browser and see products. But requests sees nothing because JavaScript hasn't run yet.

How to Check

View page source in browser:

  • Right-click page
  • Click "View Page Source"
  • Press Ctrl+F and search for the data you want

If you can't find it in the source, it's loaded with JavaScript!

How to Fix It

Option 1: Use Selenium or Playwright

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://modern-spa-site.com')

# Wait for JavaScript to load
time.sleep(3)

# Now you can get the content
html = driver.page_source
Enter fullscreen mode Exit fullscreen mode

Option 2: Use Scrapy-Playwright

# settings.py
DOWNLOAD_HANDLERS = {
    'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
    'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
}

# In spider
def start_requests(self):
    yield scrapy.Request(
        'https://example.com',
        meta={'playwright': True}
    )
Enter fullscreen mode Exit fullscreen mode

Option 3: Find the API

Sometimes easier to find the hidden API that loads the data:

  1. Open browser DevTools (F12)
  2. Go to Network tab
  3. Reload page
  4. Look for XHR/Fetch requests
  5. Find JSON data
  6. Scrape the API directly!

Real Example

Bad:

# Won't work on JavaScript sites
response = requests.get('https://react-site.com')
data = response.text  # Empty or incomplete!
Enter fullscreen mode Exit fullscreen mode

Good:

# Use browser automation
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto('https://react-site.com')
    page.wait_for_selector('.product')  # Wait for content
    html = page.content()  # Now has data!
Enter fullscreen mode Exit fullscreen mode

Mistake #8: Assuming Elements Always Exist

What This Means

Just because an element exists on one page doesn't mean it exists on all pages.

The Mistake

def parse(self, response):
    title = response.css('h1::text').get()
    price = response.css('.price::text').get()

    yield {
        'title': title.strip(),  # CRASHES if title is None!
        'price': price
    }
Enter fullscreen mode Exit fullscreen mode

If title doesn't exist, None.strip() crashes.

How to Fix It

Always check before using:

def parse(self, response):
    title = response.css('h1::text').get()
    price = response.css('.price::text').get()

    # Check before using
    if title:
        title = title.strip()

    if price:
        price = price.strip()

    yield {
        'title': title or 'No title',
        'price': price or 'No price'
    }
Enter fullscreen mode Exit fullscreen mode

Or use getall() with default:

# Gets first item or returns empty string
title = response.css('h1::text').get(default='')
Enter fullscreen mode Exit fullscreen mode

Real Example

Bad:

name = response.css('.name::text').get()
email = response.css('.email::text').get()

yield {
    'name': name.upper(),  # Crashes if name is None
    'email': email.lower()  # Crashes if email is None
}
Enter fullscreen mode Exit fullscreen mode

Good:

name = response.css('.name::text').get()
email = response.css('.email::text').get()

yield {
    'name': name.upper() if name else None,
    'email': email.lower() if email else None
}
Enter fullscreen mode Exit fullscreen mode

Mistake #9: Mixing Everything Together

What This Means

You put scraping, cleaning, and saving all in one messy function.

The Mistake

def parse(self, response):
    for product in response.css('.product'):
        # Scraping
        name = product.css('h2::text').get()
        price = product.css('.price::text').get()

        # Cleaning
        name = name.strip() if name else ''
        price = price.replace('$', '').replace(',', '') if price else '0'
        price = float(price)

        # Validating
        if price < 0:
            price = 0
        if len(name) > 100:
            name = name[:100]

        # Saving
        with open('products.csv', 'a') as f:
            f.write(f'{name},{price}\n')
Enter fullscreen mode Exit fullscreen mode

This is messy and hard to maintain!

How to Fix It

Separate concerns:

1. Scraping (in spider):

def parse(self, response):
    for product in response.css('.product'):
        yield {
            'name': product.css('h2::text').get(),
            'price': product.css('.price::text').get()
        }
Enter fullscreen mode Exit fullscreen mode

2. Cleaning (in pipeline):

class CleaningPipeline:
    def process_item(self, item, spider):
        # Clean name
        if item.get('name'):
            item['name'] = item['name'].strip()

        # Clean price
        if item.get('price'):
            item['price'] = item['price'].replace('$', '')

        return item
Enter fullscreen mode Exit fullscreen mode

3. Validating (in another pipeline):

class ValidationPipeline:
    def process_item(self, item, spider):
        # Validate
        if not item.get('name'):
            raise DropItem('Missing name')

        return item
Enter fullscreen mode Exit fullscreen mode

4. Saving (in yet another pipeline):

class SavePipeline:
    def process_item(self, item, spider):
        # Save to file/database
        return item
Enter fullscreen mode Exit fullscreen mode

Much cleaner!

Real Example

Bad:

# Everything in one place
def parse(self, response):
    title = response.css('h1::text').get()
    title = title.strip().upper()[:50]  # Scraping + cleaning together
    if title and len(title) > 3:  # Validation
        save_to_db(title)  # Saving
        yield {'title': title}
Enter fullscreen mode Exit fullscreen mode

Good:

# Spider: Just scrape
def parse(self, response):
    yield {'title': response.css('h1::text').get()}

# Pipeline: Clean
class CleanPipeline:
    def process_item(self, item, spider):
        if item.get('title'):
            item['title'] = item['title'].strip().upper()[:50]
        return item

# Pipeline: Validate
class ValidatePipeline:
    def process_item(self, item, spider):
        if not item.get('title') or len(item['title']) < 3:
            raise DropItem('Invalid title')
        return item
Enter fullscreen mode Exit fullscreen mode

Mistake #10: No Logging or Monitoring

What This Means

Your spider runs, but you have no idea what's happening. Did it work? Did it fail? How many items scraped?

The Mistake

def parse(self, response):
    for product in response.css('.product'):
        yield {'name': product.css('h2::text').get()}
    # No logging, no idea what happened!
Enter fullscreen mode Exit fullscreen mode

Spider finishes. You have no idea if it worked.

How to Fix It

Add logging:

import logging

def parse(self, response):
    products = response.css('.product')
    self.logger.info(f'Found {len(products)} products on {response.url}')

    for product in products:
        name = product.css('h2::text').get()
        if name:
            self.logger.debug(f'Scraped: {name}')
            yield {'name': name}
        else:
            self.logger.warning('Product has no name!')
Enter fullscreen mode Exit fullscreen mode

Enable file logging:

# settings.py
LOG_FILE = 'spider.log'
LOG_LEVEL = 'INFO'
Enter fullscreen mode Exit fullscreen mode

Track stats:

def closed(self, reason):
    stats = self.crawler.stats.get_stats()
    self.logger.info(f'Scraped {stats.get("item_scraped_count", 0)} items')
    self.logger.info(f'Failed {stats.get("downloader/exception_count", 0)} requests')
Enter fullscreen mode Exit fullscreen mode

Real Example

Bad:

# No idea what's happening
def parse(self, response):
    for item in response.css('.item'):
        yield {'data': item.css('::text').get()}
Enter fullscreen mode Exit fullscreen mode

Good:

def parse(self, response):
    items = response.css('.item')
    self.logger.info(f'Processing {len(items)} items from {response.url}')

    for item in items:
        data = item.css('::text').get()
        if data:
            yield {'data': data}
        else:
            self.logger.warning('Empty item found')

    self.logger.info(f'Finished processing {response.url}')
Enter fullscreen mode Exit fullscreen mode

Quick Checklist

Before running your scraper, check:

  • [ ] Read robots.txt
  • [ ] Added User-Agent header
  • [ ] Using CSS selectors (not absolute XPath)
  • [ ] Following pagination/next pages
  • [ ] Added error handling
  • [ ] Set download delay (at least 1 second)
  • [ ] Checked if site uses JavaScript
  • [ ] Checking if elements exist before using
  • [ ] Separated scraping from cleaning
  • [ ] Enabled logging

If you check all these, you're ahead of 90% of beginners!


Summary

The 10 mistakes:

  1. Ignoring robots.txt → Always check and obey it
  2. No headers → Add User-Agent and other headers
  3. Bad selectors → Use CSS classes, not absolute XPath
  4. Missing pagination → Follow next page links
  5. No error handling → Check if elements exist
  6. Scraping too fast → Add delays between requests
  7. Using requests on JS sites → Use Playwright/Selenium
  8. Assuming elements exist → Always check before using
  9. Mixing everything → Separate scraping, cleaning, saving
  10. No logging → Log what's happening

Remember:

  • Be polite (delays, robots.txt)
  • Be prepared (error handling, checking)
  • Be organized (separate concerns)
  • Be informed (logging, monitoring)

Fix these 10 mistakes and your scrapers will be more reliable, respectful, and professional.

Happy scraping! 🕷️

Top comments (0)