Muhammad Ikramullah Khan

Posted on Jan 18

10 Web Scraping Mistakes Beginners Make (And How to Fix Them)

#webdev #programming #beginners #python

I made every mistake on this list. Every single one.

My first scraper got my IP banned in 10 minutes. My second one crashed after scraping 5 pages. My third one worked perfectly on my laptop but broke the next day when the website changed.

I learned web scraping the hard way. You don't have to.

Let me show you the 10 most common mistakes beginners make, so you can avoid them from day one.

Mistake #1: Ignoring robots.txt

What This Means

Every website has a file called robots.txt that tells crawlers what they can and can't scrape.

Find it here:

https://example.com/robots.txt

Just add /robots.txt to any website.

The Mistake

You start scraping without checking this file.

Example:

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /api/

This says: "Don't scrape these folders."

If you scrape them anyway, you're being rude (and might get blocked).

How to Fix It

1. Always check robots.txt first:

curl https://example.com/robots.txt

Or just visit it in your browser.

2. Respect what it says:

If it says Disallow: /admin/, don't scrape that section.

3. In Scrapy, enable it:

# settings.py
ROBOTSTXT_OBEY = True

Scrapy will automatically check and obey robots.txt.

Real Example

Bad:

# Just scraping everything
start_urls = [
    'https://example.com/products',
    'https://example.com/admin',  # Might be disallowed!
]

Good:

# Check robots.txt first, then only scrape allowed pages
start_urls = [
    'https://example.com/products',  # Allowed
]

Mistake #2: Not Using Proper Headers

What This Means

When your scraper visits a website, it sends information about itself called "headers."

Without proper headers, you look like a robot. Websites block robots.

The Mistake

You send requests without a User-Agent header.

What websites see:

User-Agent: python-requests/2.28.0

This screams "I'm a bot!"

How to Fix It

Add a User-Agent that looks like a real browser:

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

response = requests.get('https://example.com', headers=headers)

In Scrapy:

# settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

Better: Add More Headers

Real browsers send more than just User-Agent:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate',
    'Connection': 'keep-alive',
}

This looks much more like a real browser.

Real Example

Bad:

# No headers, obvious bot
response = requests.get('https://example.com')

Good:

# Looks like a real browser
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...'}
response = requests.get('https://example.com', headers=headers)

Mistake #3: Using Bad Selectors

What This Means

Selectors are how you find elements on a page (like finding a specific button or product name).

Bad selectors break when the website changes even slightly.

The Mistake

Using absolute XPath:

# Breaks if website changes anything
title = response.xpath('/html/body/div[1]/div[2]/div[3]/h1/text()').get()

This says: "Go to body, then first div, then second div, then third div, then h1."

If the website adds ONE div anywhere, this breaks!

How to Fix It

Use relative selectors based on classes or IDs:

# Better, looks for class name
title = response.css('.product-title::text').get()

# Or ID
title = response.css('#product-name::text').get()

Even better, use multiple options:

# Try class first, then ID, then tag
title = (
    response.css('.product-title::text').get() or
    response.css('#product-name::text').get() or
    response.css('h1::text').get()
)

Real Example

Bad:

# Absolute XPath, very fragile
price = response.xpath('/html/body/div[1]/div[2]/span[1]/text()').get()

Good:

# CSS class, much more stable
price = response.css('.price::text').get()

Better:

# Multiple fallbacks
price = (
    response.css('.price::text').get() or
    response.css('.product-price::text').get() or
    response.css('[data-price]::text').get()
)

Mistake #4: Missing Pagination

What This Means

Many websites split content across multiple pages (page 1, page 2, page 3, etc.).

If you only scrape page 1, you miss 90% of the data!

The Mistake

def parse(self, response):
    # Scrape items on this page
    for product in response.css('.product'):
        yield {'name': product.css('h2::text').get()}

    # FORGOT TO GO TO NEXT PAGE!

You scrape page 1 and stop. Next page never gets scraped.

How to Fix It

Find and follow the "Next" button:

def parse(self, response):
    # Scrape items on this page
    for product in response.css('.product'):
        yield {'name': product.css('h2::text').get()}

    # Follow next page
    next_page = response.css('.next::attr(href)').get()
    if next_page:
        yield response.follow(next_page, self.parse)

Real Example

Bad:

def parse(self, response):
    for product in response.css('.product'):
        yield {'name': product.css('h2::text').get()}
    # Only scrapes page 1!

Good:

def parse(self, response):
    # Scrape this page
    for product in response.css('.product'):
        yield {'name': product.css('h2::text').get()}

    # Go to next page
    next_page = response.css('a.next::attr(href)').get()
    if next_page:
        yield response.follow(next_page, self.parse)
    # Now scrapes ALL pages!

Mistake #5: No Error Handling

What This Means

Things go wrong. Websites go down. Internet disconnects. Elements don't exist.

Without error handling, your scraper crashes and loses all progress.

The Mistake

def parse(self, response):
    title = response.css('h1::text').get()
    price = response.css('.price::text').get()

    yield {
        'title': title,
        'price': float(price)  # CRASHES if price is None!
    }

If price doesn't exist, float(None) crashes your spider.

How to Fix It

Check if things exist before using them:

def parse(self, response):
    title = response.css('h1::text').get()
    price_text = response.css('.price::text').get()

    # Check before converting
    if price_text:
        try:
            price = float(price_text.replace('$', ''))
        except ValueError:
            price = None
    else:
        price = None

    yield {
        'title': title or 'Unknown',
        'price': price
    }

In Scrapy, use errback:

def start_requests(self):
    urls = ['https://example.com']
    for url in urls:
        yield scrapy.Request(url, callback=self.parse, errback=self.error_handler)

def error_handler(self, failure):
    self.logger.error(f'Request failed: {failure}')

Real Example

Bad:

price = float(response.css('.price::text').get())  # Crashes if None

Good:

price_text = response.css('.price::text').get()
if price_text:
    try:
        price = float(price_text.replace('$', ''))
    except ValueError:
        price = None
else:
    price = None

Mistake #6: Scraping Too Fast

What This Means

You send 100 requests per second. The website thinks you're attacking it. It blocks you.

The Mistake

# Hammering the server
for i in range(1000):
    response = requests.get(f'https://example.com/page{i}')
    # No delay, sends all 1000 requests immediately!

This is like knocking on someone's door 1000 times in one second. Rude!

How to Fix It

Add delays between requests:

import time
import requests

for i in range(1000):
    response = requests.get(f'https://example.com/page{i}')
    time.sleep(1)  # Wait 1 second between requests

In Scrapy (even better):

# settings.py
DOWNLOAD_DELAY = 1  # Wait 1 second between requests

Scrapy handles this automatically!

Or randomize the delay:

DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True  # Random delay between 0.5*DELAY and 1.5*DELAY

Real Example

Bad:

# 1000 requests instantly
for url in urls:
    requests.get(url)  # BOOM BOOM BOOM

Good:

# Polite scraping
for url in urls:
    requests.get(url)
    time.sleep(1)  # Wait 1 second

Better (Scrapy):

# settings.py
DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS = 8  # Only 8 at a time

Mistake #7: Using requests on JavaScript Sites

What This Means

Some websites load content with JavaScript. The HTML you get with requests is empty!

The Mistake

response = requests.get('https://modern-spa-site.com')
soup = BeautifulSoup(response.text, 'html.parser')
products = soup.find_all('div', class_='product')  # Returns nothing!

You look at the page in browser and see products. But requests sees nothing because JavaScript hasn't run yet.

How to Check

View page source in browser:

Right-click page
Click "View Page Source"
Press Ctrl+F and search for the data you want

If you can't find it in the source, it's loaded with JavaScript!

How to Fix It

Option 1: Use Selenium or Playwright

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://modern-spa-site.com')

# Wait for JavaScript to load
time.sleep(3)

# Now you can get the content
html = driver.page_source

Option 2: Use Scrapy-Playwright

# settings.py
DOWNLOAD_HANDLERS = {
    'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
    'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
}

# In spider
def start_requests(self):
    yield scrapy.Request(
        'https://example.com',
        meta={'playwright': True}
    )

Option 3: Find the API

Sometimes easier to find the hidden API that loads the data:

Open browser DevTools (F12)
Go to Network tab
Reload page
Look for XHR/Fetch requests
Find JSON data
Scrape the API directly!

Real Example

Bad:

# Won't work on JavaScript sites
response = requests.get('https://react-site.com')
data = response.text  # Empty or incomplete!

Good:

# Use browser automation
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto('https://react-site.com')
    page.wait_for_selector('.product')  # Wait for content
    html = page.content()  # Now has data!

Mistake #8: Assuming Elements Always Exist

What This Means

Just because an element exists on one page doesn't mean it exists on all pages.

The Mistake

def parse(self, response):
    title = response.css('h1::text').get()
    price = response.css('.price::text').get()

    yield {
        'title': title.strip(),  # CRASHES if title is None!
        'price': price
    }

If title doesn't exist, None.strip() crashes.

How to Fix It

Always check before using:

def parse(self, response):
    title = response.css('h1::text').get()
    price = response.css('.price::text').get()

    # Check before using
    if title:
        title = title.strip()

    if price:
        price = price.strip()

    yield {
        'title': title or 'No title',
        'price': price or 'No price'
    }

Or use getall() with default:

# Gets first item or returns empty string
title = response.css('h1::text').get(default='')

Real Example

Bad:

name = response.css('.name::text').get()
email = response.css('.email::text').get()

yield {
    'name': name.upper(),  # Crashes if name is None
    'email': email.lower()  # Crashes if email is None
}

Good:

name = response.css('.name::text').get()
email = response.css('.email::text').get()

yield {
    'name': name.upper() if name else None,
    'email': email.lower() if email else None
}

Mistake #9: Mixing Everything Together

What This Means

You put scraping, cleaning, and saving all in one messy function.

The Mistake

def parse(self, response):
    for product in response.css('.product'):
        # Scraping
        name = product.css('h2::text').get()
        price = product.css('.price::text').get()

        # Cleaning
        name = name.strip() if name else ''
        price = price.replace('$', '').replace(',', '') if price else '0'
        price = float(price)

        # Validating
        if price < 0:
            price = 0
        if len(name) > 100:
            name = name[:100]

        # Saving
        with open('products.csv', 'a') as f:
            f.write(f'{name},{price}\n')

This is messy and hard to maintain!

How to Fix It

Separate concerns:

1. Scraping (in spider):

def parse(self, response):
    for product in response.css('.product'):
        yield {
            'name': product.css('h2::text').get(),
            'price': product.css('.price::text').get()
        }

2. Cleaning (in pipeline):

class CleaningPipeline:
    def process_item(self, item, spider):
        # Clean name
        if item.get('name'):
            item['name'] = item['name'].strip()

        # Clean price
        if item.get('price'):
            item['price'] = item['price'].replace('$', '')

        return item

3. Validating (in another pipeline):

class ValidationPipeline:
    def process_item(self, item, spider):
        # Validate
        if not item.get('name'):
            raise DropItem('Missing name')

        return item

4. Saving (in yet another pipeline):

class SavePipeline:
    def process_item(self, item, spider):
        # Save to file/database
        return item

Much cleaner!

Real Example

Bad:

# Everything in one place
def parse(self, response):
    title = response.css('h1::text').get()
    title = title.strip().upper()[:50]  # Scraping + cleaning together
    if title and len(title) > 3:  # Validation
        save_to_db(title)  # Saving
        yield {'title': title}

Good:

# Spider: Just scrape
def parse(self, response):
    yield {'title': response.css('h1::text').get()}

# Pipeline: Clean
class CleanPipeline:
    def process_item(self, item, spider):
        if item.get('title'):
            item['title'] = item['title'].strip().upper()[:50]
        return item

# Pipeline: Validate
class ValidatePipeline:
    def process_item(self, item, spider):
        if not item.get('title') or len(item['title']) < 3:
            raise DropItem('Invalid title')
        return item

Mistake #10: No Logging or Monitoring

What This Means

Your spider runs, but you have no idea what's happening. Did it work? Did it fail? How many items scraped?

The Mistake

def parse(self, response):
    for product in response.css('.product'):
        yield {'name': product.css('h2::text').get()}
    # No logging, no idea what happened!

Spider finishes. You have no idea if it worked.

How to Fix It

Add logging:

import logging

def parse(self, response):
    products = response.css('.product')
    self.logger.info(f'Found {len(products)} products on {response.url}')

    for product in products:
        name = product.css('h2::text').get()
        if name:
            self.logger.debug(f'Scraped: {name}')
            yield {'name': name}
        else:
            self.logger.warning('Product has no name!')

Enable file logging:

# settings.py
LOG_FILE = 'spider.log'
LOG_LEVEL = 'INFO'

Track stats:

def closed(self, reason):
    stats = self.crawler.stats.get_stats()
    self.logger.info(f'Scraped {stats.get("item_scraped_count", 0)} items')
    self.logger.info(f'Failed {stats.get("downloader/exception_count", 0)} requests')

Real Example

Bad:

# No idea what's happening
def parse(self, response):
    for item in response.css('.item'):
        yield {'data': item.css('::text').get()}

Good:

def parse(self, response):
    items = response.css('.item')
    self.logger.info(f'Processing {len(items)} items from {response.url}')

    for item in items:
        data = item.css('::text').get()
        if data:
            yield {'data': data}
        else:
            self.logger.warning('Empty item found')

    self.logger.info(f'Finished processing {response.url}')