Scraped 300 pages successfully. Site updated robots.txt at page 187 and blocked me.

#webscraping #python #webdev #programming

Scraped 300 pages successfully. Site updated robots.txt at page 187 and blocked me.

Building a price tracker for electronics. Target: 300 product pages across an ecommerce site. Tested first 20 pages, everything worked. Ran the full scraper overnight.

Woke up to find 187 products scraped, then nothing. Zero errors in my logs.

What happened

The site admin updated their robots.txt while I was sleeping. Added Disallow: /products/* between page 187 and 188. My scraper checks robots.txt once at startup, then runs. By page 188, their server started returning 403 Forbidden.

Fun times.

The mess I made

First attempt: Just scraped the remaining 113 pages ignoring robots.txt.

Got IP banned within 15 minutes. Smart.

Second attempt: Added 5 second delays between requests.

Still banned. Slower this time, but same result.

Third attempt: Residential proxies.

This worked but cost $40 for what should've been free data.

What I changed

import requests
from urllib.robotparser import RobotFileParser
import time

class RobotChecker:
    def __init__(self, base_url):
        self.base_url = base_url
        self.last_check = 0
        self.cache_duration = 300  # 5 minutes
        self.parser = RobotFileParser()

    def can_fetch(self, url):
        # Refresh robots.txt every 5 min instead of once
        if time.time() - self.last_check > self.cache_duration:
            self.parser.set_url(f"{self.base_url}/robots.txt")
            self.parser.read()
            self.last_check = time.time()

        return self.parser.can_fetch("*", url)

# In scraper loop
robot = RobotChecker("https://example.com")
for page in pages:
    if not robot.can_fetch(page):
        print(f"Robots.txt changed, stopping at {page}")
        break

    # scrape page

Checking robots.txt every 5 minutes caught changes before getting banned. Saved me proxy costs when sites decide to block partway through.

Platform quirks

Some ecommerce platforms update robots.txt dynamically when traffic spikes. Shopify stores do this sometimes. Big sites like Amazon never change theirs, smaller ones panic and lock everything down.

If your scraper runs longer than 10 minutes, periodic robot checks matter. Most tutorials skip this because test runs finish fast.

Still annoying when sites block you halfway through.