Site changed their HTML at page 200. Scraping 500 pages took 3 attempts.

#python #programming #webscraping #webdev

Last week I'm running this scrape job, right. Everything going smooth, page 200 hits and suddenly every single field extraction returns null.

First thought was my code broke. Checked the code, looked fine. Checked the URL, still valid. Checked if the site was down, it wasn't. Pulled up page 200 in my browser and the HTML looked completely different in the devtools.

Turns out the site had two different page templates. Most pages used Template A. Somewhere around page 180-200 they switched to Template B. Same URL structure, same CSS classes in the HTML, but the actual DOM hierarchy was totally different.

The annoying part was figuring out exactly which page changed. Can't just eyeball it when you have 500 pages to deal with. Wrote a quick script to flag any page where my extraction count dropped below a threshold.

from pathlib import Path
import re

# Find pages with suspiciously low extraction counts
pattern = r'page_(\d+)\.html'
extracted_counts = {}

for html_file in Path('scraped_pages').glob('*.html'):
    match = re.search(pattern, html_file.name)
    if match:
        page_num = int(match.group(1))
        content = html_file.read_text()
        # Count product containers
        count = content.count('class="product-card')
        extracted_counts[page_num] = count

# Flag anomalies
for page_num, count in sorted(extracted_counts.items()):
    if count < 5:  # Normal is 15-20 per page
        print(f'Suspicious: page {page_num} only extracted {count} items')

Found 47 pages with low extraction counts. The actual boundary was page 187. Not 200. Not 180. Page 187 exactly.

Why page 187? No idea honestly. Maybe they ran a split test and page 187 was when the new template shipped to 100% of users. Maybe someone pushed a template change without checking existing scraped pages. Could be aliens for all I know.

Fix was simple enough, two extraction functions and route based on page number.

def extract_products(html, page_num):
    if page_num < 187:
        return extract_template_a(html)
    else:
        return extract_template_b(html)

Three passes to get clean data. Still annoyed about it.

DEV Community

Site changed their HTML at page 200. Scraping 500 pages took 3 attempts.

Top comments (0)