DEV Community

Bartosz Raubo
Bartosz Raubo

Posted on

3 1

Scan.co.uk sales scraper

I was working today on a scraper that grabs the sale items from Scan.co.uk and collects the data in .csv file. Nothing fancy - its sole value is educational. And fittingly, the simple bs4 script threw up two issues that seem worth mentioning.

  1. HTTP Error 403 - access to the server was not authorised, the html could not be grabbed. How frustrating!

  2. x.findAll() does not return all result - I was trying to grab 6 'li' containers, but only 4 were ever found by the function. What do?

HTTP Error 403: Forbidden

This is related to urllib headers - the website does not want to be wrapped in in dealing with requests from countless scrapers, so requests headed urllib are blocked.

To get around this, you must obscure the fact you are running a scraping bot. The simplest way to do this is by using headers, as follows:

req = Request(my_url, headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req)
page_html = page.read()
page.close()
Enter fullscreen mode Exit fullscreen mode

At first, did this not work for me (It's User-Agent not User_Agent, bwap bwap).

So here's also another, apparently older solution, from user Zeta over on StackOverflow:

import urllib.request

class AppURLopener(urllib.request.FancyURLopener):
    version = "Mozilla/5.0"

uClient_opener = AppURLopener()
uClient = uClient_opener.open(my_url)
Enter fullscreen mode Exit fullscreen mode

This appears to be a legacy solution, and not preferred. In the end, both solutions worked for me, typos aside.

x.findAll() does not return all results

product_list = product_categories[x].findAll('li')
Enter fullscreen mode Exit fullscreen mode

The above code should have returned 6 results, but I could never get it to go above 4.

Some googling suggested that this was a problem with html_parser. Suggested solution - use html5lib.

This is what parsing the html with BeautifulSoup looked like before:

page_soup = soup(page_html, 'html_parser')
product_categories = page_soup.findAll('div', {'class':'category'})
Enter fullscreen mode Exit fullscreen mode

The changes to the code are minimal - just replace the html_parser variable with html5lib:

import html5lib

page_soup = soup(page_html, 'html5lib')
product_categories = page_soup.findAll('div', {'class':'category'})
Enter fullscreen mode Exit fullscreen mode

And it works! len(product_list) returns the correct 6 I was looking for.

Hope someone finds this helpful.

Image of Datadog

The Future of AI, LLMs, and Observability on Google Cloud

Datadog sat down with Google’s Director of AI to discuss the current and future states of AI, ML, and LLMs on Google Cloud. Discover 7 key insights for technical leaders, covering everything from upskilling teams to observability best practices

Learn More

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay