DEV Community

BaraaZ95
BaraaZ95

Posted on

How I used Scrapy for my ML Project

You're in your early days learning Machine learning. You're getting the hang of it and now you want to start a cool project to demonstrate your skills. You're sick of practicing with public data, you want something more fresh, creative and something you're more familiar with... enter web scraping.

If you want the super power to fetch any data you want from any website you please, you MUST learn web scraping.

Learning web scraping for the first time can be a dauntingly difficult task. There's so many tools and libraries that you can use, and each have their pros and cons and various use cases.

I wanted to invest my time and energy in learning the fastest, most efficient one, that can scale with my as my projects get more and more complex. After all, I want my projects to shine so bright in my cv it blinds the recruiter's eyes....

Enter, scrapy. Scrapy is the fastest tool you can use for scraping, due to it's low-level flexibility. But the fact that it's low level, means that it also is the most complex one to set up. But once you get the hang of it, you'll have an infinite amount of data that you can scrape.

One of my first projects was scraping for fps data for various graphics card from https://www.gpucheck.com/

How to Start with Scrapy

First, install scrapy with pip install scrapy.
Then, navigate to your destination folder where you want to save your project in using a terminal: cd C:\Users\Baraa\Projects
Finally, start a project with scrapy and navigate inside it to start a scrapy spider: scrapy startproject gpus
Get inside the folder and start a spider cd gpus, then scrapy genspider gpucheck https://www.gpucheck.com/

How to Find Elements with Scrapy

Now is the tedious part, we need to find the locations of the elements in the HTML we want to scrape.

The gpus were located at https://www.gpucheck.com/graphics-cards. I first needed to find the url of each gpu so the scraper goes there and scrapes information about each gpu using inspect in your browser.

CSS locators

Boom! The urls are under these locators and I've entered them in my scraper:

Scrapy spider

After that, I tell my scraper to go to each url under that selector.

Now comes the hard part... There is fps data on various resolutions and graphics quality settings. In order to have my scraper go through each setting url and return to me all the fps data, I needed to run this portion of the scraper asynchronously.

The async function will allow me to parse the GPU information on the page while going to all the other graphics settings pages and fetching all fps data as such:

Source phpmind

The function should look something like this:

Async function

Now my scraper will update my GPU dict on all the graphics settings simultaneously and for each resolution like so:

Resolution function

This will yield a nested dict of each graphics settings and the fps for all the games in each resolution.

The output should look like this:

Output

See? That wasn't so hard after all.

The full code for the scraper can be found in github
The dataset can be found in kaggle

Top comments (0)