DEV Community: Ian Kerins

The 5 Best Scrapyd Dashboards & Admin Tools

Ian Kerins — Fri, 14 Jan 2022 09:06:13 +0000

Published as part of The Python Scrapy Playbook.

Scrapyd is the defacto spider management tool for developers who want a free and effective way to manage their Scrapy spiders on multiple servers without having to configure cron jobs or use paid tools like Scrapy Cloud.

The one major drawback with Scrapyd, however, that the default dashboard that comes with Scrapyd is basic to say the least.

Because of this, numerous web scraping teams have had to build their own Scrapyd dashboards to get the functionality that they need.

In this guide, we're going to go through the 5 Best Scrapyd Dashboards that these developers have decided to share with the community so you don't have to build your own.

ScrapeOps
ScrapydWeb
Gerapy
SpiderKeeper
Crawlab

#1 ScrapeOps

ScrapeOps is a new Scrapyd dashboard and monitoring tool for Scrapy.

With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.

Live demo here: ScrapeOps Demo

The primary goal with ScrapeOps is to give every developer the same level of scraping monitoring capabilities as the most sophisticated web scrapers, without any of the hassle of setting up your own custom solution.

If you have an issue with integrating ScrapeOps or need advice on setting up your scrapers then they have a support team on-hand to assist you.

Features

Once you have completed the simple install (3 lines in your scraper), ScrapeOps will:

🕵️‍♂️ Monitor - Automatically monitor all your scrapers.
📈 Dashboards - Visualise your job data in dashboards, so you see real-time & historical stats.
💯 Data Quality - Validate the field coverage in each of your jobs, so broken parsers can be detected straight away.
📉 Auto Health Checks - Automatically check every jobs performance data versus its 7 day moving average to see if its healthy or not.
✔️ Custom Health Checks - Check each job with any custom health checks you have enabled for it.
⏰ Alerts - Alert you via email, Slack, etc. if any of your jobs are unhealthy.
📑 Reports - Generate daily (periodic) reports, that check all jobs versus your criteria and let you know if everything is healthy or not.

Job stats tracked include:

✅ Pages Scraped & Missed
✅ Items Parsed & Missed
✅ Item Field Coverage
✅ Runtimes
✅ Response Status Codes
✅ Success Rates
✅ Latencies
✅ Errors & Warnings
✅ Bandwidth

Integration

There are two steps to integrate ScrapeOps with your Scrapyd servers:

Install ScrapeOps Logger Extension
Connect ScrapeOps to Your Scrapyd Servers

Note: You can't connect ScrapeOps to a Scrapyd server that is running locally, and isn't offering a public IP address available to connect to.

Once setup you will be able to schedule, run and manage all your Scrapyd servers from one dashboard.

Step 1: Install Scrapy Logger Extension

For ScrapeOps to monitor your scrapers, create dashboards and trigger alerts you need to install the ScrapeOps logger extension in each of your Scrapy projects.

Simply install the Python package:

pip install scrapeops-scrapy

And add 3 lines to your settings.py file:

## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}

From there, your scraping stats will be automatically logged and automatically shipped to your dashboard.

Step 2: Connect ScrapeOps to Your Scrapyd Servers

The next step is giving ScrapeOps the connection details of your Scrapyd servers so that you can manage them from the dashboard.

Within your dashboard go to the Servers page and click on the Add Scrapyd Server at the top of the page.

In the dropdown section then enter your connection details:

Server Name
Server Domain Name (optional)
Server IP Address

Once setup, you can now schedule your scraping jobs to run periodically using the ScrapeOps scheduler and monitor your scraping results in your dashboards.

Summary

ScrapeOps is a powerful web scraping monitoring tool, that gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.

Live demo here: ScrapeOps Demo

Pros

Free unlimited community plan.
Simple 30 second install.
Hosted solution, so don't need to spin up a server.
Full Scrapyd JSON API support.
Includes most fully featured scraping monitoring, health checks and alerts straight out of the box.
Customer support team, available to help you get setup and add new features.

Cons

Not open source, if that is your preference.

#2 ScrapydWeb

The most popular open source Scrapyd dashboard, ScrapydWeb is a great solution for anyone looking for a robust spider management tool that can be integrated with their Scrapyd servers.

With ScrapydWeb, you can schedule, run and see the stats from all your jobs across all your servers on a single dashboard. ScrapydWeb supports all the Scrapyd JSON API endpoints so can also stop jobs mid-crawl and delete projects without having to log into your Scrapyd server.

When combined with LogParser, ScrapydWeb will also extract your Scrapy logs from your server and parse them into an easier to understand way.

A powerful feature that ScrapydWeb has that many of the other open source Scrapyd dashboards don’t have is the ability to easily connect multiple Scrapyd servers to your dashboard, execute actions on multiple nodes with the same command and autopackage your spiders on the Scrapyd server.

Although, ScrapydWeb has a lot of spider management functionality, its monitoring/job visualisation capabilities are quite limited, and there are a number of user experience issues that make it less than ideal if you plan to rely on it completely as your main spider monitoring solution.

Summary

If you want a easy to use open-source Scrapyd dashboard then ScrapydWeb is a great choice. It is the most popular open-source Scrapyd dashboard at the moment, and has a lot of functionality built-in.

Pros

Open source.
Robust and battle tested Scrapyd management tool.
Lots of Spider management functionality.
Best multi-node server management functionality.

Cons

Limited job monitoring and data visualisation functionality.
No customer support

#3 Gerapy

Next, on our list is Gerapy. With 2.6k stars on Github it is another very popular open source Scrapyd dashboard.

Gerapy enables you to schedule, run and control all your Scrapy scrapers from a single dashboard. Like others on this list, it's goal is to make managing distributed crawler projects easier and less time consuming.

Gerapy boasts the following features and functionality:

More convenient control of crawler runs.
View crawl results in more real time.
Easier timing tasks.
Easier project deployment.
More unified host management.
Write crawler code more easily.

Unlike ScrapydWeb, Gerapy also has a visual code editor built-in. So you can edit your projects code right from the Gerapy dashboard if you would like to make a quick change.

Summary

Gerapy is a great alternative to the open-source ScrapydWeb. It will allow you to manage multiple Scrapyd servers with a single dashboard.

However, it doesn't extract the job stats from your log files so you can't view all your jobs scraping results in a single view as you can with ScrapydWeb.

Pros

Open source, and very active maintainers.
Robust Scrapyd management tool.
Full Spider management functionality.
Ability to edit spiders within dashboard.

Cons

Limited job monitoring and data visualisation functionality.
No equivalent log parsing functionality like LogParser with ScrapydWeb.
No customer support

#4 SpiderKeeper

SpiderKeeper is another open-source Scrapyd dashboard based on the old Scrapinghub Scrapy Cloud dashboard.

SpiderKeeper was once a very popular Scrapyd dashboard because it had robust functionality and looked good.

However, it has fallen out of favour due to the launch of other dashboard projects and the fact that it isn't maintained anymore (last update was in 2018, plus numerous open pull requests).

SpiderKeeper is a simplier implementation of the functionality that ScrapeOps, ScrapydWeb or Gerapy providers, however, it still covers all the basics:

Manage your Scrapy spiders from a dashboard.
Schedule periodic jobs to run automatically.
Deploy spiders to Scrapyd with a single click.
Basic spider stats.
Full Scrapyd API support.

Summary

SpiderKeeper was a great open-source Scrapyd dashboard, however, since it isn't being actively maintained in years we would recommend using one of the other options on the list.

Pros

Open source.
Good functionality that covers all the basics.
Ability to deploy spiders within dashboard.

Cons

Not actively maintained, last update 2018.
Limited job monitoring and data visualisation functionality.
No customer support

#5 Crawlab

Whilst Crawlab isn’t a Scrapyd dashboard per-say, it is definitely an interesting tool if you are looking for a way to manage all your spiders from one central admin dashboard.

Crawlab is a Golang-based distributed web crawler admin platform for spiders management regardless of languages and frameworks. Meaning that you can use it with any type of spider be it Python Requests, NodeJS, Golang, etc. based spiders.

The fact Crawlab isn't Scrapy specific gives you huge flexibility if you decide to move away from Scrapy in the future or need to create a Puppeteer scraper to scrape a particularly difficult site, then you can easily add the scraper to your Crawlab setup.

Of the open-source tools on the list, Crawlab is by far the most comprehensive solution with a whole range of features and functionality:

Naturally supports distributed spiders out of the box.
Schedule cron jobs
Task management
Results exporting
Online code editor
Configurable spiders.
Notifications

One of the only downsides to it is that there is a bit of a learning curve to get it setup on your own server.

As of writing this article it is the most active open source project on this list.

Summary

Crawlab is a very powerful scraper management solution with a huge range of functionality, and is a great option for anyone who is running multiple types of scrapers.

Pros

Open source, and actively maintained.
Very powerful functionality.
Ability to deploy any type of scraper (Python, Scrapy, NodeJS, Golang, etc.).
Very good documentation.

Cons

Job monitoring and data visualisation functionality could be better.
No customer support

The Complete Scrapyd Guide - Deploy, Schedule & Run Your Scrapy Spiders

Ian Kerins — Thu, 13 Jan 2022 15:23:44 +0000

Published as part of The Python Scrapy Playbook.

You've built your scraper, tested that it works and now want to schedule it to run every hour, day, etc. and scrape the data you need. But what is the best way to do that?

Scrapyd is one of the most popular options. Created by the same developers that developed Scrapy itself, Scrapyd is a tool for running Scrapy spiders in production on remote servers so you don't need to run them on a local machine.

In this guide, we're going to run through:

What Is Scrapyd?
How To Setup Scrapyd?
Deploying Spiders To Scrapyd
Controlling Spiders With Scrapyd
Integrating Scrapyd with ScrapeOps

There are many different Scrapyd dashboard and admin tools available, from ScrapeOps (Live Demo) to ScrapydWeb, SpiderKeeper, and more.

So if you'd like to choose the best one for your requirements then be sure to check out our Guide to the Best Scrapyd Dashboards, so you can see the pros and cons of each before you decide on which option to go with.

What Is Scrapyd?

Scrapyd is application that allows us to deploy Scrapy spiders on a server and run them remotely using a JSON API. Scrapyd allows you to:

Run Scrapy jobs.
Pause & Cancel Scrapy jobs.
Manage Scrapy project/spider versions.
Access Scrapy logs remotely.

Scrapyd is a great option for developers who want an easy way to manage production Scrapy spiders that run on a remote server.

With Scrapyd you can manage multiple servers from one central point by using a ready-made Scrapyd management tool like ScrapeOps, an open source alternative or by building your own.

Here you can check out the full Scrapyd docs and Github repo.

How to Setup Scrapyd

Getting Scrapyd setup is quick and simple. You can run it locally or on a server.

First step is to install Scrapyd:

pip install scrapyd

And then start the server by using the command:

scrapyd

This will start Scrapyd running on http://localhost:6800/. You can open this url in your browser and you should see the following screen:

Deploying Spiders To Scrapyd

To run jobs using Scrapyd, we first need to eggify and deploy our Scrapy project to the Scrapyd server. To do this, there is a easy to use library called scrapyd-client that makes this process very simple.

First let's install scrapyd-client

pip install git+https://github.com/scrapy/scrapyd-client.git

Once installed, navigate to your Scrapy project you want to deploy and open your scrapyd.cfg file, which should be located in your projects root directory. You should see something like this, with the "demo" text being replaced by your Scrapy projects name:

## scrapy.cfg

[settings]
default = demo.settings  

[deploy]
#url = http://localhost:6800/
project = demo

Here the scrapyd.cfgconfiguration file defines the endpoint your Scrapy project should be be deployed to. To enable us to deploy our project to Scrapyd, we just need to uncomment the url value if we want to deploy it to a locally running Scrapyd server.

## scrapy.cfg

[settings]
default = demo.settings  

[deploy]
url = http://localhost:6800/
project = demo

Then run the following command in your Scrapy projects root directory:

scrapyd-deploy default

This will then eggify your Scrapy project and deploy it to your locally running Scrapyd server. You should get a result like this in your terminal if it was successful:

$ scrapyd-deploy default
Packing version 1640086638
Deploying to project "demo" in http://localhost:6800/addversion.json
Server response (200):
{"node_name": "DESKTOP-67BR2", "status": "ok", "project": "demo", "version": "1640086638", "spiders": 1}

Now your Scrapy project has been deployed to your Scrapyd and is ready to be run.

Aside: Custom Deployment Endpoints

The above example was the simplest implementation and assumed you were just deploying your Scrapy project to a local Scrapyd server. However, you can customise or add multiple deployment endpoints to scrapyd.cfg file if you would like.

For example you can define local and production endpoints:

## scrapy.cfg

[settings]
default = demo.settings  

[deploy:local]
url = http://localhost:6800/
project = demo 

[deploy:production]
url = <MY_IP_ADDRESS>
project = demo

And deploy your Scrapy project locally or to production using this command:

## Deploy locally
scrapyd-deploy local

## Deploy to production
scrapyd-deploy production

Or deploy a specific project using by specifying the project name:

scrapyd-deploy <target> -p <project>

For more information about this, check out the scrapyd-client docs here.

Controlling Spiders With Scrapyd

Scrapyd comes with a minimal web interface which can be accessed at http://localhost:6800/, however, this interface is just a rudimentary overview of what is running on a Scrapyd server and doesn't allow you to control the spiders deployed to the Scrapyd server.

To control your spiders with Scrapyd you have 3 options:

Scrapyd JSON API
Python-Scrapyd-API Library
Scrapyd Dashboard

Scrapyd JSON API

To schedule, run, cancel jobs on your Scrapyd server we need to use the JSON API it provides. Depending on the endpoint, the API supports GET or POST HTTP requests. For example:

$ curl http://localhost:6800/daemonstatus.json
{ "status": "ok", "running": "0", "pending": "0", "finished": "0", "node_name": "DESKTOP-67BR2" }

The API has the following endpoints:

Endpoint	Description
daemonstatus.json	Checks the status of the Scrapyd server.
addversion.json	Add a version to a project, creating the project if it doesn’t exist.
schedule.json	Schedule a job to run.
cancel.json	Cancel a job. If the job is pending, it will be removed. If the job is running, the job will be shutdown.
listprojects.json	Returns a list of the projects uploaded to the Scrapyd server.
listversions.json	Returns a list of versions available for the requested project.
listspiders.json	Returns a list of the spiders available for the requested project.
listjobs.json	Returns a list of pending, running and finished jobs for the requested project.
delversion.json	Deletes a project version. If project only has one version, deletes the project too.
delproject.json	Deletes the project, and all associated versions.

Full API specifications can be found here.

We can interact with these endpoints using Python Requests or any other HTTP request library, or we can use python-scrapyd-api a Python wrapper for the Scrapyd API.

Python-Scrapyd-API Library

The python-scrapyd-api provides a clean and easy to use Python wrapper around the Scrapyd JSON API, which can simplify your code.

First, we need to install it:

pip install python-scrapyd-api

Then in our code we need to import the library and configure it to interact with our Scrapyd server by passing it the Scrapyd IP address.

>>> from scrapyd_api import ScrapydAPI
>>> scrapyd = ScrapydAPI('http://localhost:6800')

From here, we can use the built in methods to interact with the Scrapyd server.

Check Daemon Status

Checks the status of the Scrapyd server.

>>> scrapyd.daemon_status()
{u'finished': 0, u'running': 0, u'pending': 0, u'node_name': u'DESKTOP-67BR2'}

List All Projects

Returns a list of the projects uploaded to the Scrapyd server.

>>> scrapyd.list_projects()
[u'demo', u'quotes_project']

List All Spiders

Enter the project name, and it will return a list of the spiders available for the requested project.

>>> scrapyd.list_spiders('project_name')
[u'raw_spider', u'js_enhanced_spider', u'selenium_spider']

Run a Job

Run a Scrapy spider by specifying the project and spider name.

>>> scrapyd.schedule('project_name', 'spider_name')
# Returns the Scrapyd job id.
u'14a6599ef67111e38a0e080027880ca6'

Pass custom settings using the settings arguement.

>>> settings = {'DOWNLOAD_DELAY': 2}
>>> scrapyd.schedule('project_name', 'spider_name', settings=settings)
u'25b6588ef67333e38a0e080027880de7'

One important thing to note about the schedule.json API endpoint. Even though the endpoint is called schedule.json, using it only adds a job to the internal Scrapy scheduler queue, which will be run when a slot is free.

This endpoint doesn't have the functionality to schedule a job in the future so it runs at specific time, Scrapyd will add the job to a queue and run it once a Scrapy slot becomes available.

To actually schedule a job to run in the future at a specific date/time or periodicially at a specific time then you will need to control this scheduling on your end. Tools like ScrapeOps will do this for you.

Cancel a Running Job

Cancel a running job by sending the project name and the job_id.

>>> scrapyd.cancel('project_name', '14a6599ef67111e38a0e080027880ca6')
# Returns the "previous state" of the job before it was cancelled: 'running' or 'pending'.
'running'

When sent it will return the "previous state" of the job before it was cancelled. You can verify that the job was actually cancelled by checking the jobs status.

>>> scrapyd.job_status('project_name', '14a6599ef67111e38a0e080027880ca6')
# Returns 'running', 'pending', 'finished' or '' for unknown state.
'finished'

For more functionality then check out the python-scrapyd-api documentation here.

Scrapyd Dashboard

Using Scrapyd's JSON API to control your spiders is possible, however, it isn't ideal as you will need to create custom workflows on your end to monitor, manage and run your spiders. Which can become a major project in itself if you need to manage spiders spread across multiple servers.

Other developers ran into this problem so luckily for us, they decided to create free and open-source Scrapyd dashboards that can connect to your Scrapyd servers so you can manage everything from a single dashboard.

There are many different Scrapyd dashboard and admin tools available:

If you'd like to choose the best one for your requirements then be sure to check out our Guide to the Best Scrapyd Dashboards here.

Integrating Scrapyd with ScrapeOps

ScrapeOps is a free monitoring tool for web scraping that also has a Scrapyd dashboard that allows you to schedule, run and manage all your scrapers from a single dashboard.

Live demo here: ScrapeOps Demo

With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.

Unlike the other Scrapyd dashboard, ScrapeOps is a full end-to-end web scraping monitoring and management tool dedicated to web scraping that automatically sets up all the monitors, health checks and alerts for you.

Features

Once setup, ScrapeOps will:

🕵️‍♂️ Monitor - Automatically monitor all your scrapers.
📈 Dashboards - Visualise your job data in dashboards, so you see real-time & historical stats.
💯 Data Quality - Validate the field coverage in each of your jobs, so broken parsers can be detected straight away.
📉 Auto Health Checks - Automatically check every jobs performance data versus its 7 day moving average to see if its healthy or not.
✔️ Custom Health Checks - Check each job with any custom health checks you have enabled for it.
⏰ Alerts - Alert you via email, Slack, etc. if any of your jobs are unhealthy.
📑 Reports - Generate daily (periodic) reports, that check all jobs versus your criteria and let you know if everything is healthy or not.

Integration

There are two steps to integrate ScrapeOps with your Scrapyd servers:

Install ScrapeOps Logger Extension
Connect ScrapeOps to Your Scrapyd Servers

Note: You can't connect ScrapeOps to a Scrapyd server that is running locally, and isn't offering a public IP address available to connect to.

Once setup you will be able to schedule, run and manage all your Scrapyd servers from one dashboard.

Step 1: Install Scrapy Logger Extension

For ScrapeOps to monitor your scrapers, create dashboards and trigger alerts you need to install the ScrapeOps logger extension in each of your Scrapy projects.

Simply install the Python package:

pip install scrapeops-scrapy

And add 3 lines to your settings.py file:

## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}

From there, your scraping stats will be automatically logged and automatically shipped to your dashboard.

Step 2: Connect ScrapeOps to Your Scrapyd Servers

The next step is giving ScrapeOps the connection details of your Scrapyd servers so that you can manage them from the dashboard.

Enter Scrapyd Server Details

Within your dashboard go to the Servers page and click on the Add Scrapyd Server at the top of the page.

In the dropdown section then enter your connection details:

Server Name
Server Domain Name (optional)
Server IP Address

Whitelist Our Server (Optional)

Depending on how you are securing your Scrapyd server, you might need to whitelist our IP address so it can connect to your Scrapyd servers. There are two options to do this:

Option 1: Auto Install (Ubuntu)

SSH into your server as root and run the following command in your terminal.

wget -O scrapeops_setup.sh "https://assets-scrapeops.nyc3.digitaloceanspaces.com/Bash_Scripts/scrapeops_setup.sh"; bash scrapeops_setup.sh

This command will begin the provisioning process for your server, and will configure the server so that Scrapyd can be managed by Scrapeops.

Option 2: Manual Install

This step is optional but needed if you want to run/stop/re-run/schedule any jobs using our site. If we cannot reach your server via port 80 or 443 the server will be listed as read only.

The following steps should work on Linux/Unix based servers that have UFW firewall installed.:

Step 1: Log into your server via SSH

Step 2: Enable SSH'ing so that you don't get blocked from your server

sudo ufw allow ssh

Step 3: Allow incoming connections from 46.101.44.87

sudo ufw allow from 46.101.44.87 to any port 443,80 proto tcp

Step 4: Enable ufw & check firewall rules are implemented

sudo ufw enable
sudo ufw status

Step 5: Install Nginx & setup a reverse proxy to let connection from scrapeops reach your scrapyd server.

sudo apt-get install nginx -y

Add the proxy_pass & proxy_set_header code below into the "location" block of your nginx default config file (default file usually found in /etc/nginx/sites-available)

proxy_pass http://localhost:6800/;
proxy_set_header X-Forwarded-Proto http;

Reload your nginx config

sudo systemctl reload nginx

Once this is done you should be able to run, re-run, stop, schedule jobs for this server from the ScrapeOps dashboard.

The Complete Guide To ScrapydWeb, Get Setup In 3 Minutes!

Ian Kerins — Thu, 13 Jan 2022 14:05:12 +0000

Published as part of The Python Scrapy Playbook.

ScrapydWeb is the most popular open source Scrapyd admin dashboards. Boasting 2,400 Github stars, ScrapydWeb has been fully embraced by the Scrapy community.

In this guide, we're going to run through:

What Is ScrapydWeb?
How To Setup ScrapydWeb?
Using ScrapydWeb
Alternatives To ScrapydWeb

There are many different Scrapyd dashboard and admin tools available, from ScrapeOps (Live Demo) to SpiderKeeper, and Gerapy.

What Is ScrapydWeb?

ScrapydWeb is a admin dashboard that is designed to make interacting with Scrapyd daemons much easier. It allows you to schedule, run and view your scraping jobs across multiple servers in one easy to use dashboard.

Thereby addressing the main problem with the default Scrapyd setup. The fact that the user interface has very limited functionality and is pretty ugly.

Although, there are many other Scrapyd dashboards out there, ScrapydWeb quickly became the most popular option after its launch in 2018 because of its easy of use and extra functionality it offered compared to the other alternatives at the time.

Features

💠 Scrapyd Cluster Management
- 💯 All Scrapyd JSON API Supported
- ☑️ Group, filter and select any number of nodes
- 🖱️ Execute command on multinodes with just a few clicks
🔍 Scrapy Log Analysis
- 📊 Stats collection
- 📈 Progress visualization
- 📑 Logs categorization
🔋 Enhancements
- 📦 Auto packaging
- 🕵️‍♂️ Integrated with 🔗 LogParser
- ⏰ Timer tasks
- 📧 Monitor & Alert
- 📱 Mobile UI
- 🔐 Basic auth for web UI

How To Setup ScrapydWeb?

Getting setup with ScrapydWeb is pretty simple. You just need to install the ScrapydWeb package and connect it to your Scrapyd server.

Setup Scrapyd Server

To run through the installation process, we're first going to need to have a Scrapyd server setup with a project running on it. (You can skip this step if you already have a Scrapyd server setup.)

If you would like a in-depth walkthrough on what is Scrapyd and how to set it up, then check out our Scrapyd guide here.

Install Scrapyd

First step is to install Scrapyd:

pip install scrapyd

And then start the server by using the command:

scrapyd

This will start Scrapyd running on http://localhost:6800/. You can open this url in your browser and you should see the following screen:

Deploy Scrapy Project to Scrapyd

To run jobs using Scrapyd, we first need to eggify and deploy our Scrapy project to the Scrapyd server. Luckily, there is a easy to library called scrapyd-client to do this.

pip install git+https://github.com/scrapy/scrapyd-client.git

Once installed, navigate to your Scrapy project and open your scrapyd.cfg file and uncomment the url line under [deploy].

## scrapy.cfg

[settings]
default = demo.settings  ## demo will be the name of your scrapy project

[deploy]
url = http://localhost:6800/
project = demo  ## demo will be the name of your scrapy project

This [deploy] section configures what url the Scrapyd endpoint the project should be deployed too, and the project field tells which project that should be deployed.

With the scrapyd.cfg file configured we are now able to deploy the project to the Scrapyd server. To do this we navigate to the Scrapy project you want to deploy in your command line and then enter the command:

scrapyd-deploy default

When you run this command, you should get a response like this:

$ scrapyd-deploy default
Packing version 1640086638
Deploying to project "scrapy_demo" in http://localhost:6800/addversion.json
Server response (200):
{"node_name": "DESKTOP-67BR2", "status": "ok", "project": "scrapy_demo", "version": "1640086638", "spiders": 1}

Note: Make sure you have your Scrapyd server running, otherwise you will get an error.

Now that we have a Scrapyd server setup and a Scrapy project deployed to the Scrapyd server we can control this with ScrapydWeb.

Installing ScrapydWeb

Getting ScrapydWeb installed and setup is super easy. (This is a big reason why it has become so popular).

To get started we need to install the latest version of ScrapydWeb:

pip install --upgrade git+https://github.com/my8100/scrapydweb.git

Next to run Scrapydweb we just need to use the command:

scrapydweb

This will build a ScrapydWeb instance for you, create the necessary settings files and launch a ScrapydWeb server on http://127.0.0.1:5000.

Sometimes the first time you run scrapydweb it will just create the ScrapydWeb files but won't start the server. If this happens just run the scrapydweb command again and it will start the server.

Now, when you open http://127.0.0.1:5000 in your browser you should see a screen like this.

Install Logparser

With the current setup you can use ScrapydWeb to schedule and run your scraping jobs, but you won't see any stats for your jobs in your dashboard.

Not to worry however, the developers behind ScrapydWeb have created a library called Logparser to do just that.

If you run Logparser in the same directory as your Scrapyd server, it will automatically parse your Scrapy logs and make them available to your ScrapydWeb dashboard.

To install Logparser, enter the command:

pip install logparser

Then in the same directory as your Scrapyd server, run:

logparser

This will start a daemon that will automatically parse your Scrapy logs for ScrapydWeb to consume.

Note: If you are running Scrapyd and ScrapydWeb on the same machine then it is recommended to set the LOCAL_SCRAPYD_LOGS_DIR path to your log files directory and ENABLE_LOGPARSER to True in your ScrapydWeb's settings file.

At this point, you will have a running Scrapyd server, a running logparser instance, and a running ScrapydWeb server. From here, we are ready to use ScrapydWeb to schedule, run and monitor our jobs.

Using ScrapydWeb

Now let's look at how we can actually use ScrapydWeb to run and monitor our jobs.

Connecting Scrapyd Servers

Adding Scrapyd servers to ScrapydWeb dashboard is pretty simple. You just need to edit your ScrapydWeb settings file.

By default ScrapydWeb is setup to connect to locally running Scrapyd server on localhost:6800.

SCRAPYD_SERVERS = [
    '127.0.0.1:6800',
    # 'username:password@localhost:6801#group', ## string format
    #('username', 'password', 'localhost', '6801', 'group'), ## tuple format
]

If you want to connect to remote Scrapyd servers, then just add them to the above array, and restart the server. You can add servers in both a string or tuple format.

Note: you need to make sure bind_address = 0.0.0.0 in your settings file, add restart Scrapyd to make it visible externally.

With this done, you should see something like this on your servers page: http://127.0.0.1:5000/1/servers/.

Running Spiders

Now, with your server connected we are able to schedule and run spiders from the projects that have been deployed to our Scrapyd server.

Navigate to the Run Spider page (http://127.0.0.1:5000/1/schedule/), and you will be able to select and run spiders.

This will then send a POST request to the /schedule.json endpoint of your Scrapyd server, triggering Scrapyd to run your spider.

You can also schedule jobs to run periodically by enabling the timer task toggle and entering your cron details.

Job Stats

When Logparser is running, ScrapydWeb will periodicially poll the Scrapyd logs endpoint and display your job stats so you can see how they have performed.

Alternatives To ScrapydWeb

There are many alternatives to ScrapydWeb, which offer different functionality and flexibility than ScrapydWeb. We've summarised them in this article here: Guide to the 5 Best Scrapyd Dashboards

If you are still open to other options then we would highly recommend that you give ScrapeOps a try. ScrapeOps does everything ScrapydWeb does and more.

Live demo here: ScrapeOps Demo

Not only can you schedule, run and manage spiders on Scrapyd servers like you can with ScrapydWeb, ScrapeOps is a fully fledged job monitoring solution for web scraping.

It allows you to monitor jobs, view the results in numerous dashboards, automatic job health checks, alerts and more.

What's more, the monitoring and scheduling part of ScrapeOps is seperate. So if you would like to use ScrapydWeb for job scheduling, you can still integrate the ScrapeOps Scrapy extension that will log your scraping data and populate your monitoring dashboards.

The Complete Guide To Scrapy Spidermon, Start Monitoring in 5 Minutes!

Ian Kerins — Thu, 13 Jan 2022 09:51:47 +0000

Published as part of The Python Scrapy Playbook.

If anyone has done a lot of web scraping, the one thing they always know is that your scrapers always break and degrade overtime.

Web scraping isn't like other software applications, where for the most part you control all the variables. In web scraping, you are writing scrapers that are trying to extract data from moving targets.

Websites can:

Change the HTML structure of their pages.
Implement new anti-bot countermeasures.
Block whole ranges of IPs from accessing their site.

All of which can degrade or completely break your scrapers. Because of this, it is vital that you have a robust monitoring and alerting setup in place for your web scrapers so you can react immediately when your spiders eventually begin the brake.

In this guide, we're going to walk you through Spidermon, a Scrapy extension that is designed to make monitoring your scrapers easier and more effective.

What is Spidermon?
Integrating Spidermon
Spidermon Monitors
Spidermon MonitorSuites
Spidermon Actions
Item Validation
End-to-End Spidermon Example + Code

For more scraping monitoring solutions, then be sure to check out the full list of Scrapy monitoring options here. Including ScrapeOps, the purpose built job monitoring & scheduling tool for web scraping.

Live demo here: ScrapeOps Demo

What is Spidermon?

Spidermon is a Scrapy extension to build monitors for Scrapy spiders. Built by the same developers that develop and maintain Scrapy, Spidermon is a highly versatile and customisable monitoring framework for Scrapy which greatly expands the default stats collection and logging functionality within Scrapy.

Spidermon allows you to create custom monitors that will:

Monitor your scrapers with template & custom monitors.
Validate the data being scraped from each page.
Notify you with the results of those checks.

Spidermon is highly customisable, so if you can track a stat then you will be able to create a Spidermon monitor to monitor it in real-time.

Spidermon is centered around Monitors, MonitorSuites, Validators and Actions, which are then used to monitor your scraping jobs and alert you if any tests are failed.

Integrating Spidermon

Getting setup with Spidermon is straight forward, but you do need to manually setup your monitors after installing the Spidermon extension.

To get started you need to install the Python package:

pip install spidermon

Then add 2 lines to your settings.py file:

## settings.py

## Enable Spidermon
SPIDERMON_ENABLED = True

## Add In The Spidermon Extension
EXTENSIONS = {
'spidermon.contrib.scrapy.extensions.Spidermon': 500,
}

From here, you need to define your Monitors, Validators and Actions, then schedule them to run with your MonitorSuites. We will go through each of these in this guide.

Spidermon Monitors

The Monitor is the core piece of Spidermon. Inherited from unittest, a monitor is a series of Unit Tests you define that allows you to test the scraping stats of your job versus predefined thresholds you have defined.

Basic Monitors

Out of the box, Spidermon has a number of basic monitors built in which you just need to enable and configure in your projects/spiders settings to activate for your jobs.

Monitors	Description
ItemCountMonitor	Check if spider extracted the minimum number of items threshold.
ItemValidationMonitor	Check for item validation errors if item validation pipelines are enabled.
FieldCoverageMonitor	Check if field coverage rules are met.
ErrorCountMonitor	Check the number of errors versus a threshold.
WarningCountMonitor	Check the number of warnings versus a threshold.
FinishReasonMonitor	Check if a job finished for an expected finish reason.
RetryCountMonitor	Check if any requests have reached the maximum amount of retries and the crawler had to drop those requests.
DownloaderExceptionMonitor	Check the amount of downloader exceptions (timeouts, rejected connections, etc.).
SuccessfulRequestsMonitor	Check the total number of successful requests made.
TotalRequestsMonitor	Check the total number of requests made.

To use any of these monitors you will need to define the thresholds for each of them in your settings.py file or your spiders custom settings.

Custom Monitors

With Spidermon you can also create your own custom monitors that can do just about anything. They can work with any type of stat that is being tracked:

✅ Requests
✅ Responses
✅ Pages Scraped
✅ Items Scraped
✅ Item Field Coverage
✅ Runtimes
✅ Errors & Warnings
✅ Bandwidth
✅ HTTP Response Codes
✅ Retries
✅ Custom Stats

Basically, you can create a monitor to verify any stat that appears in the Scrapy stats (either the default stats, or custom stats you configure your spider to insert).

Here is a example of a simple monitor that will check the number of items scraped versus a minimum threshold.

# my_project/monitors.py
from spidermon import Monitor, monitors

@monitors.name('Item count')
class CustomItemCountMonitor(Monitor):

    @monitors.name('Minimum number of items')
    def test_minimum_number_of_items(self):
        item_extracted = getattr(
            self.data.stats, 'item_scraped_count', 0)
        minimum_threshold = 10

        msg = 'Extracted less than {} items'.format(
            minimum_threshold)
        self.assertTrue(
            item_extracted >= minimum_threshold, msg=msg
        )

To run a Monitor, they need to be included in a MonitorSuite.

Spidermon MonitorSuites

A MonitorSuite is how you activate your Monitors. They tell Spidermon when you would like your monitors run and what actions should Spidermon take if your scrape passes/fails any of your health checks.

There are three built in types of monitors within Spidermon:

MonitorSuites	Description
SPIDERMON_SPIDER_OPEN_MONITORS	Runs monitors when Spider starts running.
SPIDERMON_SPIDER_CLOSE_MONITORS	Runs monitors when Spider has finished scraping.
SPIDERMON_PERIODIC_MONITORS	Runs monitors are periodic intervals that you can define.

Within these MonitorSuites you can specify which actions should be taken after the Monitors have been executed.

To create a MonitorSuite, simply create a new MonitorSuite class, and define which monitors you want to run and what actions should be taken afterwards:

## tutorial/monitors.py
from spidermon.core.suites import MonitorSuite

class SpiderCloseMonitorSuite(MonitorSuite):
    monitors = [
        CustomItemCountMonitor, ## defined above
    ]

    monitors_finished_actions = [
        # actions to execute when suite finishes its execution
    ]

    monitors_failed_actions = [
        # actions to execute when suite finishes its execution with a failed monitor
    ]

Then add that MonitorSuite to the SPIDERMON_SPIDER_CLOSE_MONITORS tuple in your settings.py file.

##settings.py
SPIDERMON_SPIDER_CLOSE_MONITORS = (
    'tutorial.monitors.SpiderCloseMonitorSuite',
)

Now Spidermon will run this MonitorSuite at the end of every job.

Spidermon Actions

The final piece of your MonitorSuite are Actions, which define what happens after a set of monitors has been run.

Spidermon has pre-built Action templates already included, but you can easily create your own custom Actions.

Here is a list of the pre-built Action templates:

Actions	Description
Email	Send alerts or job reports to you and your team.
Slack	Send slack notifications to any channel.
Telegram	Send alerts or reports to any Telegram channel.
Job Tags	Set tags on your jobs when using Scrapy Cloud.
File Report	Create and save a HTML report locally.
S3 Report	Create and save a HTML report to a S3 bucket.
Sentry	Send custom messages to Sentry.

How example to get Slack notifications when a job fails one of your monitors, you can use the pre-built SendSlackMessageSpiderFinished action by adding your Slack details to your settings.py file:

##settings.py
SPIDERMON_SLACK_SENDER_TOKEN = '<SLACK_SENDER_TOKEN>'
SPIDERMON_SLACK_SENDER_NAME = '<SLACK_SENDER_NAME>'
SPIDERMON_SLACK_RECIPIENTS = ['@yourself', '#yourprojectchannel']

Then including SendSlackMessageSpiderFinished in your MonitorSuite:

## tutorial/monitors.py
from spidermon.core.suites import MonitorSuite
from spidermon.contrib.actions.slack.notifiers import SendSlackMessageSpiderFinished

class SpiderCloseMonitorSuite(MonitorSuite):
    monitors = [
        CustomItemCountMonitor, 
    ]

    monitors_failed_actions = [
        SendSlackMessageSpiderFinished,
    ]

Item Validation

One really powerful feature of Spidermon is its support for Item validation. Using schematics or JSON Schema, you can define custom unit tests on fields of each Item.

For example, we can have Spidermon test every product item we scrape has a valid product url, has a price that is a number and doesn’t include any currency signs or special characters, etc.

Here is an example product item validator:

## validators.py
from schematics.models import Model
from schematics.types import URLType, StringType, ListType

class ProductItem(Model):
    url = URLType(required=True)
    name = StringType(required=True)
    price = DecimalType(required=True)
    features = ListType(StringType)
    image_url = URLType()

Which can be enabled in your spider by activating Spidermons ItemValidationPipeline and telling Spidermon to use the ProductItem validator class we just created in your projects settings.py file.

# settings.py
ITEM_PIPELINES = {
    'spidermon.contrib.scrapy.pipelines.ItemValidationPipeline': 800,
}

SPIDERMON_VALIDATION_MODELS = (
    'tutorial.validators.ProductItem',
)

This vaidator will then append new stats to your Scrapy stats which you can then use in your Monitors.

## log file
...
'spidermon/validation/fields': 400,
'spidermon/validation/items': 100,
'spidermon/validation/validators': 1,
'spidermon/validation/validators/item/schematics': True,
[scrapy.core.engine] INFO: Spider closed (finished)

End-to-End Spidermon Example

Now, we're going to run through a full Spidermon example so that you can see how to setup your own monitoring suite.

The full code from this example is available on Github here.

Scrapy Project

First things first, we need a Scrapy project, a spider and a website to scrape. In this case books.toscrape.com.

scrapy startproject spidermon_demo
scrapy genspider bookspider books.toscrape.com

Next we need to create a Scrapy Item for the data we want to scrape:

## items.py
import scrapy

class BookItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    price = scrapy.Field()

Finally we need to write the spider code:

## spiders/bookspider.py
import scrapy
from spidermon_demo.items import BookItem

class BookSpider(scrapy.Spider):
  name = 'bookspider'
  start_urls = ["http://books.toscrape.com"]

  def parse(self, response):

    for article in response.css('article.product_pod'):
      book_item = BookItem(
        url = article.css("h3 > a::attr(href)").get(),
        title = article.css("h3 > a::attr(title)").extract_first(),
        price = article.css(".price_color::text").extract_first(),
      )
      yield book_item

    next_page_url = response.css("li.next > a::attr(href)").get()
    if next_page_url:
      yield response.follow(url=next_page_url, callback=self.parse)

By now, you should have a working spider that will scrape every page of books.toscrape.com. Next we integrate Spidermon.

Integrate Spidermon

To install Spidermon just install the Python package:

pip install spidermon

Then add 2 lines to your settings.py file:

## settings.py

## Enable Spidermon
SPIDERMON_ENABLED = True

## Add In The Spidermon Extension
EXTENSIONS = {
'spidermon.contrib.scrapy.extensions.Spidermon': 500,
}

Create Item Validator

For this example, we're going to validate the Items we want to scrape to make sure all fields are scraped and the data is valid. To do this we need to create a validtor which is pretty simple.

First, we're going to need to install the schematics library:

pip install schematics

Next, we will define our validator for our BookItem model in a new validators.py file:

## validators.py
from schematics.models import Model
from schematics.types import URLType, StringType, ListType

class BookItem(Model):
    url = URLType(required=True)
    title = StringType(required=True)
    price = StringType(required=True)

Then enable this validator in our settings.py file:

## settings.py

ITEM_PIPELINES = {
    'spidermon.contrib.scrapy.pipelines.ItemValidationPipeline': 800,
}

SPIDERMON_VALIDATION_MODELS = (
    'spidermon_demo.validators.BookItem',
)

At this point, when you run your spider Spidermon will validate every item being scraped and update the Scrapy Stats with the results:

## Scrapy Stats Output
(...)
'spidermon/validation/fields': 3000,
'spidermon/validation/fields/errors': 1000,
'spidermon/validation/fields/errors/invalid_url': 1000,
'spidermon/validation/fields/errors/invalid_url/url': 1000,
'spidermon/validation/items': 1000,
'spidermon/validation/items/errors': 1000,
'spidermon/validation/validators': 1,
'spidermon/validation/validators/item/schematics': True,

We can see from these stats here, that the url field of our BookItem is failing all the validation checks. When digging deeper we will find that the reason is that scraped urls are relative urls catalogue/a-light-in-the-attic_1000/index.html, not absolute urls.

Create Our Monitors

Next, up we want to create Monitors that will conduct the unit tests when activated. In this example we're going to create two monitors in our monitors.py file.

Monitor 1: Item Count Monitor

This monitor will validate that our spider has scraped a set number of items.

## monitors.py
@monitors.name('Item count')
class ItemCountMonitor(Monitor):

    @monitors.name('Minimum number of items')
    def test_minimum_number_of_items(self):
        item_extracted = getattr(
            self.data.stats, 'item_scraped_count', 0)
        minimum_threshold = 200

        msg = 'Extracted less than {} items'.format(
            minimum_threshold)
        self.assertTrue(
            item_extracted >= minimum_threshold, msg=msg
        )

Monitor 2: Item Validation Monitor

This monitor will check the stats from Item validator to make sure we have no item validation errors.

## monitors.py
@monitors.name('Item validation')
class ItemValidationMonitor(Monitor, StatsMonitorMixin):

    @monitors.name('No item validation errors')
    def test_no_item_validation_errors(self):
        validation_errors = getattr(
            self.stats, 'spidermon/validation/fields/errors', 0
        )
        self.assertEqual(
            validation_errors,
            0,
            msg='Found validation errors in {} fields'.format(
                validation_errors)
        )

Create Monitor Suites

For this example, we're going to run two MonitorSuites. One at the end of the job, and another that runs every 5 seconds (for demo purposes).

MonitorSuite 1: Spider Close

Here, we're going to add both of our monitors (ItemCountMonitor,ItemValidationMonitor) to the monitor suite as we want both to run when the job finishes. To do so we just need to create the MonitorSuite in our monitors.py file:

## monitors.py
class SpiderCloseMonitorSuite(MonitorSuite):

monitors = [
    ItemCountMonitor,
    ItemValidationMonitor,
]

And then enable this MonitorSuite in our settings.py file:

## settings.py
SPIDERMON_SPIDER_CLOSE_MONITORS = (
    'spidermon_demo.monitors.SpiderCloseMonitorSuite',
)

MonitorSuite 2: Periodic Monitor

Setting up a periodic monitor to run every 5 seconds is just as easy. Simply create a new MonitorSuite and in this case we're only going to have it run the ItemValidationMonitor every 5 seconds:

## monitors.py
class PeriodicMonitorSuite(MonitorSuite):
    monitors = [
        ItemValidationMonitor,
    ]

And then enable it in our settings.py file, where we also specify how frequently we want it to run:

SPIDERMON_PERIODIC_MONITORS = {
    'spidermon_demo.monitors.PeriodicMonitorSuite': 5,  # time in seconds
}

With both of these MonitorSuites setup, Spidermon will automatically run these Monitors and add the results to your Scrapy logs and stats.

Create Our Actions

Having the results of these Monitors is good, but to make them really useful we want something to happen when a MonitorSuite has completed its tests.

The most common action is getting notified of a failed health check so for this example we're going to send a Slack notification.

First we need to install some libraries to be able to work with Slack:

pip install slack slackclient jinja2

Next we will need to enable Slack notifications in our MonitorSuites by importing SendSlackMessageSpiderFinished from Spidermon actions, and updating our MonitorSuites to use it.

## monitors.py
from spidermon.contrib.actions.slack.notifiers import SendSlackMessageSpiderFinished

## ... Existing Monitors

## Update Spider Close MonitorSuite
class SpiderCloseMonitorSuite(MonitorSuite):

    monitors = [
        ItemCountMonitor,
        ItemValidationMonitor,
    ]

    monitors_failed_actions = [
        SendSlackMessageSpiderFinished, 
    ]

## Update Periodic MonitorSuite
class PeriodicMonitorSuite(MonitorSuite):
    monitors = [
        ItemValidationMonitor,
    ]

    monitors_failed_actions = [
        SendSlackMessageSpiderFinished, 
    ]

Then add our Slack details to our settings.py file:

## settings.py
SPIDERMON_SLACK_SENDER_TOKEN = '<SLACK_SENDER_TOKEN>'
SPIDERMON_SLACK_SENDER_NAME = '<SLACK_SENDER_NAME>'
SPIDERMON_SLACK_RECIPIENTS = ['@yourself', '#yourprojectchannel']

Use this guide to create a Slack app and get your Slack credentials.

From here, anytime one of your Spidermon MonitorSuites fail, you will get a Slack notification.

The full code from this example is available on Github here.

How to Monitor Your Scrapy Spiders?

Ian Kerins — Wed, 12 Jan 2022 16:18:18 +0000

Published as part of The Python Scrapy Playbook.

For anyone who has been in web scraping for a while, you know that if there is anything certain in web scraping that just because your scrapers work today doesn’t mean they will work tomorrow.

From day to day, your scrapers can break or their performance degrade for a whole host of reasons:

The HTML structure of the target site can change.
The target site can change their anti-bot countermeasures.
Your proxy network can degrade or go down.
Or something can go wrong on your server.

Because of this it is very important for you to have a reliable and effective way for you to monitor your scrapers in production, conduct health checks and get alerts when the performance of your spider drops.

In this guide, we will go through the 4 popular options to monitor your scrapers:

Scrapy Logs & Stats
ScrapeOps Extension
Spidermon Extension
Generic Logging & Monitoring Tools

#1: Scrapy Logs & Stats

Out of the box, Scrapy boasts by far the best logging and stats functionality of any web scraping library or framework out there.

2021-12-17 17:02:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1330,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'downloader/response_bytes': 11551,
'downloader/response_count': 5,
'downloader/response_status_count/200': 5,
'elapsed_time_seconds': 2.600152,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 17, 16, 2, 22, 118835),
'httpcompression/response_bytes': 55120,
'httpcompression/response_count': 5,
'item_scraped_count': 50,
'log_count/INFO': 10,
'response_received_count': 5,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2021, 12, 17, 16, 2, 19, 518683)}
2021-12-17 17:02:25 [scrapy.core.engine] INFO: Spider closed (finished)

Whereas most other scraping libraries and frameworks focus solely on making requests and parsing the responses, Scrapy has a whole logging and stats layer under the hood that tracks your spiders in real-time. Making it really easy to test and debug your spiders when developing them.

You can easily customise the logging levels, and add more stats to the default Scrapy stats in spiders with a couple lines of code.

The major problem relying solely on using this approach to monitoring your scrapers is that it quickly becomes impractical and cumbersome in production. Especially when you have multiple spiders running every day across multiple servers.

The check the health of your scraping jobs you will need to store these logs, and either periodically SSH into the server to view them or setup a custom log exporting system so you can view them in a central user interface. More on this later.

Summary

Using Scrapy's built-in logging and stats functionality is great during development, but when running scrapers in production you should look to use a better monitoring setup.

Pros

Setup right out of the box, and very light weight.
Easy to customise so it to logs more stats.
Great for local testing and the development phase.

Cons

No dashboard functionality, so you need to setup your own system to export your logs and display them.
No historical comparison capabilities within jobs.
No inbuilt health check functionality.
Cumbersome to rely solely on when in production.

#2: ScrapeOps Extension

ScrapeOps is a monitoring and alerting tool dedicated to web scraping. With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.

Live demo here: ScrapeOps Demo

Unlike the other options on this list, ScrapeOps is a full end-to-end web scraping monitoring and management tool dedicated to web scraping that automatically sets up all the monitors, health checks and alerts for you. If you have an issue with integrating ScrapeOps or need advice on setting up your scrapers then they have a support team on-hand to assist you.

Features

Once you have completed the simple install (3 lines in your scraper), ScrapeOps will:

🕵️‍♂️ Monitor - Automatically monitor all your scrapers.
📈 Dashboards - Visualise your job data in dashboards, so you see real-time & historical stats.
💯 Data Quality - Validate the field coverage in each of your jobs, so broken parsers can be detected straight away.
📉 Auto Health Checks - Automatically check every jobs performance data versus its 7 day moving average to see if its healthy or not.
✔️ Custom Health Checks - Check each job with any custom health checks you have enabled for it.
⏰ Alerts - Alert you via email, Slack, etc. if any of your jobs are unhealthy.
📑 Reports - Generate daily (periodic) reports, that check all jobs versus your criteria and let you know if everything is healthy or not.

Job stats tracked include:

✅ Pages Scraped & Missed
✅ Items Parsed & Missed
✅ Item Field Coverage
✅ Runtimes
✅ Response Status Codes
✅ Success Rates
✅ Latencies
✅ Errors & Warnings
✅ Bandwidth

Integration

Getting setup with the logger is simple. Just install the Python package:

pip install scrapeops-scrapy

And add 3 lines to your settings.py file:

## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}

From there, your scraping stats will be automatically logged and automatically shipped to your dashboard.

Summary

ScrapeOps is a powerful web scraping monitoring tool, that gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.

Pros

Free unlimited community plan.
Simple 30 second install, gives you advanced job monitoring, health checks and alerts straight out of the box.
Job scheduling and management functionality so you can manage and monitor your scrapers from one dashboard.
Customer support team, available to help you get setup and add new features.

Cons

Currently, less customisable than Spidermon or other log management tools. (Will be soon!)

#3: Spidermon Extension

Spidermon is an open-source monitoring extension for Scrapy. When integrated it allows you to set up custom monitors that can run at the start, end or periodically during your scrape, and alert you via your chosen communication method.

This is a very powerful tool as it allows you to create custom monitors for each of your Spiders that can validate each Item scraped with your own unit tests.

For example, you can make sure a required field has been scraped, that a url field actually contains a valid url, or have it double check that scraped price is actually a number and doesn’t include any currency signs or special characters.

from schematics.models import Model
from schematics.types import URLType, StringType, ListType

class ProductItem(Model):
    url = URLType(required=True)
    name = StringType(required=True)
    price = DecimalType(required=True)
    features = ListType(StringType)
    image_url = URLType()

However, the two major drawbacks with Spidermon is the fact that:

#1 - No Dashboard or User Interface

Spidermon doesn’t have any dashboard or user interface where you can see the output of your monitors.

The output of your Spidermon monitors are just added to your log files and Scrapy stats, so you will either need to view each spider log to check your scrapers performance or setup a custom system to extract this log data and display it in your own custom dashboard.

#2 - Upfront Setup Time

Unlike, ScrapeOps with Spidermon you will have to spend a bit of upfront time to create the monitors you need for each spider and integrate them into your Scrapy projects.

Spidermon does include some out-of-the-box monitors, however, you will still need to activate them and define the failure thresholds for every spider.

Features

Once setup Spidermon can:

🕵️‍♂️ Monitor - Automatically, monitor all your scrapers with the defined monitors.
💯 Data Quality - Validate the field coverage each of the Items you've defined unit tests for.
📉 Periodic/Finished Health Checks - At periodic intervals or at job finish, you can configure Spidermon to check the health of your job versus pre-set thresholds.
⏰ Alerts - Alert you via email, Slack, etc. if any of your jobs are unhealthy.

Job stats tracked out of the box include:

✅ Pages Scraped
✅ Items Scraped
✅ Item Field Coverage
✅ Runtimes
✅ Errors & Warnings

You can also track more stats if you customise your scrapers to log them and have spidermon monitor them.

Integration

Getting setup with Spidermon is straight forward, but you do need to manually setup your monitors after installing the Spidermon extension.

To get started you need to install the Python package:

pip install spidermon

Then add 2 lines to your settings.py file:

## settings.py

## Enable Spidermon
SPIDERMON_ENABLED = True

## Add In The Spidermon Extension
EXTENSIONS = {
'spidermon.contrib.scrapy.extensions.Spidermon': 500,
}

From here you will also need to build your custom monitors and add each of them to your settings.py file. Here is a simple example of how to setup a monitor that will check the number of items scraped at the end of the job versus a fixed threshold.

First we create a custom monitor in a monitors.py file within our Scrapy project:

# my_project/monitors.py
from spidermon import Monitor, MonitorSuite, monitors

@monitors.name('Item count')
class ItemCountMonitor(Monitor):

    @monitors.name('Minimum number of items')
    def test_minimum_number_of_items(self):
        item_extracted = getattr(
            self.data.stats, 'item_scraped_count', 0)
        minimum_threshold = 10

        msg = 'Extracted less than {} items'.format(
            minimum_threshold)
        self.assertTrue(
            item_extracted >= minimum_threshold, msg=msg
        )

class SpiderCloseMonitorSuite(MonitorSuite):

    monitors = [
        ItemCountMonitor,
    ]

Then we add this to monitor to our settings.py file so that Spidermon will run it at the end of every job.

## settings.py

## Enable Spidermon Monitor
SPIDERMON_SPIDER_CLOSE_MONITORS = (
    'my_project.monitors.SpiderCloseMonitorSuite',
)

This monitor will then run at the end of every job and output the result in your logs file. Example of monitor failing its tests:

INFO: [Spidermon] -------------------- MONITORS --------------------
INFO: [Spidermon] Item count/Minimum number of items... FAIL
INFO: [Spidermon] --------------------------------------------------
ERROR: [Spidermon]
====================================================================
FAIL: Item count/Minimum number of items
--------------------------------------------------------------------
Traceback (most recent call last):
File "/tutorial/monitors.py",
    line 17, in test_minimum_number_of_items
    item_extracted >= minimum_threshold, msg=msg
AssertionError: False is not true : Extracted less than 10 items
INFO: [Spidermon] 1 monitor in 0.001s
INFO: [Spidermon] FAILED (failures=1)
INFO: [Spidermon] ---------------- FINISHED ACTIONS ----------------
INFO: [Spidermon] --------------------------------------------------
INFO: [Spidermon] 0 actions in 0.000s
INFO: [Spidermon] OK
INFO: [Spidermon] ----------------- PASSED ACTIONS -----------------
INFO: [Spidermon] --------------------------------------------------
INFO: [Spidermon] 0 actions in 0.000s
INFO: [Spidermon] OK
INFO: [Spidermon] ----------------- FAILED ACTIONS -----------------
INFO: [Spidermon] --------------------------------------------------
INFO: [Spidermon] 0 actions in 0.000s
INFO: [Spidermon] OK

If you would like a more detailed explanation of how to use Spidermon, you can check out our Complete Spidermon Guide here or the offical documentation here.

Summary

Spidermon is a great option for anyone who is wants to take their scrapers to the next level and integrate a highly customisable monitoring solution.

Pros

Open-source. Developed by core Scrapy developers.
Stable and battle tested. Used internally by Zyte developers.
Offers the ability to set custom item validation rules on every Item being scraped.

Cons

No dashboard functionality, so you need to build your own system to extract the Spidermon stats to a dashboard.
Need to do a decent bit of customisation in your Scrapy projects to get the spider monitors, alerts, etc. setup for each spider.

#4: Generic Logging & Monitoring Tools

Another option, is use any of the many active monitoring or logging platforms available, like DataDog, Logz.io, LogDNA, Sentry, etc.

These tools boast a huge range of functionality and features that allow you to graph, filter, aggregate your log data in whatever way best suits your requirements.

However, although that can be used for monitoring your spiders, you will have to do a lot of customisation work to setup the dashboards, monitors, alerts like you would get with ScrapeOps or Spidermon.

Plus, because with most of these tools you will need to ingest all your log data to power the graphs, monitors, etc. they will likely be a lot more expensive than using ScrapeOps or Spidermon as they charge based on much data they ingest and how long they retain it for.

Summary

If you have a very unique web scraping stack with a complicated ETL pipeline, then customising one of the big logging tools to your requirements might be a good option.

Pros

Lots of feature rich logging tools to choose from.
Can integrate with your other logging stack if you have on.
Highly customisable. If you can dream it, then you can likely build it.

Cons

Will need to create a custom logging setup to properly track your jobs.
No job management or scheduling capabilities.
Can get expensive when doing large scale scraping.

Scraping Millions of Google SERPs The Easy Way (Python Scrapy Spider)

Ian Kerins — Tue, 17 Nov 2020 16:53:42 +0000

Google is the undisputed king of search engines in just about every aspect. Making it the ultimate source of data for a whole host of use cases.

If you want to get access to this data you either need to extract it manually, pay a 3rd party for a expensive data feed or build your own scrape to extract the data for you.

In this article I will show you the easiest way to build a Google scraper that can extract millions of pages of data each day with just a few lines of code.

By combining Scrapy with Scraper API's proxy/autoparsing functionality we will build a Google scraper that can the search engine results from any Google query and return the following for each result:

Title
Link
Related links
Description
Snippet
Images
Thumbnails
Sources, and more

You can also refine your search queries with parameters, by specifying a keyword, the geographic region, the language, the number of results, results from a particular domain, or even to only return safe results. The possibilities are nearly limitless.

The code for this project is available on GitHub here.

For this guide, we're going to use:

Scraper API as our proxy solution, as Instagram has pretty aggressive anti-scraping in place. You can sign up to a free account here which will give you 5,000 free requests.
ScrapeOps to monitor our scrapers for free and alert us if they run into trouble. Live demo here: ScrapeOps Demo

How Querying Google Using Scraper API’s Autoparse Functionality

We will use Scraper API for two reasons:

Proxies, so we won't get blocked.
Parsing, so we don't have to worry about writing our own parsers.

Scraper API is a proxy management API that handles everything to do with rotating and managing proxies so our requests don't get banned. Which is great for a difficult site to scrape like Google.

However, what makes Scraper API extra useful for sites like Google and Amazon is that they provide auto parsing functionality free of charge so you don't need to write and maintain your own parsers.

By using Scraper API’s autoparse functionality for Google Search or Google Shopping, all the HTML will be automatically parsed into JSON format for you. Greatly simplifying the scraping process.

All we need to do to make use of this handy capability is to add the following parameter to our request:

 "&autoparse=true"

We’ll send the HTTP request with this parameter via Scrapy which will scrape google results based on specified keywords. The results will be returned in JSON format which we will then parse using Python.

Scrapy Installation and Setup

First thing’s first, the requirements for this tutorial are very straightforward:

• You will need at least Python version 3, later
• And, pip to install the necessary software packages

So, assuming you have both of those things, you only need to run the following command in your terminal to install Scrapy:

pip install scrapy

Scrapy will automatically create some default folders where all the packages and project files will be located. So navigate to that folder, and then run the following commands:

scrapy startproject google_scraper
cd google_scraper
scrapy genspider google api.scraperapi.com

First Scrapy will create a new project folder called “google-scraper” which is also the project name. We then navigate into this folder and run the “genspider” command which will generate a web scraper for us with the name “google.”

You should now see a bunch of configuration files, a “spiders” folder with your scraper(s), and a Python modules folder with some package files.

Building URLs to Query Google

As you might expect, Google uses a very standard and easy to query URL structure. To build a URL to query Google with, you only need to know the URL parameters for the data you need. In this tutorial, I’ll use some of the parameters that will be the most useful for the majority of web scraping projects.

Every Google Search query will start with the following base URL:

http://www.google.com/search

You can then build out your query simply by adding one or more of the following parameters:

The search keyword parameter denoted as q. For example, http://www.google.com/search?q=tshirt will search for results containing the “tshirt” keyword.
The language parameter hl. For example, http://www.google.com/search?q=tshirt&hl=en
The as_sitesearch parameter which will specify a domain (or, website) to search. For example, http://www.google.com/search?q=tshirt&as_sitesearch=amazon.com
The num parameter that specifies the number of results per page (maximum is 100). For example, http://www.google.com/search?q=tshirt&num=50
The start parameter which specifies the offset point. For example, http://www.google.com/search?q=tshirt&start=100
The safe parameter which will only output “safe” results. For example, http://www.google.com/search?q=tshirt&safe=active

There are many more parameters to use for querying Google, such as date, encoding, or even operators such as ‘or’ or ‘and’ to implement some basic logic.

Building the Google Search Query URL

Below is the function I’ll be using to build the Google Search query URL. It creates a dictionary with key-value pairs for the q, num, and as_sitesearch parameters. If you want to add more parameters, this is where you could do it.

If no site is specified, it will return a URL without the as_sitesearch parameter. If one is specified, it will first extract network location using netloc (e.g. amazon.com), then add this key-value pair to google_dict, and, finally, encode it in the return URL with the other parameters:

from urllib.parse import urlparse
from urllib.parse import urlencode

def create_google_url(query, site=''):
   google_dict = {'q': query, 'num': 100, }
   if site:
       web = urlparse(site).netloc
       google_dict['as_sitesearch'] = web
       return 'http://www.google.com/search?' + urlencode(google_dict)
   return 'http://www.google.com/search?' + urlencode(google_dict)

Connecting to a Proxy via the Scraper API

When scraping an internet service like Google, you will need to use a proxy if you want to scrape at any reasonable scale. If you don’t, you could get flagged by its ant-botting countermeasures and get your IP-banned. Thankfully, you can use Scraper API’s proxy solution for free for up to 5,000 API calls, using up to 10 concurrent threads. You can also use some of Scraper API’s more advanced features, such as geotargeting, JS rendering, and residential properties.

To use the proxy, just head here to sign up for free. Once you have, find your API key in the dashboard as you’ll need it to set up a proxy connection.

The proxy is incredibly easy to implement into your web spider. In the get_url function below, we’ll create a payload with our Scraper API key and the URL we built in the create_google_url function. We’ll also enable the autoparse feature here as well as set the proxy location as the U.S.:

def get_url(url):
   payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}
   proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
   return proxy_url

To send our request via one of Scraper API’s proxy pools, we only need to append our query URL to Scraper API’s proxy URL. This will return the information that we requested from Google and that we’ll parse later on.

Querying Google Search

The start_requests function is where we will set everything into motion. It will iterate through a list of queries that will be sent through to the create_google_url function as keywords for our query URL.

def start_requests(self):
       queries = ['scrapy’, ‘beautifulsoup’]
       for query in queries:
           url = create_google_url(query)
           yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0})

The query URL we built will then be sent as a request to Google Search using Scrapy’s yield via the proxy connection we set up in the get_url function. The result (which should be in JSON format) will be then be sent to the parse function to be processed. We also add the {'pos': 0} key-value pair to the meta parameter which is just used to count the number of pages scraped.

Scraping the Google Search Results

Because we used Scraper API’s autoparse functionality to return data in JSON format, parsing is very straightforward. We just need to select the data we want from the response dictionary.

First of all, we’ll load the entire JSON response and then iterate through each result, extracting some information and then putting it together into a new item we can use later on.

This process also checks to see if there is another page of results. If there is, it invokes ** yield scrapy.Request** again and sends the results to the parse function. In the meantime, pos is used to keep track of the number of pages we have scraped:

def parse(self, response):
       di = json.loads(response.text)
       pos = response.meta['pos']
       dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
       for result in di['organic_results']:
           title = result['title']
           snippet = result['snippet']
           link = result['link']
           item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
           pos += 1
           yield item
       next_page = di['pagination']['nextPageUrl']
       if next_page:
           yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos})

Putting it All Together and Running the Spider

You should now have a solid understanding of how the spider works and the flow of it. The spider we created, google.py, should now have the following contents:

import scrapy
from urllib.parse import urlencode
from urllib.parse import urlparse
import json
from datetime import datetime
API_KEY = 'YOUR_KEY'

def get_url(url):
   payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}
   proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
   return proxy_url

def create_google_url(query, site=''):
   google_dict = {'q': query, 'num': 100, }
   if site:
       web = urlparse(site).netloc
       google_dict['as_sitesearch'] = web
       return 'http://www.google.com/search?' + urlencode(google_dict)
   return 'http://www.google.com/search?' + urlencode(google_dict)

class GoogleSpider(scrapy.Spider):
   name = 'google'
   allowed_domains = ['api.scraperapi.com']
   custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO',
                  'CONCURRENT_REQUESTS_PER_DOMAIN': 10}

   def start_requests(self):
       queries = ['scrapy’, ‘beautifulsoup’] 
       for query in queries:
           url = create_google_url(query)
           yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0})

   def parse(self, response):
       di = json.loads(response.text)
       pos = response.meta['pos']
       dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
       for result in di['organic_results']:
           title = result['title']
           snippet = result['snippet']
           link = result['link']
           item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
           pos += 1
           yield item
       next_page = di['pagination']['nextPageUrl']
       if next_page:
           yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos})

Before testing the scraper we need to configure the settings to allow it to integrate with the Scraper API free plan with 10 concurrent threads.

To do this we defined the following custom settings in our spider class.

custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO',
                       'CONCURRENT_REQUESTS_PER_DOMAIN': 10, 
                       'RETRY_TIMES': 5}

We the concurrency to 10 threads to match the Scraper API free plan and et RETRY_TIMES to tell Scrapy to retry any failed requests 5 times. In the settings.py file we also need to make sure that DOWNLOAD_DELAY and RANDOMIZE_DOWNLOAD_DELAY aren’t enabled as these will lower your concurrency and are not needed with Scraper API.

To test or run the spider, just make sure you are in the right location and then run the following crawl command which will also output the results to a .csv file:

scrapy crawl google -o test.csv

If all goes according to plan, the spider will scrape Google Search for all the keywords you provide. By using a proxy, you’ll also avoid getting banned for using a bot.

Setting Up Monitoring

To monitor our scraper we're going to use ScrapeOps, a free monitoring and alerting tool dedicated to web scraping.

With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.

Live demo here: ScrapeOps Demo

Getting setup with ScrapeOps is simple. Just install the Python package:

pip install scrapeops-scrapy

And add 3 lines to your settings.py file:

## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}

From there, our scraping stats will be automatically logged and automatically shipped to our dashboard.

If you would like to run the spider for yourself or modify it for your particular Google project then feel free to do so. The code is on GitHub here. Just remember that you need to get your own Scraper API API_KEY by signing up for a free account here.

Build Your Own Google Scholar API With Python Scrapy

Ian Kerins — Tue, 18 Aug 2020 18:17:11 +0000

Google Scholar is a treasure trove of academic and industrial research that could prove invaluable to any research project.

However, as Google doesn’t provide any API for Google Scholar, it is notoriously hard to mine for information.

Faced with this problem, I decided to develop a simple Scrapy spider in Python and create my own Google Scholar API.

In this article, I’m going to show you how I built a Scrapy spider that searches Google Scholar for a particular keyword, and iterates through every available page extracting the following data from the search results:

Title
Link
Citations
Related Links
Number of Verions
Author
Publisher
Snippet

With this spider as a base, you will be able to adapt it to scrape whatever data you need and scale it to scrape thousands or millions of research keywords per month. The code for the project is available on GitHub here.

This article assumes you know the basics of Scrapy, so we’re going to focus on how to scrape Google Scholar results at scale without getting blocked.

For this tutorial, we're going to use:

Scraper API as our proxy solution, as Instagram has pretty aggressive anti-scraping in place. You can sign up to a free account here which will give you 5,000 free requests.
ScrapeOps to monitor our scrapers for free and alert us if they run into trouble. Live demo here: ScrapeOps Demo

Setting Up Our Scrapy Spider

Getting up and running with Scrapy is very easy. To install Scrapy simply enter this command in the command line:

pip install scrapy

Then navigate to your project folder Scrapy automatically creates and run the “startproject” command along with the project name (“scholar” in this case) and Scrapy will build a web scraping project folder for you, with everything already set up:

scrapy startproject scholar

cd scholar

scrapy genspider scholar scholar.com

Here is what you should see:

├── scrapy.cfg                # deploy configuration file
└── scholar                   # project's Python module, you'll import your code from here
    ├── __init__.py
    ├── items.py              # project items definition file
    ├── middlewares.py        # project middlewares file
    ├── pipelines.py          # project pipeline file
    ├── settings.py           # project settings file
    └── spiders               # a directory where spiders are located
        ├── __init__.py
        └── scholar.py        # spider we just created

Okay, that’s the Scrapy spider templates set up. Now let’s start building our Google Scholar spider.

From here we’re going to create three functions:

start_requests - will construct the Google Scholar URL for the search queries and send the request to Google.
parse - will extract all the search results from the Google Scholar search results.
get_url - to scrape Google Scholar at scale without getting blocked we need to use a proxy solution. For this project we will use Scraper API so we need to create a function to send the request to their API endpoint.

Understanding Google Scholar Search Queries

The first step of any scraping project is to figure out a way to reliably query the target website to get the data we need. So in this case we need to understand how to construct Google Scholar search queries that will return the search results we need.

Luckily for us, Google uses a very predictable URL structure. There are many more query parameters we can use with Google to refine our search results but here are the four of the most important ones for querying Google Scholar:

Define the search keyword using the “q” parameter. Example: http://www.google.com/scholar?q=airbnb
Define the language of output using the “hl” parameter. Example: http://www.google.com/scholar?q=airbnb&hl=en
Define the starting date using the “as_ylo” parameter. Example: https://scholar.google.com/scholar?as_ylo=2020&q=airbnb
Define the number of results per page using the “num” parameter. However, this is not recommended for Google Scholar, so we will leave it as the default (10). Example: http://www.google.com/scholar?q=airbnb&num=10&hl=en

Querying Google Scholar

Now we have created a scrapy project and we are familiar with how to send search queries to Google Scholar we can begin coding the spiders.

Our start requests spider is going to be pretty simple, we just need to send requests to Google Scholar with the keyword we want to search along with the language we want the output to be in:

def start_requests(self):
        queries = ['airbnb']
        for query in queries:
            url = 'https://scholar.google.com/scholar?' + urlencode({'hl': 'en', 'q': query})
            yield scrapy.Request(get_url(url), callback=self.parse, meta={'position': 0})

The start_requests function will iterate through a list of keywords in the queries list and then send the request to Google Scholar using the yield scrapy.Request(get_url(url), callback=self.parse) where the response is sent to the parse function in the callback.

You will also notice that we include the {'position': 0} dictionary in the meta parameter. This isn’t sent to Google, it is sent to the parse callback function and is used to track how many pages the spider has scraped.

Scraping The Search Results

The next step is to write our parser to extract the data we need from the HTML response we are getting back from Google Scholar.

We will use XPath selectors to extract the data from the HTML response. XPath is a big subject and there are plenty of techniques associated with it, so I won’t go into detail on how it works or how to create your own XPath selectors. If you would like to learn more about XPath and how to use it with Scrapy then you should check out the documentation here.

def parse(self, response):
   position = response.meta['position']
   for res in response.xpath('//*[@data-rp]'):
       link = res.xpath('.//h3/a/@href').extract_first()
       temp = res.xpath('.//h3/a//text()').extract()
       if not temp:
           title = "[C] " + "".join(res.xpath('.//h3/span[@id]//text()').extract())
       else:
           title = "".join(temp)
       snippet = "".join(res.xpath('.//*[@class="gs_rs"]//text()').extract())
       cited = res.xpath('.//a[starts-with(text(),"Cited")]/text()').extract_first()
       temp = res.xpath('.//a[starts-with(text(),"Related")]/@href').extract_first()
       related = "https://scholar.google.com" + temp if temp else ""
       num_versions = res.xpath('.//a[contains(text(),"version")]/text()').extract_first()
       published_data = "".join(res.xpath('.//div[@class="gs_a"]//text()').extract())
       position += 1
       item = {'title': title, 'link': link, 'cited': cited, 'relatedLink': related, 'position': position,
               'numOfVersions': num_versions, 'publishedData': published_data, 'snippet': snippet}
       yield item
   next_page = response.xpath('//td[@align="left"]/a/@href').extract_first()

To iterate through all the available pages of search results we will need to check to see if there is another page there and then construct the next URL query if there is.

def parse(self, response):
   ##...parsing logic from above
   next_page = response.xpath('//td[@align="left"]/a/@href').extract_first()
   if next_page:
       url = "https://scholar.google.com" + next_page
       yield scrapy.Request(get_url(url), callback=self.parse,meta={'position': position})

Connecting Our Proxy Solution

Google has very sophisticated anti-bot detection systems that will quickly detect that you are scraping their search results and block your IP. As a result, it is vital that you use a high-quality web scraping proxy that works with Google Scholar.

For this project, I’ve gone with Scraper API as it is super easy to use and because they have a great success rate with scraping Google Scholar.

Scraper API is a proxy API that manages everything to do with proxies for you. You simply have to send them the URL you want to scrape and their API will route your request through one of their proxy pools and give you back the HTML response.

To use Scraper API you need to sign up to a free account here and get an API key which will allow you to make 5,000 free requests and use all the extra features like Javascript rendering, geotargeting, residential proxies, etc.

Next, we need to integrate it with our spider. Reading their documentation, we see that there are three ways to interact with the API: via a single API endpoint, via their Python SDK, or via their proxy port.

For this project I integrated the API by configuring my spiders to send all our requests to their API endpoint.

def get_url(url):
    payload = {'api_key': API_KEY, 'url': url, 'country_code': 'us'}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

By using this function in our scrapy.Request() requests in the start_requests and parse functions we are able to route all our requests through Scraper APIs proxies pools and not worry about getting blocked.

Before going live we need to update the settings in settings.py to make sure we can use all the available concurrent threads available in our Scraper API free plan (5 threads), and set the number of retries to 5. Whilst making sure DOWNLOAD_DELAY and RANDOMIZE_DOWNLOAD_DELAY aren’t enabled as these will lower your concurrency and are not needed with Scraper API.

## settings.py

RETRY_TIMES = 5
CONCURRENT_REQUESTS_PER_DOMAIN = 5 
# DOWNLOAD_DELAY
# RANDOMIZE_DOWNLOAD_DELAY

Setting Up Monitoring

To monitor our scraper we're going to use ScrapeOps, a free monitoring and alerting tool dedicated to web scraping.

With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.

Live demo here: ScrapeOps Demo

Getting setup with ScrapeOps is simple. Just install the Python package:

pip install scrapeops-scrapy

And add 3 lines to your settings.py file:

## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}

From there, our scraping stats will be automatically logged and automatically shipped to our dashboard.

Going Live!

Now we are good to go. You can test the spider by running the spider with the crawl command and export the results to a csv file.

scrapy crawl scholar -o test.csv

The spider will scrape all the available search results for your keyword without getting banned.

If you would like to run the spider for yourself or modify it for your particular Google Scholar project then feel free to do so. The code is on GitHub here. Just remember that you need to get your own Scraper API api key by signing up here.

The Easy Way to Scrape Instagram Using Python Scrapy & GraphQL

Ian Kerins — Thu, 06 Aug 2020 13:10:16 +0000

After e-commerce monitoring, building social media scrapers to monitor accounts and track new trends is the next most popular use case for web scraping.

However, for anyone who’s tried to build a web scraping spider for scraping Instagram, Facebook, Twitter or TikTok you know that it can be a bit tricky.

These sites use sophisticated anti-bot technologies to block your requests and regularly make changes to their site schemas which can break your spiders parsing logic.

So in this article, I’m going to show you the easiest way to build a Python Scrapy spider that scrapes all Instagram posts for every user account that you send to it. Whilst removing the worry of getting blocked or having to design XPath selectors to scrape the data from the raw HTML.

The code for the project is available on GitHub here, and is set up to scrape:

Post URL
Image URL or Video URL
Post Captions
Date Posted
Number of Likes
Number of Comments

For every post on that user's account. As you will see there is more data we could easily extract, however, to keep this guide simple I just limited it to the most important data types.

This code can also be quickly modified to scrape all the posts related to a specific tag or geographical location with only minor changes, so it is a great base to build future spiders with.

This article assumes you know the basics of Scrapy, so we’re going to focus on how to scrape Instagram at scale without getting blocked.

The full-code is on GitHub here.

For this example, we're going to use:

Scraper API as our proxy solution, as Instagram has pretty aggressive anti-scraping in place. You can sign up to a free account here which will give you 5,000 free requests.
ScrapeOps to monitor our scrapers for free and alert us if they run into trouble. Live demo here: ScrapeOps Demo

Setting Up Our Scrapy Spider

Getting up and running with Scrapy is very easy. To install Scrapy simply enter this command in the command line:

pip install scrapy

Then navigate to your project folder Scrapy automatically creates and run the “startproject” command along with the project name (“instascraper” in this case) and Scrapy will build a web scraping project folder for you, with everything already set up:

scrapy startproject instascraper

cd instascraper

scrapy genspider instagram instagram.com

Here is what you should see:

├── scrapy.cfg                # deploy configuration file
└── tutorial                  # project's Python module, you'll import your code from here
    ├── __init__.py
    ├── items.py              # project items definition file
    ├── middlewares.py        # project middlewares file
    ├── pipelines.py          # project pipeline file
    ├── settings.py           # project settings file
    └── spiders               # a directory where spiders are located
        ├── __init__.py
        └── amazon.py        # spider we just created

Okay, that’s the Scrapy spider templates set up. Now let’s start building our Instagram spiders.

From here we’re going to create five functions:

start_requests - will construct the Instagram URL for the users account and send the request to Instagram.
parse - will extract all the posts data from the users news feed.
parse_page - if there is more than one page, this function will parse all the posts data from those pages.
get_video - if the post includes a video, this function will be called and extract the videos url.
get_url - will send the request to Scraper API so it can retrieve the HTML response.

Let’s get to work…

Requesting Instagram Accounts

To retrieve a user's data from Instagram we need to first create a list of users we want to monitor then incorporate their user ids into a URL. Luckily for us, Instagram uses a pretty straight forward URL structure.

Every user has a unique name and/or user id, that we can use to create the user URL:

https://www.instagram.com/<user_name>/

You can also retrieve the posts associated with a specific tag or from a specific location by using the following url format:

## Tags URL
https://www.instagram.com/explore/tags/<example_tag>/

## Location URL
https://www.instagram.com/explore/locations/<location_id>/

# Note: the location URL is a numeric value so you need to identify the location ID number for
# the locations you want to scrape.

So for this example spider, I’m going to use Nike and Adidas as the two Instagram accounts I want to scrape.

Using the above framework the Nike url is https://www.instagram.com/nike/, and we also want to have the ability to specify the page language using the “hl” parameter. For example:

https://www.instagram.com/nike/?hl=en  #English
https://www.instagram.com/nike/?hl=de  #German

Spider #1: Retrieving Instagram Accounts

Now we have created a scrapy project and we are familiar with how instagram displays it’s data, we can begin coding the spiders.

Our start requests spider is going to be pretty simple, we just need to send requests to Instagram with the username url to get the users account:

def start_requests(self):
        for username in user_accounts:
            url = f'https://www.instagram.com/{username}/?hl=en'
            yield scrapy.Request(get_url(url), callback=self.parse)

The start_requests function will iterate through a list of user_accounts and then send the request to Instagram using the yield scrapy.Request(get_url(url), callback=self.parse) where the response is sent to the parse function in the callback.

Spider #2: Scraping Post Data

Okay, now that we are getting a response back from Instagram we can extract the data we want.

On first glance, the post data we want like image urls, likes, comments, etc. don’t seem to be in the HTML data. However, on a closer look we will see that the data is in the form of a JSON dictionary in the scripts tag that starts with “window._sharedData”.

This is because Instagram first loads the layout and all the data it needs from its internal GraphQL API and then puts the data in the correct layout.

We could scrape this data directly if we queried Instagrams GraphQL endpoint directly by adding "/?__a=1" onto the end of the URL. For example:

https://www.instagram.com/nike/?__a=1/

But we wouldn’t be able to iterate through all the pages, so instead we’re going to get the HTML response and then extract the data from the window._sharedData JSON dictionary.

Because the data is already formatted as JSON it will be very easy to extract the data we want. We can just use a simple XPath selector to extract the JSON string and then convert it into a JSON dictionary.

def parse(self, response):
        x = response.xpath("//script[starts-with(.,'window._sharedData')]/text()").extract_first()
        json_string = x.strip().split('= ')[1][:-1]
        data = json.loads(json_string)

From here we just need to extract the data we want from the JSON dictionary.

def parse(self, response):
        x = response.xpath("//script[starts-with(.,'window._sharedData')]/text()").extract_first()
        json_string = x.strip().split('= ')[1][:-1]
        data = json.loads(json_string)
        # all that we have to do here is to parse the JSON we have
        user_id = data['entry_data']['ProfilePage'][0]['graphql']['user']['id']
        next_page_bool = \
            data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['page_info'][
                'has_next_page']
        edges = data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_felix_video_timeline']['edges']
        for i in edges:
            url = 'https://www.instagram.com/p/' + i['node']['shortcode']
            video = i['node']['is_video']
            date_posted_timestamp = i['node']['taken_at_timestamp']
            date_posted_human = datetime.fromtimestamp(date_posted_timestamp).strftime("%d/%m/%Y %H:%M:%S")
            like_count = i['node']['edge_liked_by']['count'] if "edge_liked_by" in i['node'].keys() else ''
            comment_count = i['node']['edge_media_to_comment']['count'] if 'edge_media_to_comment' in i[
                'node'].keys() else ''
            captions = ""
            if i['node']['edge_media_to_caption']:
                for i2 in i['node']['edge_media_to_caption']['edges']:
                    captions += i2['node']['text'] + "\n"

            if video:
                image_url = i['node']['display_url']
            else:
                image_url = i['node']['thumbnail_resources'][-1]['src']
            item = {'postURL': url, 'isVideo': video, 'date_posted': date_posted_human,
                    'timestamp': date_posted_timestamp, 'likeCount': like_count, 'commentCount': comment_count, 'image_url': image_url,
                    'captions': captions[:-1]}

Spider #3: Extracting Video URLs

To extract the video URL we need to make another request to that specific post as that data isn’t included in the JSON response previously returned by Instagram.

If the post includes a video then the is_video flag will be set to true, which will trigger our scraper to request that posts page and send the response to the get_video function.

if video:
     yield scrapy.Request(get_url(url), callback=self.get_video, meta={'item': item})
else:
     item['videoURL'] = ''
     yield item

The get_video function will then extract the videoURL from the response.

def get_video(self, response):
        # only from the first page
        item = response.meta['item']
        video_url = response.xpath('//meta[@property="og:video"]/@content').extract_first()
        item['videoURL'] = video_url
        yield item

Spider #4: Iterating Through Available Pages

The last piece of extraction logic we need to implement is the ability for our crawler to iterate through all the available pages on that user account and scrape all the data.

Like the get_video function we need to check if there are any more pages available before calling the parse_pages function. We do that by checking if the has_next_page field in the JSON dictionary is true or false.

next_page_bool = \
            data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['page_info'][
                'has_next_page']

If it is true, then we will extract the end_cursor value from the JSON dictionary and create a new request for Instagrams GraphQL api endpoint, along with the user_id, query_hash, etc.

        if next_page_bool:
            cursor = \
                data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['page_info'][
                    'end_cursor']
            di = {'id': user_id, 'first': 12, 'after': cursor}
            print(di)
            params = {'query_hash': 'e769aa130647d2354c40ea6a439bfc08', 'variables': json.dumps(di)}
            url = 'https://www.instagram.com/graphql/query/?' + urlencode(params)
            yield scrapy.Request(get_url(url), callback=self.parse_pages, meta={'pages_di': di})

This will then call the parse_pages function which will repeat the process of extracting all the post data and checking to see if there are any more pages.

The difference between this function and the original parse function is that it won’t scrape the video url of each post. However, you can easily add this in if you would like.

def parse_pages(self, response):
   di = response.meta['pages_di']
   data = json.loads(response.text)
   for i in data['data']['user']['edge_owner_to_timeline_media']['edges']:
       video = i['node']['is_video']
       url = 'https://www.instagram.com/p/' + i['node']['shortcode']
       if video:
           image_url = i['node']['display_url']
           video_url = i['node']['video_url']
       else:
           video_url = ''
           image_url = i['node']['thumbnail_resources'][-1]['src']
       date_posted_timestamp = i['node']['taken_at_timestamp']
       captions = ""
       if i['node']['edge_media_to_caption']:
           for i2 in i['node']['edge_media_to_caption']['edges']:
               captions += i2['node']['text'] + "\n"
       comment_count = i['node']['edge_media_to_comment']['count'] if 'edge_media_to_comment' in i['node'].keys() else ''
       date_posted_human = datetime.fromtimestamp(date_posted_timestamp).strftime("%d/%m/%Y %H:%M:%S")
       like_count = i['node']['edge_liked_by']['count'] if "edge_liked_by" in i['node'].keys() else ''
       item = {'postURL': url, 'isVideo': video, 'date_posted': date_posted_human,
               'timestamp': date_posted_timestamp, 'likeCount': like_count, 'commentCount': comment_count, 'image_url': image_url,
               'videoURL': video_url,'captions': captions[:-1]
               }
       yield item
   next_page_bool = data['data']['user']['edge_owner_to_timeline_media']['page_info']['has_next_page']
   if next_page_bool:
       cursor = data['data']['user']['edge_owner_to_timeline_media']['page_info']['end_cursor']
       di['after'] = cursor
       params = {'query_hash': 'e769aa130647d2354c40ea6a439bfc08', 'variables': json.dumps(di)}
       url = 'https://www.instagram.com/graphql/query/?' + urlencode(params)
       yield scrapy.Request(get_url(url), callback=self.parse_pages, meta={'pages_di': di})

Setting Up Proxies

Finally, we are pretty much ready to go live. Last thing we need to do is to set our spiders up to use a proxy to enable us to scrape at scale without getting blocked.

For this project, I’ve gone with Scraper API as it is super easy to use and because they have a great success rate with scraping Instagram.

To use Scraper API you need to sign up to a free account here and get an API key which will allow you to make 1,000 free requests per month and use all the extra features like Javascript rendering, geotargeting, residential proxies, etc.

For this project, I integrated the API by configuring my spiders to send all our requests to their API endpoint.

API = ‘<YOUR_API_KEY>’

def get_url(url):
    payload = {'api_key': API, 'url': url}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

And then modify our spider functions so as to use the Scraper API proxy by setting the url parameter in scrapy.Request to get_url(url). For example:

def start_requests(self):
        for username in user_accounts:
            url = f'https://www.instagram.com/{username}/?hl=en'
            yield scrapy.Request(get_url(url), callback=self.parse)

We also have to change the spiders settings to set the allowed_domains to api.scraperapi.com, and the max concurrency per domain to the concurrency limit of our Scraper API plan. Which in the case of the Scraper API Free plan is 5 concurrent threads:

class InstagramSpider(scrapy.Spider):
    name = 'instagram'
    allowed_domains = ['api.scraperapi.com']
    custom_settings = {'CONCURRENT_REQUESTS_PER_DOMAIN': 5}

Also, we should set RETRY_TIMES to tell Scrapy to retry any failed requests (to 5 for example) and make sure that DOWNLOAD_DELAY and RANDOMIZE_DOWNLOAD_DELAY aren’t enabled as these will lower your concurrency and are not needed with Scraper API.

Setting Up Monitoring

To monitor our scraper we're going to use ScrapeOps, a free monitoring and alerting tool dedicated to web scraping.

With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.

Live demo here: ScrapeOps Demo

Getting setup with ScrapeOps is simple. Just install the Python package:

pip install scrapeops-scrapy

And add 3 lines to your settings.py file:

## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}

From there, our scraping stats will be automatically logged and automatically shipped to our dashboard.

Going Live!

Now we are good to go. You can test the spider again by running the spider with the crawl command.

scrapy crawl instagram -o test.csv

Once complete the spider will store the accounts data in a csv file.

If you would like to run the spider for yourself or modify it for your particular Instagram project then feel free to do so. The code is on GitHub here. Just remember that you need to get your own Scraper API api key by signing up here.

How To Scrape Amazon at Scale With Python Scrapy, And Never Get Banned

Ian Kerins — Tue, 28 Jul 2020 18:19:29 +0000

With thousands of companies offering products and price monitoring solutions for Amazon, scraping Amazon is big business.

But for anyone who’s tried to scrape it at scale you know how quickly you can get blocked.

So in this article, I’m going to show you how I built a Scrapy spider that searches Amazon for a particular keyword, and then goes into every single product it returns and scrape all the main information:

ASIN
Product name
Image url
Price
Description
Available sizes
Available colors
Ratings
Number of reviews
Seller rank

With this spider as a base, you will be able to adapt it to scrape whatever data you need and scale it to scrape thousands or millions of products per month. The code for the project is available on GitHub here.

What We Will Need?

Obviously, you could build your scrapers from scratch using a basic library like requests and Beautifulsoup, but I choose to build it using Scrapy.

The open-source web crawling framework written in Python, as it by far the most powerful and popular web scraping framework amongst large scale web scrapers.

Compared to other web scraping libraries such as BeautifulSoup, Selenium or Cheerio, which are great libraries for parsing HTML data, Scrapy is a full web scraping framework with a large community that has loads of built-in functionality to make web scraping as simple as possible:

XPath and CSS selectors for HTML parsing
data pipelines
automatic retries
proxy management
concurrent requests
etc.

Making it really easy to get started, and very simple to scale up.

Proxies

The second thing that was a must, if you want to scrape Amazon at any type of scale is a large pool of proxies and the code to automatically rotate IPs and headers, along with dealing with bans and CAPTCHAs. Which can be very time consuming if you build this proxy management infrastructure yourself.

For this project I opted to use Scraper API, a proxy API that manages everything to do with proxies for you. You simply have to send them the URL you want to scrape and their API will route your request through one of their proxy pools and give you back the HTML response.

Scraper API has a free plan that allows you to make up to 1,000 requests per month which makes it ideal for the development phase, but can be easily scaled up to millions of pages per month if needs be.

Monitoring

Lastly, we will need some way to monitor our scraper in production to make sure that everything is running smoothly. For that we're going to use ScrapeOps, a free monitoring tool specifically designed for web scraping.

Live demo here: ScrapeOps Demo

Getting Started With Scrapy

Getting up and running with Scrapy is very easy. To install Scrapy simply enter this command in the command line:

pip install scrapy

Then navigate to your project folder Scrapy automatically creates and run the “startproject” command along with the project name (“amazon_scraper” in this case) and Scrapy will build a web scraping project folder for you, with everything already set up:

scrapy startproject amazon_scraper

Here is what you should see

├── scrapy.cfg                # deploy configuration file
└── tutorial                  # project's Python module, you'll import your code from here
    ├── __init__.py
    ├── items.py              # project items definition file
    ├── middlewares.py        # project middlewares file
    ├── pipelines.py          # project pipeline file
    ├── settings.py           # project settings file
    └── spiders               # a directory where spiders are located
        ├── __init__.py
        └── amazon.py        # spider we just created

Similar to Django when you create a project with Scrapy it automatically creates all the files you need. Each of which has its own purpose:

Items.py is useful for creating your base dictionary that you import into the spider
Settings.py is where all your settings on requests and activating of pipelines and middlewares happen. Here you can change the delays, concurrency, and lots more things.
Pipelines.py is where the item yielded by the spider gets passed, it’s mostly used to clean the text and connect to databases (Excel, SQL, etc).
Middlewares.py is useful when you want to modify how the request is made and scrapy handles the response.

Creating Our Amazon Spider

Okay, we’ve created the general project structure. Now, we’re going to develop our spiders that will do the scraping.

Scrapy provides a number of different spider types, however, in this tutorial we will cover the most common one, the Generic Spider.

To create a new spider, simply run the “genspider” command:

# syntax is --> scrapy genspider name_of_spider website.com 
scrapy genspider amazon amazon.com

And Scrapy will create a new file, with a spider template.

In our case, we will get a new file in the spiders folder called “amazon.py”.

import scrapy

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    allowed_domains = ['amazon.com']
    start_urls = ['http://www.amazon.com/']

    def parse(self, response):
        pass

We're going to remove the default code from this (allowed_domains, start_urls, parse function) and start writing our own code.

We’re going to create four functions:

start_requests - will send a search query Amazon with a particular keyword.
parse_keyword_response - will extract the ASIN value for each product returned in the Amazon keyword query, then send a new request to Amazon to return the product page of that product. It will also move to the next page and repeat the process.
parse_product_page - will extract all the target information from the product page.
get_url - will send the request to Scraper API so it can retrieve the HTML response.

With a plan made, now let’s get to work…

Send Search Queries To Amazon

The first step is building start_requests, our function that sends search queries to Amazon with our keywords. Which is pretty simple…

First let’s quickly define a list variable with our search keywords outside the AmazonSpider.

queries = ['tshirt for men', ‘tshirt for women’]

Then let's create our start_requests function within the AmazonSpider that will send the requests to Amazon.

To access Amazon’s search functionality via a URL we need to send a search query “k=SEARCH_KEYWORD” :

https://www.amazon.com/s?k=<SEARCH_KEYWORD>

When implemented in our start_requests function, it looks like this.

## amazon.py

queries = ['tshirt for men', ‘tshirt for women’]

class AmazonSpider(scrapy.Spider):

    def start_requests(self):
        for query in queries:
            url = 'https://www.amazon.com/s?' + urlencode({'k': query})
            yield scrapy.Request(url=url, callback=self.parse_keyword_response)

For every query in our queries list, we will urlencode it so that it is safe to use as a query string in a URL, and then use scrapy.Request to request that URL.

Since Scrapy is async, we will use yield instead of return, which means the functions should either yield a request or a completed dictionary. If a new request is yielded it will go to the callback method, if an item is yielded it will go to the pipeline for data cleaning.

In our case, if scrapy.Request it will activate our parse_keyword_response callback function that will then extract the ASIN for each product.

Scraping Amazon’s Product Listing Page

The cleanest and most popular way to retrieve Amazon product pages is to use their ASIN ID.

ASIN’s are a unique ID that every product on Amazon has. We can use this ID as part of our URLs to retrieve the product page of any Amazon product like this...

https://www.amazon.com/dp/<ASIN>

We can extract the ASIN value from the product listing page by using Scrapy’s built-in XPath selector extractor methods.

XPath is a big subject and there are plenty of techniques associated with it, so I won’t go into detail on how it works or how to create your own XPath selectors. If you would like to learn more about XPath and how to use it with Scrapy then you should check out the documentation here.

Using Scrapy Shell, I’m able to develop a XPath selector that grabs the ASIN value for every product on the product listing page and create a url for each product:

products = response.xpath('//*[@data-asin]')

        for product in products:
            asin = product.xpath('@data-asin').extract_first()
            product_url = f"https://www.amazon.com/dp/{asin}"

Next, we will configure the function to send a request to this URL and then call the parse_product_page callback function when we get a response. We will also add the meta parameter to this request which is used to pass items between functions (or edit certain settings).

def parse_keyword_response(self, response):
        products = response.xpath('//*[@data-asin]')

        for product in products:
            asin = product.xpath('@data-asin').extract_first()
            product_url = f"https://www.amazon.com/dp/{asin}"
            yield scrapy.Request(url=product_url, callback=self.parse_product_page, meta={'asin': asin})

Extracting Product Data From Product Page

Now, we’re finally getting to the good stuff!

So after the parse_keyword_response function requests the product pages URL, it passes the response it receives from Amazon to the parse_product_page callback function along with the ASIN ID in the meta parameter.

Now, we want to extract the data we need from a product page like this.

To do so we will have to write XPath selectors to extract each field we want from the HTML response we receive back from Amazon.

def parse_product_page(self, response):
        asin = response.meta['asin']
        title = response.xpath('//*[@id="productTitle"]/text()').extract_first()
        image = re.search('"large":"(.*?)"',response.text).groups()[0]
        rating = response.xpath('//*[@id="acrPopover"]/@title').extract_first()
        number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/text()').extract_first()
        bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/text()').extract()
        seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/parent::*//text()[not(parent::style)]').extract()

For scraping the image url, I’ve gone with a regex selector over a XPath selector as the XPath was extracting the image in base64.

With very big websites like Amazon, who have various types of product pages what you will notice is that sometimes writing a single XPath selector won’t be enough. As it might work on some pages, but not on others.

In cases like these, you will need to write numerous XPath selectors to cope with the various page layouts. I ran into this issue when trying to extract the product price so I needed to give the spider 3 different XPath options:

def parse_product_page(self, response):
        asin = response.meta['asin']
        title = response.xpath('//*[@id="productTitle"]/text()').extract_first()
        image = re.search('"large":"(.*?)"',response.text).groups()[0]
        rating = response.xpath('//*[@id="acrPopover"]/@title').extract_first()
        number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/text()').extract_first()
        bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/text()').extract()
        seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/parent::*//text()[not(parent::style)]').extract()

        price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()

        if not price:
            price = response.xpath('//*[@data-asin-price]/@data-asin-price').extract_first() or \
                    response.xpath('//*[@id="price_inside_buybox"]/text()').extract_first()

If the spider can't find a price with the first XPath selector then it moves onto the next one, etc.

If we look at the product page again, we will see that it contains variations of the product in different sizes and colors. To extract this data we will write a quick test to see if this section is present on the page, and if it is we will extract it using regex selectors.

temp = response.xpath('//*[@id="twister"]')
        sizes = []
        colors = []
        if temp:
            s = re.search('"variationValues" : ({.*})', response.text).groups()[0]
            json_acceptable = s.replace("'", "\"")
            di = json.loads(json_acceptable)
            sizes = di.get('size_name', [])
            colors = di.get('color_name', [])

Putting it all together, the parse_product_page function will look like this, and will return a JSON object which will be sent to the pipelines.py file for data cleaning (we will discuss this later).

def parse_product_page(self, response):
        asin = response.meta['asin']
        title = response.xpath('//*[@id="productTitle"]/text()').extract_first()
        image = re.search('"large":"(.*?)"',response.text).groups()[0]
        rating = response.xpath('//*[@id="acrPopover"]/@title').extract_first()
        number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/text()').extract_first()
        price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()

        if not price:
            price = response.xpath('//*[@data-asin-price]/@data-asin-price').extract_first() or \
                    response.xpath('//*[@id="price_inside_buybox"]/text()').extract_first()

        temp = response.xpath('//*[@id="twister"]')
        sizes = []
        colors = []
        if temp:
            s = re.search('"variationValues" : ({.*})', response.text).groups()[0]
            json_acceptable = s.replace("'", "\"")
            di = json.loads(json_acceptable)
            sizes = di.get('size_name', [])
            colors = di.get('color_name', [])

        bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/text()').extract()
        seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/parent::*//text()[not(parent::style)]').extract()
        yield {'asin': asin, 'Title': title, 'MainImage': image, 'Rating': rating, 'NumberOfReviews': number_of_reviews,
               'Price': price, 'AvailableSizes': sizes, 'AvailableColors': colors, 'BulletPoints': bullet_points,
               'SellerRank': seller_rank}

Iterating Through Product Listing Pages

We’re looking good now…

Our spider will search Amazon based on the keyword we give it and scrape the details of the products it returns on page 1. However, what if we want our spider to navigate through every page and scrape the products of each one.

To implement this, all we need to do is add a small bit of extra code to our parse_keyword_response function:

def parse_keyword_response(self, response):
        products = response.xpath('//*[@data-asin]')

        for product in products:
            asin = product.xpath('@data-asin').extract_first()
            product_url = f"https://www.amazon.com/dp/{asin}"
            yield scrapy.Request(url=product_url, callback=self.parse_product_page, meta={'asin': asin})

        next_page = response.xpath('//li[@class="a-last"]/a/@href').extract_first()
        if next_page:
            url = urljoin("https://www.amazon.com",next_page)
            yield scrapy.Request(url=product_url, callback=self.parse_keyword_response)

After the spider has scraped all the product pages on the first page, it will then check to see if there is a next page button. If there is, it will retrieve the url extension and create a new URL for the next page. Example:

https://www.amazon.com/s?k=tshirt+for+men&page=2&qid=1594912185&ref=sr_pg_1

From there it will restart the parse_keyword_response function using the callback and extract the ASIN IDs for each product and extract all the product data like before.

Testing The Spider

Now that we’ve developed our spider it is time to test it. Here we can use Scrapy’s built-in CSV exporter:

scrapy crawl amazon -o test.csv

All going good, you should now have items in test.csv, but you will notice there are 2 issues:

the text is messy and some values are lists
we are getting 429 responses from Amazon which means Amazon is detecting us that our requests are coming from a bot and is blocking our spider.

Issue number two is the far bigger issue, as if we keep going like this Amazon will quickly ban our IP address and we won’t be able to scrape Amazon.

In order to solve this, we will need to use a large proxy pool and rotate our proxies and headers with every request. For this we will use Scraper API.

Connecting Your Proxies With Scraper API

As discussed, at the start of this article Scraper API is a proxy API designed to take the hassle out of using web scraping proxies.

Instead of finding your own proxies, and building your own proxy infrastructure to rotate proxies and headers with every request, along with detecting bans and bypassing anti-bots you just send the URL you want to scrape the Scraper API and it will take care of everything for you.

For this project I integrated the API by configuring my spiders to send all our requests to their API endpoint.

To do so, I just needed to create a simple function that sends a GET request to Scraper API with the URL we want to scrape.

API = ‘<YOUR_API_KEY>’

def get_url(url):
    payload = {'api_key': API, 'url': url}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

And then modify our spider functions so as to use the Scraper API proxy by setting the url parameter in scrapy.Request to get_url(url).

def start_requests(self):
       ...
       …
       yield scrapy.Request(url=get_url(url), callback=self.parse_keyword_response)

def parse_keyword_response(self, response):
       ...
       …
      yield scrapy.Request(url=get_url(product_url), callback=self.parse_product_page, meta={'asin': asin})
        ...
       …
       yield scrapy.Request(url=get_url(url), callback=self.parse_keyword_response)

A really cool feature with Scraper API is that you can enable Javascript rendering, geotargeting, residential IPs, etc. by simply adding a flag to your API request.

As Amazon changes the pricing data and supplier data shown based on the country you are making the request from we're going to use Scraper API's geotargeting feature so that Amazon thinks our requests are coming from the US. To do this we need need to add the flag "&country_code=us" to the request, which we can do by adding another parameter to the payload variable.

def get_url(url):
    payload = {'api_key': API, 'url': url, 'country_code': 'us'}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

You can check out Scraper APIs other functionality here in their documentation.

Next, we have to go into the settings.py file and change the number of concurrent requests we’re allowed to make based on the concurrency limit of our Scraper API plan. Which for the free plan is 5 concurrent requests.

## settings.py

CONCURRENT_REQUESTS = 5

Concurrency is the number of requests you are allowed to make in parallel at any one time. The more concurrent requests you can make the faster you can scrape.

Also, we should set RETRY_TIMES to tell Scrapy to retry any failed requests (to 5 for example) and make sure that DOWNLOAD_DELAY and RANDOMIZE_DOWNLOAD_DELAY aren’t enabled as these will lower your concurrency and are not needed with Scraper API.

## settings.py

CONCURRENT_REQUESTS = 5
RETRY_TIMES = 5

# DOWNLOAD_DELAY
# RANDOMIZE_DOWNLOAD_DELAY

Setting Up Monitoring

To monitor our scraper we're going to use ScrapeOps, a free monitoring and alerting tool dedicated to web scraping.

With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.

Live demo here: ScrapeOps Demo

Getting setup with ScrapeOps is simple. Just install the Python package:

pip install scrapeops-scrapy

And add 3 lines to your settings.py file:

## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}

From there, our scraping stats will be automatically logged and automatically shipped to our dashboard.

Cleaning Data With Pipelines

The final step we need to do is to do a bit of data cleaning using the pipelines.py file as the text is messy and some values are lists.

class TutorialPipeline:

    def process_item(self, item, spider):
        for k, v in item.items():
            if not v:
                item[k] = ''  # replace empty list or None with empty string
                continue
            if k == 'Title':
                item[k] = v.strip()
            elif k == 'Rating':
                item[k] = v.replace(' out of 5 stars', '')
            elif k == 'AvailableSizes' or k == 'AvailableColors':
                item[k] = ", ".join(v)
            elif k == 'BulletPoints':
                item[k] = ", ".join([i.strip() for i in v if i.strip()])
            elif k == 'SellerRank':
                item[k] = " ".join([i.strip() for i in v if i.strip()])
        return item

After the spider has yielded a JSON object, the item is passed to the pipeline for the item to be cleaned.

To enable the pipeline we need to add it to the settings.py file.

## settings.py

ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 300}

Now we are good to go. You can test the spider again by running the spider with the crawl command.

scrapy crawl amazon -o test.csv

This time you should see that the spider was able to scrape all the available products for your keyword without getting banned.

If you would like to run the spider for yourself or modify it for your particular Amazon project then feel free to do so. The code is on GitHub here. Just remember that you need to get your own Scraper API api key by signing up here.

DEV Community: Ian Kerins

The 5 Best Scrapyd Dashboards & Admin Tools

#1 ScrapeOps

Features

Integration

Step 1: Install Scrapy Logger Extension

Step 2: Connect ScrapeOps to Your Scrapyd Servers

Summary

Pros

Cons

#2 ScrapydWeb

Summary

Pros

Cons

#3 Gerapy

Summary

Pros

Cons

#4 SpiderKeeper

Summary

Pros

Cons

#5 Crawlab

Summary

Pros

Cons

The Complete Scrapyd Guide - Deploy, Schedule & Run Your Scrapy Spiders

What Is Scrapyd?

How to Setup Scrapyd

Deploying Spiders To Scrapyd

Aside: Custom Deployment Endpoints

Controlling Spiders With Scrapyd

Scrapyd JSON API

Python-Scrapyd-API Library

Check Daemon Status

List All Projects

List All Spiders

Run a Job

Cancel a Running Job

Scrapyd Dashboard

Integrating Scrapyd with ScrapeOps

Features

Integration

Step 1: Install Scrapy Logger Extension

Step 2: Connect ScrapeOps to Your Scrapyd Servers

Enter Scrapyd Server Details

Whitelist Our Server (Optional)

Option 1: Auto Install (Ubuntu)

Option 2: Manual Install

More Scrapy Tutorials

The Complete Guide To ScrapydWeb, Get Setup In 3 Minutes!

What Is ScrapydWeb?

Features

How To Setup ScrapydWeb?

Setup Scrapyd Server

Install Scrapyd

Deploy Scrapy Project to Scrapyd

Installing ScrapydWeb

Install Logparser

Using ScrapydWeb

Connecting Scrapyd Servers

Running Spiders

Job Stats

Alternatives To ScrapydWeb

The Complete Guide To Scrapy Spidermon, Start Monitoring in 5 Minutes!

What is Spidermon?

Integrating Spidermon

Spidermon Monitors

Basic Monitors

Custom Monitors

Spidermon MonitorSuites

Spidermon Actions

Item Validation

End-to-End Spidermon Example

Scrapy Project

Integrate Spidermon

Create Item Validator

Create Our Monitors

Monitor 1: Item Count Monitor

Monitor 2: Item Validation Monitor