<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ian Kerins</title>
    <description>The latest articles on DEV Community by Ian Kerins (@iankerins).</description>
    <link>https://dev.to/iankerins</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F426918%2F0c864e5b-3884-4b39-93d4-2e1bf45ac525.jpg</url>
      <title>DEV Community: Ian Kerins</title>
      <link>https://dev.to/iankerins</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/iankerins"/>
    <language>en</language>
    <item>
      <title>The 5 Best Scrapyd Dashboards &amp; Admin Tools </title>
      <dc:creator>Ian Kerins</dc:creator>
      <pubDate>Fri, 14 Jan 2022 09:06:13 +0000</pubDate>
      <link>https://dev.to/iankerins/the-5-best-scrapyd-dashboards-admin-tools-42eb</link>
      <guid>https://dev.to/iankerins/the-5-best-scrapyd-dashboards-admin-tools-42eb</guid>
      <description>&lt;p&gt;Published as part of &lt;a href="https://scrapeops.io/python-scrapy-playbook"&gt;The Python Scrapy Playbook&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Scrapyd is the defacto spider management tool for developers who want a free and effective way to manage their Scrapy spiders on multiple servers without having to configure cron jobs or use paid tools like &lt;a href="https://www.zyte.com/scrapy-cloud/"&gt;Scrapy Cloud&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The one major drawback with Scrapyd, however, that the default dashboard that comes with Scrapyd is basic to say the least.&lt;/p&gt;

&lt;p&gt;Because of this, numerous web scraping teams have had to build their own Scrapyd dashboards to get the functionality that they need. &lt;/p&gt;

&lt;p&gt;In this guide, we're going to go through the &lt;strong&gt;5 Best Scrapyd Dashboards&lt;/strong&gt; that these developers have decided to share with the community so you don't have to build your own.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;ScrapeOps&lt;/li&gt;
&lt;li&gt;ScrapydWeb&lt;/li&gt;
&lt;li&gt;Gerapy&lt;/li&gt;
&lt;li&gt;SpiderKeeper&lt;/li&gt;
&lt;li&gt;Crawlab&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  #1 ScrapeOps
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://scrapeops.io"&gt;ScrapeOps&lt;/a&gt; is a new Scrapyd dashboard and monitoring tool for Scrapy. &lt;/p&gt;

&lt;p&gt;With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live demo here:&lt;/strong&gt; &lt;a href="https://scrapeops.io/app/login/demo"&gt;ScrapeOps Demo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Y4w75y8p--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Y4w75y8p--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" alt="ScrapeOps Promo" width="880" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The primary goal with ScrapeOps is to give every developer the same level of scraping monitoring capabilities as the most sophisticated web scrapers, without any of the hassle of setting up your own custom solution.&lt;/p&gt;

&lt;p&gt;Unlike the other options on this list, ScrapeOps is a full end-to-end web scraping monitoring and management tool dedicated to web scraping that automatically sets up all the monitors, health checks and alerts for you. &lt;/p&gt;

&lt;p&gt;If you have an issue with integrating ScrapeOps or need advice on setting up your scrapers then they have a support team on-hand to assist you.&lt;/p&gt;




&lt;h3&gt;
  
  
  Features
&lt;/h3&gt;

&lt;p&gt;Once you have completed the simple install (3 lines in your scraper), ScrapeOps will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🕵️‍♂️ &lt;strong&gt;Monitor -&lt;/strong&gt; Automatically monitor all your scrapers.&lt;/li&gt;
&lt;li&gt;📈 &lt;strong&gt;Dashboards -&lt;/strong&gt; Visualise your job data in dashboards, so you see real-time &amp;amp; historical stats.&lt;/li&gt;
&lt;li&gt;💯 &lt;strong&gt;Data Quality -&lt;/strong&gt; Validate the field coverage in each of your jobs, so broken parsers can be detected straight away.&lt;/li&gt;
&lt;li&gt;📉 &lt;strong&gt;Auto Health Checks -&lt;/strong&gt; Automatically check every jobs performance data versus its 7 day moving average to see if its healthy or not.&lt;/li&gt;
&lt;li&gt;✔️ &lt;strong&gt;Custom Health Checks -&lt;/strong&gt; Check each job with any custom health checks you have enabled for it.&lt;/li&gt;
&lt;li&gt;⏰ &lt;strong&gt;Alerts -&lt;/strong&gt; Alert you via email, Slack, etc. if any of your jobs are unhealthy.&lt;/li&gt;
&lt;li&gt;📑 &lt;strong&gt;Reports -&lt;/strong&gt; Generate daily (periodic) reports, that check all jobs versus your criteria and let you know if everything is healthy or not.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Job stats tracked include:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Pages Scraped &amp;amp; Missed&lt;/li&gt;
&lt;li&gt;✅ Items Parsed &amp;amp; Missed&lt;/li&gt;
&lt;li&gt;✅ Item Field Coverage&lt;/li&gt;
&lt;li&gt;✅ Runtimes&lt;/li&gt;
&lt;li&gt;✅ Response Status Codes&lt;/li&gt;
&lt;li&gt;✅ Success Rates&lt;/li&gt;
&lt;li&gt;✅ Latencies&lt;/li&gt;
&lt;li&gt;✅ Errors &amp;amp; Warnings&lt;/li&gt;
&lt;li&gt;✅ Bandwidth&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Integration
&lt;/h3&gt;

&lt;p&gt;There are two steps to integrate ScrapeOps with your Scrapyd servers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Install ScrapeOps Logger Extension&lt;/li&gt;
&lt;li&gt;Connect ScrapeOps to Your Scrapyd Servers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; You can't connect ScrapeOps to a Scrapyd server that is running locally, and isn't offering a public IP address available to connect to. &lt;/p&gt;

&lt;p&gt;Once setup you will be able to schedule, run and manage all your Scrapyd servers from one dashboard.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 1: Install Scrapy Logger Extension
&lt;/h4&gt;

&lt;p&gt;For ScrapeOps to monitor your scrapers, create dashboards and trigger alerts you need to install the ScrapeOps logger extension in each of your Scrapy projects.&lt;/p&gt;

&lt;p&gt;Simply install the Python package:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install scrapeops-scrapy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And add 3 lines to your &lt;code&gt;settings.py&lt;/code&gt; file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;From there, your scraping stats will be automatically logged and automatically shipped to your dashboard.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 2: Connect ScrapeOps to Your Scrapyd Servers
&lt;/h4&gt;

&lt;p&gt;The next step is giving ScrapeOps the connection details of your Scrapyd servers so that you can manage them from the dashboard. &lt;/p&gt;

&lt;p&gt;Within your dashboard go to the &lt;a href="https://scrapeops.io/app/servers"&gt;Servers page&lt;/a&gt; and click on the &lt;strong&gt;Add Scrapyd Server&lt;/strong&gt; at the top of the page.&lt;/p&gt;

&lt;p&gt;In the dropdown section then enter your connection details: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Server Name&lt;/li&gt;
&lt;li&gt;Server Domain Name (optional)&lt;/li&gt;
&lt;li&gt;Server IP Address&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once setup, you can now schedule your scraping jobs to run periodically using the ScrapeOps scheduler and monitor your scraping results in your dashboards.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XXy4D1IZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapeops-scheduler-holder-162d9dd0d364d461b3a2ce1f9989fd25.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XXy4D1IZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapeops-scheduler-holder-162d9dd0d364d461b3a2ce1f9989fd25.png" alt="ScrapeOps Dashboard Demo" width="880" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://scrapeops.io"&gt;ScrapeOps&lt;/a&gt; is a powerful web scraping monitoring tool, that gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live demo here:&lt;/strong&gt; &lt;a href="https://scrapeops.io/app/login/demo"&gt;ScrapeOps Demo&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Free unlimited community plan.&lt;/li&gt;
&lt;li&gt;Simple 30 second install.&lt;/li&gt;
&lt;li&gt;Hosted solution, so don't need to spin up a server.&lt;/li&gt;
&lt;li&gt;Full Scrapyd JSON API support. &lt;/li&gt;
&lt;li&gt;Includes most fully featured scraping monitoring, health checks and alerts straight out of the box. &lt;/li&gt;
&lt;li&gt;Customer support team, available to help you get setup and add new features.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Not open source, if that is your preference.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  #2 ScrapydWeb
&lt;/h2&gt;

&lt;p&gt;The most popular open source Scrapyd dashboard, &lt;a href="https://github.com/my8100/scrapydweb"&gt;ScrapydWeb&lt;/a&gt; is a great solution for anyone looking for a robust spider management tool that can be integrated with their Scrapyd servers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fYFLTC----/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/my8100/scrapydweb/master/screenshots/servers.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fYFLTC----/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/my8100/scrapydweb/master/screenshots/servers.png" alt="Scrapydweb Dashboard" width="880" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With ScrapydWeb, you can schedule, run and see the stats from all your jobs across all your servers on a single dashboard. ScrapydWeb supports all the Scrapyd JSON API endpoints so can also stop jobs mid-crawl and delete projects without having to log into your Scrapyd server.&lt;/p&gt;

&lt;p&gt;When combined with &lt;a href="https://github.com/my8100/logparser"&gt;LogParser&lt;/a&gt;, ScrapydWeb will also extract your Scrapy logs from your server and parse them into an easier to understand way. &lt;/p&gt;

&lt;p&gt;A powerful feature that ScrapydWeb has that many of the other open source Scrapyd dashboards don’t have is the ability to easily connect multiple Scrapyd servers to your dashboard, execute actions on multiple nodes with the same command and autopackage your spiders on the Scrapyd server.&lt;/p&gt;

&lt;p&gt;Although, ScrapydWeb has a lot of spider management functionality, its monitoring/job visualisation capabilities are quite limited, and there are a number of user experience issues that make it less than ideal if you plan to rely on it completely as your main spider monitoring solution. &lt;/p&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;If you want a easy to use open-source Scrapyd dashboard then ScrapydWeb is a great choice. It is the most popular open-source Scrapyd dashboard at the moment, and has a lot of functionality built-in.&lt;/p&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Open source.&lt;/li&gt;
&lt;li&gt;Robust and battle tested Scrapyd management tool.&lt;/li&gt;
&lt;li&gt;Lots of Spider management functionality.&lt;/li&gt;
&lt;li&gt;Best multi-node server management functionality.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Limited job monitoring and data visualisation functionality.&lt;/li&gt;
&lt;li&gt;No customer support&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  #3 Gerapy
&lt;/h2&gt;

&lt;p&gt;Next, on our list is &lt;a href="https://github.com/Gerapy/Gerapy"&gt;Gerapy&lt;/a&gt;. With 2.6k stars on Github it is another very popular open source Scrapyd dashboard.&lt;/p&gt;

&lt;p&gt;Gerapy enables you to schedule, run and control all your Scrapy scrapers from a single dashboard. Like others on this list, it's goal is to make managing distributed crawler projects easier and less time consuming.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---dWVH8Aj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://qiniu.cuiqingcai.com/2019-11-23-070132.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---dWVH8Aj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://qiniu.cuiqingcai.com/2019-11-23-070132.png" alt="Gerapy Dashboard" width="880" height="531"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Gerapy boasts the following features and functionality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More convenient control of crawler runs.&lt;/li&gt;
&lt;li&gt;View crawl results in more real time.&lt;/li&gt;
&lt;li&gt;Easier timing tasks.&lt;/li&gt;
&lt;li&gt;Easier project deployment.&lt;/li&gt;
&lt;li&gt;More unified host management.&lt;/li&gt;
&lt;li&gt;Write crawler code more easily.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Unlike ScrapydWeb, Gerapy also has a visual code editor built-in. So you can edit your projects code right from the Gerapy dashboard if you would like to make a quick change.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--TRE3yi8D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://qiniu.cuiqingcai.com/2019-11-23-070248.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--TRE3yi8D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://qiniu.cuiqingcai.com/2019-11-23-070248.png" alt="Gerapy Visual Editor" width="880" height="582"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;Gerapy is a great alternative to the open-source ScrapydWeb. It will allow you to manage multiple Scrapyd servers with a single dashboard. &lt;/p&gt;

&lt;p&gt;However, it doesn't extract the job stats from your log files so you can't view all your jobs scraping results in a single view as you can with ScrapydWeb.&lt;/p&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Open source, and very active maintainers.&lt;/li&gt;
&lt;li&gt;Robust Scrapyd management tool.&lt;/li&gt;
&lt;li&gt;Full Spider management functionality.&lt;/li&gt;
&lt;li&gt;Ability to edit spiders within dashboard.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Limited job monitoring and data visualisation functionality.&lt;/li&gt;
&lt;li&gt;No equivalent log parsing functionality like &lt;a href="https://github.com/my8100/logparser"&gt;LogParser&lt;/a&gt; with ScrapydWeb.&lt;/li&gt;
&lt;li&gt;No customer support&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  #4 SpiderKeeper
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/DormyMo/SpiderKeeper"&gt;SpiderKeeper&lt;/a&gt; is another open-source Scrapyd dashboard based on the old Scrapinghub Scrapy Cloud dashboard.&lt;/p&gt;

&lt;p&gt;SpiderKeeper was once a very popular Scrapyd dashboard because it had robust functionality and looked good. &lt;/p&gt;

&lt;p&gt;However, it has fallen out of favour due to the launch of other dashboard projects and the fact that it isn't maintained anymore (last update was in 2018, plus numerous open pull requests).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Nd33XKvp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/DormyMo/SpiderKeeper/master/screenshot/screenshot_1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Nd33XKvp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/DormyMo/SpiderKeeper/master/screenshot/screenshot_1.png" alt="SpiderKeeper Dashboard" width="880" height="499"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;SpiderKeeper is a simplier implementation of the functionality that ScrapeOps, ScrapydWeb or Gerapy providers, however, it still covers all the basics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manage your Scrapy spiders from a dashboard.&lt;/li&gt;
&lt;li&gt;Schedule periodic jobs to run automatically.&lt;/li&gt;
&lt;li&gt;Deploy spiders to Scrapyd with a single click.&lt;/li&gt;
&lt;li&gt;Basic spider stats.&lt;/li&gt;
&lt;li&gt;Full Scrapyd API support.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;SpiderKeeper was a great open-source Scrapyd dashboard, however, since it isn't being actively maintained in years we would recommend using one of the other options on the list. &lt;/p&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Open source.&lt;/li&gt;
&lt;li&gt;Good functionality that covers all the basics.&lt;/li&gt;
&lt;li&gt;Ability to deploy spiders within dashboard.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Not actively maintained, last update 2018.&lt;/li&gt;
&lt;li&gt;Limited job monitoring and data visualisation functionality.&lt;/li&gt;
&lt;li&gt;No customer support&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  #5 Crawlab
&lt;/h2&gt;

&lt;p&gt;Whilst &lt;a href="https://github.com/crawlab-team/crawlab"&gt;Crawlab&lt;/a&gt; isn’t a Scrapyd dashboard per-say, it is definitely an interesting tool if you are looking for a way to manage all your spiders from one central admin dashboard.&lt;/p&gt;

&lt;p&gt;Crawlab is a Golang-based distributed web crawler admin platform for spiders management regardless of languages and frameworks. Meaning that you can use it with any type of spider be it Python Requests, NodeJS, Golang, etc. based spiders.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uRwa8izO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/crawlab-team/images/raw/main/20210729/screenshot-home.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uRwa8izO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/crawlab-team/images/raw/main/20210729/screenshot-home.png%3Fraw%3Dtrue" alt="Crawlab Dashboard" width="880" height="497"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The fact Crawlab isn't Scrapy specific gives you huge flexibility if you decide to move away from Scrapy in the future or need to create a Puppeteer scraper to scrape a particularly difficult site, then you can easily add the scraper to your Crawlab setup.&lt;/p&gt;

&lt;p&gt;Of the open-source tools on the list, Crawlab is by far the most comprehensive solution with a whole range of features and functionality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Naturally supports distributed spiders out of the box.&lt;/li&gt;
&lt;li&gt;Schedule cron jobs&lt;/li&gt;
&lt;li&gt;Task management&lt;/li&gt;
&lt;li&gt;Results exporting&lt;/li&gt;
&lt;li&gt;Online code editor&lt;/li&gt;
&lt;li&gt;Configurable spiders.&lt;/li&gt;
&lt;li&gt;Notifications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One of the only downsides to it is that there is a bit of a learning curve to get it setup on your own server.&lt;/p&gt;

&lt;p&gt;As of writing this article it is the most active open source project on this list.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;Crawlab is a very powerful scraper management solution with a huge range of functionality, and is a great option for anyone who is running multiple types of scrapers. &lt;/p&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Open source, and actively maintained.&lt;/li&gt;
&lt;li&gt;Very powerful functionality.&lt;/li&gt;
&lt;li&gt;Ability to deploy any type of scraper (Python, Scrapy, NodeJS, Golang, etc.).&lt;/li&gt;
&lt;li&gt;Very good documentation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Job monitoring and data visualisation functionality could be better.&lt;/li&gt;
&lt;li&gt;No customer support&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>scrapy</category>
      <category>scrapyd</category>
      <category>python</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>The Complete Scrapyd Guide - Deploy, Schedule &amp; Run Your Scrapy Spiders</title>
      <dc:creator>Ian Kerins</dc:creator>
      <pubDate>Thu, 13 Jan 2022 15:23:44 +0000</pubDate>
      <link>https://dev.to/iankerins/the-complete-scrapyd-guide-deploy-schedule-run-your-scrapy-spiders-3ip9</link>
      <guid>https://dev.to/iankerins/the-complete-scrapyd-guide-deploy-schedule-run-your-scrapy-spiders-3ip9</guid>
      <description>&lt;p&gt;Published as part of &lt;a href="https://scrapeops.io/python-scrapy-playbook" rel="noopener noreferrer"&gt;The Python Scrapy Playbook&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You've built your scraper, tested that it works and now want to schedule it to run every hour, day, etc. and scrape the data you need. But what is the best way to do that?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/scrapy/scrapyd" rel="noopener noreferrer"&gt;Scrapyd&lt;/a&gt; is one of the most popular options. Created by the same developers that developed Scrapy itself, Scrapyd is a tool for running Scrapy spiders in production on remote servers so you don't need to run them on a local machine. &lt;/p&gt;

&lt;p&gt;In this guide, we're going to run through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What Is Scrapyd?&lt;/li&gt;
&lt;li&gt;How To Setup Scrapyd?&lt;/li&gt;
&lt;li&gt;Deploying Spiders To Scrapyd&lt;/li&gt;
&lt;li&gt;Controlling Spiders With Scrapyd&lt;/li&gt;
&lt;li&gt;Integrating Scrapyd with ScrapeOps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are many different Scrapyd dashboard and admin tools available, from &lt;a href="https://scrapeops.io/" rel="noopener noreferrer"&gt;ScrapeOps&lt;/a&gt; (&lt;a href="https://scrapeops.io/app/login/demo" rel="noopener noreferrer"&gt;Live Demo&lt;/a&gt;) to &lt;a href="https://github.com/my8100/scrapydweb" rel="noopener noreferrer"&gt;ScrapydWeb&lt;/a&gt;, &lt;a href="https://github.com/DormyMo/SpiderKeeper" rel="noopener noreferrer"&gt;SpiderKeeper&lt;/a&gt;, and more. &lt;/p&gt;

&lt;p&gt;So if you'd like to choose the best one for your requirements then be sure to check out our &lt;a href="https://scrapeops.io/python-scrapy-playbook/best-scrapyd-dashboards-ui" rel="noopener noreferrer"&gt;Guide to the Best Scrapyd Dashboards&lt;/a&gt;, so you can see the pros and cons of each before you decide on which option to go with.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Scrapyd?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://scrapyd.readthedocs.io/en/stable/" rel="noopener noreferrer"&gt;Scrapyd&lt;/a&gt; is application that allows us to deploy Scrapy spiders on a server and run them remotely using a JSON API. Scrapyd allows you to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run Scrapy jobs.&lt;/li&gt;
&lt;li&gt;Pause &amp;amp; Cancel Scrapy jobs.&lt;/li&gt;
&lt;li&gt;Manage Scrapy project/spider versions.&lt;/li&gt;
&lt;li&gt;Access Scrapy logs remotely. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Scrapyd is a great option for developers who want an easy way to manage production Scrapy spiders that run on a remote server. &lt;/p&gt;

&lt;p&gt;With Scrapyd you can manage multiple servers from one central point by using a ready-made Scrapyd management tool like &lt;a href="https://scrapeops.io/" rel="noopener noreferrer"&gt;ScrapeOps&lt;/a&gt;, an open source alternative or by building your own.&lt;/p&gt;

&lt;p&gt;Here you can check out the full &lt;a href="https://scrapyd.readthedocs.io/en/stable/" rel="noopener noreferrer"&gt;Scrapyd docs&lt;/a&gt; and &lt;a href="https://github.com/scrapy/scrapyd" rel="noopener noreferrer"&gt;Github repo&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Setup Scrapyd
&lt;/h2&gt;

&lt;p&gt;Getting Scrapyd setup is quick and simple. You can run it locally or on a server.&lt;/p&gt;

&lt;p&gt;First step is to install Scrapyd:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install scrapyd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And then start the server by using the command:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapyd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This will start Scrapyd running on &lt;code&gt;http://localhost:6800/&lt;/code&gt;. You can open this url in your browser and you should see the following screen:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapyd-homepage-5b9237d9297d5c99275ac3c0477b6384.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapyd-homepage-5b9237d9297d5c99275ac3c0477b6384.png" alt="Scrapyd Homepage"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Deploying Spiders To Scrapyd
&lt;/h2&gt;

&lt;p&gt;To run jobs using Scrapyd, we first need to eggify and deploy our Scrapy project to the Scrapyd server. To do this, there is a easy to use library called &lt;a href="https://github.com/scrapy/scrapyd-client" rel="noopener noreferrer"&gt;scrapyd-client&lt;/a&gt; that makes this process very simple.&lt;/p&gt;

&lt;p&gt;First let's install scrapyd-client&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install git+https://github.com/scrapy/scrapyd-client.git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Once installed, navigate to your Scrapy project you want to deploy and open your &lt;code&gt;scrapyd.cfg&lt;/code&gt; file, which should be located in your projects root directory. You should see something like this, with the &lt;strong&gt;"demo"&lt;/strong&gt; text being replaced by your Scrapy projects name:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## scrapy.cfg

[settings]
default = demo.settings  

[deploy]
#url = http://localhost:6800/
project = demo  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Here the &lt;code&gt;scrapyd.cfg&lt;/code&gt;configuration file defines the endpoint your Scrapy project should be be deployed to. To enable us to deploy our project to Scrapyd, we just need to uncomment the &lt;code&gt;url&lt;/code&gt; value if we want to deploy it to a locally running Scrapyd server.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## scrapy.cfg

[settings]
default = demo.settings  

[deploy]
url = http://localhost:6800/
project = demo  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then run the following command in your Scrapy projects root directory:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapyd-deploy default
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This will then eggify your Scrapy project and deploy it to your locally running Scrapyd server. You should get a result like this in your terminal if it was successful:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ scrapyd-deploy default
Packing version 1640086638
Deploying to project "demo" in http://localhost:6800/addversion.json
Server response (200):
{"node_name": "DESKTOP-67BR2", "status": "ok", "project": "demo", "version": "1640086638", "spiders": 1}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Now your Scrapy project has been deployed to your Scrapyd and is ready to be run.&lt;/p&gt;

&lt;h4&gt;
  
  
  Aside: Custom Deployment Endpoints
&lt;/h4&gt;

&lt;p&gt;The above example was the simplest implementation and assumed you were just deploying your Scrapy project to a local Scrapyd server. However, you can customise or add multiple deployment endpoints to &lt;code&gt;scrapyd.cfg&lt;/code&gt; file if you would like. &lt;/p&gt;

&lt;p&gt;For example you can define local and production endpoints:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## scrapy.cfg

[settings]
default = demo.settings  

[deploy:local]
url = http://localhost:6800/
project = demo 

[deploy:production]
url = &amp;lt;MY_IP_ADDRESS&amp;gt;
project = demo 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And deploy your Scrapy project locally or to production using this command:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Deploy locally
scrapyd-deploy local

## Deploy to production
scrapyd-deploy production
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Or deploy a specific project using by specifying the project name:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapyd-deploy &amp;lt;target&amp;gt; -p &amp;lt;project&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For more information about this, check out the &lt;a href="https://github.com/scrapy/scrapyd-client" rel="noopener noreferrer"&gt;scrapyd-client docs here&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Controlling Spiders With Scrapyd
&lt;/h2&gt;

&lt;p&gt;Scrapyd comes with a minimal web interface which can be accessed at &lt;a href="http://localhost:6800/" rel="noopener noreferrer"&gt;http://localhost:6800/&lt;/a&gt;, however, this interface is just a rudimentary overview of what is running on a Scrapyd server and doesn't allow you to control the spiders deployed to the Scrapyd server.&lt;/p&gt;

&lt;p&gt;To control your spiders with Scrapyd you have 3 options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scrapyd JSON API&lt;/li&gt;
&lt;li&gt;Python-Scrapyd-API Library&lt;/li&gt;
&lt;li&gt;Scrapyd Dashboard&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Scrapyd JSON API
&lt;/h3&gt;

&lt;p&gt;To schedule, run, cancel jobs on your Scrapyd server we need to use the JSON API it provides. Depending on the endpoint, the API supports &lt;code&gt;GET&lt;/code&gt; or &lt;code&gt;POST&lt;/code&gt; HTTP requests. For example:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ curl http://localhost:6800/daemonstatus.json
{ "status": "ok", "running": "0", "pending": "0", "finished": "0", "node_name": "DESKTOP-67BR2" }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The API has the following endpoints:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Endpoint&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;daemonstatus.json&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Checks the status of the Scrapyd server.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;addversion.json&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Add a version to a project, creating the project if it doesn’t exist.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;schedule.json&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Schedule a job to run.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;cancel.json&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cancel a job. If the job is pending, it will be removed. If the job is running, the job will be shutdown.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;listprojects.json&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Returns a list of the projects uploaded to the Scrapyd server.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;listversions.json&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Returns a list of versions available for the requested project.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;listspiders.json&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Returns a list of the spiders available for the requested project.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;listjobs.json&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Returns a list of pending, running and finished jobs for the requested project.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;delversion.json&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deletes a project version. If project only has one version, deletes the project too.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;delproject.json&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deletes the project, and all associated versions.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://scrapyd.readthedocs.io/en/stable/api.html" rel="noopener noreferrer"&gt;Full API specifications can be found here.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can interact with these endpoints using &lt;strong&gt;Python Requests&lt;/strong&gt; or any other HTTP request library, or we can use &lt;a href="https://github.com/djm/python-scrapyd-api" rel="noopener noreferrer"&gt;python-scrapyd-api&lt;/a&gt; a Python wrapper for the Scrapyd API.&lt;/p&gt;




&lt;h3&gt;
  
  
  Python-Scrapyd-API Library
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://github.com/djm/python-scrapyd-api" rel="noopener noreferrer"&gt;python-scrapyd-api&lt;/a&gt; provides a clean and easy to use Python wrapper around the Scrapyd JSON API, which can simplify your code.&lt;/p&gt;

&lt;p&gt;First, we need to install it:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install python-scrapyd-api
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then in our code we need to import the library and configure it to interact with our Scrapyd server by passing it the Scrapyd IP address.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; from scrapyd_api import ScrapydAPI
&amp;gt;&amp;gt;&amp;gt; scrapyd = ScrapydAPI('http://localhost:6800')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;From here, we can use the built in methods to interact with the Scrapyd server.&lt;/p&gt;




&lt;h4&gt;
  
  
  Check Daemon Status
&lt;/h4&gt;

&lt;p&gt;Checks the status of the Scrapyd server.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; scrapyd.daemon_status()
{u'finished': 0, u'running': 0, u'pending': 0, u'node_name': u'DESKTOP-67BR2'}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;h4&gt;
  
  
  List All Projects
&lt;/h4&gt;

&lt;p&gt;Returns a list of the projects uploaded to the Scrapyd server. &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; scrapyd.list_projects()
[u'demo', u'quotes_project']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;h4&gt;
  
  
  List All Spiders
&lt;/h4&gt;

&lt;p&gt;Enter the project name, and it will return a list of the spiders available for the requested project.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; scrapyd.list_spiders('project_name')
[u'raw_spider', u'js_enhanced_spider', u'selenium_spider']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;h4&gt;
  
  
  Run a Job
&lt;/h4&gt;

&lt;p&gt;Run a Scrapy spider by specifying the project and spider name.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; scrapyd.schedule('project_name', 'spider_name')
# Returns the Scrapyd job id.
u'14a6599ef67111e38a0e080027880ca6'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Pass custom settings using the &lt;code&gt;settings&lt;/code&gt; arguement.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; settings = {'DOWNLOAD_DELAY': 2}
&amp;gt;&amp;gt;&amp;gt; scrapyd.schedule('project_name', 'spider_name', settings=settings)
u'25b6588ef67333e38a0e080027880de7'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;One important thing to note about the schedule.json API endpoint. Even though the endpoint is called schedule.json, using it only adds a job to the internal Scrapy scheduler queue, which will be run when a slot is free. &lt;/p&gt;

&lt;p&gt;This endpoint doesn't have the functionality to schedule a job in the future so it runs at specific time, Scrapyd will add the job to a queue and run it once a Scrapy slot becomes available. &lt;/p&gt;

&lt;p&gt;To actually schedule a job to run in the future at a specific date/time or periodicially at a specific time then you will need to control this scheduling on your end. Tools like &lt;a href="https://scrapeops.io/" rel="noopener noreferrer"&gt;ScrapeOps&lt;/a&gt; will do this for you.  &lt;/p&gt;




&lt;h4&gt;
  
  
  Cancel a Running Job
&lt;/h4&gt;

&lt;p&gt;Cancel a running job by sending the project name and the job_id.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; scrapyd.cancel('project_name', '14a6599ef67111e38a0e080027880ca6')
# Returns the "previous state" of the job before it was cancelled: 'running' or 'pending'.
'running'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;When sent it will return the "previous state" of the job before it was cancelled. You can verify that the job was actually cancelled by checking the jobs status.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; scrapyd.job_status('project_name', '14a6599ef67111e38a0e080027880ca6')
# Returns 'running', 'pending', 'finished' or '' for unknown state.
'finished'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For more functionality then check out the &lt;strong&gt;python-scrapyd-api&lt;/strong&gt; &lt;a href="https://github.com/djm/python-scrapyd-api" rel="noopener noreferrer"&gt;documentation here&lt;/a&gt;. &lt;/p&gt;




&lt;h3&gt;
  
  
  Scrapyd Dashboard
&lt;/h3&gt;

&lt;p&gt;Using Scrapyd's JSON API to control your spiders is possible, however, it isn't ideal as you will need to create custom workflows on your end to monitor, manage and run your spiders. Which can become a major project in itself if you need to manage spiders spread across multiple servers.&lt;/p&gt;

&lt;p&gt;Other developers ran into this problem so luckily for us, they decided to create free and open-source Scrapyd dashboards that can connect to your Scrapyd servers so you can manage everything from a single dashboard.&lt;/p&gt;

&lt;p&gt;There are many different Scrapyd dashboard and admin tools available: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://scrapeops.io/" rel="noopener noreferrer"&gt;ScrapeOps&lt;/a&gt; (&lt;a href="https://scrapeops.io/app/login/demo" rel="noopener noreferrer"&gt;Live Demo&lt;/a&gt;) &lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/python-scrapy-playbook/extensions/scrapydweb-guide"&gt;ScrapydWeb&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/DormyMo/SpiderKeeper" rel="noopener noreferrer"&gt;SpiderKeeper&lt;/a&gt; &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you'd like to choose the best one for your requirements then be sure to check out our &lt;a href="https://scrapeops.io/python-scrapy-playbook/best-scrapyd-dashboards-ui" rel="noopener noreferrer"&gt;Guide to the Best Scrapyd Dashboards here&lt;/a&gt;.  &lt;/p&gt;




&lt;h2&gt;
  
  
  Integrating Scrapyd with ScrapeOps
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://scrapeops.io" rel="noopener noreferrer"&gt;ScrapeOps&lt;/a&gt; is a free monitoring tool for web scraping that also has a Scrapyd dashboard that allows you to schedule, run and manage all your scrapers from a single dashboard. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live demo here:&lt;/strong&gt; &lt;a href="https://scrapeops.io/app/login/demo" rel="noopener noreferrer"&gt;ScrapeOps Demo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" alt="ScrapeOps Promo"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.&lt;/p&gt;

&lt;p&gt;Unlike the other Scrapyd dashboard, ScrapeOps is a full end-to-end web scraping monitoring and management tool dedicated to web scraping that automatically sets up all the monitors, health checks and alerts for you. &lt;/p&gt;




&lt;h3&gt;
  
  
  Features
&lt;/h3&gt;

&lt;p&gt;Once setup, ScrapeOps will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🕵️‍♂️ &lt;strong&gt;Monitor -&lt;/strong&gt; Automatically monitor all your scrapers.&lt;/li&gt;
&lt;li&gt;📈 &lt;strong&gt;Dashboards -&lt;/strong&gt; Visualise your job data in dashboards, so you see real-time &amp;amp; historical stats.&lt;/li&gt;
&lt;li&gt;💯 &lt;strong&gt;Data Quality -&lt;/strong&gt; Validate the field coverage in each of your jobs, so broken parsers can be detected straight away.&lt;/li&gt;
&lt;li&gt;📉 &lt;strong&gt;Auto Health Checks -&lt;/strong&gt; Automatically check every jobs performance data versus its 7 day moving average to see if its healthy or not.&lt;/li&gt;
&lt;li&gt;✔️ &lt;strong&gt;Custom Health Checks -&lt;/strong&gt; Check each job with any custom health checks you have enabled for it.&lt;/li&gt;
&lt;li&gt;⏰ &lt;strong&gt;Alerts -&lt;/strong&gt; Alert you via email, Slack, etc. if any of your jobs are unhealthy.&lt;/li&gt;
&lt;li&gt;📑 &lt;strong&gt;Reports -&lt;/strong&gt; Generate daily (periodic) reports, that check all jobs versus your criteria and let you know if everything is healthy or not.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Integration
&lt;/h3&gt;

&lt;p&gt;There are two steps to integrate ScrapeOps with your Scrapyd servers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Install ScrapeOps Logger Extension&lt;/li&gt;
&lt;li&gt;Connect ScrapeOps to Your Scrapyd Servers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; You can't connect ScrapeOps to a Scrapyd server that is running locally, and isn't offering a public IP address available to connect to. &lt;/p&gt;

&lt;p&gt;Once setup you will be able to schedule, run and manage all your Scrapyd servers from one dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapeops-scheduler-holder-162d9dd0d364d461b3a2ce1f9989fd25.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapeops-scheduler-holder-162d9dd0d364d461b3a2ce1f9989fd25.png" alt="ScrapeOps Dashboard Demo"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 1: Install Scrapy Logger Extension
&lt;/h3&gt;

&lt;p&gt;For ScrapeOps to monitor your scrapers, create dashboards and trigger alerts you need to install the ScrapeOps logger extension in each of your Scrapy projects.&lt;/p&gt;

&lt;p&gt;Simply install the Python package:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install scrapeops-scrapy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And add 3 lines to your &lt;code&gt;settings.py&lt;/code&gt; file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;From there, your scraping stats will be automatically logged and automatically shipped to your dashboard.&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 2: Connect ScrapeOps to Your Scrapyd Servers
&lt;/h3&gt;

&lt;p&gt;The next step is giving ScrapeOps the connection details of your Scrapyd servers so that you can manage them from the dashboard. &lt;/p&gt;

&lt;h4&gt;
  
  
  Enter Scrapyd Server Details
&lt;/h4&gt;

&lt;p&gt;Within your dashboard go to the &lt;a href="https://scrapeops.io/app/servers" rel="noopener noreferrer"&gt;Servers page&lt;/a&gt; and click on the &lt;strong&gt;Add Scrapyd Server&lt;/strong&gt; at the top of the page.&lt;/p&gt;

&lt;p&gt;In the dropdown section then enter your connection details: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Server Name&lt;/li&gt;
&lt;li&gt;Server Domain Name (optional)&lt;/li&gt;
&lt;li&gt;Server IP Address&lt;/li&gt;
&lt;/ul&gt;




&lt;h4&gt;
  
  
  Whitelist Our Server (Optional)
&lt;/h4&gt;

&lt;p&gt;Depending on how you are securing your Scrapyd server, you might need to whitelist our IP address so it can connect to your Scrapyd servers. There are two options to do this:&lt;/p&gt;


&lt;h4&gt;
  
  
  Option 1: Auto Install (Ubuntu)
&lt;/h4&gt;

&lt;p&gt;SSH into your server as root and run the following command in your terminal.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget &lt;span class="nt"&gt;-O&lt;/span&gt; scrapeops_setup.sh &lt;span class="s2"&gt;"https://assets-scrapeops.nyc3.digitaloceanspaces.com/Bash_Scripts/scrapeops_setup.sh"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; bash scrapeops_setup.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command will begin the provisioning process for your server, and will configure the server so that Scrapyd can be managed by Scrapeops.&lt;/p&gt;


&lt;h4&gt;
  
  
  Option 2: Manual Install
&lt;/h4&gt;

&lt;p&gt;This step is optional but needed if you want to run/stop/re-run/schedule any jobs using our site. If we cannot reach your server via port 80 or 443 the server will be listed as read only.&lt;/p&gt;

&lt;p&gt;The following steps should work on Linux/Unix based servers that have UFW firewall installed.:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1:&lt;/strong&gt; Log into your server via SSH&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2:&lt;/strong&gt; Enable SSH'ing so that you don't get blocked from your server&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw allow ssh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3:&lt;/strong&gt; Allow incoming connections from 46.101.44.87&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw allow from 46.101.44.87 to any port 443,80 proto tcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4:&lt;/strong&gt; Enable ufw &amp;amp; check firewall rules are implemented&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw &lt;span class="nb"&gt;enable
sudo &lt;/span&gt;ufw status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 5:&lt;/strong&gt; Install Nginx &amp;amp; setup a reverse proxy to let connection from scrapeops reach your scrapyd server.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install &lt;/span&gt;nginx &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add the proxy_pass &amp;amp; proxy_set_header code below into the "location" block of your nginx default config file (default file usually found in /etc/nginx/sites-available)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;proxy_pass http://localhost:6800/&lt;span class="p"&gt;;&lt;/span&gt;
proxy_set_header X-Forwarded-Proto http&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reload your nginx config&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl reload nginx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once this is done you should be able to run, re-run, stop, schedule jobs for this server from the ScrapeOps dashboard.&lt;/p&gt;

&lt;h2&gt;
  
  
  More Scrapy Tutorials
&lt;/h2&gt;

&lt;p&gt;That's it for how to use Scrapyd to run your Scrapy spiders. If you would like to learn more about Scrapy, then be sure to check out &lt;a href="https://scrapeops.io/python-scrapy-playbook/" rel="noopener noreferrer"&gt;The Scrapy Playbook&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>scrapyd</category>
      <category>scrapy</category>
      <category>python</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>The Complete Guide To ScrapydWeb, Get Setup In 3 Minutes!</title>
      <dc:creator>Ian Kerins</dc:creator>
      <pubDate>Thu, 13 Jan 2022 14:05:12 +0000</pubDate>
      <link>https://dev.to/iankerins/the-complete-guide-to-scrapydweb-get-setup-in-3-minutes-3ib</link>
      <guid>https://dev.to/iankerins/the-complete-guide-to-scrapydweb-get-setup-in-3-minutes-3ib</guid>
      <description>&lt;p&gt;Published as part of &lt;a href="https://scrapeops.io/python-scrapy-playbook"&gt;The Python Scrapy Playbook&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/my8100/scrapydweb"&gt;ScrapydWeb&lt;/a&gt; is the most popular open source &lt;a href="https://github.com/scrapy/scrapyd"&gt;Scrapyd&lt;/a&gt; admin dashboards. Boasting 2,400 Github stars, ScrapydWeb has been fully embraced by the Scrapy community.&lt;/p&gt;

&lt;p&gt;In this guide, we're going to run through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What Is ScrapydWeb?&lt;/li&gt;
&lt;li&gt;How To Setup ScrapydWeb?&lt;/li&gt;
&lt;li&gt;Using ScrapydWeb&lt;/li&gt;
&lt;li&gt;Alternatives To ScrapydWeb&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are many different Scrapyd dashboard and admin tools available, from &lt;a href="https://scrapeops.io/"&gt;ScrapeOps&lt;/a&gt; (&lt;a href="https://scrapeops.io/app/login/demo"&gt;Live Demo&lt;/a&gt;) to &lt;a href="https://github.com/DormyMo/SpiderKeeper"&gt;SpiderKeeper&lt;/a&gt;, and &lt;a href="https://github.com/Gerapy/Gerapy"&gt;Gerapy&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;So if you'd like to choose the best one for your requirements then be sure to check out our &lt;a href="https://scrapeops.io/python-scrapy-playbook/best-scrapyd-dashboards-ui"&gt;Guide to the Best Scrapyd Dashboards&lt;/a&gt;, so you can see the pros and cons of each before you decide on which option to go with.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is ScrapydWeb?
&lt;/h2&gt;

&lt;p&gt;ScrapydWeb is a admin dashboard that is designed to make interacting with Scrapyd daemons much easier. It allows you to schedule, run and view your scraping jobs across multiple servers in one easy to use dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fYFLTC----/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/my8100/scrapydweb/master/screenshots/servers.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fYFLTC----/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/my8100/scrapydweb/master/screenshots/servers.png" alt="Scrapydweb Dashboard" width="880" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Thereby addressing the main problem with the default Scrapyd setup. The fact that the user interface has very limited functionality and is pretty ugly.&lt;/p&gt;

&lt;p&gt;Although, there are many other Scrapyd dashboards out there, ScrapydWeb quickly became the most popular option after its launch in 2018 because of its easy of use and extra functionality it offered compared to the other alternatives at the time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;💠 &lt;strong&gt;Scrapyd Cluster Management&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;💯 All Scrapyd JSON API Supported&lt;/li&gt;
&lt;li&gt;☑️ Group, filter and select any number of nodes&lt;/li&gt;
&lt;li&gt;🖱️ Execute command on multinodes with just a few clicks&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;

&lt;p&gt;🔍 &lt;strong&gt;Scrapy Log Analysis&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📊 Stats collection&lt;/li&gt;
&lt;li&gt;📈 Progress visualization&lt;/li&gt;
&lt;li&gt;📑 Logs categorization&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;

&lt;p&gt;🔋 &lt;strong&gt;Enhancements&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📦 Auto packaging&lt;/li&gt;
&lt;li&gt;🕵️‍♂️ Integrated with &lt;a href="https://github.com/my8100/logparser"&gt;🔗 &lt;em&gt;LogParser&lt;/em&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;⏰ Timer tasks&lt;/li&gt;
&lt;li&gt;📧 Monitor &amp;amp; Alert&lt;/li&gt;
&lt;li&gt;📱 Mobile UI&lt;/li&gt;
&lt;li&gt;🔐 Basic auth for web UI&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How To Setup ScrapydWeb?
&lt;/h2&gt;

&lt;p&gt;Getting setup with ScrapydWeb is pretty simple. You just need to install the ScrapydWeb package and connect it to your Scrapyd server.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setup Scrapyd Server
&lt;/h3&gt;

&lt;p&gt;To run through the installation process, we're first going to need to have a Scrapyd server setup with a project running on it. (You can skip this step if you already have a Scrapyd server setup.)&lt;/p&gt;

&lt;p&gt;If you would like a in-depth walkthrough on what is Scrapyd and how to set it up, then check out our &lt;a href="https://scrapeops.io/python-scrapy-playbook/extensions/scrapy-scrapyd-guide"&gt;Scrapyd guide here&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Install Scrapyd
&lt;/h4&gt;

&lt;p&gt;First step is to install Scrapyd:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install scrapyd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And then start the server by using the command:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapyd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This will start Scrapyd running on &lt;code&gt;http://localhost:6800/&lt;/code&gt;. You can open this url in your browser and you should see the following screen:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_xvNs2-n--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapyd-homepage-5b9237d9297d5c99275ac3c0477b6384.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_xvNs2-n--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapyd-homepage-5b9237d9297d5c99275ac3c0477b6384.png" alt="Scrapyd Homepage" width="880" height="535"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Deploy Scrapy Project to Scrapyd
&lt;/h4&gt;

&lt;p&gt;To run jobs using Scrapyd, we first need to eggify and deploy our Scrapy project to the Scrapyd server. Luckily, there is a easy to library called &lt;a href="https://github.com/scrapy/scrapyd-client"&gt;scrapyd-client&lt;/a&gt; to do this.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install git+https://github.com/scrapy/scrapyd-client.git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Once installed, navigate to your Scrapy project and open your &lt;code&gt;scrapyd.cfg&lt;/code&gt; file and uncomment the url line under &lt;code&gt;[deploy]&lt;/code&gt;.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## scrapy.cfg

[settings]
default = demo.settings  ## demo will be the name of your scrapy project

[deploy]
url = http://localhost:6800/
project = demo  ## demo will be the name of your scrapy project
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This &lt;code&gt;[deploy]&lt;/code&gt; section configures what url the Scrapyd endpoint the project should be deployed too, and the &lt;code&gt;project&lt;/code&gt; field tells which project that should be deployed.&lt;/p&gt;

&lt;p&gt;With the &lt;code&gt;scrapyd.cfg&lt;/code&gt; file configured we are now able to deploy the project to the Scrapyd server. To do this we navigate to the Scrapy project you want to deploy in your command line and then enter the command:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapyd-deploy default
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;When you run this command, you should get a response like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ scrapyd-deploy default
Packing version 1640086638
Deploying to project "scrapy_demo" in http://localhost:6800/addversion.json
Server response (200):
{"node_name": "DESKTOP-67BR2", "status": "ok", "project": "scrapy_demo", "version": "1640086638", "spiders": 1}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Make sure you have your Scrapyd server running, otherwise you will get an error. &lt;/p&gt;

&lt;p&gt;Now that we have a Scrapyd server setup and a Scrapy project deployed to the Scrapyd server we can control this with ScrapydWeb.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installing ScrapydWeb
&lt;/h3&gt;

&lt;p&gt;Getting ScrapydWeb installed and setup is super easy. (This is a big reason why it has become so popular).&lt;/p&gt;

&lt;p&gt;To get started we need to install the latest version of ScrapydWeb:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install --upgrade git+https://github.com/my8100/scrapydweb.git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Next to run Scrapydweb we just need to use the command:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapydweb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This will build a ScrapydWeb instance for you, create the necessary settings files and launch a ScrapydWeb server on &lt;code&gt;http://127.0.0.1:5000&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;Sometimes the first time you run &lt;code&gt;scrapydweb&lt;/code&gt; it will just create the ScrapydWeb files but won't start the server. If this happens just run the &lt;code&gt;scrapydweb&lt;/code&gt; command again and it will start the server. &lt;/p&gt;

&lt;p&gt;Now, when you open &lt;code&gt;http://127.0.0.1:5000&lt;/code&gt; in your browser you should see a screen like this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6ItH3g7T--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapydweb-jobs2-25ef8a459ead57b5290b7bbcbb7ada1a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6ItH3g7T--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapydweb-jobs2-25ef8a459ead57b5290b7bbcbb7ada1a.png" alt="ScrapydWeb Jobs" width="880" height="406"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Install Logparser
&lt;/h3&gt;

&lt;p&gt;With the current setup you can use ScrapydWeb to schedule and run your scraping jobs, but you won't see any stats for your jobs in your dashboard. &lt;/p&gt;

&lt;p&gt;Not to worry however, the developers behind ScrapydWeb have created a library called &lt;a href="https://github.com/my8100/logparser"&gt;Logparser&lt;/a&gt; to do just that.&lt;/p&gt;

&lt;p&gt;If you run Logparser in the same directory as your Scrapyd server, it will automatically parse your Scrapy logs and make them available to your ScrapydWeb dashboard.&lt;/p&gt;

&lt;p&gt;To install Logparser, enter the command:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install logparser
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then in the same directory as your Scrapyd server, run:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;logparser
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This will start a daemon that will automatically parse your Scrapy logs for ScrapydWeb to consume.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; If you are running Scrapyd and ScrapydWeb on the same machine then it is recommended to set the &lt;code&gt;LOCAL_SCRAPYD_LOGS_DIR&lt;/code&gt; path to your log files directory and &lt;code&gt;ENABLE_LOGPARSER&lt;/code&gt; to &lt;strong&gt;True&lt;/strong&gt; in your ScrapydWeb's settings file.&lt;/p&gt;

&lt;p&gt;At this point, you will have a running Scrapyd server, a running logparser instance, and a running ScrapydWeb server. From here, we are ready to use ScrapydWeb to schedule, run and monitor our jobs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Using ScrapydWeb
&lt;/h2&gt;

&lt;p&gt;Now let's look at how we can actually use ScrapydWeb to run and monitor our jobs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Connecting Scrapyd Servers
&lt;/h3&gt;

&lt;p&gt;Adding Scrapyd servers to ScrapydWeb dashboard is pretty simple. You just need to edit your ScrapydWeb settings file.&lt;/p&gt;

&lt;p&gt;By default ScrapydWeb is setup to connect to locally running Scrapyd server on localhost:6800.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SCRAPYD_SERVERS = [
    '127.0.0.1:6800',
    # 'username:password@localhost:6801#group', ## string format
    #('username', 'password', 'localhost', '6801', 'group'), ## tuple format
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;If you want to connect to remote Scrapyd servers, then just add them to the above array, and restart the server. You can add servers in both a string or tuple format.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; you need to make sure &lt;code&gt;bind_address = 0.0.0.0&lt;/code&gt; in your settings file, add restart Scrapyd to make it visible externally. &lt;/p&gt;

&lt;p&gt;With this done, you should see something like this on your servers page: &lt;a href="http://127.0.0.1:5000/1/servers/"&gt;http://127.0.0.1:5000/1/servers/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--NP47DH6q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapydweb-servers-9eff0d0cfc50a46bdb987b9608e8e2d1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--NP47DH6q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapydweb-servers-9eff0d0cfc50a46bdb987b9608e8e2d1.png" alt="ScrapydWeb Servers" width="880" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Running Spiders
&lt;/h3&gt;

&lt;p&gt;Now, with your server connected we are able to schedule and run spiders from the projects that have been deployed to our Scrapyd server.&lt;/p&gt;

&lt;p&gt;Navigate to the &lt;strong&gt;Run Spider&lt;/strong&gt; page (&lt;a href="http://127.0.0.1:5000/1/schedule/"&gt;http://127.0.0.1:5000/1/schedule/&lt;/a&gt;), and you will be able to select and run spiders.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wkzW1SyW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://raw.githubusercontent.com/my8100/files/master/scrapydweb/screenshots/run.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wkzW1SyW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://raw.githubusercontent.com/my8100/files/master/scrapydweb/screenshots/run.gif" alt="ScrapydWeb Running Spiders" width="730" height="730"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This will then send a &lt;code&gt;POST&lt;/code&gt; request to the &lt;code&gt;/schedule.json&lt;/code&gt; endpoint of your Scrapyd server, triggering Scrapyd to run your spider.&lt;/p&gt;

&lt;p&gt;You can also schedule jobs to run periodically by enabling the &lt;strong&gt;timer task&lt;/strong&gt; toggle and entering your cron details.&lt;/p&gt;

&lt;h3&gt;
  
  
  Job Stats
&lt;/h3&gt;

&lt;p&gt;When Logparser is running, ScrapydWeb will periodicially poll the Scrapyd logs endpoint and display your job stats so you can see how they have performed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XVyE4d0j--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/my8100/files/master/scrapydweb/screenshots/jobs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XVyE4d0j--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/my8100/files/master/scrapydweb/screenshots/jobs.png" alt="ScrapydWeb Job Stats Dashboard" width="880" height="480"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Alternatives To ScrapydWeb
&lt;/h2&gt;

&lt;p&gt;There are many alternatives to ScrapydWeb, which offer different functionality and flexibility than ScrapydWeb. We've summarised them in this article here: &lt;a href="https://scrapeops.io/python-scrapy-playbook/best-scrapyd-dashboards-ui"&gt;Guide to the 5 Best Scrapyd Dashboards&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you are still open to other options then we would highly recommend that you give ScrapeOps a try. &lt;a href="https://scrapeops.io/"&gt;ScrapeOps&lt;/a&gt; does everything ScrapydWeb does and more. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Y4w75y8p--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Y4w75y8p--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" alt="ScrapeOps Promo" width="880" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live demo here:&lt;/strong&gt; &lt;a href="https://scrapeops.io/app/login/demo"&gt;ScrapeOps Demo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Not only can you schedule, run and manage spiders on Scrapyd servers like you can with ScrapydWeb, ScrapeOps is a fully fledged job monitoring solution for web scraping. &lt;/p&gt;

&lt;p&gt;It allows you to monitor jobs, view the results in numerous dashboards, automatic job health checks, alerts and more. &lt;/p&gt;

&lt;p&gt;What's more, the monitoring and scheduling part of ScrapeOps is seperate. So if you would like to use ScrapydWeb for job scheduling, you can still integrate the ScrapeOps Scrapy extension that will log your scraping data and populate your monitoring dashboards.&lt;/p&gt;

</description>
      <category>scrapy</category>
      <category>scrapydweb</category>
      <category>scrapyd</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>The Complete Guide To Scrapy Spidermon, Start Monitoring in 5 Minutes!</title>
      <dc:creator>Ian Kerins</dc:creator>
      <pubDate>Thu, 13 Jan 2022 09:51:47 +0000</pubDate>
      <link>https://dev.to/iankerins/the-complete-guide-to-scrapy-spidermon-start-monitoring-in-5-minutes-2aii</link>
      <guid>https://dev.to/iankerins/the-complete-guide-to-scrapy-spidermon-start-monitoring-in-5-minutes-2aii</guid>
      <description>&lt;p&gt;Published as part of &lt;a href="https://scrapeops.io/python-scrapy-playbook"&gt;The Python Scrapy Playbook&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If anyone has done a lot of web scraping, the one thing they always know is that your scrapers always break and degrade overtime. &lt;/p&gt;

&lt;p&gt;Web scraping isn't like other software applications, where for the most part you control all the variables. In web scraping, you are writing scrapers that are trying to extract data from moving targets. &lt;/p&gt;

&lt;p&gt;Websites can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Change the HTML structure of their pages.&lt;/li&gt;
&lt;li&gt;Implement new anti-bot countermeasures.&lt;/li&gt;
&lt;li&gt;Block whole ranges of IPs from accessing their site.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of which can degrade or completely break your scrapers. Because of this, it is vital that you have a robust monitoring and alerting setup in place for your web scrapers so you can react immediately when your spiders eventually begin the brake.&lt;/p&gt;

&lt;p&gt;In this guide, we're going to walk you through Spidermon, a Scrapy extension that is designed to make monitoring your scrapers easier and more effective.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What is Spidermon?&lt;/li&gt;
&lt;li&gt;Integrating Spidermon&lt;/li&gt;
&lt;li&gt;Spidermon Monitors&lt;/li&gt;
&lt;li&gt;Spidermon MonitorSuites&lt;/li&gt;
&lt;li&gt;Spidermon Actions&lt;/li&gt;
&lt;li&gt;Item Validation&lt;/li&gt;
&lt;li&gt;End-to-End Spidermon Example + Code&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For more scraping monitoring solutions, then be sure to check out &lt;a href="https://scrapeops.io/python-scrapy-playbook/how-to-monitor-scrapy-spiders/"&gt;the full list of Scrapy monitoring options here&lt;/a&gt;. Including &lt;a href="https://scrapeops.io/"&gt;ScrapeOps&lt;/a&gt;, the purpose built job monitoring &amp;amp; scheduling tool for web scraping. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Y4w75y8p--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Y4w75y8p--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" alt="ScrapeOps Promo" width="880" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live demo here:&lt;/strong&gt; &lt;a href="https://scrapeops.io/app/login/demo"&gt;ScrapeOps Demo&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What is Spidermon?
&lt;/h2&gt;

&lt;p&gt;Spidermon is a Scrapy extension to build monitors for Scrapy spiders. Built by the same developers that develop and maintain Scrapy, Spidermon is a highly versatile and customisable monitoring framework for Scrapy which greatly expands the default stats collection and logging functionality within Scrapy.&lt;/p&gt;

&lt;p&gt;Spidermon allows you to create custom monitors that will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitor your scrapers with template &amp;amp; custom monitors.&lt;/li&gt;
&lt;li&gt;Validate the data being scraped from each page.&lt;/li&gt;
&lt;li&gt;Notify you with the results of those checks. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Spidermon is highly customisable, so if you can track a stat then you will be able to create a Spidermon monitor to monitor it in real-time.&lt;/p&gt;

&lt;p&gt;Spidermon is centered around &lt;strong&gt;Monitors&lt;/strong&gt;, &lt;strong&gt;MonitorSuites&lt;/strong&gt;, &lt;strong&gt;Validators&lt;/strong&gt; and &lt;strong&gt;Actions&lt;/strong&gt;, which are then used to monitor your scraping jobs and alert you if any tests are failed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Integrating Spidermon
&lt;/h2&gt;

&lt;p&gt;Getting setup with Spidermon is straight forward, but you do need to manually setup your monitors after installing the Spidermon extension. &lt;/p&gt;

&lt;p&gt;To get started you need to install the Python package:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install spidermon
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then add 2 lines to your &lt;code&gt;settings.py&lt;/code&gt; file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py

## Enable Spidermon
SPIDERMON_ENABLED = True

## Add In The Spidermon Extension
EXTENSIONS = {
'spidermon.contrib.scrapy.extensions.Spidermon': 500,
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;From here, you need to define your &lt;strong&gt;Monitors&lt;/strong&gt;, &lt;strong&gt;Validators&lt;/strong&gt; and &lt;strong&gt;Actions&lt;/strong&gt;, then schedule them to run with your &lt;strong&gt;MonitorSuites&lt;/strong&gt;. We will go through each of these in this guide.&lt;/p&gt;




&lt;h2&gt;
  
  
  Spidermon Monitors
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;Monitor&lt;/strong&gt; is the core piece of Spidermon. Inherited from unittest, a monitor is a series of Unit Tests you define that allows you to test the scraping stats of your job versus predefined thresholds you have defined. &lt;/p&gt;

&lt;h3&gt;
  
  
  Basic Monitors
&lt;/h3&gt;

&lt;p&gt;Out of the box, Spidermon has a number of &lt;a href="https://spidermon.readthedocs.io/en/latest/monitors.html#the-basic-monitors"&gt;basic monitors&lt;/a&gt; built in which you just need to enable and configure in your projects/spiders settings to activate for your jobs. &lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Monitors&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ItemCountMonitor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Check if spider extracted the minimum number of items threshold.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ItemValidationMonitor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Check for item validation errors if item validation pipelines are enabled.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FieldCoverageMonitor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Check if field coverage rules are met.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ErrorCountMonitor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Check the number of errors versus a threshold.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;WarningCountMonitor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Check the number of warnings versus a threshold.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FinishReasonMonitor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Check if a job finished for an expected finish reason.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RetryCountMonitor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Check if any requests have reached the maximum amount of retries and the crawler had to drop those requests.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DownloaderExceptionMonitor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Check the amount of downloader exceptions (timeouts, rejected connections, etc.).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SuccessfulRequestsMonitor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Check the total number of successful requests made.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TotalRequestsMonitor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Check the total number of requests made.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;To use any of these monitors you will need to define the thresholds for each of them in your &lt;code&gt;settings.py&lt;/code&gt; file or your spiders custom settings.&lt;/p&gt;



&lt;h3&gt;
  
  
  Custom Monitors
&lt;/h3&gt;

&lt;p&gt;With Spidermon you can also create your own custom monitors that can do just about anything. They can work with any type of stat that is being tracked:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Requests&lt;/li&gt;
&lt;li&gt;✅ Responses&lt;/li&gt;
&lt;li&gt;✅ Pages Scraped&lt;/li&gt;
&lt;li&gt;✅ Items Scraped&lt;/li&gt;
&lt;li&gt;✅ Item Field Coverage&lt;/li&gt;
&lt;li&gt;✅ Runtimes&lt;/li&gt;
&lt;li&gt;✅ Errors &amp;amp; Warnings&lt;/li&gt;
&lt;li&gt;✅ Bandwidth&lt;/li&gt;
&lt;li&gt;✅ HTTP Response Codes&lt;/li&gt;
&lt;li&gt;✅ Retries&lt;/li&gt;
&lt;li&gt;✅ Custom Stats&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Basically, you can create a monitor to verify any stat that appears in the Scrapy stats (either the default stats, or custom stats you configure your spider to insert).&lt;/p&gt;

&lt;p&gt;Here is a example of a simple monitor that will check the number of items scraped versus a minimum threshold.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# my_project/monitors.py
from spidermon import Monitor, monitors

@monitors.name('Item count')
class CustomItemCountMonitor(Monitor):

    @monitors.name('Minimum number of items')
    def test_minimum_number_of_items(self):
        item_extracted = getattr(
            self.data.stats, 'item_scraped_count', 0)
        minimum_threshold = 10

        msg = 'Extracted less than {} items'.format(
            minimum_threshold)
        self.assertTrue(
            item_extracted &amp;gt;= minimum_threshold, msg=msg
        )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;To run a &lt;strong&gt;Monitor&lt;/strong&gt;, they need to be included in a &lt;strong&gt;MonitorSuite&lt;/strong&gt;. &lt;/p&gt;




&lt;h2&gt;
  
  
  Spidermon MonitorSuites
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;MonitorSuite&lt;/strong&gt; is how you activate your &lt;strong&gt;Monitors&lt;/strong&gt;. They tell Spidermon when you would like your monitors run and what actions should Spidermon take if your scrape passes/fails any of your health checks.&lt;/p&gt;

&lt;p&gt;There are three built in types of monitors within Spidermon:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;MonitorSuites&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SPIDERMON_SPIDER_OPEN_MONITORS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Runs monitors when Spider starts running.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SPIDERMON_SPIDER_CLOSE_MONITORS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Runs monitors when Spider has finished scraping.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SPIDERMON_PERIODIC_MONITORS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Runs monitors are periodic intervals that you can define.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Within these &lt;strong&gt;MonitorSuites&lt;/strong&gt; you can specify which actions should be taken after the &lt;strong&gt;Monitors&lt;/strong&gt; have been executed.&lt;/p&gt;

&lt;p&gt;To create a &lt;strong&gt;MonitorSuite&lt;/strong&gt;, simply create a new &lt;strong&gt;MonitorSuite&lt;/strong&gt; class, and define which monitors you want to run and what actions should be taken afterwards:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## tutorial/monitors.py
from spidermon.core.suites import MonitorSuite

class SpiderCloseMonitorSuite(MonitorSuite):
    monitors = [
        CustomItemCountMonitor, ## defined above
    ]

    monitors_finished_actions = [
        # actions to execute when suite finishes its execution
    ]

    monitors_failed_actions = [
        # actions to execute when suite finishes its execution with a failed monitor
    ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then add that &lt;strong&gt;MonitorSuite&lt;/strong&gt; to the &lt;code&gt;SPIDERMON_SPIDER_CLOSE_MONITORS&lt;/code&gt; tuple in your &lt;code&gt;settings.py&lt;/code&gt; file.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;##settings.py
SPIDERMON_SPIDER_CLOSE_MONITORS = (
    'tutorial.monitors.SpiderCloseMonitorSuite',
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Now Spidermon will run this &lt;strong&gt;MonitorSuite&lt;/strong&gt; at the end of every job. &lt;/p&gt;




&lt;h2&gt;
  
  
  Spidermon Actions
&lt;/h2&gt;

&lt;p&gt;The final piece of your &lt;strong&gt;MonitorSuite&lt;/strong&gt; are &lt;strong&gt;Actions&lt;/strong&gt;, which define what happens after a set of monitors has been run.&lt;/p&gt;

&lt;p&gt;Spidermon has pre-built &lt;strong&gt;Action&lt;/strong&gt; templates already included, but you can easily create your own custom &lt;strong&gt;Actions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here is a list of the pre-built &lt;strong&gt;Action&lt;/strong&gt; templates:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Actions&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Email&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Send alerts or job reports to you and your team.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Slack&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Send slack notifications to any channel.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Telegram&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Send alerts or reports to any Telegram channel.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Job Tags&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Set tags on your jobs when using Scrapy Cloud.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;File Report&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Create and save a HTML report locally.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 Report&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Create and save a HTML report to a S3 bucket.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sentry&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Send custom messages to Sentry.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;How example to get Slack notifications when a job fails one of your monitors, you can use the pre-built &lt;strong&gt;SendSlackMessageSpiderFinished&lt;/strong&gt; action by adding your Slack details to your &lt;code&gt;settings.py&lt;/code&gt; file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;##settings.py
SPIDERMON_SLACK_SENDER_TOKEN = '&amp;lt;SLACK_SENDER_TOKEN&amp;gt;'
SPIDERMON_SLACK_SENDER_NAME = '&amp;lt;SLACK_SENDER_NAME&amp;gt;'
SPIDERMON_SLACK_RECIPIENTS = ['@yourself', '#yourprojectchannel']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then including &lt;strong&gt;SendSlackMessageSpiderFinished&lt;/strong&gt; in your &lt;strong&gt;MonitorSuite&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## tutorial/monitors.py
from spidermon.core.suites import MonitorSuite
from spidermon.contrib.actions.slack.notifiers import SendSlackMessageSpiderFinished

class SpiderCloseMonitorSuite(MonitorSuite):
    monitors = [
        CustomItemCountMonitor, 
    ]

    monitors_failed_actions = [
        SendSlackMessageSpiderFinished,
    ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;h2&gt;
  
  
  Item Validation
&lt;/h2&gt;

&lt;p&gt;One really powerful feature of Spidermon is its support for Item validation. Using &lt;a href="https://schematics.readthedocs.io/en/latest/"&gt;schematics&lt;/a&gt; or &lt;a href="https://json-schema.org/"&gt;JSON Schema&lt;/a&gt;, you can define custom unit tests on fields of each Item.&lt;/p&gt;

&lt;p&gt;For example, we can have Spidermon test every product item we scrape has a valid product url, has a price that is a number and doesn’t include any currency signs or special characters, etc. &lt;/p&gt;

&lt;p&gt;Here is an example product item validator:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## validators.py
from schematics.models import Model
from schematics.types import URLType, StringType, ListType

class ProductItem(Model):
    url = URLType(required=True)
    name = StringType(required=True)
    price = DecimalType(required=True)
    features = ListType(StringType)
    image_url = URLType()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Which can be enabled in your spider by activating Spidermons &lt;strong&gt;ItemValidationPipeline&lt;/strong&gt; and telling Spidermon to use the &lt;strong&gt;ProductItem&lt;/strong&gt; validator class we just created in your projects &lt;code&gt;settings.py&lt;/code&gt; file.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# settings.py
ITEM_PIPELINES = {
    'spidermon.contrib.scrapy.pipelines.ItemValidationPipeline': 800,
}

SPIDERMON_VALIDATION_MODELS = (
    'tutorial.validators.ProductItem',
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This vaidator will then append new stats to your Scrapy stats which you can then use in your &lt;strong&gt;Monitors&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## log file
...
'spidermon/validation/fields': 400,
'spidermon/validation/items': 100,
'spidermon/validation/validators': 1,
'spidermon/validation/validators/item/schematics': True,
[scrapy.core.engine] INFO: Spider closed (finished)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;h2&gt;
  
  
  End-to-End Spidermon Example
&lt;/h2&gt;

&lt;p&gt;Now, we're going to run through a full Spidermon example so that you can see how to setup your own monitoring suite. &lt;/p&gt;

&lt;p&gt;The full code from this example is available on &lt;a href="https://github.com/ScrapeOps/python-scrapy-playbook/tree/master/4.%20Scrapy%20Extensions/spidermon/spidermon_demo"&gt;Github here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scrapy Project
&lt;/h3&gt;

&lt;p&gt;First things first, we need a Scrapy project, a spider and a website to scrape. In this case &lt;a href="https://books.toscrape.com/"&gt;books.toscrape.com&lt;/a&gt;.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapy startproject spidermon_demo
scrapy genspider bookspider books.toscrape.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Next we need to create a Scrapy Item for the data we want to scrape:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## items.py
import scrapy

class BookItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    price = scrapy.Field()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Finally we need to write the spider code:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## spiders/bookspider.py
import scrapy
from spidermon_demo.items import BookItem

class BookSpider(scrapy.Spider):
  name = 'bookspider'
  start_urls = ["http://books.toscrape.com"]

  def parse(self, response):

    for article in response.css('article.product_pod'):
      book_item = BookItem(
        url = article.css("h3 &amp;gt; a::attr(href)").get(),
        title = article.css("h3 &amp;gt; a::attr(title)").extract_first(),
        price = article.css(".price_color::text").extract_first(),
      )
      yield book_item

    next_page_url = response.css("li.next &amp;gt; a::attr(href)").get()
    if next_page_url:
      yield response.follow(url=next_page_url, callback=self.parse)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;By now, you should have a working spider that will scrape every page of &lt;a href="https://books.toscrape.com/"&gt;books.toscrape.com&lt;/a&gt;. Next we integrate Spidermon.&lt;/p&gt;

&lt;h3&gt;
  
  
  Integrate Spidermon
&lt;/h3&gt;

&lt;p&gt;To install Spidermon just install the Python package:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install spidermon
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then add 2 lines to your &lt;code&gt;settings.py&lt;/code&gt; file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py

## Enable Spidermon
SPIDERMON_ENABLED = True

## Add In The Spidermon Extension
EXTENSIONS = {
'spidermon.contrib.scrapy.extensions.Spidermon': 500,
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Create Item Validator
&lt;/h3&gt;

&lt;p&gt;For this example, we're going to validate the Items we want to scrape to make sure all fields are scraped and the data is valid. To do this we need to create a validtor which is pretty simple.&lt;/p&gt;

&lt;p&gt;First, we're going to need to install the schematics library:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install schematics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Next, we will define our validator for our BookItem model in a new &lt;code&gt;validators.py&lt;/code&gt; file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## validators.py
from schematics.models import Model
from schematics.types import URLType, StringType, ListType

class BookItem(Model):
    url = URLType(required=True)
    title = StringType(required=True)
    price = StringType(required=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then enable this validator in our &lt;code&gt;settings.py&lt;/code&gt; file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py

ITEM_PIPELINES = {
    'spidermon.contrib.scrapy.pipelines.ItemValidationPipeline': 800,
}

SPIDERMON_VALIDATION_MODELS = (
    'spidermon_demo.validators.BookItem',
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;At this point, when you run your spider Spidermon will validate every item being scraped and update the Scrapy Stats with the results:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Scrapy Stats Output
(...)
'spidermon/validation/fields': 3000,
'spidermon/validation/fields/errors': 1000,
'spidermon/validation/fields/errors/invalid_url': 1000,
'spidermon/validation/fields/errors/invalid_url/url': 1000,
'spidermon/validation/items': 1000,
'spidermon/validation/items/errors': 1000,
'spidermon/validation/validators': 1,
'spidermon/validation/validators/item/schematics': True,
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We can see from these stats here, that the &lt;strong&gt;url&lt;/strong&gt; field of our &lt;strong&gt;BookItem&lt;/strong&gt; is failing all the validation checks. When digging deeper we will find that the reason is that scraped urls are relative urls &lt;code&gt;catalogue/a-light-in-the-attic_1000/index.html&lt;/code&gt;, not absolute urls.&lt;/p&gt;

&lt;h3&gt;
  
  
  Create Our Monitors
&lt;/h3&gt;

&lt;p&gt;Next, up we want to create &lt;strong&gt;Monitors&lt;/strong&gt; that will conduct the unit tests when activated. In this example we're going to create two monitors in our &lt;code&gt;monitors.py&lt;/code&gt; file.&lt;/p&gt;

&lt;h4&gt;
  
  
  Monitor 1: Item Count Monitor
&lt;/h4&gt;

&lt;p&gt;This monitor will validate that our spider has scraped a set number of items. &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## monitors.py
@monitors.name('Item count')
class ItemCountMonitor(Monitor):

    @monitors.name('Minimum number of items')
    def test_minimum_number_of_items(self):
        item_extracted = getattr(
            self.data.stats, 'item_scraped_count', 0)
        minimum_threshold = 200

        msg = 'Extracted less than {} items'.format(
            minimum_threshold)
        self.assertTrue(
            item_extracted &amp;gt;= minimum_threshold, msg=msg
        )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h4&gt;
  
  
  Monitor 2: Item Validation Monitor
&lt;/h4&gt;

&lt;p&gt;This monitor will check the stats from Item validator to make sure we have no item validation errors. &lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## monitors.py
@monitors.name('Item validation')
class ItemValidationMonitor(Monitor, StatsMonitorMixin):

    @monitors.name('No item validation errors')
    def test_no_item_validation_errors(self):
        validation_errors = getattr(
            self.stats, 'spidermon/validation/fields/errors', 0
        )
        self.assertEqual(
            validation_errors,
            0,
            msg='Found validation errors in {} fields'.format(
                validation_errors)
        )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Create Monitor Suites
&lt;/h3&gt;

&lt;p&gt;For this example, we're going to run two &lt;strong&gt;MonitorSuites&lt;/strong&gt;. One at the end of the job, and another that runs every 5 seconds (for demo purposes).&lt;/p&gt;
&lt;h4&gt;
  
  
  MonitorSuite 1: Spider Close
&lt;/h4&gt;

&lt;p&gt;Here, we're going to add both of our monitors (&lt;strong&gt;ItemCountMonitor&lt;/strong&gt;,&lt;strong&gt;ItemValidationMonitor&lt;/strong&gt;) to the monitor suite as we want both to run when the job finishes. To do so we just need to create the &lt;strong&gt;MonitorSuite&lt;/strong&gt; in our &lt;code&gt;monitors.py&lt;/code&gt; file:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## monitors.py
class SpiderCloseMonitorSuite(MonitorSuite):

monitors = [
    ItemCountMonitor,
    ItemValidationMonitor,
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And then enable this &lt;strong&gt;MonitorSuite&lt;/strong&gt; in our &lt;code&gt;settings.py&lt;/code&gt; file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py
SPIDERMON_SPIDER_CLOSE_MONITORS = (
    'spidermon_demo.monitors.SpiderCloseMonitorSuite',
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h4&gt;
  
  
  MonitorSuite 2: Periodic Monitor
&lt;/h4&gt;

&lt;p&gt;Setting up a periodic monitor to run every 5 seconds is just as easy. Simply create a new &lt;strong&gt;MonitorSuite&lt;/strong&gt; and in this case we're only going to have it run the &lt;strong&gt;ItemValidationMonitor&lt;/strong&gt; every 5 seconds:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## monitors.py
class PeriodicMonitorSuite(MonitorSuite):
    monitors = [
        ItemValidationMonitor,
    ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And then enable it in our &lt;code&gt;settings.py&lt;/code&gt; file, where we also specify how frequently we want it to run:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SPIDERMON_PERIODIC_MONITORS = {
    'spidermon_demo.monitors.PeriodicMonitorSuite': 5,  # time in seconds
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;With both of these &lt;strong&gt;MonitorSuites&lt;/strong&gt; setup, Spidermon will automatically run these &lt;strong&gt;Monitors&lt;/strong&gt; and add the results to your Scrapy logs and stats.&lt;/p&gt;

&lt;h3&gt;
  
  
  Create Our Actions
&lt;/h3&gt;

&lt;p&gt;Having the results of these &lt;strong&gt;Monitors&lt;/strong&gt; is good, but to make them really useful we want something to happen when a &lt;strong&gt;MonitorSuite&lt;/strong&gt; has completed its tests. &lt;/p&gt;

&lt;p&gt;The most common action is getting notified of a failed health check so for this example we're going to send a Slack notification.  &lt;/p&gt;

&lt;p&gt;First we need to install some libraries to be able to work with Slack:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install slack slackclient jinja2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Next we will need to enable Slack notifications in our &lt;strong&gt;MonitorSuites&lt;/strong&gt; by importing &lt;code&gt;SendSlackMessageSpiderFinished&lt;/code&gt; from Spidermon actions, and updating our &lt;strong&gt;MonitorSuites&lt;/strong&gt; to use it.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## monitors.py
from spidermon.contrib.actions.slack.notifiers import SendSlackMessageSpiderFinished

## ... Existing Monitors

## Update Spider Close MonitorSuite
class SpiderCloseMonitorSuite(MonitorSuite):

    monitors = [
        ItemCountMonitor,
        ItemValidationMonitor,
    ]

    monitors_failed_actions = [
        SendSlackMessageSpiderFinished, 
    ]

## Update Periodic MonitorSuite
class PeriodicMonitorSuite(MonitorSuite):
    monitors = [
        ItemValidationMonitor,
    ]

    monitors_failed_actions = [
        SendSlackMessageSpiderFinished, 
    ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then add our Slack details to our &lt;code&gt;settings.py&lt;/code&gt; file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py
SPIDERMON_SLACK_SENDER_TOKEN = '&amp;lt;SLACK_SENDER_TOKEN&amp;gt;'
SPIDERMON_SLACK_SENDER_NAME = '&amp;lt;SLACK_SENDER_NAME&amp;gt;'
SPIDERMON_SLACK_RECIPIENTS = ['@yourself', '#yourprojectchannel']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://spidermon.readthedocs.io/en/latest/howto/configuring-slack-for-spidermon.html#configuring-slack-bot"&gt;Use this guide to create a Slack app and get your Slack credentials.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From here, anytime one of your Spidermon &lt;strong&gt;MonitorSuites&lt;/strong&gt; fail, you will get a Slack notification.&lt;/p&gt;

&lt;p&gt;The full code from this example is available on &lt;a href="https://github.com/ScrapeOps/python-scrapy-playbook/tree/master/4.%20Scrapy%20Extensions/spidermon/spidermon_demo"&gt;Github here&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  More Scrapy Tutorials
&lt;/h2&gt;

&lt;p&gt;That's it for how to use Spidermon to monitor your Scrapy spiders. If you would like to learn more about Scrapy, then be sure to check out &lt;a href="https://scrapeops.io/python-scrapy-playbook"&gt;The Scrapy Playbook&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>scrapy</category>
      <category>spidermon</category>
      <category>python</category>
    </item>
    <item>
      <title>How to Monitor Your Scrapy Spiders?</title>
      <dc:creator>Ian Kerins</dc:creator>
      <pubDate>Wed, 12 Jan 2022 16:18:18 +0000</pubDate>
      <link>https://dev.to/iankerins/how-to-monitor-your-scrapy-spiders-5c9o</link>
      <guid>https://dev.to/iankerins/how-to-monitor-your-scrapy-spiders-5c9o</guid>
      <description>&lt;p&gt;Published as part of &lt;a href="https://scrapeops.io/python-scrapy-playbook"&gt;The Python Scrapy Playbook&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For anyone who has been in web scraping for a while, you know that if there is anything certain in web scraping that just because your scrapers work today doesn’t mean they will work tomorrow. &lt;/p&gt;

&lt;p&gt;From day to day, your scrapers can break or their performance degrade for a whole host of reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The HTML structure of the target site can change.&lt;/li&gt;
&lt;li&gt;The target site can change their anti-bot countermeasures.&lt;/li&gt;
&lt;li&gt;Your proxy network can degrade or go down.&lt;/li&gt;
&lt;li&gt;Or something can go wrong on your server.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because of this it is very important for you to have a reliable and effective way for you to monitor your scrapers in production, conduct health checks and get alerts when the performance of your spider drops.&lt;/p&gt;

&lt;p&gt;In this guide, we will go through the &lt;strong&gt;4 popular options to monitor your scrapers&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Scrapy Logs &amp;amp; Stats&lt;/li&gt;
&lt;li&gt;ScrapeOps Extension&lt;/li&gt;
&lt;li&gt;Spidermon Extension&lt;/li&gt;
&lt;li&gt;Generic Logging &amp;amp; Monitoring Tools&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  #1: Scrapy Logs &amp;amp; Stats
&lt;/h2&gt;

&lt;p&gt;Out of the box, Scrapy boasts by far the best logging and stats functionality of any web scraping library or framework out there. &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2021-12-17 17:02:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1330,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'downloader/response_bytes': 11551,
'downloader/response_count': 5,
'downloader/response_status_count/200': 5,
'elapsed_time_seconds': 2.600152,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 17, 16, 2, 22, 118835),
'httpcompression/response_bytes': 55120,
'httpcompression/response_count': 5,
'item_scraped_count': 50,
'log_count/INFO': 10,
'response_received_count': 5,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2021, 12, 17, 16, 2, 19, 518683)}
2021-12-17 17:02:25 [scrapy.core.engine] INFO: Spider closed (finished)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Whereas most other scraping libraries and frameworks focus solely on making requests and parsing the responses, Scrapy has a whole logging and stats layer under the hood that tracks your spiders in real-time. Making it really easy to test and debug your spiders when developing them.&lt;/p&gt;

&lt;p&gt;You can easily customise the logging levels, and add more stats to the default Scrapy stats in spiders with a couple lines of code. &lt;/p&gt;

&lt;p&gt;The major problem relying solely on using this approach to monitoring your scrapers is that it quickly becomes impractical and cumbersome in production. Especially when you have multiple spiders running every day across multiple servers.&lt;/p&gt;

&lt;p&gt;The check the health of your scraping jobs you will need to store these logs, and either periodically SSH into the server to view them or setup a custom log exporting system so you can view them in a central user interface. More on this later.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;Using Scrapy's built-in logging and stats functionality is great during development, but when running scrapers in production you should look to use a better monitoring setup. &lt;/p&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Setup right out of the box, and very light weight.&lt;/li&gt;
&lt;li&gt;Easy to customise so it to logs more stats.&lt;/li&gt;
&lt;li&gt;Great for local testing and the development phase. &lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;No dashboard functionality, so you need to setup your own system to export your logs and display them.&lt;/li&gt;
&lt;li&gt;No historical comparison capabilities within jobs. &lt;/li&gt;
&lt;li&gt;No inbuilt health check functionality.&lt;/li&gt;
&lt;li&gt;Cumbersome to rely solely on when in production. &lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  #2: ScrapeOps Extension
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://scrapeops.io"&gt;ScrapeOps&lt;/a&gt; is a monitoring and alerting tool dedicated to web scraping. With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live demo here:&lt;/strong&gt; &lt;a href="https://scrapeops.io/app/login/demo"&gt;ScrapeOps Demo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Y4w75y8p--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Y4w75y8p--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" alt="ScrapeOps Promo" width="880" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The primary goal with ScrapeOps is to give every developer the same level of scraping monitoring capabilities as the most sophisticated web scrapers, without any of the hassle of setting up your own custom solution.&lt;/p&gt;

&lt;p&gt;Unlike the other options on this list, ScrapeOps is a full end-to-end web scraping monitoring and management tool dedicated to web scraping that automatically sets up all the monitors, health checks and alerts for you. If you have an issue with integrating ScrapeOps or need advice on setting up your scrapers then they have a support team on-hand to assist you.&lt;/p&gt;

&lt;h3&gt;
  
  
  Features
&lt;/h3&gt;

&lt;p&gt;Once you have completed the simple install (3 lines in your scraper), ScrapeOps will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🕵️‍♂️ &lt;strong&gt;Monitor -&lt;/strong&gt; Automatically monitor all your scrapers.&lt;/li&gt;
&lt;li&gt;📈 &lt;strong&gt;Dashboards -&lt;/strong&gt; Visualise your job data in dashboards, so you see real-time &amp;amp; historical stats.&lt;/li&gt;
&lt;li&gt;💯 &lt;strong&gt;Data Quality -&lt;/strong&gt; Validate the field coverage in each of your jobs, so broken parsers can be detected straight away.&lt;/li&gt;
&lt;li&gt;📉 &lt;strong&gt;Auto Health Checks -&lt;/strong&gt; Automatically check every jobs performance data versus its 7 day moving average to see if its healthy or not.&lt;/li&gt;
&lt;li&gt;✔️ &lt;strong&gt;Custom Health Checks -&lt;/strong&gt; Check each job with any custom health checks you have enabled for it.&lt;/li&gt;
&lt;li&gt;⏰ &lt;strong&gt;Alerts -&lt;/strong&gt; Alert you via email, Slack, etc. if any of your jobs are unhealthy.&lt;/li&gt;
&lt;li&gt;📑 &lt;strong&gt;Reports -&lt;/strong&gt; Generate daily (periodic) reports, that check all jobs versus your criteria and let you know if everything is healthy or not.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Job stats tracked include:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Pages Scraped &amp;amp; Missed&lt;/li&gt;
&lt;li&gt;✅ Items Parsed &amp;amp; Missed&lt;/li&gt;
&lt;li&gt;✅ Item Field Coverage&lt;/li&gt;
&lt;li&gt;✅ Runtimes&lt;/li&gt;
&lt;li&gt;✅ Response Status Codes&lt;/li&gt;
&lt;li&gt;✅ Success Rates&lt;/li&gt;
&lt;li&gt;✅ Latencies&lt;/li&gt;
&lt;li&gt;✅ Errors &amp;amp; Warnings&lt;/li&gt;
&lt;li&gt;✅ Bandwidth&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Integration
&lt;/h3&gt;

&lt;p&gt;Getting setup with the logger is simple. Just install the Python package:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install scrapeops-scrapy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And add 3 lines to your &lt;code&gt;settings.py&lt;/code&gt; file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;From there, your scraping stats will be automatically logged and automatically shipped to your dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Yoq4Bja3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapeops-demo-holder-7dd5eec8fc4395cfa9c9994d0ec09807.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Yoq4Bja3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapeops-demo-holder-7dd5eec8fc4395cfa9c9994d0ec09807.png" alt="ScrapeOps Dashboard Demo" width="880" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://scrapeops.io"&gt;ScrapeOps&lt;/a&gt; is a powerful web scraping monitoring tool, that gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box. &lt;/p&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Free unlimited community plan. &lt;/li&gt;
&lt;li&gt;Simple 30 second install, gives you advanced job monitoring, health checks and alerts straight out of the box.&lt;/li&gt;
&lt;li&gt;Job scheduling and management functionality so you can manage and monitor your scrapers from one dashboard.&lt;/li&gt;
&lt;li&gt;Customer support team, available to help you get setup and add new features.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Currently, less customisable than Spidermon or other log management tools. (Will be soon!)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  #3: Spidermon Extension
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://spidermon.readthedocs.io/en/latest/index.html"&gt;Spidermon&lt;/a&gt; is an open-source monitoring extension for Scrapy. When integrated it allows you to set up custom monitors that can run at the start, end or periodically during your scrape, and alert you via your chosen communication method.&lt;/p&gt;

&lt;p&gt;This is a very powerful tool as it allows you to create custom monitors for each of your Spiders that can validate each Item scraped with your own unit tests. &lt;/p&gt;

&lt;p&gt;For example, you can make sure a required field has been scraped, that a url field actually contains a valid url, or have it double check that scraped price is actually a number and doesn’t include any currency signs or special characters.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from schematics.models import Model
from schematics.types import URLType, StringType, ListType

class ProductItem(Model):
    url = URLType(required=True)
    name = StringType(required=True)
    price = DecimalType(required=True)
    features = ListType(StringType)
    image_url = URLType()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;However, the two major drawbacks with Spidermon is the fact that:&lt;/p&gt;

&lt;h4&gt;
  
  
  #1 - No Dashboard or User Interface
&lt;/h4&gt;

&lt;p&gt;Spidermon doesn’t have any dashboard or user interface where you can see the output of your monitors.&lt;/p&gt;

&lt;p&gt;The output of your Spidermon monitors are just added to your log files and Scrapy stats, so you will either need to view each spider log to check your scrapers performance or setup a custom system to extract this log data and display it in your own custom dashboard.&lt;/p&gt;

&lt;h4&gt;
  
  
  #2 - Upfront Setup Time
&lt;/h4&gt;

&lt;p&gt;Unlike, ScrapeOps with Spidermon you will have to spend a bit of upfront time to create the monitors you need for each spider and integrate them into your Scrapy projects. &lt;/p&gt;

&lt;p&gt;Spidermon does include some out-of-the-box monitors, however, you will still need to activate them and define the failure thresholds for every spider. &lt;/p&gt;

&lt;h3&gt;
  
  
  Features
&lt;/h3&gt;

&lt;p&gt;Once setup Spidermon can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🕵️‍♂️ &lt;strong&gt;Monitor -&lt;/strong&gt; Automatically, monitor all your scrapers with the defined monitors.&lt;/li&gt;
&lt;li&gt;💯 &lt;strong&gt;Data Quality -&lt;/strong&gt; Validate the field coverage each of the Items you've defined unit tests for.&lt;/li&gt;
&lt;li&gt;📉 &lt;strong&gt;Periodic/Finished Health Checks -&lt;/strong&gt; At periodic intervals or at job finish, you can configure Spidermon to check the health of your job versus pre-set thresholds.&lt;/li&gt;
&lt;li&gt;⏰ &lt;strong&gt;Alerts -&lt;/strong&gt; Alert you via email, Slack, etc. if any of your jobs are unhealthy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Job stats tracked out of the box include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Pages Scraped&lt;/li&gt;
&lt;li&gt;✅ Items Scraped&lt;/li&gt;
&lt;li&gt;✅ Item Field Coverage&lt;/li&gt;
&lt;li&gt;✅ Runtimes&lt;/li&gt;
&lt;li&gt;✅ Errors &amp;amp; Warnings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can also track more stats if you customise your scrapers to log them and have spidermon monitor them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Integration
&lt;/h3&gt;

&lt;p&gt;Getting setup with Spidermon is straight forward, but you do need to manually setup your monitors after installing the Spidermon extension. &lt;/p&gt;

&lt;p&gt;To get started you need to install the Python package:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install spidermon
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then add 2 lines to your &lt;code&gt;settings.py&lt;/code&gt; file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py

## Enable Spidermon
SPIDERMON_ENABLED = True

## Add In The Spidermon Extension
EXTENSIONS = {
'spidermon.contrib.scrapy.extensions.Spidermon': 500,
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;From here you will also need to build your custom monitors and add each of them to your &lt;code&gt;settings.py&lt;/code&gt; file. Here is a simple example of how to setup a monitor that will check the number of items scraped at the end of the job versus a fixed threshold.&lt;/p&gt;

&lt;p&gt;First we create a custom monitor in a monitors.py file within our Scrapy project: &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# my_project/monitors.py
from spidermon import Monitor, MonitorSuite, monitors

@monitors.name('Item count')
class ItemCountMonitor(Monitor):

    @monitors.name('Minimum number of items')
    def test_minimum_number_of_items(self):
        item_extracted = getattr(
            self.data.stats, 'item_scraped_count', 0)
        minimum_threshold = 10

        msg = 'Extracted less than {} items'.format(
            minimum_threshold)
        self.assertTrue(
            item_extracted &amp;gt;= minimum_threshold, msg=msg
        )

class SpiderCloseMonitorSuite(MonitorSuite):

    monitors = [
        ItemCountMonitor,
    ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then we add this to monitor to our &lt;code&gt;settings.py&lt;/code&gt; file so that Spidermon will run it at the end of every job.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py

## Enable Spidermon Monitor
SPIDERMON_SPIDER_CLOSE_MONITORS = (
    'my_project.monitors.SpiderCloseMonitorSuite',
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This monitor will then run at the end of every job and output the result in your logs file. Example of monitor failing its tests:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INFO: [Spidermon] -------------------- MONITORS --------------------
INFO: [Spidermon] Item count/Minimum number of items... FAIL
INFO: [Spidermon] --------------------------------------------------
ERROR: [Spidermon]
====================================================================
FAIL: Item count/Minimum number of items
--------------------------------------------------------------------
Traceback (most recent call last):
File "/tutorial/monitors.py",
    line 17, in test_minimum_number_of_items
    item_extracted &amp;gt;= minimum_threshold, msg=msg
AssertionError: False is not true : Extracted less than 10 items
INFO: [Spidermon] 1 monitor in 0.001s
INFO: [Spidermon] FAILED (failures=1)
INFO: [Spidermon] ---------------- FINISHED ACTIONS ----------------
INFO: [Spidermon] --------------------------------------------------
INFO: [Spidermon] 0 actions in 0.000s
INFO: [Spidermon] OK
INFO: [Spidermon] ----------------- PASSED ACTIONS -----------------
INFO: [Spidermon] --------------------------------------------------
INFO: [Spidermon] 0 actions in 0.000s
INFO: [Spidermon] OK
INFO: [Spidermon] ----------------- FAILED ACTIONS -----------------
INFO: [Spidermon] --------------------------------------------------
INFO: [Spidermon] 0 actions in 0.000s
INFO: [Spidermon] OK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;If you would like a more detailed explanation of how to use Spidermon, you can check out our &lt;a href="https://scrapeops.io/python-scrapy-playbook/python-scrapy-playbook/extensions/scrapy-spidermon-guide"&gt;Complete Spidermon Guide here&lt;/a&gt; or the &lt;a href="https://spidermon.readthedocs.io/en/latest/index.html"&gt;offical documentation here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;Spidermon is a great option for anyone who is wants to take their scrapers to the next level and integrate a highly customisable monitoring solution.&lt;/p&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Open-source. Developed by core Scrapy developers.&lt;/li&gt;
&lt;li&gt;Stable and battle tested. Used internally by Zyte developers.&lt;/li&gt;
&lt;li&gt;Offers the ability to set custom item validation rules on every Item being scraped.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;No dashboard functionality, so you need to build your own system to extract the Spidermon stats to a dashboard.&lt;/li&gt;
&lt;li&gt;Need to do a decent bit of customisation in your Scrapy projects to get the spider monitors, alerts, etc. setup for each spider.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  #4: Generic Logging &amp;amp; Monitoring Tools
&lt;/h2&gt;

&lt;p&gt;Another option, is use any of the many active monitoring or logging platforms available, like DataDog, Logz.io, LogDNA, Sentry, etc. &lt;/p&gt;

&lt;p&gt;These tools boast a huge range of functionality and features that allow you to graph, filter, aggregate your log data in whatever way best suits your requirements.&lt;/p&gt;

&lt;p&gt;However, although that can be used for monitoring your spiders, you will have to do a lot of customisation work to setup the dashboards, monitors, alerts like you would get with ScrapeOps or Spidermon.&lt;/p&gt;

&lt;p&gt;Plus, because with most of these tools you will need to ingest all your log data to power the graphs, monitors, etc. they will likely be a lot more expensive than using ScrapeOps or Spidermon as they charge based on much data they ingest and how long they retain it for.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;If you have a very unique web scraping stack with a complicated ETL pipeline, then customising one of the big logging tools to your requirements might be a good option. &lt;/p&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Lots of feature rich logging tools to choose from. &lt;/li&gt;
&lt;li&gt;Can integrate with your other logging stack if you have on.&lt;/li&gt;
&lt;li&gt;Highly customisable. If you can dream it, then you can likely build it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Will need to create a custom logging setup to properly track your jobs. &lt;/li&gt;
&lt;li&gt;No job management or scheduling capabilities.&lt;/li&gt;
&lt;li&gt;Can get expensive when doing large scale scraping.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  More Scrapy Tutorials
&lt;/h2&gt;

&lt;p&gt;That's it for all the ways you can monitor your Scrapy spiders. If you would like to learn more about Scrapy, then be sure to check out &lt;a href="https://scrapeops.io/python-scrapy-playbook"&gt;The Scrapy Playbook&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>scrapy</category>
      <category>python</category>
      <category>scraping</category>
    </item>
    <item>
      <title>Scraping Millions of Google SERPs The Easy Way (Python Scrapy Spider) </title>
      <dc:creator>Ian Kerins</dc:creator>
      <pubDate>Tue, 17 Nov 2020 16:53:42 +0000</pubDate>
      <link>https://dev.to/iankerins/scraping-millions-of-google-serps-the-easy-way-python-scrapy-spider-4hpc</link>
      <guid>https://dev.to/iankerins/scraping-millions-of-google-serps-the-easy-way-python-scrapy-spider-4hpc</guid>
      <description>&lt;p&gt;Google is the undisputed king of search engines in just about every aspect. Making it the ultimate source of data for a whole host of use cases.&lt;/p&gt;

&lt;p&gt;If you want to get access to this data you either need to extract it manually, pay a 3rd party for a expensive data feed or build your own scrape to extract the data for you.&lt;/p&gt;

&lt;p&gt;In this article I will show you the easiest way to build a Google scraper that can extract millions of pages of data each day with just a few lines of code. &lt;/p&gt;

&lt;p&gt;By combining Scrapy with Scraper API's proxy/autoparsing functionality we will build a Google scraper that can the search engine results from any Google query and return the following for each result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Title&lt;/li&gt;
&lt;li&gt;Link&lt;/li&gt;
&lt;li&gt;Related links&lt;/li&gt;
&lt;li&gt;Description&lt;/li&gt;
&lt;li&gt;Snippet&lt;/li&gt;
&lt;li&gt;Images&lt;/li&gt;
&lt;li&gt;Thumbnails&lt;/li&gt;
&lt;li&gt;Sources, and more&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can also refine your search queries with parameters, by specifying a  keyword, the geographic region, the language, the number of results, results from a particular domain, or even to only return safe results. The possibilities are nearly limitless.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/ian-kerins/google-scraper-python-scrapy" rel="noopener noreferrer"&gt;The code for this project is available on GitHub here.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For this guide, we're going to use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.scraperapi.com/" rel="noopener noreferrer"&gt;Scraper API&lt;/a&gt; as our proxy solution, as Instagram has pretty aggressive anti-scraping in place. You can sign up to a &lt;a href="https://dashboard.scraperapi.com/signup" rel="noopener noreferrer"&gt;free account here&lt;/a&gt; which will give you 5,000 free requests.
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://scrapeops.io/" rel="noopener noreferrer"&gt;ScrapeOps&lt;/a&gt; to monitor our scrapers for free and alert us if they run into trouble. &lt;strong&gt;Live demo here:&lt;/strong&gt; &lt;a href="https://scrapeops.io/app/login/demo" rel="noopener noreferrer"&gt;ScrapeOps Demo&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" alt="ScrapeOps Dashboard"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  How Querying Google Using Scraper API’s Autoparse Functionality
&lt;/h2&gt;

&lt;p&gt;We will use &lt;a href="https://www.scraperapi.com/" rel="noopener noreferrer"&gt;Scraper API&lt;/a&gt; for two reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Proxies&lt;/strong&gt;, so we won't get blocked.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parsing&lt;/strong&gt;, so we don't have to worry about writing our own parsers.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Scraper API is a proxy management API that handles everything to do with rotating and managing proxies so our requests don't get banned. Which is great for a difficult site to scrape like Google.&lt;/p&gt;

&lt;p&gt;However, what makes Scraper API extra useful for sites like Google and Amazon is that they provide auto parsing functionality free of charge so you don't need to write and maintain your own parsers.&lt;/p&gt;

&lt;p&gt;By using &lt;a href="https://www.scraperapi.com/google" rel="noopener noreferrer"&gt;Scraper API’s autoparse&lt;/a&gt; functionality for Google Search or Google Shopping, all the HTML will be automatically parsed into JSON format for you. Greatly simplifying the scraping process.&lt;/p&gt;

&lt;p&gt;All we need to do to make use of this handy capability is to add the following parameter to our request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; "&amp;amp;autoparse=true"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We’ll send the HTTP request with this parameter via Scrapy which will scrape google results based on specified keywords. The results will be returned in JSON format which we will then parse using Python.&lt;/p&gt;




&lt;h2&gt;
  
  
  Scrapy Installation and Setup
&lt;/h2&gt;

&lt;p&gt;First thing’s first, the requirements for this tutorial are very straightforward:&lt;/p&gt;

&lt;p&gt;• You will need at least Python version 3, later&lt;br&gt;
• And, &lt;em&gt;pip&lt;/em&gt; to install the necessary software packages&lt;/p&gt;

&lt;p&gt;So, assuming you have both of those things, you only need to run the following command in your terminal to install Scrapy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install scrapy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scrapy will automatically create some default folders where all the packages and project files will be located. So navigate to that folder, and then run the following commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapy startproject google_scraper
cd google_scraper
scrapy genspider google api.scraperapi.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First Scrapy will create a new project folder called “google-scraper” which is also the project name. We then navigate into this folder and run the “genspider” command which will generate a web scraper for us with the name “google.”&lt;/p&gt;

&lt;p&gt;You should now see a bunch of configuration files, a “spiders” folder with your scraper(s), and a Python modules folder with some package files.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building URLs to Query Google
&lt;/h2&gt;

&lt;p&gt;As you might expect, Google uses a very standard and easy to query URL structure. To build a URL to query Google with, you only need to know the URL parameters for the data you need. In this tutorial, I’ll use some of the parameters that will be the most useful for the majority of web scraping projects. &lt;/p&gt;

&lt;p&gt;Every Google Search query will start with the following base URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://www.google.com/search
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can then build out your query simply by adding one or more of the following parameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;search keyword&lt;/strong&gt; parameter denoted as &lt;strong&gt;q&lt;/strong&gt;. For example, &lt;em&gt;&lt;a href="http://www.google.com/search?q=tshirt" rel="noopener noreferrer"&gt;http://www.google.com/search?q=tshirt&lt;/a&gt;&lt;/em&gt; will search for results containing the “tshirt” keyword.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;language&lt;/strong&gt; parameter &lt;strong&gt;hl&lt;/strong&gt;. For example, &lt;em&gt;&lt;a href="http://www.google.com/search?q=tshirt&amp;amp;hl=en" rel="noopener noreferrer"&gt;http://www.google.com/search?q=tshirt&amp;amp;hl=en&lt;/a&gt;&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;as_sitesearch&lt;/strong&gt; parameter which will specify a domain (or, website) to search. For example, &lt;em&gt;&lt;a href="http://www.google.com/search?q=tshirt&amp;amp;as_sitesearch=amazon.com" rel="noopener noreferrer"&gt;http://www.google.com/search?q=tshirt&amp;amp;as_sitesearch=amazon.com&lt;/a&gt;&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;num&lt;/strong&gt; parameter that specifies the number of results per page (maximum is 100). For example, &lt;em&gt;&lt;a href="http://www.google.com/search?q=tshirt&amp;amp;num=50" rel="noopener noreferrer"&gt;http://www.google.com/search?q=tshirt&amp;amp;num=50&lt;/a&gt;&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;The  &lt;strong&gt;start&lt;/strong&gt; parameter which specifies the offset point. For example, &lt;em&gt;&lt;a href="http://www.google.com/search?q=tshirt&amp;amp;start=100" rel="noopener noreferrer"&gt;http://www.google.com/search?q=tshirt&amp;amp;start=100&lt;/a&gt;&lt;/em&gt; &lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;safe&lt;/strong&gt; parameter which will only output “safe” results. For example, &lt;em&gt;&lt;a href="http://www.google.com/search?q=tshirt&amp;amp;safe=active" rel="noopener noreferrer"&gt;http://www.google.com/search?q=tshirt&amp;amp;safe=active&lt;/a&gt;&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are many more parameters to use for querying Google, such as date, encoding, or even operators such as ‘or’ or ‘and’ to implement some basic logic.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building the Google Search Query URL
&lt;/h2&gt;

&lt;p&gt;Below is the function I’ll be using to build the Google Search query URL. It creates a dictionary with key-value pairs for the &lt;strong&gt;q&lt;/strong&gt;, &lt;strong&gt;num&lt;/strong&gt;, and &lt;strong&gt;as_sitesearch&lt;/strong&gt; parameters. If you want to add more parameters, this is where you could do it.&lt;/p&gt;

&lt;p&gt;If no site is specified, it will return a URL without the &lt;strong&gt;as_sitesearch&lt;/strong&gt; parameter. If one is specified, it will first extract network location using &lt;em&gt;netloc&lt;/em&gt; (e.g. amazon.com), then add this key-value pair to &lt;em&gt;google_dict&lt;/em&gt;, and, finally, encode it in the return URL with the other parameters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from urllib.parse import urlparse
from urllib.parse import urlencode

def create_google_url(query, site=''):
   google_dict = {'q': query, 'num': 100, }
   if site:
       web = urlparse(site).netloc
       google_dict['as_sitesearch'] = web
       return 'http://www.google.com/search?' + urlencode(google_dict)
   return 'http://www.google.com/search?' + urlencode(google_dict)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Connecting to a Proxy via the Scraper API
&lt;/h2&gt;

&lt;p&gt;When scraping an internet service like Google, you will need to use a proxy if you want to scrape at any reasonable scale. If you don’t, you could get flagged by its ant-botting countermeasures and get your IP-banned. Thankfully, you can use &lt;a href="https://www.scraperapi.com/signup" rel="noopener noreferrer"&gt;Scraper API’s proxy solution for free for up to 5,000 API calls&lt;/a&gt;, using up to 10 concurrent threads. You can also use some of Scraper API’s more advanced features, such as geotargeting, JS rendering, and residential properties. &lt;/p&gt;

&lt;p&gt;To use the proxy, just head &lt;a href="https://www.scraperapi.com/signup" rel="noopener noreferrer"&gt;here&lt;/a&gt; to sign up for free. Once you have, find your API key in the dashboard as you’ll need it to set up a proxy connection. &lt;/p&gt;

&lt;p&gt;The proxy is incredibly easy to implement into your web spider. In the &lt;em&gt;get_url&lt;/em&gt; function below, we’ll create a payload with our Scraper API key and the URL we built in the &lt;em&gt;create_google_url function&lt;/em&gt;. We’ll also enable the &lt;strong&gt;autoparse&lt;/strong&gt; feature here as well as set the proxy location as the U.S.:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def get_url(url):
   payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}
   proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
   return proxy_url
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To send our request via one of Scraper API’s proxy pools, we only need to append our query URL to Scraper API’s proxy URL. This will return the information that we requested from Google and that we’ll parse later on.&lt;/p&gt;




&lt;h2&gt;
  
  
  Querying Google Search
&lt;/h2&gt;

&lt;p&gt;The &lt;em&gt;start_requests&lt;/em&gt; function is where we will set everything into motion. It will iterate through a list of queries that will be sent through to the &lt;em&gt;create_google_url&lt;/em&gt; function as keywords for our query URL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def start_requests(self):
       queries = ['scrapy’, ‘beautifulsoup’]
       for query in queries:
           url = create_google_url(query)
           yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The query URL we built will then be sent as a request to Google Search using Scrapy’s &lt;em&gt;yield&lt;/em&gt; via the proxy connection we set up in the &lt;em&gt;get_url&lt;/em&gt; function. The result (which should be in JSON format) will be then be sent to the &lt;em&gt;parse&lt;/em&gt; function to be processed. We also add the &lt;em&gt;{'pos': 0}&lt;/em&gt; key-value pair to the &lt;em&gt;meta&lt;/em&gt; parameter which is just used to count the number of pages scraped.&lt;/p&gt;




&lt;h2&gt;
  
  
  Scraping the Google Search Results
&lt;/h2&gt;

&lt;p&gt;Because we used Scraper API’s &lt;strong&gt;autoparse&lt;/strong&gt; functionality to return data in JSON format, parsing is very straightforward. We just need to select the data we want from the response dictionary.&lt;/p&gt;

&lt;p&gt;First of all, we’ll load the entire JSON response and then iterate through each result, extracting some information and then putting it together into a new item we can use later on.&lt;/p&gt;

&lt;p&gt;This process also checks to see if there is another page of results. If there is, it invokes ** yield scrapy.Request** again and sends the results to the &lt;em&gt;parse&lt;/em&gt; function. In the meantime, &lt;em&gt;pos&lt;/em&gt; is used to keep track of the number of pages we have scraped:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def parse(self, response):
       di = json.loads(response.text)
       pos = response.meta['pos']
       dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
       for result in di['organic_results']:
           title = result['title']
           snippet = result['snippet']
           link = result['link']
           item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
           pos += 1
           yield item
       next_page = di['pagination']['nextPageUrl']
       if next_page:
           yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Putting it All Together and Running the Spider
&lt;/h2&gt;

&lt;p&gt;You should now have a solid understanding of how the spider works and the flow of it. The spider we created, &lt;strong&gt;google.py&lt;/strong&gt;, should now have the following contents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import scrapy
from urllib.parse import urlencode
from urllib.parse import urlparse
import json
from datetime import datetime
API_KEY = 'YOUR_KEY'

def get_url(url):
   payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}
   proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
   return proxy_url

def create_google_url(query, site=''):
   google_dict = {'q': query, 'num': 100, }
   if site:
       web = urlparse(site).netloc
       google_dict['as_sitesearch'] = web
       return 'http://www.google.com/search?' + urlencode(google_dict)
   return 'http://www.google.com/search?' + urlencode(google_dict)

class GoogleSpider(scrapy.Spider):
   name = 'google'
   allowed_domains = ['api.scraperapi.com']
   custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO',
                  'CONCURRENT_REQUESTS_PER_DOMAIN': 10}

   def start_requests(self):
       queries = ['scrapy’, ‘beautifulsoup’] 
       for query in queries:
           url = create_google_url(query)
           yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0})

   def parse(self, response):
       di = json.loads(response.text)
       pos = response.meta['pos']
       dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
       for result in di['organic_results']:
           title = result['title']
           snippet = result['snippet']
           link = result['link']
           item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
           pos += 1
           yield item
       next_page = di['pagination']['nextPageUrl']
       if next_page:
           yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before testing the scraper we need to configure the settings to allow it to integrate with the Scraper API free plan with 10 concurrent threads.&lt;/p&gt;

&lt;p&gt;To do this we defined the following custom settings in our spider class.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO',
                       'CONCURRENT_REQUESTS_PER_DOMAIN': 10, 
                       'RETRY_TIMES': 5}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We the concurrency to 10 threads to match the Scraper API free plan and et &lt;code&gt;RETRY_TIMES&lt;/code&gt; to tell Scrapy to retry any failed requests 5 times. In the &lt;strong&gt;settings.py&lt;/strong&gt; file we also need to make sure that &lt;code&gt;DOWNLOAD_DELAY&lt;/code&gt; and &lt;code&gt;RANDOMIZE_DOWNLOAD_DELAY&lt;/code&gt; aren’t enabled as these will lower your concurrency and are not needed with Scraper API.&lt;/p&gt;

&lt;p&gt;To test or run the spider, just make sure you are in the right location and then run the following crawl command which will also output the results to a .csv file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapy crawl google -o test.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If all goes according to plan, the spider will scrape Google Search for all the keywords you provide. By using a proxy, you’ll also avoid getting banned for using a bot.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting Up Monitoring
&lt;/h2&gt;

&lt;p&gt;To monitor our scraper we're going to use &lt;a href="https://scrapeops.io/" rel="noopener noreferrer"&gt;ScrapeOps&lt;/a&gt;, a free monitoring and alerting tool dedicated to web scraping. &lt;/p&gt;

&lt;p&gt;With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live demo here:&lt;/strong&gt; &lt;a href="https://scrapeops.io/app/login/demo" rel="noopener noreferrer"&gt;ScrapeOps Demo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Getting setup with ScrapeOps is simple. Just install the Python package:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install scrapeops-scrapy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And add 3 lines to your &lt;code&gt;settings.py&lt;/code&gt; file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;From there, our scraping stats will be automatically logged and automatically shipped to our dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapeops-demo-holder-7dd5eec8fc4395cfa9c9994d0ec09807.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapeops-demo-holder-7dd5eec8fc4395cfa9c9994d0ec09807.png" alt="ScrapeOps Dashboard"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;If you would like to run the spider for yourself or modify it for your particular Google project then feel free to do so. &lt;a href="https://github.com/ian-kerins/google-scraper-python-scrapy" rel="noopener noreferrer"&gt;The code is on GitHub here&lt;/a&gt;. Just remember that you need to get your own Scraper API &lt;code&gt;API_KEY&lt;/code&gt; by signing up &lt;a href="https://www.scraperapi.com/signup" rel="noopener noreferrer"&gt;for a free account here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>scrapy</category>
      <category>python</category>
    </item>
    <item>
      <title>Build Your Own Google Scholar API With Python Scrapy</title>
      <dc:creator>Ian Kerins</dc:creator>
      <pubDate>Tue, 18 Aug 2020 18:17:11 +0000</pubDate>
      <link>https://dev.to/iankerins/build-your-own-google-scholar-api-with-python-scrapy-4p73</link>
      <guid>https://dev.to/iankerins/build-your-own-google-scholar-api-with-python-scrapy-4p73</guid>
      <description>&lt;p&gt;Google Scholar is a treasure trove of academic and industrial research that could prove invaluable to any research project.&lt;/p&gt;

&lt;p&gt;However, as Google doesn’t provide any API for Google Scholar, it is notoriously hard to mine for information.&lt;/p&gt;

&lt;p&gt;Faced with this problem, I decided to develop a simple Scrapy spider in Python and create my own Google Scholar API.&lt;/p&gt;

&lt;p&gt;In this article, I’m going to show you how I built a Scrapy spider that searches Google Scholar for a particular keyword, and iterates through every available page extracting the following data from the search results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Title &lt;/li&gt;
&lt;li&gt;Link&lt;/li&gt;
&lt;li&gt;Citations&lt;/li&gt;
&lt;li&gt;Related Links&lt;/li&gt;
&lt;li&gt;Number of Verions&lt;/li&gt;
&lt;li&gt;Author&lt;/li&gt;
&lt;li&gt;Publisher&lt;/li&gt;
&lt;li&gt;Snippet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With this spider as a base, you will be able to adapt it to scrape whatever data you need and scale it to scrape thousands or millions of research keywords per month. &lt;a href="https://github.com/ian-kerins/google-scholar-scrapy-spider"&gt;The code for the project is available on GitHub here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This article assumes you know the basics of Scrapy, so we’re going to focus on how to scrape Google Scholar results at scale without getting blocked.&lt;/p&gt;

&lt;p&gt;For this tutorial, we're going to use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.scraperapi.com/"&gt;Scraper API&lt;/a&gt; as our proxy solution, as Instagram has pretty aggressive anti-scraping in place. You can sign up to a &lt;a href="https://dashboard.scraperapi.com/signup"&gt;free account here&lt;/a&gt; which will give you 5,000 free requests.
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://scrapeops.io/"&gt;ScrapeOps&lt;/a&gt; to monitor our scrapers for free and alert us if they run into trouble. &lt;strong&gt;Live demo here:&lt;/strong&gt; &lt;a href="https://scrapeops.io/app/login/demo"&gt;ScrapeOps Demo&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Y4w75y8p--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Y4w75y8p--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" alt="ScrapeOps Dashboard" width="880" height="461"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting Up Our Scrapy Spider
&lt;/h2&gt;

&lt;p&gt;Getting up and running with Scrapy is very easy. To install Scrapy simply enter this command in the command line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install scrapy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then navigate to your project folder Scrapy automatically creates and run the “startproject” command along with the project name (“scholar” in this case) and Scrapy will build a web scraping project folder for you, with everything already set up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapy startproject scholar

cd scholar

scrapy genspider scholar scholar.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is what you should see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;├── scrapy.cfg                # deploy configuration file
└── scholar                   # project's Python module, you'll import your code from here
    ├── __init__.py
    ├── items.py              # project items definition file
    ├── middlewares.py        # project middlewares file
    ├── pipelines.py          # project pipeline file
    ├── settings.py           # project settings file
    └── spiders               # a directory where spiders are located
        ├── __init__.py
        └── scholar.py        # spider we just created
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Okay, that’s the Scrapy spider templates set up. Now let’s start building our Google Scholar spider.&lt;/p&gt;

&lt;p&gt;From here we’re going to create three functions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;start_requests -&lt;/strong&gt; will construct the Google Scholar URL for the search queries and send the request to Google.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;parse -&lt;/strong&gt; will extract all the search results from the Google Scholar search results.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;get_url -&lt;/strong&gt; to scrape Google Scholar at scale without getting blocked we need to use a proxy solution. For this project we will use Scraper API so we need to create a function to send the request to their API endpoint.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Understanding Google Scholar Search Queries
&lt;/h2&gt;

&lt;p&gt;The first step of any scraping project is to figure out a way to reliably query the target website to get the data we need. So in this case we need to understand how to construct Google Scholar search queries that will return the search results we need.&lt;/p&gt;

&lt;p&gt;Luckily for us, Google uses a very predictable URL structure. There are many more query parameters we can use with Google to refine our search results but here are the four of the most important ones for querying Google Scholar:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define the &lt;strong&gt;search keyword&lt;/strong&gt; using the &lt;strong&gt;“q” parameter&lt;/strong&gt;. Example: &lt;em&gt;&lt;a href="http://www.google.com/scholar?q=airbnb"&gt;http://www.google.com/scholar?q=airbnb&lt;/a&gt;&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Define the &lt;strong&gt;language&lt;/strong&gt; of output using the &lt;strong&gt;“hl” parameter&lt;/strong&gt;. Example: &lt;em&gt;&lt;a href="http://www.google.com/scholar?q=airbnb&amp;amp;hl=en"&gt;http://www.google.com/scholar?q=airbnb&amp;amp;hl=en&lt;/a&gt;&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Define the &lt;strong&gt;starting date&lt;/strong&gt; using the &lt;strong&gt;“as_ylo”&lt;/strong&gt; parameter. Example: &lt;em&gt;&lt;a href="https://scholar.google.com/scholar?as_ylo=2020&amp;amp;q=airbnb"&gt;https://scholar.google.com/scholar?as_ylo=2020&amp;amp;q=airbnb&lt;/a&gt;&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Define the &lt;strong&gt;number of results per page&lt;/strong&gt; using the &lt;strong&gt;“num”&lt;/strong&gt; parameter. However, this is not recommended for Google Scholar, so we will leave it as the default (10). Example: &lt;em&gt;&lt;a href="http://www.google.com/scholar?q=airbnb&amp;amp;num=10&amp;amp;hl=en"&gt;http://www.google.com/scholar?q=airbnb&amp;amp;num=10&amp;amp;hl=en&lt;/a&gt;&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Querying Google Scholar
&lt;/h2&gt;

&lt;p&gt;Now we have created a scrapy project and we are familiar with how to send search queries to Google Scholar we can begin coding the spiders.&lt;/p&gt;

&lt;p&gt;Our start requests spider is going to be pretty simple, we just need to send requests to Google Scholar with the keyword we want to search along with the language we want the output to be in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def start_requests(self):
        queries = ['airbnb']
        for query in queries:
            url = 'https://scholar.google.com/scholar?' + urlencode({'hl': 'en', 'q': query})
            yield scrapy.Request(get_url(url), callback=self.parse, meta={'position': 0})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;strong&gt;start_requests&lt;/strong&gt; function will iterate through a list of keywords in the queries list and then send the request to Google Scholar using the &lt;strong&gt;yield scrapy.Request(get_url(url), callback=self.parse)&lt;/strong&gt; where the response is sent to the &lt;strong&gt;parse&lt;/strong&gt; function in the callback.&lt;/p&gt;

&lt;p&gt;You will also notice that we include the {'position': 0} dictionary in the meta parameter. This isn’t sent to Google, it is sent to the &lt;strong&gt;parse&lt;/strong&gt; callback function and is used to track how many pages the spider has scraped.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scraping The Search Results
&lt;/h2&gt;

&lt;p&gt;The next step is to write our parser to extract the data we need from the HTML response we are getting back from Google Scholar. &lt;/p&gt;

&lt;p&gt;We will use XPath selectors to extract the data from the HTML response. XPath is a big subject and there are plenty of techniques associated with it, so I won’t go into detail on how it works or how to create your own XPath selectors. If you would like to learn more about XPath and how to use it with Scrapy then you should &lt;a href="https://docs.scrapy.org/en/latest/topics/selectors.html"&gt;check out the documentation here&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def parse(self, response):
   position = response.meta['position']
   for res in response.xpath('//*[@data-rp]'):
       link = res.xpath('.//h3/a/@href').extract_first()
       temp = res.xpath('.//h3/a//text()').extract()
       if not temp:
           title = "[C] " + "".join(res.xpath('.//h3/span[@id]//text()').extract())
       else:
           title = "".join(temp)
       snippet = "".join(res.xpath('.//*[@class="gs_rs"]//text()').extract())
       cited = res.xpath('.//a[starts-with(text(),"Cited")]/text()').extract_first()
       temp = res.xpath('.//a[starts-with(text(),"Related")]/@href').extract_first()
       related = "https://scholar.google.com" + temp if temp else ""
       num_versions = res.xpath('.//a[contains(text(),"version")]/text()').extract_first()
       published_data = "".join(res.xpath('.//div[@class="gs_a"]//text()').extract())
       position += 1
       item = {'title': title, 'link': link, 'cited': cited, 'relatedLink': related, 'position': position,
               'numOfVersions': num_versions, 'publishedData': published_data, 'snippet': snippet}
       yield item
   next_page = response.xpath('//td[@align="left"]/a/@href').extract_first()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To iterate through all the available pages of search results we will need to check to see if there is another page there and then construct the next URL query if there is.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def parse(self, response):
   ##...parsing logic from above
   next_page = response.xpath('//td[@align="left"]/a/@href').extract_first()
   if next_page:
       url = "https://scholar.google.com" + next_page
       yield scrapy.Request(get_url(url), callback=self.parse,meta={'position': position})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Connecting Our Proxy Solution
&lt;/h2&gt;

&lt;p&gt;Google has very sophisticated anti-bot detection systems that will quickly detect that you are scraping their search results and block your IP. As a result, it is vital that you use a high-quality web scraping proxy that works with Google Scholar.&lt;/p&gt;

&lt;p&gt;For this project, I’ve gone with Scraper API as it is super easy to use and because they have a great success rate with scraping Google Scholar. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.scraperapi.com/"&gt;Scraper API&lt;/a&gt; is a proxy API that manages everything to do with proxies for you. You simply have to send them the URL you want to scrape and their API will route your request through one of their proxy pools and give you back the HTML response.&lt;/p&gt;

&lt;p&gt;To use Scraper API you need to &lt;a href="https://www.scraperapi.com/signup"&gt;sign up to a free account here&lt;/a&gt; and get an API key which will allow you to make 5,000 free requests and use all the extra features like Javascript rendering, geotargeting, residential proxies, etc.&lt;/p&gt;

&lt;p&gt;Next, we need to integrate it with our spider. Reading their documentation, we see that there are three ways to interact with the API: via a single API endpoint, via their Python SDK, or via their proxy port.&lt;/p&gt;

&lt;p&gt;For this project I integrated the API by configuring my spiders to send all our requests to their API endpoint.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def get_url(url):
    payload = {'api_key': API_KEY, 'url': url, 'country_code': 'us'}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By using this function in our &lt;strong&gt;scrapy.Request()&lt;/strong&gt; requests in the &lt;strong&gt;start_requests&lt;/strong&gt; and &lt;strong&gt;parse&lt;/strong&gt; functions we are able to route all our requests through Scraper APIs proxies pools and not worry about getting blocked.&lt;/p&gt;

&lt;p&gt;Before going live we need to update the settings in settings.py to make sure we can use all the available concurrent threads available in our Scraper API free plan (5 threads), and set the number of retries to 5. Whilst making sure &lt;strong&gt;DOWNLOAD_DELAY&lt;/strong&gt;  and &lt;strong&gt;RANDOMIZE_DOWNLOAD_DELAY&lt;/strong&gt; aren’t enabled as these will lower your concurrency and are not needed with Scraper API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py

RETRY_TIMES = 5
CONCURRENT_REQUESTS_PER_DOMAIN = 5 
# DOWNLOAD_DELAY
# RANDOMIZE_DOWNLOAD_DELAY

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Setting Up Monitoring
&lt;/h2&gt;

&lt;p&gt;To monitor our scraper we're going to use &lt;a href="https://scrapeops.io/"&gt;ScrapeOps&lt;/a&gt;, a free monitoring and alerting tool dedicated to web scraping. &lt;/p&gt;

&lt;p&gt;With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live demo here:&lt;/strong&gt; &lt;a href="https://scrapeops.io/app/login/demo"&gt;ScrapeOps Demo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Getting setup with ScrapeOps is simple. Just install the Python package:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install scrapeops-scrapy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And add 3 lines to your &lt;code&gt;settings.py&lt;/code&gt; file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;From there, our scraping stats will be automatically logged and automatically shipped to our dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Yoq4Bja3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapeops-demo-holder-7dd5eec8fc4395cfa9c9994d0ec09807.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Yoq4Bja3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://scrapeops.io/assets/images/scrapeops-demo-holder-7dd5eec8fc4395cfa9c9994d0ec09807.png" alt="ScrapeOps Dashboard" width="880" height="461"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Going Live!
&lt;/h2&gt;

&lt;p&gt;Now we are good to go. You can test the spider by running the spider with the crawl command and export the results to a csv file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapy crawl scholar -o test.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The spider will scrape all the available search results for your keyword without getting banned.&lt;/p&gt;

&lt;p&gt;If you would like to run the spider for yourself or modify it for your particular Google Scholar project then feel free to do so. &lt;a href="https://github.com/ian-kerins/google-scholar-scrapy-spider"&gt;The code is on GitHub here&lt;/a&gt;. Just remember that you need to get your own Scraper API api key by &lt;a href="https://www.scraperapi.com/signup"&gt;signing up here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>scraping</category>
      <category>scrapy</category>
      <category>python</category>
    </item>
    <item>
      <title>The Easy Way to Scrape Instagram Using Python Scrapy &amp; GraphQL</title>
      <dc:creator>Ian Kerins</dc:creator>
      <pubDate>Thu, 06 Aug 2020 13:10:16 +0000</pubDate>
      <link>https://dev.to/iankerins/the-easy-way-to-build-an-instagram-spider-using-python-scrapy-graphql-4gko</link>
      <guid>https://dev.to/iankerins/the-easy-way-to-build-an-instagram-spider-using-python-scrapy-graphql-4gko</guid>
      <description>&lt;p&gt;After e-commerce monitoring, building social media scrapers to monitor accounts and track new trends is the next most popular use case for web scraping.&lt;/p&gt;

&lt;p&gt;However, for anyone who’s tried to build a web scraping spider for scraping Instagram, Facebook, Twitter or TikTok you know that it can be a bit tricky.&lt;/p&gt;

&lt;p&gt;These sites use sophisticated anti-bot technologies to block your requests and regularly make changes to their site schemas which can break your spiders parsing logic.&lt;/p&gt;

&lt;p&gt;So in this article, I’m going to show you the easiest way to build a Python Scrapy spider that scrapes all Instagram posts for every user account that you send to it. Whilst removing the worry of getting blocked or having to design XPath selectors to scrape the data from the raw HTML.&lt;/p&gt;

&lt;p&gt;The code for the project is available on &lt;a href="https://github.com/ian-kerins/instagram-python-scrapy-spider" rel="noopener noreferrer"&gt;GitHub here&lt;/a&gt;, and is set up to scrape:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Post URL&lt;/li&gt;
&lt;li&gt;Image URL or Video URL&lt;/li&gt;
&lt;li&gt;Post Captions&lt;/li&gt;
&lt;li&gt;Date Posted&lt;/li&gt;
&lt;li&gt;Number of Likes&lt;/li&gt;
&lt;li&gt;Number of Comments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For every post on that user's account. As you will see there is more data we could easily extract, however, to keep this guide simple I just limited it to the most important data types.&lt;/p&gt;

&lt;p&gt;This code can also be quickly modified to scrape all the posts related to a specific tag or geographical location with only minor changes, so it is a great base to build future spiders with.&lt;/p&gt;

&lt;p&gt;This article assumes you know the basics of Scrapy, so we’re going to focus on how to scrape Instagram at scale without getting blocked.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/ian-kerins/instagram-python-scrapy-spider" rel="noopener noreferrer"&gt;The full-code is on GitHub here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For this example, we're going to use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.scraperapi.com/" rel="noopener noreferrer"&gt;Scraper API&lt;/a&gt; as our proxy solution, as Instagram has pretty aggressive anti-scraping in place. You can sign up to a &lt;a href="https://dashboard.scraperapi.com/signup" rel="noopener noreferrer"&gt;free account here&lt;/a&gt; which will give you 5,000 free requests.
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://scrapeops.io/" rel="noopener noreferrer"&gt;ScrapeOps&lt;/a&gt; to monitor our scrapers for free and alert us if they run into trouble. &lt;strong&gt;Live demo here:&lt;/strong&gt; &lt;a href="https://scrapeops.io/app/login/demo" rel="noopener noreferrer"&gt;ScrapeOps Demo&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" alt="ScrapeOps Dashboard"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting Up Our Scrapy Spider
&lt;/h2&gt;

&lt;p&gt;Getting up and running with Scrapy is very easy. To install Scrapy simply enter this command in the command line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install scrapy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then navigate to your project folder Scrapy automatically creates and run the “startproject” command along with the project name (“instascraper” in this case) and Scrapy will build a web scraping project folder for you, with everything already set up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapy startproject instascraper

cd instascraper

scrapy genspider instagram instagram.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is what you should see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;├── scrapy.cfg                # deploy configuration file
└── tutorial                  # project's Python module, you'll import your code from here
    ├── __init__.py
    ├── items.py              # project items definition file
    ├── middlewares.py        # project middlewares file
    ├── pipelines.py          # project pipeline file
    ├── settings.py           # project settings file
    └── spiders               # a directory where spiders are located
        ├── __init__.py
        └── amazon.py        # spider we just created
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Okay, that’s the Scrapy spider templates set up. Now let’s start building our Instagram spiders.&lt;/p&gt;

&lt;p&gt;From here we’re going to create five functions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;start_requests -&lt;/strong&gt; will construct the Instagram URL for the users account and send the request to Instagram.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;parse -&lt;/strong&gt; will extract all the posts data from the users news feed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;parse_page -&lt;/strong&gt; if there is more than one page, this function will parse all the posts data from those pages.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;get_video -&lt;/strong&gt; if the post includes a video, this function will be called and extract the videos url.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;get_url -&lt;/strong&gt; will send the request to Scraper API so it can retrieve the HTML response.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let’s get to work…&lt;/p&gt;




&lt;h2&gt;
  
  
  Requesting Instagram Accounts
&lt;/h2&gt;

&lt;p&gt;To retrieve a user's data from Instagram we need to first create a list of users we want to monitor then incorporate their user ids into a URL. Luckily for us, Instagram uses a pretty straight forward URL structure.&lt;/p&gt;

&lt;p&gt;Every user has a unique name and/or user id, that we can use to create the user URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://www.instagram.com/&amp;lt;user_name&amp;gt;/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also retrieve the posts associated with a specific tag or from a specific location by using the following url format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## Tags URL
https://www.instagram.com/explore/tags/&amp;lt;example_tag&amp;gt;/

## Location URL
https://www.instagram.com/explore/locations/&amp;lt;location_id&amp;gt;/

# Note: the location URL is a numeric value so you need to identify the location ID number for
# the locations you want to scrape. 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So for this example spider, I’m going to use Nike and Adidas as the two Instagram accounts I want to scrape.&lt;/p&gt;

&lt;p&gt;Using the above framework the Nike url is &lt;code&gt;https://www.instagram.com/nike/&lt;/code&gt;, and we also want to have the ability to specify the page language using the “hl” parameter. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://www.instagram.com/nike/?hl=en  #English
https://www.instagram.com/nike/?hl=de  #German
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Spider #1: Retrieving Instagram Accounts
&lt;/h2&gt;

&lt;p&gt;Now we have created a scrapy project and we are familiar with how instagram displays it’s data, we can begin coding the spiders.&lt;/p&gt;

&lt;p&gt;Our start requests spider is going to be pretty simple, we just need to send requests to Instagram with the username url to get the users account:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def start_requests(self):
        for username in user_accounts:
            url = f'https://www.instagram.com/{username}/?hl=en'
            yield scrapy.Request(get_url(url), callback=self.parse)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The start_requests function will iterate through a list of user_accounts and then send the request to Instagram using the &lt;strong&gt;yield scrapy.Request(get_url(url), callback=self.parse)&lt;/strong&gt; where the response is sent to the &lt;strong&gt;parse&lt;/strong&gt; function in the callback.&lt;/p&gt;




&lt;h2&gt;
  
  
  Spider #2: Scraping Post Data
&lt;/h2&gt;

&lt;p&gt;Okay, now that we are getting a response back from Instagram we can extract the data we want.&lt;/p&gt;

&lt;p&gt;On first glance, the post data we want like image urls, likes, comments, etc. don’t seem to be in the HTML data. However, on a closer look we will see that the data is in the form of a JSON dictionary in the scripts tag that starts with “window._sharedData”. &lt;/p&gt;

&lt;p&gt;This is because Instagram first loads the layout and all the data it needs from its internal GraphQL API and then puts the data in the correct layout.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F7gzoad3vbykljm50fado.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F7gzoad3vbykljm50fado.PNG" alt="Image of window._sharedData"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We could scrape this data directly if we queried Instagrams GraphQL endpoint directly by adding "/?__a=1" onto the end of the URL. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://www.instagram.com/nike/?__a=1/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But we wouldn’t be able to iterate through all the pages, so instead we’re going to get the HTML response and then extract the data from the window._sharedData JSON dictionary.&lt;/p&gt;

&lt;p&gt;Because the data is already formatted as JSON it will be very easy to extract the data we want. We can just use a simple XPath selector to extract the JSON string and then convert it into a JSON dictionary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def parse(self, response):
        x = response.xpath("//script[starts-with(.,'window._sharedData')]/text()").extract_first()
        json_string = x.strip().split('= ')[1][:-1]
        data = json.loads(json_string)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From here we just need to extract the data we want from the JSON dictionary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def parse(self, response):
        x = response.xpath("//script[starts-with(.,'window._sharedData')]/text()").extract_first()
        json_string = x.strip().split('= ')[1][:-1]
        data = json.loads(json_string)
        # all that we have to do here is to parse the JSON we have
        user_id = data['entry_data']['ProfilePage'][0]['graphql']['user']['id']
        next_page_bool = \
            data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['page_info'][
                'has_next_page']
        edges = data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_felix_video_timeline']['edges']
        for i in edges:
            url = 'https://www.instagram.com/p/' + i['node']['shortcode']
            video = i['node']['is_video']
            date_posted_timestamp = i['node']['taken_at_timestamp']
            date_posted_human = datetime.fromtimestamp(date_posted_timestamp).strftime("%d/%m/%Y %H:%M:%S")
            like_count = i['node']['edge_liked_by']['count'] if "edge_liked_by" in i['node'].keys() else ''
            comment_count = i['node']['edge_media_to_comment']['count'] if 'edge_media_to_comment' in i[
                'node'].keys() else ''
            captions = ""
            if i['node']['edge_media_to_caption']:
                for i2 in i['node']['edge_media_to_caption']['edges']:
                    captions += i2['node']['text'] + "\n"

            if video:
                image_url = i['node']['display_url']
            else:
                image_url = i['node']['thumbnail_resources'][-1]['src']
            item = {'postURL': url, 'isVideo': video, 'date_posted': date_posted_human,
                    'timestamp': date_posted_timestamp, 'likeCount': like_count, 'commentCount': comment_count, 'image_url': image_url,
                    'captions': captions[:-1]}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Spider #3: Extracting Video URLs
&lt;/h2&gt;

&lt;p&gt;To extract the video URL we need to make another request to that specific post as that data isn’t included in the JSON response previously returned by Instagram.&lt;/p&gt;

&lt;p&gt;If the post includes a video then the &lt;strong&gt;is_video&lt;/strong&gt; flag will be set to true, which will trigger our scraper to request that posts page and send the response to the &lt;strong&gt;get_video&lt;/strong&gt; function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if video:
     yield scrapy.Request(get_url(url), callback=self.get_video, meta={'item': item})
else:
     item['videoURL'] = ''
     yield item
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The get_video function will then extract the videoURL from the response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def get_video(self, response):
        # only from the first page
        item = response.meta['item']
        video_url = response.xpath('//meta[@property="og:video"]/@content').extract_first()
        item['videoURL'] = video_url
        yield item
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Spider #4: Iterating Through Available Pages
&lt;/h2&gt;

&lt;p&gt;The last piece of extraction logic we need to implement is the ability for our crawler to iterate through all the available pages on that user account and scrape all the data.&lt;/p&gt;

&lt;p&gt;Like the &lt;strong&gt;get_video&lt;/strong&gt; function we need to check if there are any more pages available before calling the &lt;strong&gt;parse_pages&lt;/strong&gt; function. We do that by checking if the &lt;strong&gt;has_next_page&lt;/strong&gt; field in the JSON dictionary is true or false.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;next_page_bool = \
            data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['page_info'][
                'has_next_page']

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If it is true, then we will extract the &lt;strong&gt;end_cursor&lt;/strong&gt; value from the JSON dictionary and create a new request for Instagrams GraphQL api endpoint, along with the &lt;strong&gt;user_id&lt;/strong&gt;, &lt;strong&gt;query_hash&lt;/strong&gt;, etc.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;        if next_page_bool:
            cursor = \
                data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['page_info'][
                    'end_cursor']
            di = {'id': user_id, 'first': 12, 'after': cursor}
            print(di)
            params = {'query_hash': 'e769aa130647d2354c40ea6a439bfc08', 'variables': json.dumps(di)}
            url = 'https://www.instagram.com/graphql/query/?' + urlencode(params)
            yield scrapy.Request(get_url(url), callback=self.parse_pages, meta={'pages_di': di})

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will then call the &lt;strong&gt;parse_pages&lt;/strong&gt; function which will repeat the process of extracting all the post data and checking to see if there are any more pages. &lt;/p&gt;

&lt;p&gt;The difference between this function and the original parse function is that it won’t scrape the video url of each post. However, you can easily add this in if you would like.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def parse_pages(self, response):
   di = response.meta['pages_di']
   data = json.loads(response.text)
   for i in data['data']['user']['edge_owner_to_timeline_media']['edges']:
       video = i['node']['is_video']
       url = 'https://www.instagram.com/p/' + i['node']['shortcode']
       if video:
           image_url = i['node']['display_url']
           video_url = i['node']['video_url']
       else:
           video_url = ''
           image_url = i['node']['thumbnail_resources'][-1]['src']
       date_posted_timestamp = i['node']['taken_at_timestamp']
       captions = ""
       if i['node']['edge_media_to_caption']:
           for i2 in i['node']['edge_media_to_caption']['edges']:
               captions += i2['node']['text'] + "\n"
       comment_count = i['node']['edge_media_to_comment']['count'] if 'edge_media_to_comment' in i['node'].keys() else ''
       date_posted_human = datetime.fromtimestamp(date_posted_timestamp).strftime("%d/%m/%Y %H:%M:%S")
       like_count = i['node']['edge_liked_by']['count'] if "edge_liked_by" in i['node'].keys() else ''
       item = {'postURL': url, 'isVideo': video, 'date_posted': date_posted_human,
               'timestamp': date_posted_timestamp, 'likeCount': like_count, 'commentCount': comment_count, 'image_url': image_url,
               'videoURL': video_url,'captions': captions[:-1]
               }
       yield item
   next_page_bool = data['data']['user']['edge_owner_to_timeline_media']['page_info']['has_next_page']
   if next_page_bool:
       cursor = data['data']['user']['edge_owner_to_timeline_media']['page_info']['end_cursor']
       di['after'] = cursor
       params = {'query_hash': 'e769aa130647d2354c40ea6a439bfc08', 'variables': json.dumps(di)}
       url = 'https://www.instagram.com/graphql/query/?' + urlencode(params)
       yield scrapy.Request(get_url(url), callback=self.parse_pages, meta={'pages_di': di})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Setting Up Proxies
&lt;/h2&gt;

&lt;p&gt;Finally, we are pretty much ready to go live. Last thing we need to do is to set our spiders up to use a proxy to enable us to scrape at scale without getting blocked.&lt;/p&gt;

&lt;p&gt;For this project, I’ve gone with Scraper API as it is super easy to use and because they have a great success rate with scraping Instagram. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.scraperapi.com/" rel="noopener noreferrer"&gt;Scraper API&lt;/a&gt; is a proxy API that manages everything to do with proxies for you. You simply have to send them the URL you want to scrape and their API will route your request through one of their proxy pools and give you back the HTML response.&lt;/p&gt;

&lt;p&gt;To use Scraper API you need to &lt;a href="https://www.scraperapi.com/signup" rel="noopener noreferrer"&gt;sign up to a free account here&lt;/a&gt; and get an API key which will allow you to make 1,000 free requests per month and use all the extra features like Javascript rendering, geotargeting, residential proxies, etc.&lt;/p&gt;

&lt;p&gt;Next, we need to integrate it with our spider. Reading their documentation, we see that there are three ways to interact with the API: via a single API endpoint, via their Python SDK, or via their proxy port.&lt;/p&gt;

&lt;p&gt;For this project, I integrated the API by configuring my spiders to send all our requests to their API endpoint.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API = ‘&amp;lt;YOUR_API_KEY&amp;gt;’

def get_url(url):
    payload = {'api_key': API, 'url': url}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And then modify our spider functions so as to use the Scraper API proxy by setting the url parameter in scrapy.Request to &lt;strong&gt;get_url(url)&lt;/strong&gt;. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def start_requests(self):
        for username in user_accounts:
            url = f'https://www.instagram.com/{username}/?hl=en'
            yield scrapy.Request(get_url(url), callback=self.parse)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We also have to change the spiders settings to set the &lt;strong&gt;allowed_domains&lt;/strong&gt; to api.scraperapi.com, and the max concurrency per domain to the concurrency limit of our Scraper API plan. Which in the case of the Scraper API Free plan is 5 concurrent threads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class InstagramSpider(scrapy.Spider):
    name = 'instagram'
    allowed_domains = ['api.scraperapi.com']
    custom_settings = {'CONCURRENT_REQUESTS_PER_DOMAIN': 5}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also, we should set &lt;strong&gt;RETRY_TIMES&lt;/strong&gt; to tell Scrapy to retry any failed requests (to 5 for example) and make sure that &lt;strong&gt;DOWNLOAD_DELAY&lt;/strong&gt;  and &lt;strong&gt;RANDOMIZE_DOWNLOAD_DELAY&lt;/strong&gt; aren’t enabled as these will lower your concurrency and are not needed with Scraper API.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting Up Monitoring
&lt;/h2&gt;

&lt;p&gt;To monitor our scraper we're going to use &lt;a href="https://scrapeops.io/" rel="noopener noreferrer"&gt;ScrapeOps&lt;/a&gt;, a free monitoring and alerting tool dedicated to web scraping. &lt;/p&gt;

&lt;p&gt;With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live demo here:&lt;/strong&gt; &lt;a href="https://scrapeops.io/app/login/demo" rel="noopener noreferrer"&gt;ScrapeOps Demo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Getting setup with ScrapeOps is simple. Just install the Python package:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install scrapeops-scrapy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And add 3 lines to your &lt;code&gt;settings.py&lt;/code&gt; file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;From there, our scraping stats will be automatically logged and automatically shipped to our dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapeops-demo-holder-7dd5eec8fc4395cfa9c9994d0ec09807.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapeops-demo-holder-7dd5eec8fc4395cfa9c9994d0ec09807.png" alt="ScrapeOps Dashboard"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Going Live!
&lt;/h2&gt;

&lt;p&gt;Now we are good to go. You can test the spider again by running the spider with the crawl command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapy crawl instagram -o test.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once complete the spider will store the accounts data in a csv file.&lt;/p&gt;

&lt;p&gt;If you would like to run the spider for yourself or modify it for your particular Instagram project then feel free to do so. &lt;a href="https://github.com/ian-kerins/instagram-python-scrapy-spider" rel="noopener noreferrer"&gt;The code is on GitHub here&lt;/a&gt;. Just remember that you need to get your own Scraper API api key by &lt;a href="https://www.scraperapi.com/signup" rel="noopener noreferrer"&gt;signing up here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>scraping</category>
      <category>scrapy</category>
      <category>python</category>
    </item>
    <item>
      <title>How To Scrape Amazon at Scale With Python Scrapy, And Never Get Banned</title>
      <dc:creator>Ian Kerins</dc:creator>
      <pubDate>Tue, 28 Jul 2020 18:19:29 +0000</pubDate>
      <link>https://dev.to/iankerins/how-to-scrape-amazon-at-scale-with-python-scrapy-and-never-get-banned-44cm</link>
      <guid>https://dev.to/iankerins/how-to-scrape-amazon-at-scale-with-python-scrapy-and-never-get-banned-44cm</guid>
      <description>&lt;p&gt;With thousands of companies offering products and price monitoring solutions for Amazon, scraping Amazon is big business.&lt;/p&gt;

&lt;p&gt;But for anyone who’s tried to scrape it at scale you know how quickly you can get blocked.&lt;/p&gt;

&lt;p&gt;So in this article, I’m going to show you how I built a Scrapy spider that searches Amazon for a particular keyword, and then goes into every single product it returns and scrape all the main information:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ASIN&lt;/li&gt;
&lt;li&gt;Product name&lt;/li&gt;
&lt;li&gt;Image url&lt;/li&gt;
&lt;li&gt;Price&lt;/li&gt;
&lt;li&gt;Description&lt;/li&gt;
&lt;li&gt;Available sizes&lt;/li&gt;
&lt;li&gt;Available colors&lt;/li&gt;
&lt;li&gt;Ratings&lt;/li&gt;
&lt;li&gt;Number of reviews&lt;/li&gt;
&lt;li&gt;Seller rank&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With this spider as a base, you will be able to adapt it to scrape whatever data you need and scale it to scrape thousands or millions of products per month. &lt;a href="https://github.com/ian-kerins/amazon-python-scrapy-scraper" rel="noopener noreferrer"&gt;The code for the project is available on GitHub here&lt;/a&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  What We Will Need?
&lt;/h2&gt;

&lt;p&gt;Obviously, you could build your scrapers from scratch using a basic library like requests and Beautifulsoup, but I choose to build it using Scrapy.&lt;/p&gt;

&lt;p&gt;The open-source web crawling framework written in Python, as it by far the most powerful and popular web scraping framework amongst large scale web scrapers. &lt;/p&gt;

&lt;p&gt;Compared to other web scraping libraries such as BeautifulSoup, Selenium or Cheerio, which are great libraries for parsing HTML data, Scrapy is a full web scraping framework with a large community that has loads of built-in functionality to make web scraping as simple as possible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;XPath and CSS selectors for HTML parsing&lt;/li&gt;
&lt;li&gt;data pipelines&lt;/li&gt;
&lt;li&gt;automatic retries&lt;/li&gt;
&lt;li&gt;proxy management&lt;/li&gt;
&lt;li&gt;concurrent requests&lt;/li&gt;
&lt;li&gt;etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Making it really easy to get started, and very simple to scale up. &lt;/p&gt;

&lt;h4&gt;
  
  
  Proxies
&lt;/h4&gt;

&lt;p&gt;The second thing that was a must, if you want to scrape Amazon at any type of scale is a large pool of proxies and the code to automatically rotate IPs and headers, along with dealing with bans and CAPTCHAs. Which can be very time consuming if you build this proxy management infrastructure yourself.&lt;/p&gt;

&lt;p&gt;For this project I opted to use &lt;a href="https://www.scraperapi.com/" rel="noopener noreferrer"&gt;Scraper API&lt;/a&gt;, a proxy API that manages everything to do with proxies for you. You simply have to send them the URL you want to scrape and their API will route your request through one of their proxy pools and give you back the HTML response. &lt;/p&gt;

&lt;p&gt;Scraper API has a free plan that allows you to make up to 1,000 requests per month which makes it ideal for the development phase, but can be easily scaled up to millions of pages per month if needs be.&lt;/p&gt;

&lt;h4&gt;
  
  
  Monitoring
&lt;/h4&gt;

&lt;p&gt;Lastly, we will need some way to monitor our scraper in production to make sure that everything is running smoothly. For that we're going to use &lt;a href="https://scrapeops.io/" rel="noopener noreferrer"&gt;ScrapeOps&lt;/a&gt;, a free monitoring tool specifically designed for web scraping. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live demo here:&lt;/strong&gt; &lt;a href="https://scrapeops.io/app/login/demo" rel="noopener noreferrer"&gt;ScrapeOps Demo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapeops-promo-286a59166d9f41db1c195f619aa36a06.png" alt="ScrapeOps Dashboard"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started With Scrapy
&lt;/h2&gt;

&lt;p&gt;Getting up and running with Scrapy is very easy. To install Scrapy simply enter this command in the command line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install scrapy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then navigate to your project folder Scrapy automatically creates and run the “startproject” command along with the project name (“amazon_scraper” in this case) and Scrapy will build a web scraping project folder for you, with everything already set up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapy startproject amazon_scraper
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is what you should see&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;├── scrapy.cfg                # deploy configuration file
└── tutorial                  # project's Python module, you'll import your code from here
    ├── __init__.py
    ├── items.py              # project items definition file
    ├── middlewares.py        # project middlewares file
    ├── pipelines.py          # project pipeline file
    ├── settings.py           # project settings file
    └── spiders               # a directory where spiders are located
        ├── __init__.py
        └── amazon.py        # spider we just created
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Similar to Django when you create a project with Scrapy it automatically creates all the files you need. Each of which has its own purpose:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Items.py&lt;/strong&gt; is useful for creating your base dictionary that you import into the spider&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Settings.py&lt;/strong&gt; is where all your settings on requests and activating of pipelines and middlewares happen. Here you can change the delays, concurrency, and lots more things.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipelines.py&lt;/strong&gt; is where the item yielded by the spider gets passed, it’s mostly used to clean the text and connect to databases (Excel, SQL, etc).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Middlewares.py&lt;/strong&gt; is useful when you want to modify how the request is made and scrapy handles the response.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Creating Our Amazon Spider
&lt;/h2&gt;

&lt;p&gt;Okay, we’ve created the general project structure. Now, we’re going to develop our spiders that will do the scraping.&lt;/p&gt;

&lt;p&gt;Scrapy provides a number of different spider types, however, in this tutorial we will cover the most common one, the Generic Spider.&lt;/p&gt;

&lt;p&gt;To create a new spider, simply run the &lt;strong&gt;“genspider”&lt;/strong&gt; command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# syntax is --&amp;gt; scrapy genspider name_of_spider website.com 
scrapy genspider amazon amazon.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And Scrapy will create a new file, with a spider template.&lt;/p&gt;

&lt;p&gt;In our case, we will get a new file in the spiders folder called &lt;strong&gt;“amazon.py”&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import scrapy

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    allowed_domains = ['amazon.com']
    start_urls = ['http://www.amazon.com/']

    def parse(self, response):
        pass
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We're going to remove the default code from this (allowed_domains, start_urls, parse function) and start writing our own code.&lt;/p&gt;

&lt;p&gt;We’re going to create four functions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;start_requests -&lt;/strong&gt; will send a search query Amazon with a particular keyword.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;parse_keyword_response -&lt;/strong&gt; will extract the ASIN value for each product returned in the Amazon keyword query, then send a new request to Amazon to return the product page of that product. It will also move to the next page and repeat the process.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;parse_product_page -&lt;/strong&gt; will extract all the target information from the product page.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;get_url -&lt;/strong&gt; will send the request to Scraper API so it can retrieve the HTML response.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With a plan made, now let’s get to work…&lt;/p&gt;

&lt;h2&gt;
  
  
  Send Search Queries To Amazon
&lt;/h2&gt;

&lt;p&gt;The first step is building &lt;strong&gt;start_requests&lt;/strong&gt;, our function that sends search queries to Amazon with our keywords. Which is pretty simple…&lt;/p&gt;

&lt;p&gt;First let’s quickly define a list variable with our search keywords outside the AmazonSpider.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;queries = ['tshirt for men', ‘tshirt for women’]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then let's create our &lt;strong&gt;start_requests&lt;/strong&gt; function within the AmazonSpider that will send the requests to Amazon.&lt;/p&gt;

&lt;p&gt;To access Amazon’s search functionality via a URL we need to send a search query &lt;strong&gt;“k=SEARCH_KEYWORD”&lt;/strong&gt; :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://www.amazon.com/s?k=&amp;lt;SEARCH_KEYWORD&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When implemented in our &lt;strong&gt;start_requests&lt;/strong&gt; function, it looks like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## amazon.py

queries = ['tshirt for men', ‘tshirt for women’]

class AmazonSpider(scrapy.Spider):

    def start_requests(self):
        for query in queries:
            url = 'https://www.amazon.com/s?' + urlencode({'k': query})
            yield scrapy.Request(url=url, callback=self.parse_keyword_response)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For every query in our queries list, we will &lt;strong&gt;urlencode&lt;/strong&gt; it so that it is safe to use as a query string in a URL, and then use scrapy.Request to request that URL. &lt;/p&gt;

&lt;p&gt;Since Scrapy is async, we will use &lt;strong&gt;yield&lt;/strong&gt; instead of &lt;strong&gt;return&lt;/strong&gt;, which means the functions should either yield a request or a completed dictionary. If a new request is yielded it will go to the callback method, if an item is yielded it will go to the pipeline for data cleaning.&lt;/p&gt;

&lt;p&gt;In our case, if scrapy.Request it will activate our &lt;strong&gt;parse_keyword_response&lt;/strong&gt; callback function that will then extract the ASIN for each product.&lt;/p&gt;




&lt;h2&gt;
  
  
  Scraping Amazon’s Product Listing Page
&lt;/h2&gt;

&lt;p&gt;The cleanest and most popular way to retrieve Amazon product pages is to use their ASIN ID. &lt;/p&gt;

&lt;p&gt;ASIN’s are a unique ID that every product on Amazon has. We can use this ID as part of our URLs to retrieve the product page of any Amazon product like this...&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://www.amazon.com/dp/&amp;lt;ASIN&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can extract the ASIN value from the product listing page by using Scrapy’s built-in XPath selector extractor methods.&lt;/p&gt;

&lt;p&gt;XPath is a big subject and there are plenty of techniques associated with it, so I won’t go into detail on how it works or how to create your own XPath selectors. If you would like to learn more about XPath and how to use it with Scrapy then you should &lt;a href="https://docs.scrapy.org/en/latest/topics/selectors.html" rel="noopener noreferrer"&gt;check out the documentation here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Using Scrapy Shell, I’m able to develop a XPath selector that grabs the ASIN value for every product on the product listing page and create a url for each product:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;products = response.xpath('//*[@data-asin]')

        for product in products:
            asin = product.xpath('@data-asin').extract_first()
            product_url = f"https://www.amazon.com/dp/{asin}"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we will configure the function to send a request to this URL and then call the &lt;strong&gt;parse_product_page&lt;/strong&gt; callback function when we get a response. We will also add the meta parameter to this request which is used to pass items between functions (or edit certain settings).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def parse_keyword_response(self, response):
        products = response.xpath('//*[@data-asin]')

        for product in products:
            asin = product.xpath('@data-asin').extract_first()
            product_url = f"https://www.amazon.com/dp/{asin}"
            yield scrapy.Request(url=product_url, callback=self.parse_product_page, meta={'asin': asin})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Extracting Product Data From Product Page
&lt;/h2&gt;

&lt;p&gt;Now, we’re finally getting to the good stuff!&lt;/p&gt;

&lt;p&gt;So after the parse_keyword_response function requests the product pages URL, it passes the response it receives from Amazon to the &lt;strong&gt;parse_product_page&lt;/strong&gt; callback function along with the ASIN ID in the meta parameter.&lt;/p&gt;

&lt;p&gt;Now, we want to extract the data we need from a product page like &lt;a href="https://www.amazon.com/dp/B06XWLMYVY" rel="noopener noreferrer"&gt;this&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fq9g4br629ogqmgbn6ji9.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fq9g4br629ogqmgbn6ji9.PNG" alt="Amazon Product Page"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To do so we will have to write XPath selectors to extract each field we want from the HTML response we receive back from Amazon.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def parse_product_page(self, response):
        asin = response.meta['asin']
        title = response.xpath('//*[@id="productTitle"]/text()').extract_first()
        image = re.search('"large":"(.*?)"',response.text).groups()[0]
        rating = response.xpath('//*[@id="acrPopover"]/@title').extract_first()
        number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/text()').extract_first()
        bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/text()').extract()
        seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/parent::*//text()[not(parent::style)]').extract()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For scraping the image url, I’ve gone with a regex selector over a XPath selector as the XPath was extracting the image in base64.&lt;/p&gt;

&lt;p&gt;With very big websites like Amazon, who have various types of product pages what you will notice is that sometimes writing a single XPath selector won’t be enough. As it might work on some pages, but not on others.&lt;/p&gt;

&lt;p&gt;In cases like these, you will need to write numerous XPath selectors to cope with the various page layouts. I ran into this issue when trying to extract the product price so I needed to give the spider 3 different XPath options:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def parse_product_page(self, response):
        asin = response.meta['asin']
        title = response.xpath('//*[@id="productTitle"]/text()').extract_first()
        image = re.search('"large":"(.*?)"',response.text).groups()[0]
        rating = response.xpath('//*[@id="acrPopover"]/@title').extract_first()
        number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/text()').extract_first()
        bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/text()').extract()
        seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/parent::*//text()[not(parent::style)]').extract()

        price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()

        if not price:
            price = response.xpath('//*[@data-asin-price]/@data-asin-price').extract_first() or \
                    response.xpath('//*[@id="price_inside_buybox"]/text()').extract_first()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the spider can't find a price with the first XPath selector then it moves onto the next one, etc.&lt;/p&gt;

&lt;p&gt;If we look at the product page again, we will see that it contains variations of the product in different sizes and colors. To extract this data we will write a quick test to see if this section is present on the page, and if it is we will extract it using regex selectors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;temp = response.xpath('//*[@id="twister"]')
        sizes = []
        colors = []
        if temp:
            s = re.search('"variationValues" : ({.*})', response.text).groups()[0]
            json_acceptable = s.replace("'", "\"")
            di = json.loads(json_acceptable)
            sizes = di.get('size_name', [])
            colors = di.get('color_name', [])

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Putting it all together, the &lt;strong&gt;parse_product_page&lt;/strong&gt; function will look like this, and will return a JSON object which will be sent to the pipelines.py file for data cleaning (we will discuss this later).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def parse_product_page(self, response):
        asin = response.meta['asin']
        title = response.xpath('//*[@id="productTitle"]/text()').extract_first()
        image = re.search('"large":"(.*?)"',response.text).groups()[0]
        rating = response.xpath('//*[@id="acrPopover"]/@title').extract_first()
        number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/text()').extract_first()
        price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()

        if not price:
            price = response.xpath('//*[@data-asin-price]/@data-asin-price').extract_first() or \
                    response.xpath('//*[@id="price_inside_buybox"]/text()').extract_first()

        temp = response.xpath('//*[@id="twister"]')
        sizes = []
        colors = []
        if temp:
            s = re.search('"variationValues" : ({.*})', response.text).groups()[0]
            json_acceptable = s.replace("'", "\"")
            di = json.loads(json_acceptable)
            sizes = di.get('size_name', [])
            colors = di.get('color_name', [])

        bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/text()').extract()
        seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/parent::*//text()[not(parent::style)]').extract()
        yield {'asin': asin, 'Title': title, 'MainImage': image, 'Rating': rating, 'NumberOfReviews': number_of_reviews,
               'Price': price, 'AvailableSizes': sizes, 'AvailableColors': colors, 'BulletPoints': bullet_points,
               'SellerRank': seller_rank}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Iterating Through Product Listing Pages
&lt;/h2&gt;

&lt;p&gt;We’re looking good now…&lt;/p&gt;

&lt;p&gt;Our spider will search Amazon based on the keyword we give it and scrape the details of the products it returns on page 1. However, what if we want our spider to navigate through every page and scrape the products of each one. &lt;/p&gt;

&lt;p&gt;To implement this, all we need to do is add a small bit of extra code to our &lt;strong&gt;parse_keyword_response&lt;/strong&gt; function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def parse_keyword_response(self, response):
        products = response.xpath('//*[@data-asin]')

        for product in products:
            asin = product.xpath('@data-asin').extract_first()
            product_url = f"https://www.amazon.com/dp/{asin}"
            yield scrapy.Request(url=product_url, callback=self.parse_product_page, meta={'asin': asin})

        next_page = response.xpath('//li[@class="a-last"]/a/@href').extract_first()
        if next_page:
            url = urljoin("https://www.amazon.com",next_page)
            yield scrapy.Request(url=product_url, callback=self.parse_keyword_response)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the spider has scraped all the product pages on the first page, it will then check to see if there is a next page button. If there is, it will retrieve the url extension and create a new URL for the next page. Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://www.amazon.com/s?k=tshirt+for+men&amp;amp;page=2&amp;amp;qid=1594912185&amp;amp;ref=sr_pg_1

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From there it will restart the &lt;strong&gt;parse_keyword_response&lt;/strong&gt; function using the callback and extract the ASIN IDs for each product and extract all the product data like before.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing The Spider
&lt;/h2&gt;

&lt;p&gt;Now that we’ve developed our spider it is time to test it. Here we can use Scrapy’s built-in CSV exporter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapy crawl amazon -o test.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All going good, you should now have items in test.csv, but you will notice there are 2 issues:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;the text is messy and some values are lists&lt;/li&gt;
&lt;li&gt;we are getting 429 responses from Amazon which means Amazon is detecting us that our requests are coming from a bot and is blocking our spider. &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Issue number two is the far bigger issue, as if we keep going like this Amazon will quickly ban our IP address and we won’t be able to scrape Amazon. &lt;/p&gt;

&lt;p&gt;In order to solve this, we will need to use a large proxy pool and rotate our proxies and headers with every request. For this we will use Scraper API.&lt;/p&gt;




&lt;h2&gt;
  
  
  Connecting Your Proxies With Scraper API
&lt;/h2&gt;

&lt;p&gt;As discussed, at the start of this article Scraper API is a proxy API designed to take the hassle out of using web scraping proxies. &lt;/p&gt;

&lt;p&gt;Instead of finding your own proxies, and building your own proxy infrastructure to rotate proxies and headers with every request, along with detecting bans and bypassing anti-bots you just send the URL you want to scrape the Scraper API and it will take care of everything for you.&lt;/p&gt;

&lt;p&gt;To use Scraper API you need to &lt;a href="https://www.scraperapi.com/signup" rel="noopener noreferrer"&gt;sign up to a free account here&lt;/a&gt; and get an API key which will allow you to make 1,000 free requests per month and use all the extra features like Javascript rendering, geotargeting, residential proxies, etc.&lt;/p&gt;

&lt;p&gt;Next, we need to integrate it with our spider. Reading their documentation, we see that there are three ways to interact with the API: via a single API endpoint, via their Python SDK, or via their proxy port.&lt;/p&gt;

&lt;p&gt;For this project I integrated the API by configuring my spiders to send all our requests to their API endpoint.&lt;/p&gt;

&lt;p&gt;To do so, I just needed to create a simple function that sends a GET request to Scraper API with the URL we want to scrape.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;API = ‘&amp;lt;YOUR_API_KEY&amp;gt;’

def get_url(url):
    payload = {'api_key': API, 'url': url}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And then modify our spider functions so as to use the Scraper API proxy by setting the url parameter in scrapy.Request to &lt;strong&gt;get_url(url)&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def start_requests(self):
       ...
       …
       yield scrapy.Request(url=get_url(url), callback=self.parse_keyword_response)

def parse_keyword_response(self, response):
       ...
       …
      yield scrapy.Request(url=get_url(product_url), callback=self.parse_product_page, meta={'asin': asin})
        ...
       …
       yield scrapy.Request(url=get_url(url), callback=self.parse_keyword_response)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A really cool feature with Scraper API is that you can enable Javascript rendering, geotargeting, residential IPs, etc. by simply adding a flag to your API request.&lt;/p&gt;

&lt;p&gt;As Amazon changes the pricing data and supplier data shown based on the country you are making the request from we're going to use Scraper API's geotargeting feature so that Amazon thinks our requests are coming from the US. To do this we need need to add the flag &lt;strong&gt;"&amp;amp;country_code=us"&lt;/strong&gt; to the request, which we can do by adding another parameter to the payload variable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def get_url(url):
    payload = {'api_key': API, 'url': url, 'country_code': 'us'}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can check out Scraper APIs other functionality here in their &lt;a href="https://www.scraperapi.com/documentation" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Next, we have to go into the &lt;strong&gt;settings.py&lt;/strong&gt; file and change the number of concurrent requests we’re allowed to make based on the concurrency limit of our Scraper API plan. Which for the free plan is 5 concurrent requests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py

CONCURRENT_REQUESTS = 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Concurrency is the number of requests you are allowed to make in parallel at any one time. The more concurrent requests you can make the faster you can scrape.&lt;/p&gt;

&lt;p&gt;Also, we should set &lt;code&gt;RETRY_TIMES&lt;/code&gt; to tell Scrapy to retry any failed requests (to 5 for example) and make sure that &lt;code&gt;DOWNLOAD_DELAY&lt;/code&gt;  and &lt;code&gt;RANDOMIZE_DOWNLOAD_DELAY&lt;/code&gt; aren’t enabled as these will lower your concurrency and are not needed with Scraper API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py

CONCURRENT_REQUESTS = 5
RETRY_TIMES = 5

# DOWNLOAD_DELAY
# RANDOMIZE_DOWNLOAD_DELAY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Setting Up Monitoring
&lt;/h2&gt;

&lt;p&gt;To monitor our scraper we're going to use &lt;a href="https://scrapeops.io/" rel="noopener noreferrer"&gt;ScrapeOps&lt;/a&gt;, a free monitoring and alerting tool dedicated to web scraping. &lt;/p&gt;

&lt;p&gt;With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live demo here:&lt;/strong&gt; &lt;a href="https://scrapeops.io/app/login/demo" rel="noopener noreferrer"&gt;ScrapeOps Demo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Getting setup with ScrapeOps is simple. Just install the Python package:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install scrapeops-scrapy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And add 3 lines to your &lt;code&gt;settings.py&lt;/code&gt; file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;From there, our scraping stats will be automatically logged and automatically shipped to our dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapeops-demo-holder-7dd5eec8fc4395cfa9c9994d0ec09807.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fscrapeops.io%2Fassets%2Fimages%2Fscrapeops-demo-holder-7dd5eec8fc4395cfa9c9994d0ec09807.png" alt="ScrapeOps Dashboard"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Cleaning Data With Pipelines
&lt;/h2&gt;

&lt;p&gt;The final step we need to do is to do a bit of data cleaning using the &lt;strong&gt;pipelines.py&lt;/strong&gt; file as the text is messy and some values are lists.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class TutorialPipeline:

    def process_item(self, item, spider):
        for k, v in item.items():
            if not v:
                item[k] = ''  # replace empty list or None with empty string
                continue
            if k == 'Title':
                item[k] = v.strip()
            elif k == 'Rating':
                item[k] = v.replace(' out of 5 stars', '')
            elif k == 'AvailableSizes' or k == 'AvailableColors':
                item[k] = ", ".join(v)
            elif k == 'BulletPoints':
                item[k] = ", ".join([i.strip() for i in v if i.strip()])
            elif k == 'SellerRank':
                item[k] = " ".join([i.strip() for i in v if i.strip()])
        return item
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the spider has yielded a JSON object, the item is passed to the pipeline for the item to be cleaned.&lt;/p&gt;

&lt;p&gt;To enable the pipeline we need to add it to the settings.py file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## settings.py

ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 300}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we are good to go. You can test the spider again by running the spider with the crawl command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapy crawl amazon -o test.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This time you should see that the spider was able to scrape all the available products for your keyword without getting banned.&lt;/p&gt;

&lt;p&gt;If you would like to run the spider for yourself or modify it for your particular Amazon project then feel free to do so. &lt;a href="https://github.com/ian-kerins/amazon-python-scrapy-scraper" rel="noopener noreferrer"&gt;The code is on GitHub here&lt;/a&gt;. Just remember that you need to get your own Scraper API api key by &lt;a href="https://www.scraperapi.com/signup" rel="noopener noreferrer"&gt;signing up here&lt;/a&gt;. &lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>scraping</category>
      <category>scrapy</category>
      <category>python</category>
    </item>
  </channel>
</rss>
