Scrapy is a great framework to use for scraping projects. However did you know there is a way to run Scrapy straight from a script? In fact, looking at the documentation, there are two ways to run Scrapy. Using the Scrapy API or the framework.
Here we will discuss using the Scrapy API, to access the require settings and classes needed to run scrapy in a single python script. This is an area touched on only briefly within the documentation and the main reason for why a tutorial is worth discussing the more practical aspects of writing python scripts.
In this article you will learn
- Why you would use scrapy from a script
- Understand the basic script every time you want access scrapy from an individual script
- Understand how to specify customised scrapy settings
- Understand how to specify HTTP requests for scrapy to invoke
- Understand how to process those HTTP responses using scrapy under an one script.
Why Use Scrapy from a Script ?
Scrapy can be used for a heavy duty scraping work, however there are a lot of projects that are actually quite small and don't require the need for using the whole scrapy framework. This is where using scrapy in a python script comes in. No need to use the whole framework you can do it all from a python script.
The Scrapy API allows you to run scrapy entirely within one script. It uses only one process per spider.
Lets see what the basics of this look like before fleshing out some of the necessary settings to scrape.
Basic Script
The key to running scrapy in a python script is the CrawlerProcess
class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess
class, python's twisted framework is imported.
Twisted is a python framework that is used for input and output processes like http requests for example. Now it does this through what's called a twister event reactor. Scrapy is actually built on top of twisted! We wont go into too much detail here but needless to say, the CrawlerProcess
class imports a twisted reactor which listens for events like an multiple http requests. This is at the heart of how scrapy works.
CrawlerProcess
assumes that a twisted reactor is NOT used by anything else, like for example another spider. With that lets look at the the code below.
import scrapy
from scrapy.crawler import CrawlerProcess
class TestSpider(scrapy.Spider):
name = 'test'
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(TestSpider)
process.start()
- To use Scrapy We must create a our own spider, this is done by creating a class which inherits from
scrapy.Spider
class.scrapy.Spider
is the most basic spider that we must derive from in all scrapy projects. - We have to give this spider a name for it to run. Spiders will require a couple of functions and an URL to scrape but for this example we will omit this for the moment.
- Now you see
if __name__ == "__main__"
. This is used as a best practice in python. When we write a script you want to it to be able to run the code but also be able to import that code somewhere else. - We instantiate the class
CrawlerProcess
first to get access to the functions we need to start scraping data.CrawlerProcess
has two functions we are interested in, crawl and start We use crawl to start the spider we created. We then use the start function to start a twisted reactor, the engine that processes and listens to our http requests we want.
Adding in Settings
The scrapy framework provides a list of settings that it will use automatically, however for working with the Scrapy API we have to provide the settings explicility. The settings we define is how we can customise our spiders. The spider.Spider
class has a variable called custom_settings
. Now this variable can be used to override the settings scrapy automatically uses. We have to create a dictionary of our own settings to do this ascustom_settings variable is set to none using scrapy.
You may want to use some or most of the settings scrapy provides, in which case you could copy them from there. Alternatively a list of the built-in settings can be found here.
As an example see below
class TestSpider(scrapy.Spider):
name = 'test'
custom_settings = { 'DOWNLOD_DELAY': 1 }
Specifying URLs to scrape
We have shown how to create a spider and define the settings, but we haven't actually specified any URLs to scrape, or how we want to specify the requests to the website we want to get data from. For example, parameters, headers and user-agents.
When we create spider we also start a method called start_requests()
. This will create the requests for any URL we want. Now there are two ways to use this method.
1)By defining the start_urls
attribute
2)We implement our own function called start_requests
.
The shortest way is by defining start_urls
. We define it as a list of URLs we want to get. By specifying this variable we automatically use start_requests()
to go through each one of our URLs.
class TestSpider(scrapy.Spider):
name = 'test'
custom_settings = { 'DOWNLOD_DELAY': '1' }
start_urls = ['URL1','URL2']
However notice how if we do this, we can't specify our own headers, parameters or anything else we want to go along with the request ? This is where implementing our own start_requests
method comes in.
First we define our variables we want to go along with the request. We then implement our own start_requests
method so we can make use of the headers and params we want, as well as where we want to response to go.
class TestSpider(scrapy.Spider):
name = 'test'
custom_settings = { 'DOWNLOD_DELAY': 1 }
headers = {}
params = {}
def start_requests(self):
yield scrapy.Requests(url, headers=headers, params=params)
Here we access the Requests method which when given an url will make the HTTP requests and return a response defined as response variable.
You will notice how we didn't specific a callback ? That is we didn't specify where scrapy should send the response to the requests we just told it to get for us.
Lets fix that, by default scrapy expects the callback method to be parse but it could be anything we want it to be.
class TestSpider(scrapy.Spider):
name = 'test'
custom_settings = { 'DOWNLOD_DELAY': 1 }
headers = {}
params = {}
def start_requests(self):
yield scrapy.Requests(url, headers=headers, params=params,callback = self.parse)
def parse(self,response):
print(response.body)
Here we have defined the method parse which accepts a response variable, remember this is created when we ask scrapy to do the HTTP requests. We then ask scrapy to print the response body.
With that, we now have the basics of running scrapy in a python script. We can use all the same methods but we just have to do a bit of configuring before hand.
It's important to review content you read, the focus should be to recall and understand. I encourage you to do the exercises
Exercises
- Recall how to write a Basic Script that will enable Scrapy
- What is imported when we instantiate CrawlerProcess ?
- How do you add in your own custom settings ?
- Why should you use
start_requests()
rather thanstart_urls
when directing HTTP requests with Scrapy ?
Top comments (3)
Fantastic. We use a similar approach at work. For our case, each Scrapy project is run as a docker image for specific URL domains. Just curious, does this approach still allow middleware to be declared into the custom settings?
Hi Musale! Glad you liked the article.
There is an option to declare middleware in the custom settings yes, but I think as far as creating the middleware inside one script this is something I haven't come across. My guess would be that if you're thinking of middleware then the whole scrapy framework is the best, indeed that is what it's there for.
Great info. But how to store into database? (pipeline setting)?