Scrapy is an open source framework written in Python that allows you to extract data from structural content such as html and xml. It is able to scraping and crawling on websites especially fast enough. First of all, you should install python packet manager, pip.
Installing with using pip:
pip install scrapy
Starting a new project:
scrapy startproject tutorial
When a scrapy project is created, a file / directory structure will be created as follows.
tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
Example spider:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
Scrapy spiders can scan and extract data on one or more addresses. With Scrapy selectors, the desired fields can be selected and filtered. Scrapy is supported by xpath scrapy in selectors.
For crawling:
scrapy crawl quotes
For writing output to json file:
scrapy crawl quotes -o qutoes.json
Top comments (0)