David MM👨🏻‍💻

Posted on Sep 6, 2019 • Edited on Sep 12, 2019 • Originally published at letslearnabout.net

Creating your first spider - 01 - Python scrapy tutorial for beginners

#python #scrapy #tutorial

Original post: Python scrapy tutorial for beginners – 01 – Creating your first spider

Learn how to fetch the data of any website with Python and the Scrapy Framework in just minutes. On the first lesson of 'Python scrapy tutorial for beginners', we will scrape the data from a book store, extracting all the information and storing in a file.

In this post you will learn:

Prepare your environment and install everything
How to create a Scrapy project and spider
How to fetch the data from the HTML
To manipulate the data and extract the data you want
How to store the data into a .json, .csv and .xml file

Preparing your environment and installing everything

Before anything, we need to prepare our environment and install everything.

In Python, we create virtual environments to have a separated environment with different dependencies.

For example, Project1 has Python 3.4 and Scrapy 1.2, and Project2 Python 3.7.4 and Scrapy 1.7.3.

As we keep separated environments, one for each project, we will never have a conflict by having different versions of packages.

You can use Conda, virtualenv or Pipenv to create a virtual environment. In this course, I will use pipenv. You only need to install it with pip install __pipenv and to create a new virtual environment with pipenv shell.

Once you are set, install Scrapy with pip install scrapy. That's all you need.

Time to create the project and your spider.

Base image provided by Vecteezy

Creating a project and a spider – And what they are

Before anything, we need to create a Scrapy project. In your current folder, enter:

scrapy startproject books

This will create a project named 'books'. Inside you'll find a few files. I'll explain them in a more detailed post but here's a brief explanation:

books/
    scrapy.cfg   <-- Configuration file (DO NOT TOUCH!)
    tutorial/
        __init__.py              <-- Empty file that marks this as a Python folder
        items.py                 <-- Model of the item to scrap
        middlewares.py    <-- Scrapy processing hooks (DO NOT TOUCH)
        pipelines.py           <-- What to do with the scraped item
        settings.py             <-- Project settings file
        spiders/                  <-- Directory of our spiders (empty by now)
            __init__.py

After creating a project, navigate to the project created (cd books) and once inside the folder, create a spider by passing it the name and the root URL without 'www':

scrapy genspider spider books.toscrape.com

Now we have our spider inside the spider folder! You will have something like this:

# -*- coding: utf-8 -*-
import scrapy


class SpiderSpider(scrapy.Spider):
    name = 'spider'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        pass

First, we import scrapy. Then, a class is created inheriting 'Spider' from Scrapy. That class has 3 variables and a method.

The variables are the spider's name, the allowed_domains and the start_URL. Pretty self-explanatory. The name is what we will use in a second to run the spider, allowed_domains limit the scope of the scraping process (It can't go outside any URL not specified here) and start_urls are the starting point of the scrapy spider. In this case, just one.

The parse method is internally called when we start the Scrapy spider. Right now has only 'pass': It does nothing. Let's solve that.

How to fetch data from the HTML

We are going to query the HTML and to do so we need Xpath, a query language. Don't you worry, even if it seems weird at first, it is easy to learn as all you need are a few functions.

Parse method

But first, let's see what we have on 'parse' method.

Parse it's called automatically when the Scrapy spider starts. As arguments, we have self(the instance of the class) and a response. The response _is what the server returns when we request an HTML. In this class, we are requesting h_ttp://books.toscrape.com _and in _response_ _we have an object with all the HTML, a status message and more.

Replace "pass" with 'print(response.status)' and run the spider:

scrapy crawl spider

This is what we got:

Between a lot of information, we see that we have crawled the start_url, got a 200 HTTP message (Success) and then the spider stopped.

Besides 'status', our spider has a lot of methods. The one we are going to use right now is 'xpath'.

Our first steps with Xpath

Open the starting URL, and right-click -> inspect any book. A side menu will open with the HTML structure of the website (if not, make sure you have selected the 'Elements' tab). You'll have something like this:

We can see that each 'article' tag contains all the information we want.

The plan is to grab all articles, then, one by one, get all the information from each book.

First, let's see how we select all articles.

If we click on the HTML the side menu and press Control + F, the search menu opens:

At the bottom-right, you can read "Find by string, selector or Xpath". Scrapy uses Xpath, so let's use it.

To start a query with Xpath, write '//' then what you want to find. We want to grab all the articles, so type '//article'. We want to be more accurate, so let's grab all the articles with the attribute 'class = product_pod'. To specify an attribute, type it between brackets, like this: '//article[@class="product_pod"]'.

You can see now that we have selected 20 elements: The 20 initial books.

Seems like we got it! Let's copy that Xpath instruction and use it to select the articles in our spider. Then, we store all the books.

    def parse(self, response):
        all_books = response.xpath('//article[@class="product_pod"]')

Once we have all the books, we want to look inside each book for the information we want. Let's start with the title. Go to your URL and search where the full title is located. Right-click any title and then select 'Inspect'.

Inside the h3 tag, there is an 'a' tag with the book title as 'title' attribute. Let's loop over the books and extract it.

    def parse(self, response):
        all_books = response.xpath('//article[@class="product_pod"]')

        for book in all_books:
            title = book.xpath('.//h3/a/@title').extract_first()

We get all the books, and for each one of them, we search for the 'h3' tag, then the 'a' tag, and we select the @title attribute. We want that text, so we use 'extract_first' (we can also 'use extract' to extract all of them).

As we are scraping, not the whole HTML but a small subset (the one in 'book') we need to put a dot at the start of the Xpath function. Remember: '//' for the whole HTML response, './/' for a subset of that HTML we already extracted.

We have the title, now go the price. Right click the price and inspect it.

The text we want is inside a 'p' tag with the 'price_color' class inside a 'div' tag. Add this after the title:

price = book.xpath('.//div/p[@class="price_color"]/text()').extract_first()

We go to any 'div', with a 'p' child that has a 'price_color' class, then we use 'text()' function to get the text. And then, we extract_first() our selection.

Let's see if what we have. Print both the price and the title and run the spider.

    def parse(self, response):
        all_books = response.xpath('//article[@class="product_pod"]')

        for book in all_books:
            title = book.xpath('.//h3/a/@title').extract_first()
            price = book.xpath('.//div/p[@class="price_color"]/text()').extract_first()
            print(title)
            print(price)


scrapy crawl spider

Everything is working as planned. Let's take the image URL too. Right-click the image, inspect it:

We don't have an URL here but a partial one.

The 'src' attribute has the relative URL, not the whole URL. The 'books.toscrape.com' is missing. Well, we just need to add it. Add this at the bottom of your method.

image_url = self.start_urls[0] + book.xpath('.//img[@class="thumbnail"]/@src').extract_first()
print(image_url)

We get the 'img' tag with the class 'thumbnail', we get the relative URL with 'src' then we add the first (and only) start_url. Again, let's print the result. Run the spider again.

Looking good! Open any of the URL and you'll see the cover's thumbnail.

Now let's extract the URL so we can buy any book if we are interested.

The book URL is stored in the href of both the title and the thumbnail. Any of both will do.

book_url = self.start_urls[0] + book.xpath('.//h3/a/@href').extract_first()
print(book_url)

Run the spider again:

Click on any URL and you'll go to that book website.

Now we are selecting all the fields we want, but we are not doing anything with it, right? We need to 'yield' (or 'return') them. For each book, we are going to return it's title, price, image and book URL.

Remove all the prints and yield the items like a dictionary:

    def parse(self, response):
        all_books = response.xpath('//article[@class="product_pod"]')

        for book in all_books:
            title = book.xpath('.//h3/a/@title').extract_first()
            price = book.xpath('.//div/p[@class="price_color"]/text()').extract_first()
            image_url = self.start_urls[0] + book.xpath('.//img[@class="thumbnail"]/@src').extract_first()
            book_url = self.start_urls[0] + book.xpath('.//h3/a/@href').extract_first()
            yield {
                'title': title,
                'price': price,
                'Image URL': image_url,
                'Book URL': book_url,
            }

Run the spider and look at the terminal:

Saving the data into a file

While it looks cool on the terminal, there is no use. Why don't we store it into a file we can use later?

When we run our spider we have optional arguments. One of them is the name of the file you want to store. Run this.

scrapy crawl spider -o books.json

Wait until it's done… a new file has appeared! Double click it to open it.

All the information we saw on the terminal is now stored into a 'books.json'. Isn't that cool? We can do the same with .csv and .xml files:

Conclusion

I know the first time is tricky, but you have learnt the basics of Scrapy. You know how to:

Create a Scrapy spider to navigate an URL
A Scrapy project is structured
Use Xpath to extract the data
Store the data in .json, .csv and .xml files

I suggest you keep training. Look for an URL you want to scrape and try extracting a few fields as you did at the Beautiful Soup tutorial. The trick of Scrapy is learning how Xpath works.

But…do you remember that each book has an URL like this one?

Inside each item we scraped, there's more information we can take. And we'll do it in the second lesson of this series.

Final code on Github

Reach to me on Twitter

My Youtube tutorial videos

Hands-on debugging session: instrument, monitor, and fix

Join Lazar for a hands-on session where you’ll build it, break it, debug it, and fix it. You’ll set up Sentry, track errors, use Session Replay and Tracing, and leverage some good ol’ AI to find and fix issues fast.

RSVP here →

DEV Community

Creating your first spider - 01 - Python scrapy tutorial for beginners

Preparing your environment and installing everything

Creating a project and a spider – And what they are

How to fetch data from the HTML

Parse method

Our first steps with Xpath

Saving the data into a file

Conclusion

Hands-on debugging session: instrument, monitor, and fix

Top comments (0)

See why 4M developers consider Sentry, “not bad.”

Read next

Creating a full-stack AI based calorie/nutrition tracker in just 8 hrs using Supabase & Lovable

AI-Powered Graph Exploration with LangChain's NLP Capabilities, Question Answer Using Langchain

How to Perform Bulk Insert with EF Core

Boosting Performance with Web Workers in JavaScript