How to use Scrapy Items - 05 - Python Scrapy tutorial for beginners

#python #scrapy #tutorial

Original post Python Scrapy tutorial for beginners – 05 – How to use Scrapy Items

The goal of scraping is to extract data from websites. Without Scrapy Items, we return unstructured data in the form of Python dictionaries: An easy way to introduce typos and return faulty data.

Luckily, Scrapy provides us with the Item class: A class we can inherit to make our data more structured and stronger, yielding a Python object.

In this post you will learn how to:

Create Scrapy Items
Use them to return a structured object

Starting point

While you can use your own Scrapy projects for this tutorial, I'll recommend you to follow along by using the last version of this tutorial series, where we added Rules and a LinkExtractor to our spider.

Clone the Github Repo, and you are set to go!

Creating our Scrapy Item

To use our Item, first we need to create it and… it is already done!

On the root project, we have an items.py file with the skeleton of an item:

Besides the scrapy import and a valuable link, we have nothing there. Let's solve that!

Our BooksItem it is going to be the class we are going to use for every scraped element. Think about it like a blueprint that tells you what we are going to need.

And what do we need on each item? A title, an image, a price… Exactly what our spider.py file yields.

Copy the elements and paste them inside the class, and assign them 'scrapy.Field()':

class BooksItem(scrapy.Item):
    title = scrapy.Field()
    final_image = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    stars = scrapy.Field()
    description = scrapy.Field()
    upc = scrapy.Field()
    price_excl_tax = scrapy.Field()
    price_inc_tax = scrapy.Field()
    tax = scrapy.Field()

That's it! We are done!

Our BooksItem class is created. The only fields we can add are the ones we explicitly wrote inside the class. Let's test that theory.

Testing our Scrapy Item

Let's see that our theory is solid. Load the scrapy shell (with scrapy shell on your terminal), import the item and create an object with some fields. Nothing wrong happens, as every field is optional.

But then, try to create another object with non-existing fields. You'll get an error:

Our theory is right: We can only add the existing fields that we declared on our Item.

We talked enough about the Item, let's use it.

Using our Scrapy Item in our Spider

Open your items.py (finally!) and add the import on top of the file:

# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from ..items import BooksItem # New line

import scrapy

Then, inside the parser method, create an object somewhere. For example, I created it after every data is scraped:

               ....
               '//table[@class="table table-striped"]/tr[1]/td/text()').extract_first()
            price_excl_tax = response.xpath(
                '//table[@class="table table-striped"]/tr[3]/td/text()').extract_first()
            price_inc_tax = response.xpath(
                '//table[@class="table table-striped"]/tr[4]/td/text()').extract_first()
            tax = response.xpath(
                '//table[@class="table table-striped"]/tr[5]/td/text()').extract_first()

            book = BooksItem() # New line

Now we have a nice yield returning a dictionary with all the data.

Delete it.

And then, assign each field to the book object. And then, yield the object instance:

            book = BooksItem()

            book['title'] = title
            book['final_image'] = final_image
            book['price'] = price
            book['stock'] = stock
            book['stars'] = stars
            book['description'] = description
            book['tax'] = tax

            yield book

That's enough. Let's run the code. Run scrapy crawl spider -o scrapy_item_version.json and wait until the spider is done.

As always, we have our 1000 books, this time, with a stronger and more solid code, by using Items:

Conclusion

It is easy to make your spiders less buggy, and one of the easier improvements are using Scrapy Items. The Item class let us inherit a class that enables us to use Scrapy classes that by declaring its fields. To use them, we just need to:

Create an Item by specifying the fields it is going to have
Import the class created
Create an instance of that class
For every field extracted, add it to the Item instance
Finally, return the object instance.

This opens the door to the Item Pipeline, which processes the item scraped. We can tell how Scrapy should process the scraped item, for example, cleaning it, validating the fields and more.

And of course, we'll learn that and more on our next lesson*.

*The sixth lesson is being built right now. Thanks for your patience.

My Youtube tutorial videos

Final code on Github

Reach to me on Twitter

Previous lesson: 04 – Crawler, Rules and LinkExtractor