DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Items: The Complete Beginner's Guide (Why You Should Stop Using Dictionaries)

When I first started scraping with Scrapy, I used plain dictionaries for everything:

yield {
    'name': 'Product Name',
    'price': '$29.99',
    'stock': 'In Stock'
}
Enter fullscreen mode Exit fullscreen mode

It worked. My scraper ran. Data got saved. Mission accomplished, right?

Wrong.

Three weeks later, I made a typo. Instead of 'price', I accidentally typed 'pric'. My scraper kept running, but all the price data was going into a field that didn't exist. I didn't notice for days.

That's when I learned about Scrapy Items. They would have caught that typo immediately and saved me hours of frustration.

Let me show you what Items are, why they matter, and the tricks nobody talks about.


What Are Scrapy Items?

Think of Items as blueprints for your data. Instead of throwing random dictionaries around, you define exactly what fields your data should have.

With dictionaries (the risky way):

def parse(self, response):
    yield {
        'name': response.css('h1::text').get(),
        'price': response.css('.price::text').get()
    }

    # Later in your code, you make a typo
    yield {
        'nam': response.css('h1::text').get(),  # Typo! But no error!
        'price': response.css('.price::text').get()
    }
Enter fullscreen mode Exit fullscreen mode

With dictionaries, typos silently create new fields. Your data gets messed up and you might not notice.

With Items (the safe way):

# Define the structure once
class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()

# Use it in your spider
def parse(self, response):
    item = ProductItem()
    item['name'] = response.css('h1::text').get()
    item['pric'] = response.css('.price::text').get()  # ERROR! Field doesn't exist
Enter fullscreen mode Exit fullscreen mode

With Items, typos cause immediate errors. You catch problems right away.


Creating Your First Item

Step 1: Define Your Item

Open your items.py file (Scrapy creates this automatically when you start a project):

# items.py
import scrapy

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()
    rating = scrapy.Field()
    stock = scrapy.Field()
Enter fullscreen mode Exit fullscreen mode

That's it. You just created a blueprint. Every ProductItem must have these fields (and only these fields).

Step 2: Use It in Your Spider

# spider.py
import scrapy
from myproject.items import ProductItem

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            item = ProductItem()
            item['name'] = product.css('h2::text').get()
            item['price'] = product.css('.price::text').get()
            item['url'] = product.css('a::attr(href)').get()
            item['rating'] = product.css('.rating::text').get()
            item['stock'] = product.css('.stock::text').get()

            yield item
Enter fullscreen mode Exit fullscreen mode

Notice the pattern:

  1. Import your Item class
  2. Create an instance
  3. Fill in the fields
  4. Yield the item

Why Bother with Items? (The Real Benefits)

Benefit #1: Typo Protection

This is huge and saved me so many times.

item = ProductItem()
item['pricee'] = '29.99'  # CRASH! KeyError: 'pricee'
Enter fullscreen mode Exit fullscreen mode

With dictionaries, this would silently create a field called pricee. With Items, you get an error immediately.

Benefit #2: Clear Data Structure

When someone (or future you) looks at your code, they can instantly see what data you're collecting:

# items.py
class BookItem(scrapy.Item):
    title = scrapy.Field()
    author = scrapy.Field()
    isbn = scrapy.Field()
    price = scrapy.Field()
    publisher = scrapy.Field()
Enter fullscreen mode Exit fullscreen mode

One glance at items.py tells you everything being scraped. No need to hunt through spider code.

Benefit #3: Works Seamlessly with Pipelines

Items work perfectly with Scrapy pipelines:

# pipelines.py
class PricePipeline:
    def process_item(self, item, spider):
        # You know 'price' exists because it's defined in the Item
        item['price'] = item['price'].replace('$', '')
        item['price'] = float(item['price'])
        return item
Enter fullscreen mode Exit fullscreen mode

Benefit #4: Better for Teams

When multiple people work on a project, Items create a contract. Everyone knows exactly what fields exist.


Items vs Dictionaries: Side by Side

Using dictionaries:

def parse(self, response):
    yield {
        'product_name': 'Widget',
        'product_price': 29.99
    }

def parse_detail(self, response):
    yield {
        'name': 'Widget',  # Different key name! Oops.
        'price': 29.99
    }
Enter fullscreen mode Exit fullscreen mode

Different parts of your code use different field names. Chaos.

Using Items:

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()

def parse(self, response):
    item = ProductItem()
    item['name'] = 'Widget'  # Consistent!
    item['price'] = 29.99
    yield item

def parse_detail(self, response):
    item = ProductItem()
    item['name'] = 'Widget'  # Same fields everywhere
    item['price'] = 29.99
    yield item
Enter fullscreen mode Exit fullscreen mode

Consistency enforced automatically.


Working with Items (The Practical Stuff)

Creating an Item

item = ProductItem()
Enter fullscreen mode Exit fullscreen mode

Setting Fields

# Like a dictionary
item['name'] = 'Product Name'
item['price'] = 29.99
Enter fullscreen mode Exit fullscreen mode

Getting Fields

# Like a dictionary
name = item['name']
price = item.get('price', 0.0)  # With default value
Enter fullscreen mode Exit fullscreen mode

Checking if a Field Exists

# Check if populated
if 'name' in item:
    print('Name is set')

# Check if declared (even if not populated)
if 'name' in item.fields:
    print('Name is a valid field')
Enter fullscreen mode Exit fullscreen mode

Getting All Fields

# Get only populated fields
data = dict(item)

# Get all declared fields (even if not set)
all_fields = item.fields.keys()
Enter fullscreen mode Exit fullscreen mode

Advanced: Field Metadata

Here's something most tutorials skip. Fields can have metadata that components use:

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field(serializer=str)
    last_updated = scrapy.Field(serializer=str)
Enter fullscreen mode Exit fullscreen mode

The serializer tells Scrapy how to serialize this field when exporting data. You can add any metadata you want:

class ProductItem(scrapy.Item):
    name = scrapy.Field(
        required=True,
        max_length=200
    )
    price = scrapy.Field(
        required=True,
        serializer=float
    )
    description = scrapy.Field(
        required=False,
        default='No description available'
    )
Enter fullscreen mode Exit fullscreen mode

Scrapy itself doesn't use this metadata (except serializer), but your pipelines can!


Real-World Example: Building an Item-Based Spider

Let's build a complete spider that scrapes book data:

Step 1: Define the Item

# items.py
import scrapy

class BookItem(scrapy.Item):
    title = scrapy.Field()
    author = scrapy.Field()
    price = scrapy.Field()
    rating = scrapy.Field()
    availability = scrapy.Field()
    description = scrapy.Field()
    url = scrapy.Field()
    scraped_at = scrapy.Field()
Enter fullscreen mode Exit fullscreen mode

Step 2: Create the Spider

# spiders/books.py
import scrapy
from datetime import datetime
from myproject.items import BookItem

class BooksSpider(scrapy.Spider):
    name = 'books'
    start_urls = ['https://books.toscrape.com']

    def parse(self, response):
        for book in response.css('.product_pod'):
            book_url = book.css('h3 a::attr(href)').get()
            yield response.follow(book_url, callback=self.parse_book)

        # Pagination
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_book(self, response):
        item = BookItem()

        item['title'] = response.css('h1::text').get()
        item['price'] = response.css('.price_color::text').get()
        item['rating'] = response.css('.star-rating::attr(class)').get()
        item['availability'] = response.css('.availability::text').getall()[1].strip()
        item['description'] = response.css('#product_description + p::text').get()
        item['url'] = response.url
        item['scraped_at'] = datetime.now().isoformat()

        # Author might not exist
        author = response.css('th:contains("Author") + td a::text').get()
        if author:
            item['author'] = author

        yield item
Enter fullscreen mode Exit fullscreen mode

Step 3: Process with Pipelines

# pipelines.py
class BookCleaningPipeline:
    def process_item(self, item, spider):
        # Clean price
        if item.get('price'):
            item['price'] = item['price'].replace('£', '')
            item['price'] = float(item['price'])

        # Extract rating number
        if item.get('rating'):
            rating_map = {
                'One': 1,
                'Two': 2,
                'Three': 3,
                'Four': 4,
                'Five': 5
            }
            rating_class = item['rating'].replace('star-rating ', '')
            item['rating'] = rating_map.get(rating_class, 0)

        return item
Enter fullscreen mode Exit fullscreen mode

Step 4: Enable Pipeline in Settings

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.BookCleaningPipeline': 300,
}
Enter fullscreen mode Exit fullscreen mode

Run it:

scrapy crawl books -o books.json
Enter fullscreen mode Exit fullscreen mode

Advanced Patterns Nobody Talks About

Pattern #1: Partial Items (Building Across Pages)

Sometimes you scrape data from multiple pages. Build your item gradually:

def parse_listing(self, response):
    for product in response.css('.product'):
        item = ProductItem()
        item['name'] = product.css('h2::text').get()
        item['price'] = product.css('.price::text').get()

        detail_url = product.css('a::attr(href)').get()
        yield response.follow(
            detail_url,
            callback=self.parse_detail,
            meta={'item': item}
        )

def parse_detail(self, response):
    item = response.meta['item']
    item['description'] = response.css('.description::text').get()
    item['reviews'] = len(response.css('.review'))
    yield item
Enter fullscreen mode Exit fullscreen mode

Pattern #2: Conditional Fields

Not all items need all fields:

def parse(self, response):
    item = ProductItem()
    item['name'] = response.css('h1::text').get()
    item['price'] = response.css('.price::text').get()

    # Only add discount if it exists
    discount = response.css('.discount::text').get()
    if discount:
        item['discount'] = discount

    # Only add rating if it exists
    rating = response.css('.rating::text').get()
    if rating:
        item['rating'] = rating

    yield item
Enter fullscreen mode Exit fullscreen mode

Pattern #3: Item Inheritance

You can extend Items for different types of products:

# items.py
class BaseProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()

class BookItem(BaseProductItem):
    author = scrapy.Field()
    isbn = scrapy.Field()
    publisher = scrapy.Field()

class ElectronicsItem(BaseProductItem):
    brand = scrapy.Field()
    model = scrapy.Field()
    warranty = scrapy.Field()
Enter fullscreen mode Exit fullscreen mode

Pattern #4: Default Values

Set defaults when creating items:

def parse(self, response):
    item = ProductItem(
        scraped_at=datetime.now().isoformat(),
        currency='USD'
    )

    item['name'] = response.css('h1::text').get()
    item['price'] = response.css('.price::text').get()

    yield item
Enter fullscreen mode Exit fullscreen mode

Common Mistakes and How to Avoid Them

Mistake #1: Creating Item Inside the Loop

# WRONG (inefficient)
def parse(self, response):
    item = ProductItem()  # Created once
    for product in response.css('.product'):
        item['name'] = product.css('h2::text').get()
        yield item  # Yielding the SAME item multiple times!

# RIGHT (creates new item for each product)
def parse(self, response):
    for product in response.css('.product'):
        item = ProductItem()  # New item each time
        item['name'] = product.css('h2::text').get()
        yield item
Enter fullscreen mode Exit fullscreen mode

Mistake #2: Not Importing the Item

# WRONG
def parse(self, response):
    item = ProductItem()  # NameError!

# RIGHT
from myproject.items import ProductItem

def parse(self, response):
    item = ProductItem()  # Works!
Enter fullscreen mode Exit fullscreen mode

Mistake #3: Mixing Dictionaries and Items

# WRONG (inconsistent)
def parse(self, response):
    yield {'name': 'Product 1'}  # Dictionary

def parse_detail(self, response):
    item = ProductItem()
    item['name'] = 'Product 2'
    yield item  # Item

# RIGHT (consistent)
def parse(self, response):
    item = ProductItem()
    item['name'] = 'Product 1'
    yield item

def parse_detail(self, response):
    item = ProductItem()
    item['name'] = 'Product 2'
    yield item
Enter fullscreen mode Exit fullscreen mode

Mistake #4: Not Handling Missing Fields

# WRONG (crashes if author doesn't exist)
item['author'] = response.css('.author::text').get()

# RIGHT (handles missing data)
author = response.css('.author::text').get()
if author:
    item['author'] = author
# Or with default
item['author'] = response.css('.author::text').get() or 'Unknown'
Enter fullscreen mode Exit fullscreen mode

Items vs ItemAdapter (Modern Scrapy)

Modern Scrapy recommends using ItemAdapter in pipelines. It works with both Items and dictionaries:

# pipelines.py
from itemadapter import ItemAdapter

class MyPipeline:
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)

        # Works whether item is a dict or Item object
        if adapter.get('price'):
            adapter['price'] = float(adapter['price'])

        return item
Enter fullscreen mode Exit fullscreen mode

This makes your pipelines more flexible.


When to Use Items vs Dictionaries

Use Items when:

  • Building a serious, production scraper
  • Working in a team
  • Need typo protection
  • Want clear data structure
  • Using pipelines extensively

Use dictionaries when:

  • Quick, one-off scraping
  • Learning Scrapy basics
  • Scraping very simple data
  • Don't need validation

My recommendation? Start with Items from day one. The tiny bit of extra work saves massive debugging time later.


Debugging Items

Check What Fields Are Populated

def parse(self, response):
    item = ProductItem()
    item['name'] = 'Widget'

    # See what's in the item
    self.logger.info(f'Item: {dict(item)}')

    # Check if field is set
    if 'price' in item:
        self.logger.info('Price is set')
    else:
        self.logger.warning('Price is missing!')

    yield item
Enter fullscreen mode Exit fullscreen mode

Validate Items in Pipelines

class ValidationPipeline:
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)

        # Check required fields
        required = ['name', 'price', 'url']
        for field in required:
            if not adapter.get(field):
                raise DropItem(f'Missing required field: {field}')

        return item
Enter fullscreen mode Exit fullscreen mode

Quick Reference

Defining Items

# items.py
import scrapy

class MyItem(scrapy.Item):
    field1 = scrapy.Field()
    field2 = scrapy.Field()
    field3 = scrapy.Field(serializer=str)
Enter fullscreen mode Exit fullscreen mode

Using Items

# Create
item = MyItem()

# Set fields
item['field1'] = 'value'

# Get fields
value = item['field1']
value = item.get('field2', 'default')

# Check existence
if 'field1' in item:
    pass

# Convert to dict
data = dict(item)

# Yield
yield item
Enter fullscreen mode Exit fullscreen mode

With Inheritance

class BaseItem(scrapy.Item):
    name = scrapy.Field()
    url = scrapy.Field()

class ExtendedItem(BaseItem):
    extra_field = scrapy.Field()
Enter fullscreen mode Exit fullscreen mode

Summary

Items are blueprints for your scraped data. They:

  • Prevent typos with field validation
  • Make data structure crystal clear
  • Work seamlessly with pipelines
  • Enable better teamwork
  • Catch errors early

Key takeaways:

  • Define Items in items.py
  • Use scrapy.Field() for each field
  • Create new item instance for each scraped object
  • Access like dictionaries with item['field']
  • Use ItemAdapter in pipelines for flexibility

Start using Items in your next spider. Your future self (and teammates) will thank you.

Happy scraping! 🕷️

Top comments (0)