Muhammad Ikramullah Khan

Posted on Dec 25

Scrapy Items: The Complete Beginner's Guide (Why You Should Stop Using Dictionaries)

#productivity #programming #webdev #beginners

When I first started scraping with Scrapy, I used plain dictionaries for everything:

yield {
    'name': 'Product Name',
    'price': '$29.99',
    'stock': 'In Stock'
}

It worked. My scraper ran. Data got saved. Mission accomplished, right?

Wrong.

Three weeks later, I made a typo. Instead of 'price', I accidentally typed 'pric'. My scraper kept running, but all the price data was going into a field that didn't exist. I didn't notice for days.

That's when I learned about Scrapy Items. They would have caught that typo immediately and saved me hours of frustration.

Let me show you what Items are, why they matter, and the tricks nobody talks about.

What Are Scrapy Items?

Think of Items as blueprints for your data. Instead of throwing random dictionaries around, you define exactly what fields your data should have.

With dictionaries (the risky way):

def parse(self, response):
    yield {
        'name': response.css('h1::text').get(),
        'price': response.css('.price::text').get()
    }

    # Later in your code, you make a typo
    yield {
        'nam': response.css('h1::text').get(),  # Typo! But no error!
        'price': response.css('.price::text').get()
    }

With dictionaries, typos silently create new fields. Your data gets messed up and you might not notice.

With Items (the safe way):

# Define the structure once
class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()

# Use it in your spider
def parse(self, response):
    item = ProductItem()
    item['name'] = response.css('h1::text').get()
    item['pric'] = response.css('.price::text').get()  # ERROR! Field doesn't exist

With Items, typos cause immediate errors. You catch problems right away.

Creating Your First Item

Step 1: Define Your Item

Open your items.py file (Scrapy creates this automatically when you start a project):

# items.py
import scrapy

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()
    rating = scrapy.Field()
    stock = scrapy.Field()

That's it. You just created a blueprint. Every ProductItem must have these fields (and only these fields).

Step 2: Use It in Your Spider

# spider.py
import scrapy
from myproject.items import ProductItem

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            item = ProductItem()
            item['name'] = product.css('h2::text').get()
            item['price'] = product.css('.price::text').get()
            item['url'] = product.css('a::attr(href)').get()
            item['rating'] = product.css('.rating::text').get()
            item['stock'] = product.css('.stock::text').get()

            yield item

Notice the pattern:

Import your Item class
Create an instance
Fill in the fields
Yield the item

Why Bother with Items? (The Real Benefits)

Benefit #1: Typo Protection

This is huge and saved me so many times.

item = ProductItem()
item['pricee'] = '29.99'  # CRASH! KeyError: 'pricee'

With dictionaries, this would silently create a field called pricee. With Items, you get an error immediately.

Benefit #2: Clear Data Structure

When someone (or future you) looks at your code, they can instantly see what data you're collecting:

# items.py
class BookItem(scrapy.Item):
    title = scrapy.Field()
    author = scrapy.Field()
    isbn = scrapy.Field()
    price = scrapy.Field()
    publisher = scrapy.Field()

One glance at items.py tells you everything being scraped. No need to hunt through spider code.

Benefit #3: Works Seamlessly with Pipelines

Items work perfectly with Scrapy pipelines:

# pipelines.py
class PricePipeline:
    def process_item(self, item, spider):
        # You know 'price' exists because it's defined in the Item
        item['price'] = item['price'].replace('$', '')
        item['price'] = float(item['price'])
        return item

Benefit #4: Better for Teams

When multiple people work on a project, Items create a contract. Everyone knows exactly what fields exist.

Items vs Dictionaries: Side by Side

Using dictionaries:

def parse(self, response):
    yield {
        'product_name': 'Widget',
        'product_price': 29.99
    }

def parse_detail(self, response):
    yield {
        'name': 'Widget',  # Different key name! Oops.
        'price': 29.99
    }

Different parts of your code use different field names. Chaos.

Using Items:

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()

def parse(self, response):
    item = ProductItem()
    item['name'] = 'Widget'  # Consistent!
    item['price'] = 29.99
    yield item

def parse_detail(self, response):
    item = ProductItem()
    item['name'] = 'Widget'  # Same fields everywhere
    item['price'] = 29.99
    yield item

Consistency enforced automatically.

Working with Items (The Practical Stuff)

Creating an Item

item = ProductItem()

Setting Fields

# Like a dictionary
item['name'] = 'Product Name'
item['price'] = 29.99

Getting Fields

# Like a dictionary
name = item['name']
price = item.get('price', 0.0)  # With default value

Checking if a Field Exists

# Check if populated
if 'name' in item:
    print('Name is set')

# Check if declared (even if not populated)
if 'name' in item.fields:
    print('Name is a valid field')

Getting All Fields

# Get only populated fields
data = dict(item)

# Get all declared fields (even if not set)
all_fields = item.fields.keys()

Advanced: Field Metadata

Here's something most tutorials skip. Fields can have metadata that components use:

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field(serializer=str)
    last_updated = scrapy.Field(serializer=str)

The serializer tells Scrapy how to serialize this field when exporting data. You can add any metadata you want:

class ProductItem(scrapy.Item):
    name = scrapy.Field(
        required=True,
        max_length=200
    )
    price = scrapy.Field(
        required=True,
        serializer=float
    )
    description = scrapy.Field(
        required=False,
        default='No description available'
    )

Scrapy itself doesn't use this metadata (except serializer), but your pipelines can!

Real-World Example: Building an Item-Based Spider

Let's build a complete spider that scrapes book data:

Step 1: Define the Item

# items.py
import scrapy

class BookItem(scrapy.Item):
    title = scrapy.Field()
    author = scrapy.Field()
    price = scrapy.Field()
    rating = scrapy.Field()
    availability = scrapy.Field()
    description = scrapy.Field()
    url = scrapy.Field()
    scraped_at = scrapy.Field()

Step 2: Create the Spider

# spiders/books.py
import scrapy
from datetime import datetime
from myproject.items import BookItem

class BooksSpider(scrapy.Spider):
    name = 'books'
    start_urls = ['https://books.toscrape.com']

    def parse(self, response):
        for book in response.css('.product_pod'):
            book_url = book.css('h3 a::attr(href)').get()
            yield response.follow(book_url, callback=self.parse_book)

        # Pagination
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_book(self, response):
        item = BookItem()

        item['title'] = response.css('h1::text').get()
        item['price'] = response.css('.price_color::text').get()
        item['rating'] = response.css('.star-rating::attr(class)').get()
        item['availability'] = response.css('.availability::text').getall()[1].strip()
        item['description'] = response.css('#product_description + p::text').get()
        item['url'] = response.url
        item['scraped_at'] = datetime.now().isoformat()

        # Author might not exist
        author = response.css('th:contains("Author") + td a::text').get()
        if author:
            item['author'] = author

        yield item

Step 3: Process with Pipelines

# pipelines.py
class BookCleaningPipeline:
    def process_item(self, item, spider):
        # Clean price
        if item.get('price'):
            item['price'] = item['price'].replace('£', '')
            item['price'] = float(item['price'])

        # Extract rating number
        if item.get('rating'):
            rating_map = {
                'One': 1,
                'Two': 2,
                'Three': 3,
                'Four': 4,
                'Five': 5
            }
            rating_class = item['rating'].replace('star-rating ', '')
            item['rating'] = rating_map.get(rating_class, 0)

        return item

Step 4: Enable Pipeline in Settings

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.BookCleaningPipeline': 300,
}

Run it:

scrapy crawl books -o books.json

Advanced Patterns Nobody Talks About

Pattern #1: Partial Items (Building Across Pages)

Sometimes you scrape data from multiple pages. Build your item gradually:

def parse_listing(self, response):
    for product in response.css('.product'):
        item = ProductItem()
        item['name'] = product.css('h2::text').get()
        item['price'] = product.css('.price::text').get()

        detail_url = product.css('a::attr(href)').get()
        yield response.follow(
            detail_url,
            callback=self.parse_detail,
            meta={'item': item}
        )

def parse_detail(self, response):
    item = response.meta['item']
    item['description'] = response.css('.description::text').get()
    item['reviews'] = len(response.css('.review'))
    yield item

Pattern #2: Conditional Fields

Not all items need all fields:

def parse(self, response):
    item = ProductItem()
    item['name'] = response.css('h1::text').get()
    item['price'] = response.css('.price::text').get()

    # Only add discount if it exists
    discount = response.css('.discount::text').get()
    if discount:
        item['discount'] = discount

    # Only add rating if it exists
    rating = response.css('.rating::text').get()
    if rating:
        item['rating'] = rating

    yield item

Pattern #3: Item Inheritance

You can extend Items for different types of products:

# items.py
class BaseProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()

class BookItem(BaseProductItem):
    author = scrapy.Field()
    isbn = scrapy.Field()
    publisher = scrapy.Field()

class ElectronicsItem(BaseProductItem):
    brand = scrapy.Field()
    model = scrapy.Field()
    warranty = scrapy.Field()

Pattern #4: Default Values

Set defaults when creating items:

def parse(self, response):
    item = ProductItem(
        scraped_at=datetime.now().isoformat(),
        currency='USD'
    )

    item['name'] = response.css('h1::text').get()
    item['price'] = response.css('.price::text').get()

    yield item

Common Mistakes and How to Avoid Them

Mistake #1: Creating Item Inside the Loop

# WRONG (inefficient)
def parse(self, response):
    item = ProductItem()  # Created once
    for product in response.css('.product'):
        item['name'] = product.css('h2::text').get()
        yield item  # Yielding the SAME item multiple times!

# RIGHT (creates new item for each product)
def parse(self, response):
    for product in response.css('.product'):
        item = ProductItem()  # New item each time
        item['name'] = product.css('h2::text').get()
        yield item

Mistake #2: Not Importing the Item

# WRONG
def parse(self, response):
    item = ProductItem()  # NameError!

# RIGHT
from myproject.items import ProductItem

def parse(self, response):
    item = ProductItem()  # Works!

Mistake #3: Mixing Dictionaries and Items

# WRONG (inconsistent)
def parse(self, response):
    yield {'name': 'Product 1'}  # Dictionary

def parse_detail(self, response):
    item = ProductItem()
    item['name'] = 'Product 2'
    yield item  # Item

# RIGHT (consistent)
def parse(self, response):
    item = ProductItem()
    item['name'] = 'Product 1'
    yield item

def parse_detail(self, response):
    item = ProductItem()
    item['name'] = 'Product 2'
    yield item

Mistake #4: Not Handling Missing Fields

# WRONG (crashes if author doesn't exist)
item['author'] = response.css('.author::text').get()

# RIGHT (handles missing data)
author = response.css('.author::text').get()
if author:
    item['author'] = author
# Or with default
item['author'] = response.css('.author::text').get() or 'Unknown'

Items vs ItemAdapter (Modern Scrapy)

Modern Scrapy recommends using ItemAdapter in pipelines. It works with both Items and dictionaries:

# pipelines.py
from itemadapter import ItemAdapter

class MyPipeline:
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)

        # Works whether item is a dict or Item object
        if adapter.get('price'):
            adapter['price'] = float(adapter['price'])

        return item

This makes your pipelines more flexible.

When to Use Items vs Dictionaries

Use Items when:

Building a serious, production scraper
Working in a team
Need typo protection
Want clear data structure
Using pipelines extensively

Use dictionaries when:

Quick, one-off scraping
Learning Scrapy basics
Scraping very simple data
Don't need validation

My recommendation? Start with Items from day one. The tiny bit of extra work saves massive debugging time later.

Debugging Items

Check What Fields Are Populated

def parse(self, response):
    item = ProductItem()
    item['name'] = 'Widget'

    # See what's in the item
    self.logger.info(f'Item: {dict(item)}')

    # Check if field is set
    if 'price' in item:
        self.logger.info('Price is set')
    else:
        self.logger.warning('Price is missing!')

    yield item

Validate Items in Pipelines

class ValidationPipeline:
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)

        # Check required fields
        required = ['name', 'price', 'url']
        for field in required:
            if not adapter.get(field):
                raise DropItem(f'Missing required field: {field}')

        return item

Quick Reference

Defining Items

# items.py
import scrapy

class MyItem(scrapy.Item):
    field1 = scrapy.Field()
    field2 = scrapy.Field()
    field3 = scrapy.Field(serializer=str)

Using Items

# Create
item = MyItem()

# Set fields
item['field1'] = 'value'

# Get fields
value = item['field1']
value = item.get('field2', 'default')

# Check existence
if 'field1' in item:
    pass

# Convert to dict
data = dict(item)

# Yield
yield item

With Inheritance

class BaseItem(scrapy.Item):
    name = scrapy.Field()
    url = scrapy.Field()

class ExtendedItem(BaseItem):
    extra_field = scrapy.Field()

Summary

Items are blueprints for your scraped data. They:

Prevent typos with field validation
Make data structure crystal clear
Work seamlessly with pipelines
Enable better teamwork
Catch errors early

Key takeaways:

Define Items in items.py
Use scrapy.Field() for each field
Create new item instance for each scraped object
Access like dictionaries with item['field']
Use ItemAdapter in pipelines for flexibility

Start using Items in your next spider. Your future self (and teammates) will thank you.

Happy scraping! 🕷️