When I first started scraping with Scrapy, I used plain dictionaries for everything:
yield {
'name': 'Product Name',
'price': '$29.99',
'stock': 'In Stock'
}
It worked. My scraper ran. Data got saved. Mission accomplished, right?
Wrong.
Three weeks later, I made a typo. Instead of 'price', I accidentally typed 'pric'. My scraper kept running, but all the price data was going into a field that didn't exist. I didn't notice for days.
That's when I learned about Scrapy Items. They would have caught that typo immediately and saved me hours of frustration.
Let me show you what Items are, why they matter, and the tricks nobody talks about.
What Are Scrapy Items?
Think of Items as blueprints for your data. Instead of throwing random dictionaries around, you define exactly what fields your data should have.
With dictionaries (the risky way):
def parse(self, response):
yield {
'name': response.css('h1::text').get(),
'price': response.css('.price::text').get()
}
# Later in your code, you make a typo
yield {
'nam': response.css('h1::text').get(), # Typo! But no error!
'price': response.css('.price::text').get()
}
With dictionaries, typos silently create new fields. Your data gets messed up and you might not notice.
With Items (the safe way):
# Define the structure once
class ProductItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
# Use it in your spider
def parse(self, response):
item = ProductItem()
item['name'] = response.css('h1::text').get()
item['pric'] = response.css('.price::text').get() # ERROR! Field doesn't exist
With Items, typos cause immediate errors. You catch problems right away.
Creating Your First Item
Step 1: Define Your Item
Open your items.py file (Scrapy creates this automatically when you start a project):
# items.py
import scrapy
class ProductItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
url = scrapy.Field()
rating = scrapy.Field()
stock = scrapy.Field()
That's it. You just created a blueprint. Every ProductItem must have these fields (and only these fields).
Step 2: Use It in Your Spider
# spider.py
import scrapy
from myproject.items import ProductItem
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com/products']
def parse(self, response):
for product in response.css('.product'):
item = ProductItem()
item['name'] = product.css('h2::text').get()
item['price'] = product.css('.price::text').get()
item['url'] = product.css('a::attr(href)').get()
item['rating'] = product.css('.rating::text').get()
item['stock'] = product.css('.stock::text').get()
yield item
Notice the pattern:
- Import your Item class
- Create an instance
- Fill in the fields
- Yield the item
Why Bother with Items? (The Real Benefits)
Benefit #1: Typo Protection
This is huge and saved me so many times.
item = ProductItem()
item['pricee'] = '29.99' # CRASH! KeyError: 'pricee'
With dictionaries, this would silently create a field called pricee. With Items, you get an error immediately.
Benefit #2: Clear Data Structure
When someone (or future you) looks at your code, they can instantly see what data you're collecting:
# items.py
class BookItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
isbn = scrapy.Field()
price = scrapy.Field()
publisher = scrapy.Field()
One glance at items.py tells you everything being scraped. No need to hunt through spider code.
Benefit #3: Works Seamlessly with Pipelines
Items work perfectly with Scrapy pipelines:
# pipelines.py
class PricePipeline:
def process_item(self, item, spider):
# You know 'price' exists because it's defined in the Item
item['price'] = item['price'].replace('$', '')
item['price'] = float(item['price'])
return item
Benefit #4: Better for Teams
When multiple people work on a project, Items create a contract. Everyone knows exactly what fields exist.
Items vs Dictionaries: Side by Side
Using dictionaries:
def parse(self, response):
yield {
'product_name': 'Widget',
'product_price': 29.99
}
def parse_detail(self, response):
yield {
'name': 'Widget', # Different key name! Oops.
'price': 29.99
}
Different parts of your code use different field names. Chaos.
Using Items:
class ProductItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
def parse(self, response):
item = ProductItem()
item['name'] = 'Widget' # Consistent!
item['price'] = 29.99
yield item
def parse_detail(self, response):
item = ProductItem()
item['name'] = 'Widget' # Same fields everywhere
item['price'] = 29.99
yield item
Consistency enforced automatically.
Working with Items (The Practical Stuff)
Creating an Item
item = ProductItem()
Setting Fields
# Like a dictionary
item['name'] = 'Product Name'
item['price'] = 29.99
Getting Fields
# Like a dictionary
name = item['name']
price = item.get('price', 0.0) # With default value
Checking if a Field Exists
# Check if populated
if 'name' in item:
print('Name is set')
# Check if declared (even if not populated)
if 'name' in item.fields:
print('Name is a valid field')
Getting All Fields
# Get only populated fields
data = dict(item)
# Get all declared fields (even if not set)
all_fields = item.fields.keys()
Advanced: Field Metadata
Here's something most tutorials skip. Fields can have metadata that components use:
class ProductItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field(serializer=str)
last_updated = scrapy.Field(serializer=str)
The serializer tells Scrapy how to serialize this field when exporting data. You can add any metadata you want:
class ProductItem(scrapy.Item):
name = scrapy.Field(
required=True,
max_length=200
)
price = scrapy.Field(
required=True,
serializer=float
)
description = scrapy.Field(
required=False,
default='No description available'
)
Scrapy itself doesn't use this metadata (except serializer), but your pipelines can!
Real-World Example: Building an Item-Based Spider
Let's build a complete spider that scrapes book data:
Step 1: Define the Item
# items.py
import scrapy
class BookItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
price = scrapy.Field()
rating = scrapy.Field()
availability = scrapy.Field()
description = scrapy.Field()
url = scrapy.Field()
scraped_at = scrapy.Field()
Step 2: Create the Spider
# spiders/books.py
import scrapy
from datetime import datetime
from myproject.items import BookItem
class BooksSpider(scrapy.Spider):
name = 'books'
start_urls = ['https://books.toscrape.com']
def parse(self, response):
for book in response.css('.product_pod'):
book_url = book.css('h3 a::attr(href)').get()
yield response.follow(book_url, callback=self.parse_book)
# Pagination
next_page = response.css('.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
def parse_book(self, response):
item = BookItem()
item['title'] = response.css('h1::text').get()
item['price'] = response.css('.price_color::text').get()
item['rating'] = response.css('.star-rating::attr(class)').get()
item['availability'] = response.css('.availability::text').getall()[1].strip()
item['description'] = response.css('#product_description + p::text').get()
item['url'] = response.url
item['scraped_at'] = datetime.now().isoformat()
# Author might not exist
author = response.css('th:contains("Author") + td a::text').get()
if author:
item['author'] = author
yield item
Step 3: Process with Pipelines
# pipelines.py
class BookCleaningPipeline:
def process_item(self, item, spider):
# Clean price
if item.get('price'):
item['price'] = item['price'].replace('£', '')
item['price'] = float(item['price'])
# Extract rating number
if item.get('rating'):
rating_map = {
'One': 1,
'Two': 2,
'Three': 3,
'Four': 4,
'Five': 5
}
rating_class = item['rating'].replace('star-rating ', '')
item['rating'] = rating_map.get(rating_class, 0)
return item
Step 4: Enable Pipeline in Settings
# settings.py
ITEM_PIPELINES = {
'myproject.pipelines.BookCleaningPipeline': 300,
}
Run it:
scrapy crawl books -o books.json
Advanced Patterns Nobody Talks About
Pattern #1: Partial Items (Building Across Pages)
Sometimes you scrape data from multiple pages. Build your item gradually:
def parse_listing(self, response):
for product in response.css('.product'):
item = ProductItem()
item['name'] = product.css('h2::text').get()
item['price'] = product.css('.price::text').get()
detail_url = product.css('a::attr(href)').get()
yield response.follow(
detail_url,
callback=self.parse_detail,
meta={'item': item}
)
def parse_detail(self, response):
item = response.meta['item']
item['description'] = response.css('.description::text').get()
item['reviews'] = len(response.css('.review'))
yield item
Pattern #2: Conditional Fields
Not all items need all fields:
def parse(self, response):
item = ProductItem()
item['name'] = response.css('h1::text').get()
item['price'] = response.css('.price::text').get()
# Only add discount if it exists
discount = response.css('.discount::text').get()
if discount:
item['discount'] = discount
# Only add rating if it exists
rating = response.css('.rating::text').get()
if rating:
item['rating'] = rating
yield item
Pattern #3: Item Inheritance
You can extend Items for different types of products:
# items.py
class BaseProductItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
url = scrapy.Field()
class BookItem(BaseProductItem):
author = scrapy.Field()
isbn = scrapy.Field()
publisher = scrapy.Field()
class ElectronicsItem(BaseProductItem):
brand = scrapy.Field()
model = scrapy.Field()
warranty = scrapy.Field()
Pattern #4: Default Values
Set defaults when creating items:
def parse(self, response):
item = ProductItem(
scraped_at=datetime.now().isoformat(),
currency='USD'
)
item['name'] = response.css('h1::text').get()
item['price'] = response.css('.price::text').get()
yield item
Common Mistakes and How to Avoid Them
Mistake #1: Creating Item Inside the Loop
# WRONG (inefficient)
def parse(self, response):
item = ProductItem() # Created once
for product in response.css('.product'):
item['name'] = product.css('h2::text').get()
yield item # Yielding the SAME item multiple times!
# RIGHT (creates new item for each product)
def parse(self, response):
for product in response.css('.product'):
item = ProductItem() # New item each time
item['name'] = product.css('h2::text').get()
yield item
Mistake #2: Not Importing the Item
# WRONG
def parse(self, response):
item = ProductItem() # NameError!
# RIGHT
from myproject.items import ProductItem
def parse(self, response):
item = ProductItem() # Works!
Mistake #3: Mixing Dictionaries and Items
# WRONG (inconsistent)
def parse(self, response):
yield {'name': 'Product 1'} # Dictionary
def parse_detail(self, response):
item = ProductItem()
item['name'] = 'Product 2'
yield item # Item
# RIGHT (consistent)
def parse(self, response):
item = ProductItem()
item['name'] = 'Product 1'
yield item
def parse_detail(self, response):
item = ProductItem()
item['name'] = 'Product 2'
yield item
Mistake #4: Not Handling Missing Fields
# WRONG (crashes if author doesn't exist)
item['author'] = response.css('.author::text').get()
# RIGHT (handles missing data)
author = response.css('.author::text').get()
if author:
item['author'] = author
# Or with default
item['author'] = response.css('.author::text').get() or 'Unknown'
Items vs ItemAdapter (Modern Scrapy)
Modern Scrapy recommends using ItemAdapter in pipelines. It works with both Items and dictionaries:
# pipelines.py
from itemadapter import ItemAdapter
class MyPipeline:
def process_item(self, item, spider):
adapter = ItemAdapter(item)
# Works whether item is a dict or Item object
if adapter.get('price'):
adapter['price'] = float(adapter['price'])
return item
This makes your pipelines more flexible.
When to Use Items vs Dictionaries
Use Items when:
- Building a serious, production scraper
- Working in a team
- Need typo protection
- Want clear data structure
- Using pipelines extensively
Use dictionaries when:
- Quick, one-off scraping
- Learning Scrapy basics
- Scraping very simple data
- Don't need validation
My recommendation? Start with Items from day one. The tiny bit of extra work saves massive debugging time later.
Debugging Items
Check What Fields Are Populated
def parse(self, response):
item = ProductItem()
item['name'] = 'Widget'
# See what's in the item
self.logger.info(f'Item: {dict(item)}')
# Check if field is set
if 'price' in item:
self.logger.info('Price is set')
else:
self.logger.warning('Price is missing!')
yield item
Validate Items in Pipelines
class ValidationPipeline:
def process_item(self, item, spider):
adapter = ItemAdapter(item)
# Check required fields
required = ['name', 'price', 'url']
for field in required:
if not adapter.get(field):
raise DropItem(f'Missing required field: {field}')
return item
Quick Reference
Defining Items
# items.py
import scrapy
class MyItem(scrapy.Item):
field1 = scrapy.Field()
field2 = scrapy.Field()
field3 = scrapy.Field(serializer=str)
Using Items
# Create
item = MyItem()
# Set fields
item['field1'] = 'value'
# Get fields
value = item['field1']
value = item.get('field2', 'default')
# Check existence
if 'field1' in item:
pass
# Convert to dict
data = dict(item)
# Yield
yield item
With Inheritance
class BaseItem(scrapy.Item):
name = scrapy.Field()
url = scrapy.Field()
class ExtendedItem(BaseItem):
extra_field = scrapy.Field()
Summary
Items are blueprints for your scraped data. They:
- Prevent typos with field validation
- Make data structure crystal clear
- Work seamlessly with pipelines
- Enable better teamwork
- Catch errors early
Key takeaways:
- Define Items in
items.py - Use
scrapy.Field()for each field - Create new item instance for each scraped object
- Access like dictionaries with
item['field'] - Use ItemAdapter in pipelines for flexibility
Start using Items in your next spider. Your future self (and teammates) will thank you.
Happy scraping! 🕷️
Top comments (0)