When I first started scraping, my spiders looked like this:
def parse(self, response):
price = response.css('.price::text').get()
price = price.replace('$', '').replace(',', '').strip()
price = float(price) if price else 0.0
title = response.css('h1::text').get()
title = title.strip().title() if title else ''
date = response.css('.date::text').get()
date = datetime.strptime(date.strip(), '%Y-%m-%d') if date else None
yield {'title': title, 'price': price, 'date': date}
My parse() methods were 80% data cleaning, 20% actual scraping. It was ugly, repetitive, and hard to maintain.
Then I discovered Item Loaders. Suddenly, data cleaning moved to a separate, reusable place. My spiders became clean and focused.
Let me show you how.
What Are Item Loaders?
Item Loaders are processors that clean and transform scraped data BEFORE putting it into items.
Without Item Loaders:
# Clean everything manually in parse()
title = response.css('h1::text').get().strip().title()
price = float(response.css('.price::text').get().replace('$', ''))
With Item Loaders:
# Just load the data, cleaning happens automatically
loader = ItemLoader(item=ProductItem(), response=response)
loader.add_css('title', 'h1::text')
loader.add_css('price', '.price::text')
return loader.load_item() # Returns cleaned item
The cleaning happens in processors you define once and reuse everywhere.
Basic Example: Before and After
Before (Manual Cleaning)
def parse(self, response):
for product in response.css('.product'):
# Extract and clean manually
title = product.css('h2::text').get()
if title:
title = title.strip()
price = product.css('.price::text').get()
if price:
price = price.replace('$', '').replace(',', '')
price = float(price)
yield {'title': title, 'price': price}
After (With Item Loader)
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
class ProductLoader(ItemLoader):
default_output_processor = TakeFirst()
price_in = MapCompose(lambda x: x.replace('$', '').replace(',', ''), float)
def parse(self, response):
for product in response.css('.product'):
loader = ProductLoader(item=ProductItem(), selector=product)
loader.add_css('title', 'h2::text')
loader.add_css('price', '.price::text')
yield loader.load_item()
Same result, cleaner code, reusable processors.
How Item Loaders Work
Item Loaders process data in two stages:
Stage 1: Input Processors (when data is added)
- Clean each individual value
- Run on each piece of data as it's added
- Example: strip whitespace, remove special characters
Stage 2: Output Processors (when item is loaded)
- Combine all values for a field
- Run once when you call
load_item() - Example: take first value, join list into string
# Input processor runs immediately
loader.add_css('title', 'h2::text') # → ' Product Name '
# After input processor: 'Product Name'
# Output processor runs on load_item()
titles = ['Title 1', 'Title 2'] # Multiple values
# After output processor: 'Title 1' (TakeFirst)
Creating Your First Item Loader
Step 1: Define Your Item
# items.py
import scrapy
class ProductItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
description = scrapy.Field()
image_urls = scrapy.Field()
Step 2: Create Item Loader
# loaders.py
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose, Join
import re
class ProductLoader(ItemLoader):
# Default: take first value for all fields
default_output_processor = TakeFirst()
# Price: remove $, commas, convert to float
price_in = MapCompose(lambda x: x.replace('$', '').replace(',', ''), float)
# Title: strip whitespace, title case
title_in = MapCompose(str.strip, str.title)
# Description: join multiple paragraphs
description_out = Join('\n')
Step 3: Use in Spider
# spider.py
from myproject.loaders import ProductLoader
from myproject.items import ProductItem
class ProductSpider(scrapy.Spider):
name = 'products'
def parse(self, response):
for product in response.css('.product'):
loader = ProductLoader(item=ProductItem(), selector=product)
# Just add data, cleaning happens automatically
loader.add_css('title', 'h2::text')
loader.add_css('price', '.price::text')
loader.add_css('description', '.description p::text')
yield loader.load_item()
Built-In Processors (The Useful Ones)
TakeFirst (Most Common)
Takes the first non-null value:
from itemloaders.processors import TakeFirst
class MyLoader(ItemLoader):
default_output_processor = TakeFirst()
# Without TakeFirst
# title = ['Product Name', 'Another Name']
# With TakeFirst
# title = 'Product Name'
Join (Combine Strings)
Joins list into single string:
from itemloaders.processors import Join
class MyLoader(ItemLoader):
description_out = Join('\n') # Join with newline
# Input: ['Para 1', 'Para 2', 'Para 3']
# Output: 'Para 1\nPara 2\nPara 3'
MapCompose (Process Each Value)
Applies functions to each value:
from itemloaders.processors import MapCompose
def remove_currency(value):
return value.replace('$', '').replace('€', '')
def to_float(value):
return float(value)
class MyLoader(ItemLoader):
price_in = MapCompose(remove_currency, to_float)
# Input: '$1,234.56'
# After remove_currency: '1,234.56'
# After to_float: 1234.56
Compose (Chain Processors)
Like MapCompose but for the whole list:
from itemloaders.processors import Compose
def get_unique(values):
return list(set(values))
def sort_values(values):
return sorted(values)
class MyLoader(ItemLoader):
tags_out = Compose(get_unique, sort_values)
# Input: ['python', 'scrapy', 'python', 'web']
# After get_unique: ['python', 'scrapy', 'web']
# After sort_values: ['python', 'scrapy', 'web']
Real-World Processors
Clean Price
def clean_price(value):
# Remove currency symbols and commas
value = re.sub(r'[$€£,]', '', value)
# Extract just the number
match = re.search(r'\d+\.?\d*', value)
return float(match.group()) if match else 0.0
class ProductLoader(ItemLoader):
price_in = MapCompose(clean_price)
# Handles:
# '$1,234.56' → 1234.56
# '€1.234,56' → 1234.56
# 'Price: $99.99' → 99.99
Clean Date
from dateutil import parser
def parse_date(value):
try:
return parser.parse(value).date()
except:
return None
class ArticleLoader(ItemLoader):
date_in = MapCompose(str.strip, parse_date)
# Handles:
# '2024-12-25' → date(2024, 12, 25)
# 'December 25, 2024' → date(2024, 12, 25)
# '25/12/2024' → date(2024, 12, 25)
Clean Text
def clean_text(value):
# Remove extra whitespace
value = re.sub(r'\s+', ' ', value)
# Remove special characters
value = re.sub(r'[^\w\s\-.,!?]', '', value)
return value.strip()
class ArticleLoader(ItemLoader):
title_in = MapCompose(clean_text)
description_in = MapCompose(clean_text)
Extract Numbers
def extract_number(value):
match = re.search(r'\d+', value)
return int(match.group()) if match else 0
class ProductLoader(ItemLoader):
stock_in = MapCompose(extract_number)
# '23 in stock' → 23
# 'Stock: 100' → 100
# 'Out of stock' → 0
Advanced Techniques
Per-Field Input and Output Processors
class ProductLoader(ItemLoader):
# Default for all fields
default_input_processor = MapCompose(str.strip)
default_output_processor = TakeFirst()
# Custom for specific fields
price_in = MapCompose(lambda x: x.replace('$', ''), float)
price_out = TakeFirst()
tags_in = MapCompose(str.strip, str.lower)
tags_out = Identity() # Keep as list, don't take first
Adding Values with XPath
loader.add_xpath('title', '//h1/text()')
loader.add_xpath('price', '//span[@class="price"]/text()')
Adding Values Directly
# Add hardcoded value
loader.add_value('scraped_at', datetime.now().isoformat())
loader.add_value('source', 'example.com')
# Add from variable
category = 'electronics'
loader.add_value('category', category)
Replacing Values
# Add initial value
loader.add_css('title', 'h1::text') # 'Product Name'
# Replace it
loader.replace_css('title', 'h2::text') # Now uses h2 instead
Getting Values Without Loading
loader.add_css('title', 'h1::text')
loader.add_css('price', '.price::text')
# Get value without loading item
title = loader.get_output_value('title') # 'Product Name'
# Load item later
item = loader.load_item()
Nested Loaders (Advanced)
For complex nested data:
def parse_product(self, response):
loader = ProductLoader(item=ProductItem(), response=response)
loader.add_css('name', 'h1::text')
# Create nested loader for reviews
for review in response.css('.review'):
review_loader = ReviewLoader(selector=review)
review_loader.add_css('author', '.author::text')
review_loader.add_css('rating', '.rating::text')
review_loader.add_css('text', '.review-text::text')
# Add nested item
loader.add_value('reviews', review_loader.load_item())
yield loader.load_item()
Common Patterns
Pattern 1: Multiple Selectors
Try multiple selectors in order:
# Try h1 first, fallback to h2
loader.add_css('title', 'h1::text')
loader.add_css('title', 'h2::text') # Only used if h1 is empty
# Or use one add with multiple selectors
loader.add_css('title', 'h1::text, h2::text')
Pattern 2: Default Values
def default_value(default):
def processor(values):
return values[0] if values else default
return processor
class ProductLoader(ItemLoader):
stock_out = default_value(0)
available_out = default_value(False)
Pattern 3: Conditional Processing
def process_if_condition(value):
if 'sale' in value.lower():
return value.replace('SALE:', '').strip()
return value
class ProductLoader(ItemLoader):
title_in = MapCompose(str.strip, process_if_condition)
Complete Real-World Example
# items.py
import scrapy
class ProductItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
original_price = scrapy.Field()
discount = scrapy.Field()
rating = scrapy.Field()
reviews_count = scrapy.Field()
availability = scrapy.Field()
description = scrapy.Field()
features = scrapy.Field()
image_urls = scrapy.Field()
scraped_at = scrapy.Field()
# loaders.py
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose, Join
from datetime import datetime
import re
def clean_price(value):
value = re.sub(r'[$,]', '', value)
return float(re.search(r'\d+\.?\d*', value).group())
def clean_rating(value):
match = re.search(r'(\d+\.?\d*)', value)
return float(match.group()) if match else 0.0
def extract_number(value):
match = re.search(r'\d+', value)
return int(match.group()) if match else 0
def clean_text(value):
return re.sub(r'\s+', ' ', value).strip()
class ProductLoader(ItemLoader):
default_input_processor = MapCompose(str.strip)
default_output_processor = TakeFirst()
# Name
name_in = MapCompose(clean_text)
# Prices
price_in = MapCompose(clean_price)
original_price_in = MapCompose(clean_price)
# Rating
rating_in = MapCompose(clean_rating)
reviews_count_in = MapCompose(extract_number)
# Description (join multiple paragraphs)
description_in = MapCompose(clean_text)
description_out = Join('\n\n')
# Features (keep as list)
features_in = MapCompose(clean_text)
features_out = Identity() # Keep list
# spider.py
from scrapy import Spider
from myproject.items import ProductItem
from myproject.loaders import ProductLoader
from datetime import datetime
class ProductSpider(Spider):
name = 'products'
start_urls = ['https://example.com/products']
def parse(self, response):
for product in response.css('.product'):
loader = ProductLoader(item=ProductItem(), selector=product)
# Basic info
loader.add_css('name', 'h2.product-name::text')
loader.add_css('price', 'span.price::text')
loader.add_css('original_price', 'span.original-price::text')
# Rating
loader.add_css('rating', '.rating::text')
loader.add_css('reviews_count', '.reviews-count::text')
# Availability
loader.add_css('availability', '.availability::text')
# Description
loader.add_css('description', '.description p::text')
# Features
loader.add_css('features', '.features li::text')
# Images
loader.add_css('image_urls', 'img::attr(src)')
# Metadata
loader.add_value('scraped_at', datetime.now().isoformat())
yield loader.load_item()
When NOT to Use Item Loaders
Item Loaders add complexity. Sometimes simple is better.
Don't use Item Loaders when:
- Spider is simple (few fields, no cleaning)
- Data is already clean
- You're prototyping quickly
- Cleaning logic is page-specific (not reusable)
Do use Item Loaders when:
- Multiple spiders need same cleaning
- Complex data transformation
- Keeping spiders clean matters
- You want testable cleaning code
Testing Your Loaders
# test_loaders.py
from scrapy.http import HtmlResponse
from myproject.loaders import ProductLoader
from myproject.items import ProductItem
def test_price_cleaning():
html = '<span class="price">$1,234.56</span>'
response = HtmlResponse(url='test', body=html.encode())
loader = ProductLoader(item=ProductItem(), response=response)
loader.add_css('price', '.price::text')
item = loader.load_item()
assert item['price'] == 1234.56
def test_title_cleaning():
html = '<h1> PRODUCT NAME </h1>'
response = HtmlResponse(url='test', body=html.encode())
loader = ProductLoader(item=ProductItem(), response=response)
loader.add_css('name', 'h1::text')
item = loader.load_item()
assert item['name'] == 'Product Name'
Summary
Item Loaders separate data extraction from data cleaning:
- Input processors: Clean each value as it's added
- Output processors: Combine/transform final values
Common processors:
-
TakeFirst()- Get first value -
Join()- Combine strings -
MapCompose()- Process each value -
Identity()- Keep as-is
Benefits:
- Cleaner spider code
- Reusable cleaning logic
- Easier testing
- Separation of concerns
Start simple:
- Use default processors first
- Add custom ones as needed
- Don't over-engineer
Item Loaders make your code cleaner and more maintainable. Use them when cleaning logic is complex or reusable.
Happy scraping! 🕷️
Top comments (0)