DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Item Loaders: Stop Writing Messy Data Cleaning Code

When I first started scraping, my spiders looked like this:

def parse(self, response):
    price = response.css('.price::text').get()
    price = price.replace('$', '').replace(',', '').strip()
    price = float(price) if price else 0.0

    title = response.css('h1::text').get()
    title = title.strip().title() if title else ''

    date = response.css('.date::text').get()
    date = datetime.strptime(date.strip(), '%Y-%m-%d') if date else None

    yield {'title': title, 'price': price, 'date': date}
Enter fullscreen mode Exit fullscreen mode

My parse() methods were 80% data cleaning, 20% actual scraping. It was ugly, repetitive, and hard to maintain.

Then I discovered Item Loaders. Suddenly, data cleaning moved to a separate, reusable place. My spiders became clean and focused.

Let me show you how.


What Are Item Loaders?

Item Loaders are processors that clean and transform scraped data BEFORE putting it into items.

Without Item Loaders:

# Clean everything manually in parse()
title = response.css('h1::text').get().strip().title()
price = float(response.css('.price::text').get().replace('$', ''))
Enter fullscreen mode Exit fullscreen mode

With Item Loaders:

# Just load the data, cleaning happens automatically
loader = ItemLoader(item=ProductItem(), response=response)
loader.add_css('title', 'h1::text')
loader.add_css('price', '.price::text')
return loader.load_item()  # Returns cleaned item
Enter fullscreen mode Exit fullscreen mode

The cleaning happens in processors you define once and reuse everywhere.


Basic Example: Before and After

Before (Manual Cleaning)

def parse(self, response):
    for product in response.css('.product'):
        # Extract and clean manually
        title = product.css('h2::text').get()
        if title:
            title = title.strip()

        price = product.css('.price::text').get()
        if price:
            price = price.replace('$', '').replace(',', '')
            price = float(price)

        yield {'title': title, 'price': price}
Enter fullscreen mode Exit fullscreen mode

After (With Item Loader)

from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose

class ProductLoader(ItemLoader):
    default_output_processor = TakeFirst()

    price_in = MapCompose(lambda x: x.replace('$', '').replace(',', ''), float)

def parse(self, response):
    for product in response.css('.product'):
        loader = ProductLoader(item=ProductItem(), selector=product)
        loader.add_css('title', 'h2::text')
        loader.add_css('price', '.price::text')
        yield loader.load_item()
Enter fullscreen mode Exit fullscreen mode

Same result, cleaner code, reusable processors.


How Item Loaders Work

Item Loaders process data in two stages:

Stage 1: Input Processors (when data is added)

  • Clean each individual value
  • Run on each piece of data as it's added
  • Example: strip whitespace, remove special characters

Stage 2: Output Processors (when item is loaded)

  • Combine all values for a field
  • Run once when you call load_item()
  • Example: take first value, join list into string
# Input processor runs immediately
loader.add_css('title', 'h2::text')  # → '  Product Name  '
# After input processor: 'Product Name'

# Output processor runs on load_item()
titles = ['Title 1', 'Title 2']  # Multiple values
# After output processor: 'Title 1' (TakeFirst)
Enter fullscreen mode Exit fullscreen mode

Creating Your First Item Loader

Step 1: Define Your Item

# items.py
import scrapy

class ProductItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    description = scrapy.Field()
    image_urls = scrapy.Field()
Enter fullscreen mode Exit fullscreen mode

Step 2: Create Item Loader

# loaders.py
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose, Join
import re

class ProductLoader(ItemLoader):
    # Default: take first value for all fields
    default_output_processor = TakeFirst()

    # Price: remove $, commas, convert to float
    price_in = MapCompose(lambda x: x.replace('$', '').replace(',', ''), float)

    # Title: strip whitespace, title case
    title_in = MapCompose(str.strip, str.title)

    # Description: join multiple paragraphs
    description_out = Join('\n')
Enter fullscreen mode Exit fullscreen mode

Step 3: Use in Spider

# spider.py
from myproject.loaders import ProductLoader
from myproject.items import ProductItem

class ProductSpider(scrapy.Spider):
    name = 'products'

    def parse(self, response):
        for product in response.css('.product'):
            loader = ProductLoader(item=ProductItem(), selector=product)

            # Just add data, cleaning happens automatically
            loader.add_css('title', 'h2::text')
            loader.add_css('price', '.price::text')
            loader.add_css('description', '.description p::text')

            yield loader.load_item()
Enter fullscreen mode Exit fullscreen mode

Built-In Processors (The Useful Ones)

TakeFirst (Most Common)

Takes the first non-null value:

from itemloaders.processors import TakeFirst

class MyLoader(ItemLoader):
    default_output_processor = TakeFirst()

# Without TakeFirst
# title = ['Product Name', 'Another Name']

# With TakeFirst
# title = 'Product Name'
Enter fullscreen mode Exit fullscreen mode

Join (Combine Strings)

Joins list into single string:

from itemloaders.processors import Join

class MyLoader(ItemLoader):
    description_out = Join('\n')  # Join with newline

# Input: ['Para 1', 'Para 2', 'Para 3']
# Output: 'Para 1\nPara 2\nPara 3'
Enter fullscreen mode Exit fullscreen mode

MapCompose (Process Each Value)

Applies functions to each value:

from itemloaders.processors import MapCompose

def remove_currency(value):
    return value.replace('$', '').replace('', '')

def to_float(value):
    return float(value)

class MyLoader(ItemLoader):
    price_in = MapCompose(remove_currency, to_float)

# Input: '$1,234.56'
# After remove_currency: '1,234.56'
# After to_float: 1234.56
Enter fullscreen mode Exit fullscreen mode

Compose (Chain Processors)

Like MapCompose but for the whole list:

from itemloaders.processors import Compose

def get_unique(values):
    return list(set(values))

def sort_values(values):
    return sorted(values)

class MyLoader(ItemLoader):
    tags_out = Compose(get_unique, sort_values)

# Input: ['python', 'scrapy', 'python', 'web']
# After get_unique: ['python', 'scrapy', 'web']
# After sort_values: ['python', 'scrapy', 'web']
Enter fullscreen mode Exit fullscreen mode

Real-World Processors

Clean Price

def clean_price(value):
    # Remove currency symbols and commas
    value = re.sub(r'[$€£,]', '', value)
    # Extract just the number
    match = re.search(r'\d+\.?\d*', value)
    return float(match.group()) if match else 0.0

class ProductLoader(ItemLoader):
    price_in = MapCompose(clean_price)

# Handles:
# '$1,234.56' → 1234.56
# '€1.234,56' → 1234.56
# 'Price: $99.99' → 99.99
Enter fullscreen mode Exit fullscreen mode

Clean Date

from dateutil import parser

def parse_date(value):
    try:
        return parser.parse(value).date()
    except:
        return None

class ArticleLoader(ItemLoader):
    date_in = MapCompose(str.strip, parse_date)

# Handles:
# '2024-12-25' → date(2024, 12, 25)
# 'December 25, 2024' → date(2024, 12, 25)
# '25/12/2024' → date(2024, 12, 25)
Enter fullscreen mode Exit fullscreen mode

Clean Text

def clean_text(value):
    # Remove extra whitespace
    value = re.sub(r'\s+', ' ', value)
    # Remove special characters
    value = re.sub(r'[^\w\s\-.,!?]', '', value)
    return value.strip()

class ArticleLoader(ItemLoader):
    title_in = MapCompose(clean_text)
    description_in = MapCompose(clean_text)
Enter fullscreen mode Exit fullscreen mode

Extract Numbers

def extract_number(value):
    match = re.search(r'\d+', value)
    return int(match.group()) if match else 0

class ProductLoader(ItemLoader):
    stock_in = MapCompose(extract_number)

# '23 in stock' → 23
# 'Stock: 100' → 100
# 'Out of stock' → 0
Enter fullscreen mode Exit fullscreen mode

Advanced Techniques

Per-Field Input and Output Processors

class ProductLoader(ItemLoader):
    # Default for all fields
    default_input_processor = MapCompose(str.strip)
    default_output_processor = TakeFirst()

    # Custom for specific fields
    price_in = MapCompose(lambda x: x.replace('$', ''), float)
    price_out = TakeFirst()

    tags_in = MapCompose(str.strip, str.lower)
    tags_out = Identity()  # Keep as list, don't take first
Enter fullscreen mode Exit fullscreen mode

Adding Values with XPath

loader.add_xpath('title', '//h1/text()')
loader.add_xpath('price', '//span[@class="price"]/text()')
Enter fullscreen mode Exit fullscreen mode

Adding Values Directly

# Add hardcoded value
loader.add_value('scraped_at', datetime.now().isoformat())
loader.add_value('source', 'example.com')

# Add from variable
category = 'electronics'
loader.add_value('category', category)
Enter fullscreen mode Exit fullscreen mode

Replacing Values

# Add initial value
loader.add_css('title', 'h1::text')  # 'Product Name'

# Replace it
loader.replace_css('title', 'h2::text')  # Now uses h2 instead
Enter fullscreen mode Exit fullscreen mode

Getting Values Without Loading

loader.add_css('title', 'h1::text')
loader.add_css('price', '.price::text')

# Get value without loading item
title = loader.get_output_value('title')  # 'Product Name'

# Load item later
item = loader.load_item()
Enter fullscreen mode Exit fullscreen mode

Nested Loaders (Advanced)

For complex nested data:

def parse_product(self, response):
    loader = ProductLoader(item=ProductItem(), response=response)
    loader.add_css('name', 'h1::text')

    # Create nested loader for reviews
    for review in response.css('.review'):
        review_loader = ReviewLoader(selector=review)
        review_loader.add_css('author', '.author::text')
        review_loader.add_css('rating', '.rating::text')
        review_loader.add_css('text', '.review-text::text')

        # Add nested item
        loader.add_value('reviews', review_loader.load_item())

    yield loader.load_item()
Enter fullscreen mode Exit fullscreen mode

Common Patterns

Pattern 1: Multiple Selectors

Try multiple selectors in order:

# Try h1 first, fallback to h2
loader.add_css('title', 'h1::text')
loader.add_css('title', 'h2::text')  # Only used if h1 is empty

# Or use one add with multiple selectors
loader.add_css('title', 'h1::text, h2::text')
Enter fullscreen mode Exit fullscreen mode

Pattern 2: Default Values

def default_value(default):
    def processor(values):
        return values[0] if values else default
    return processor

class ProductLoader(ItemLoader):
    stock_out = default_value(0)
    available_out = default_value(False)
Enter fullscreen mode Exit fullscreen mode

Pattern 3: Conditional Processing

def process_if_condition(value):
    if 'sale' in value.lower():
        return value.replace('SALE:', '').strip()
    return value

class ProductLoader(ItemLoader):
    title_in = MapCompose(str.strip, process_if_condition)
Enter fullscreen mode Exit fullscreen mode

Complete Real-World Example

# items.py
import scrapy

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    original_price = scrapy.Field()
    discount = scrapy.Field()
    rating = scrapy.Field()
    reviews_count = scrapy.Field()
    availability = scrapy.Field()
    description = scrapy.Field()
    features = scrapy.Field()
    image_urls = scrapy.Field()
    scraped_at = scrapy.Field()
Enter fullscreen mode Exit fullscreen mode
# loaders.py
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose, Join
from datetime import datetime
import re

def clean_price(value):
    value = re.sub(r'[$,]', '', value)
    return float(re.search(r'\d+\.?\d*', value).group())

def clean_rating(value):
    match = re.search(r'(\d+\.?\d*)', value)
    return float(match.group()) if match else 0.0

def extract_number(value):
    match = re.search(r'\d+', value)
    return int(match.group()) if match else 0

def clean_text(value):
    return re.sub(r'\s+', ' ', value).strip()

class ProductLoader(ItemLoader):
    default_input_processor = MapCompose(str.strip)
    default_output_processor = TakeFirst()

    # Name
    name_in = MapCompose(clean_text)

    # Prices
    price_in = MapCompose(clean_price)
    original_price_in = MapCompose(clean_price)

    # Rating
    rating_in = MapCompose(clean_rating)
    reviews_count_in = MapCompose(extract_number)

    # Description (join multiple paragraphs)
    description_in = MapCompose(clean_text)
    description_out = Join('\n\n')

    # Features (keep as list)
    features_in = MapCompose(clean_text)
    features_out = Identity()  # Keep list
Enter fullscreen mode Exit fullscreen mode
# spider.py
from scrapy import Spider
from myproject.items import ProductItem
from myproject.loaders import ProductLoader
from datetime import datetime

class ProductSpider(Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            loader = ProductLoader(item=ProductItem(), selector=product)

            # Basic info
            loader.add_css('name', 'h2.product-name::text')
            loader.add_css('price', 'span.price::text')
            loader.add_css('original_price', 'span.original-price::text')

            # Rating
            loader.add_css('rating', '.rating::text')
            loader.add_css('reviews_count', '.reviews-count::text')

            # Availability
            loader.add_css('availability', '.availability::text')

            # Description
            loader.add_css('description', '.description p::text')

            # Features
            loader.add_css('features', '.features li::text')

            # Images
            loader.add_css('image_urls', 'img::attr(src)')

            # Metadata
            loader.add_value('scraped_at', datetime.now().isoformat())

            yield loader.load_item()
Enter fullscreen mode Exit fullscreen mode

When NOT to Use Item Loaders

Item Loaders add complexity. Sometimes simple is better.

Don't use Item Loaders when:

  • Spider is simple (few fields, no cleaning)
  • Data is already clean
  • You're prototyping quickly
  • Cleaning logic is page-specific (not reusable)

Do use Item Loaders when:

  • Multiple spiders need same cleaning
  • Complex data transformation
  • Keeping spiders clean matters
  • You want testable cleaning code

Testing Your Loaders

# test_loaders.py
from scrapy.http import HtmlResponse
from myproject.loaders import ProductLoader
from myproject.items import ProductItem

def test_price_cleaning():
    html = '<span class="price">$1,234.56</span>'
    response = HtmlResponse(url='test', body=html.encode())

    loader = ProductLoader(item=ProductItem(), response=response)
    loader.add_css('price', '.price::text')
    item = loader.load_item()

    assert item['price'] == 1234.56

def test_title_cleaning():
    html = '<h1>  PRODUCT   NAME  </h1>'
    response = HtmlResponse(url='test', body=html.encode())

    loader = ProductLoader(item=ProductItem(), response=response)
    loader.add_css('name', 'h1::text')
    item = loader.load_item()

    assert item['name'] == 'Product Name'
Enter fullscreen mode Exit fullscreen mode

Summary

Item Loaders separate data extraction from data cleaning:

  • Input processors: Clean each value as it's added
  • Output processors: Combine/transform final values

Common processors:

  • TakeFirst() - Get first value
  • Join() - Combine strings
  • MapCompose() - Process each value
  • Identity() - Keep as-is

Benefits:

  • Cleaner spider code
  • Reusable cleaning logic
  • Easier testing
  • Separation of concerns

Start simple:

  • Use default processors first
  • Add custom ones as needed
  • Don't over-engineer

Item Loaders make your code cleaner and more maintainable. Use them when cleaning logic is complex or reusable.

Happy scraping! 🕷️

Top comments (0)