When I first started using Scrapy, I ran into a frustrating problem.
I'd scrape data from one page, then follow a link to get more details. But when I got to the second page, I had no way to access the data I scraped from the first page. It was like the data just disappeared.
I know you've probably felt this frustration too.
The solution? Scrapy's meta parameter. It's like a backpack your spider carries around, keeping data safe as it jumps from page to page.
Let me show you exactly how it works.
The Problem: Losing Data Between Pages
Here's a common scenario. You're scraping a product listing site:
- Page 1 has product names and prices
- You click through to Page 2 to get the full description
- You want to combine everything into one item
But here's the problem. Each callback function only gets the response from its current page. It doesn't automatically know about data from previous pages.
Look at this broken code:
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products']
def parse(self, response):
for product in response.css('.product'):
name = product.css('h2::text').get()
price = product.css('.price::text').get()
detail_url = product.css('a::attr(href)').get()
# Go to detail page
yield scrapy.Request(detail_url, callback=self.parse_detail)
def parse_detail(self, response):
description = response.css('.description: ":text').get()"
# Uh oh! Where are name and price?
# They're lost! We scraped them on the previous page.
yield {
'name': ???, # Don't have this
'price': ???, # Don't have this either
'description': description
}
See the problem? We scraped the name and price on the listing page, but when we get to parse_detail, that data is gone. We can't access it.
This is exactly where meta comes in.
The Solution: Using Meta to Carry Data
The meta parameter is a dictionary that travels with your request. Think of it as a backpack. You can put anything in it, and it'll be there when you reach the next page.
Here's the same code, but fixed with meta:
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products']
def parse(self, response):
for product in response.css('.product'):
name = product.css('h2::text').get()
price = product.css('.price::text').get()
detail_url = product.css('a::attr(href)').get()
# Put data in the meta "backpack"
yield scrapy.Request(
detail_url,
callback=self.parse_detail,
meta={'name': name, 'price': price}
)
def parse_detail(self, response):
# Take data out of the meta "backpack"
name = response.meta['name']
price = response.meta['price']
# Scrape description from current page
description = response.css('.description::text').get()
# Now we have everything!
yield {
'name': name,
'price': price,
'description': description
}
Perfect! Now name and price travel from parse to parse_detail through meta.
How Meta Actually Works
Let me break down exactly what's happening:
Step 1: Putting Data Into Meta
yield scrapy.Request(
detail_url,
callback=self.parse_detail,
meta={'name': name, 'price': price}
)
When you create a request, you pass a dictionary to the meta parameter. This dictionary can contain anything you want. Strings, numbers, lists, dictionaries, whatever.
Step 2: Accessing Data From Meta
def parse_detail(self, response):
name = response.meta['name']
price = response.meta['price']
In the callback, you access the data through response.meta. It's just a regular Python dictionary.
That's it. Put data in when making the request. Take data out in the callback.
Real Example: Scraping a Book Store
Let me show you a complete, working example. We'll scrape a bookstore where:
- Category pages list books
- Book pages have full details
import scrapy
class BookSpider(scrapy.Spider):
name = 'books'
start_urls = ['https://books.toscrape.com/']
def parse(self, response):
# Get all categories
for category in response.css('.side_categories a')[1:]:
category_name = category.css('::text').get().strip()
category_url = category.css('::attr(href)').get()
# Pass category name to next callback
yield response.follow(
category_url,
callback=self.parse_category,
meta={'category': category_name}
)
def parse_category(self, response):
# Get category from previous page
category = response.meta['category']
# Get all books in this category
for book in response.css('.product_pod'):
title = book.css('h3 a::attr(title)').get()
price = book.css('.price_color::text').get()
book_url = book.css('h3 a::attr(href)').get()
# Pass both category AND book info to next callback
yield response.follow(
book_url,
callback=self.parse_book,
meta={
'category': category,
'title': title,
'price': price
}
)
def parse_book(self, response):
# Get everything from meta
category = response.meta['category']
title = response.meta['title']
price = response.meta['price']
# Scrape additional details from this page
description = response.css('#product_description + p::text').get()
availability = response.css('.availability::text').getall()[1].strip()
# Return complete item
yield {
'category': category,
'title': title,
'price': price,
'description': description,
'availability': availability,
'url': response.url
}
See how meta carries data through three different callbacks? Category → Books → Book Details.
Passing Complex Data
You can put anything in meta. Not just strings.
Passing Dictionaries
def parse(self, response):
item = {
'name': 'Product Name',
'price': 29.99,
'rating': 4.5
}
yield scrapy.Request(
url,
callback=self.parse_detail,
meta={'item': item}
)
def parse_detail(self, response):
item = response.meta['item']
item['description'] = response.css('.description::text').get()
yield item
Passing Lists
def parse(self, response):
images = response.css('img::attr(src)').getall()
yield scrapy.Request(
detail_url,
callback=self.parse_detail,
meta={'images': images}
)
def parse_detail(self, response):
images = response.meta['images']
# Use the images list
Passing Numbers, Booleans, etc.
yield scrapy.Request(
url,
callback=self.parse_page,
meta={
'page_number': 5,
'is_premium': True,
'score': 98.6
}
)
Common Use Cases for Meta
Use Case 1: Building Items Across Multiple Pages
This is the most common use. You scrape data from multiple pages and combine it into one item.
def parse_listing(self, response):
for product in response.css('.product'):
item = {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
reviews_url = product.css('.reviews-link::attr(href)').get()
yield scrapy.Request(
reviews_url,
callback=self.parse_reviews,
meta={'item': item}
)
def parse_reviews(self, response):
item = response.meta['item']
item['rating'] = response.css('.rating::text').get()
item['review_count'] = len(response.css('.review'))
yield item
Use Case 2: Tracking Source Information
Keep track of where data came from:
def parse(self, response):
for link in response.css('a'):
yield scrapy.Request(
link.css('::attr(href)').get(),
callback=self.parse_page,
meta={'source_url': response.url, 'source_title': response.css('title::text').get()}
)
def parse_page(self, response):
yield {
'data': response.css('.content::text').get(),
'scraped_from': response.meta['source_url'],
'parent_title': response.meta['source_title']
}
Use Case 3: Depth Tracking
Keep track of how deep you are in the crawl:
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, meta={'depth': 0})
def parse(self, response):
depth = response.meta.get('depth', 0)
# Only follow links if not too deep
if depth < 3:
for link in response.css('a::attr(href)').getall():
yield response.follow(
link,
callback=self.parse,
meta={'depth': depth + 1}
)
Use Case 4: Pagination with Context
When following pagination, carry the item type or category:
def parse_category(self, response):
category_name = response.css('h1::text').get()
# Scrape products
for product in response.css('.product'):
yield {
'category': category_name,
'name': product.css('h2::text').get()
}
# Follow next page, keeping category context
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(
next_page,
callback=self.parse_category,
meta={'category': category_name} # Optional but cleaner
)
Important Things to Know About Meta
1. Meta Is Always a Dictionary
# CORRECT
meta={'key': 'value'}
# WRONG
meta='just a string' # This won't work!
2. You Can Add Multiple Keys
meta={
'name': 'John',
'age': 30,
'city': 'New York',
'items': [1, 2, 3]
}
3. Meta Is Preserved Through Redirects
If a request gets redirected, the meta data stays with it:
yield scrapy.Request(
'http://example.com/redirect',
callback=self.parse,
meta={'important': 'data'}
)
# Even after redirect, meta is still there
def parse(self, response):
data = response.meta['important'] # Works fine!
4. Use .get() to Avoid KeyErrors
Instead of:
name = response.meta['name'] # Crashes if 'name' doesn't exist
Use:
name = response.meta.get('name', 'Default Value') # Safe!
5. Meta Is Shallow Copied
When Scrapy copies a request, it only does a shallow copy of meta. For simple values (strings, numbers), this is fine. But for complex objects, be careful:
# This item will be shared between requests
item = {'name': 'Product'}
yield scrapy.Request(url1, meta={'item': item})
yield scrapy.Request(url2, meta={'item': item}) # Same item object!
# If one callback modifies it, both see the change
To avoid this, create new objects:
item = {'name': 'Product'}
yield scrapy.Request(url1, meta={'item': item.copy()})
yield scrapy.Request(url2, meta={'item': item.copy()})
Special Meta Keys (Used by Scrapy)
Scrapy uses some special keys in meta for its own purposes. You can use these to control Scrapy's behavior:
dont_redirect
Prevent Scrapy from following redirects:
yield scrapy.Request(
url,
meta={'dont_redirect': True}
)
dont_retry
Prevent automatic retries on failure:
yield scrapy.Request(
url,
meta={'dont_retry': True}
)
download_timeout
Set a custom timeout for this specific request:
yield scrapy.Request(
url,
meta={'download_timeout': 30} # 30 seconds
)
proxy
Use a specific proxy for this request:
yield scrapy.Request(
url,
meta={'proxy': 'http://proxy.example.com:8080'}
)
handle_httpstatus_list
Tell Scrapy to not treat certain status codes as errors:
yield scrapy.Request(
url,
callback=self.parse,
meta={'handle_httpstatus_list': [404, 500]}
)
Meta vs cb_kwargs (What's the Difference?)
Scrapy has another way to pass data called cb_kwargs. Here's when to use each:
Use Meta When:
- Working with Scrapy components (middlewares, extensions)
- Need data to persist through redirects
- Want to control Scrapy behavior (dont_retry, proxy, etc.)
- Working with older Scrapy code
Use cb_kwargs When:
- Just passing data to your own callback
- Want cleaner, more explicit code
- Working with newer Scrapy projects
Example with cb_kwargs:
def parse(self, response):
yield scrapy.Request(
url,
callback=self.parse_detail,
cb_kwargs={'name': 'Product', 'price': 29.99}
)
def parse_detail(self, response, name, price):
# name and price come as function arguments
yield {
'name': name,
'price': price,
'description': response.css('.description::text').get()
}
With cb_kwargs, data comes as function arguments. With meta, you access it through response.meta.
My advice: Use meta for now while learning. It's more common in tutorials and older code. You can learn cb_kwargs later.
Common Mistakes and How to Avoid Them
Mistake 1: Forgetting to Pass Meta
# WRONG
def parse(self, response):
name = response.css('h1::text').get()
yield scrapy.Request(detail_url, callback=self.parse_detail)
# Forgot to pass name!
# RIGHT
def parse(self, response):
name = response.css('h1::text').get()
yield scrapy.Request(
detail_url,
callback=self.parse_detail,
meta={'name': name}
)
Mistake 2: Typo in Dictionary Key
# WRONG
yield scrapy.Request(url, meta={'prodcut_name': name}) # Typo!
def parse_detail(self, response):
name = response.meta['product_name'] # KeyError!
# RIGHT
yield scrapy.Request(url, meta={'product_name': name})
def parse_detail(self, response):
name = response.meta['product_name'] # Works!
Mistake 3: Not Using .get() for Optional Data
# WRONG (crashes if 'optional_data' doesn't exist)
data = response.meta['optional_data']
# RIGHT (returns None if doesn't exist)
data = response.meta.get('optional_data')
# EVEN BETTER (with default value)
data = response.meta.get('optional_data', 'default_value')
Mistake 4: Modifying Shared Objects
# WRONG
item = {'name': 'Product'}
yield scrapy.Request(url1, meta={'item': item})
yield scrapy.Request(url2, meta={'item': item})
# Both requests share the same item dictionary!
# RIGHT
item = {'name': 'Product'}
yield scrapy.Request(url1, meta={'item': item.copy()})
yield scrapy.Request(url2, meta={'item': {'name': 'Product'}})
Debugging Meta
When things aren't working, print the meta to see what's in it:
def parse_detail(self, response):
# See what's in meta
self.logger.info(f'Meta contains: {response.meta}')
# Or check if a specific key exists
if 'name' in response.meta:
self.logger.info(f'Name is: {response.meta["name"]}')
else:
self.logger.warning('Name not found in meta!')
Complete Real-World Example
Here's a complete spider that uses meta effectively:
import scrapy
class EcommerceSpider(scrapy.Spider):
name = 'ecommerce'
start_urls = ['https://example-shop.com']
def parse(self, response):
"""Scrape category pages"""
for category in response.css('.category'):
category_name = category.css('h2::text').get()
category_url = category.css('a::attr(href)').get()
yield response.follow(
category_url,
callback=self.parse_products,
meta={'category': category_name, 'page': 1}
)
def parse_products(self, response):
"""Scrape product listings"""
category = response.meta['category']
page = response.meta.get('page', 1)
self.logger.info(f'Scraping {category}, page {page}')
for product in response.css('.product'):
product_data = {
'category': category,
'name': product.css('h3::text').get(),
'price': product.css('.price::text').get(),
'image_url': product.css('img::attr(src)').get()
}
detail_url = product.css('a::attr(href)').get()
yield response.follow(
detail_url,
callback=self.parse_product_detail,
meta={'product': product_data, 'source_page': page}
)
# Handle pagination
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(
next_page,
callback=self.parse_products,
meta={'category': category, 'page': page + 1}
)
def parse_product_detail(self, response):
"""Scrape full product details"""
product = response.meta['product']
source_page = response.meta['source_page']
# Add details from this page
product['description'] = response.css('.description::text').get()
product['rating'] = response.css('.rating::text').get()
product['reviews_count'] = len(response.css('.review'))
product['in_stock'] = bool(response.css('.in-stock'))
product['source_page'] = source_page
product['detail_url'] = response.url
yield product
This spider demonstrates:
- Passing data through multiple callbacks
- Tracking category and page numbers
- Building items across pages
- Using .get() for optional data
- Proper logging
When NOT to Use Meta
Sometimes meta isn't the right choice:
Don't use meta for:
- Data that can be scraped from the current page (just scrape it directly!)
- Large binary data (images, files)
- Data that doesn't need to travel between pages
Do use meta for:
- Context from previous pages
- Tracking state (depth, source, page numbers)
- Passing partial items to be completed later
- Controlling Scrapy behavior per-request
Final Tips
Keep meta simple: Don't put huge objects in meta. Keep it lightweight.
Use descriptive keys: Instead of
meta={'d': data}, usemeta={'product_data': data}Always use .get(): Use
response.meta.get('key', default)to avoid KeyErrorsCheck what's in meta: When debugging, print
response.metato see what's thereDon't overuse meta: If you're passing 10+ keys, consider restructuring your code
Remember meta is for data, not configuration: Use spider attributes or settings for configuration
Summary
Meta is your spider's backpack. It carries data from one page to the next.
Key takeaways:
- Use
meta={'key': 'value'}when making requests - Access with
response.meta['key']in callbacks - Use
.get()for optional data to avoid errors - Meta can hold any Python object
- Common uses: building items across pages, tracking context, controlling Scrapy
Start using meta in your next spider. It'll make multi-page scraping so much easier.
Happy scraping! 🕷️
Top comments (0)