Muhammad Ikramullah Khan

Posted on Dec 24, 2025

Scrapy Meta: The Complete Beginner's Guide to Passing Data Between Callbacks

#python #webdev #programming #beginners

When I first started using Scrapy, I ran into a frustrating problem.

I'd scrape data from one page, then follow a link to get more details. But when I got to the second page, I had no way to access the data I scraped from the first page. It was like the data just disappeared.

I know you've probably felt this frustration too.

The solution? Scrapy's meta parameter. It's like a backpack your spider carries around, keeping data safe as it jumps from page to page.

Let me show you exactly how it works.

The Problem: Losing Data Between Pages

Here's a common scenario. You're scraping a product listing site:

Page 1 has product names and prices
You click through to Page 2 to get the full description
You want to combine everything into one item

But here's the problem. Each callback function only gets the response from its current page. It doesn't automatically know about data from previous pages.

Look at this broken code:

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            name = product.css('h2::text').get()
            price = product.css('.price::text').get()
            detail_url = product.css('a::attr(href)').get()

            # Go to detail page
            yield scrapy.Request(detail_url, callback=self.parse_detail)

    def parse_detail(self, response):
        description = response.css('.description: ":text').get()"

        # Uh oh! Where are name and price?
        # They're lost! We scraped them on the previous page.
        yield {
            'name': ???,  # Don't have this
            'price': ???,  # Don't have this either
            'description': description
        }

See the problem? We scraped the name and price on the listing page, but when we get to parse_detail, that data is gone. We can't access it.

This is exactly where meta comes in.

The Solution: Using Meta to Carry Data

The meta parameter is a dictionary that travels with your request. Think of it as a backpack. You can put anything in it, and it'll be there when you reach the next page.

Here's the same code, but fixed with meta:

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            name = product.css('h2::text').get()
            price = product.css('.price::text').get()
            detail_url = product.css('a::attr(href)').get()

            # Put data in the meta "backpack"
            yield scrapy.Request(
                detail_url,
                callback=self.parse_detail,
                meta={'name': name, 'price': price}
            )

    def parse_detail(self, response):
        # Take data out of the meta "backpack"
        name = response.meta['name']
        price = response.meta['price']

        # Scrape description from current page
        description = response.css('.description::text').get()

        # Now we have everything!
        yield {
            'name': name,
            'price': price,
            'description': description
        }

Perfect! Now name and price travel from parse to parse_detail through meta.

How Meta Actually Works

Let me break down exactly what's happening:

Step 1: Putting Data Into Meta

yield scrapy.Request(
    detail_url,
    callback=self.parse_detail,
    meta={'name': name, 'price': price}
)

When you create a request, you pass a dictionary to the meta parameter. This dictionary can contain anything you want. Strings, numbers, lists, dictionaries, whatever.

Step 2: Accessing Data From Meta

def parse_detail(self, response):
    name = response.meta['name']
    price = response.meta['price']

In the callback, you access the data through response.meta. It's just a regular Python dictionary.

That's it. Put data in when making the request. Take data out in the callback.

Real Example: Scraping a Book Store

Let me show you a complete, working example. We'll scrape a bookstore where:

Category pages list books
Book pages have full details

import scrapy

class BookSpider(scrapy.Spider):
    name = 'books'
    start_urls = ['https://books.toscrape.com/']

    def parse(self, response):
        # Get all categories
        for category in response.css('.side_categories a')[1:]:
            category_name = category.css('::text').get().strip()
            category_url = category.css('::attr(href)').get()

            # Pass category name to next callback
            yield response.follow(
                category_url,
                callback=self.parse_category,
                meta={'category': category_name}
            )

    def parse_category(self, response):
        # Get category from previous page
        category = response.meta['category']

        # Get all books in this category
        for book in response.css('.product_pod'):
            title = book.css('h3 a::attr(title)').get()
            price = book.css('.price_color::text').get()
            book_url = book.css('h3 a::attr(href)').get()

            # Pass both category AND book info to next callback
            yield response.follow(
                book_url,
                callback=self.parse_book,
                meta={
                    'category': category,
                    'title': title,
                    'price': price
                }
            )

    def parse_book(self, response):
        # Get everything from meta
        category = response.meta['category']
        title = response.meta['title']
        price = response.meta['price']

        # Scrape additional details from this page
        description = response.css('#product_description + p::text').get()
        availability = response.css('.availability::text').getall()[1].strip()

        # Return complete item
        yield {
            'category': category,
            'title': title,
            'price': price,
            'description': description,
            'availability': availability,
            'url': response.url
        }

See how meta carries data through three different callbacks? Category → Books → Book Details.

Passing Complex Data

You can put anything in meta. Not just strings.

Passing Dictionaries

def parse(self, response):
    item = {
        'name': 'Product Name',
        'price': 29.99,
        'rating': 4.5
    }

    yield scrapy.Request(
        url,
        callback=self.parse_detail,
        meta={'item': item}
    )

def parse_detail(self, response):
    item = response.meta['item']
    item['description'] = response.css('.description::text').get()
    yield item

Passing Lists

def parse(self, response):
    images = response.css('img::attr(src)').getall()

    yield scrapy.Request(
        detail_url,
        callback=self.parse_detail,
        meta={'images': images}
    )

def parse_detail(self, response):
    images = response.meta['images']
    # Use the images list

Passing Numbers, Booleans, etc.

yield scrapy.Request(
    url,
    callback=self.parse_page,
    meta={
        'page_number': 5,
        'is_premium': True,
        'score': 98.6
    }
)

Common Use Cases for Meta

Use Case 1: Building Items Across Multiple Pages

This is the most common use. You scrape data from multiple pages and combine it into one item.

def parse_listing(self, response):
    for product in response.css('.product'):
        item = {
            'name': product.css('h2::text').get(),
            'price': product.css('.price::text').get()
        }

        reviews_url = product.css('.reviews-link::attr(href)').get()

        yield scrapy.Request(
            reviews_url,
            callback=self.parse_reviews,
            meta={'item': item}
        )

def parse_reviews(self, response):
    item = response.meta['item']
    item['rating'] = response.css('.rating::text').get()
    item['review_count'] = len(response.css('.review'))
    yield item

Use Case 2: Tracking Source Information

Keep track of where data came from:

def parse(self, response):
    for link in response.css('a'):
        yield scrapy.Request(
            link.css('::attr(href)').get(),
            callback=self.parse_page,
            meta={'source_url': response.url, 'source_title': response.css('title::text').get()}
        )

def parse_page(self, response):
    yield {
        'data': response.css('.content::text').get(),
        'scraped_from': response.meta['source_url'],
        'parent_title': response.meta['source_title']
    }

Use Case 3: Depth Tracking

Keep track of how deep you are in the crawl:

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, callback=self.parse, meta={'depth': 0})

def parse(self, response):
    depth = response.meta.get('depth', 0)

    # Only follow links if not too deep
    if depth < 3:
        for link in response.css('a::attr(href)').getall():
            yield response.follow(
                link,
                callback=self.parse,
                meta={'depth': depth + 1}
            )

Use Case 4: Pagination with Context

When following pagination, carry the item type or category:

def parse_category(self, response):
    category_name = response.css('h1::text').get()

    # Scrape products
    for product in response.css('.product'):
        yield {
            'category': category_name,
            'name': product.css('h2::text').get()
        }

    # Follow next page, keeping category context
    next_page = response.css('.next::attr(href)').get()
    if next_page:
        yield response.follow(
            next_page,
            callback=self.parse_category,
            meta={'category': category_name}  # Optional but cleaner
        )

Important Things to Know About Meta

1. Meta Is Always a Dictionary

# CORRECT
meta={'key': 'value'}

# WRONG
meta='just a string'  # This won't work!

2. You Can Add Multiple Keys

meta={
    'name': 'John',
    'age': 30,
    'city': 'New York',
    'items': [1, 2, 3]
}

3. Meta Is Preserved Through Redirects

If a request gets redirected, the meta data stays with it:

yield scrapy.Request(
    'http://example.com/redirect',
    callback=self.parse,
    meta={'important': 'data'}
)

# Even after redirect, meta is still there
def parse(self, response):
    data = response.meta['important']  # Works fine!

4. Use .get() to Avoid KeyErrors

Instead of:

name = response.meta['name']  # Crashes if 'name' doesn't exist

Use:

name = response.meta.get('name', 'Default Value')  # Safe!

5. Meta Is Shallow Copied

When Scrapy copies a request, it only does a shallow copy of meta. For simple values (strings, numbers), this is fine. But for complex objects, be careful:

# This item will be shared between requests
item = {'name': 'Product'}

yield scrapy.Request(url1, meta={'item': item})
yield scrapy.Request(url2, meta={'item': item})  # Same item object!

# If one callback modifies it, both see the change

To avoid this, create new objects:

item = {'name': 'Product'}

yield scrapy.Request(url1, meta={'item': item.copy()})
yield scrapy.Request(url2, meta={'item': item.copy()})

Special Meta Keys (Used by Scrapy)

Scrapy uses some special keys in meta for its own purposes. You can use these to control Scrapy's behavior:

dont_redirect

Prevent Scrapy from following redirects:

yield scrapy.Request(
    url,
    meta={'dont_redirect': True}
)

dont_retry

Prevent automatic retries on failure:

yield scrapy.Request(
    url,
    meta={'dont_retry': True}
)

download_timeout

Set a custom timeout for this specific request:

yield scrapy.Request(
    url,
    meta={'download_timeout': 30}  # 30 seconds
)

proxy

Use a specific proxy for this request:

yield scrapy.Request(
    url,
    meta={'proxy': 'http://proxy.example.com:8080'}
)

handle_httpstatus_list

Tell Scrapy to not treat certain status codes as errors:

yield scrapy.Request(
    url,
    callback=self.parse,
    meta={'handle_httpstatus_list': [404, 500]}
)

Meta vs cb_kwargs (What's the Difference?)

Scrapy has another way to pass data called cb_kwargs. Here's when to use each:

Use Meta When:

Working with Scrapy components (middlewares, extensions)
Need data to persist through redirects
Want to control Scrapy behavior (dont_retry, proxy, etc.)
Working with older Scrapy code

Use cb_kwargs When:

Just passing data to your own callback
Want cleaner, more explicit code
Working with newer Scrapy projects

Example with cb_kwargs:

def parse(self, response):
    yield scrapy.Request(
        url,
        callback=self.parse_detail,
        cb_kwargs={'name': 'Product', 'price': 29.99}
    )

def parse_detail(self, response, name, price):
    # name and price come as function arguments
    yield {
        'name': name,
        'price': price,
        'description': response.css('.description::text').get()
    }

With cb_kwargs, data comes as function arguments. With meta, you access it through response.meta.

My advice: Use meta for now while learning. It's more common in tutorials and older code. You can learn cb_kwargs later.

Common Mistakes and How to Avoid Them

Mistake 1: Forgetting to Pass Meta

# WRONG
def parse(self, response):
    name = response.css('h1::text').get()
    yield scrapy.Request(detail_url, callback=self.parse_detail)
    # Forgot to pass name!

# RIGHT
def parse(self, response):
    name = response.css('h1::text').get()
    yield scrapy.Request(
        detail_url,
        callback=self.parse_detail,
        meta={'name': name}
    )

Mistake 2: Typo in Dictionary Key

# WRONG
yield scrapy.Request(url, meta={'prodcut_name': name})  # Typo!

def parse_detail(self, response):
    name = response.meta['product_name']  # KeyError!

# RIGHT
yield scrapy.Request(url, meta={'product_name': name})

def parse_detail(self, response):
    name = response.meta['product_name']  # Works!

Mistake 3: Not Using .get() for Optional Data

# WRONG (crashes if 'optional_data' doesn't exist)
data = response.meta['optional_data']

# RIGHT (returns None if doesn't exist)
data = response.meta.get('optional_data')

# EVEN BETTER (with default value)
data = response.meta.get('optional_data', 'default_value')

Mistake 4: Modifying Shared Objects

# WRONG
item = {'name': 'Product'}
yield scrapy.Request(url1, meta={'item': item})
yield scrapy.Request(url2, meta={'item': item})
# Both requests share the same item dictionary!

# RIGHT
item = {'name': 'Product'}
yield scrapy.Request(url1, meta={'item': item.copy()})
yield scrapy.Request(url2, meta={'item': {'name': 'Product'}})

Debugging Meta

When things aren't working, print the meta to see what's in it:

def parse_detail(self, response):
    # See what's in meta
    self.logger.info(f'Meta contains: {response.meta}')

    # Or check if a specific key exists
    if 'name' in response.meta:
        self.logger.info(f'Name is: {response.meta["name"]}')
    else:
        self.logger.warning('Name not found in meta!')

Complete Real-World Example

Here's a complete spider that uses meta effectively:

import scrapy

class EcommerceSpider(scrapy.Spider):
    name = 'ecommerce'
    start_urls = ['https://example-shop.com']

    def parse(self, response):
        """Scrape category pages"""
        for category in response.css('.category'):
            category_name = category.css('h2::text').get()
            category_url = category.css('a::attr(href)').get()

            yield response.follow(
                category_url,
                callback=self.parse_products,
                meta={'category': category_name, 'page': 1}
            )

    def parse_products(self, response):
        """Scrape product listings"""
        category = response.meta['category']
        page = response.meta.get('page', 1)

        self.logger.info(f'Scraping {category}, page {page}')

        for product in response.css('.product'):
            product_data = {
                'category': category,
                'name': product.css('h3::text').get(),
                'price': product.css('.price::text').get(),
                'image_url': product.css('img::attr(src)').get()
            }

            detail_url = product.css('a::attr(href)').get()

            yield response.follow(
                detail_url,
                callback=self.parse_product_detail,
                meta={'product': product_data, 'source_page': page}
            )

        # Handle pagination
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(
                next_page,
                callback=self.parse_products,
                meta={'category': category, 'page': page + 1}
            )

    def parse_product_detail(self, response):
        """Scrape full product details"""
        product = response.meta['product']
        source_page = response.meta['source_page']

        # Add details from this page
        product['description'] = response.css('.description::text').get()
        product['rating'] = response.css('.rating::text').get()
        product['reviews_count'] = len(response.css('.review'))
        product['in_stock'] = bool(response.css('.in-stock'))
        product['source_page'] = source_page
        product['detail_url'] = response.url

        yield product

This spider demonstrates:

Passing data through multiple callbacks
Tracking category and page numbers
Building items across pages
Using .get() for optional data
Proper logging

When NOT to Use Meta

Sometimes meta isn't the right choice:

Don't use meta for:

Data that can be scraped from the current page (just scrape it directly!)
Large binary data (images, files)
Data that doesn't need to travel between pages

Do use meta for:

Context from previous pages
Tracking state (depth, source, page numbers)
Passing partial items to be completed later
Controlling Scrapy behavior per-request

Final Tips

Keep meta simple: Don't put huge objects in meta. Keep it lightweight.
Use descriptive keys: Instead of meta={'d': data}, use meta={'product_data': data}
Always use .get(): Use response.meta.get('key', default) to avoid KeyErrors
Check what's in meta: When debugging, print response.meta to see what's there
Don't overuse meta: If you're passing 10+ keys, consider restructuring your code
Remember meta is for data, not configuration: Use spider attributes or settings for configuration