DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Meta: The Complete Beginner's Guide to Passing Data Between Callbacks

When I first started using Scrapy, I ran into a frustrating problem.

I'd scrape data from one page, then follow a link to get more details. But when I got to the second page, I had no way to access the data I scraped from the first page. It was like the data just disappeared.

I know you've probably felt this frustration too.

The solution? Scrapy's meta parameter. It's like a backpack your spider carries around, keeping data safe as it jumps from page to page.

Let me show you exactly how it works.


The Problem: Losing Data Between Pages

Here's a common scenario. You're scraping a product listing site:

  1. Page 1 has product names and prices
  2. You click through to Page 2 to get the full description
  3. You want to combine everything into one item

But here's the problem. Each callback function only gets the response from its current page. It doesn't automatically know about data from previous pages.

Look at this broken code:

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            name = product.css('h2::text').get()
            price = product.css('.price::text').get()
            detail_url = product.css('a::attr(href)').get()

            # Go to detail page
            yield scrapy.Request(detail_url, callback=self.parse_detail)

    def parse_detail(self, response):
        description = response.css('.description: ":text').get()"

        # Uh oh! Where are name and price?
        # They're lost! We scraped them on the previous page.
        yield {
            'name': ???,  # Don't have this
            'price': ???,  # Don't have this either
            'description': description
        }
Enter fullscreen mode Exit fullscreen mode

See the problem? We scraped the name and price on the listing page, but when we get to parse_detail, that data is gone. We can't access it.

This is exactly where meta comes in.


The Solution: Using Meta to Carry Data

The meta parameter is a dictionary that travels with your request. Think of it as a backpack. You can put anything in it, and it'll be there when you reach the next page.

Here's the same code, but fixed with meta:

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            name = product.css('h2::text').get()
            price = product.css('.price::text').get()
            detail_url = product.css('a::attr(href)').get()

            # Put data in the meta "backpack"
            yield scrapy.Request(
                detail_url,
                callback=self.parse_detail,
                meta={'name': name, 'price': price}
            )

    def parse_detail(self, response):
        # Take data out of the meta "backpack"
        name = response.meta['name']
        price = response.meta['price']

        # Scrape description from current page
        description = response.css('.description::text').get()

        # Now we have everything!
        yield {
            'name': name,
            'price': price,
            'description': description
        }
Enter fullscreen mode Exit fullscreen mode

Perfect! Now name and price travel from parse to parse_detail through meta.


How Meta Actually Works

Let me break down exactly what's happening:

Step 1: Putting Data Into Meta

yield scrapy.Request(
    detail_url,
    callback=self.parse_detail,
    meta={'name': name, 'price': price}
)
Enter fullscreen mode Exit fullscreen mode

When you create a request, you pass a dictionary to the meta parameter. This dictionary can contain anything you want. Strings, numbers, lists, dictionaries, whatever.

Step 2: Accessing Data From Meta

def parse_detail(self, response):
    name = response.meta['name']
    price = response.meta['price']
Enter fullscreen mode Exit fullscreen mode

In the callback, you access the data through response.meta. It's just a regular Python dictionary.

That's it. Put data in when making the request. Take data out in the callback.


Real Example: Scraping a Book Store

Let me show you a complete, working example. We'll scrape a bookstore where:

  1. Category pages list books
  2. Book pages have full details
import scrapy

class BookSpider(scrapy.Spider):
    name = 'books'
    start_urls = ['https://books.toscrape.com/']

    def parse(self, response):
        # Get all categories
        for category in response.css('.side_categories a')[1:]:
            category_name = category.css('::text').get().strip()
            category_url = category.css('::attr(href)').get()

            # Pass category name to next callback
            yield response.follow(
                category_url,
                callback=self.parse_category,
                meta={'category': category_name}
            )

    def parse_category(self, response):
        # Get category from previous page
        category = response.meta['category']

        # Get all books in this category
        for book in response.css('.product_pod'):
            title = book.css('h3 a::attr(title)').get()
            price = book.css('.price_color::text').get()
            book_url = book.css('h3 a::attr(href)').get()

            # Pass both category AND book info to next callback
            yield response.follow(
                book_url,
                callback=self.parse_book,
                meta={
                    'category': category,
                    'title': title,
                    'price': price
                }
            )

    def parse_book(self, response):
        # Get everything from meta
        category = response.meta['category']
        title = response.meta['title']
        price = response.meta['price']

        # Scrape additional details from this page
        description = response.css('#product_description + p::text').get()
        availability = response.css('.availability::text').getall()[1].strip()

        # Return complete item
        yield {
            'category': category,
            'title': title,
            'price': price,
            'description': description,
            'availability': availability,
            'url': response.url
        }
Enter fullscreen mode Exit fullscreen mode

See how meta carries data through three different callbacks? Category → Books → Book Details.


Passing Complex Data

You can put anything in meta. Not just strings.

Passing Dictionaries

def parse(self, response):
    item = {
        'name': 'Product Name',
        'price': 29.99,
        'rating': 4.5
    }

    yield scrapy.Request(
        url,
        callback=self.parse_detail,
        meta={'item': item}
    )

def parse_detail(self, response):
    item = response.meta['item']
    item['description'] = response.css('.description::text').get()
    yield item
Enter fullscreen mode Exit fullscreen mode

Passing Lists

def parse(self, response):
    images = response.css('img::attr(src)').getall()

    yield scrapy.Request(
        detail_url,
        callback=self.parse_detail,
        meta={'images': images}
    )

def parse_detail(self, response):
    images = response.meta['images']
    # Use the images list
Enter fullscreen mode Exit fullscreen mode

Passing Numbers, Booleans, etc.

yield scrapy.Request(
    url,
    callback=self.parse_page,
    meta={
        'page_number': 5,
        'is_premium': True,
        'score': 98.6
    }
)
Enter fullscreen mode Exit fullscreen mode

Common Use Cases for Meta

Use Case 1: Building Items Across Multiple Pages

This is the most common use. You scrape data from multiple pages and combine it into one item.

def parse_listing(self, response):
    for product in response.css('.product'):
        item = {
            'name': product.css('h2::text').get(),
            'price': product.css('.price::text').get()
        }

        reviews_url = product.css('.reviews-link::attr(href)').get()

        yield scrapy.Request(
            reviews_url,
            callback=self.parse_reviews,
            meta={'item': item}
        )

def parse_reviews(self, response):
    item = response.meta['item']
    item['rating'] = response.css('.rating::text').get()
    item['review_count'] = len(response.css('.review'))
    yield item
Enter fullscreen mode Exit fullscreen mode

Use Case 2: Tracking Source Information

Keep track of where data came from:

def parse(self, response):
    for link in response.css('a'):
        yield scrapy.Request(
            link.css('::attr(href)').get(),
            callback=self.parse_page,
            meta={'source_url': response.url, 'source_title': response.css('title::text').get()}
        )

def parse_page(self, response):
    yield {
        'data': response.css('.content::text').get(),
        'scraped_from': response.meta['source_url'],
        'parent_title': response.meta['source_title']
    }
Enter fullscreen mode Exit fullscreen mode

Use Case 3: Depth Tracking

Keep track of how deep you are in the crawl:

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, callback=self.parse, meta={'depth': 0})

def parse(self, response):
    depth = response.meta.get('depth', 0)

    # Only follow links if not too deep
    if depth < 3:
        for link in response.css('a::attr(href)').getall():
            yield response.follow(
                link,
                callback=self.parse,
                meta={'depth': depth + 1}
            )
Enter fullscreen mode Exit fullscreen mode

Use Case 4: Pagination with Context

When following pagination, carry the item type or category:

def parse_category(self, response):
    category_name = response.css('h1::text').get()

    # Scrape products
    for product in response.css('.product'):
        yield {
            'category': category_name,
            'name': product.css('h2::text').get()
        }

    # Follow next page, keeping category context
    next_page = response.css('.next::attr(href)').get()
    if next_page:
        yield response.follow(
            next_page,
            callback=self.parse_category,
            meta={'category': category_name}  # Optional but cleaner
        )
Enter fullscreen mode Exit fullscreen mode

Important Things to Know About Meta

1. Meta Is Always a Dictionary

# CORRECT
meta={'key': 'value'}

# WRONG
meta='just a string'  # This won't work!
Enter fullscreen mode Exit fullscreen mode

2. You Can Add Multiple Keys

meta={
    'name': 'John',
    'age': 30,
    'city': 'New York',
    'items': [1, 2, 3]
}
Enter fullscreen mode Exit fullscreen mode

3. Meta Is Preserved Through Redirects

If a request gets redirected, the meta data stays with it:

yield scrapy.Request(
    'http://example.com/redirect',
    callback=self.parse,
    meta={'important': 'data'}
)

# Even after redirect, meta is still there
def parse(self, response):
    data = response.meta['important']  # Works fine!
Enter fullscreen mode Exit fullscreen mode

4. Use .get() to Avoid KeyErrors

Instead of:

name = response.meta['name']  # Crashes if 'name' doesn't exist
Enter fullscreen mode Exit fullscreen mode

Use:

name = response.meta.get('name', 'Default Value')  # Safe!
Enter fullscreen mode Exit fullscreen mode

5. Meta Is Shallow Copied

When Scrapy copies a request, it only does a shallow copy of meta. For simple values (strings, numbers), this is fine. But for complex objects, be careful:

# This item will be shared between requests
item = {'name': 'Product'}

yield scrapy.Request(url1, meta={'item': item})
yield scrapy.Request(url2, meta={'item': item})  # Same item object!

# If one callback modifies it, both see the change
Enter fullscreen mode Exit fullscreen mode

To avoid this, create new objects:

item = {'name': 'Product'}

yield scrapy.Request(url1, meta={'item': item.copy()})
yield scrapy.Request(url2, meta={'item': item.copy()})
Enter fullscreen mode Exit fullscreen mode

Special Meta Keys (Used by Scrapy)

Scrapy uses some special keys in meta for its own purposes. You can use these to control Scrapy's behavior:

dont_redirect

Prevent Scrapy from following redirects:

yield scrapy.Request(
    url,
    meta={'dont_redirect': True}
)
Enter fullscreen mode Exit fullscreen mode

dont_retry

Prevent automatic retries on failure:

yield scrapy.Request(
    url,
    meta={'dont_retry': True}
)
Enter fullscreen mode Exit fullscreen mode

download_timeout

Set a custom timeout for this specific request:

yield scrapy.Request(
    url,
    meta={'download_timeout': 30}  # 30 seconds
)
Enter fullscreen mode Exit fullscreen mode

proxy

Use a specific proxy for this request:

yield scrapy.Request(
    url,
    meta={'proxy': 'http://proxy.example.com:8080'}
)
Enter fullscreen mode Exit fullscreen mode

handle_httpstatus_list

Tell Scrapy to not treat certain status codes as errors:

yield scrapy.Request(
    url,
    callback=self.parse,
    meta={'handle_httpstatus_list': [404, 500]}
)
Enter fullscreen mode Exit fullscreen mode

Meta vs cb_kwargs (What's the Difference?)

Scrapy has another way to pass data called cb_kwargs. Here's when to use each:

Use Meta When:

  • Working with Scrapy components (middlewares, extensions)
  • Need data to persist through redirects
  • Want to control Scrapy behavior (dont_retry, proxy, etc.)
  • Working with older Scrapy code

Use cb_kwargs When:

  • Just passing data to your own callback
  • Want cleaner, more explicit code
  • Working with newer Scrapy projects

Example with cb_kwargs:

def parse(self, response):
    yield scrapy.Request(
        url,
        callback=self.parse_detail,
        cb_kwargs={'name': 'Product', 'price': 29.99}
    )

def parse_detail(self, response, name, price):
    # name and price come as function arguments
    yield {
        'name': name,
        'price': price,
        'description': response.css('.description::text').get()
    }
Enter fullscreen mode Exit fullscreen mode

With cb_kwargs, data comes as function arguments. With meta, you access it through response.meta.

My advice: Use meta for now while learning. It's more common in tutorials and older code. You can learn cb_kwargs later.


Common Mistakes and How to Avoid Them

Mistake 1: Forgetting to Pass Meta

# WRONG
def parse(self, response):
    name = response.css('h1::text').get()
    yield scrapy.Request(detail_url, callback=self.parse_detail)
    # Forgot to pass name!

# RIGHT
def parse(self, response):
    name = response.css('h1::text').get()
    yield scrapy.Request(
        detail_url,
        callback=self.parse_detail,
        meta={'name': name}
    )
Enter fullscreen mode Exit fullscreen mode

Mistake 2: Typo in Dictionary Key

# WRONG
yield scrapy.Request(url, meta={'prodcut_name': name})  # Typo!

def parse_detail(self, response):
    name = response.meta['product_name']  # KeyError!

# RIGHT
yield scrapy.Request(url, meta={'product_name': name})

def parse_detail(self, response):
    name = response.meta['product_name']  # Works!
Enter fullscreen mode Exit fullscreen mode

Mistake 3: Not Using .get() for Optional Data

# WRONG (crashes if 'optional_data' doesn't exist)
data = response.meta['optional_data']

# RIGHT (returns None if doesn't exist)
data = response.meta.get('optional_data')

# EVEN BETTER (with default value)
data = response.meta.get('optional_data', 'default_value')
Enter fullscreen mode Exit fullscreen mode

Mistake 4: Modifying Shared Objects

# WRONG
item = {'name': 'Product'}
yield scrapy.Request(url1, meta={'item': item})
yield scrapy.Request(url2, meta={'item': item})
# Both requests share the same item dictionary!

# RIGHT
item = {'name': 'Product'}
yield scrapy.Request(url1, meta={'item': item.copy()})
yield scrapy.Request(url2, meta={'item': {'name': 'Product'}})
Enter fullscreen mode Exit fullscreen mode

Debugging Meta

When things aren't working, print the meta to see what's in it:

def parse_detail(self, response):
    # See what's in meta
    self.logger.info(f'Meta contains: {response.meta}')

    # Or check if a specific key exists
    if 'name' in response.meta:
        self.logger.info(f'Name is: {response.meta["name"]}')
    else:
        self.logger.warning('Name not found in meta!')
Enter fullscreen mode Exit fullscreen mode

Complete Real-World Example

Here's a complete spider that uses meta effectively:

import scrapy

class EcommerceSpider(scrapy.Spider):
    name = 'ecommerce'
    start_urls = ['https://example-shop.com']

    def parse(self, response):
        """Scrape category pages"""
        for category in response.css('.category'):
            category_name = category.css('h2::text').get()
            category_url = category.css('a::attr(href)').get()

            yield response.follow(
                category_url,
                callback=self.parse_products,
                meta={'category': category_name, 'page': 1}
            )

    def parse_products(self, response):
        """Scrape product listings"""
        category = response.meta['category']
        page = response.meta.get('page', 1)

        self.logger.info(f'Scraping {category}, page {page}')

        for product in response.css('.product'):
            product_data = {
                'category': category,
                'name': product.css('h3::text').get(),
                'price': product.css('.price::text').get(),
                'image_url': product.css('img::attr(src)').get()
            }

            detail_url = product.css('a::attr(href)').get()

            yield response.follow(
                detail_url,
                callback=self.parse_product_detail,
                meta={'product': product_data, 'source_page': page}
            )

        # Handle pagination
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(
                next_page,
                callback=self.parse_products,
                meta={'category': category, 'page': page + 1}
            )

    def parse_product_detail(self, response):
        """Scrape full product details"""
        product = response.meta['product']
        source_page = response.meta['source_page']

        # Add details from this page
        product['description'] = response.css('.description::text').get()
        product['rating'] = response.css('.rating::text').get()
        product['reviews_count'] = len(response.css('.review'))
        product['in_stock'] = bool(response.css('.in-stock'))
        product['source_page'] = source_page
        product['detail_url'] = response.url

        yield product
Enter fullscreen mode Exit fullscreen mode

This spider demonstrates:

  • Passing data through multiple callbacks
  • Tracking category and page numbers
  • Building items across pages
  • Using .get() for optional data
  • Proper logging

When NOT to Use Meta

Sometimes meta isn't the right choice:

Don't use meta for:

  • Data that can be scraped from the current page (just scrape it directly!)
  • Large binary data (images, files)
  • Data that doesn't need to travel between pages

Do use meta for:

  • Context from previous pages
  • Tracking state (depth, source, page numbers)
  • Passing partial items to be completed later
  • Controlling Scrapy behavior per-request

Final Tips

  1. Keep meta simple: Don't put huge objects in meta. Keep it lightweight.

  2. Use descriptive keys: Instead of meta={'d': data}, use meta={'product_data': data}

  3. Always use .get(): Use response.meta.get('key', default) to avoid KeyErrors

  4. Check what's in meta: When debugging, print response.meta to see what's there

  5. Don't overuse meta: If you're passing 10+ keys, consider restructuring your code

  6. Remember meta is for data, not configuration: Use spider attributes or settings for configuration


Summary

Meta is your spider's backpack. It carries data from one page to the next.

Key takeaways:

  • Use meta={'key': 'value'} when making requests
  • Access with response.meta['key'] in callbacks
  • Use .get() for optional data to avoid errors
  • Meta can hold any Python object
  • Common uses: building items across pages, tracking context, controlling Scrapy

Start using meta in your next spider. It'll make multi-page scraping so much easier.

Happy scraping! 🕷️

Top comments (0)