DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy: yield vs yield from and Request vs response.follow() (The Complete Beginner's Guide)

When I first started using Scrapy, I kept seeing yield everywhere. Then sometimes I'd see yield from. And don't even get me started on when to use scrapy.Request() versus response.follow().

I spent hours reading documentation that explained what these do, but not when to use them or why they matter.

Let me fix that for you. I'll explain both topics in plain English, with real examples, and the practical insights you won't find in the docs.


Part 1: Understanding yield vs yield from

What is yield, Really?

Let's start with the basics. In regular Python, when you want to return something from a function, you use return:

def get_numbers():
    return [1, 2, 3, 4, 5]

numbers = get_numbers()
print(numbers)  # [1, 2, 3, 4, 5]
Enter fullscreen mode Exit fullscreen mode

This returns all the numbers at once. The function runs, creates the entire list, and gives it back.

But yield works differently. It returns values one at a time:

def get_numbers():
    yield 1
    yield 2
    yield 3
    yield 4
    yield 5

for num in get_numbers():
    print(num)  # Prints one number at a time
Enter fullscreen mode Exit fullscreen mode

This is called a generator. Each time you ask for the next value, the function runs until it hits the next yield, gives you that value, then pauses.

Why Scrapy Uses yield

In Scrapy, you scrape multiple items from pages. You don't want to wait until you've scraped everything to start processing. You want to send each item as soon as you scrape it.

With return (wrong way):

def parse(self, response):
    items = []
    for product in response.css('.product'):
        items.append({
            'name': product.css('h2::text').get(),
            'price': product.css('.price::text').get()
        })
    return items  # Wait for ALL products, then return
Enter fullscreen mode Exit fullscreen mode

This collects all products, stores them in memory, then returns them all at once.

With yield (right way):

def parse(self, response):
    for product in response.css('.product'):
        yield {
            'name': product.css('h2::text').get(),
            'price': product.css('.price::text').get()
        }  # Send each product immediately
Enter fullscreen mode Exit fullscreen mode

This sends each product to Scrapy's pipeline as soon as it's scraped. Memory efficient. Faster processing.

The Power of yield in Scrapy

Here's what most tutorials don't explain clearly. You can yield different things:

def parse(self, response):
    # Yield items (data you scraped)
    yield {'name': 'Product 1', 'price': 29.99}

    # Yield more requests (pages to visit)
    yield scrapy.Request('https://example.com/page2', callback=self.parse)

    # You can mix them!
    for product in response.css('.product'):
        yield {'name': product.css('h2::text').get()}

    next_page = response.css('.next::attr(href)').get()
    if next_page:
        yield response.follow(next_page, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Every time you yield something, Scrapy takes it and decides what to do:

  • If it's a dict or item, it goes through pipelines
  • If it's a Request, it goes to the scheduler to be fetched

What is yield from?

Now let's talk about yield from. This is syntactic sugar for yielding everything from another generator or iterable.

Without yield from:

def parse(self, response):
    for request in self.make_requests(response):
        yield request

def make_requests(self, response):
    for link in response.css('a::attr(href)').getall():
        yield scrapy.Request(link, callback=self.parse_page)
Enter fullscreen mode Exit fullscreen mode

With yield from:

def parse(self, response):
    yield from self.make_requests(response)

def make_requests(self, response):
    for link in response.css('a::attr(href)').getall():
        yield scrapy.Request(link, callback=self.parse_page)
Enter fullscreen mode Exit fullscreen mode

yield from takes everything that make_requests() yields and yields it directly. It's cleaner and more readable.

When to Use yield from in Scrapy

Use Case 1: Extracting Helper Functions

def parse(self, response):
    # Scrape products
    yield from self.parse_products(response)

    # Follow pagination
    yield from self.parse_pagination(response)

def parse_products(self, response):
    for product in response.css('.product'):
        yield {
            'name': product.css('h2::text').get(),
            'price': product.css('.price::text').get()
        }

def parse_pagination(self, response):
    for page_link in response.css('.pagination a::attr(href)').getall():
        yield response.follow(page_link, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

This keeps your code organized. Instead of one giant parse() function, you split it into smaller, readable pieces.

Use Case 2: Reusing Request Generation

def parse_category(self, response):
    # Generate product requests
    yield from self.create_product_requests(response)

def parse_search(self, response):
    # Use the same request generator
    yield from self.create_product_requests(response)

def create_product_requests(self, response):
    for product in response.css('.product'):
        url = product.css('a::attr(href)').get()
        yield scrapy.Request(url, callback=self.parse_product)
Enter fullscreen mode Exit fullscreen mode

DRY principle. Don't repeat yourself.

Use Case 3: Processing Lists

def parse(self, response):
    urls = response.css('.product a::attr(href)').getall()

    # Instead of looping and yielding
    # for url in urls:
    #     yield response.follow(url, callback=self.parse_product)

    # You can do this
    yield from response.follow_all(urls, callback=self.parse_product)
Enter fullscreen mode Exit fullscreen mode

yield vs yield from: Quick Decision Guide

Use yield when:

  • Yielding one item or request at a time
  • Simple, straightforward code
  • You're just learning Scrapy

Use yield from when:

  • Delegating to helper functions that yield
  • Reusing request generators
  • Code is getting messy with nested loops
  • You want cleaner, more organized code

Part 2: Request vs response.follow()

Now let's tackle the second confusion: when to use scrapy.Request() versus response.follow().

What's the Difference?

Both create requests to visit URLs. But response.follow() is a shortcut that makes your life easier.

Using scrapy.Request() (The Verbose Way)

def parse(self, response):
    for product in response.css('.product'):
        # Get the relative URL
        relative_url = product.css('a::attr(href)').get()

        # Convert to absolute URL
        absolute_url = response.urljoin(relative_url)

        # Create request
        yield scrapy.Request(
            url=absolute_url,
            callback=self.parse_product
        )
Enter fullscreen mode Exit fullscreen mode

Notice what you have to do:

  1. Extract the URL
  2. Convert relative URL to absolute URL (manually!)
  3. Create the Request object

Using response.follow() (The Smart Way)

def parse(self, response):
    for product in response.css('.product'):
        # Just follow the link!
        yield response.follow(
            product.css('a::attr(href)').get(),
            callback=self.parse_product
        )
Enter fullscreen mode Exit fullscreen mode

response.follow() automatically:

  • Handles relative URLs (converts them to absolute)
  • Works with both URLs and selectors
  • Uses less code

The Magic: response.follow() with Selectors

Here's something really cool that most beginners don't know. You can pass a selector directly to response.follow():

def parse(self, response):
    # Instead of this
    for link in response.css('.product a'):
        url = link.css('::attr(href)').get()
        yield response.follow(url, callback=self.parse_product)

    # You can do this
    for link in response.css('.product a'):
        yield response.follow(link, callback=self.parse_product)
Enter fullscreen mode Exit fullscreen mode

When you pass a selector, response.follow() automatically extracts the href attribute for you!

Even better:

def parse(self, response):
    # Follow all links at once!
    yield from response.follow_all(
        response.css('.product a'),
        callback=self.parse_product
    )
Enter fullscreen mode Exit fullscreen mode

response.follow_all() is like response.follow() but for multiple links. Super clean!

When to Use scrapy.Request() vs response.follow()

Use scrapy.Request() when:

  1. Creating start requests (no response yet):
def start_requests(self):
    urls = ['https://example.com/page1', 'https://example.com/page2']
    for url in urls:
        yield scrapy.Request(url, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode
  1. You need full control (custom headers, cookies, etc.):
yield scrapy.Request(
    url='https://example.com',
    callback=self.parse,
    headers={'Custom-Header': 'value'},
    cookies={'session': '12345'},
    meta={'page': 1},
    priority=10,
    dont_filter=True
)
Enter fullscreen mode Exit fullscreen mode
  1. Making POST requests:
yield scrapy.Request(
    url='https://example.com/api',
    method='POST',
    body='{"key": "value"}',
    callback=self.parse
)
Enter fullscreen mode Exit fullscreen mode

Use response.follow() when:

  1. Following links from a page (99% of the time):
next_page = response.css('.next::attr(href)').get()
if next_page:
    yield response.follow(next_page, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode
  1. Working with relative URLs:
# These all work!
yield response.follow('/products/123', callback=self.parse)
yield response.follow('../category/books', callback=self.parse)
yield response.follow('?page=2', callback=self.parse)
Enter fullscreen mode Exit fullscreen mode
  1. Quick and simple link following:
for link in response.css('a.product'):
    yield response.follow(link, callback=self.parse_product)
Enter fullscreen mode Exit fullscreen mode

Real-World Examples

Example 1: Complete Spider Using yield and response.follow()

import scrapy

class BookSpider(scrapy.Spider):
    name = 'books'
    start_urls = ['https://books.toscrape.com']

    def parse(self, response):
        # Use yield to scrape products on current page
        for book in response.css('.product_pod'):
            yield {
                'title': book.css('h3 a::attr(title)').get(),
                'price': book.css('.price_color::text').get()
            }

        # Use response.follow for pagination
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Example 2: Using yield from for Organization

import scrapy

class EcommerceSpider(scrapy.Spider):
    name = 'ecommerce'
    start_urls = ['https://example.com']

    def parse(self, response):
        # Use yield from to delegate to helper methods
        yield from self.scrape_products(response)
        yield from self.follow_categories(response)
        yield from self.follow_pagination(response)

    def scrape_products(self, response):
        """Scrape all products on the page"""
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

    def follow_categories(self, response):
        """Follow all category links"""
        for category in response.css('.category a'):
            yield response.follow(category, callback=self.parse)

    def follow_pagination(self, response):
        """Follow pagination links"""
        next_page = response.css('.next a')
        if next_page:
            yield response.follow(next_page, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Example 3: Mixing Request and response.follow()

import scrapy

class MixedSpider(scrapy.Spider):
    name = 'mixed'

    def start_requests(self):
        # Use Request for start URLs (no response yet)
        urls = ['https://example.com/category/electronics',
                'https://example.com/category/books']

        for url in urls:
            yield scrapy.Request(
                url,
                callback=self.parse_category,
                meta={'category': url.split('/')[-1]}
            )

    def parse_category(self, response):
        category = response.meta['category']

        # Use response.follow for links (easier!)
        for product in response.css('.product a'):
            yield response.follow(
                product,
                callback=self.parse_product,
                meta={'category': category}
            )

    def parse_product(self, response):
        yield {
            'category': response.meta['category'],
            'name': response.css('h1::text').get(),
            'price': response.css('.price::text').get()
        }
Enter fullscreen mode Exit fullscreen mode

Hidden Tricks the Docs Don't Emphasize

Trick #1: You Can Mix yield and yield from

def parse(self, response):
    # Yield a single item
    yield {'page_title': response.css('title::text').get()}

    # Yield from multiple requests
    yield from response.follow_all(
        response.css('.product a'),
        callback=self.parse_product
    )

    # Yield another item
    yield {'product_count': len(response.css('.product'))}
Enter fullscreen mode Exit fullscreen mode

Trick #2: response.follow() Accepts meta Too

def parse(self, response):
    for i, product in enumerate(response.css('.product a')):
        yield response.follow(
            product,
            callback=self.parse_product,
            meta={'position': i, 'category': 'electronics'}
        )
Enter fullscreen mode Exit fullscreen mode

Trick #3: Yielding from CrawlSpider Rules

When using CrawlSpider with rules, you're already using yield behind the scenes:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class AutoSpider(CrawlSpider):
    name = 'auto'
    start_urls = ['https://example.com']

    rules = (
        Rule(LinkExtractor(allow=r'/category/'), follow=True),
        Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
    )

    def parse_product(self, response):
        # Just yield like normal
        yield {
            'name': response.css('h1::text').get(),
            'price': response.css('.price::text').get()
        }
Enter fullscreen mode Exit fullscreen mode

Trick #4: Yield Can Create Pipelines

You can yield different types and route them through different pipelines:

def parse(self, response):
    # Yield a product item
    yield {
        'type': 'product',
        'name': response.css('h1::text').get()
    }

    # Yield an image URL
    yield {
        'type': 'image',
        'url': response.css('.main-image::attr(src)').get()
    }
Enter fullscreen mode Exit fullscreen mode

Then in your pipeline, handle them differently:

class TypedPipeline:
    def process_item(self, item, spider):
        if item['type'] == 'product':
            # Save to products database
            pass
        elif item['type'] == 'image':
            # Download image
            pass
        return item
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake #1: Forgetting to yield

# WRONG (nothing happens!)
def parse(self, response):
    for product in response.css('.product'):
        {  # Missing yield!
            'name': product.css('h2::text').get()
        }

# RIGHT
def parse(self, response):
    for product in response.css('.product'):
        yield {
            'name': product.css('h2::text').get()
        }
Enter fullscreen mode Exit fullscreen mode

Mistake #2: Using return Instead of yield

# WRONG (only returns first item!)
def parse(self, response):
    for product in response.css('.product'):
        return {'name': product.css('h2::text').get()}

# RIGHT
def parse(self, response):
    for product in response.css('.product'):
        yield {'name': product.css('h2::text').get()}
Enter fullscreen mode Exit fullscreen mode

Mistake #3: Not Using response.follow() for Relative URLs

# WRONG (breaks with relative URLs like '/products/123')
def parse(self, response):
    url = response.css('.next::attr(href)').get()
    yield scrapy.Request(url, callback=self.parse)  # Might break!

# RIGHT (handles relative URLs automatically)
def parse(self, response):
    url = response.css('.next::attr(href)').get()
    yield response.follow(url, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Mistake #4: Confusing yield from with yield

# WRONG (yields a generator object, not the items!)
def parse(self, response):
    yield self.parse_products(response)  # Yields the generator itself

# RIGHT
def parse(self, response):
    yield from self.parse_products(response)  # Yields each item from the generator
Enter fullscreen mode Exit fullscreen mode

Performance Considerations

Memory Usage

# BAD (stores everything in memory)
def parse(self, response):
    items = []
    for product in response.css('.product'):
        items.append({'name': product.css('h2::text').get()})
    return items  # All at once

# GOOD (memory efficient)
def parse(self, response):
    for product in response.css('.product'):
        yield {'name': product.css('h2::text').get()}  # One at a time
Enter fullscreen mode Exit fullscreen mode

Processing Speed

Using yield allows Scrapy to start processing items through pipelines immediately, while the spider continues scraping. This parallelization makes your scraper faster.


Quick Decision Guide

yield vs yield from:

  • Use yield for single items or requests
  • Use yield from when delegating to helper functions that yield multiple things
  • Use yield from response.follow_all() for following multiple links cleanly

Request vs response.follow():

  • Use Request in start_requests() or when you need full control
  • Use response.follow() everywhere else (following links from pages)
  • Use response.follow_all() for multiple links

Summary

yield:

  • Returns values one at a time (generator)
  • Memory efficient
  • Allows parallel processing
  • Always use in Scrapy (never return)

yield from:

  • Yields everything from another generator
  • Cleaner code
  • Good for helper functions
  • Keeps code organized

scrapy.Request():

  • Full control over request
  • Use in start_requests()
  • Use when you need custom headers/cookies/etc.

response.follow():

  • Shortcut for following links
  • Handles relative URLs automatically
  • Can work with selectors directly
  • Use 99% of the time when following links

Start with the simple patterns. Use yield and response.follow() everywhere. Add yield from and scrapy.Request() only when you need them.

Happy scraping! 🕷️

Top comments (0)