Muhammad Ikramullah Khan

Posted on Dec 22

Scrapy: yield vs yield from and Request vs response.follow() (The Complete Beginner's Guide)

#python #beginners #programming #webdev

When I first started using Scrapy, I kept seeing yield everywhere. Then sometimes I'd see yield from. And don't even get me started on when to use scrapy.Request() versus response.follow().

I spent hours reading documentation that explained what these do, but not when to use them or why they matter.

Let me fix that for you. I'll explain both topics in plain English, with real examples, and the practical insights you won't find in the docs.

Part 1: Understanding yield vs yield from

What is yield, Really?

Let's start with the basics. In regular Python, when you want to return something from a function, you use return:

def get_numbers():
    return [1, 2, 3, 4, 5]

numbers = get_numbers()
print(numbers)  # [1, 2, 3, 4, 5]

This returns all the numbers at once. The function runs, creates the entire list, and gives it back.

But yield works differently. It returns values one at a time:

def get_numbers():
    yield 1
    yield 2
    yield 3
    yield 4
    yield 5

for num in get_numbers():
    print(num)  # Prints one number at a time

This is called a generator. Each time you ask for the next value, the function runs until it hits the next yield, gives you that value, then pauses.

Why Scrapy Uses yield

In Scrapy, you scrape multiple items from pages. You don't want to wait until you've scraped everything to start processing. You want to send each item as soon as you scrape it.

With return (wrong way):

def parse(self, response):
    items = []
    for product in response.css('.product'):
        items.append({
            'name': product.css('h2::text').get(),
            'price': product.css('.price::text').get()
        })
    return items  # Wait for ALL products, then return

This collects all products, stores them in memory, then returns them all at once.

With yield (right way):

def parse(self, response):
    for product in response.css('.product'):
        yield {
            'name': product.css('h2::text').get(),
            'price': product.css('.price::text').get()
        }  # Send each product immediately

This sends each product to Scrapy's pipeline as soon as it's scraped. Memory efficient. Faster processing.

The Power of yield in Scrapy

Here's what most tutorials don't explain clearly. You can yield different things:

def parse(self, response):
    # Yield items (data you scraped)
    yield {'name': 'Product 1', 'price': 29.99}

    # Yield more requests (pages to visit)
    yield scrapy.Request('https://example.com/page2', callback=self.parse)

    # You can mix them!
    for product in response.css('.product'):
        yield {'name': product.css('h2::text').get()}

    next_page = response.css('.next::attr(href)').get()
    if next_page:
        yield response.follow(next_page, callback=self.parse)

Every time you yield something, Scrapy takes it and decides what to do:

If it's a dict or item, it goes through pipelines
If it's a Request, it goes to the scheduler to be fetched

What is yield from?

Now let's talk about yield from. This is syntactic sugar for yielding everything from another generator or iterable.

Without yield from:

def parse(self, response):
    for request in self.make_requests(response):
        yield request

def make_requests(self, response):
    for link in response.css('a::attr(href)').getall():
        yield scrapy.Request(link, callback=self.parse_page)

With yield from:

def parse(self, response):
    yield from self.make_requests(response)

def make_requests(self, response):
    for link in response.css('a::attr(href)').getall():
        yield scrapy.Request(link, callback=self.parse_page)

yield from takes everything that make_requests() yields and yields it directly. It's cleaner and more readable.

When to Use yield from in Scrapy

Use Case 1: Extracting Helper Functions

def parse(self, response):
    # Scrape products
    yield from self.parse_products(response)

    # Follow pagination
    yield from self.parse_pagination(response)

def parse_products(self, response):
    for product in response.css('.product'):
        yield {
            'name': product.css('h2::text').get(),
            'price': product.css('.price::text').get()
        }

def parse_pagination(self, response):
    for page_link in response.css('.pagination a::attr(href)').getall():
        yield response.follow(page_link, callback=self.parse)

This keeps your code organized. Instead of one giant parse() function, you split it into smaller, readable pieces.

Use Case 2: Reusing Request Generation

def parse_category(self, response):
    # Generate product requests
    yield from self.create_product_requests(response)

def parse_search(self, response):
    # Use the same request generator
    yield from self.create_product_requests(response)

def create_product_requests(self, response):
    for product in response.css('.product'):
        url = product.css('a::attr(href)').get()
        yield scrapy.Request(url, callback=self.parse_product)

DRY principle. Don't repeat yourself.

Use Case 3: Processing Lists

def parse(self, response):
    urls = response.css('.product a::attr(href)').getall()

    # Instead of looping and yielding
    # for url in urls:
    #     yield response.follow(url, callback=self.parse_product)

    # You can do this
    yield from response.follow_all(urls, callback=self.parse_product)

yield vs yield from: Quick Decision Guide

Use yield when:

Yielding one item or request at a time
Simple, straightforward code
You're just learning Scrapy

Use yield from when:

Delegating to helper functions that yield
Reusing request generators
Code is getting messy with nested loops
You want cleaner, more organized code

Part 2: Request vs response.follow()

Now let's tackle the second confusion: when to use scrapy.Request() versus response.follow().

What's the Difference?

Both create requests to visit URLs. But response.follow() is a shortcut that makes your life easier.

Using scrapy.Request() (The Verbose Way)

def parse(self, response):
    for product in response.css('.product'):
        # Get the relative URL
        relative_url = product.css('a::attr(href)').get()

        # Convert to absolute URL
        absolute_url = response.urljoin(relative_url)

        # Create request
        yield scrapy.Request(
            url=absolute_url,
            callback=self.parse_product
        )

Notice what you have to do:

Extract the URL
Convert relative URL to absolute URL (manually!)
Create the Request object

Using response.follow() (The Smart Way)

def parse(self, response):
    for product in response.css('.product'):
        # Just follow the link!
        yield response.follow(
            product.css('a::attr(href)').get(),
            callback=self.parse_product
        )

response.follow() automatically:

Handles relative URLs (converts them to absolute)
Works with both URLs and selectors
Uses less code

The Magic: response.follow() with Selectors

Here's something really cool that most beginners don't know. You can pass a selector directly to response.follow():

def parse(self, response):
    # Instead of this
    for link in response.css('.product a'):
        url = link.css('::attr(href)').get()
        yield response.follow(url, callback=self.parse_product)

    # You can do this
    for link in response.css('.product a'):
        yield response.follow(link, callback=self.parse_product)

When you pass a selector, response.follow() automatically extracts the href attribute for you!

Even better:

def parse(self, response):
    # Follow all links at once!
    yield from response.follow_all(
        response.css('.product a'),
        callback=self.parse_product
    )

response.follow_all() is like response.follow() but for multiple links. Super clean!

When to Use scrapy.Request() vs response.follow()

Use scrapy.Request() when:

Creating start requests (no response yet):

def start_requests(self):
    urls = ['https://example.com/page1', 'https://example.com/page2']
    for url in urls:
        yield scrapy.Request(url, callback=self.parse)

You need full control (custom headers, cookies, etc.):

yield scrapy.Request(
    url='https://example.com',
    callback=self.parse,
    headers={'Custom-Header': 'value'},
    cookies={'session': '12345'},
    meta={'page': 1},
    priority=10,
    dont_filter=True
)

Making POST requests:

yield scrapy.Request(
    url='https://example.com/api',
    method='POST',
    body='{"key": "value"}',
    callback=self.parse
)

Use response.follow() when:

Following links from a page (99% of the time):

next_page = response.css('.next::attr(href)').get()
if next_page:
    yield response.follow(next_page, callback=self.parse)

Working with relative URLs:

# These all work!
yield response.follow('/products/123', callback=self.parse)
yield response.follow('../category/books', callback=self.parse)
yield response.follow('?page=2', callback=self.parse)

Quick and simple link following:

for link in response.css('a.product'):
    yield response.follow(link, callback=self.parse_product)

Real-World Examples

Example 1: Complete Spider Using yield and response.follow()

import scrapy

class BookSpider(scrapy.Spider):
    name = 'books'
    start_urls = ['https://books.toscrape.com']

    def parse(self, response):
        # Use yield to scrape products on current page
        for book in response.css('.product_pod'):
            yield {
                'title': book.css('h3 a::attr(title)').get(),
                'price': book.css('.price_color::text').get()
            }

        # Use response.follow for pagination
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Example 2: Using yield from for Organization

import scrapy

class EcommerceSpider(scrapy.Spider):
    name = 'ecommerce'
    start_urls = ['https://example.com']

    def parse(self, response):
        # Use yield from to delegate to helper methods
        yield from self.scrape_products(response)
        yield from self.follow_categories(response)
        yield from self.follow_pagination(response)

    def scrape_products(self, response):
        """Scrape all products on the page"""
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

    def follow_categories(self, response):
        """Follow all category links"""
        for category in response.css('.category a'):
            yield response.follow(category, callback=self.parse)

    def follow_pagination(self, response):
        """Follow pagination links"""
        next_page = response.css('.next a')
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Example 3: Mixing Request and response.follow()

import scrapy

class MixedSpider(scrapy.Spider):
    name = 'mixed'

    def start_requests(self):
        # Use Request for start URLs (no response yet)
        urls = ['https://example.com/category/electronics',
                'https://example.com/category/books']

        for url in urls:
            yield scrapy.Request(
                url,
                callback=self.parse_category,
                meta={'category': url.split('/')[-1]}
            )

    def parse_category(self, response):
        category = response.meta['category']

        # Use response.follow for links (easier!)
        for product in response.css('.product a'):
            yield response.follow(
                product,
                callback=self.parse_product,
                meta={'category': category}
            )

    def parse_product(self, response):
        yield {
            'category': response.meta['category'],
            'name': response.css('h1::text').get(),
            'price': response.css('.price::text').get()
        }

Hidden Tricks the Docs Don't Emphasize

Trick #1: You Can Mix yield and yield from

def parse(self, response):
    # Yield a single item
    yield {'page_title': response.css('title::text').get()}

    # Yield from multiple requests
    yield from response.follow_all(
        response.css('.product a'),
        callback=self.parse_product
    )

    # Yield another item
    yield {'product_count': len(response.css('.product'))}

Trick #2: response.follow() Accepts meta Too

def parse(self, response):
    for i, product in enumerate(response.css('.product a')):
        yield response.follow(
            product,
            callback=self.parse_product,
            meta={'position': i, 'category': 'electronics'}
        )

Trick #3: Yielding from CrawlSpider Rules

When using CrawlSpider with rules, you're already using yield behind the scenes:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class AutoSpider(CrawlSpider):
    name = 'auto'
    start_urls = ['https://example.com']

    rules = (
        Rule(LinkExtractor(allow=r'/category/'), follow=True),
        Rule(LinkExtractor(allow=r'/product/'), callback='parse_product'),
    )

    def parse_product(self, response):
        # Just yield like normal
        yield {
            'name': response.css('h1::text').get(),
            'price': response.css('.price::text').get()
        }

Trick #4: Yield Can Create Pipelines

You can yield different types and route them through different pipelines:

def parse(self, response):
    # Yield a product item
    yield {
        'type': 'product',
        'name': response.css('h1::text').get()
    }

    # Yield an image URL
    yield {
        'type': 'image',
        'url': response.css('.main-image::attr(src)').get()
    }

Then in your pipeline, handle them differently:

class TypedPipeline:
    def process_item(self, item, spider):
        if item['type'] == 'product':
            # Save to products database
            pass
        elif item['type'] == 'image':
            # Download image
            pass
        return item

Common Mistakes

Mistake #1: Forgetting to yield

# WRONG (nothing happens!)
def parse(self, response):
    for product in response.css('.product'):
        {  # Missing yield!
            'name': product.css('h2::text').get()
        }

# RIGHT
def parse(self, response):
    for product in response.css('.product'):
        yield {
            'name': product.css('h2::text').get()
        }

Mistake #2: Using return Instead of yield

# WRONG (only returns first item!)
def parse(self, response):
    for product in response.css('.product'):
        return {'name': product.css('h2::text').get()}

# RIGHT
def parse(self, response):
    for product in response.css('.product'):
        yield {'name': product.css('h2::text').get()}

Mistake #3: Not Using response.follow() for Relative URLs

# WRONG (breaks with relative URLs like '/products/123')
def parse(self, response):
    url = response.css('.next::attr(href)').get()
    yield scrapy.Request(url, callback=self.parse)  # Might break!

# RIGHT (handles relative URLs automatically)
def parse(self, response):
    url = response.css('.next::attr(href)').get()
    yield response.follow(url, callback=self.parse)

Mistake #4: Confusing yield from with yield

# WRONG (yields a generator object, not the items!)
def parse(self, response):
    yield self.parse_products(response)  # Yields the generator itself

# RIGHT
def parse(self, response):
    yield from self.parse_products(response)  # Yields each item from the generator

Performance Considerations

Memory Usage

# BAD (stores everything in memory)
def parse(self, response):
    items = []
    for product in response.css('.product'):
        items.append({'name': product.css('h2::text').get()})
    return items  # All at once

# GOOD (memory efficient)
def parse(self, response):
    for product in response.css('.product'):
        yield {'name': product.css('h2::text').get()}  # One at a time

Processing Speed

Using yield allows Scrapy to start processing items through pipelines immediately, while the spider continues scraping. This parallelization makes your scraper faster.

Quick Decision Guide

yield vs yield from:

Use yield for single items or requests
Use yield from when delegating to helper functions that yield multiple things
Use yield from response.follow_all() for following multiple links cleanly

Request vs response.follow():

Use Request in start_requests() or when you need full control
Use response.follow() everywhere else (following links from pages)
Use response.follow_all() for multiple links

Summary

yield:

Returns values one at a time (generator)
Memory efficient
Allows parallel processing
Always use in Scrapy (never return)

yield from:

Yields everything from another generator
Cleaner code
Good for helper functions
Keeps code organized

scrapy.Request():

Full control over request
Use in start_requests()
Use when you need custom headers/cookies/etc.

response.follow():

Shortcut for following links
Handles relative URLs automatically
Can work with selectors directly
Use 99% of the time when following links

Start with the simple patterns. Use yield and response.follow() everywhere. Add yield from and scrapy.Request() only when you need them.

Happy scraping! 🕷️

DEV Community