DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Scrapy Requests and Responses: The Complete Beginner's Guide (With Secrets the Docs Don't Tell You)

When I first started using Scrapy, I thought Requests and Responses were simple concepts. You make a request, you get a response. Easy, right?

Wrong.

There's so much hidden under the surface. Things the documentation mentions but doesn't explain. Tricks that experienced scrapers use every day but nobody writes about.

After scraping hundreds of websites and debugging thousands of issues, I've learned the ins and outs of Scrapy's Request and Response objects. Let me share everything with you, including the stuff the docs leave out.


What Are Requests and Responses, Really?

Think of web scraping like having a conversation:

Request: "Hey website, can you show me this page?"
Response: "Sure, here's the HTML!"

In Scrapy:

  • A Request is an object that says "I want to visit this URL"
  • A Response is an object that contains what the website sent back

But here's where it gets interesting. These aren't just simple objects. They carry a ton of hidden information and have special behaviors most beginners never discover.


Creating Your First Request (The Right Way)

Most tutorials show you this:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    def parse(self, response):
        # Do something with response
        pass
Enter fullscreen mode Exit fullscreen mode

But what's actually happening here? Scrapy automatically creates Request objects from start_urls. Behind the scenes, it's doing this:

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url=url, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Now let's make requests manually and see all the options:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        yield scrapy.Request(
            url='https://example.com',
            callback=self.parse,
            method='GET',
            headers={'User-Agent': 'My Custom Agent'},
            cookies={'session': 'abc123'},
            meta={'page_num': 1},
            dont_filter=False,
            priority=0
        )

    def parse(self, response):
        # Process response
        pass
Enter fullscreen mode Exit fullscreen mode

Let me break down each parameter:

url (Required)

The page you want to scrape. Pretty straightforward.

yield scrapy.Request(url='https://example.com/products')
Enter fullscreen mode Exit fullscreen mode

callback (Optional, but Important)

The function that processes the response. If you don't specify, Scrapy uses parse() by default.

yield scrapy.Request(
    url='https://example.com/products',
    callback=self.parse_products
)

def parse_products(self, response):
    # Handle response here
    pass
Enter fullscreen mode Exit fullscreen mode

method (Optional)

The HTTP method. Default is GET, but you can use POST, PUT, DELETE, etc.

yield scrapy.Request(
    url='https://example.com/api',
    method='POST',
    body='{"key": "value"}'
)
Enter fullscreen mode Exit fullscreen mode

headers (Optional)

Custom headers to send with the request.

yield scrapy.Request(
    url='https://example.com',
    headers={
        'User-Agent': 'Mozilla/5.0',
        'Accept': 'text/html',
        'Referer': 'https://google.com'
    }
)
Enter fullscreen mode Exit fullscreen mode

cookies (Optional)

Cookies to send with the request.

yield scrapy.Request(
    url='https://example.com',
    cookies={'session_id': '12345', 'user': 'john'}
)
Enter fullscreen mode Exit fullscreen mode

meta (Optional)

Data to carry forward to the callback. This is huge for passing data between pages.

yield scrapy.Request(
    url='https://example.com/details',
    meta={'product_name': 'Widget', 'price': 29.99},
    callback=self.parse_details
)

def parse_details(self, response):
    name = response.meta['product_name']
    price = response.meta['price']
Enter fullscreen mode Exit fullscreen mode

dont_filter (Optional)

By default, Scrapy filters duplicate URLs. Set this to True to visit the same URL multiple times.

yield scrapy.Request(
    url='https://example.com',
    dont_filter=True  # Visit this URL even if we've been there
)
Enter fullscreen mode Exit fullscreen mode

priority (Optional)

Higher priority requests get processed first. Default is 0.

yield scrapy.Request(
    url='https://example.com/important',
    priority=10  # Process this before priority 0 requests
)
Enter fullscreen mode Exit fullscreen mode

The Response Object (What You Actually Get Back)

When your request completes, you get a Response object in your callback. Let's see what's inside:

def parse(self, response):
    # The URL of the response (might differ from request due to redirects)
    url = response.url

    # The HTML body as bytes
    html = response.body

    # The HTML as a string (more useful!)
    text = response.text

    # The HTTP status code
    status = response.status  # 200, 404, 500, etc.

    # Response headers
    headers = response.headers

    # The original request that generated this response
    original_request = response.request

    # Meta data from the request
    meta_data = response.meta
Enter fullscreen mode Exit fullscreen mode

Useful Response Methods

def parse(self, response):
    # CSS selectors (easiest!)
    titles = response.css('h1.title::text').getall()
    first_title = response.css('h1.title::text').get()

    # XPath selectors (more powerful)
    titles = response.xpath('//h1[@class="title"]/text()').getall()

    # Follow links (super convenient)
    next_page = response.css('a.next::attr(href)').get()
    if next_page:
        yield response.follow(next_page, callback=self.parse)

    # Urljoin (combine relative URLs with base URL)
    full_url = response.urljoin('/relative/path')
Enter fullscreen mode Exit fullscreen mode

Secrets the Documentation Doesn't Emphasize

Secret #1: Response.follow() Is Magic

Instead of manually creating requests like this:

next_url = response.css('a.next::attr(href)').get()
full_url = response.urljoin(next_url)
yield scrapy.Request(full_url, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Just use response.follow():

next_url = response.css('a.next::attr(href)').get()
yield response.follow(next_url, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Even better, you can pass a selector directly:

# This works!
yield response.follow(response.css('a.next::attr(href)').get(), callback=self.parse)

# This also works!
for link in response.css('a'):
    yield response.follow(link, callback=self.parse_page)
Enter fullscreen mode Exit fullscreen mode

response.follow() automatically:

  • Handles relative URLs
  • Extracts the href attribute if you pass a selector
  • Creates the Request object for you

Secret #2: response.request Gets You the Original Request

def parse(self, response):
    # Access the original request
    original_url = response.request.url
    original_headers = response.request.headers
    original_meta = response.request.meta

    # Useful for debugging
    self.logger.info(f'Requested: {original_url}')
    self.logger.info(f'Got back: {response.url}')
    # These might differ if there was a redirect!
Enter fullscreen mode Exit fullscreen mode

Secret #3: You Can Inspect Response Headers

def parse(self, response):
    # Get all headers
    all_headers = response.headers

    # Get a specific header
    content_type = response.headers.get('Content-Type')

    # Check cookies the server sent back
    cookies = response.headers.getlist('Set-Cookie')

    # Useful for debugging blocks
    server = response.headers.get('Server')
    self.logger.info(f'Server type: {server}')
Enter fullscreen mode Exit fullscreen mode

Secret #4: response.meta Survives Redirects

This is huge and not well documented. When a request gets redirected, the meta data stays with it:

def start_requests(self):
    yield scrapy.Request(
        'https://example.com/redirect',
        meta={'important': 'data'},
        callback=self.parse
    )

def parse(self, response):
    # Even after redirect, meta is still there!
    data = response.meta['important']

    # The URL might be different
    self.logger.info(f'Ended up at: {response.url}')
Enter fullscreen mode Exit fullscreen mode

Secret #5: Request Priority Actually Matters

Most people ignore priority, but it's powerful:

def parse_listing(self, response):
    # High priority for product pages (process first)
    for product in response.css('.product'):
        url = product.css('a::attr(href)').get()
        yield response.follow(
            url,
            callback=self.parse_product,
            priority=10
        )

    # Low priority for pagination (process later)
    next_page = response.css('.next::attr(href)').get()
    if next_page:
        yield response.follow(
            next_page,
            callback=self.parse_listing,
            priority=0
        )
Enter fullscreen mode Exit fullscreen mode

This ensures you scrape important pages first before moving to the next page of listings.


FormRequest: For Login and POST Requests

When you need to submit forms or POST data, use FormRequest:

Simple POST Request

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'login'

    def start_requests(self):
        yield scrapy.FormRequest(
            url='https://example.com/login',
            formdata={
                'username': 'myuser',
                'password': 'mypass'
            },
            callback=self.after_login
        )

    def after_login(self, response):
        if 'Welcome' in response.text:
            self.logger.info('Login successful!')
        else:
            self.logger.error('Login failed!')
Enter fullscreen mode Exit fullscreen mode

FormRequest.from_response() (The Smart Way)

This is incredibly useful but underused:

class LoginSpider(scrapy.Spider):
    name = 'login'
    start_urls = ['https://example.com/login']

    def parse(self, response):
        # Automatically fill in the form from the page
        yield scrapy.FormRequest.from_response(
            response,
            formdata={
                'username': 'myuser',
                'password': 'mypass'
            },
            callback=self.after_login
        )

    def after_login(self, response):
        # Now you're logged in!
        yield response.follow('/dashboard', callback=self.parse_dashboard)
Enter fullscreen mode Exit fullscreen mode

from_response() automatically:

  • Finds the form on the page
  • Extracts all form fields
  • Preserves hidden fields (CSRF tokens, etc.)
  • Fills in your data
  • Submits the form

It's like magic for login forms!


Real-World Examples

Example 1: Scraping With Pagination

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'

    def start_requests(self):
        yield scrapy.Request(
            'https://example.com/products?page=1',
            meta={'page': 1},
            callback=self.parse
        )

    def parse(self, response):
        page = response.meta['page']
        self.logger.info(f'Scraping page {page}')

        # Scrape products
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get(),
                'page': page
            }

        # Follow next page
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(
                next_page,
                meta={'page': page + 1},
                callback=self.parse
            )
Enter fullscreen mode Exit fullscreen mode

Example 2: Scraping Details Across Multiple Pages

import scrapy

class DetailSpider(scrapy.Spider):
    name = 'details'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        """Scrape product listings"""
        for product in response.css('.product'):
            item = {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

            # Go to detail page to get more info
            detail_url = product.css('a::attr(href)').get()
            yield response.follow(
                detail_url,
                callback=self.parse_detail,
                meta={'item': item}
            )

    def parse_detail(self, response):
        """Add details to the item"""
        item = response.meta['item']
        item['description'] = response.css('.description::text').get()
        item['rating'] = response.css('.rating::text').get()
        item['reviews'] = len(response.css('.review'))
        yield item
Enter fullscreen mode Exit fullscreen mode

Example 3: Handling Authentication

import scrapy

class AuthSpider(scrapy.Spider):
    name = 'auth'
    start_urls = ['https://example.com/login']

    def parse(self, response):
        """Login first"""
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'user', 'password': 'pass'},
            callback=self.after_login
        )

    def after_login(self, response):
        """Check if login succeeded"""
        if 'logout' in response.text:
            self.logger.info('Logged in successfully!')
            yield response.follow('/protected/data', callback=self.parse_data)
        else:
            self.logger.error('Login failed')

    def parse_data(self, response):
        """Scrape protected data"""
        for item in response.css('.data-item'):
            yield {
                'title': item.css('h3::text').get(),
                'data': item.css('.value::text').get()
            }
Enter fullscreen mode Exit fullscreen mode

Common Mistakes and How to Fix Them

Mistake #1: Not Yielding Requests

# WRONG
def parse(self, response):
    next_url = response.css('.next::attr(href)').get()
    response.follow(next_url, callback=self.parse)  # Missing yield!

# RIGHT
def parse(self, response):
    next_url = response.css('.next::attr(href)').get()
    yield response.follow(next_url, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Mistake #2: Forgetting to Handle None

# WRONG (crashes if no next button)
next_url = response.css('.next::attr(href)').get()
yield response.follow(next_url, callback=self.parse)

# RIGHT
next_url = response.css('.next::attr(href)').get()
if next_url:
    yield response.follow(next_url, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Mistake #3: Not Using response.follow() for Relative URLs

# WRONG (breaks with relative URLs)
url = response.css('a::attr(href)').get()
yield scrapy.Request(url, callback=self.parse)

# RIGHT (handles relative URLs automatically)
url = response.css('a::attr(href)').get()
yield response.follow(url, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Mistake #4: Modifying Response

# WRONG (response is read-only)
response.body = 'new content'  # This doesn't work!

# RIGHT (create a new response if needed)
new_response = response.replace(body=b'new content')
Enter fullscreen mode Exit fullscreen mode

Advanced: Request and Response Tricks

Trick #1: Chaining Multiple Pages

def parse_category(self, response):
    category = response.css('h1::text').get()

    for product_link in response.css('.product a'):
        yield response.follow(
            product_link,
            callback=self.parse_product,
            meta={'category': category}
        )

def parse_product(self, response):
    category = response.meta['category']

    review_link = response.css('.reviews-link::attr(href)').get()
    if review_link:
        yield response.follow(
            review_link,
            callback=self.parse_reviews,
            meta={
                'category': category,
                'product': response.css('h1::text').get()
            }
        )

def parse_reviews(self, response):
    yield {
        'category': response.meta['category'],
        'product': response.meta['product'],
        'reviews': response.css('.review::text').getall()
    }
Enter fullscreen mode Exit fullscreen mode

Trick #2: Conditional Requests

def parse(self, response):
    for link in response.css('a'):
        url = link.css('::attr(href)').get()

        # Only follow links to product pages
        if '/product/' in url:
            yield response.follow(url, callback=self.parse_product)

        # Only follow links to category pages
        elif '/category/' in url:
            yield response.follow(url, callback=self.parse_category)
Enter fullscreen mode Exit fullscreen mode

Trick #3: Dynamic Headers Per Request

def parse(self, response):
    for i, product in enumerate(response.css('.product')):
        url = product.css('a::attr(href)').get()

        # Different referer for each request
        yield scrapy.Request(
            url,
            callback=self.parse_product,
            headers={'Referer': response.url},
            meta={'product_position': i}
        )
Enter fullscreen mode Exit fullscreen mode

Debugging Requests and Responses

See What Requests Are Being Made

def parse(self, response):
    self.logger.info(f'Visiting: {response.url}')
    self.logger.info(f'Status: {response.status}')
    self.logger.info(f'Headers: {response.headers}')
Enter fullscreen mode Exit fullscreen mode

Check Response Content

def parse(self, response):
    # Save response to file for inspection
    filename = 'response.html'
    with open(filename, 'wb') as f:
        f.write(response.body)

    self.logger.info(f'Saved response to {filename}')
Enter fullscreen mode Exit fullscreen mode

Debug Failed Requests

def start_requests(self):
    yield scrapy.Request(
        'https://example.com',
        callback=self.parse,
        errback=self.handle_error
    )

def handle_error(self, failure):
    self.logger.error(f'Request failed: {failure}')
    self.logger.error(f'URL: {failure.request.url}')
Enter fullscreen mode Exit fullscreen mode

Response Types (The Secret Hierarchy)

Scrapy actually has different types of Response objects:

Response (Base Class)

Basic response for any content.

TextResponse (Most Common)

For HTML, XML, and text content. Has .text and selector methods.

HtmlResponse

Specifically for HTML. Auto-detects encoding.

XmlResponse

For XML content. Auto-detects encoding from XML declaration.

You rarely need to care about this, but it explains why .css() and .xpath() work on HTML responses but would fail on binary responses.


Performance Tips

Tip #1: Use dont_filter Sparingly

# This is expensive (no filtering)
yield scrapy.Request(url, dont_filter=True)

# Better: only disable filtering when necessary
if need_to_revisit:
    yield scrapy.Request(url, dont_filter=True)
else:
    yield scrapy.Request(url)  # Filtered by default
Enter fullscreen mode Exit fullscreen mode

Tip #2: Set Appropriate Priorities

# Important requests first
yield scrapy.Request(important_url, priority=100)

# Less important requests later
yield scrapy.Request(other_url, priority=1)
Enter fullscreen mode Exit fullscreen mode

Tip #3: Don't Pass Huge Objects in Meta

# BAD (large object in meta)
huge_data = [lots of data]
yield scrapy.Request(url, meta={'data': huge_data})

# GOOD (only pass what you need)
small_id = get_id(huge_data)
yield scrapy.Request(url, meta={'id': small_id})
Enter fullscreen mode Exit fullscreen mode

Summary: Request and Response Cheat Sheet

Creating Requests:

# Basic
yield scrapy.Request(url, callback=self.parse)

# With all options
yield scrapy.Request(
    url=url,
    callback=self.parse,
    method='GET',
    headers={'User-Agent': 'custom'},
    cookies={'session': '123'},
    meta={'data': 'value'},
    priority=10,
    dont_filter=False
)

# Form request
yield scrapy.FormRequest(url, formdata={'key': 'value'})

# From response (shortcut)
yield response.follow(url, callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Using Responses:

# Get data
url = response.url
status = response.status
text = response.text
body = response.body

# Selectors
response.css('selector')
response.xpath('xpath')

# Follow links
yield response.follow(url, callback=self.parse)

# Access meta
data = response.meta['key']

# Original request
original = response.request
Enter fullscreen mode Exit fullscreen mode

Final Thoughts

Requests and Responses are the foundation of Scrapy. Master these, and everything else gets easier.

Key takeaways:

  • Always yield requests (don't forget!)
  • Use response.follow() for convenience
  • Pass data through meta
  • Handle None values
  • Check response.status
  • Use FormRequest for logins
  • Debug with logging

Start simple. Practice with basic requests. Then add complexity as you need it.

Happy scraping! 🕷️

Top comments (0)