Muhammad Ikramullah Khan

Posted on Dec 23

Scrapy Requests and Responses: The Complete Beginner's Guide (With Secrets the Docs Don't Tell You)

#programming #python #webdev #beginners

When I first started using Scrapy, I thought Requests and Responses were simple concepts. You make a request, you get a response. Easy, right?

Wrong.

There's so much hidden under the surface. Things the documentation mentions but doesn't explain. Tricks that experienced scrapers use every day but nobody writes about.

After scraping hundreds of websites and debugging thousands of issues, I've learned the ins and outs of Scrapy's Request and Response objects. Let me share everything with you, including the stuff the docs leave out.

What Are Requests and Responses, Really?

Think of web scraping like having a conversation:

Request: "Hey website, can you show me this page?"
Response: "Sure, here's the HTML!"

In Scrapy:

A Request is an object that says "I want to visit this URL"
A Response is an object that contains what the website sent back

But here's where it gets interesting. These aren't just simple objects. They carry a ton of hidden information and have special behaviors most beginners never discover.

Creating Your First Request (The Right Way)

Most tutorials show you this:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    def parse(self, response):
        # Do something with response
        pass

But what's actually happening here? Scrapy automatically creates Request objects from start_urls. Behind the scenes, it's doing this:

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url=url, callback=self.parse)

Now let's make requests manually and see all the options:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        yield scrapy.Request(
            url='https://example.com',
            callback=self.parse,
            method='GET',
            headers={'User-Agent': 'My Custom Agent'},
            cookies={'session': 'abc123'},
            meta={'page_num': 1},
            dont_filter=False,
            priority=0
        )

    def parse(self, response):
        # Process response
        pass

Let me break down each parameter:

url (Required)

The page you want to scrape. Pretty straightforward.

yield scrapy.Request(url='https://example.com/products')

callback (Optional, but Important)

The function that processes the response. If you don't specify, Scrapy uses parse() by default.

yield scrapy.Request(
    url='https://example.com/products',
    callback=self.parse_products
)

def parse_products(self, response):
    # Handle response here
    pass

method (Optional)

The HTTP method. Default is GET, but you can use POST, PUT, DELETE, etc.

yield scrapy.Request(
    url='https://example.com/api',
    method='POST',
    body='{"key": "value"}'
)

headers (Optional)

Custom headers to send with the request.

yield scrapy.Request(
    url='https://example.com',
    headers={
        'User-Agent': 'Mozilla/5.0',
        'Accept': 'text/html',
        'Referer': 'https://google.com'
    }
)

cookies (Optional)

Cookies to send with the request.

yield scrapy.Request(
    url='https://example.com',
    cookies={'session_id': '12345', 'user': 'john'}
)

meta (Optional)

Data to carry forward to the callback. This is huge for passing data between pages.

yield scrapy.Request(
    url='https://example.com/details',
    meta={'product_name': 'Widget', 'price': 29.99},
    callback=self.parse_details
)

def parse_details(self, response):
    name = response.meta['product_name']
    price = response.meta['price']

dont_filter (Optional)

By default, Scrapy filters duplicate URLs. Set this to True to visit the same URL multiple times.

yield scrapy.Request(
    url='https://example.com',
    dont_filter=True  # Visit this URL even if we've been there
)

priority (Optional)

Higher priority requests get processed first. Default is 0.

yield scrapy.Request(
    url='https://example.com/important',
    priority=10  # Process this before priority 0 requests
)

The Response Object (What You Actually Get Back)

When your request completes, you get a Response object in your callback. Let's see what's inside:

def parse(self, response):
    # The URL of the response (might differ from request due to redirects)
    url = response.url

    # The HTML body as bytes
    html = response.body

    # The HTML as a string (more useful!)
    text = response.text

    # The HTTP status code
    status = response.status  # 200, 404, 500, etc.

    # Response headers
    headers = response.headers

    # The original request that generated this response
    original_request = response.request

    # Meta data from the request
    meta_data = response.meta

Useful Response Methods

def parse(self, response):
    # CSS selectors (easiest!)
    titles = response.css('h1.title::text').getall()
    first_title = response.css('h1.title::text').get()

    # XPath selectors (more powerful)
    titles = response.xpath('//h1[@class="title"]/text()').getall()

    # Follow links (super convenient)
    next_page = response.css('a.next::attr(href)').get()
    if next_page:
        yield response.follow(next_page, callback=self.parse)

    # Urljoin (combine relative URLs with base URL)
    full_url = response.urljoin('/relative/path')

Secrets the Documentation Doesn't Emphasize

Secret #1: Response.follow() Is Magic

Instead of manually creating requests like this:

next_url = response.css('a.next::attr(href)').get()
full_url = response.urljoin(next_url)
yield scrapy.Request(full_url, callback=self.parse)

Just use response.follow():

next_url = response.css('a.next::attr(href)').get()
yield response.follow(next_url, callback=self.parse)

Even better, you can pass a selector directly:

# This works!
yield response.follow(response.css('a.next::attr(href)').get(), callback=self.parse)

# This also works!
for link in response.css('a'):
    yield response.follow(link, callback=self.parse_page)

response.follow() automatically:

Handles relative URLs
Extracts the href attribute if you pass a selector
Creates the Request object for you

Secret #2: response.request Gets You the Original Request

def parse(self, response):
    # Access the original request
    original_url = response.request.url
    original_headers = response.request.headers
    original_meta = response.request.meta

    # Useful for debugging
    self.logger.info(f'Requested: {original_url}')
    self.logger.info(f'Got back: {response.url}')
    # These might differ if there was a redirect!

Secret #3: You Can Inspect Response Headers

def parse(self, response):
    # Get all headers
    all_headers = response.headers

    # Get a specific header
    content_type = response.headers.get('Content-Type')

    # Check cookies the server sent back
    cookies = response.headers.getlist('Set-Cookie')

    # Useful for debugging blocks
    server = response.headers.get('Server')
    self.logger.info(f'Server type: {server}')

Secret #4: response.meta Survives Redirects

This is huge and not well documented. When a request gets redirected, the meta data stays with it:

def start_requests(self):
    yield scrapy.Request(
        'https://example.com/redirect',
        meta={'important': 'data'},
        callback=self.parse
    )

def parse(self, response):
    # Even after redirect, meta is still there!
    data = response.meta['important']

    # The URL might be different
    self.logger.info(f'Ended up at: {response.url}')

Secret #5: Request Priority Actually Matters

Most people ignore priority, but it's powerful:

def parse_listing(self, response):
    # High priority for product pages (process first)
    for product in response.css('.product'):
        url = product.css('a::attr(href)').get()
        yield response.follow(
            url,
            callback=self.parse_product,
            priority=10
        )

    # Low priority for pagination (process later)
    next_page = response.css('.next::attr(href)').get()
    if next_page:
        yield response.follow(
            next_page,
            callback=self.parse_listing,
            priority=0
        )

This ensures you scrape important pages first before moving to the next page of listings.

FormRequest: For Login and POST Requests

When you need to submit forms or POST data, use FormRequest:

Simple POST Request

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'login'

    def start_requests(self):
        yield scrapy.FormRequest(
            url='https://example.com/login',
            formdata={
                'username': 'myuser',
                'password': 'mypass'
            },
            callback=self.after_login
        )

    def after_login(self, response):
        if 'Welcome' in response.text:
            self.logger.info('Login successful!')
        else:
            self.logger.error('Login failed!')

FormRequest.from_response() (The Smart Way)

This is incredibly useful but underused:

class LoginSpider(scrapy.Spider):
    name = 'login'
    start_urls = ['https://example.com/login']

    def parse(self, response):
        # Automatically fill in the form from the page
        yield scrapy.FormRequest.from_response(
            response,
            formdata={
                'username': 'myuser',
                'password': 'mypass'
            },
            callback=self.after_login
        )

    def after_login(self, response):
        # Now you're logged in!
        yield response.follow('/dashboard', callback=self.parse_dashboard)

from_response() automatically:

Finds the form on the page
Extracts all form fields
Preserves hidden fields (CSRF tokens, etc.)
Fills in your data
Submits the form

It's like magic for login forms!

Real-World Examples

Example 1: Scraping With Pagination

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'

    def start_requests(self):
        yield scrapy.Request(
            'https://example.com/products?page=1',
            meta={'page': 1},
            callback=self.parse
        )

    def parse(self, response):
        page = response.meta['page']
        self.logger.info(f'Scraping page {page}')

        # Scrape products
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get(),
                'page': page
            }

        # Follow next page
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(
                next_page,
                meta={'page': page + 1},
                callback=self.parse
            )

Example 2: Scraping Details Across Multiple Pages

import scrapy

class DetailSpider(scrapy.Spider):
    name = 'details'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        """Scrape product listings"""
        for product in response.css('.product'):
            item = {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

            # Go to detail page to get more info
            detail_url = product.css('a::attr(href)').get()
            yield response.follow(
                detail_url,
                callback=self.parse_detail,
                meta={'item': item}
            )

    def parse_detail(self, response):
        """Add details to the item"""
        item = response.meta['item']
        item['description'] = response.css('.description::text').get()
        item['rating'] = response.css('.rating::text').get()
        item['reviews'] = len(response.css('.review'))
        yield item

Example 3: Handling Authentication

import scrapy

class AuthSpider(scrapy.Spider):
    name = 'auth'
    start_urls = ['https://example.com/login']

    def parse(self, response):
        """Login first"""
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'user', 'password': 'pass'},
            callback=self.after_login
        )

    def after_login(self, response):
        """Check if login succeeded"""
        if 'logout' in response.text:
            self.logger.info('Logged in successfully!')
            yield response.follow('/protected/data', callback=self.parse_data)
        else:
            self.logger.error('Login failed')

    def parse_data(self, response):
        """Scrape protected data"""
        for item in response.css('.data-item'):
            yield {
                'title': item.css('h3::text').get(),
                'data': item.css('.value::text').get()
            }

Common Mistakes and How to Fix Them

Mistake #1: Not Yielding Requests

# WRONG
def parse(self, response):
    next_url = response.css('.next::attr(href)').get()
    response.follow(next_url, callback=self.parse)  # Missing yield!

# RIGHT
def parse(self, response):
    next_url = response.css('.next::attr(href)').get()
    yield response.follow(next_url, callback=self.parse)

Mistake #2: Forgetting to Handle None

# WRONG (crashes if no next button)
next_url = response.css('.next::attr(href)').get()
yield response.follow(next_url, callback=self.parse)

# RIGHT
next_url = response.css('.next::attr(href)').get()
if next_url:
    yield response.follow(next_url, callback=self.parse)

Mistake #3: Not Using response.follow() for Relative URLs

# WRONG (breaks with relative URLs)
url = response.css('a::attr(href)').get()
yield scrapy.Request(url, callback=self.parse)

# RIGHT (handles relative URLs automatically)
url = response.css('a::attr(href)').get()
yield response.follow(url, callback=self.parse)

Mistake #4: Modifying Response

# WRONG (response is read-only)
response.body = 'new content'  # This doesn't work!

# RIGHT (create a new response if needed)
new_response = response.replace(body=b'new content')

Advanced: Request and Response Tricks

Trick #1: Chaining Multiple Pages

def parse_category(self, response):
    category = response.css('h1::text').get()

    for product_link in response.css('.product a'):
        yield response.follow(
            product_link,
            callback=self.parse_product,
            meta={'category': category}
        )

def parse_product(self, response):
    category = response.meta['category']

    review_link = response.css('.reviews-link::attr(href)').get()
    if review_link:
        yield response.follow(
            review_link,
            callback=self.parse_reviews,
            meta={
                'category': category,
                'product': response.css('h1::text').get()
            }
        )

def parse_reviews(self, response):
    yield {
        'category': response.meta['category'],
        'product': response.meta['product'],
        'reviews': response.css('.review::text').getall()
    }

Trick #2: Conditional Requests

def parse(self, response):
    for link in response.css('a'):
        url = link.css('::attr(href)').get()

        # Only follow links to product pages
        if '/product/' in url:
            yield response.follow(url, callback=self.parse_product)

        # Only follow links to category pages
        elif '/category/' in url:
            yield response.follow(url, callback=self.parse_category)

Trick #3: Dynamic Headers Per Request

def parse(self, response):
    for i, product in enumerate(response.css('.product')):
        url = product.css('a::attr(href)').get()

        # Different referer for each request
        yield scrapy.Request(
            url,
            callback=self.parse_product,
            headers={'Referer': response.url},
            meta={'product_position': i}
        )

Debugging Requests and Responses

See What Requests Are Being Made

def parse(self, response):
    self.logger.info(f'Visiting: {response.url}')
    self.logger.info(f'Status: {response.status}')
    self.logger.info(f'Headers: {response.headers}')

Check Response Content

def parse(self, response):
    # Save response to file for inspection
    filename = 'response.html'
    with open(filename, 'wb') as f:
        f.write(response.body)

    self.logger.info(f'Saved response to {filename}')

Debug Failed Requests

def start_requests(self):
    yield scrapy.Request(
        'https://example.com',
        callback=self.parse,
        errback=self.handle_error
    )

def handle_error(self, failure):
    self.logger.error(f'Request failed: {failure}')
    self.logger.error(f'URL: {failure.request.url}')

Response Types (The Secret Hierarchy)

Scrapy actually has different types of Response objects:

Response (Base Class)

Basic response for any content.

TextResponse (Most Common)

For HTML, XML, and text content. Has .text and selector methods.

HtmlResponse

Specifically for HTML. Auto-detects encoding.

XmlResponse

For XML content. Auto-detects encoding from XML declaration.

You rarely need to care about this, but it explains why .css() and .xpath() work on HTML responses but would fail on binary responses.

Performance Tips

Tip #1: Use dont_filter Sparingly

# This is expensive (no filtering)
yield scrapy.Request(url, dont_filter=True)

# Better: only disable filtering when necessary
if need_to_revisit:
    yield scrapy.Request(url, dont_filter=True)
else:
    yield scrapy.Request(url)  # Filtered by default

Tip #2: Set Appropriate Priorities

# Important requests first
yield scrapy.Request(important_url, priority=100)

# Less important requests later
yield scrapy.Request(other_url, priority=1)

Tip #3: Don't Pass Huge Objects in Meta

# BAD (large object in meta)
huge_data = [lots of data]
yield scrapy.Request(url, meta={'data': huge_data})

# GOOD (only pass what you need)
small_id = get_id(huge_data)
yield scrapy.Request(url, meta={'id': small_id})

Summary: Request and Response Cheat Sheet

Creating Requests:

# Basic
yield scrapy.Request(url, callback=self.parse)

# With all options
yield scrapy.Request(
    url=url,
    callback=self.parse,
    method='GET',
    headers={'User-Agent': 'custom'},
    cookies={'session': '123'},
    meta={'data': 'value'},
    priority=10,
    dont_filter=False
)

# Form request
yield scrapy.FormRequest(url, formdata={'key': 'value'})

# From response (shortcut)
yield response.follow(url, callback=self.parse)

Using Responses:

# Get data
url = response.url
status = response.status
text = response.text
body = response.body

# Selectors
response.css('selector')
response.xpath('xpath')

# Follow links
yield response.follow(url, callback=self.parse)

# Access meta
data = response.meta['key']

# Original request
original = response.request