When I first started using Scrapy, I thought Requests and Responses were simple concepts. You make a request, you get a response. Easy, right?
Wrong.
There's so much hidden under the surface. Things the documentation mentions but doesn't explain. Tricks that experienced scrapers use every day but nobody writes about.
After scraping hundreds of websites and debugging thousands of issues, I've learned the ins and outs of Scrapy's Request and Response objects. Let me share everything with you, including the stuff the docs leave out.
What Are Requests and Responses, Really?
Think of web scraping like having a conversation:
Request: "Hey website, can you show me this page?"
Response: "Sure, here's the HTML!"
In Scrapy:
- A Request is an object that says "I want to visit this URL"
- A Response is an object that contains what the website sent back
But here's where it gets interesting. These aren't just simple objects. They carry a ton of hidden information and have special behaviors most beginners never discover.
Creating Your First Request (The Right Way)
Most tutorials show you this:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
def parse(self, response):
# Do something with response
pass
But what's actually happening here? Scrapy automatically creates Request objects from start_urls. Behind the scenes, it's doing this:
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse)
Now let's make requests manually and see all the options:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
yield scrapy.Request(
url='https://example.com',
callback=self.parse,
method='GET',
headers={'User-Agent': 'My Custom Agent'},
cookies={'session': 'abc123'},
meta={'page_num': 1},
dont_filter=False,
priority=0
)
def parse(self, response):
# Process response
pass
Let me break down each parameter:
url (Required)
The page you want to scrape. Pretty straightforward.
yield scrapy.Request(url='https://example.com/products')
callback (Optional, but Important)
The function that processes the response. If you don't specify, Scrapy uses parse() by default.
yield scrapy.Request(
url='https://example.com/products',
callback=self.parse_products
)
def parse_products(self, response):
# Handle response here
pass
method (Optional)
The HTTP method. Default is GET, but you can use POST, PUT, DELETE, etc.
yield scrapy.Request(
url='https://example.com/api',
method='POST',
body='{"key": "value"}'
)
headers (Optional)
Custom headers to send with the request.
yield scrapy.Request(
url='https://example.com',
headers={
'User-Agent': 'Mozilla/5.0',
'Accept': 'text/html',
'Referer': 'https://google.com'
}
)
cookies (Optional)
Cookies to send with the request.
yield scrapy.Request(
url='https://example.com',
cookies={'session_id': '12345', 'user': 'john'}
)
meta (Optional)
Data to carry forward to the callback. This is huge for passing data between pages.
yield scrapy.Request(
url='https://example.com/details',
meta={'product_name': 'Widget', 'price': 29.99},
callback=self.parse_details
)
def parse_details(self, response):
name = response.meta['product_name']
price = response.meta['price']
dont_filter (Optional)
By default, Scrapy filters duplicate URLs. Set this to True to visit the same URL multiple times.
yield scrapy.Request(
url='https://example.com',
dont_filter=True # Visit this URL even if we've been there
)
priority (Optional)
Higher priority requests get processed first. Default is 0.
yield scrapy.Request(
url='https://example.com/important',
priority=10 # Process this before priority 0 requests
)
The Response Object (What You Actually Get Back)
When your request completes, you get a Response object in your callback. Let's see what's inside:
def parse(self, response):
# The URL of the response (might differ from request due to redirects)
url = response.url
# The HTML body as bytes
html = response.body
# The HTML as a string (more useful!)
text = response.text
# The HTTP status code
status = response.status # 200, 404, 500, etc.
# Response headers
headers = response.headers
# The original request that generated this response
original_request = response.request
# Meta data from the request
meta_data = response.meta
Useful Response Methods
def parse(self, response):
# CSS selectors (easiest!)
titles = response.css('h1.title::text').getall()
first_title = response.css('h1.title::text').get()
# XPath selectors (more powerful)
titles = response.xpath('//h1[@class="title"]/text()').getall()
# Follow links (super convenient)
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
# Urljoin (combine relative URLs with base URL)
full_url = response.urljoin('/relative/path')
Secrets the Documentation Doesn't Emphasize
Secret #1: Response.follow() Is Magic
Instead of manually creating requests like this:
next_url = response.css('a.next::attr(href)').get()
full_url = response.urljoin(next_url)
yield scrapy.Request(full_url, callback=self.parse)
Just use response.follow():
next_url = response.css('a.next::attr(href)').get()
yield response.follow(next_url, callback=self.parse)
Even better, you can pass a selector directly:
# This works!
yield response.follow(response.css('a.next::attr(href)').get(), callback=self.parse)
# This also works!
for link in response.css('a'):
yield response.follow(link, callback=self.parse_page)
response.follow() automatically:
- Handles relative URLs
- Extracts the href attribute if you pass a selector
- Creates the Request object for you
Secret #2: response.request Gets You the Original Request
def parse(self, response):
# Access the original request
original_url = response.request.url
original_headers = response.request.headers
original_meta = response.request.meta
# Useful for debugging
self.logger.info(f'Requested: {original_url}')
self.logger.info(f'Got back: {response.url}')
# These might differ if there was a redirect!
Secret #3: You Can Inspect Response Headers
def parse(self, response):
# Get all headers
all_headers = response.headers
# Get a specific header
content_type = response.headers.get('Content-Type')
# Check cookies the server sent back
cookies = response.headers.getlist('Set-Cookie')
# Useful for debugging blocks
server = response.headers.get('Server')
self.logger.info(f'Server type: {server}')
Secret #4: response.meta Survives Redirects
This is huge and not well documented. When a request gets redirected, the meta data stays with it:
def start_requests(self):
yield scrapy.Request(
'https://example.com/redirect',
meta={'important': 'data'},
callback=self.parse
)
def parse(self, response):
# Even after redirect, meta is still there!
data = response.meta['important']
# The URL might be different
self.logger.info(f'Ended up at: {response.url}')
Secret #5: Request Priority Actually Matters
Most people ignore priority, but it's powerful:
def parse_listing(self, response):
# High priority for product pages (process first)
for product in response.css('.product'):
url = product.css('a::attr(href)').get()
yield response.follow(
url,
callback=self.parse_product,
priority=10
)
# Low priority for pagination (process later)
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(
next_page,
callback=self.parse_listing,
priority=0
)
This ensures you scrape important pages first before moving to the next page of listings.
FormRequest: For Login and POST Requests
When you need to submit forms or POST data, use FormRequest:
Simple POST Request
import scrapy
class LoginSpider(scrapy.Spider):
name = 'login'
def start_requests(self):
yield scrapy.FormRequest(
url='https://example.com/login',
formdata={
'username': 'myuser',
'password': 'mypass'
},
callback=self.after_login
)
def after_login(self, response):
if 'Welcome' in response.text:
self.logger.info('Login successful!')
else:
self.logger.error('Login failed!')
FormRequest.from_response() (The Smart Way)
This is incredibly useful but underused:
class LoginSpider(scrapy.Spider):
name = 'login'
start_urls = ['https://example.com/login']
def parse(self, response):
# Automatically fill in the form from the page
yield scrapy.FormRequest.from_response(
response,
formdata={
'username': 'myuser',
'password': 'mypass'
},
callback=self.after_login
)
def after_login(self, response):
# Now you're logged in!
yield response.follow('/dashboard', callback=self.parse_dashboard)
from_response() automatically:
- Finds the form on the page
- Extracts all form fields
- Preserves hidden fields (CSRF tokens, etc.)
- Fills in your data
- Submits the form
It's like magic for login forms!
Real-World Examples
Example 1: Scraping With Pagination
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
def start_requests(self):
yield scrapy.Request(
'https://example.com/products?page=1',
meta={'page': 1},
callback=self.parse
)
def parse(self, response):
page = response.meta['page']
self.logger.info(f'Scraping page {page}')
# Scrape products
for product in response.css('.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get(),
'page': page
}
# Follow next page
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(
next_page,
meta={'page': page + 1},
callback=self.parse
)
Example 2: Scraping Details Across Multiple Pages
import scrapy
class DetailSpider(scrapy.Spider):
name = 'details'
start_urls = ['https://example.com/products']
def parse(self, response):
"""Scrape product listings"""
for product in response.css('.product'):
item = {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
# Go to detail page to get more info
detail_url = product.css('a::attr(href)').get()
yield response.follow(
detail_url,
callback=self.parse_detail,
meta={'item': item}
)
def parse_detail(self, response):
"""Add details to the item"""
item = response.meta['item']
item['description'] = response.css('.description::text').get()
item['rating'] = response.css('.rating::text').get()
item['reviews'] = len(response.css('.review'))
yield item
Example 3: Handling Authentication
import scrapy
class AuthSpider(scrapy.Spider):
name = 'auth'
start_urls = ['https://example.com/login']
def parse(self, response):
"""Login first"""
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'user', 'password': 'pass'},
callback=self.after_login
)
def after_login(self, response):
"""Check if login succeeded"""
if 'logout' in response.text:
self.logger.info('Logged in successfully!')
yield response.follow('/protected/data', callback=self.parse_data)
else:
self.logger.error('Login failed')
def parse_data(self, response):
"""Scrape protected data"""
for item in response.css('.data-item'):
yield {
'title': item.css('h3::text').get(),
'data': item.css('.value::text').get()
}
Common Mistakes and How to Fix Them
Mistake #1: Not Yielding Requests
# WRONG
def parse(self, response):
next_url = response.css('.next::attr(href)').get()
response.follow(next_url, callback=self.parse) # Missing yield!
# RIGHT
def parse(self, response):
next_url = response.css('.next::attr(href)').get()
yield response.follow(next_url, callback=self.parse)
Mistake #2: Forgetting to Handle None
# WRONG (crashes if no next button)
next_url = response.css('.next::attr(href)').get()
yield response.follow(next_url, callback=self.parse)
# RIGHT
next_url = response.css('.next::attr(href)').get()
if next_url:
yield response.follow(next_url, callback=self.parse)
Mistake #3: Not Using response.follow() for Relative URLs
# WRONG (breaks with relative URLs)
url = response.css('a::attr(href)').get()
yield scrapy.Request(url, callback=self.parse)
# RIGHT (handles relative URLs automatically)
url = response.css('a::attr(href)').get()
yield response.follow(url, callback=self.parse)
Mistake #4: Modifying Response
# WRONG (response is read-only)
response.body = 'new content' # This doesn't work!
# RIGHT (create a new response if needed)
new_response = response.replace(body=b'new content')
Advanced: Request and Response Tricks
Trick #1: Chaining Multiple Pages
def parse_category(self, response):
category = response.css('h1::text').get()
for product_link in response.css('.product a'):
yield response.follow(
product_link,
callback=self.parse_product,
meta={'category': category}
)
def parse_product(self, response):
category = response.meta['category']
review_link = response.css('.reviews-link::attr(href)').get()
if review_link:
yield response.follow(
review_link,
callback=self.parse_reviews,
meta={
'category': category,
'product': response.css('h1::text').get()
}
)
def parse_reviews(self, response):
yield {
'category': response.meta['category'],
'product': response.meta['product'],
'reviews': response.css('.review::text').getall()
}
Trick #2: Conditional Requests
def parse(self, response):
for link in response.css('a'):
url = link.css('::attr(href)').get()
# Only follow links to product pages
if '/product/' in url:
yield response.follow(url, callback=self.parse_product)
# Only follow links to category pages
elif '/category/' in url:
yield response.follow(url, callback=self.parse_category)
Trick #3: Dynamic Headers Per Request
def parse(self, response):
for i, product in enumerate(response.css('.product')):
url = product.css('a::attr(href)').get()
# Different referer for each request
yield scrapy.Request(
url,
callback=self.parse_product,
headers={'Referer': response.url},
meta={'product_position': i}
)
Debugging Requests and Responses
See What Requests Are Being Made
def parse(self, response):
self.logger.info(f'Visiting: {response.url}')
self.logger.info(f'Status: {response.status}')
self.logger.info(f'Headers: {response.headers}')
Check Response Content
def parse(self, response):
# Save response to file for inspection
filename = 'response.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.logger.info(f'Saved response to {filename}')
Debug Failed Requests
def start_requests(self):
yield scrapy.Request(
'https://example.com',
callback=self.parse,
errback=self.handle_error
)
def handle_error(self, failure):
self.logger.error(f'Request failed: {failure}')
self.logger.error(f'URL: {failure.request.url}')
Response Types (The Secret Hierarchy)
Scrapy actually has different types of Response objects:
Response (Base Class)
Basic response for any content.
TextResponse (Most Common)
For HTML, XML, and text content. Has .text and selector methods.
HtmlResponse
Specifically for HTML. Auto-detects encoding.
XmlResponse
For XML content. Auto-detects encoding from XML declaration.
You rarely need to care about this, but it explains why .css() and .xpath() work on HTML responses but would fail on binary responses.
Performance Tips
Tip #1: Use dont_filter Sparingly
# This is expensive (no filtering)
yield scrapy.Request(url, dont_filter=True)
# Better: only disable filtering when necessary
if need_to_revisit:
yield scrapy.Request(url, dont_filter=True)
else:
yield scrapy.Request(url) # Filtered by default
Tip #2: Set Appropriate Priorities
# Important requests first
yield scrapy.Request(important_url, priority=100)
# Less important requests later
yield scrapy.Request(other_url, priority=1)
Tip #3: Don't Pass Huge Objects in Meta
# BAD (large object in meta)
huge_data = [lots of data]
yield scrapy.Request(url, meta={'data': huge_data})
# GOOD (only pass what you need)
small_id = get_id(huge_data)
yield scrapy.Request(url, meta={'id': small_id})
Summary: Request and Response Cheat Sheet
Creating Requests:
# Basic
yield scrapy.Request(url, callback=self.parse)
# With all options
yield scrapy.Request(
url=url,
callback=self.parse,
method='GET',
headers={'User-Agent': 'custom'},
cookies={'session': '123'},
meta={'data': 'value'},
priority=10,
dont_filter=False
)
# Form request
yield scrapy.FormRequest(url, formdata={'key': 'value'})
# From response (shortcut)
yield response.follow(url, callback=self.parse)
Using Responses:
# Get data
url = response.url
status = response.status
text = response.text
body = response.body
# Selectors
response.css('selector')
response.xpath('xpath')
# Follow links
yield response.follow(url, callback=self.parse)
# Access meta
data = response.meta['key']
# Original request
original = response.request
Final Thoughts
Requests and Responses are the foundation of Scrapy. Master these, and everything else gets easier.
Key takeaways:
- Always yield requests (don't forget!)
- Use response.follow() for convenience
- Pass data through meta
- Handle None values
- Check response.status
- Use FormRequest for logins
- Debug with logging
Start simple. Practice with basic requests. Then add complexity as you need it.
Happy scraping! 🕷️
Top comments (0)